[HN Gopher] Show HN: SeekStorm - open-source sub-millisecond sea...
___________________________________________________________________
Show HN: SeekStorm - open-source sub-millisecond search in Rust
Author : wolfgarbe
Score : 102 points
Date : 2024-12-02 13:06 UTC (9 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| Leoko wrote:
| Sub-millisecond latency sounds impressive, but isn't network
| latency going to overshadow these gains in most real-world
| scenarios?
| wolfgarbe wrote:
| It depends on the application.
|
| When using SeekStorm as a server, keeping the latency per query
| low increases the throughput and the number of parallel queries
| a server can handle on top of a given hardware. An efficient
| search server can reduce the required investments in server
| hardware.
|
| In other cases, only the local search performance matters,
| e.g., for data mining or RAG.
|
| Also, it's not only about averages but also about tail
| latencies. While network latencies dominate the average search
| time, that is not the case for tail latencies, which in turn
| heavily influence user satisfaction and revenue in online
| shopping.
| pornel wrote:
| When search is cheap and quick, it's possible to improve search
| by postprocessing search results and running more queries when
| necessary.
|
| I use Tantivy, and add refinements like: if the top result is
| objectively a low-quality one, it's usually a query with a typo
| finding a document with the same typo, so I run the query again
| with fuzzy spelling. If all the top results have the same tag
| (that isn't in the query), then I mix in results from another
| search with the most common tag excluded. If the query is a
| word that has multiple meanings, I can ensure that each meaning
| is represented in the top results.
| intelVISA wrote:
| A typical server is serving more than one request at a time,
| hopefully.
| tlofreso wrote:
| Demo = impressed.
|
| How's SeekStorm's prowess in mid-cap enterprise? How hairy is the
| ingest pipeline for sources like: decade old sharepoint sites,
| PDFs with partial text layers, excel, email.msg files, etc...
| wolfgarbe wrote:
| Yes, integration in complex legacy systems is always
| challenging. As a small startup, we are concentrating on core
| search technology to make search faster and to make the most of
| available server infrastructure. As SeekStorm is open-source,
| system integrators can take it from there.
| fiedzia wrote:
| Same as any other full-text search solution - it's your job to
| integrate it.
| jazzyjackson wrote:
| On that topic, can anybody chime in on state of the art PDF
| OCR? Even if that's a multimodal LLM, I've used ChatGPT to
| extract tabular data from images but need something I can self
| host for proprietary data.
| DonnyV wrote:
| The fact they could only squeeze out more performance was by
| switching to Rust from C#. Says a lot of C#. .NET Framework has
| come a long way with performance.
| neonsunset wrote:
| I find the note unfortunate. They state 2-4x performance
| improvement. I'm sure looking at the implementation with a
| profiler and tactically optimizing critical paths would have
| yielded them 2-3x as is. They could have also reached out to
| .NET JIT team via issues or discussions on GitHub for guidance.
| Especially since .NET has rich set of SIMD APIs very well
| suited for implementing SOTA text search algorithms (and also
| comes with many out of box, seriously, look at e.g.
| https://devblogs.microsoft.com/dotnet/performance-
| improvemen...)
|
| The note also states "No framework dependencies (CLR or JVM
| virtual machines)" which isn't true either - 'dotnet publish
| /p:PublishSingleFile=true /p:PublishTrimmed=true' gives the
| same "dependency-less" experience. "Ahead-of-time instead of
| just-in-time compilation" is similarly wrong - replace previous
| args with '/p:PublishAot=true' and you get a native binary.
| wolfgarbe wrote:
| The 2-4 speed ratio was not meant to denounce C#, which is a
| great language I loved to program in for over two decades,
| coming from Delphi. Unfortunately, C# has not a complete SIMD
| support. See our request to support the SSE4.2 _mm_cmpistrm
| instruction
| https://github.com/dotnet/runtime/discussions/63332, which we
| required for a vectorized intersection between two sorted
| 16-bit arrays. We did the switch from C# to Rust not light-
| minded, as the cost of porting a fairly large codebase is
| time-consuming. We just wanted to share our experience for
| our specific task, not as a general statement.
| neonsunset wrote:
| Thank you. It is indeed true that .NET has some gaps in its
| SIMD API, which might require either writing a specific
| routine in C and pinvoking it or implementing the algorithm
| differently.
|
| Were there any other factors that contributed to the
| decision?
|
| FWIW I forwarded the issue the discussion links to
| dotnetevolution discord server.
| wolfgarbe wrote:
| Yes. We waited long for AOT compilation to become mature,
| to remove the need for the user to install the .Net
| framework. But two years ago when we decided to switch,
| we still couldn't just get the AOT compilation of our
| codebase to work without changes (perhaps it was somehow
| possible, but the available documentation was not very
| verbose about this). Also, there is still a performance
| gap. Of course, this doesn't matter for most of the
| applications, where the completeness and consistency of
| the framework, and the number of programmers fluent in
| that language might matter more. But for a search server,
| we needed to carve out every inch of performance we could
| get. And other benchmarks seemed to echo our experience:
| https://programming-language-benchmarks.vercel.app/rust-
| vs-c...
| neonsunset wrote:
| That specific suite is...not the best.
| https://benchmarksgame-
| team.pages.debian.net/benchmarksgame/... is more focused
| on optimized implementations and showcases where the
| performance of .NET places given submissions someone
| cared to spend some time optimizing.
|
| It is true that 2 years ago NAOT was in its infancy, it
| has improved substantially since then. Self-contained
| trimmed binaries already worked back then however.
|
| I guess it is more about unfortunate timing than anything
| - even the compiler itself moves fast and in some areas
| the difference in codegen quality is very significant
| between 7, 8 and 9.
| remram wrote:
| What is the story for multi-language corpus? Do I have to do my
| own stop word pruning, tokenizing, lemming, etc? This is usually
| the case with full-text search solutions and it is a pain.
| wolfgarbe wrote:
| We started with making the core search technology faster. Then
| we added a Unicode character folding/normalization tokenizer
| (diacritics, accents, umlauts, bold, italic, full-width
| chars...). Last week we added a tokenizer that supports Chinese
| word segmentation. Currently, we are working on a multi-
| language tokenizer, that segments Chinese, Japanese an Korean
| without switching the tokenizer.
| jazzyjackson wrote:
| Re: stemming and lemming, I just want to plug the most
| impressive NLP stack I ever used, "chat script", really it's
| for building dialog trees where it walks down a branch of
| conversation using effectively switch statements but with
| really rich conceptual pattern matching and capturing - so
| somewhere in the middle of the stack it has excellent
| abstracting from word input to general concept (in WordNet),
| performing all the spell correction (according to your
| dictionary), stem, lem, and disambiguation.
|
| I've had it in mind for a while to build a fuzzy search tool
| based on parsing each phrase into concepts, parsing the search
| query into concepts, and finding nearest match based on that.
| It's a C library and very fast.
|
| https://github.com/ChatScript/ChatScript
|
| Looks like it hasn't been committed to in some time, I'll have
| to check out their blog and see what's up. I guess with the
| advent of LLMs, dialog trees are passe.
| athompsondog wrote:
| I wonder how burnt sushi feels about this
| throwaway888abc wrote:
| Impressive, bookmarked, upvoted.
|
| Appreciate the demo: https://deephn.org/?q=apple+silicon
| justmarc wrote:
| I really like your approach. Impressed by your care for
| performance and your fast pace of adding what appears to be
| pretty complex stuff, while making sure it stays performant.
|
| Keep it up!
|
| Bookmarked.
| distracted_boy wrote:
| How does this compare to PostgreSQL?
| wolfgarbe wrote:
| PostgreSQL is an SQL database that also offers full-text search
| (FTS), with extensions like pg_search it also supports BM25
| scoring which is essential for lexical search. SeekStorm is
| centered around full-text search only, it doesn't offer SQL.
|
| Performance-wise it would be indeed interesting to run a
| benchmark. The third-party open-source benchmark we are
| currently using (search_benchmark_game) does not yet support
| PostgreSQL. So yes, that comparison is still pending.
| treefarmer wrote:
| Is there distributed server support? I see it on the list of new
| features with (currently PoC) next to it, but is the code for the
| PoC available anywhere?
|
| Also, would there be any potential issues if the index was
| mounted on shared storage between multiple instances?
| Thaxll wrote:
| It feels like everyone re-implement the same application,
| searching text in language x.y.z has been done a million times,
| search speed in not a problem so what differenciate this solution
| with the dozen+ mature ones.
___________________________________________________________________
(page generated 2024-12-02 23:01 UTC)