[HN Gopher] Turbopuffer: Fast search on object storage
       ___________________________________________________________________
        
       Turbopuffer: Fast search on object storage
        
       Author : Sirupsen
       Score  : 320 points
       Date   : 2024-07-09 14:48 UTC (1 days ago)
        
 (HTM) web link (turbopuffer.com)
 (TXT) w3m dump (turbopuffer.com)
        
       | softwaredoug wrote:
       | Having worked with Simon he knows his sh*t. We talked a lot about
       | what the ideal search stack would look when we worked together at
       | Shopify on search (him more infra, me more ML+relevance). I
       | discussed how I just want a thing in the cloud to provide my
       | retrieval arms, let me express ranking in a fluent "py-data"
       | first way, and get out of my way
       | 
       | My ideal is that turbopuffer ultimately is like a Polars
       | dataframe where all my ranking is expressed in my search API. I
       | could just lazily express some lexical or embedding similarity,
       | boost with various attributes like, maybe by recency, popularity,
       | etc to get a first pass (again all just with dataframe math).
       | Then compute features for a reranking model I run on my side -
       | dataframe math - and it "just works" - runs all this as some kind
       | of query execution DAG - and stays out of my way.
        
         | bkitano19 wrote:
         | +1, had the fortune to work with him at a previous startup and
         | meetup in person. Our convo very much broadened my perspective
         | on engineering as a career and a craft, always excited to see
         | what he's working on. Good luck Simon!
        
       | cmcollier wrote:
       | Unrelated to the core topic, I really enjoy the aesthetic of
       | their website. Another similar one is from Fixie.ai (also,
       | interestingly, one of their customers).
        
         | itunpredictable wrote:
         | This website rocks
        
         | nsguy wrote:
         | Yeah! fast, clean, cool, unique.
        
         | swyx wrote:
         | what does fixie do these days?
        
           | sitkack wrote:
           | They pivoted, but will probably pivot back to their original
           | quest.
        
             | zkoch wrote:
             | Nah, we're pretty happy with the new trajectory. :)
        
         | xarope wrote:
         | Yes, I like the turboxyz123 animation and contrast to the
         | minimalist website (reminds me of the zen garden with a single
         | rock). I think people forget nowadays in their haste to add the
         | latest and greatest react animation, that too much noise is a
         | thing.
        
         | k2so wrote:
         | This was my first thought too, after reading through their
         | blog. This feels like a no-frills website made by an engineer,
         | who makes things that just work.
         | 
         | The documentation is great, I really appreciate them putting
         | the roadmap front and centre.
        
         | 5- wrote:
         | indeed! what a nice, minimal page... that comes with ~1.6mb of
         | javascript.
        
       | bigbones wrote:
       | Sounds like a source-unavailable version of Quickwit?
       | https://quickwit.io/
        
         | pushrax wrote:
         | LSM tree storage engine vs time series storage engine, similar
         | philosophy but different use cases
        
           | singhrac wrote:
           | Maybe I misunderstood both products but I think neither
           | Quickwit or Turbopuffer is either of those things
           | intrinsically (though log structured messages are a good fit
           | for Quickfit). I think Quickwit is essentially
           | Lucene/Elasticsearch (i.e. sparse queries or BM25) and
           | Turbopuffer does vector search (or dense queries) like say
           | Faiss/Pinecone/Qdrant/Vectorize, both over object storage.
        
             | pushrax wrote:
             | It's true that turbopuffer does vector search, though it
             | also does BM25.
             | 
             | The biggest difference at a low level is that turbopuffer
             | records have unique primary keys, and can be updated, like
             | in a normal database. Old records that were overwritten
             | won't be returned in searches. The LSM tree storage engine
             | is used to achieve this. The LSM tree also enables
             | maintenance of global indexes that can be used for
             | efficient retrieval without any time-based filter.
             | 
             | Quickwit records are immutable. You can't overwrite a
             | record (well, you can, but overwritten records will also be
             | returned in searches). The data files it produces are
             | organized into a time series, and if you don't pass a time-
             | based filter it has to look at every file.
        
               | singhrac wrote:
               | Ah I didn't catch that Quickwit had immutable records.
               | That explains the focus on log usage. Thanks!
        
       | vidar wrote:
       | Can you compare to S3 Athena (ELI5)?
        
       | CyberDildonics wrote:
       | Sounds like a filesystem with attributes in a database.
        
       | drodgers wrote:
       | I love the object-storage-first approach; it seems like such a
       | natural fit for the could.
        
       | eknkc wrote:
       | Is there a good general purpose solution where I can store a
       | large read only database in s3 or something and do lookups
       | directly on it?
       | 
       | Duckdb can open parquet files over http and query them but I
       | found it to trigger a lot of small requests reading bunch of
       | places from the files. I mean a lot.
       | 
       | I mostly need key / value lookups and could potentially store
       | each key in a seperate object in s3 but for a couple hundred
       | million objects.. It would be a lot more managable to have a
       | single file and maybe a cacheable index.
        
         | jiggawatts wrote:
         | > trigger a lot of small requests reading bunch of places from
         | the files. I mean a lot.
         | 
         | That's... the whole point. That's how Parquet files are
         | supposed to be used. They're an improvement over CSV or JSON
         | because clients can read small subsets of them efficiently!
         | 
         | For comparison, I've tried a few other client products that
         | don't use Parquet files properly and just read the whole file
         | every time, no matter how trivial the query is.
        
           | eknkc wrote:
           | This makes sense but the problem I had with duckdb + parquet
           | is it looks like there is no metadata caching so each and
           | every query triggers a lot of requests.
           | 
           | Duckdb can query a remote duckdb database too, in that case
           | it looks like there is caching. Which might be better.
           | 
           | I wonder if anyone actually worked on a specific file format
           | for this use case (relatively high latency random access) to
           | minimize reads to as little blocks as possible.
        
             | jiggawatts wrote:
             | Sounds like a bug or missing feature in DuckDB more than an
             | issue with the format
        
         | imiric wrote:
         | ClickHouse can also read from S3. I'm not sure how it compares
         | to DuckDB re efficiency, but it worked fine for my simple use
         | case.
        
           | masterj wrote:
           | Neither of these support indexes afaik. They are designed to
           | do fast scans / computation.
        
             | hodgesrm wrote:
             | It depends on what you mean by "support." ClickHouse as I
             | recall can read min/max indexes from Parquet row groups.
             | One of my colleagues is working on a PR to add support for
             | bloom filter indexes. So that will be covered as well.
             | 
             | Right now one of the main performance problems is that
             | Clickhouse does not cache index metadata yet, so you still
             | have to scan files rather than keeping the metadata in
             | memory. ClickHouse does this for native MergeTree tables.
             | There are a couple of steps to get there but I have no
             | doubt that metadata caching will be properly handled soon.
             | 
             | Disclaimer: I work for Altinity, an enterprise provider for
             | ClickHouse software.
        
             | orthecreedence wrote:
             | Depends what you mean by "indexes." DuckDB can read path
             | parameters (ex s3://my-
             | bucket/category=beverages/month=2022-01-01/*/*.parquet)
             | where `category` and `month` can be filtered at the query
             | level, skipping any non-matching files. I think that
             | qualifies as an index. Obviously, you'd have to create
             | these up-front, or risk moving lots of data between paths.
        
         | tionis wrote:
         | You could use a sqlite database and use range queries using
         | something like this: https://github.com/psanford/sqlite3vfshttp
         | https://github.com/phiresky/sql.js-httpvfs
         | 
         | Simon Willison wrote about it:
         | https://simonwillison.net/2022/Aug/10/sqlite-http/
        
           | arcanemachiner wrote:
           | That whole thing still blows my mind.
        
           | eknkc wrote:
           | Yep this thing is the reason I thought about doing it in the
           | first place. Tried duckdb which has built in support for
           | range requests over http.
           | 
           | Whole idea makes sense but I feel like the file format should
           | be specifically tuned for this use case. Otherwise you end up
           | with a lot of range requests because it was designed for disk
           | access. I wondered if anything was actually designed for
           | that.
        
             | hobofan wrote:
             | Parquet and other columnar storage formats are essentially
             | already tuned for that.
             | 
             | A lot of requests in themselves shouldn't be that horrible
             | with Cloudfront nowadays, as you both have low latency and
             | with HTTP2 a low-overhead RPC channel.
             | 
             | There are some potential remedies, but each come with
             | significant architetural impact:
             | 
             | - Bigger range queries; For smallish tables, instead of
             | trying to do point-based access for individual rows,
             | instead retrieve bigger chunks at once and scan through
             | them locally -> Less requests, but likely also more wasted
             | bandwidth
             | 
             | - Compute the specific view live with a remote DuckDB ->
             | Has the downside of having to introduce a DuckDB instance
             | that you have to manage between the browser and S3
             | 
             | - Precompute the data you are interested into new parquest
             | files -> Only works if you can anticipate the query
             | patterns enough
             | 
             | I read in the sibling comment that your main issue seems to
             | be re-reading of metadata. DuckDB is AFAIK able to cache
             | the metadata, but won't across instances. I've seen someone
             | have the same issue, and the problem was that they only
             | created short-lived DuckDB in-memory instances (every time
             | the wanted to run a query), so every time the fresh DB had
             | to retrieve the metadata again.
        
               | eknkc wrote:
               | Thanks for the insights. Precomputing is not really
               | suitable for this and the thing is, I'm mostly using it
               | as a lookup table on key / value queries. I know Duckdb
               | is mostly suitable for aggregation but the http range
               | query support was too attractive to pass on.
               | 
               | I did some tests, querying "where col = 'x'". If the
               | database was a remote duckdb native db, it would issue a
               | bunch of http range requests and the second exact call
               | would not trigger any new requests. Also, querying for
               | col = foo and then col = foob would yield less and less
               | requests as I assume it has the necesary data on hand.
               | 
               | Doing it on parquet, with a single long running duckdb
               | cli instance, I get the same requests over and over
               | again. The difference though, I'd need to "attach" the
               | duckdb database under a schema name but would query the
               | parquet file using "select from 'http://.../x.parquet'"
               | syntax. Maybe this causes it to be ephemeral for each
               | query. Will see if the attach syntax also works for
               | parquet.
        
               | hobofan wrote:
               | I think both should work, but you have to set the object
               | cache pragma IIRC: https://duckdb.org/docs/configuration/
               | pragmas.html#object-ca...
        
         | cdchn wrote:
         | >Is there a good general purpose solution where I can store a
         | large read only database in s3 or something and do lookups
         | directly on it?
         | 
         | I think this is pretty much what AWS Athena is.
        
           | tiew9Vii wrote:
           | Cloud backed SQLLite looks like it might be good for this.
           | Doesn't support S3 though
           | 
           | https://sqlite.org/cloudsqlite/doc/trunk/www/index.wiki
        
         | canadiantim wrote:
         | LanceDB
        
       | omneity wrote:
       | > In 2022, production-grade vector databases were relying on in-
       | memory storage
       | 
       | This is irking me. pg_vector has existed from before that,
       | doesn't require in-memory storage and can definitely handle
       | vector search for 100m+ documents in a decently performant
       | manner. Did they have a particular requirement somewhere?
        
         | jbellis wrote:
         | Have you tried it? pgvector performance falls off a cliff once
         | you can't cache in ram. Vector search isn't like "normal"
         | workloads that follow a nice pareto distribution.
        
           | omneity wrote:
           | Tried and deployed in production with similar sized
           | collections.
           | 
           | You only need enough memory to load the index, definitely not
           | the whole collection. A typical index would most likely fit
           | within a few GBs. And even if you need dozens of GBs of RAM
           | it won't cost nearly as much as $20k/month as the article
           | surmises.
        
             | lyu07282 wrote:
             | How do you get to "a few GBs"? A hundred million
             | embeddings, if you have 4 byte floats 1024 dimensions would
             | be >400 GB alone.
        
               | omneity wrote:
               | I did say the index, not the embeddings themselves. The
               | index is a more compact representation of your embeddings
               | collection, and that's what you need in memory. One
               | approach for indexing is to calculate centroids of your
               | embeddings.
               | 
               | You have multiple parameters to tweak, that affect
               | retrieval performance as well as the memory footprint of
               | your indexes. Here's a rundown on that:
               | https://tembo.io/blog/vector-indexes-in-pgvector
        
       | yamumsahoe wrote:
       | unsure if they are comparable, but is this and quickwit
       | comparable?
        
       | hipadev23 wrote:
       | That's some woefully disappointing and incorrect metrics (read
       | and write latency are both sub-second, storage medium would be "
       | Memory + Replicated SSDs") you've got for Clickhouse there, but I
       | understand what you're going for and why you categorized it where
       | you did.
        
       | endisneigh wrote:
       | Slightly relevant - do people really want article
       | recommendations? I don't think I've ever read an article and
       | wanted a recommendation. Even with this one - I sort of read it
       | and that's it; no feeling of wanting recommendations.
       | 
       | Am I alone in this?
       | 
       | In any case this seems like a pretty interesting approach.
       | Reminds me of Warpstream which does something similar with S3 to
       | replace Kafka.
        
       | nh2 wrote:
       | > $3600.00/TB/month
       | 
       | It doesn't have to be that way.
       | 
       | At Hetzner I pay $200/TB/month for RAM. That's 18x cheaper.
       | 
       | Sometimes you can reach the goal faster with less complexity by
       | removing the part with the 20x markup.
        
         | TechDebtDevin wrote:
         | I will likely never leave Hetzner.
        
         | AYBABTME wrote:
         | 200$/TB/month for raw RAM, not RAM that's presented to you
         | behind a usable API that's distributed and operated by someone
         | else, freeing you of time.
         | 
         | It's not particularly useful to compare the cost of raw
         | unorganized information medium on a single node, to highly
         | organized information platform. It's like saying "this CPU chip
         | is expensive, just look at the price of this sand".
        
           | kirmerzlikin wrote:
           | AFAIU, 3600$ is also a price for "raw RAM" that will be used
           | by your common database via sys calls and not via a "usable
           | API operated by someone else"
        
           | hodgesrm wrote:
           | > It's not particularly useful to compare the cost of raw
           | unorganized information medium on a single node, to highly
           | organized information platform.
           | 
           | Except that it does prompt you to ask what you could do to
           | use that cheap compute and RAM. In the case of Hetzner that
           | might be large caches that allow you to apply those resources
           | on remote data whilst minimizing transfer and API costs.
        
         | formerly_proven wrote:
         | You seem to be quoting the highest figure from the article out
         | of context as-if that is their pricing, but the opposite is the
         | case.
         | 
         | > $3600.00/TB/month (incumbents)
         | 
         | > $70.00/TB/month (turbopuffer)
         | 
         | That's still 3x cheaper than your number and it's a SaaS API,
         | not just a piece of rented hardware.
        
           | nh2 wrote:
           | > as-if that is their pricing
           | 
           | No, that's not what I'm saying. Their "Storage Costs" table
           | shows costs to rent storage from some provider (AWS?). It's
           | clear that those are costs that the user has to pay for
           | infrastructure needed for certain types of software (e.g.
           | Turbopuffer is designed to be running on "S3 + SSD Cache",
           | while other software may be designed to run on "RAM + 3x
           | SSD").
           | 
           | I'm comparing RAM costs from that table with RAM costs in the
           | real world.
           | 
           | The idea backed by that table is "RAM is so expensive, so we
           | need to build software to run it on cheaper storage instead".
           | 
           | My statement is "RAM is that expensive only on that provider,
           | there are others where it is not; on those, you may just run
           | it in RAM and save on software complexity".
           | 
           | You will still need some software for your SaaS API to serve
           | queries from RAM, but it won't need the complexity of trying
           | to make it fast when serving from a higher-latency storage
           | backend (S3).
        
       | yawnxyz wrote:
       | can't wait for the day the get into GA!
        
       | cdchn wrote:
       | The very long introductory page has a ton of very juicy data in
       | it, even if you don't care about the product itself.
        
       | zX41ZdbW wrote:
       | A correction to the article. It mentions
       | Warehouse BigQuery, Snowflake, Clickhouse >=1s Minutes
       | 
       | For ClickHouse, it should be: read latency <= 100ms, write
       | latency <= 1s.
       | 
       | Logging, real-time analytics, and RAG are also suitable for
       | ClickHouse.
        
         | Sirupsen wrote:
         | Yeah, thinking about this more I now understand Clickhouse to
         | be more of an operational warehouse similar to Materialize,
         | Pinot, Druid, etc. if I understand correctly? So bunching with
         | BigQuery/Snowflake/Trino/Databricks... wasn't the right
         | category (although operational warehouses certainly can have a
         | ton of overlap)
         | 
         | I left that category out for simplicity (plenty of others that
         | didn't make it into the taxonomy, e.g. queues, nosql, time-
         | series, graph, embedded, ..)
        
       | arnorhs wrote:
       | This looks super interesting. I'm not that familiar with vector
       | databases. I thought they were mostly something used for RAG and
       | other AI-related stuff.
       | 
       | Seems like a topic I need to delive into a bit more.
        
       | solatic wrote:
       | Is it feasible to try to build this kind of approach (hot SSD
       | cache nodes sitting in front of object storage) with prior open-
       | source art (Lucene)? Or are the search indexes themselves also
       | proprietary in this solution?
       | 
       | Having witnessed some very large Elasticsearch production
       | deployments, being able to throw everything into S3 would be
       | _incredible_. The applicability here isn 't only for vector
       | search.
        
         | francoismassot wrote:
         | If you don't need vector search and have very large
         | Elasticsearch deployment, you can have a look at Quickwit, it's
         | a search engine on object storage, it's OSS and works for
         | append-only datasets (like logs, traces, ...)
         | 
         | Repo: https://github.com/quickwit-oss/quickwit
        
         | rohitnair wrote:
         | Elasticsearch and OpenSearch already support S3 backed indices.
         | See features like https://opensearch.org/docs/latest/tuning-
         | your-cluster/avail... The files in S3 are plain old Lucene
         | segment files (just wrapped in OpenSearch snapshots which
         | provide a way to track metadata around those files).
        
           | francoismassot wrote:
           | But you don't have fast search on those files stored on
           | object storage.
        
             | rohitnair wrote:
             | Yes, there is a cold start penalty but once the data is
             | cached, it is equivalent to disk backed indices. There is
             | also active work being done to improve the performance,
             | example https://github.com/opensearch-
             | project/OpenSearch/issues/1380...
        
       ___________________________________________________________________
       (page generated 2024-07-10 23:01 UTC)