[HN Gopher] Turbopuffer: Fast search on object storage
___________________________________________________________________
Turbopuffer: Fast search on object storage
Author : Sirupsen
Score : 320 points
Date : 2024-07-09 14:48 UTC (1 days ago)
(HTM) web link (turbopuffer.com)
(TXT) w3m dump (turbopuffer.com)
| softwaredoug wrote:
| Having worked with Simon he knows his sh*t. We talked a lot about
| what the ideal search stack would look when we worked together at
| Shopify on search (him more infra, me more ML+relevance). I
| discussed how I just want a thing in the cloud to provide my
| retrieval arms, let me express ranking in a fluent "py-data"
| first way, and get out of my way
|
| My ideal is that turbopuffer ultimately is like a Polars
| dataframe where all my ranking is expressed in my search API. I
| could just lazily express some lexical or embedding similarity,
| boost with various attributes like, maybe by recency, popularity,
| etc to get a first pass (again all just with dataframe math).
| Then compute features for a reranking model I run on my side -
| dataframe math - and it "just works" - runs all this as some kind
| of query execution DAG - and stays out of my way.
| bkitano19 wrote:
| +1, had the fortune to work with him at a previous startup and
| meetup in person. Our convo very much broadened my perspective
| on engineering as a career and a craft, always excited to see
| what he's working on. Good luck Simon!
| cmcollier wrote:
| Unrelated to the core topic, I really enjoy the aesthetic of
| their website. Another similar one is from Fixie.ai (also,
| interestingly, one of their customers).
| itunpredictable wrote:
| This website rocks
| nsguy wrote:
| Yeah! fast, clean, cool, unique.
| swyx wrote:
| what does fixie do these days?
| sitkack wrote:
| They pivoted, but will probably pivot back to their original
| quest.
| zkoch wrote:
| Nah, we're pretty happy with the new trajectory. :)
| xarope wrote:
| Yes, I like the turboxyz123 animation and contrast to the
| minimalist website (reminds me of the zen garden with a single
| rock). I think people forget nowadays in their haste to add the
| latest and greatest react animation, that too much noise is a
| thing.
| k2so wrote:
| This was my first thought too, after reading through their
| blog. This feels like a no-frills website made by an engineer,
| who makes things that just work.
|
| The documentation is great, I really appreciate them putting
| the roadmap front and centre.
| 5- wrote:
| indeed! what a nice, minimal page... that comes with ~1.6mb of
| javascript.
| bigbones wrote:
| Sounds like a source-unavailable version of Quickwit?
| https://quickwit.io/
| pushrax wrote:
| LSM tree storage engine vs time series storage engine, similar
| philosophy but different use cases
| singhrac wrote:
| Maybe I misunderstood both products but I think neither
| Quickwit or Turbopuffer is either of those things
| intrinsically (though log structured messages are a good fit
| for Quickfit). I think Quickwit is essentially
| Lucene/Elasticsearch (i.e. sparse queries or BM25) and
| Turbopuffer does vector search (or dense queries) like say
| Faiss/Pinecone/Qdrant/Vectorize, both over object storage.
| pushrax wrote:
| It's true that turbopuffer does vector search, though it
| also does BM25.
|
| The biggest difference at a low level is that turbopuffer
| records have unique primary keys, and can be updated, like
| in a normal database. Old records that were overwritten
| won't be returned in searches. The LSM tree storage engine
| is used to achieve this. The LSM tree also enables
| maintenance of global indexes that can be used for
| efficient retrieval without any time-based filter.
|
| Quickwit records are immutable. You can't overwrite a
| record (well, you can, but overwritten records will also be
| returned in searches). The data files it produces are
| organized into a time series, and if you don't pass a time-
| based filter it has to look at every file.
| singhrac wrote:
| Ah I didn't catch that Quickwit had immutable records.
| That explains the focus on log usage. Thanks!
| vidar wrote:
| Can you compare to S3 Athena (ELI5)?
| CyberDildonics wrote:
| Sounds like a filesystem with attributes in a database.
| drodgers wrote:
| I love the object-storage-first approach; it seems like such a
| natural fit for the could.
| eknkc wrote:
| Is there a good general purpose solution where I can store a
| large read only database in s3 or something and do lookups
| directly on it?
|
| Duckdb can open parquet files over http and query them but I
| found it to trigger a lot of small requests reading bunch of
| places from the files. I mean a lot.
|
| I mostly need key / value lookups and could potentially store
| each key in a seperate object in s3 but for a couple hundred
| million objects.. It would be a lot more managable to have a
| single file and maybe a cacheable index.
| jiggawatts wrote:
| > trigger a lot of small requests reading bunch of places from
| the files. I mean a lot.
|
| That's... the whole point. That's how Parquet files are
| supposed to be used. They're an improvement over CSV or JSON
| because clients can read small subsets of them efficiently!
|
| For comparison, I've tried a few other client products that
| don't use Parquet files properly and just read the whole file
| every time, no matter how trivial the query is.
| eknkc wrote:
| This makes sense but the problem I had with duckdb + parquet
| is it looks like there is no metadata caching so each and
| every query triggers a lot of requests.
|
| Duckdb can query a remote duckdb database too, in that case
| it looks like there is caching. Which might be better.
|
| I wonder if anyone actually worked on a specific file format
| for this use case (relatively high latency random access) to
| minimize reads to as little blocks as possible.
| jiggawatts wrote:
| Sounds like a bug or missing feature in DuckDB more than an
| issue with the format
| imiric wrote:
| ClickHouse can also read from S3. I'm not sure how it compares
| to DuckDB re efficiency, but it worked fine for my simple use
| case.
| masterj wrote:
| Neither of these support indexes afaik. They are designed to
| do fast scans / computation.
| hodgesrm wrote:
| It depends on what you mean by "support." ClickHouse as I
| recall can read min/max indexes from Parquet row groups.
| One of my colleagues is working on a PR to add support for
| bloom filter indexes. So that will be covered as well.
|
| Right now one of the main performance problems is that
| Clickhouse does not cache index metadata yet, so you still
| have to scan files rather than keeping the metadata in
| memory. ClickHouse does this for native MergeTree tables.
| There are a couple of steps to get there but I have no
| doubt that metadata caching will be properly handled soon.
|
| Disclaimer: I work for Altinity, an enterprise provider for
| ClickHouse software.
| orthecreedence wrote:
| Depends what you mean by "indexes." DuckDB can read path
| parameters (ex s3://my-
| bucket/category=beverages/month=2022-01-01/*/*.parquet)
| where `category` and `month` can be filtered at the query
| level, skipping any non-matching files. I think that
| qualifies as an index. Obviously, you'd have to create
| these up-front, or risk moving lots of data between paths.
| tionis wrote:
| You could use a sqlite database and use range queries using
| something like this: https://github.com/psanford/sqlite3vfshttp
| https://github.com/phiresky/sql.js-httpvfs
|
| Simon Willison wrote about it:
| https://simonwillison.net/2022/Aug/10/sqlite-http/
| arcanemachiner wrote:
| That whole thing still blows my mind.
| eknkc wrote:
| Yep this thing is the reason I thought about doing it in the
| first place. Tried duckdb which has built in support for
| range requests over http.
|
| Whole idea makes sense but I feel like the file format should
| be specifically tuned for this use case. Otherwise you end up
| with a lot of range requests because it was designed for disk
| access. I wondered if anything was actually designed for
| that.
| hobofan wrote:
| Parquet and other columnar storage formats are essentially
| already tuned for that.
|
| A lot of requests in themselves shouldn't be that horrible
| with Cloudfront nowadays, as you both have low latency and
| with HTTP2 a low-overhead RPC channel.
|
| There are some potential remedies, but each come with
| significant architetural impact:
|
| - Bigger range queries; For smallish tables, instead of
| trying to do point-based access for individual rows,
| instead retrieve bigger chunks at once and scan through
| them locally -> Less requests, but likely also more wasted
| bandwidth
|
| - Compute the specific view live with a remote DuckDB ->
| Has the downside of having to introduce a DuckDB instance
| that you have to manage between the browser and S3
|
| - Precompute the data you are interested into new parquest
| files -> Only works if you can anticipate the query
| patterns enough
|
| I read in the sibling comment that your main issue seems to
| be re-reading of metadata. DuckDB is AFAIK able to cache
| the metadata, but won't across instances. I've seen someone
| have the same issue, and the problem was that they only
| created short-lived DuckDB in-memory instances (every time
| the wanted to run a query), so every time the fresh DB had
| to retrieve the metadata again.
| eknkc wrote:
| Thanks for the insights. Precomputing is not really
| suitable for this and the thing is, I'm mostly using it
| as a lookup table on key / value queries. I know Duckdb
| is mostly suitable for aggregation but the http range
| query support was too attractive to pass on.
|
| I did some tests, querying "where col = 'x'". If the
| database was a remote duckdb native db, it would issue a
| bunch of http range requests and the second exact call
| would not trigger any new requests. Also, querying for
| col = foo and then col = foob would yield less and less
| requests as I assume it has the necesary data on hand.
|
| Doing it on parquet, with a single long running duckdb
| cli instance, I get the same requests over and over
| again. The difference though, I'd need to "attach" the
| duckdb database under a schema name but would query the
| parquet file using "select from 'http://.../x.parquet'"
| syntax. Maybe this causes it to be ephemeral for each
| query. Will see if the attach syntax also works for
| parquet.
| hobofan wrote:
| I think both should work, but you have to set the object
| cache pragma IIRC: https://duckdb.org/docs/configuration/
| pragmas.html#object-ca...
| cdchn wrote:
| >Is there a good general purpose solution where I can store a
| large read only database in s3 or something and do lookups
| directly on it?
|
| I think this is pretty much what AWS Athena is.
| tiew9Vii wrote:
| Cloud backed SQLLite looks like it might be good for this.
| Doesn't support S3 though
|
| https://sqlite.org/cloudsqlite/doc/trunk/www/index.wiki
| canadiantim wrote:
| LanceDB
| omneity wrote:
| > In 2022, production-grade vector databases were relying on in-
| memory storage
|
| This is irking me. pg_vector has existed from before that,
| doesn't require in-memory storage and can definitely handle
| vector search for 100m+ documents in a decently performant
| manner. Did they have a particular requirement somewhere?
| jbellis wrote:
| Have you tried it? pgvector performance falls off a cliff once
| you can't cache in ram. Vector search isn't like "normal"
| workloads that follow a nice pareto distribution.
| omneity wrote:
| Tried and deployed in production with similar sized
| collections.
|
| You only need enough memory to load the index, definitely not
| the whole collection. A typical index would most likely fit
| within a few GBs. And even if you need dozens of GBs of RAM
| it won't cost nearly as much as $20k/month as the article
| surmises.
| lyu07282 wrote:
| How do you get to "a few GBs"? A hundred million
| embeddings, if you have 4 byte floats 1024 dimensions would
| be >400 GB alone.
| omneity wrote:
| I did say the index, not the embeddings themselves. The
| index is a more compact representation of your embeddings
| collection, and that's what you need in memory. One
| approach for indexing is to calculate centroids of your
| embeddings.
|
| You have multiple parameters to tweak, that affect
| retrieval performance as well as the memory footprint of
| your indexes. Here's a rundown on that:
| https://tembo.io/blog/vector-indexes-in-pgvector
| yamumsahoe wrote:
| unsure if they are comparable, but is this and quickwit
| comparable?
| hipadev23 wrote:
| That's some woefully disappointing and incorrect metrics (read
| and write latency are both sub-second, storage medium would be "
| Memory + Replicated SSDs") you've got for Clickhouse there, but I
| understand what you're going for and why you categorized it where
| you did.
| endisneigh wrote:
| Slightly relevant - do people really want article
| recommendations? I don't think I've ever read an article and
| wanted a recommendation. Even with this one - I sort of read it
| and that's it; no feeling of wanting recommendations.
|
| Am I alone in this?
|
| In any case this seems like a pretty interesting approach.
| Reminds me of Warpstream which does something similar with S3 to
| replace Kafka.
| nh2 wrote:
| > $3600.00/TB/month
|
| It doesn't have to be that way.
|
| At Hetzner I pay $200/TB/month for RAM. That's 18x cheaper.
|
| Sometimes you can reach the goal faster with less complexity by
| removing the part with the 20x markup.
| TechDebtDevin wrote:
| I will likely never leave Hetzner.
| AYBABTME wrote:
| 200$/TB/month for raw RAM, not RAM that's presented to you
| behind a usable API that's distributed and operated by someone
| else, freeing you of time.
|
| It's not particularly useful to compare the cost of raw
| unorganized information medium on a single node, to highly
| organized information platform. It's like saying "this CPU chip
| is expensive, just look at the price of this sand".
| kirmerzlikin wrote:
| AFAIU, 3600$ is also a price for "raw RAM" that will be used
| by your common database via sys calls and not via a "usable
| API operated by someone else"
| hodgesrm wrote:
| > It's not particularly useful to compare the cost of raw
| unorganized information medium on a single node, to highly
| organized information platform.
|
| Except that it does prompt you to ask what you could do to
| use that cheap compute and RAM. In the case of Hetzner that
| might be large caches that allow you to apply those resources
| on remote data whilst minimizing transfer and API costs.
| formerly_proven wrote:
| You seem to be quoting the highest figure from the article out
| of context as-if that is their pricing, but the opposite is the
| case.
|
| > $3600.00/TB/month (incumbents)
|
| > $70.00/TB/month (turbopuffer)
|
| That's still 3x cheaper than your number and it's a SaaS API,
| not just a piece of rented hardware.
| nh2 wrote:
| > as-if that is their pricing
|
| No, that's not what I'm saying. Their "Storage Costs" table
| shows costs to rent storage from some provider (AWS?). It's
| clear that those are costs that the user has to pay for
| infrastructure needed for certain types of software (e.g.
| Turbopuffer is designed to be running on "S3 + SSD Cache",
| while other software may be designed to run on "RAM + 3x
| SSD").
|
| I'm comparing RAM costs from that table with RAM costs in the
| real world.
|
| The idea backed by that table is "RAM is so expensive, so we
| need to build software to run it on cheaper storage instead".
|
| My statement is "RAM is that expensive only on that provider,
| there are others where it is not; on those, you may just run
| it in RAM and save on software complexity".
|
| You will still need some software for your SaaS API to serve
| queries from RAM, but it won't need the complexity of trying
| to make it fast when serving from a higher-latency storage
| backend (S3).
| yawnxyz wrote:
| can't wait for the day the get into GA!
| cdchn wrote:
| The very long introductory page has a ton of very juicy data in
| it, even if you don't care about the product itself.
| zX41ZdbW wrote:
| A correction to the article. It mentions
| Warehouse BigQuery, Snowflake, Clickhouse >=1s Minutes
|
| For ClickHouse, it should be: read latency <= 100ms, write
| latency <= 1s.
|
| Logging, real-time analytics, and RAG are also suitable for
| ClickHouse.
| Sirupsen wrote:
| Yeah, thinking about this more I now understand Clickhouse to
| be more of an operational warehouse similar to Materialize,
| Pinot, Druid, etc. if I understand correctly? So bunching with
| BigQuery/Snowflake/Trino/Databricks... wasn't the right
| category (although operational warehouses certainly can have a
| ton of overlap)
|
| I left that category out for simplicity (plenty of others that
| didn't make it into the taxonomy, e.g. queues, nosql, time-
| series, graph, embedded, ..)
| arnorhs wrote:
| This looks super interesting. I'm not that familiar with vector
| databases. I thought they were mostly something used for RAG and
| other AI-related stuff.
|
| Seems like a topic I need to delive into a bit more.
| solatic wrote:
| Is it feasible to try to build this kind of approach (hot SSD
| cache nodes sitting in front of object storage) with prior open-
| source art (Lucene)? Or are the search indexes themselves also
| proprietary in this solution?
|
| Having witnessed some very large Elasticsearch production
| deployments, being able to throw everything into S3 would be
| _incredible_. The applicability here isn 't only for vector
| search.
| francoismassot wrote:
| If you don't need vector search and have very large
| Elasticsearch deployment, you can have a look at Quickwit, it's
| a search engine on object storage, it's OSS and works for
| append-only datasets (like logs, traces, ...)
|
| Repo: https://github.com/quickwit-oss/quickwit
| rohitnair wrote:
| Elasticsearch and OpenSearch already support S3 backed indices.
| See features like https://opensearch.org/docs/latest/tuning-
| your-cluster/avail... The files in S3 are plain old Lucene
| segment files (just wrapped in OpenSearch snapshots which
| provide a way to track metadata around those files).
| francoismassot wrote:
| But you don't have fast search on those files stored on
| object storage.
| rohitnair wrote:
| Yes, there is a cold start penalty but once the data is
| cached, it is equivalent to disk backed indices. There is
| also active work being done to improve the performance,
| example https://github.com/opensearch-
| project/OpenSearch/issues/1380...
___________________________________________________________________
(page generated 2024-07-10 23:01 UTC)