[HN Gopher] A sharded DuckDB on 63 nodes runs 1T row aggregation...
___________________________________________________________________
A sharded DuckDB on 63 nodes runs 1T row aggregation challenge in 5
sec
Author : tanelpoder
Score : 197 points
Date : 2025-10-24 12:54 UTC (10 hours ago)
(HTM) web link (gizmodata.com)
(TXT) w3m dump (gizmodata.com)
| maxmcd wrote:
| Are there any open sourced sharded query planners like this?
| Something that can aggregate queries across many duckdb/sqlite
| dbs?
| hobofan wrote:
| Not directly DuckDB (though I think it might be able to be
| connected to that), but I think Apache Datafusion Ballista[0]
| would be a typical modern open source benchmark here.
|
| [0]: https://datafusion.apache.org/ballista/contributors-
| guide/ar...
| mritchie712 wrote:
| DeepSeek released smallpond
|
| 0 - https://github.com/deepseek-ai/smallpond
|
| 1 - https://www.definite.app/blog/smallpond (overview for data
| engineers, practical application)
| tgv wrote:
| Impressive, but those 63 nodes were "Azure Standard E64pds v6
| nodes, each providing 64 vCPUs and 504 GiB of RAM." That's 4000
| CPUs and 30TB memory.
| ralegh wrote:
| Just noting that 4000 vCPUs usually means 2000 cores, 4000
| threads
| electroly wrote:
| It doesn't mean that here. Epdsv6 is 1 core = 1 vCPU.
| ralegh wrote:
| I stand corrected...
| RamtinJ95 wrote:
| At that scale it cannot be cheaper than just running the same
| workload on BigQuery or Snowflake or?
| philbe77 wrote:
| A Standard E64pds v6 costs: $3.744 / hr on demand. At 63
| nodes - the cost is: $235.872 / hr - still cheaper than a
| Snowflake 4XL cluster - costing: 128 credits / hr at
| $3/credit = $384 / hr.
| philbe77 wrote:
| At 5 seconds - the query technically cost: $0.3276
| Keyframe wrote:
| That's like calculating a trip cost based on gas cost
| without accounting for car rental, gas station food, and
| especially mandatory bathroom fee after said food.
| philbe77 wrote:
| If I used "spot" instances - it would have been 63 x
| $0.732/hr for a total of: $45.99 / hr.
| ramraj07 wrote:
| Sounds like the equivalent of a 4xl snowflake warehouse, which
| for such queries would take 30 seconds, with the added benefit
| of the data being cold stored in s3. Thus you only pay by the
| minute.
| hobs wrote:
| No, that would be equivalent to 64 4xl snowflake warehouses
| (though the rest of your point still stands).
| philbe77 wrote:
| Cost-wise, 64 4xl Snowflake clusters would cost: 64 x
| $384/hr - for a total of: $24,576/hr (I believe)
| __mharrison__ wrote:
| What was the cost of the duck implementation?
| ramraj07 wrote:
| Apologize for getting it wrong a few orders of magnitude,
| but thats even more ghastly if its so overpowered and yet
| takes this long.
| philbe77 wrote:
| Challenge accepted - I'll try it on a 4XL Snowflake to get
| actual perf/cost
| shinypenguin wrote:
| Is the dataset somewhere accessible? Does anyone know more about
| the "1T challenge", or is it just the 1B challenge moved up a
| notch?
|
| Would be interesting to see if it would be possible to handle
| such data on one node, since the servers they are using are quite
| beefy.
| philbe77 wrote:
| Hi shinypenguin - the dataset and challenge are detailed here:
| https://github.com/coiled/1trc
|
| The data is in a publicly accessible bucket, but the requester
| is responsible for any egress fees...
| shinypenguin wrote:
| Hi, thank you for the link and quick response! :)
|
| Do you know if anyone attempted to run this on the least
| amount of hardware possible with reasonable processing times?
| philbe77 wrote:
| Yes - I also had GizmoSQL (a single-node DuckDB database
| engine) take the challenge - with very good performance (2
| minutes for $0.10 in cloud compute cost):
| https://gizmodata.com/blog/gizmosql-one-trillion-row-
| challen...
| simonw wrote:
| I suggest linking to that from the article, it is a useful
| clarification.
| philbe77 wrote:
| Good point - I'll update it...
| achabotl wrote:
| The One Trillion Row Challenge was proposed by Coiled in 2024.
| https://docs.coiled.io/blog/1trc.html
| MobiusHorizons wrote:
| > Once trusted, each worker executes its local query through
| DuckDB and streams intermediate Arrow IPC datasets back to the
| server over secure WebSockets. The server merges and aggregates
| all results in parallel to produce the final SQL result--often in
| seconds.
|
| Can someone explain why you would use websockets in an
| application where neither end is a browser? Why not just use
| regular sockets and cut the overhead of the http layer? Is there
| a real benefit I'm missing?
| philbe77 wrote:
| Hi MobiusHorizons, I happened to use websockets b/c it was the
| technology I was familiar with. I will try to learn more about
| normal sockets to see if I could perhaps make them work with
| the app. Thanks for the suggestion...
| DanielHB wrote:
| if you really want maximum performance maybe consider using
| CoAP for node-communication:
|
| https://en.wikipedia.org/wiki/Constrained_Application_Protoc.
| ..
|
| It is UDP-based but adds handshakes and retransmissions. But
| I am guessing for your benchmark transmission overhead isn't
| a major concern.
|
| Websockets are not that bad, only the initial connection is
| HTTP. As long as you don't create a ton of connections all
| the time it shouldn't be much slower than a TCP-based socket
| (purely theoretical assumption on my part, I never tested).
| gopalv wrote:
| > will try to learn more about normal sockets to see if I
| could perhaps make them work with the app.
|
| There's a whole skit in the vein of "What have the Romans
| ever done for us?" about ZeroMQ[1] which has probably lost to
| the search index now.
|
| As someone who has held a socket wrench before, fought
| tcp_cork and dsack, Websockets isn't a bad abstraction to be
| on top of, especially if you are intending to throw TLS in
| there anyway.
|
| Low level sockets is like assembly, you can use it but it is
| a whole box of complexity (you might use it completely raw
| sometimes like a tickle ack in the ctdb[2] implementation).
|
| [1] - https://news.ycombinator.com/item?id=32242238
|
| [2] - https://linux.die.net/man/1/ctdb
| sureglymop wrote:
| Wait but websockets aren't over http right? Just the initiation
| and then there is a protocol upgrade or am I wrong? What
| overhead is there otherwise?
| tsimionescu wrote:
| You're right, WebSockets aren't over HTTP, they just use HTTP
| for the connection initiation. They do add some overhead in
| two places: one, when opening a new connection, since you go
| TCP -> TLS -> HTTP -> WebSockets -> Your protocol ; and two,
| they do add some per packet overhead, since there is a
| WebSocket encapsulation of your data - but this is much
| smaller than typical HTTP request/response overhead.
| kevincox wrote:
| > the overhead of the http layer
|
| There isn't much overhead here other than connection setup. For
| HTTP/1 the connection is just "upgraded" to websockets. For
| HTTP/2 I think the HTTP layer still lives on a bit so that you
| can use connection multiplexing (which maybe be overhead if you
| have no use for it here) but that is still a very thin layer.
|
| So I think the question isn't so much HTTP overhead but
| WebSocket overhead. WebSockets add a bit of message framing and
| whatnot that may be overhead if you don't need it.
|
| In 99% of applications if you need encryption, authentication
| and message framing you would be hard-pressed to find a
| significantly more efficient option.
| toast0 wrote:
| > In 99% of applications if you need encryption,
| authentication and message framing you would be hard-pressed
| to find a significantly more efficient option.
|
| AFAIK, websockets doesn't do authentication? And the
| encryption it does is minimal, optional xor with a key
| disclosed in the handshake. It does do framing.
|
| It's not super common, but if all your messages have a 16-bit
| length, you can just use TLS framing. I would argue that TLS
| framing is ineffecient (multiple length terms), but using it
| by itself is better than adding a redundant framing layer.
|
| But IMHO, there is significant benefit from removing a layer
| where it'd unneeded.
| LunaSea wrote:
| > AFAIK, websockets doesn't do authentication?
|
| Websocket allows for custom header and query parameters
| which make it possible to run a basic authentication scheme
| and later on additional autorisation in the message
| themselves if really necessary.
|
| > And the encryption it does is minimal, optional xor with
| a key disclosed in the handshake. It does do framing.
|
| Web Secure Socket (WSS) is the TLS encrypted version of
| Websockets (WS) (similar to HTTP vs. HTTPS).
| fyrn_ wrote:
| Worth noting that wbesockets in the browser don't allow
| custom headers and custom header support is spotty
| accross sever impls. It's just not exposed in the
| javascript API. There has been an open chrome bug for
| that for like 15 years
| LunaSea wrote:
| > Worth noting that wbesockets in the browser don't allow
| custom headers
|
| They do during the initial handshake (protocol upgrade
| from HTTP to WebSocket).
|
| Afterwards the message body can be used to send
| authorisation data.
|
| Server support will depend on tech but Node.js has great
| support.
| simonw wrote:
| If you're using sockets you still need to come up with some
| kind of protocol on top of those sockets for the data that's
| being transferred - message delimiter, a data format etc. Then
| you have to build client libraries for that protocol.
|
| WebSockets solve a bunch of those low level problems for you,
| in a well specified way with plenty of existing libraries.
| HumblyTossed wrote:
| ASCII table codes 1,2,3 & 4 pretty simple to use.
| jcheng wrote:
| Not if you're passing binary data
| woodruffw wrote:
| Even beyond that: the ASCII delimiter control codes are
| perfectly valid UTF-8 (despite not being printable), so
| using them for in-band signaling is a recipe for pain on
| arbitrary UTF-8 data.
| mananaysiempre wrote:
| If you know your data is UTF-8, then bytes 0xFE and 0xFF
| are guaranteed to be free. Strictly speaking, 0xC0, 0xC1,
| and 0xF5 through 0xFD also are, but the two top values
| are free even if you are very lax and allow overlong
| encodings as well as codepoints up to 232 - 1.
| woodruffw wrote:
| I think it would probably be better to invest in a proper
| framing design than trying to poke holes in UTF-8.
|
| (This is true regardless of UTF-8 -- in-band encodings
| are almost always brittle!)
| dns_snek wrote:
| Sure, in principle. Someone already mentioned binary data,
| then you come up with a framing scheme and get to write
| protocol documentation, but why? What's the benefit?
| HumblyTossed wrote:
| Simplicity.
| ryanjshaw wrote:
| You misspelled "bugs and maintenance nightmare"
| tracker1 wrote:
| Now solve for encryption, authorization, authentication...
|
| WS(S) has in the box solutions for a lot of these... on top
| of that, application gateways, distribution, failover etc.
| You get a lot of already solved solutions in the box, so to
| speak. If you use raw sockets, now you have to implement
| all of these things yourself, and you aren't gaining much
| over just using WSS.
| zerd wrote:
| WebSocket doesn't specify data format, it's just bytes, so
| they have to handle that themselves. It looks like they're
| using Arrow IPC.
|
| Since they're using Arrow they might look into Flight RPC [1]
| which is made for this use case.
|
| [1] https://arrow.apache.org/docs/format/Flight.html
| lucideer wrote:
| > _overhead of the http layer_
|
| Detail of this well-covered in sibling comments, but at a
| higher-level, two thoughts on this:
|
| 1. I see a lot of backlash lately against everything being
| HTTP-ified, with little justification other than a presumption
| that it necessarily adds overhead. Perf-wise, HTTP has come a
| long way & modern HTTP is a very efficient protocol. I think
| this has cleared the way for it to be a foundation for many
| more things than in the past. HTTP/3 being over UDP might clear
| the way for more of this (albeit I think the overhead of TCP/IP
| is also often overstated - see e.g. MQTT).
|
| 2. Overhead can be defined in two ways: perf. & maintenance
| complexity. Modern HTTP does add a bit of the latter, so in
| that context it may be a fair concern, but I think the large
| range of competing implementations probably obviates any
| concern here & the alternative usually involves doing something
| custom (albeit simpler), so you run into inconsistency, re-
| invented wheels & bus factor issues there.
| fweimer wrote:
| One reason comes to my mind: HTTP is no longer a stable
| protocol with well-understood security properties. If you
| deploy it today, people expect interoperability with clients
| and servers that implement future protocol upgrades,
| resulting in an ongoing maintenance burden that a different
| protocol choice would avoid.
| lucideer wrote:
| I'm absolutely not an expert of any kind on protocol
| details, so pardon my ignorance here but this surprises me:
| is this true?
|
| High-level spec changes have been infrequent, with long
| dual support periods, & generally seen pretty slow gradual
| client & server adoption. 1.1 was 1997 & continues to have
| widespread support today. 2 & 3 were proposed in 2015 &
| 2016 - almost 2 decades later - & 2 is only really starting
| to see wide support today, with 3 still broadly
| unsupported.
|
| I'm likely missing a lot of nuance in between versioned
| releases though - I know e.g. 2 saw at least two major
| additions/updates, though I thought those were mostly
| additive security features rather than changes to existing
| protocol features.
| immibis wrote:
| Using stuff like HTTP signals a lack of understanding of the
| whole stack. IMO it's important for programmers to understand
| computers. You can write programs without understanding
| computers, but it's best if you go and learn about computers
| first. You can use abstractions but you should also
| understand the abstractions.
|
| There are two ways I've noticed to design an application.
|
| Some people grab some tools out of their toolbox that look
| like they fit - I need a client/server, I know web
| clients/servers, so I'll use a web client/server.
|
| Other people think about what the computer actually has to do
| and then write code to achieve that: Computer A has to send a
| block of data to computer B, and this has to work on Linux
| (which means no bit-banging - you can only go as low as raw
| sockets). This type of person may still take shortcuts, but
| it's by intention, not because it's the only thing they know:
| if HTTP is only one function call in Python, it makes sense
| to use HTTP, not because it's the only thing you know but
| because it's good enough, you know it works well enough for
| this problem, and you can change it later if it becomes a
| bottleneck.
|
| Websockets are an odd choice because they're sort of the
| worst of both worlds: they're barely more convenient as raw
| sockets (there's framing, but framing is easy), but they also
| add a bunch of performance and complexity overhead over raw
| sockets, and more things that can go wrong. So it doesn't
| seem to win on the convenience/laziness front nor the
| performance/security/robustness front. If your client had to
| be a web browser, or could sometimes be a web browser, or if
| you wanted to pass the connections through an HTTP reverse
| proxy, those would be good reasons to choose websockets, but
| none of them are the case here.
| lucideer wrote:
| Acknowledging that a huge number of people (the vast
| majority) are going to use the only option they know rather
| than the best of a set of options they know, I still think
| that for a person who's fully versed in all available
| options, Websockets is a better option than you make out.
|
| > _they 're barely more convenient as raw sockets_
|
| Honestly, raw sockets are pretty convenient - I'm not
| convinced Websockets are more convenient at all (assuming
| you already know both & there's no learning curves). Raw
| sockets might even be more convenient.
|
| I think it's features rather than convenience that is more
| likely to drive Websocket usage when comparing the two.
|
| > _they also add a bunch of performance and complexity
| overhead over raw sockets_
|
| This is the part that I was getting at in my above comment.
| I agree in theory, but I just think that the "a bunch"
| quantifier is bit of an exaggeration. They really add very
| very little performance overhead in practice: a negligible
| amount in most cases.
|
| So for a likely-negligible performance loss, & a likely-
| negligible convenience difference, you're getting a
| protocol with built-in encryption, widespread documentation
| & community support - especially important if you're
| writing code that other people will need to take over &
| maintain - & as you alluded to: extensibility (you may
| never need browser support or http proxying, but having the
| option is compelling when the trade-offs are so
| negligible).
| gr4vityWall wrote:
| > they're barely more convenient as raw sockets (there's
| framing, but framing is easy)
|
| I think it's significant more convenient if your stack
| touches multiple programming languages. Otherwise you'd
| have to implement framing yourself for all of them. Not
| hard, but I don't see the benefit either.
|
| > they also add a bunch of performance and complexity
| overhead over raw sockets
|
| What performance overhead is there over raw sockets once
| you're past the protocol upgrade? It seems negligible if
| you connection is even slightly long-lived.
| nurettin wrote:
| http 101 upgrade isn't much of an overhead and there are tried
| and tested websocket/ssl libraries with pretty callback
| interfaces versus your custom binary protocol. I would still
| choose the latter but I wouldn't recommend it.
| immibis wrote:
| you can apply this reasoning to a lot of protocols, like why
| not use Nostr over websockets? I mean, I don't see any reason
| to do this with Nostr over websockets, but also, why not?
| it's not much overhead right?
| lantastic wrote:
| Others pointed plenty of arguments, but the ones I find most
| compelling (not necessarily useful in this context) are:
|
| - you can serve any number of disjoint websocket services via
| same port via HTTP routing - this also means you can do TLS
| termination in one place, so downstream websocket service
| doesn't have to deal with the nitty-gritty of certificates.
|
| Sure, it adds a hop compared to socket passing, and there are
| ways to get similar fanout with TCP with a custom protocol. But
| you need to add this to every stack that interacting components
| use, while websockets libraries exist for most languages that
| are likely to be used in such an endeavor.
| nodesocket wrote:
| > Each GizmoEdge worker pod was provisioned with 3.8 vCPUs (3800
| m) and 30 GiB RAM, allowing roughly 16 workers per node--meaning
| the test required about 63 nodes in total.
|
| How was this node setup chosen? Specially 3.8 vCPU and 30 GiB RAM
| per? Why not just run 16 workers total using the entire 64 vCPU
| and 504 GiB of memory each?
| philbe77 wrote:
| Hi nodesocket - I tried to do 4 CPUs per node, but Kubernetes
| takes a small (about 200m) CPU request amount for daemon
| processes - so if you try to request 4 (4000m) CPUs x 16 -
| you'll spill one pod over - fitting only 15 per node.
|
| I was out of quota in Azure - so I had to fit in the 63
| nodes... :)
| nodesocket wrote:
| But why split up a vm into so many workers instead of
| utilizing the entire vm as a dedicated single worker? What's
| the performance gain and strategy?
| philbe77 wrote:
| I'm not exactly sure yet. My goal was to not have the
| shards be too large so as to be un-manageable. In theory -
| I could just have had 63 (or 64) huge shards - and 1 worker
| per K8s node, but I haven't tried it.
|
| There are so many variables to try - it is a little
| overwhelming...
| nodesocket wrote:
| Would be interesting to test. I'm thinking there may not
| be a benefit to having so many workers on a vm instead of
| just the entire vm resources as a single worker. Could be
| wrong, but that would be a bit surprising.
| boshomi wrote:
| >"In our talk, we will describe the design rationale of the
| DuckLake format and its principles of simplicity, scalability,
| and speed. We will show the DuckDB implementation of DuckLake in
| action and discuss the implications for data architecture in
| general.
|
| Prof. Hannes Muhleisen, cofounder of DuckDB:
|
| [DuckLake - The SQL-Powered Lakehouse Format for the Rest of Us
| by Prof. Hannes
| Muhleisen](https://www.youtube.com/watch?v=YQEUkFWa69o) (53 min)
| Talk from Systems Distributed '25: https://systemsdistributed.com
| ferguess_k wrote:
| Wait until you see a 800-line Tableau query that joins TB data
| with TB data /s
| philbe77 wrote:
| :D that is scary!
| kwillets wrote:
| Don't forget the 2 hour tableau cloud runtime limit.
| NorwegianDude wrote:
| This is very silly. You're not doing the challenge if you do the
| work up front. The idea is that you start with a file and the
| goal is to get the result as fast as possible.
|
| How long did it take to distribute and import the data to all
| workers, what is the total time from file to result?
|
| I can do this a million times faster on one machine, it just
| depends on what work I do up front.
| philbe77 wrote:
| You should do it then, and post it here. I did do it with one
| machine as well: https://gizmodata.com/blog/gizmosql-one-
| trillion-row-challen...
| NorwegianDude wrote:
| Nobody cares if I can do it a million times faster, everyone
| can. It's cheating.
|
| The whole reason you have to account for the time you spend
| setting it up is so that all work spent processing the data
| is timed. Otherwise we can just precomputed the answer and
| print it on demand, that is very fast and easy.
|
| Just getting it into memory is a large bottleneck in the
| actual challenge.
|
| If I first put it into a DB with statistics that tracks the
| needed min/max/mean then it's basically instant to retrieve,
| but also slower to set up because that work needs to be done
| somewhere. That's why the challenge is time from file to
| result.
| ta12653421 wrote:
| When reading such extreme numbers, I'm always thinking what I may
| be doing wrong, when my MSSQL based CRUD application warms up its
| caches with around 600.000 rows and it takes 30 seconds to load
| them from DB into RAM on my 4x3GHz machine :-D
|
| Maybe I'm missing something fundamental here
| dgan wrote:
| I also had misfortune working with MSSQL is it was so so
| unbearably slow, because i couldnt upload data in bulk. I guess
| its forbidden technology
| Foobar8568 wrote:
| Or you didn't use MSSQL properly, there are at least 2 or 3
| ways to do bulk upload on MS SQL, not sure in today era.
| dgan wrote:
| Maybe? Don't know. I never had problemes bulk uploading
| into Postgres tho, it's right there in documentation and I
| don't have to have a weird executable on my corporately
| castrated laptop
| rovr138 wrote:
| Would OLAP be better than OLTP for those queries you're doing?
| zwnow wrote:
| This type of stuff is usually hyperoptimized for no reason and
| serves no real purpose, you are doing just fine
| RobinL wrote:
| Yes - OLAP database are built with a completely different
| performance tradeoff. The way data is stored and the query
| planner are optimised for exactly these types of queries. If
| you're working in an oltp system, you're not necessarily doing
| it wrong, but you may wish to consider exporting the data to
| use in an OLAP tool if you're frequently doing big queries. And
| nowadays there's ways to 'do both ' e.g. you can run the duckdb
| query engine within a postgres instance
| riku_iki wrote:
| Could you run some query like select sum(banch of columns) from
| my_table and see how long it will take?
|
| 600k rows is likely less than 1GB of data, and should take
| about second to load into RAM on modern nvme ssd raids.
| mosselman wrote:
| Are there any good instructions somewhere on how to set this up?
| As in not 63 nodes. But a distributed duckdb instance
| philbe77 wrote:
| Hi mosselman, GizmoEdge is not open-source. DeepSeek has
| "smallpond" however, which is open-source:
| https://github.com/deepseek-ai/smallpond
|
| I plan on getting GizmoEdge to production-grade quality
| eventually so folks can use it as a service or licensed
| software. There is a lot of work to do, though :)
| djhworld wrote:
| Interesting and fun
|
| > Workers download, decompress, and materialize their shards into
| DuckDB databases built from Parquet files.
|
| I'm interested to know whether the 5s query time includes this
| materialization step of downloading the files etc, or is this
| result from workers that have been "pre-warmed". Also is the data
| in DuckDB in memory or on disk?
| philbe77 wrote:
| hi djhworld. The 5s does not include the
| download/materialization step. That parts takes the worker
| about 1 to 2 minutes for this data set. I didn't know that this
| was going on HackerNews or would be this popular - I will try
| to get more solid stats on that part, and update the blog
| accordingly.
|
| You can have GizmoEdge reference cloud (remote) data as well,
| but of course that would be slower than what I did for the
| challenge here...
|
| The data is on disk - on locally mounted NVMe on each worker -
| in the form of a DuckDB database file (once the worker has
| converted it from parquet). I originally kept the data in
| parquet, but the duckdb format was about 10 to 15% faster - and
| since I was trying to squeeze every drop of performance - I
| went ahead and did that...
|
| Thanks for the questions.
|
| GizmoEdge is not production yet - this was just to demonstrate
| the art of the possible. I wanted to divide-and-conquer a huge
| dataset with a lot of power...
| philbe77 wrote:
| I've since learned (from a DuckDB blog) - that DuckDB seems
| to do better when the XFS filesytem. I used ext4 for this, so
| I may be able to get another 10 to 15% (maybe!).
|
| DuckDB blog: https://duckdb.org/2025/10/09/benchmark-
| results-14-lts
| sammy2255 wrote:
| How would a 63 node Clickhouse cluster compare? >:)
| lolive wrote:
| Why doesn't such large-scale test the big feature everyone needs,
| which is inner join at scale?
| philbe77 wrote:
| This is something we are trying to take a novel approach to as
| well. We have a video demonstrating some TPC-H SF10TB queries
| which perform inner joins, etc. - with GizmoEdge as well:
| https://www.youtube.com/watch?v=hlSx0E2jGMU
| lolive wrote:
| Does that study go into the global vision of DuckLake ?
| 1a527dd5 wrote:
| The title buries the lede a little
|
| > Our cluster ran on Azure Standard E64pds v6 nodes, each
| providing 64 vCPUs and 504 GiB of RAM.
|
| Yes, I would _expect_ when each node has that kind of power it
| should return very impressive speeds.
| vysakh0 wrote:
| Duckdb is an excellent OLAP db, I have had customers who had s3
| data lake of parquet and use databricks or other expensive tool,
| when they could easily use duckdb.. Given we have cursor/claude
| code, it is not that hard for lot of use cases, I think the lack
| of documentation on how duckdb functions -- in terms of how it
| loads these files etc are some of the reasons companies are not
| even trying to adopt duckdb. I think blogs like this is a great
| testament for duckdb's performance!
| mrtimo wrote:
| I have experience with duckDB but not databricks... from the
| perspective of a company, is a tool like databricks more
| "secure" than duckdb? If my company adopts duckdb as a
| datalake, how do we secure it?
| rapatel0 wrote:
| Duckdb can run as a local instance that points to parquet
| files in a n s3 bucket. So your "auth" can live on the layer
| that gives permissions to access that bucket.
| lopatin wrote:
| DuckDB is great but it's barely OLAP right? A key part of OLAP
| is "online". Since the writer process blocks any other
| processes from doing reads, calling it OLAP is a stretch I
| think.
| ansgri wrote:
| Isn't the Online part here about getting results immediately
| after query, as opposed to overnight batch reports? So if you
| don't completely overwhelm DuckDB with writes, it still
| qualifies. The quality you're describing is something like
| "realtime analytics", and is a whole another category:
| Clickhouse doesn't qualify (batching updates, merging etc. --
| but it's clearly OLAP), Druid does.
| lopatin wrote:
| Huh yeah looks like I was totally wrong about what online
| meant. So yeah DuckDB is OLAP. Not that anyone was asking
| me in the first place. Carry on :)
| sdairs wrote:
| ClickHouse is the market leader in real-time analytics so
| it's an interesting take that you don't think it qualifies.
| ansgri wrote:
| For certain definition of realtime, certainly (as would
| any system with bounded ingestion latency), but it's not
| low-latency streaming realtime. Tens of seconds or more
| can pass before new data becomes visible in queries in
| normal operation. There's batching, there's merging, and
| its overall architecture prioritizes throughput over
| latency.
| adammarples wrote:
| I have been playing today with ducklake, and I have to confess
| I don't quite get what it does that duckdb doesn't already do,
| if duckdb can just run on top of parquet files quite happily
| without this extension...
| RobinL wrote:
| It's main purpose is to solve the problem of upserts to a
| data lake, because upsert operations to file based data
| storage are a real pain.
| sdairs wrote:
| Pretty big caveat; 5 seconds AFTER all data has been loaded into
| memory - over 2 minutes if you also factor reading the files from
| S3 and loading memory. So to get this performance you will need
| to run hot: 4000 CPUs and and 30TB of memory going 24/7.
| CaptainOfCoit wrote:
| Yeah, pretty misleading it feels like.
|
| For background, here is the initial ideation of the "One
| Trillion Row Challenge" challenge this submission originally
| aimed to participate in: https://docs.coiled.io/blog/1trc.html
| lumost wrote:
| It does make me wonder whether all of the investment in hot-
| loading of GPU infrastructure for LLM workloads is portable to
| databases. 30TB of GPU memory will be roughly 200 B200 cards or
| roughly 1200 per hour compared to the $240/hour pricing for the
| CPU based cluster. The GPU cluster would assuredly crush the
| CPU cluster with a suitable DB given it has 80x the FP32 FLOP
| capacity. You'd expect the in-memory GPU solution to be cheaper
| (assuming optimized software) with a 5x growth in GPU memory
| per card, or today if the workload can be bin-packed
| efficiently.
| eulgro wrote:
| Do databases do matrix multiplication? Why would they even
| use floats?
| lumost wrote:
| lot's of columns are float valued, GPU tensor cores can be
| programmed to do many operations between different
| float/int valued vectors. Strings can also be processed in
| this manner as they are simply vectors of integers. NVidia
| publishes official TPC benchmarks for each GPU release.
|
| The idea of a GPU database has been reasonably well
| explored, they are extremely fast - but have been cost
| ineffective due to GPU costs. When the dataset is larger
| than GPU memory, you also incur slowdowns due to cycling
| between CPU and GPU memory.
| radarsat1 wrote:
| what do you think vector databases are? absolutely. i think
| the idea of a database and a "model" could start to really
| be merged this way..
| lisbbb wrote:
| That's a great question. I never worked on any cool NASA
| stuff which would involve large scale number crunching. In
| the corpo space, that's not been my experience at all. We
| were trying to solve big data problems of like, how to
| report on medical claims that are in flight (which are
| hardly ever static until much later after the claim is long
| completed and no longer interesting to anyone) and do it at
| scale of tens of thousands per hour. It never went that
| well, tbh, because it is so hard to validate what a "claim"
| even is since it is changing in real time. I don't think
| excess GPUs would help with that.
| mey wrote:
| So how would that compare to DynamoDB or BigQuery? (I have zero
| interest in paying for running that experiment).
|
| In theory a Zen 5 / Eypc Turin can have up to 4TB of ram. So
| how would a more traditional non-clustered DB stand up?
|
| 1000 k8s pods, each with 30gb of ram, there has to be a bit of
| overhead/wastage going on.
| mulmen wrote:
| Are you asking how Dynamo compares at the storage level? Like
| in comparison to S3? As a key-value database it doesn't even
| have a native aggregation capability. It's a very poor choose
| for OLAP.
|
| BigQuery is comparable to DuckDB. I'm curious how the various
| Redshift flavors (provisioned, serverless, spectrum) and
| Spark compare.
|
| I don't have a lot of experience with DuckDB but it seems
| like Spark is the most comparable.
| fifilura wrote:
| BigQuery is built for the distributed case while DuckDB is
| single CPU and requires the workarounds described in the
| article to act like a distributed engine.
| mulmen wrote:
| Ah ok. Maybe that does make sense as a comparison to ask
| if you need an analytics stack or can just grind through
| your prod Dynamo.
| trhway wrote:
| the https://sortbenchmark.org has always stipulated "Must sort
| to and from operating system files on secondary storage." and
| thus felt as a more reasonable estimate of overall system
| performance
| philbe77 wrote:
| hi sdairs, we did store the data on the worker nodes for the
| challenge, but not in memory. We wrote the data to the local
| NVMe SSD storage on the node. Linux may cache the filesystem
| data, but we didn't load the data directly into memory. We like
| to preserve the memory for aggregations, joins, etc. as much as
| possible...
|
| It is true you would need to run the instance(s) 24/7 to get
| the performance all day, the startup time over a couple minutes
| is not ideal. We have a lot of work to do on the engine, but it
| has been a fun learning experience...
| otterley wrote:
| "Linux may cache the filesystem data" means there's a non-
| zero likelihood that the data in memory unless you dropped
| caches right before you began the benchmark. You don't have
| to explicitly load it into memory for this to be true. What's
| more, unless you are in charge of how memory is used, the
| kernel is going to make its own decisions as to what to cache
| and what to evict, which can make benchmarks unreproducible.
|
| It's important to know what you are benchmarking before you
| start and to control for extrinsic factors as explicitly as
| possible.
| sdairs wrote:
| Thanks for clarifying; I'm not trying to take anything away
| from you, I work in the OLAP space too so it's always good to
| see people pushing it forwards. It would be interesting to
| see a comparison of totally cold Vs hot caches.
|
| Are you looking at distributed queries directly over S3? We
| did this in ClickHouse and can do instant virtual sharding
| over large data sets S3. We call it parallel replicas
| https://clickhouse.com/blog/clickhouse-parallel-replicas
| tanelpoder wrote:
| (I submitted this link). My interest in this approach in
| general is about observability infra at scale - thinking
| about buffering detailed events, metrics and thread samples
| at the edge and later only extract things of interest,
| after early filtering at the edge. I'm a SQL & database
| nerd, thus this approach looks interesting.
| jamesblonde wrote:
| With 2 modern NVMe disks per host (15 GB/s) and pcie 5.0, it
| should only take 15s to read 30 TB into memory on 63 hosts.
|
| You can find those disks on Hetzner. Not AWS, though.
| lisbbb wrote:
| Wow (Owen Wilson voice). That's still impressive that it can be
| done. Just having 4k cpus going reliably for any period of time
| is pretty nifty. The problem I have run into is that even big
| companies say they want this kind of compute until they get the
| bill for it.
| afpx wrote:
| I've never used DuckDB, but I was surprised by the 30 GiB of
| memory. Many years ago when I used to use EMR a lot, I would go
| for > 10 TiB of RAM to keep all the data in memory and only spill
| over to SSD on big joins.
| kwillets wrote:
| This is fun, but I'm confused by the architecture. Duckdb is
| based on one-off queries that can scale momentarily and then
| disappear, but this seems to run on k8s and maintain a persistent
| distributed worker pool.
|
| This pool lacks many of the features of a distributed cluster
| such as recovery, quorum, and storage state management, and
| queries run through a single server. What happens when a node
| goes down? Does it give up, replan, or just hang? How does it
| divide up resources between multiple requests? Can it distribute
| joins and other intermediate operators?
|
| I have a soft spot in my heart for duckdb, but its uniqueness is
| in avoiding the large-scale clustering that other engines already
| do reasonably well.
| fHr wrote:
| Better and worth more then all the quantum bs I have to listen
| to.
| hbarka wrote:
| SELECT COUNT(DISTINCT) has entered the challenge.
| philbe77 wrote:
| good point :) - we can re-aggregate HyperLogLog (HLL) sketches
| to get a pretty accurate NDV (Count Distinct) - see
| Query.farm's DataSketches DuckDB extension here:
| https://github.com/Query-farm/datasketches
|
| We also have Bitmap aggregation capabilities for exact count
| distinct - something I worked with Oracle, Snowflake,
| Databricks, and DuckDB labs on implementing. It isn't as fast
| as HLL - but it is 100% accurate...
| fifilura wrote:
| I remember BigQuery had Distinct with HLL accuracy 10 years
| ago but rather quickly replaced it with actual accuracy.
|
| How would you compare this solution to BigQuery?
| up2isomorphism wrote:
| Sensational title, a reflection of "attention is all you
| need".(pun intended)
| fifilura wrote:
| Isn't Trino built for exactly this, without the quirky
| workarounds?
___________________________________________________________________
(page generated 2025-10-24 23:01 UTC)