[HN Gopher] A sharded DuckDB on 63 nodes runs 1T row aggregation...
       ___________________________________________________________________
        
       A sharded DuckDB on 63 nodes runs 1T row aggregation challenge in 5
       sec
        
       Author : tanelpoder
       Score  : 197 points
       Date   : 2025-10-24 12:54 UTC (10 hours ago)
        
 (HTM) web link (gizmodata.com)
 (TXT) w3m dump (gizmodata.com)
        
       | maxmcd wrote:
       | Are there any open sourced sharded query planners like this?
       | Something that can aggregate queries across many duckdb/sqlite
       | dbs?
        
         | hobofan wrote:
         | Not directly DuckDB (though I think it might be able to be
         | connected to that), but I think Apache Datafusion Ballista[0]
         | would be a typical modern open source benchmark here.
         | 
         | [0]: https://datafusion.apache.org/ballista/contributors-
         | guide/ar...
        
         | mritchie712 wrote:
         | DeepSeek released smallpond
         | 
         | 0 - https://github.com/deepseek-ai/smallpond
         | 
         | 1 - https://www.definite.app/blog/smallpond (overview for data
         | engineers, practical application)
        
       | tgv wrote:
       | Impressive, but those 63 nodes were "Azure Standard E64pds v6
       | nodes, each providing 64 vCPUs and 504 GiB of RAM." That's 4000
       | CPUs and 30TB memory.
        
         | ralegh wrote:
         | Just noting that 4000 vCPUs usually means 2000 cores, 4000
         | threads
        
           | electroly wrote:
           | It doesn't mean that here. Epdsv6 is 1 core = 1 vCPU.
        
             | ralegh wrote:
             | I stand corrected...
        
         | RamtinJ95 wrote:
         | At that scale it cannot be cheaper than just running the same
         | workload on BigQuery or Snowflake or?
        
           | philbe77 wrote:
           | A Standard E64pds v6 costs: $3.744 / hr on demand. At 63
           | nodes - the cost is: $235.872 / hr - still cheaper than a
           | Snowflake 4XL cluster - costing: 128 credits / hr at
           | $3/credit = $384 / hr.
        
             | philbe77 wrote:
             | At 5 seconds - the query technically cost: $0.3276
        
               | Keyframe wrote:
               | That's like calculating a trip cost based on gas cost
               | without accounting for car rental, gas station food, and
               | especially mandatory bathroom fee after said food.
        
             | philbe77 wrote:
             | If I used "spot" instances - it would have been 63 x
             | $0.732/hr for a total of: $45.99 / hr.
        
         | ramraj07 wrote:
         | Sounds like the equivalent of a 4xl snowflake warehouse, which
         | for such queries would take 30 seconds, with the added benefit
         | of the data being cold stored in s3. Thus you only pay by the
         | minute.
        
           | hobs wrote:
           | No, that would be equivalent to 64 4xl snowflake warehouses
           | (though the rest of your point still stands).
        
             | philbe77 wrote:
             | Cost-wise, 64 4xl Snowflake clusters would cost: 64 x
             | $384/hr - for a total of: $24,576/hr (I believe)
        
               | __mharrison__ wrote:
               | What was the cost of the duck implementation?
        
             | ramraj07 wrote:
             | Apologize for getting it wrong a few orders of magnitude,
             | but thats even more ghastly if its so overpowered and yet
             | takes this long.
        
           | philbe77 wrote:
           | Challenge accepted - I'll try it on a 4XL Snowflake to get
           | actual perf/cost
        
       | shinypenguin wrote:
       | Is the dataset somewhere accessible? Does anyone know more about
       | the "1T challenge", or is it just the 1B challenge moved up a
       | notch?
       | 
       | Would be interesting to see if it would be possible to handle
       | such data on one node, since the servers they are using are quite
       | beefy.
        
         | philbe77 wrote:
         | Hi shinypenguin - the dataset and challenge are detailed here:
         | https://github.com/coiled/1trc
         | 
         | The data is in a publicly accessible bucket, but the requester
         | is responsible for any egress fees...
        
           | shinypenguin wrote:
           | Hi, thank you for the link and quick response! :)
           | 
           | Do you know if anyone attempted to run this on the least
           | amount of hardware possible with reasonable processing times?
        
             | philbe77 wrote:
             | Yes - I also had GizmoSQL (a single-node DuckDB database
             | engine) take the challenge - with very good performance (2
             | minutes for $0.10 in cloud compute cost):
             | https://gizmodata.com/blog/gizmosql-one-trillion-row-
             | challen...
        
           | simonw wrote:
           | I suggest linking to that from the article, it is a useful
           | clarification.
        
             | philbe77 wrote:
             | Good point - I'll update it...
        
         | achabotl wrote:
         | The One Trillion Row Challenge was proposed by Coiled in 2024.
         | https://docs.coiled.io/blog/1trc.html
        
       | MobiusHorizons wrote:
       | > Once trusted, each worker executes its local query through
       | DuckDB and streams intermediate Arrow IPC datasets back to the
       | server over secure WebSockets. The server merges and aggregates
       | all results in parallel to produce the final SQL result--often in
       | seconds.
       | 
       | Can someone explain why you would use websockets in an
       | application where neither end is a browser? Why not just use
       | regular sockets and cut the overhead of the http layer? Is there
       | a real benefit I'm missing?
        
         | philbe77 wrote:
         | Hi MobiusHorizons, I happened to use websockets b/c it was the
         | technology I was familiar with. I will try to learn more about
         | normal sockets to see if I could perhaps make them work with
         | the app. Thanks for the suggestion...
        
           | DanielHB wrote:
           | if you really want maximum performance maybe consider using
           | CoAP for node-communication:
           | 
           | https://en.wikipedia.org/wiki/Constrained_Application_Protoc.
           | ..
           | 
           | It is UDP-based but adds handshakes and retransmissions. But
           | I am guessing for your benchmark transmission overhead isn't
           | a major concern.
           | 
           | Websockets are not that bad, only the initial connection is
           | HTTP. As long as you don't create a ton of connections all
           | the time it shouldn't be much slower than a TCP-based socket
           | (purely theoretical assumption on my part, I never tested).
        
           | gopalv wrote:
           | > will try to learn more about normal sockets to see if I
           | could perhaps make them work with the app.
           | 
           | There's a whole skit in the vein of "What have the Romans
           | ever done for us?" about ZeroMQ[1] which has probably lost to
           | the search index now.
           | 
           | As someone who has held a socket wrench before, fought
           | tcp_cork and dsack, Websockets isn't a bad abstraction to be
           | on top of, especially if you are intending to throw TLS in
           | there anyway.
           | 
           | Low level sockets is like assembly, you can use it but it is
           | a whole box of complexity (you might use it completely raw
           | sometimes like a tickle ack in the ctdb[2] implementation).
           | 
           | [1] - https://news.ycombinator.com/item?id=32242238
           | 
           | [2] - https://linux.die.net/man/1/ctdb
        
         | sureglymop wrote:
         | Wait but websockets aren't over http right? Just the initiation
         | and then there is a protocol upgrade or am I wrong? What
         | overhead is there otherwise?
        
           | tsimionescu wrote:
           | You're right, WebSockets aren't over HTTP, they just use HTTP
           | for the connection initiation. They do add some overhead in
           | two places: one, when opening a new connection, since you go
           | TCP -> TLS -> HTTP -> WebSockets -> Your protocol ; and two,
           | they do add some per packet overhead, since there is a
           | WebSocket encapsulation of your data - but this is much
           | smaller than typical HTTP request/response overhead.
        
         | kevincox wrote:
         | > the overhead of the http layer
         | 
         | There isn't much overhead here other than connection setup. For
         | HTTP/1 the connection is just "upgraded" to websockets. For
         | HTTP/2 I think the HTTP layer still lives on a bit so that you
         | can use connection multiplexing (which maybe be overhead if you
         | have no use for it here) but that is still a very thin layer.
         | 
         | So I think the question isn't so much HTTP overhead but
         | WebSocket overhead. WebSockets add a bit of message framing and
         | whatnot that may be overhead if you don't need it.
         | 
         | In 99% of applications if you need encryption, authentication
         | and message framing you would be hard-pressed to find a
         | significantly more efficient option.
        
           | toast0 wrote:
           | > In 99% of applications if you need encryption,
           | authentication and message framing you would be hard-pressed
           | to find a significantly more efficient option.
           | 
           | AFAIK, websockets doesn't do authentication? And the
           | encryption it does is minimal, optional xor with a key
           | disclosed in the handshake. It does do framing.
           | 
           | It's not super common, but if all your messages have a 16-bit
           | length, you can just use TLS framing. I would argue that TLS
           | framing is ineffecient (multiple length terms), but using it
           | by itself is better than adding a redundant framing layer.
           | 
           | But IMHO, there is significant benefit from removing a layer
           | where it'd unneeded.
        
             | LunaSea wrote:
             | > AFAIK, websockets doesn't do authentication?
             | 
             | Websocket allows for custom header and query parameters
             | which make it possible to run a basic authentication scheme
             | and later on additional autorisation in the message
             | themselves if really necessary.
             | 
             | > And the encryption it does is minimal, optional xor with
             | a key disclosed in the handshake. It does do framing.
             | 
             | Web Secure Socket (WSS) is the TLS encrypted version of
             | Websockets (WS) (similar to HTTP vs. HTTPS).
        
               | fyrn_ wrote:
               | Worth noting that wbesockets in the browser don't allow
               | custom headers and custom header support is spotty
               | accross sever impls. It's just not exposed in the
               | javascript API. There has been an open chrome bug for
               | that for like 15 years
        
               | LunaSea wrote:
               | > Worth noting that wbesockets in the browser don't allow
               | custom headers
               | 
               | They do during the initial handshake (protocol upgrade
               | from HTTP to WebSocket).
               | 
               | Afterwards the message body can be used to send
               | authorisation data.
               | 
               | Server support will depend on tech but Node.js has great
               | support.
        
         | simonw wrote:
         | If you're using sockets you still need to come up with some
         | kind of protocol on top of those sockets for the data that's
         | being transferred - message delimiter, a data format etc. Then
         | you have to build client libraries for that protocol.
         | 
         | WebSockets solve a bunch of those low level problems for you,
         | in a well specified way with plenty of existing libraries.
        
           | HumblyTossed wrote:
           | ASCII table codes 1,2,3 & 4 pretty simple to use.
        
             | jcheng wrote:
             | Not if you're passing binary data
        
               | woodruffw wrote:
               | Even beyond that: the ASCII delimiter control codes are
               | perfectly valid UTF-8 (despite not being printable), so
               | using them for in-band signaling is a recipe for pain on
               | arbitrary UTF-8 data.
        
               | mananaysiempre wrote:
               | If you know your data is UTF-8, then bytes 0xFE and 0xFF
               | are guaranteed to be free. Strictly speaking, 0xC0, 0xC1,
               | and 0xF5 through 0xFD also are, but the two top values
               | are free even if you are very lax and allow overlong
               | encodings as well as codepoints up to 232 - 1.
        
               | woodruffw wrote:
               | I think it would probably be better to invest in a proper
               | framing design than trying to poke holes in UTF-8.
               | 
               | (This is true regardless of UTF-8 -- in-band encodings
               | are almost always brittle!)
        
             | dns_snek wrote:
             | Sure, in principle. Someone already mentioned binary data,
             | then you come up with a framing scheme and get to write
             | protocol documentation, but why? What's the benefit?
        
               | HumblyTossed wrote:
               | Simplicity.
        
               | ryanjshaw wrote:
               | You misspelled "bugs and maintenance nightmare"
        
             | tracker1 wrote:
             | Now solve for encryption, authorization, authentication...
             | 
             | WS(S) has in the box solutions for a lot of these... on top
             | of that, application gateways, distribution, failover etc.
             | You get a lot of already solved solutions in the box, so to
             | speak. If you use raw sockets, now you have to implement
             | all of these things yourself, and you aren't gaining much
             | over just using WSS.
        
           | zerd wrote:
           | WebSocket doesn't specify data format, it's just bytes, so
           | they have to handle that themselves. It looks like they're
           | using Arrow IPC.
           | 
           | Since they're using Arrow they might look into Flight RPC [1]
           | which is made for this use case.
           | 
           | [1] https://arrow.apache.org/docs/format/Flight.html
        
         | lucideer wrote:
         | > _overhead of the http layer_
         | 
         | Detail of this well-covered in sibling comments, but at a
         | higher-level, two thoughts on this:
         | 
         | 1. I see a lot of backlash lately against everything being
         | HTTP-ified, with little justification other than a presumption
         | that it necessarily adds overhead. Perf-wise, HTTP has come a
         | long way & modern HTTP is a very efficient protocol. I think
         | this has cleared the way for it to be a foundation for many
         | more things than in the past. HTTP/3 being over UDP might clear
         | the way for more of this (albeit I think the overhead of TCP/IP
         | is also often overstated - see e.g. MQTT).
         | 
         | 2. Overhead can be defined in two ways: perf. & maintenance
         | complexity. Modern HTTP does add a bit of the latter, so in
         | that context it may be a fair concern, but I think the large
         | range of competing implementations probably obviates any
         | concern here & the alternative usually involves doing something
         | custom (albeit simpler), so you run into inconsistency, re-
         | invented wheels & bus factor issues there.
        
           | fweimer wrote:
           | One reason comes to my mind: HTTP is no longer a stable
           | protocol with well-understood security properties. If you
           | deploy it today, people expect interoperability with clients
           | and servers that implement future protocol upgrades,
           | resulting in an ongoing maintenance burden that a different
           | protocol choice would avoid.
        
             | lucideer wrote:
             | I'm absolutely not an expert of any kind on protocol
             | details, so pardon my ignorance here but this surprises me:
             | is this true?
             | 
             | High-level spec changes have been infrequent, with long
             | dual support periods, & generally seen pretty slow gradual
             | client & server adoption. 1.1 was 1997 & continues to have
             | widespread support today. 2 & 3 were proposed in 2015 &
             | 2016 - almost 2 decades later - & 2 is only really starting
             | to see wide support today, with 3 still broadly
             | unsupported.
             | 
             | I'm likely missing a lot of nuance in between versioned
             | releases though - I know e.g. 2 saw at least two major
             | additions/updates, though I thought those were mostly
             | additive security features rather than changes to existing
             | protocol features.
        
           | immibis wrote:
           | Using stuff like HTTP signals a lack of understanding of the
           | whole stack. IMO it's important for programmers to understand
           | computers. You can write programs without understanding
           | computers, but it's best if you go and learn about computers
           | first. You can use abstractions but you should also
           | understand the abstractions.
           | 
           | There are two ways I've noticed to design an application.
           | 
           | Some people grab some tools out of their toolbox that look
           | like they fit - I need a client/server, I know web
           | clients/servers, so I'll use a web client/server.
           | 
           | Other people think about what the computer actually has to do
           | and then write code to achieve that: Computer A has to send a
           | block of data to computer B, and this has to work on Linux
           | (which means no bit-banging - you can only go as low as raw
           | sockets). This type of person may still take shortcuts, but
           | it's by intention, not because it's the only thing they know:
           | if HTTP is only one function call in Python, it makes sense
           | to use HTTP, not because it's the only thing you know but
           | because it's good enough, you know it works well enough for
           | this problem, and you can change it later if it becomes a
           | bottleneck.
           | 
           | Websockets are an odd choice because they're sort of the
           | worst of both worlds: they're barely more convenient as raw
           | sockets (there's framing, but framing is easy), but they also
           | add a bunch of performance and complexity overhead over raw
           | sockets, and more things that can go wrong. So it doesn't
           | seem to win on the convenience/laziness front nor the
           | performance/security/robustness front. If your client had to
           | be a web browser, or could sometimes be a web browser, or if
           | you wanted to pass the connections through an HTTP reverse
           | proxy, those would be good reasons to choose websockets, but
           | none of them are the case here.
        
             | lucideer wrote:
             | Acknowledging that a huge number of people (the vast
             | majority) are going to use the only option they know rather
             | than the best of a set of options they know, I still think
             | that for a person who's fully versed in all available
             | options, Websockets is a better option than you make out.
             | 
             | > _they 're barely more convenient as raw sockets_
             | 
             | Honestly, raw sockets are pretty convenient - I'm not
             | convinced Websockets are more convenient at all (assuming
             | you already know both & there's no learning curves). Raw
             | sockets might even be more convenient.
             | 
             | I think it's features rather than convenience that is more
             | likely to drive Websocket usage when comparing the two.
             | 
             | > _they also add a bunch of performance and complexity
             | overhead over raw sockets_
             | 
             | This is the part that I was getting at in my above comment.
             | I agree in theory, but I just think that the "a bunch"
             | quantifier is bit of an exaggeration. They really add very
             | very little performance overhead in practice: a negligible
             | amount in most cases.
             | 
             | So for a likely-negligible performance loss, & a likely-
             | negligible convenience difference, you're getting a
             | protocol with built-in encryption, widespread documentation
             | & community support - especially important if you're
             | writing code that other people will need to take over &
             | maintain - & as you alluded to: extensibility (you may
             | never need browser support or http proxying, but having the
             | option is compelling when the trade-offs are so
             | negligible).
        
             | gr4vityWall wrote:
             | > they're barely more convenient as raw sockets (there's
             | framing, but framing is easy)
             | 
             | I think it's significant more convenient if your stack
             | touches multiple programming languages. Otherwise you'd
             | have to implement framing yourself for all of them. Not
             | hard, but I don't see the benefit either.
             | 
             | > they also add a bunch of performance and complexity
             | overhead over raw sockets
             | 
             | What performance overhead is there over raw sockets once
             | you're past the protocol upgrade? It seems negligible if
             | you connection is even slightly long-lived.
        
         | nurettin wrote:
         | http 101 upgrade isn't much of an overhead and there are tried
         | and tested websocket/ssl libraries with pretty callback
         | interfaces versus your custom binary protocol. I would still
         | choose the latter but I wouldn't recommend it.
        
           | immibis wrote:
           | you can apply this reasoning to a lot of protocols, like why
           | not use Nostr over websockets? I mean, I don't see any reason
           | to do this with Nostr over websockets, but also, why not?
           | it's not much overhead right?
        
         | lantastic wrote:
         | Others pointed plenty of arguments, but the ones I find most
         | compelling (not necessarily useful in this context) are:
         | 
         | - you can serve any number of disjoint websocket services via
         | same port via HTTP routing - this also means you can do TLS
         | termination in one place, so downstream websocket service
         | doesn't have to deal with the nitty-gritty of certificates.
         | 
         | Sure, it adds a hop compared to socket passing, and there are
         | ways to get similar fanout with TCP with a custom protocol. But
         | you need to add this to every stack that interacting components
         | use, while websockets libraries exist for most languages that
         | are likely to be used in such an endeavor.
        
       | nodesocket wrote:
       | > Each GizmoEdge worker pod was provisioned with 3.8 vCPUs (3800
       | m) and 30 GiB RAM, allowing roughly 16 workers per node--meaning
       | the test required about 63 nodes in total.
       | 
       | How was this node setup chosen? Specially 3.8 vCPU and 30 GiB RAM
       | per? Why not just run 16 workers total using the entire 64 vCPU
       | and 504 GiB of memory each?
        
         | philbe77 wrote:
         | Hi nodesocket - I tried to do 4 CPUs per node, but Kubernetes
         | takes a small (about 200m) CPU request amount for daemon
         | processes - so if you try to request 4 (4000m) CPUs x 16 -
         | you'll spill one pod over - fitting only 15 per node.
         | 
         | I was out of quota in Azure - so I had to fit in the 63
         | nodes... :)
        
           | nodesocket wrote:
           | But why split up a vm into so many workers instead of
           | utilizing the entire vm as a dedicated single worker? What's
           | the performance gain and strategy?
        
             | philbe77 wrote:
             | I'm not exactly sure yet. My goal was to not have the
             | shards be too large so as to be un-manageable. In theory -
             | I could just have had 63 (or 64) huge shards - and 1 worker
             | per K8s node, but I haven't tried it.
             | 
             | There are so many variables to try - it is a little
             | overwhelming...
        
               | nodesocket wrote:
               | Would be interesting to test. I'm thinking there may not
               | be a benefit to having so many workers on a vm instead of
               | just the entire vm resources as a single worker. Could be
               | wrong, but that would be a bit surprising.
        
       | boshomi wrote:
       | >"In our talk, we will describe the design rationale of the
       | DuckLake format and its principles of simplicity, scalability,
       | and speed. We will show the DuckDB implementation of DuckLake in
       | action and discuss the implications for data architecture in
       | general.
       | 
       | Prof. Hannes Muhleisen, cofounder of DuckDB:
       | 
       | [DuckLake - The SQL-Powered Lakehouse Format for the Rest of Us
       | by Prof. Hannes
       | Muhleisen](https://www.youtube.com/watch?v=YQEUkFWa69o) (53 min)
       | Talk from Systems Distributed '25: https://systemsdistributed.com
        
       | ferguess_k wrote:
       | Wait until you see a 800-line Tableau query that joins TB data
       | with TB data /s
        
         | philbe77 wrote:
         | :D that is scary!
        
         | kwillets wrote:
         | Don't forget the 2 hour tableau cloud runtime limit.
        
       | NorwegianDude wrote:
       | This is very silly. You're not doing the challenge if you do the
       | work up front. The idea is that you start with a file and the
       | goal is to get the result as fast as possible.
       | 
       | How long did it take to distribute and import the data to all
       | workers, what is the total time from file to result?
       | 
       | I can do this a million times faster on one machine, it just
       | depends on what work I do up front.
        
         | philbe77 wrote:
         | You should do it then, and post it here. I did do it with one
         | machine as well: https://gizmodata.com/blog/gizmosql-one-
         | trillion-row-challen...
        
           | NorwegianDude wrote:
           | Nobody cares if I can do it a million times faster, everyone
           | can. It's cheating.
           | 
           | The whole reason you have to account for the time you spend
           | setting it up is so that all work spent processing the data
           | is timed. Otherwise we can just precomputed the answer and
           | print it on demand, that is very fast and easy.
           | 
           | Just getting it into memory is a large bottleneck in the
           | actual challenge.
           | 
           | If I first put it into a DB with statistics that tracks the
           | needed min/max/mean then it's basically instant to retrieve,
           | but also slower to set up because that work needs to be done
           | somewhere. That's why the challenge is time from file to
           | result.
        
       | ta12653421 wrote:
       | When reading such extreme numbers, I'm always thinking what I may
       | be doing wrong, when my MSSQL based CRUD application warms up its
       | caches with around 600.000 rows and it takes 30 seconds to load
       | them from DB into RAM on my 4x3GHz machine :-D
       | 
       | Maybe I'm missing something fundamental here
        
         | dgan wrote:
         | I also had misfortune working with MSSQL is it was so so
         | unbearably slow, because i couldnt upload data in bulk. I guess
         | its forbidden technology
        
           | Foobar8568 wrote:
           | Or you didn't use MSSQL properly, there are at least 2 or 3
           | ways to do bulk upload on MS SQL, not sure in today era.
        
             | dgan wrote:
             | Maybe? Don't know. I never had problemes bulk uploading
             | into Postgres tho, it's right there in documentation and I
             | don't have to have a weird executable on my corporately
             | castrated laptop
        
         | rovr138 wrote:
         | Would OLAP be better than OLTP for those queries you're doing?
        
         | zwnow wrote:
         | This type of stuff is usually hyperoptimized for no reason and
         | serves no real purpose, you are doing just fine
        
         | RobinL wrote:
         | Yes - OLAP database are built with a completely different
         | performance tradeoff. The way data is stored and the query
         | planner are optimised for exactly these types of queries. If
         | you're working in an oltp system, you're not necessarily doing
         | it wrong, but you may wish to consider exporting the data to
         | use in an OLAP tool if you're frequently doing big queries. And
         | nowadays there's ways to 'do both ' e.g. you can run the duckdb
         | query engine within a postgres instance
        
         | riku_iki wrote:
         | Could you run some query like select sum(banch of columns) from
         | my_table and see how long it will take?
         | 
         | 600k rows is likely less than 1GB of data, and should take
         | about second to load into RAM on modern nvme ssd raids.
        
       | mosselman wrote:
       | Are there any good instructions somewhere on how to set this up?
       | As in not 63 nodes. But a distributed duckdb instance
        
         | philbe77 wrote:
         | Hi mosselman, GizmoEdge is not open-source. DeepSeek has
         | "smallpond" however, which is open-source:
         | https://github.com/deepseek-ai/smallpond
         | 
         | I plan on getting GizmoEdge to production-grade quality
         | eventually so folks can use it as a service or licensed
         | software. There is a lot of work to do, though :)
        
       | djhworld wrote:
       | Interesting and fun
       | 
       | > Workers download, decompress, and materialize their shards into
       | DuckDB databases built from Parquet files.
       | 
       | I'm interested to know whether the 5s query time includes this
       | materialization step of downloading the files etc, or is this
       | result from workers that have been "pre-warmed". Also is the data
       | in DuckDB in memory or on disk?
        
         | philbe77 wrote:
         | hi djhworld. The 5s does not include the
         | download/materialization step. That parts takes the worker
         | about 1 to 2 minutes for this data set. I didn't know that this
         | was going on HackerNews or would be this popular - I will try
         | to get more solid stats on that part, and update the blog
         | accordingly.
         | 
         | You can have GizmoEdge reference cloud (remote) data as well,
         | but of course that would be slower than what I did for the
         | challenge here...
         | 
         | The data is on disk - on locally mounted NVMe on each worker -
         | in the form of a DuckDB database file (once the worker has
         | converted it from parquet). I originally kept the data in
         | parquet, but the duckdb format was about 10 to 15% faster - and
         | since I was trying to squeeze every drop of performance - I
         | went ahead and did that...
         | 
         | Thanks for the questions.
         | 
         | GizmoEdge is not production yet - this was just to demonstrate
         | the art of the possible. I wanted to divide-and-conquer a huge
         | dataset with a lot of power...
        
           | philbe77 wrote:
           | I've since learned (from a DuckDB blog) - that DuckDB seems
           | to do better when the XFS filesytem. I used ext4 for this, so
           | I may be able to get another 10 to 15% (maybe!).
           | 
           | DuckDB blog: https://duckdb.org/2025/10/09/benchmark-
           | results-14-lts
        
       | sammy2255 wrote:
       | How would a 63 node Clickhouse cluster compare? >:)
        
       | lolive wrote:
       | Why doesn't such large-scale test the big feature everyone needs,
       | which is inner join at scale?
        
         | philbe77 wrote:
         | This is something we are trying to take a novel approach to as
         | well. We have a video demonstrating some TPC-H SF10TB queries
         | which perform inner joins, etc. - with GizmoEdge as well:
         | https://www.youtube.com/watch?v=hlSx0E2jGMU
        
           | lolive wrote:
           | Does that study go into the global vision of DuckLake ?
        
       | 1a527dd5 wrote:
       | The title buries the lede a little
       | 
       | > Our cluster ran on Azure Standard E64pds v6 nodes, each
       | providing 64 vCPUs and 504 GiB of RAM.
       | 
       | Yes, I would _expect_ when each node has that kind of power it
       | should return very impressive speeds.
        
       | vysakh0 wrote:
       | Duckdb is an excellent OLAP db, I have had customers who had s3
       | data lake of parquet and use databricks or other expensive tool,
       | when they could easily use duckdb.. Given we have cursor/claude
       | code, it is not that hard for lot of use cases, I think the lack
       | of documentation on how duckdb functions -- in terms of how it
       | loads these files etc are some of the reasons companies are not
       | even trying to adopt duckdb. I think blogs like this is a great
       | testament for duckdb's performance!
        
         | mrtimo wrote:
         | I have experience with duckDB but not databricks... from the
         | perspective of a company, is a tool like databricks more
         | "secure" than duckdb? If my company adopts duckdb as a
         | datalake, how do we secure it?
        
           | rapatel0 wrote:
           | Duckdb can run as a local instance that points to parquet
           | files in a n s3 bucket. So your "auth" can live on the layer
           | that gives permissions to access that bucket.
        
         | lopatin wrote:
         | DuckDB is great but it's barely OLAP right? A key part of OLAP
         | is "online". Since the writer process blocks any other
         | processes from doing reads, calling it OLAP is a stretch I
         | think.
        
           | ansgri wrote:
           | Isn't the Online part here about getting results immediately
           | after query, as opposed to overnight batch reports? So if you
           | don't completely overwhelm DuckDB with writes, it still
           | qualifies. The quality you're describing is something like
           | "realtime analytics", and is a whole another category:
           | Clickhouse doesn't qualify (batching updates, merging etc. --
           | but it's clearly OLAP), Druid does.
        
             | lopatin wrote:
             | Huh yeah looks like I was totally wrong about what online
             | meant. So yeah DuckDB is OLAP. Not that anyone was asking
             | me in the first place. Carry on :)
        
             | sdairs wrote:
             | ClickHouse is the market leader in real-time analytics so
             | it's an interesting take that you don't think it qualifies.
        
               | ansgri wrote:
               | For certain definition of realtime, certainly (as would
               | any system with bounded ingestion latency), but it's not
               | low-latency streaming realtime. Tens of seconds or more
               | can pass before new data becomes visible in queries in
               | normal operation. There's batching, there's merging, and
               | its overall architecture prioritizes throughput over
               | latency.
        
         | adammarples wrote:
         | I have been playing today with ducklake, and I have to confess
         | I don't quite get what it does that duckdb doesn't already do,
         | if duckdb can just run on top of parquet files quite happily
         | without this extension...
        
           | RobinL wrote:
           | It's main purpose is to solve the problem of upserts to a
           | data lake, because upsert operations to file based data
           | storage are a real pain.
        
       | sdairs wrote:
       | Pretty big caveat; 5 seconds AFTER all data has been loaded into
       | memory - over 2 minutes if you also factor reading the files from
       | S3 and loading memory. So to get this performance you will need
       | to run hot: 4000 CPUs and and 30TB of memory going 24/7.
        
         | CaptainOfCoit wrote:
         | Yeah, pretty misleading it feels like.
         | 
         | For background, here is the initial ideation of the "One
         | Trillion Row Challenge" challenge this submission originally
         | aimed to participate in: https://docs.coiled.io/blog/1trc.html
        
         | lumost wrote:
         | It does make me wonder whether all of the investment in hot-
         | loading of GPU infrastructure for LLM workloads is portable to
         | databases. 30TB of GPU memory will be roughly 200 B200 cards or
         | roughly 1200 per hour compared to the $240/hour pricing for the
         | CPU based cluster. The GPU cluster would assuredly crush the
         | CPU cluster with a suitable DB given it has 80x the FP32 FLOP
         | capacity. You'd expect the in-memory GPU solution to be cheaper
         | (assuming optimized software) with a 5x growth in GPU memory
         | per card, or today if the workload can be bin-packed
         | efficiently.
        
           | eulgro wrote:
           | Do databases do matrix multiplication? Why would they even
           | use floats?
        
             | lumost wrote:
             | lot's of columns are float valued, GPU tensor cores can be
             | programmed to do many operations between different
             | float/int valued vectors. Strings can also be processed in
             | this manner as they are simply vectors of integers. NVidia
             | publishes official TPC benchmarks for each GPU release.
             | 
             | The idea of a GPU database has been reasonably well
             | explored, they are extremely fast - but have been cost
             | ineffective due to GPU costs. When the dataset is larger
             | than GPU memory, you also incur slowdowns due to cycling
             | between CPU and GPU memory.
        
             | radarsat1 wrote:
             | what do you think vector databases are? absolutely. i think
             | the idea of a database and a "model" could start to really
             | be merged this way..
        
             | lisbbb wrote:
             | That's a great question. I never worked on any cool NASA
             | stuff which would involve large scale number crunching. In
             | the corpo space, that's not been my experience at all. We
             | were trying to solve big data problems of like, how to
             | report on medical claims that are in flight (which are
             | hardly ever static until much later after the claim is long
             | completed and no longer interesting to anyone) and do it at
             | scale of tens of thousands per hour. It never went that
             | well, tbh, because it is so hard to validate what a "claim"
             | even is since it is changing in real time. I don't think
             | excess GPUs would help with that.
        
         | mey wrote:
         | So how would that compare to DynamoDB or BigQuery? (I have zero
         | interest in paying for running that experiment).
         | 
         | In theory a Zen 5 / Eypc Turin can have up to 4TB of ram. So
         | how would a more traditional non-clustered DB stand up?
         | 
         | 1000 k8s pods, each with 30gb of ram, there has to be a bit of
         | overhead/wastage going on.
        
           | mulmen wrote:
           | Are you asking how Dynamo compares at the storage level? Like
           | in comparison to S3? As a key-value database it doesn't even
           | have a native aggregation capability. It's a very poor choose
           | for OLAP.
           | 
           | BigQuery is comparable to DuckDB. I'm curious how the various
           | Redshift flavors (provisioned, serverless, spectrum) and
           | Spark compare.
           | 
           | I don't have a lot of experience with DuckDB but it seems
           | like Spark is the most comparable.
        
             | fifilura wrote:
             | BigQuery is built for the distributed case while DuckDB is
             | single CPU and requires the workarounds described in the
             | article to act like a distributed engine.
        
               | mulmen wrote:
               | Ah ok. Maybe that does make sense as a comparison to ask
               | if you need an analytics stack or can just grind through
               | your prod Dynamo.
        
         | trhway wrote:
         | the https://sortbenchmark.org has always stipulated "Must sort
         | to and from operating system files on secondary storage." and
         | thus felt as a more reasonable estimate of overall system
         | performance
        
         | philbe77 wrote:
         | hi sdairs, we did store the data on the worker nodes for the
         | challenge, but not in memory. We wrote the data to the local
         | NVMe SSD storage on the node. Linux may cache the filesystem
         | data, but we didn't load the data directly into memory. We like
         | to preserve the memory for aggregations, joins, etc. as much as
         | possible...
         | 
         | It is true you would need to run the instance(s) 24/7 to get
         | the performance all day, the startup time over a couple minutes
         | is not ideal. We have a lot of work to do on the engine, but it
         | has been a fun learning experience...
        
           | otterley wrote:
           | "Linux may cache the filesystem data" means there's a non-
           | zero likelihood that the data in memory unless you dropped
           | caches right before you began the benchmark. You don't have
           | to explicitly load it into memory for this to be true. What's
           | more, unless you are in charge of how memory is used, the
           | kernel is going to make its own decisions as to what to cache
           | and what to evict, which can make benchmarks unreproducible.
           | 
           | It's important to know what you are benchmarking before you
           | start and to control for extrinsic factors as explicitly as
           | possible.
        
           | sdairs wrote:
           | Thanks for clarifying; I'm not trying to take anything away
           | from you, I work in the OLAP space too so it's always good to
           | see people pushing it forwards. It would be interesting to
           | see a comparison of totally cold Vs hot caches.
           | 
           | Are you looking at distributed queries directly over S3? We
           | did this in ClickHouse and can do instant virtual sharding
           | over large data sets S3. We call it parallel replicas
           | https://clickhouse.com/blog/clickhouse-parallel-replicas
        
             | tanelpoder wrote:
             | (I submitted this link). My interest in this approach in
             | general is about observability infra at scale - thinking
             | about buffering detailed events, metrics and thread samples
             | at the edge and later only extract things of interest,
             | after early filtering at the edge. I'm a SQL & database
             | nerd, thus this approach looks interesting.
        
           | jamesblonde wrote:
           | With 2 modern NVMe disks per host (15 GB/s) and pcie 5.0, it
           | should only take 15s to read 30 TB into memory on 63 hosts.
           | 
           | You can find those disks on Hetzner. Not AWS, though.
        
         | lisbbb wrote:
         | Wow (Owen Wilson voice). That's still impressive that it can be
         | done. Just having 4k cpus going reliably for any period of time
         | is pretty nifty. The problem I have run into is that even big
         | companies say they want this kind of compute until they get the
         | bill for it.
        
       | afpx wrote:
       | I've never used DuckDB, but I was surprised by the 30 GiB of
       | memory. Many years ago when I used to use EMR a lot, I would go
       | for > 10 TiB of RAM to keep all the data in memory and only spill
       | over to SSD on big joins.
        
       | kwillets wrote:
       | This is fun, but I'm confused by the architecture. Duckdb is
       | based on one-off queries that can scale momentarily and then
       | disappear, but this seems to run on k8s and maintain a persistent
       | distributed worker pool.
       | 
       | This pool lacks many of the features of a distributed cluster
       | such as recovery, quorum, and storage state management, and
       | queries run through a single server. What happens when a node
       | goes down? Does it give up, replan, or just hang? How does it
       | divide up resources between multiple requests? Can it distribute
       | joins and other intermediate operators?
       | 
       | I have a soft spot in my heart for duckdb, but its uniqueness is
       | in avoiding the large-scale clustering that other engines already
       | do reasonably well.
        
       | fHr wrote:
       | Better and worth more then all the quantum bs I have to listen
       | to.
        
       | hbarka wrote:
       | SELECT COUNT(DISTINCT) has entered the challenge.
        
         | philbe77 wrote:
         | good point :) - we can re-aggregate HyperLogLog (HLL) sketches
         | to get a pretty accurate NDV (Count Distinct) - see
         | Query.farm's DataSketches DuckDB extension here:
         | https://github.com/Query-farm/datasketches
         | 
         | We also have Bitmap aggregation capabilities for exact count
         | distinct - something I worked with Oracle, Snowflake,
         | Databricks, and DuckDB labs on implementing. It isn't as fast
         | as HLL - but it is 100% accurate...
        
           | fifilura wrote:
           | I remember BigQuery had Distinct with HLL accuracy 10 years
           | ago but rather quickly replaced it with actual accuracy.
           | 
           | How would you compare this solution to BigQuery?
        
       | up2isomorphism wrote:
       | Sensational title, a reflection of "attention is all you
       | need".(pun intended)
        
       | fifilura wrote:
       | Isn't Trino built for exactly this, without the quirky
       | workarounds?
        
       ___________________________________________________________________
       (page generated 2025-10-24 23:01 UTC)