[HN Gopher] Sirius: A GPU-native SQL engine
___________________________________________________________________
Sirius: A GPU-native SQL engine
Author : qianli_cs
Score : 134 points
Date : 2025-06-28 14:18 UTC (1 days ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| cpard wrote:
| It's great to see substrait getting more seriously used!
|
| It has been supported by engines like duckdb but the main serious
| use case of it I'm aware of is from Apache gluten where it is
| used to add Velox as the execution engine of Spark.
|
| It's an ambitious project and certainly has limitations but more
| projects like this are needed to push it forward.
| gavinray wrote:
| At Hasura/PromptQL, we attempted to use Substrait IR through
| Datafusion for representing query engine plans but found that
| not all semantics were supported.
|
| We ended up having to roll our own [0], but I still think
| Substrait is a fantastic idea (someone has to solve this
| problem, eventually) and it's got a good group of minds
| consistently working on it, so my outlook for it is bright.
|
| [0] https://hasura.github.io/ndc-
| spec/reference/types.html#query...
| cpard wrote:
| Yeah there's definitely a lot work left for substrait and
| that's why it makes me happy to see projects like this.
|
| Substrait is the type of project that can only be built by
| trying to engineer real systems, just like you tried to do.
| b0a04gl wrote:
| query plans are tightly coupled to the engine that emits them :
| cost models, memory layout, parallelism strategy, codegen
| behavior all vary. substrait enables structural portability, but
| the actual execution efficiency depends on engine specific
| rewrites. a plan optimized in duckdb might underperform in sirius
| unless it's reshaped. how this handled for now?
| tucnak wrote:
| Reminds me of PG-Strom[1] which is a Postgres extension for GPU-
| bound index access methods (most notably BRIN, select GIS
| functions) and the like; it relies on proprietary NVIDIA
| GPUDirect tech for peer-to-peer PCIe transactions between the GPU
| and NVMe devices. I'm not sure whether amdgpu kernel driver has
| this capability in the first place, and last I checked (~6 mo.
| ago) ROCm didn't have this in software.
|
| However, I wonder whether the GPU's are a good fit for this to
| begin with.
|
| Counterpoint: Xilinx side of the AMD shop has developed Alveo-
| series accelerators which used to be pretty basic SmartNIC
| platforms, but have since evolved to include A LOT more
| programmable logic and compute IP. You may have heard about these
| in video encoding applications, HFT, Blockchain stuff, what-have-
| you. A lot of it has to with AI stuff, see Versal[2]. Raw compute
| figures are often cited as "underwhelming," and it's unfortunate
| that so many pundits are mistaking the forest for the trees here.
| I don't think the AI tiles in these devices are really meant for
| end-to-end LLM inference, even though memory bandwidth in the
| high-end devices allows it.
|
| The sauce is compute-in-network over fabrics.
|
| Similarly to how PG-Strom would feed the GPU with relational data
| from disk, or network directly, many AI teams on the datacenter
| side are now experimenting with data movement, & intermediate
| computations (think K/V cache management) over 100/200/800+G
| fabrics. IMHO, compute-in-network is the MapReduce of this
| decade. Obviously, there's demand for it in the AI space, but a
| lot of it lends nicely to the more general-purpose applications,
| like databases. If you're into experimental networking like that,
| Corundum[3] by Alex Forencich is a great, perhaps the best, open
| source NIC design for up to 100G line rate. Some of the cards it
| supports also expose direct-attach NVMe's over MCIO for latency,
| and typically have as many as two, or four SFP28 ports for
| bandwidth.
|
| This is a bit naive way to think about it, but it would have to
| do!
|
| Postgres is not typically considered to "scale well," but
| oftentimes this is a statement about its tablespaces more than
| anything; it has foreign data[4] API, which is how you extend
| Postgres as single point-of-consumption, foregoing some
| transactional guarantees in the process. This is how
| pg_analytics[5] brings DuckDB to Postgres, or how Steampipe[6]
| similarly exposes many Cloud and SaaS applications. Depending on
| where you stand on this, the so-called alternative SQL engines
| may seem like moving in the wrong direction. Shrug.
|
| [1] https://heterodb.github.io/pg-strom/
|
| [2] https://xilinx.github.io/AVED/latest/AVED%2BOverview.html
|
| [3] https://github.com/corundum/corundum
|
| [4] https://wiki.postgresql.org/wiki/Foreign_data_wrappers
|
| [5] https://github.com/paradedb/pg_analytics
|
| [6] https://hub.steampipe.io/#plugins
| bob1029 wrote:
| > However, I wonder whether the GPU's are a good fit for this
| to begin with.
|
| I think the GPU could be a great fit for OLAP, but when it
| comes to the nasty OLTP use cases the CPU will absolutely
| dominate.
|
| Strictly serialized transaction processing facilities demand
| extremely low latency compute to achieve meaningful throughput.
| When the behavior of transaction B depends on transaction A
| being fully resolved, there are no magic tricks you can play
| anymore.
|
| Consider that talking to L1 is _at least_ 1,000x faster than
| talking to the GPU. Unless you can get a shitload of work done
| with each CPU-GPU message (and it is _usually_ the case that
| you can), this penalty is horrifyingly crippling.
| tucnak wrote:
| I think, TrueTime would constitute a "trick," insofar
| ordering is concerned?
|
| > Consider that talking to L1 is at least 1,000x faster than
| talking to the GPU.
|
| This is largely true for "traditional" architectures, but
| s/GPU/TPU and s/L1/CMEM and suddenly this is no big deal
| anymore. I'd like Googlers to correct me here, but it seems
| well in line with classic MapReduce, and probably something
| that they're doing a lot outside of LLM inference... ads?
| bob1029 wrote:
| How does the information get to & from the GPU in the first
| place?
|
| If a client wishes to use your GPU-based RDBMS engine, it
| needs to make a trip through the CPU first, does it not?
| tucnak wrote:
| Not necessarily! The setup I'm discussing is explicitly
| non-GPU, and it's not necessarily a TPU either. Any
| accelerator card with NoC capability will do: the
| requests are queued/batched from network, trickle through
| the adjacent compute/network nodes, and written back to
| network. This is what "compute-in-network" means; the CPU
| is never involved, main memory is never involved. You
| read from network, you write to network, that's it. On-
| chip memory on these accelerators is orders of magnitude
| larger than L1 (FPGA's are known for low-latency systolic
| stuff) and the on-package memory is large HBM stacks
| similar to those you would find in a GPU.
| dbetteridge wrote:
| Could you (assuming no care about efficiency)
|
| Send the query to both GPU and CPU pipelines at the same time
| and use whichever comes back first
| Joel_Mckay wrote:
| Most database query optimizer engines do a few tests to
| figure out the most pragmatic approach.
|
| GPUs can incur higher failure risks, and thus one will not
| normally find them in high-reliability roles. =3
| Joel_Mckay wrote:
| Thanks for reminding us of the project name.
|
| Personally, I'd rather have another dual cpu Epyc host with
| maximum ECC ram, as I have witnessed NVIDIA GPU failed closed
| to take out host power supplies. =3
| Joel_Mckay wrote:
| If I recall PostgreSQL had GPU accelerators many years back.
|
| Personally, the risk associated with GPU failure rates is
| important, and I have witnessed NVIDIA cards take out entire
| hosts power-systems by failing closed. i.e. no back-plane
| diagnostics as the power supplies are in "safe" off condition.
|
| I am sure the use-cases for SQL + GPU exist, but for database
| reliability no GPU should be allowed in those racks. =3
| RachelF wrote:
| Pity it requires Volta 7 which is rather high end for fiddling
| around on at home.
| qayxc wrote:
| Really? Any NVIDIA GPU released 6 years ago or newer should be
| able to meet that requirement, in other words any RTX 2000
| series and up suffices [1].
|
| [1] https://developer.nvidia.com/cuda-gpus
| RachelF wrote:
| It requires "CUDA >= 11.2"
___________________________________________________________________
(page generated 2025-06-29 23:01 UTC)