[HN Gopher] Sirius: A GPU-native SQL engine
       ___________________________________________________________________
        
       Sirius: A GPU-native SQL engine
        
       Author : qianli_cs
       Score  : 134 points
       Date   : 2025-06-28 14:18 UTC (1 days ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | cpard wrote:
       | It's great to see substrait getting more seriously used!
       | 
       | It has been supported by engines like duckdb but the main serious
       | use case of it I'm aware of is from Apache gluten where it is
       | used to add Velox as the execution engine of Spark.
       | 
       | It's an ambitious project and certainly has limitations but more
       | projects like this are needed to push it forward.
        
         | gavinray wrote:
         | At Hasura/PromptQL, we attempted to use Substrait IR through
         | Datafusion for representing query engine plans but found that
         | not all semantics were supported.
         | 
         | We ended up having to roll our own [0], but I still think
         | Substrait is a fantastic idea (someone has to solve this
         | problem, eventually) and it's got a good group of minds
         | consistently working on it, so my outlook for it is bright.
         | 
         | [0] https://hasura.github.io/ndc-
         | spec/reference/types.html#query...
        
           | cpard wrote:
           | Yeah there's definitely a lot work left for substrait and
           | that's why it makes me happy to see projects like this.
           | 
           | Substrait is the type of project that can only be built by
           | trying to engineer real systems, just like you tried to do.
        
       | b0a04gl wrote:
       | query plans are tightly coupled to the engine that emits them :
       | cost models, memory layout, parallelism strategy, codegen
       | behavior all vary. substrait enables structural portability, but
       | the actual execution efficiency depends on engine specific
       | rewrites. a plan optimized in duckdb might underperform in sirius
       | unless it's reshaped. how this handled for now?
        
       | tucnak wrote:
       | Reminds me of PG-Strom[1] which is a Postgres extension for GPU-
       | bound index access methods (most notably BRIN, select GIS
       | functions) and the like; it relies on proprietary NVIDIA
       | GPUDirect tech for peer-to-peer PCIe transactions between the GPU
       | and NVMe devices. I'm not sure whether amdgpu kernel driver has
       | this capability in the first place, and last I checked (~6 mo.
       | ago) ROCm didn't have this in software.
       | 
       | However, I wonder whether the GPU's are a good fit for this to
       | begin with.
       | 
       | Counterpoint: Xilinx side of the AMD shop has developed Alveo-
       | series accelerators which used to be pretty basic SmartNIC
       | platforms, but have since evolved to include A LOT more
       | programmable logic and compute IP. You may have heard about these
       | in video encoding applications, HFT, Blockchain stuff, what-have-
       | you. A lot of it has to with AI stuff, see Versal[2]. Raw compute
       | figures are often cited as "underwhelming," and it's unfortunate
       | that so many pundits are mistaking the forest for the trees here.
       | I don't think the AI tiles in these devices are really meant for
       | end-to-end LLM inference, even though memory bandwidth in the
       | high-end devices allows it.
       | 
       | The sauce is compute-in-network over fabrics.
       | 
       | Similarly to how PG-Strom would feed the GPU with relational data
       | from disk, or network directly, many AI teams on the datacenter
       | side are now experimenting with data movement, & intermediate
       | computations (think K/V cache management) over 100/200/800+G
       | fabrics. IMHO, compute-in-network is the MapReduce of this
       | decade. Obviously, there's demand for it in the AI space, but a
       | lot of it lends nicely to the more general-purpose applications,
       | like databases. If you're into experimental networking like that,
       | Corundum[3] by Alex Forencich is a great, perhaps the best, open
       | source NIC design for up to 100G line rate. Some of the cards it
       | supports also expose direct-attach NVMe's over MCIO for latency,
       | and typically have as many as two, or four SFP28 ports for
       | bandwidth.
       | 
       | This is a bit naive way to think about it, but it would have to
       | do!
       | 
       | Postgres is not typically considered to "scale well," but
       | oftentimes this is a statement about its tablespaces more than
       | anything; it has foreign data[4] API, which is how you extend
       | Postgres as single point-of-consumption, foregoing some
       | transactional guarantees in the process. This is how
       | pg_analytics[5] brings DuckDB to Postgres, or how Steampipe[6]
       | similarly exposes many Cloud and SaaS applications. Depending on
       | where you stand on this, the so-called alternative SQL engines
       | may seem like moving in the wrong direction. Shrug.
       | 
       | [1] https://heterodb.github.io/pg-strom/
       | 
       | [2] https://xilinx.github.io/AVED/latest/AVED%2BOverview.html
       | 
       | [3] https://github.com/corundum/corundum
       | 
       | [4] https://wiki.postgresql.org/wiki/Foreign_data_wrappers
       | 
       | [5] https://github.com/paradedb/pg_analytics
       | 
       | [6] https://hub.steampipe.io/#plugins
        
         | bob1029 wrote:
         | > However, I wonder whether the GPU's are a good fit for this
         | to begin with.
         | 
         | I think the GPU could be a great fit for OLAP, but when it
         | comes to the nasty OLTP use cases the CPU will absolutely
         | dominate.
         | 
         | Strictly serialized transaction processing facilities demand
         | extremely low latency compute to achieve meaningful throughput.
         | When the behavior of transaction B depends on transaction A
         | being fully resolved, there are no magic tricks you can play
         | anymore.
         | 
         | Consider that talking to L1 is _at least_ 1,000x faster than
         | talking to the GPU. Unless you can get a shitload of work done
         | with each CPU-GPU message (and it is _usually_ the case that
         | you can), this penalty is horrifyingly crippling.
        
           | tucnak wrote:
           | I think, TrueTime would constitute a "trick," insofar
           | ordering is concerned?
           | 
           | > Consider that talking to L1 is at least 1,000x faster than
           | talking to the GPU.
           | 
           | This is largely true for "traditional" architectures, but
           | s/GPU/TPU and s/L1/CMEM and suddenly this is no big deal
           | anymore. I'd like Googlers to correct me here, but it seems
           | well in line with classic MapReduce, and probably something
           | that they're doing a lot outside of LLM inference... ads?
        
             | bob1029 wrote:
             | How does the information get to & from the GPU in the first
             | place?
             | 
             | If a client wishes to use your GPU-based RDBMS engine, it
             | needs to make a trip through the CPU first, does it not?
        
               | tucnak wrote:
               | Not necessarily! The setup I'm discussing is explicitly
               | non-GPU, and it's not necessarily a TPU either. Any
               | accelerator card with NoC capability will do: the
               | requests are queued/batched from network, trickle through
               | the adjacent compute/network nodes, and written back to
               | network. This is what "compute-in-network" means; the CPU
               | is never involved, main memory is never involved. You
               | read from network, you write to network, that's it. On-
               | chip memory on these accelerators is orders of magnitude
               | larger than L1 (FPGA's are known for low-latency systolic
               | stuff) and the on-package memory is large HBM stacks
               | similar to those you would find in a GPU.
        
           | dbetteridge wrote:
           | Could you (assuming no care about efficiency)
           | 
           | Send the query to both GPU and CPU pipelines at the same time
           | and use whichever comes back first
        
             | Joel_Mckay wrote:
             | Most database query optimizer engines do a few tests to
             | figure out the most pragmatic approach.
             | 
             | GPUs can incur higher failure risks, and thus one will not
             | normally find them in high-reliability roles. =3
        
         | Joel_Mckay wrote:
         | Thanks for reminding us of the project name.
         | 
         | Personally, I'd rather have another dual cpu Epyc host with
         | maximum ECC ram, as I have witnessed NVIDIA GPU failed closed
         | to take out host power supplies. =3
        
       | Joel_Mckay wrote:
       | If I recall PostgreSQL had GPU accelerators many years back.
       | 
       | Personally, the risk associated with GPU failure rates is
       | important, and I have witnessed NVIDIA cards take out entire
       | hosts power-systems by failing closed. i.e. no back-plane
       | diagnostics as the power supplies are in "safe" off condition.
       | 
       | I am sure the use-cases for SQL + GPU exist, but for database
       | reliability no GPU should be allowed in those racks. =3
        
       | RachelF wrote:
       | Pity it requires Volta 7 which is rather high end for fiddling
       | around on at home.
        
         | qayxc wrote:
         | Really? Any NVIDIA GPU released 6 years ago or newer should be
         | able to meet that requirement, in other words any RTX 2000
         | series and up suffices [1].
         | 
         | [1] https://developer.nvidia.com/cuda-gpus
        
           | RachelF wrote:
           | It requires "CUDA >= 11.2"
        
       ___________________________________________________________________
       (page generated 2025-06-29 23:01 UTC)