[HN Gopher] Show HN: I built a tensor library from scratch in C+...
       ___________________________________________________________________
        
       Show HN: I built a tensor library from scratch in C++/CUDA
        
       Hi HN,  Over the past few months, I've been building `dsc`, a
       tensor library from scratch in C++/CUDA. My main focus has been on
       getting the basics right, prioritizing a clean API, simplicity, and
       clear observability for running small LLMs locally.  The key
       features are: - C++ core with CUDA support written from scratch. -
       A familiar, PyTorch-like Python API. - Runs real models: it's
       complete enough to load a model like Qwen from HuggingFace and run
       inference on both CUDA and CPU with a single line change[1]. -
       Simple, built-in observability for both Python and C++.  Next on
       the roadmap is adding BF16 support and then I'll be working on
       visualization for GPU workloads.  The project is still early and I
       would be incredibly grateful for any feedback, code reviews, or
       questions from the HN community!  GitHub Repo:
       https://github.com/nirw4nna/dsc  [1]:
       https://github.com/nirw4nna/dsc/blob/main/examples/models/qw...
        
       Author : nirw4nna
       Score  : 79 points
       Date   : 2025-06-18 15:20 UTC (7 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | helltone wrote:
       | This is very cool. I'm wondering if some of the templates and
       | switch statements would be nicer if there was an intermediate
       | representation and a compiler-like architecture.
       | 
       | I'm also curious about how this compares to something like Jax.
       | 
       | Also curious about how this compares to zml.
        
         | nirw4nna wrote:
         | You are absolutely correct! I started working on a sort of
         | compiler a while back but decided to get the basics down first.
         | The templates and switch(s) are not really the issue but rather
         | going back and forth between C & Python. This is an experiment
         | I did a few months ago:
         | https://x.com/nirw4nna/status/1904114563672354822 as you can
         | see there is a ~20% perf gain just by generating a naive C++
         | kernel instead of calling 5 separate kernels in the case of
         | softmax.
        
       | kajecounterhack wrote:
       | Cool stuff! Is the goal of this project personal learning,
       | inference performance, or something else?
       | 
       | Would be nice to see how inference speed stacks up against say
       | llama.cpp
        
         | liuliu wrote:
         | Both uses cublas under the hood. So I think it is similar for
         | prefilling (of course, this framework is too early and don't
         | have FP16 / BF16 support for GEMM it seems). Hand-roll gemv is
         | faster for token generation hence llama.cpp is better.
        
         | nirw4nna wrote:
         | Thanks! To be honest, it started purely as a learning project.
         | I was really inspired when llama.cpp first came out and tried
         | to build something similar in pure C++
         | (https://github.com/nirw4nna/YAMI), mostly for fun and to
         | practice low-level coding. The idea for DSC came when I
         | realized how hard it was to port new models to that C++ engine,
         | especially since I don't have a deep ML background. I wanted
         | something that felt more like PyTorch, where I could experiment
         | with new architectures easily. As for llama.cpp, it's
         | definitely faster! They have hand-optimizing kernels for a
         | whole bunch of architectures, models and data types. DSC is
         | more of a general-purpose toolkit. I'm excited to work on
         | performance later on, but for now, I'm focused on getting the
         | API and core features right.
        
       | aklein wrote:
       | I noticed you interface with the native code via ctypes. I think
       | cffi is generally preferred (eg,
       | https://cffi.readthedocs.io/en/stable/overview.html#api-mode...).
       | Although you'd have more flexibility if you build your own python
       | extension module (eg using pybind), which will free you from a
       | simple/strict ABI. Curious if this strict separation of C &
       | Python was a deliberate design choice.
        
         | nirw4nna wrote:
         | Yes, when I designed the API I wanted to keep a clear
         | distinction between Python and C. At some point I had two APIs:
         | 1 in Python and the other in high-level C++ and they both
         | shared the same low-level C API. I find this design quite clean
         | and easy to work with if multiple languages are involved. When
         | I'll get to perf I plan to experiment a bit with nanobind
         | (https://github.com/wjakob/nanobind) and see if there's a
         | noticeable difference wrt ctypes.
        
           | almostgotcaught wrote:
           | The call overhead of using ctypes vs nanobind/pybind is
           | enormous
           | 
           | https://news.ycombinator.com/item?id=31378277
           | 
           | Even if the number reported there is off, it's not far off
           | because ctypes just calls out to libffi which is known to be
           | the slowest way to do ffi.
        
       | rrhjm53270 wrote:
       | Do you have any plan for the serialization and deserialization of
       | your tensor and nn library?
        
         | nirw4nna wrote:
         | Right now I can load tensors directly from a safetensors file
         | or from a NumPy array so I don't really have in mind to add my
         | own custom format but I do plan to support GGUF files.
        
       | amtk2 wrote:
       | super n00b question , what kind of labtop do you need to do
       | project like this? Is mac ok? or do you need dedicated linux
       | labtop?
        
         | kadushka wrote:
         | Any laptop with an Nvidia card
        
           | amtk2 wrote:
           | does gaming labtop works with windows? have always used mac
           | for development because tool chain is so much easier,
           | wondering if there is a difference between windows and linux
           | for cuda development
        
             | Anerudhan wrote:
             | You can always use WSL
        
         | nirw4nna wrote:
         | I developed this on an HP Omen 15 with an i7-8750H, a GTX
         | 1050TI and 32GB or RAM with Linux Mint as my OS.
        
       | einpoklum wrote:
       | It's very C-like, heavy use of macros, prefixes instead of
       | namespaces, raw pointers for arrays etc. Technically you're
       | compiling C++, but... not really.
       | 
       | No negative or positive comment on its usability though, I'm not
       | an ML/Neural Network simulation person.
        
         | caned wrote:
         | I've found adherence to C++ conventions in low-level software
         | to be a rather contentious issue, mostly recently when working
         | in an ML compiler group. One set abhorred the use of macros,
         | the other any kind of polymorphism or modern C++ feature.
         | 
         | Coming from a background of working with OS kernels and systems
         | software, I don't mind the kind of explicit "C++ lite" style
         | used by the OP. Left to my own devices, I usually write things
         | that way. I would think twice if I was trying to design a large
         | framework, but ... I try to avoid those.
        
       ___________________________________________________________________
       (page generated 2025-06-18 23:00 UTC)