[HN Gopher] Show HN: I built a tensor library from scratch in C+...
___________________________________________________________________
Show HN: I built a tensor library from scratch in C++/CUDA
Hi HN, Over the past few months, I've been building `dsc`, a
tensor library from scratch in C++/CUDA. My main focus has been on
getting the basics right, prioritizing a clean API, simplicity, and
clear observability for running small LLMs locally. The key
features are: - C++ core with CUDA support written from scratch. -
A familiar, PyTorch-like Python API. - Runs real models: it's
complete enough to load a model like Qwen from HuggingFace and run
inference on both CUDA and CPU with a single line change[1]. -
Simple, built-in observability for both Python and C++. Next on
the roadmap is adding BF16 support and then I'll be working on
visualization for GPU workloads. The project is still early and I
would be incredibly grateful for any feedback, code reviews, or
questions from the HN community! GitHub Repo:
https://github.com/nirw4nna/dsc [1]:
https://github.com/nirw4nna/dsc/blob/main/examples/models/qw...
Author : nirw4nna
Score : 79 points
Date : 2025-06-18 15:20 UTC (7 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| helltone wrote:
| This is very cool. I'm wondering if some of the templates and
| switch statements would be nicer if there was an intermediate
| representation and a compiler-like architecture.
|
| I'm also curious about how this compares to something like Jax.
|
| Also curious about how this compares to zml.
| nirw4nna wrote:
| You are absolutely correct! I started working on a sort of
| compiler a while back but decided to get the basics down first.
| The templates and switch(s) are not really the issue but rather
| going back and forth between C & Python. This is an experiment
| I did a few months ago:
| https://x.com/nirw4nna/status/1904114563672354822 as you can
| see there is a ~20% perf gain just by generating a naive C++
| kernel instead of calling 5 separate kernels in the case of
| softmax.
| kajecounterhack wrote:
| Cool stuff! Is the goal of this project personal learning,
| inference performance, or something else?
|
| Would be nice to see how inference speed stacks up against say
| llama.cpp
| liuliu wrote:
| Both uses cublas under the hood. So I think it is similar for
| prefilling (of course, this framework is too early and don't
| have FP16 / BF16 support for GEMM it seems). Hand-roll gemv is
| faster for token generation hence llama.cpp is better.
| nirw4nna wrote:
| Thanks! To be honest, it started purely as a learning project.
| I was really inspired when llama.cpp first came out and tried
| to build something similar in pure C++
| (https://github.com/nirw4nna/YAMI), mostly for fun and to
| practice low-level coding. The idea for DSC came when I
| realized how hard it was to port new models to that C++ engine,
| especially since I don't have a deep ML background. I wanted
| something that felt more like PyTorch, where I could experiment
| with new architectures easily. As for llama.cpp, it's
| definitely faster! They have hand-optimizing kernels for a
| whole bunch of architectures, models and data types. DSC is
| more of a general-purpose toolkit. I'm excited to work on
| performance later on, but for now, I'm focused on getting the
| API and core features right.
| aklein wrote:
| I noticed you interface with the native code via ctypes. I think
| cffi is generally preferred (eg,
| https://cffi.readthedocs.io/en/stable/overview.html#api-mode...).
| Although you'd have more flexibility if you build your own python
| extension module (eg using pybind), which will free you from a
| simple/strict ABI. Curious if this strict separation of C &
| Python was a deliberate design choice.
| nirw4nna wrote:
| Yes, when I designed the API I wanted to keep a clear
| distinction between Python and C. At some point I had two APIs:
| 1 in Python and the other in high-level C++ and they both
| shared the same low-level C API. I find this design quite clean
| and easy to work with if multiple languages are involved. When
| I'll get to perf I plan to experiment a bit with nanobind
| (https://github.com/wjakob/nanobind) and see if there's a
| noticeable difference wrt ctypes.
| almostgotcaught wrote:
| The call overhead of using ctypes vs nanobind/pybind is
| enormous
|
| https://news.ycombinator.com/item?id=31378277
|
| Even if the number reported there is off, it's not far off
| because ctypes just calls out to libffi which is known to be
| the slowest way to do ffi.
| rrhjm53270 wrote:
| Do you have any plan for the serialization and deserialization of
| your tensor and nn library?
| nirw4nna wrote:
| Right now I can load tensors directly from a safetensors file
| or from a NumPy array so I don't really have in mind to add my
| own custom format but I do plan to support GGUF files.
| amtk2 wrote:
| super n00b question , what kind of labtop do you need to do
| project like this? Is mac ok? or do you need dedicated linux
| labtop?
| kadushka wrote:
| Any laptop with an Nvidia card
| amtk2 wrote:
| does gaming labtop works with windows? have always used mac
| for development because tool chain is so much easier,
| wondering if there is a difference between windows and linux
| for cuda development
| Anerudhan wrote:
| You can always use WSL
| nirw4nna wrote:
| I developed this on an HP Omen 15 with an i7-8750H, a GTX
| 1050TI and 32GB or RAM with Linux Mint as my OS.
| einpoklum wrote:
| It's very C-like, heavy use of macros, prefixes instead of
| namespaces, raw pointers for arrays etc. Technically you're
| compiling C++, but... not really.
|
| No negative or positive comment on its usability though, I'm not
| an ML/Neural Network simulation person.
| caned wrote:
| I've found adherence to C++ conventions in low-level software
| to be a rather contentious issue, mostly recently when working
| in an ML compiler group. One set abhorred the use of macros,
| the other any kind of polymorphism or modern C++ feature.
|
| Coming from a background of working with OS kernels and systems
| software, I don't mind the kind of explicit "C++ lite" style
| used by the OP. Left to my own devices, I usually write things
| that way. I would think twice if I was trying to design a large
| framework, but ... I try to avoid those.
___________________________________________________________________
(page generated 2025-06-18 23:00 UTC)