[HN Gopher] New ScyllaDB Go Driver: Faster Than GoCQL and Its Ru...
       ___________________________________________________________________
        
       New ScyllaDB Go Driver: Faster Than GoCQL and Its Rust Counterpart
        
       Author : truth_seeker
       Score  : 175 points
       Date   : 2022-10-13 05:47 UTC (17 hours ago)
        
 (HTM) web link (www.scylladb.com)
 (TXT) w3m dump (www.scylladb.com)
        
       | henrydark wrote:
       | With all this work to remove allocations I get the feeling that
       | what really is needed is C++ with Go's concurrency syntax and
       | runtime
        
         | [deleted]
        
         | pjmlp wrote:
         | It is a myth that C and C++ are free of such issues, it is
         | always a matter of how much one cares about performance.
         | 
         | https://groups.google.com/a/chromium.org/g/chromium-dev/c/EU...
        
         | avgcorrection wrote:
         | Why C++?
        
       | bborud wrote:
       | Somewhat unrelated observation: I have never looked at SchyllaDB
       | so I went to the web page. In the most prominent space they take
       | a dump on the competition. Normally that would be a red flag for
       | me, but in this case it made me curious.
       | 
       | Now I want to know more. :-)
        
       | midislack wrote:
       | Go d-driver? What the heck? What's this even mean?
        
       | metadat wrote:
       | Clever ideas to optimize this baby. Nice work.
       | 
       | Hadn't heard of the pre-coalesce millisecond pile up technique.
       | 
       | Favorited, thank you, sincerely!
        
         | gpderetta wrote:
         | > Hadn't heard of the pre-coalesce millisecond pile up
         | technique.
         | 
         | This is basically Nagling and/or TCP_CORK right?
        
         | loosescrews wrote:
         | This is similar to Nagle's algorithm (controllable on Linux
         | with TCPNODELAY).
        
       | insanitybit wrote:
       | Are there simple benchmarks that I can run for the Rust
       | counterpart? I've worked a bit on the scylla rust code and I see
       | plenty of room for improving efficiency (there's a lot of
       | unnecessary allocation imo, and the hashing algorithm is 10x
       | slower than it needs to be), but I don't want to make a PR for
       | improvements without evidence.
       | 
       | > The big difference between our Rust and Go drivers comes from
       | coalescing; however, even with this optimization disabled in the
       | Go driver, it's still a bit faster.
       | 
       | For anyone who's wondering, the Rust driver has coalescing
       | support as of 9 days ago.
        
         | mvelbaum wrote:
         | I wonder if Tokio is also a reason for worse performance
         | compared to Go's concurrency runtime.
        
           | raverbashing wrote:
           | I wonder how many people are using 'async' just for the sake
           | of it without a real need for it and shooting themselves in
           | the foot while at it
        
             | pkolaczk wrote:
             | Actually in this case async is the only way to get sane
             | performance and both drivers deliver excellent performance
             | thanks to async. I've been using Scylla Rust driver in my
             | C* benchmarking project and it is an order of magnitude
             | faster than the tools which use threads.
             | 
             | https://github.com/pkolaczk/latte
        
               | raverbashing wrote:
               | Cool, good to know. I know threads are a limiting factor,
               | but sometimes people jump into async while the problem is
               | somewhere else
        
               | pkolaczk wrote:
               | In this case each request is very tiny amount of work on
               | the client, so waking up a thread to do that work just to
               | immediately block waiting on the response from the server
               | is very wasteful. With async you can send hundreds of
               | requests in a simple loop, on a single thread. It's not
               | only more efficient but also actually easier to write.
        
           | insanitybit wrote:
           | There are probably a bunch of reasons, which is why I want an
           | easy "run benchmarks" command that I can use. I'd even be
           | fine using infra so long as I had pulumi/terraform to set it
           | all up for me.
           | 
           | I just don't want to spin up EC2 instances manually, get the
           | connections all working, make sure I can reset state, etc.
           | 
           | I already have a fork of Scylla where I removed a lot of
           | unnecessary cloning of `String` but no way I'm gonna PR it
           | without a benchmark.
           | 
           | I also opened a PR to replace the hash algorithm used in
           | their PreparedStatement cache, which gets hit for every
           | query, but they wanted benchmarks before accepting
           | (completely fair) and I have none. `ahash` is extremely fast
           | compared to Rust's default -
           | https://github.com/tkaitchuck/ahash and with the `comptime`
           | randomness (more than sufficient for the scylla use case) you
           | can avoid a system call when creating the HashMap.
           | 
           | There are also some performance improvements I have in mind
           | for the response parsing, among other things.
        
             | indiv0 wrote:
             | I just did a comparison between almost every hashing
             | algorithm I could find on crates.io. On my machine t1ha2
             | (under the t1ha crate) beat the pants off of every other
             | algorithm. By like an order of magnitude. Others in the
             | lead were blake3 (from the blake3 crate) and metrohash.
             | Worth taking a look at those if you're going for hash
             | speed.
             | 
             | I don't have the exact numbers on me right now but I can
             | share them tomorrow (along with the benchmark code) if
             | you're interested.
        
               | ComputerGuru wrote:
               | FYI Small hashes beat better quality hashes for hash
               | table purposes.
        
               | virtualritz wrote:
               | aHash claims it is faster than t1ha[1].
               | 
               | The t1ha crate also hasn't been updated in over three
               | years so the benchmark in this link should be current.
               | 
               | [1] https://github.com/tkaitchuck/aHash/blob/master/compa
               | re/read...
               | 
               | Edit: if you really think tha1 is faster I would open an
               | issue on the aHash repo to update their benchmark.
        
               | insanitybit wrote:
               | The PR I have lets you provide the algorithm as the
               | caller, although I did benchmark against fxhash and I
               | think it would be a good idea to suggest `ahash`. I'm
               | certainly interested.
               | 
               | `ahash` has some good benchmarks here:
               | https://github.com/tkaitchuck/aHash/blob/master/FAQ.md
        
             | PoignardAzur wrote:
             | > _I just don 't want to spin up EC2 instances manually,
             | get the connections all working, make sure I can reset
             | state, etc._
             | 
             | I've been thinking about this lately.
             | 
             | I wonder if we could standardize a benchmark format so that
             | you could automatically do the steps of downloading the
             | code, setting up a container (on your computer or in the
             | cloud), running the benchmarks, producing an output file,
             | and making a PR with the output.
             | 
             | So developers would go "here's my benchmark suite, but I've
             | only tested it on my machine", and users would call "cargo
             | bench --submit-results-in-pr" or whatever, and thus the
             | benchmark would quickly get more samples.
             | 
             | (With graphs being auto-generated as more samples come in,
             | based on some config files plus the bench samples)
        
               | insanitybit wrote:
               | Interesting idea. I could imagine something like that but
               | it's a bit tough.
        
             | ianpurton wrote:
             | So would the ideal solution be if ScyllaDB had a github
             | action to run benchmarks against PR's?
             | 
             | Not sure how decent a benchmark would be without running up
             | servers in the cloud. So I guess provisioning infra would
             | be a requirement?
             | 
             | So perhaps this could be run manually. But certainly
             | possible
             | 
             | - Pulumi up infra - Run benchmarks - Collect results -
             | Attach to PR.
        
               | insanitybit wrote:
               | I'd be happy with a few things:
               | 
               | 1. Benchmarks of "pure" code like the response parser,
               | which I could `cargo bench`. I may actually work on
               | contributing this.
               | 
               | 2. Some way to run benchmarks against a deployed server.
               | I wouldn't recommend a Github action necessarily, a
               | nightly job or manual job would probably be a better use
               | of money/resources. If I could plug in some AWS creds and
               | have it do the deployment and spit out a bunch of metrics
               | for me that'd be wonderful.
        
             | QuadDamaged wrote:
             | Hi, do you know if there's a recent hash benchmark I can
             | look into? I am using `FnvHash` as my go-to non-crypto-
             | secure hash for performance reason, didn't realise there
             | could be faster contenders.
             | 
             | Thanks!
        
               | insanitybit wrote:
               | These seem decent:
               | https://github.com/tkaitchuck/aHash/blob/master/FAQ.md
        
               | llimllib wrote:
               | This is the best, most comprehensive hash test suite I
               | know of: https://github.com/rurban/smhasher/
               | 
               | you might want to particularly look into murmur, spooky,
               | and metrohash. I'm not exactly sure of what the tradeoffs
               | involved are, or what your need is, but that site should
               | serve as a good starting point at least.
        
             | insanitybit wrote:
             | So, I got sniped _hard_ by this.
             | 
             | 1. I've re-opened by hashing PR and I'm going to suggest
             | that they adopt ahash as the default hasher in the future.
             | 
             | 2. I've re-written my "reduce allocations" work as a POC.
             | Another dev has done similar work to reduce allocations, we
             | took different approaches to the same area of code. I'm
             | going to try to push the conversation forward until we have
             | a PR'able plan.
             | 
             | 3. I'm going to push for a change that will remove multiple
             | large allocations (of PreparedStatement) out of the query
             | path.
             | 
             | 4. Another two devs have started work on the response
             | deserialization optimizations, which is awesome and means I
             | don't have to even think about it.
             | 
             | I think we'll see really significant performance gains if
             | all of these changes come in.
        
         | PeterCorless wrote:
         | You might want to check out ScyllaDB Stress Orchestrator. Not
         | sure of the current state of the code, but it's meant to do
         | what you are talking about:
         | 
         | https://github.com/scylladb/scylla-stress-orchestrator/wiki/...
        
         | enedil wrote:
         | I wonder why you got downvoted, the comments are on point.
         | 
         | Disclaimer: I work for ScyllaDB, although not on drivers. I can
         | forward your question to relevant people.
        
           | insanitybit wrote:
           | FWIW my company is a customer so we've already got a shared
           | Slack/ account reps :P Feel free to reach out to
           | colin@graplsecurity.com though (me) if you want to chat about
           | it.
        
             | yitr wrote:
             | fyi social links at the bottom of
             | https://www.graplsecurity.com/ map to the wrong things
             | (linkedin to discord, github to linkedin)
        
         | dorlaor wrote:
         | My understanding is that the new Rust coalescing will make the
         | situation on-par with the new go driver. However, in the second
         | part of the blog there is a no-coalescing test where go is
         | still faster and allocates less memory. I'm sure that the Rust
         | driver can get there too
        
       | pengaru wrote:
       | Sounds like GoCQL and its Rust counterpart are poorly
       | implemented.
        
       | Sin2x wrote:
       | How does it compare to Clickhouse regarding speed?
        
         | PeterCorless wrote:
         | Clickhouse is a column store designed for analytics [OLAP]
         | workloads. It would compete with, say, Apache Druid or Apache
         | Pinot.
         | 
         | ScyllaDB is a wide column store which is, in fact, a row store;
         | you can call it a "key-key-value," since it had a partitioning
         | key and a clustering [or "sort" key]. Which is more for
         | transactional workloads [OLTP]. So it is more comparable with
         | Cassandra or DynamoDB.
         | 
         | So they are really designed for different sorts of things.
         | 
         | That being said, ScyllaDB has some features, like Workload
         | Prioritization, so you can run analytics, like range or full
         | table scans against it without hammering your incoming
         | transactions. But it wasn't designed specifically for that.
        
           | Sin2x wrote:
           | Thanks, for some reason I thought they aim at comparable
           | usecases.
        
       | pjmlp wrote:
       | Very good write up how performance can be improved without the
       | typical rewrite in X.
        
         | erichocean wrote:
         | And yet...ScyllaDB is famous for being a 10x faster rewrite of
         | Cassandra (written in Java) in C++.
         | 
         | Your general comment is correct. I see it often with GPU
         | algorithms which, no surprise, are also much faster on CPUs
         | (using something like ISPC to compile them).
        
           | pjmlp wrote:
           | A performance improvement that could have been obtained by
           | only rewriting in C++ the critical paths and integrate them
           | via JNI, instead of rewriting the world.
           | 
           | An approach that tends to be ignored by those rewrite X in Y.
        
             | seastarer wrote:
             | ScyllaDB is a rearchitecting of Cassandra, not just a
             | rewrite.
        
               | pjmlp wrote:
               | The point stands.
        
       | habibur wrote:
       | > We also paid close attention to proper memory management,
       | producing as little garbage as possible.
       | 
       | The key.
       | 
       | And I was wondering how can a tracing GC outperform a non-
       | tracing-GC memory manager.
        
         | chrisseaton wrote:
         | Manual memory management means your memory management is
         | tightly interleaved with application code, with both competing
         | for the same limited cache space. A tracing GC batches memory
         | management, using cache better and not as frequently evicting
         | your application's data out of cache.
         | 
         | A GC can also let you use more efficient concurrent data
         | structures - many sophisticated concurrent objects require a
         | tracing GC for implementing correctly - which can improve the
         | performance of your application code.
        
           | tialaramex wrote:
           | > many sophisticated concurrent objects require a tracing GC
           | for implementing correctly
           | 
           | What sort of "sophisticated concurrent objects" are you
           | thinking of?
        
             | chrisseaton wrote:
             | Implementing many lock-free data structures without a GC is
             | still a research topic, for example requiring conversion to
             | use epoch allocation.
        
               | tialaramex wrote:
               | This still seems very vague, I was looking for concrete
               | examples where quite different arrangements like hazard
               | pointers are for some reason impossible.
        
             | ruuda wrote:
             | Here's one design that's nice to do for servers in GC'd
             | functional languages, and hard to achieve without GC. Have
             | the state of your application be a persistent data
             | structure (one that you can efficiently update such that
             | both the old and new version remain available, as opposed
             | to in-place mutation). Then hold a "current state" pointer,
             | that gets updated atomically. Endpoints that only read the
             | state can read the pointer and then complete the request
             | with a consistent view of the state, while writes can be
             | serialized, build the new state in the background, and when
             | it's done, swap the pointer to "publish" the new state.
             | This way reads get a consistent view without blocking
             | writes, and writes do not block reads. (Unlike an
             | implementation where you mutate the state in-place, and you
             | would need to protect it with a mutex.) It is possible to
             | do this in non-gc'd languages too, but then persistent data
             | structures are unwieldy, and cloning the full state for
             | every update may be prohibitively expensive.
        
             | yencabulator wrote:
             | Even just read-copy-update without refcounts is very hard
             | without GC. Linux kernel can do RCU without refcounts
             | largely because it's in complete control of CPU cores and
             | scheduling; userspace can't pull of the same tricks.
             | Meanwhile, with GC, it's just
             | https://pkg.go.dev/sync/atomic#Value
             | 
             | https://www.kernel.org/doc/html/latest/RCU/whatisRCU.html
        
               | tialaramex wrote:
               | > Even just read-copy-update without refcounts is very
               | hard without GC
               | 
               | To me "it could be more difficult without" and "requires"
               | are quite different claims, especially in the context of
               | what's possible and why.
        
         | forrestthewoods wrote:
         | The key to garbage collection is... to go super far out of your
         | way to avoid allocating memory. This is not ideal.
        
           | girvo wrote:
           | I'm probably biased because I live in firmware development
           | these days, but that's true even for non-garbage collected
           | languages when it comes to making sure things are fast
        
           | ekidd wrote:
           | Speaking as someone who has spent time optimizing C++ and
           | Rust, memory allocation in hot loops is often where
           | performance goes to die. GC or not, if you want to fast,
           | reducing allocation is one of the first things to benchmark.
           | 
           | (One fast way to manage allocations is to use an arena
           | allocator that allocates memory by incrementing a pointer,
           | and frees memory all at once. This is pretty effective for
           | simple, short-lived requests.)
        
             | forrestthewoods wrote:
             | Yes, minimize malloc in all cases. The difference is that
             | GC languages are fundamentally designed around the concept
             | that it's cheap and easy to malloc/free. Avoiding
             | allocations can be excruciatingly difficult.
             | 
             | In C++ you also need to minimize allocations, but it's
             | radically easier to do in C++ than in C#.
        
         | throwaway81523 wrote:
         | > And I was wondering how can a tracing GC outperform a non-
         | tracing GC memory manager.
         | 
         | The cliche is that malloc/free style memory management has to
         | touch all the garbage in order to free it, while a semispace GC
         | only has to copy the live data once in a while. The garbage is
         | ignored.
        
           | pkolaczk wrote:
           | However, when a tracing GC runs, it has to touch massive
           | amounts of cold data and pushes hot data out of cache.
           | Traditional malloc/free touches data in small chunks and
           | freeing happens close to the last use, so when most of the
           | data is still hot in cache. Stable, predictable performance
           | is often more important than the peak performance.
        
           | masklinn wrote:
           | Afaik Go uses a non-moving GC, so it can't be a semi space.
        
       | lukeqsee wrote:
       | ScyllaDB's obsession with performance by working closely with
       | deep understanding of hardware and software and not simply adding
       | more machines is really impressive.
       | 
       | They consistently demonstrate that we are under using our CPUs
       | compared to potential.
        
         | nvarsj wrote:
         | I think it's a common principle of modern computing. We trade
         | productivity for performance all the time. The idea being
         | machines are fast enough it doesn't matter. There are order of
         | magnitude gains to be made at most levels in the modern stack -
         | it's just the effort required is immense.
        
           | rob74 wrote:
           | The effort doesn't _have_ to be immense though. I bet there
           | are plenty of  "low-hanging performance fruit" in most
           | codebases, it's just that there's no real reward to pick
           | them...
        
             | jerf wrote:
             | My own experience backs this. I don't sit here obsessing
             | about performance. In a database context that is
             | appropriate but I'm not implementing databases. Nor do I
             | prematurely optimize. I just give some thought to the
             | matter occasionally, especially at the architecture level,
             | and as a result I tend to produce systems that often
             | surprise my fellow programmers at its performance. And I
             | _strongly_ assert that is not because I 'm some sort of
             | design genius... ask me about my design mistakes! I've got
             | 'em. It's just that I try a little. My peer's intuitions
             | are visibly tuned for systems where nobody even tried.
             | 
             | I'm not sure how much "low hanging" fruit there is, though.
             | A lot of modern slowdown is architectural. Nobody sat down
             | and thought about the flow through the system holistically
             | at the design phase, and the system rolled out with a
             | design that intrinsically depends on a synchronous network
             | transaction every time the user types a key, or the code
             | passes back and forth between three layers of architecture
             | internally getting wrapped and unwrapped in intermediate
             | objects a billion times per second (... loops are terrible
             | magnifiers of architecture failures, a minor "oops" becomes
             | a performance disaster when done a billion times...) when a
             | better design could have just done the thing in one shot,
             | etc. I think a lot of the things we have fundamental
             | performance issues with are actually so hard to fix they
             | all but require new programs to be written in a lot of
             | cases.
             | 
             | Then again, there is also visibly a lot of code in the
             | world that has simply never been run through a profiler,
             | not even for fun (and it is _so much_ fun to profile a code
             | base that has never been profiled before, I _highly_
             | recommend it, no sarcasm), and it 's hard to get a
             | statistically-significant sense of how much of the
             | performance issues we face are low-hanging fruit and how
             | much are bad architecture.
        
       | hintymad wrote:
       | Is DB driver a bottleneck in applications? Somehow I usually see
       | bottlenecks in other places in a service, and the db bottlenecks
       | are usually on the database-side instead on the driver side.
        
       | wejick wrote:
       | Will this be compatible with other DB using CQL? Like Cassandra
       | itself or Yugabyte for example.
        
         | PeterCorless wrote:
         | Yes. ScyllaDB writes all of its drivers to be
         | backward/generically compatible with other CQL-based databases
         | like Cassandra, etc.
         | 
         | There are some specific features like shard-aware queries and
         | shard-aware ports that naturally won't apply. But they will
         | work.
        
       | neoyagami wrote:
       | Is anyone using scyllaDB in production nowdays? I tried some
       | years ago with dataloss and kinda spooked the bejesus out of
       | me(and went to cassandra)
        
         | PeterCorless wrote:
         | 400+ companies. https://www.scylladb.com/users/
        
       | pornel wrote:
       | This is a dare to get people rewrite their Rust driver for
       | performance.
        
         | avgcorrection wrote:
         | Go/Rust rivalry is a meme.
        
           | _wldu wrote:
           | Every language dreams of being faster and safer than C and
           | C++. Can't have it both ways.
        
             | yencabulator wrote:
             | A different language can enable easy expression of designs
             | that would be nightmare to maintain in C/C++:
             | https://go.dev/talks/2013/oscon-dl.slide
        
       | psaux wrote:
       | One of my fav companies! Dor is an amazing person, highly
       | recommend working with them. We started to move to ScyllaDB at my
       | last engineering job. Doing 100TB's a day in IoT Data.
        
         | snapetom wrote:
         | My last job was far less data than yours (more like 100's of
         | GB/day), and it wasn't surprising it handled it with ease. I
         | love that DB so much. So performant, and far easier to admin
         | than Cassandra.
        
         | ZephyrBlu wrote:
         | What is that amount of data used for?
         | 
         | Is it along the lines of "we want to collect all the data we
         | can in case we want to use or analyze it at some point", or are
         | there real use cases?
        
           | PeterCorless wrote:
           | For IIoT, there's digital twins, machine health, sensor data
           | aggregation, and a lot more. While a lot of it can be
           | classified as "write once, read hardly ever," it is vital for
           | triggers, alarms, real-time failure and security alerts, and,
           | eventually being forwarded over to analytics systems for
           | longer-term trends.
        
       | usrusr wrote:
       | Almost reads like a case where an initial implementation in rust
       | forced a clear mental image of ownership that could then be
       | transferred into another language much easier than it would have
       | been to reach the same clarity outside of pedantic reign of the
       | rust compiler.
        
         | philosopher1234 wrote:
         | I think this is possible but I see no evidence in the article
         | that would support this interpretation
        
       ___________________________________________________________________
       (page generated 2022-10-13 23:02 UTC)