[HN Gopher] New ScyllaDB Go Driver: Faster Than GoCQL and Its Ru...
___________________________________________________________________
New ScyllaDB Go Driver: Faster Than GoCQL and Its Rust Counterpart
Author : truth_seeker
Score : 175 points
Date : 2022-10-13 05:47 UTC (17 hours ago)
(HTM) web link (www.scylladb.com)
(TXT) w3m dump (www.scylladb.com)
| henrydark wrote:
| With all this work to remove allocations I get the feeling that
| what really is needed is C++ with Go's concurrency syntax and
| runtime
| [deleted]
| pjmlp wrote:
| It is a myth that C and C++ are free of such issues, it is
| always a matter of how much one cares about performance.
|
| https://groups.google.com/a/chromium.org/g/chromium-dev/c/EU...
| avgcorrection wrote:
| Why C++?
| bborud wrote:
| Somewhat unrelated observation: I have never looked at SchyllaDB
| so I went to the web page. In the most prominent space they take
| a dump on the competition. Normally that would be a red flag for
| me, but in this case it made me curious.
|
| Now I want to know more. :-)
| midislack wrote:
| Go d-driver? What the heck? What's this even mean?
| metadat wrote:
| Clever ideas to optimize this baby. Nice work.
|
| Hadn't heard of the pre-coalesce millisecond pile up technique.
|
| Favorited, thank you, sincerely!
| gpderetta wrote:
| > Hadn't heard of the pre-coalesce millisecond pile up
| technique.
|
| This is basically Nagling and/or TCP_CORK right?
| loosescrews wrote:
| This is similar to Nagle's algorithm (controllable on Linux
| with TCPNODELAY).
| insanitybit wrote:
| Are there simple benchmarks that I can run for the Rust
| counterpart? I've worked a bit on the scylla rust code and I see
| plenty of room for improving efficiency (there's a lot of
| unnecessary allocation imo, and the hashing algorithm is 10x
| slower than it needs to be), but I don't want to make a PR for
| improvements without evidence.
|
| > The big difference between our Rust and Go drivers comes from
| coalescing; however, even with this optimization disabled in the
| Go driver, it's still a bit faster.
|
| For anyone who's wondering, the Rust driver has coalescing
| support as of 9 days ago.
| mvelbaum wrote:
| I wonder if Tokio is also a reason for worse performance
| compared to Go's concurrency runtime.
| raverbashing wrote:
| I wonder how many people are using 'async' just for the sake
| of it without a real need for it and shooting themselves in
| the foot while at it
| pkolaczk wrote:
| Actually in this case async is the only way to get sane
| performance and both drivers deliver excellent performance
| thanks to async. I've been using Scylla Rust driver in my
| C* benchmarking project and it is an order of magnitude
| faster than the tools which use threads.
|
| https://github.com/pkolaczk/latte
| raverbashing wrote:
| Cool, good to know. I know threads are a limiting factor,
| but sometimes people jump into async while the problem is
| somewhere else
| pkolaczk wrote:
| In this case each request is very tiny amount of work on
| the client, so waking up a thread to do that work just to
| immediately block waiting on the response from the server
| is very wasteful. With async you can send hundreds of
| requests in a simple loop, on a single thread. It's not
| only more efficient but also actually easier to write.
| insanitybit wrote:
| There are probably a bunch of reasons, which is why I want an
| easy "run benchmarks" command that I can use. I'd even be
| fine using infra so long as I had pulumi/terraform to set it
| all up for me.
|
| I just don't want to spin up EC2 instances manually, get the
| connections all working, make sure I can reset state, etc.
|
| I already have a fork of Scylla where I removed a lot of
| unnecessary cloning of `String` but no way I'm gonna PR it
| without a benchmark.
|
| I also opened a PR to replace the hash algorithm used in
| their PreparedStatement cache, which gets hit for every
| query, but they wanted benchmarks before accepting
| (completely fair) and I have none. `ahash` is extremely fast
| compared to Rust's default -
| https://github.com/tkaitchuck/ahash and with the `comptime`
| randomness (more than sufficient for the scylla use case) you
| can avoid a system call when creating the HashMap.
|
| There are also some performance improvements I have in mind
| for the response parsing, among other things.
| indiv0 wrote:
| I just did a comparison between almost every hashing
| algorithm I could find on crates.io. On my machine t1ha2
| (under the t1ha crate) beat the pants off of every other
| algorithm. By like an order of magnitude. Others in the
| lead were blake3 (from the blake3 crate) and metrohash.
| Worth taking a look at those if you're going for hash
| speed.
|
| I don't have the exact numbers on me right now but I can
| share them tomorrow (along with the benchmark code) if
| you're interested.
| ComputerGuru wrote:
| FYI Small hashes beat better quality hashes for hash
| table purposes.
| virtualritz wrote:
| aHash claims it is faster than t1ha[1].
|
| The t1ha crate also hasn't been updated in over three
| years so the benchmark in this link should be current.
|
| [1] https://github.com/tkaitchuck/aHash/blob/master/compa
| re/read...
|
| Edit: if you really think tha1 is faster I would open an
| issue on the aHash repo to update their benchmark.
| insanitybit wrote:
| The PR I have lets you provide the algorithm as the
| caller, although I did benchmark against fxhash and I
| think it would be a good idea to suggest `ahash`. I'm
| certainly interested.
|
| `ahash` has some good benchmarks here:
| https://github.com/tkaitchuck/aHash/blob/master/FAQ.md
| PoignardAzur wrote:
| > _I just don 't want to spin up EC2 instances manually,
| get the connections all working, make sure I can reset
| state, etc._
|
| I've been thinking about this lately.
|
| I wonder if we could standardize a benchmark format so that
| you could automatically do the steps of downloading the
| code, setting up a container (on your computer or in the
| cloud), running the benchmarks, producing an output file,
| and making a PR with the output.
|
| So developers would go "here's my benchmark suite, but I've
| only tested it on my machine", and users would call "cargo
| bench --submit-results-in-pr" or whatever, and thus the
| benchmark would quickly get more samples.
|
| (With graphs being auto-generated as more samples come in,
| based on some config files plus the bench samples)
| insanitybit wrote:
| Interesting idea. I could imagine something like that but
| it's a bit tough.
| ianpurton wrote:
| So would the ideal solution be if ScyllaDB had a github
| action to run benchmarks against PR's?
|
| Not sure how decent a benchmark would be without running up
| servers in the cloud. So I guess provisioning infra would
| be a requirement?
|
| So perhaps this could be run manually. But certainly
| possible
|
| - Pulumi up infra - Run benchmarks - Collect results -
| Attach to PR.
| insanitybit wrote:
| I'd be happy with a few things:
|
| 1. Benchmarks of "pure" code like the response parser,
| which I could `cargo bench`. I may actually work on
| contributing this.
|
| 2. Some way to run benchmarks against a deployed server.
| I wouldn't recommend a Github action necessarily, a
| nightly job or manual job would probably be a better use
| of money/resources. If I could plug in some AWS creds and
| have it do the deployment and spit out a bunch of metrics
| for me that'd be wonderful.
| QuadDamaged wrote:
| Hi, do you know if there's a recent hash benchmark I can
| look into? I am using `FnvHash` as my go-to non-crypto-
| secure hash for performance reason, didn't realise there
| could be faster contenders.
|
| Thanks!
| insanitybit wrote:
| These seem decent:
| https://github.com/tkaitchuck/aHash/blob/master/FAQ.md
| llimllib wrote:
| This is the best, most comprehensive hash test suite I
| know of: https://github.com/rurban/smhasher/
|
| you might want to particularly look into murmur, spooky,
| and metrohash. I'm not exactly sure of what the tradeoffs
| involved are, or what your need is, but that site should
| serve as a good starting point at least.
| insanitybit wrote:
| So, I got sniped _hard_ by this.
|
| 1. I've re-opened by hashing PR and I'm going to suggest
| that they adopt ahash as the default hasher in the future.
|
| 2. I've re-written my "reduce allocations" work as a POC.
| Another dev has done similar work to reduce allocations, we
| took different approaches to the same area of code. I'm
| going to try to push the conversation forward until we have
| a PR'able plan.
|
| 3. I'm going to push for a change that will remove multiple
| large allocations (of PreparedStatement) out of the query
| path.
|
| 4. Another two devs have started work on the response
| deserialization optimizations, which is awesome and means I
| don't have to even think about it.
|
| I think we'll see really significant performance gains if
| all of these changes come in.
| PeterCorless wrote:
| You might want to check out ScyllaDB Stress Orchestrator. Not
| sure of the current state of the code, but it's meant to do
| what you are talking about:
|
| https://github.com/scylladb/scylla-stress-orchestrator/wiki/...
| enedil wrote:
| I wonder why you got downvoted, the comments are on point.
|
| Disclaimer: I work for ScyllaDB, although not on drivers. I can
| forward your question to relevant people.
| insanitybit wrote:
| FWIW my company is a customer so we've already got a shared
| Slack/ account reps :P Feel free to reach out to
| colin@graplsecurity.com though (me) if you want to chat about
| it.
| yitr wrote:
| fyi social links at the bottom of
| https://www.graplsecurity.com/ map to the wrong things
| (linkedin to discord, github to linkedin)
| dorlaor wrote:
| My understanding is that the new Rust coalescing will make the
| situation on-par with the new go driver. However, in the second
| part of the blog there is a no-coalescing test where go is
| still faster and allocates less memory. I'm sure that the Rust
| driver can get there too
| pengaru wrote:
| Sounds like GoCQL and its Rust counterpart are poorly
| implemented.
| Sin2x wrote:
| How does it compare to Clickhouse regarding speed?
| PeterCorless wrote:
| Clickhouse is a column store designed for analytics [OLAP]
| workloads. It would compete with, say, Apache Druid or Apache
| Pinot.
|
| ScyllaDB is a wide column store which is, in fact, a row store;
| you can call it a "key-key-value," since it had a partitioning
| key and a clustering [or "sort" key]. Which is more for
| transactional workloads [OLTP]. So it is more comparable with
| Cassandra or DynamoDB.
|
| So they are really designed for different sorts of things.
|
| That being said, ScyllaDB has some features, like Workload
| Prioritization, so you can run analytics, like range or full
| table scans against it without hammering your incoming
| transactions. But it wasn't designed specifically for that.
| Sin2x wrote:
| Thanks, for some reason I thought they aim at comparable
| usecases.
| pjmlp wrote:
| Very good write up how performance can be improved without the
| typical rewrite in X.
| erichocean wrote:
| And yet...ScyllaDB is famous for being a 10x faster rewrite of
| Cassandra (written in Java) in C++.
|
| Your general comment is correct. I see it often with GPU
| algorithms which, no surprise, are also much faster on CPUs
| (using something like ISPC to compile them).
| pjmlp wrote:
| A performance improvement that could have been obtained by
| only rewriting in C++ the critical paths and integrate them
| via JNI, instead of rewriting the world.
|
| An approach that tends to be ignored by those rewrite X in Y.
| seastarer wrote:
| ScyllaDB is a rearchitecting of Cassandra, not just a
| rewrite.
| pjmlp wrote:
| The point stands.
| habibur wrote:
| > We also paid close attention to proper memory management,
| producing as little garbage as possible.
|
| The key.
|
| And I was wondering how can a tracing GC outperform a non-
| tracing-GC memory manager.
| chrisseaton wrote:
| Manual memory management means your memory management is
| tightly interleaved with application code, with both competing
| for the same limited cache space. A tracing GC batches memory
| management, using cache better and not as frequently evicting
| your application's data out of cache.
|
| A GC can also let you use more efficient concurrent data
| structures - many sophisticated concurrent objects require a
| tracing GC for implementing correctly - which can improve the
| performance of your application code.
| tialaramex wrote:
| > many sophisticated concurrent objects require a tracing GC
| for implementing correctly
|
| What sort of "sophisticated concurrent objects" are you
| thinking of?
| chrisseaton wrote:
| Implementing many lock-free data structures without a GC is
| still a research topic, for example requiring conversion to
| use epoch allocation.
| tialaramex wrote:
| This still seems very vague, I was looking for concrete
| examples where quite different arrangements like hazard
| pointers are for some reason impossible.
| ruuda wrote:
| Here's one design that's nice to do for servers in GC'd
| functional languages, and hard to achieve without GC. Have
| the state of your application be a persistent data
| structure (one that you can efficiently update such that
| both the old and new version remain available, as opposed
| to in-place mutation). Then hold a "current state" pointer,
| that gets updated atomically. Endpoints that only read the
| state can read the pointer and then complete the request
| with a consistent view of the state, while writes can be
| serialized, build the new state in the background, and when
| it's done, swap the pointer to "publish" the new state.
| This way reads get a consistent view without blocking
| writes, and writes do not block reads. (Unlike an
| implementation where you mutate the state in-place, and you
| would need to protect it with a mutex.) It is possible to
| do this in non-gc'd languages too, but then persistent data
| structures are unwieldy, and cloning the full state for
| every update may be prohibitively expensive.
| yencabulator wrote:
| Even just read-copy-update without refcounts is very hard
| without GC. Linux kernel can do RCU without refcounts
| largely because it's in complete control of CPU cores and
| scheduling; userspace can't pull of the same tricks.
| Meanwhile, with GC, it's just
| https://pkg.go.dev/sync/atomic#Value
|
| https://www.kernel.org/doc/html/latest/RCU/whatisRCU.html
| tialaramex wrote:
| > Even just read-copy-update without refcounts is very
| hard without GC
|
| To me "it could be more difficult without" and "requires"
| are quite different claims, especially in the context of
| what's possible and why.
| forrestthewoods wrote:
| The key to garbage collection is... to go super far out of your
| way to avoid allocating memory. This is not ideal.
| girvo wrote:
| I'm probably biased because I live in firmware development
| these days, but that's true even for non-garbage collected
| languages when it comes to making sure things are fast
| ekidd wrote:
| Speaking as someone who has spent time optimizing C++ and
| Rust, memory allocation in hot loops is often where
| performance goes to die. GC or not, if you want to fast,
| reducing allocation is one of the first things to benchmark.
|
| (One fast way to manage allocations is to use an arena
| allocator that allocates memory by incrementing a pointer,
| and frees memory all at once. This is pretty effective for
| simple, short-lived requests.)
| forrestthewoods wrote:
| Yes, minimize malloc in all cases. The difference is that
| GC languages are fundamentally designed around the concept
| that it's cheap and easy to malloc/free. Avoiding
| allocations can be excruciatingly difficult.
|
| In C++ you also need to minimize allocations, but it's
| radically easier to do in C++ than in C#.
| throwaway81523 wrote:
| > And I was wondering how can a tracing GC outperform a non-
| tracing GC memory manager.
|
| The cliche is that malloc/free style memory management has to
| touch all the garbage in order to free it, while a semispace GC
| only has to copy the live data once in a while. The garbage is
| ignored.
| pkolaczk wrote:
| However, when a tracing GC runs, it has to touch massive
| amounts of cold data and pushes hot data out of cache.
| Traditional malloc/free touches data in small chunks and
| freeing happens close to the last use, so when most of the
| data is still hot in cache. Stable, predictable performance
| is often more important than the peak performance.
| masklinn wrote:
| Afaik Go uses a non-moving GC, so it can't be a semi space.
| lukeqsee wrote:
| ScyllaDB's obsession with performance by working closely with
| deep understanding of hardware and software and not simply adding
| more machines is really impressive.
|
| They consistently demonstrate that we are under using our CPUs
| compared to potential.
| nvarsj wrote:
| I think it's a common principle of modern computing. We trade
| productivity for performance all the time. The idea being
| machines are fast enough it doesn't matter. There are order of
| magnitude gains to be made at most levels in the modern stack -
| it's just the effort required is immense.
| rob74 wrote:
| The effort doesn't _have_ to be immense though. I bet there
| are plenty of "low-hanging performance fruit" in most
| codebases, it's just that there's no real reward to pick
| them...
| jerf wrote:
| My own experience backs this. I don't sit here obsessing
| about performance. In a database context that is
| appropriate but I'm not implementing databases. Nor do I
| prematurely optimize. I just give some thought to the
| matter occasionally, especially at the architecture level,
| and as a result I tend to produce systems that often
| surprise my fellow programmers at its performance. And I
| _strongly_ assert that is not because I 'm some sort of
| design genius... ask me about my design mistakes! I've got
| 'em. It's just that I try a little. My peer's intuitions
| are visibly tuned for systems where nobody even tried.
|
| I'm not sure how much "low hanging" fruit there is, though.
| A lot of modern slowdown is architectural. Nobody sat down
| and thought about the flow through the system holistically
| at the design phase, and the system rolled out with a
| design that intrinsically depends on a synchronous network
| transaction every time the user types a key, or the code
| passes back and forth between three layers of architecture
| internally getting wrapped and unwrapped in intermediate
| objects a billion times per second (... loops are terrible
| magnifiers of architecture failures, a minor "oops" becomes
| a performance disaster when done a billion times...) when a
| better design could have just done the thing in one shot,
| etc. I think a lot of the things we have fundamental
| performance issues with are actually so hard to fix they
| all but require new programs to be written in a lot of
| cases.
|
| Then again, there is also visibly a lot of code in the
| world that has simply never been run through a profiler,
| not even for fun (and it is _so much_ fun to profile a code
| base that has never been profiled before, I _highly_
| recommend it, no sarcasm), and it 's hard to get a
| statistically-significant sense of how much of the
| performance issues we face are low-hanging fruit and how
| much are bad architecture.
| hintymad wrote:
| Is DB driver a bottleneck in applications? Somehow I usually see
| bottlenecks in other places in a service, and the db bottlenecks
| are usually on the database-side instead on the driver side.
| wejick wrote:
| Will this be compatible with other DB using CQL? Like Cassandra
| itself or Yugabyte for example.
| PeterCorless wrote:
| Yes. ScyllaDB writes all of its drivers to be
| backward/generically compatible with other CQL-based databases
| like Cassandra, etc.
|
| There are some specific features like shard-aware queries and
| shard-aware ports that naturally won't apply. But they will
| work.
| neoyagami wrote:
| Is anyone using scyllaDB in production nowdays? I tried some
| years ago with dataloss and kinda spooked the bejesus out of
| me(and went to cassandra)
| PeterCorless wrote:
| 400+ companies. https://www.scylladb.com/users/
| pornel wrote:
| This is a dare to get people rewrite their Rust driver for
| performance.
| avgcorrection wrote:
| Go/Rust rivalry is a meme.
| _wldu wrote:
| Every language dreams of being faster and safer than C and
| C++. Can't have it both ways.
| yencabulator wrote:
| A different language can enable easy expression of designs
| that would be nightmare to maintain in C/C++:
| https://go.dev/talks/2013/oscon-dl.slide
| psaux wrote:
| One of my fav companies! Dor is an amazing person, highly
| recommend working with them. We started to move to ScyllaDB at my
| last engineering job. Doing 100TB's a day in IoT Data.
| snapetom wrote:
| My last job was far less data than yours (more like 100's of
| GB/day), and it wasn't surprising it handled it with ease. I
| love that DB so much. So performant, and far easier to admin
| than Cassandra.
| ZephyrBlu wrote:
| What is that amount of data used for?
|
| Is it along the lines of "we want to collect all the data we
| can in case we want to use or analyze it at some point", or are
| there real use cases?
| PeterCorless wrote:
| For IIoT, there's digital twins, machine health, sensor data
| aggregation, and a lot more. While a lot of it can be
| classified as "write once, read hardly ever," it is vital for
| triggers, alarms, real-time failure and security alerts, and,
| eventually being forwarded over to analytics systems for
| longer-term trends.
| usrusr wrote:
| Almost reads like a case where an initial implementation in rust
| forced a clear mental image of ownership that could then be
| transferred into another language much easier than it would have
| been to reach the same clarity outside of pedantic reign of the
| rust compiler.
| philosopher1234 wrote:
| I think this is possible but I see no evidence in the article
| that would support this interpretation
___________________________________________________________________
(page generated 2022-10-13 23:02 UTC)