[HN Gopher] IceFireDB: Distributed disk storage database based o...
       ___________________________________________________________________
        
       IceFireDB: Distributed disk storage database based on Raft and
       Redis protocol
        
       Author : thunderbong
       Score  : 124 points
       Date   : 2021-08-21 14:23 UTC (8 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | wiremine wrote:
       | One of the child comments made the observation that "this speaks
       | Redis."
       | 
       | Makes me wonder if there is any spec for the Redis commands.
       | I.e., in the same way that SQL defines an interface, but leaves
       | the details up to individual implementations, is there a "Redis"
       | interface that leaves the details up to the implementation?
       | 
       | I'm thinking of something similar to ISO or RFC.
        
         | WJW wrote:
         | https://redis.io/topics/protocol ?
        
         | rndgermandude wrote:
         | I've implemented a subset of redis in the past, and went by
         | their official docs, first the protocol[1] level protocol, then
         | the docs for individual comments such as SET[2]. They also have
         | a test suite, and I extracted the bits that applied to my
         | partial implementation from there.
         | 
         | The only real pitfall was what part of the CONFIG stuff I
         | needed to implement to make popular redis client libs talk to
         | me and/or use the newer protocol features.
         | 
         | The rest was pretty straight forward, just read the docs for a
         | command, implement the stuff, run the test suite, fix any bugs,
         | repeat.
         | 
         | As far as I know there is no RFC let alone an ISO standard.
         | 
         | [1] https://redis.io/topics/protocol
         | 
         | [2] https://redis.io/commands/set
        
         | aranchelk wrote:
         | You'd probably want to define two specs, a basic and full.
         | There are several Redis-compatible data stores, but (if memory
         | serves) you'll find they almost always lack some advanced Redis
         | features, e.g. transactions.
        
         | linux2647 wrote:
         | Probably not like an ISO or RFC. Probably more like AWS S3: it
         | has an API that other software conforms to, but it isn't
         | strictly speaking a standard
        
       | spookylettuce wrote:
       | Quasi-related: what are some good hosted alternatives to AWS
       | dynamodb / GCloud Firestore that are a) fast b) affordable at
       | scale c) have a good local dev experience?
       | 
       | A hosted disk based redis protocol compliant capable of sub TB
       | size datasets would be a dream for me.
        
         | skinnyarms wrote:
         | I was surprised out how easy it was to get started with
         | Cassandra on DataStax: https://www.datastax.com/
        
         | tomnipotent wrote:
         | Cloudflare Workers KV is really promising, but needs a better
         | local dev story (no stable project so simulate services locally
         | e.g. cloudworkers). Pricing is reasonable depending on what you
         | interpret "scale" to be.
        
       | didip wrote:
       | Does it have a helm chart?
        
       | edoceo wrote:
       | See also Tendis https://github.com/Tencent/Tendis
       | 
       | Tendis is a high-performance distributed storage system which is
       | fully compatible with the Redis protocol.
        
       | tyingq wrote:
       | SET: 253232.12 requests per second       GET: 2130875.50 requests
       | per second
       | 
       | The 10:1 throughput ratio for GET vs SET is interesting. Redis
       | being in-memory, the rates there are pretty close to the same for
       | read/write.
       | 
       | Is a 10:1 ratio typical for a storage backed distributed kv
       | store?
       | 
       | Edit: Looks like CockroachDb has roughly a 3:1 ratio, similar for
       | YugabyteDB:
       | 
       | https://www.cockroachlabs.com/docs/stable/performance.html
       | 
       | https://forum.yugabyte.com/t/large-cluster-perf-1-25-nodes/5...
       | 
       | Also ~3:1 for etcd:
       | 
       | https://etcd.io/docs/v3.4/op-guide/performance/
        
         | bob1029 wrote:
         | > Is a 10:1 ratio typical for a storage backed distributed kv
         | store?
         | 
         | In a single-node system, the best way to increase your write
         | throughput is to batch requests over small chunks of time.
         | Ultimately, the amount of writes you can perform per unit time
         | is either bounded by the underlying I/O sequential throughput,
         | or the business constraints regarding maximum allowable request
         | latency. In the most trivial case, you are writing a buffer
         | containing the entire day's work to disk in 1 shot while
         | everyone sleeps. Imagine how fast that could be.
         | 
         | A distributed system has all of the same properties, but then
         | you have to put this over a denominator that additionally
         | factors in the number of nodes and the latency between all
         | participants. A single node is always going to give you the
         | most throughput when talking about 1 serial narrative of events
         | wherein any degree of contention is expected.
        
         | jandrewrogers wrote:
         | Comparisons of read/write ratios has to account for several
         | differences in design and implementation. Representative
         | benchmarks are difficult.
         | 
         | Things that can make a difference: Databases have subtly
         | different definitions of "durability", so they aren't always
         | doing semantically equivalent operations. Write throughput
         | sometimes scales with the number of clients and it is not
         | possible to saturate the server with a single client due to
         | limitations of the client protocol, so single client benchmarks
         | are misleading. Some databases allow read and write operations
         | to be pipelined; in these implementations it is possible for
         | write performance to sometimes exceed read performance.
         | 
         | For open source databases in particular, read and write
         | throughput is significantly throttled by poor storage engine
         | performance, so the ratio of read/write performance is almost
         | arbitrary. That 3:1 ratio isn't a good heuristic because the
         | absolute values in these cases could be much higher. A more
         | optimal design would offer integer factor throughput
         | improvements for both reading and writing, but it is difficult
         | to estimate what the ratio "should" be on a given server absent
         | a database engine that can really drive the hardware.
        
         | AlphaSite wrote:
         | I wonder how geode performs here.
        
         | refenestrator wrote:
         | Raft involves waiting for fsync on a majority of nodes, so
         | that's not too surprising.
         | 
         | 'Typical' is a matter of what guarantees you want to give.
        
           | toolz wrote:
           | Typically people use raft for leader election which in turn
           | can coordinate writes. I don't think the writes are being
           | fsync'd in the raft logs here. At least I wouldn't expect
           | that behavior.
        
             | alexchamberlain wrote:
             | Each write should be fsync'd to the WAL, right?
        
               | stingraycharles wrote:
               | Yes but those can happen at the convenience of the
               | particular node, not necessarily as a globally
               | chrckpointed fsync()
        
               | refenestrator wrote:
               | Then you're sacrificing consistency guarantees. If less
               | than a majority have committed a write, it could be lost
               | while the cluster still has a quorum up.
               | 
               | Waiting to report success until a majority have committed
               | allows you to make guarantees with a straight face.. "it
               | will probably be committed in the near future" is not the
               | same thing.
        
             | jimsimmons wrote:
             | You don't understand Raft. Quorum has to fsync for commit
        
             | skyde wrote:
             | For the raft algorithm to be correct Fsync is required on a
             | majority of node otherwise you are technically not
             | implementing Raft.
             | 
             | The reason is that in Raft if a node acknowledges to the
             | leader that it wrote something to the log it must not later
             | accept a different write in the same log position.
             | 
             | This mean if for some reason server rebooted with dirty
             | buffered writes that could not be flushed in time. it's
             | supposed to forgot everything it know and rejoin the
             | cluster using a brand new node id.
        
         | inglor wrote:
         | Often reads of data already committed only need to hit one node
         | but writes need to wait for a majority so they need to wait for
         | multiple nodes to receive and acknowledge the write.
         | 
         | I haven't checked the code though so I might be off.
        
       | maxpert wrote:
       | A database without any test harness? While this could be a good
       | toy or PoC I would never use it in production. Readers should be
       | aware, just because it's on HN doesn't mean it's production
       | ready.
        
         | hughrr wrote:
         | It uses Raft underneath as well which means there's a bunch of
         | non-determinism and hell for anyone who invokes it as well from
         | experience. The thing is cursed.
         | 
         | Source: several years dealing with vault and consul.
        
           | tempest_ wrote:
           | We use consul a bit as some "light" service discovery and a
           | KV store for a few things without much issue so far.
           | 
           | What demons did you encounter with vault/consul?
        
             | hughrr wrote:
             | We had some massive problems including complete cluster
             | collapse requiring rebuilds from scratch, eternal
             | leadership elections and occasionally nodes would just
             | entirely stop responding to KV requests causing cascading
             | failures outside. Vault is a massive damage multiplier for
             | these issues plus some other nasty ones like buggy barely
             | supported plugins.
        
           | PYTHONDJANGO wrote:
           | Please post URLs to bugs / issues to give your comment some
           | cred. Thanks!
        
           | cortesoft wrote:
           | What is a better consensus protocol to use?
        
             | hughrr wrote:
             | Look higher up the problem domain and solve it without
             | requiring a consensus protocol.
        
               | cortesoft wrote:
               | Hmm, how do you have high availability consistent data
               | without a consensus protocol? No matter where in the
               | problem chain you move, you have to eventually solve that
               | problem.
        
         | tomnipotent wrote:
         | A database without any code, actually.
         | 
         | It's less than a few hundred lines of Go that just wraps two
         | other databases (syndtr/goleveldb and ledisdb/ledisdb) with a
         | third library (tidwall/uhaha) that provides a Raft API.
        
           | 3np wrote:
           | Oooh that means I can form it to do Redis instead right?
           | Because that could be a nice way to to Redis clustering
        
             | tomnipotent wrote:
             | I suppose you could stick tidwall/uhaha directly in front
             | of redis, but I'm not entirely certain what you'd call
             | that...
             | 
             | Here's the LSET code:
             | 
             | https://github.com/gitsrc/IceFireDB/blob/main/lists.go#L232
             | func cmdLSET(m uhaha.Machine, args []string) (interface{},
             | error) {          if len(args) != 4 {           return nil,
             | rafthub.ErrWrongNumArgs          }                   index,
             | err := ledis.StrInt64([]byte(args[2]), nil)          if err
             | != nil {           return nil, err          }
             | if err := ldb.LSet([]byte(args[1]), int32(index),
             | []byte(args[3])); err != nil {           return nil, err
             | }          return redcon.SimpleString("OK"), nil         }
             | 
             | So what "IceFireDB" is:
             | 
             | 1. tidwall/uhaha - Raft server (m uhaha.Machine, rafthub)
             | 
             | 2. tidwall/redcon - Read/write redis protocol
             | (redcon.SimpleString)
             | 
             | 3. ledisdb/ledisdb - Redis-compatible with disk persistence
             | via leveldb (ldb.LSet)
             | 
             | 4. syndtr/goleveldb/leveldb - Provides snapshots, other
             | scattered references throughout code
             | 
             | It also includes this seemingly random file below, which
             | seems to implement some string slice overloads using
             | unsafe.Pointer:
             | 
             | https://github.com/siddontang/go/blob/master/hack/hack.go
             | // no copy to change slice to string         // use your
             | own risk         func String(b []byte) (s string) {
             | pbytes := (*reflect.SliceHeader)(unsafe.Pointer(&b))
             | pstring := (*reflect.StringHeader)(unsafe.Pointer(&s))
             | pstring.Data = pbytes.Data          pstring.Len =
             | pbytes.Len          return         }              // no
             | copy to change string to slice         // use your own risk
             | func Slice(s string) (b []byte) {          pbytes :=
             | (*reflect.SliceHeader)(unsafe.Pointer(&b))          pstring
             | := (*reflect.StringHeader)(unsafe.Pointer(&s))
             | pbytes.Data = pstring.Data          pbytes.Len =
             | pstring.Len          pbytes.Cap = pstring.Len
             | return         }
        
       | bayesian_horse wrote:
       | Not quite sure what this is supposed to be good for.
        
         | detaro wrote:
         | I'd guess for "I want redis, but more durable clustering"
         | (although I don't quite remember how much redis nowadays offers
         | there itself). Would want a lot more info before trusting it
         | for that though.
        
           | bayesian_horse wrote:
           | You can configure Redis to be plenty durable... but all the
           | data has to fit into memory somewhere.
        
         | inglor wrote:
         | It speaks Redis so ideally it can replace Redis in cases where
         | persistence is required.
         | 
         | There are already several community solutions for Redis
         | persistence - this one provides different guarantees.
         | 
         | The name implies the goal is to make it easy to mix "hot" (from
         | memory) and "cold" (from disk) data. The author suggests this.
        
           | brasic wrote:
           | Redis itself already supports a number of persistence schemes
           | and has since the beginning:
           | https://redis.io/topics/persistence
        
             | inglor wrote:
             | That's a good point and I should have been clearer.
             | 
             | I might be off (and probably am) but if I remember
             | correctly Redis persistence is more for disaster recovery -
             | you can create snapshots and recover them or replay a log
             | file. That's very different in terms of performance
             | guarantees from persisting the data itself to disk and
             | reading from it.
             | 
             | I was under the impression that's what tools (like this
             | one) and stuff like Ardb try to solve.
        
               | parhamn wrote:
               | Youre right. Redis will persist either the AOF or log but
               | your whole dataset must fit in memory (the AOF file is
               | used to fill existing memory on boot).
        
               | bayesian_horse wrote:
               | I wouldn't call it disaster recovery per se.
               | 
               | It's just that Redis is mostly an in-memory database and
               | if the process is terminated and restarted (for all sorts
               | of reasons) the data can be restored from disk.
               | 
               | So what IceFireDB might be good for is data which would
               | not fit easily into the memory of one node.
               | 
               | Again, it's really not clear to me.
        
               | rubicon33 wrote:
               | I often see projects like this posted on HN, and it's
               | very unclear to me what the actual use case is. Does
               | anyone even end up actually using these things? I guess
               | the developers hope it takes off, and they gain notoriety
               | as 'the guy who made X'?
               | 
               | It's unclear.
        
       | tptacek wrote:
       | See also rqlite, which seems like this, but sqlite instead of
       | Redis. Super interesting.
       | 
       | https://github.com/rqlite/rqlite
        
         | otoolep wrote:
         | rqlite author here, happy to answer any questions.
        
       ___________________________________________________________________
       (page generated 2021-08-21 23:00 UTC)