[HN Gopher] IceFireDB: Distributed disk storage database based o...
___________________________________________________________________
IceFireDB: Distributed disk storage database based on Raft and
Redis protocol
Author : thunderbong
Score : 124 points
Date : 2021-08-21 14:23 UTC (8 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| wiremine wrote:
| One of the child comments made the observation that "this speaks
| Redis."
|
| Makes me wonder if there is any spec for the Redis commands.
| I.e., in the same way that SQL defines an interface, but leaves
| the details up to individual implementations, is there a "Redis"
| interface that leaves the details up to the implementation?
|
| I'm thinking of something similar to ISO or RFC.
| WJW wrote:
| https://redis.io/topics/protocol ?
| rndgermandude wrote:
| I've implemented a subset of redis in the past, and went by
| their official docs, first the protocol[1] level protocol, then
| the docs for individual comments such as SET[2]. They also have
| a test suite, and I extracted the bits that applied to my
| partial implementation from there.
|
| The only real pitfall was what part of the CONFIG stuff I
| needed to implement to make popular redis client libs talk to
| me and/or use the newer protocol features.
|
| The rest was pretty straight forward, just read the docs for a
| command, implement the stuff, run the test suite, fix any bugs,
| repeat.
|
| As far as I know there is no RFC let alone an ISO standard.
|
| [1] https://redis.io/topics/protocol
|
| [2] https://redis.io/commands/set
| aranchelk wrote:
| You'd probably want to define two specs, a basic and full.
| There are several Redis-compatible data stores, but (if memory
| serves) you'll find they almost always lack some advanced Redis
| features, e.g. transactions.
| linux2647 wrote:
| Probably not like an ISO or RFC. Probably more like AWS S3: it
| has an API that other software conforms to, but it isn't
| strictly speaking a standard
| spookylettuce wrote:
| Quasi-related: what are some good hosted alternatives to AWS
| dynamodb / GCloud Firestore that are a) fast b) affordable at
| scale c) have a good local dev experience?
|
| A hosted disk based redis protocol compliant capable of sub TB
| size datasets would be a dream for me.
| skinnyarms wrote:
| I was surprised out how easy it was to get started with
| Cassandra on DataStax: https://www.datastax.com/
| tomnipotent wrote:
| Cloudflare Workers KV is really promising, but needs a better
| local dev story (no stable project so simulate services locally
| e.g. cloudworkers). Pricing is reasonable depending on what you
| interpret "scale" to be.
| didip wrote:
| Does it have a helm chart?
| edoceo wrote:
| See also Tendis https://github.com/Tencent/Tendis
|
| Tendis is a high-performance distributed storage system which is
| fully compatible with the Redis protocol.
| tyingq wrote:
| SET: 253232.12 requests per second GET: 2130875.50 requests
| per second
|
| The 10:1 throughput ratio for GET vs SET is interesting. Redis
| being in-memory, the rates there are pretty close to the same for
| read/write.
|
| Is a 10:1 ratio typical for a storage backed distributed kv
| store?
|
| Edit: Looks like CockroachDb has roughly a 3:1 ratio, similar for
| YugabyteDB:
|
| https://www.cockroachlabs.com/docs/stable/performance.html
|
| https://forum.yugabyte.com/t/large-cluster-perf-1-25-nodes/5...
|
| Also ~3:1 for etcd:
|
| https://etcd.io/docs/v3.4/op-guide/performance/
| bob1029 wrote:
| > Is a 10:1 ratio typical for a storage backed distributed kv
| store?
|
| In a single-node system, the best way to increase your write
| throughput is to batch requests over small chunks of time.
| Ultimately, the amount of writes you can perform per unit time
| is either bounded by the underlying I/O sequential throughput,
| or the business constraints regarding maximum allowable request
| latency. In the most trivial case, you are writing a buffer
| containing the entire day's work to disk in 1 shot while
| everyone sleeps. Imagine how fast that could be.
|
| A distributed system has all of the same properties, but then
| you have to put this over a denominator that additionally
| factors in the number of nodes and the latency between all
| participants. A single node is always going to give you the
| most throughput when talking about 1 serial narrative of events
| wherein any degree of contention is expected.
| jandrewrogers wrote:
| Comparisons of read/write ratios has to account for several
| differences in design and implementation. Representative
| benchmarks are difficult.
|
| Things that can make a difference: Databases have subtly
| different definitions of "durability", so they aren't always
| doing semantically equivalent operations. Write throughput
| sometimes scales with the number of clients and it is not
| possible to saturate the server with a single client due to
| limitations of the client protocol, so single client benchmarks
| are misleading. Some databases allow read and write operations
| to be pipelined; in these implementations it is possible for
| write performance to sometimes exceed read performance.
|
| For open source databases in particular, read and write
| throughput is significantly throttled by poor storage engine
| performance, so the ratio of read/write performance is almost
| arbitrary. That 3:1 ratio isn't a good heuristic because the
| absolute values in these cases could be much higher. A more
| optimal design would offer integer factor throughput
| improvements for both reading and writing, but it is difficult
| to estimate what the ratio "should" be on a given server absent
| a database engine that can really drive the hardware.
| AlphaSite wrote:
| I wonder how geode performs here.
| refenestrator wrote:
| Raft involves waiting for fsync on a majority of nodes, so
| that's not too surprising.
|
| 'Typical' is a matter of what guarantees you want to give.
| toolz wrote:
| Typically people use raft for leader election which in turn
| can coordinate writes. I don't think the writes are being
| fsync'd in the raft logs here. At least I wouldn't expect
| that behavior.
| alexchamberlain wrote:
| Each write should be fsync'd to the WAL, right?
| stingraycharles wrote:
| Yes but those can happen at the convenience of the
| particular node, not necessarily as a globally
| chrckpointed fsync()
| refenestrator wrote:
| Then you're sacrificing consistency guarantees. If less
| than a majority have committed a write, it could be lost
| while the cluster still has a quorum up.
|
| Waiting to report success until a majority have committed
| allows you to make guarantees with a straight face.. "it
| will probably be committed in the near future" is not the
| same thing.
| jimsimmons wrote:
| You don't understand Raft. Quorum has to fsync for commit
| skyde wrote:
| For the raft algorithm to be correct Fsync is required on a
| majority of node otherwise you are technically not
| implementing Raft.
|
| The reason is that in Raft if a node acknowledges to the
| leader that it wrote something to the log it must not later
| accept a different write in the same log position.
|
| This mean if for some reason server rebooted with dirty
| buffered writes that could not be flushed in time. it's
| supposed to forgot everything it know and rejoin the
| cluster using a brand new node id.
| inglor wrote:
| Often reads of data already committed only need to hit one node
| but writes need to wait for a majority so they need to wait for
| multiple nodes to receive and acknowledge the write.
|
| I haven't checked the code though so I might be off.
| maxpert wrote:
| A database without any test harness? While this could be a good
| toy or PoC I would never use it in production. Readers should be
| aware, just because it's on HN doesn't mean it's production
| ready.
| hughrr wrote:
| It uses Raft underneath as well which means there's a bunch of
| non-determinism and hell for anyone who invokes it as well from
| experience. The thing is cursed.
|
| Source: several years dealing with vault and consul.
| tempest_ wrote:
| We use consul a bit as some "light" service discovery and a
| KV store for a few things without much issue so far.
|
| What demons did you encounter with vault/consul?
| hughrr wrote:
| We had some massive problems including complete cluster
| collapse requiring rebuilds from scratch, eternal
| leadership elections and occasionally nodes would just
| entirely stop responding to KV requests causing cascading
| failures outside. Vault is a massive damage multiplier for
| these issues plus some other nasty ones like buggy barely
| supported plugins.
| PYTHONDJANGO wrote:
| Please post URLs to bugs / issues to give your comment some
| cred. Thanks!
| cortesoft wrote:
| What is a better consensus protocol to use?
| hughrr wrote:
| Look higher up the problem domain and solve it without
| requiring a consensus protocol.
| cortesoft wrote:
| Hmm, how do you have high availability consistent data
| without a consensus protocol? No matter where in the
| problem chain you move, you have to eventually solve that
| problem.
| tomnipotent wrote:
| A database without any code, actually.
|
| It's less than a few hundred lines of Go that just wraps two
| other databases (syndtr/goleveldb and ledisdb/ledisdb) with a
| third library (tidwall/uhaha) that provides a Raft API.
| 3np wrote:
| Oooh that means I can form it to do Redis instead right?
| Because that could be a nice way to to Redis clustering
| tomnipotent wrote:
| I suppose you could stick tidwall/uhaha directly in front
| of redis, but I'm not entirely certain what you'd call
| that...
|
| Here's the LSET code:
|
| https://github.com/gitsrc/IceFireDB/blob/main/lists.go#L232
| func cmdLSET(m uhaha.Machine, args []string) (interface{},
| error) { if len(args) != 4 { return nil,
| rafthub.ErrWrongNumArgs } index,
| err := ledis.StrInt64([]byte(args[2]), nil) if err
| != nil { return nil, err }
| if err := ldb.LSet([]byte(args[1]), int32(index),
| []byte(args[3])); err != nil { return nil, err
| } return redcon.SimpleString("OK"), nil }
|
| So what "IceFireDB" is:
|
| 1. tidwall/uhaha - Raft server (m uhaha.Machine, rafthub)
|
| 2. tidwall/redcon - Read/write redis protocol
| (redcon.SimpleString)
|
| 3. ledisdb/ledisdb - Redis-compatible with disk persistence
| via leveldb (ldb.LSet)
|
| 4. syndtr/goleveldb/leveldb - Provides snapshots, other
| scattered references throughout code
|
| It also includes this seemingly random file below, which
| seems to implement some string slice overloads using
| unsafe.Pointer:
|
| https://github.com/siddontang/go/blob/master/hack/hack.go
| // no copy to change slice to string // use your
| own risk func String(b []byte) (s string) {
| pbytes := (*reflect.SliceHeader)(unsafe.Pointer(&b))
| pstring := (*reflect.StringHeader)(unsafe.Pointer(&s))
| pstring.Data = pbytes.Data pstring.Len =
| pbytes.Len return } // no
| copy to change string to slice // use your own risk
| func Slice(s string) (b []byte) { pbytes :=
| (*reflect.SliceHeader)(unsafe.Pointer(&b)) pstring
| := (*reflect.StringHeader)(unsafe.Pointer(&s))
| pbytes.Data = pstring.Data pbytes.Len =
| pstring.Len pbytes.Cap = pstring.Len
| return }
| bayesian_horse wrote:
| Not quite sure what this is supposed to be good for.
| detaro wrote:
| I'd guess for "I want redis, but more durable clustering"
| (although I don't quite remember how much redis nowadays offers
| there itself). Would want a lot more info before trusting it
| for that though.
| bayesian_horse wrote:
| You can configure Redis to be plenty durable... but all the
| data has to fit into memory somewhere.
| inglor wrote:
| It speaks Redis so ideally it can replace Redis in cases where
| persistence is required.
|
| There are already several community solutions for Redis
| persistence - this one provides different guarantees.
|
| The name implies the goal is to make it easy to mix "hot" (from
| memory) and "cold" (from disk) data. The author suggests this.
| brasic wrote:
| Redis itself already supports a number of persistence schemes
| and has since the beginning:
| https://redis.io/topics/persistence
| inglor wrote:
| That's a good point and I should have been clearer.
|
| I might be off (and probably am) but if I remember
| correctly Redis persistence is more for disaster recovery -
| you can create snapshots and recover them or replay a log
| file. That's very different in terms of performance
| guarantees from persisting the data itself to disk and
| reading from it.
|
| I was under the impression that's what tools (like this
| one) and stuff like Ardb try to solve.
| parhamn wrote:
| Youre right. Redis will persist either the AOF or log but
| your whole dataset must fit in memory (the AOF file is
| used to fill existing memory on boot).
| bayesian_horse wrote:
| I wouldn't call it disaster recovery per se.
|
| It's just that Redis is mostly an in-memory database and
| if the process is terminated and restarted (for all sorts
| of reasons) the data can be restored from disk.
|
| So what IceFireDB might be good for is data which would
| not fit easily into the memory of one node.
|
| Again, it's really not clear to me.
| rubicon33 wrote:
| I often see projects like this posted on HN, and it's
| very unclear to me what the actual use case is. Does
| anyone even end up actually using these things? I guess
| the developers hope it takes off, and they gain notoriety
| as 'the guy who made X'?
|
| It's unclear.
| tptacek wrote:
| See also rqlite, which seems like this, but sqlite instead of
| Redis. Super interesting.
|
| https://github.com/rqlite/rqlite
| otoolep wrote:
| rqlite author here, happy to answer any questions.
___________________________________________________________________
(page generated 2021-08-21 23:00 UTC)