[HN Gopher] Garage, our self-hosted distributed object storage s...
___________________________________________________________________
Garage, our self-hosted distributed object storage solution
Author : lxpz
Score : 343 points
Date : 2022-02-08 10:20 UTC (12 hours ago)
(HTM) web link (garagehq.deuxfleurs.fr)
(TXT) w3m dump (garagehq.deuxfleurs.fr)
| southerntofu wrote:
| Hello, quick question: from what i understand since it's based on
| CRDTs it means any server in the cluster may save something to
| the storage, that gets replicated elsewhere.
|
| That's fine when there is absolute trust between the server
| operators, but am i correct to assume it's not the same threat
| model as encrypted/signed backups pushed to some friends lending
| storage space for you (who could corrupt your backup but can't
| corrupt your working data)?
|
| If my understanding is correct, maybe your homepage should
| outline the threat model more clearly (i.e. a single trusted
| operator for the whole cluster) and point to other solutions like
| TAHOE-LAFS for other use-cases.
|
| Congratulations on doing selfhosting with friends, that's pretty
| cool! Do you have any idea if some hosting cooperatives (CHATONS)
| or ISPs (FFDN) have practical use-cases in mind? They already
| have many physical locations and rather good bandwidth, but i
| personally can't think of an interesting idea.
| ClumsyPilot wrote:
| I was thiniing StorJ and fulecoin are the ones attempting to
| provide storage with a 'tryst noone' model
| lxpz wrote:
| You are absolutely correct concerning the trust model of
| Garage: any administrator of one node is an administrator of
| the entire cluster, and can manipulate data arbitrarily in the
| system.
|
| From an ideological perspective, we are strongly attached to
| the building of tight-knit communities in which strong trust
| bonds can emerge, as it gives us more meaning than living in an
| individualized society where all exchanges between individuals
| are mediated by a market, or worse, by blockchain technology.
| This means that trusting several system's administrator makes
| sense to us.
|
| Note that in the case of cooperatives such as CHATONS, most
| users are non-technical and have to trust their sysadmin
| anyways; here, they just have to trust several sysadmins
| instead of just one. We know that several hosting cooperatives
| of the CHATONS network are thinking like us and are interested
| in setting up systems such as Garage that work under this
| assumption.
|
| In the meantime, we also do perfectly recognize the possibility
| of a variety of attack scenarios against which we want to
| consider practical defenses, such as the following:
|
| 1/ An honest-but-curious system administrator or an intruder in
| the network that wants to read user's private data, or
| equivalently, a police raid where server software is embarked
| for inspection by state services;
|
| 2/ A malicious administrator or an intruder that wants to
| manipulate the users by introducing fake data;
|
| 3/ A malicious administrator or an intruder that simply wants
| to wreak havoc by deleting everything.
|
| Point 1 is the biggest risk in my eyes. Several solutions can
| be built to add an encryption layer over S3 for different usage
| scenarios. For instance for storing personnal files, Rclone can
| be used to add simple file encryption to an S3 bucket and can
| also be mounted directly via FUSE, allowing us to access Garage
| as an end-to-end encrypted network drive. For backups, programs
| like Restic and Borg allow us to upload our backups encrypted
| into Garage.
|
| Point 2 can be at least partially solved by adding signatures
| and verification in the encryption layer which is handled on
| the client (I don't know for sure if Rclone, Restic and Borg
| are doing this).
|
| Point 3 is harder and probably requires adaptation on the side
| of Garage to be solved, for instance by adding a restriction on
| which nodes are allowed to propagate updates in the network and
| thus establishing a hierarchy between two categories of nodes:
| those that implement the S3 gateway and thus have full power in
| the network, and those that are only responsible for storing
| data given by the gateway but cannot originate modifications
| themselves (under the assumption that modifications must be
| accompanied by a digital signature and that only gateway nodes
| have the private keys to generate such signatures). Then we
| could separate gateway nodes according to the different buckets
| in which they can write for better separation of concernts.
| However, as long as we are trying to implement the S3 protocol
| I think we are stuck with some imperfect solution like this,
| because S3 itself does not implement using public-key
| cryptography to attest operations sent by the users.
|
| It is perfectly true that solutions that are explicitely
| designed to handle these threats (such as TAHOE-LAFS) would
| provide better guarantees in a system that is maybe more
| consistent as a whole. However we are also trying to juggle
| these security constraints with deployment contraints such as
| keeping compatibility with standard protocols such as S3 (and
| soon also IMAP for mailbox storage), which restricts us in the
| design choices we make.
|
| Just to clarify, the fact that any node can tamper with any of
| the data in the cluster is not strictly linked to the fact that
| we use CRDTs internally. It is true that CRDTs make it more
| difficult, but we do believe that there are solutions to this
| and we intend to implement at least some of them in our next
| project that involves mailbox storage.
| ddrdrck_ wrote:
| "or worse, by blockchain technology" -> I really would
| appreciate if you could elaborate on this. Blockchain
| technology has its pros and cons, depending on the underlying
| consensus algorithm, but it could certainly be put to good
| use in the context of a political aware project promoting
| decentralization and freedom ?
| lxpz wrote:
| I'd much rather have a system that doesn't try to solve
| human conflict by algorithms. Blockchains do this: you
| write code, the code becomes the law, and then there is no
| recourse when things go wrong. I deeply believe that we
| should rather focus on developping our interpersonnal
| communication skills and building social structures that
| empower us to do things together by trusting eachother and
| taking care together of the group. Algorithms are just
| tools, and we don't want them interfering in our
| relationships; in particular in the case of blockchains,
| the promise of trustlessness is absolutely opposite to this
| ideal.
| prmoustache wrote:
| I understand the founders are Terry Pratchett / Discworld fans
| from the name of the associations.
|
| For non french speakers, Deuxfleurs is the french traduction of
| the Twoflower character appearing in The Colour of Magic, The
| Light Fantastic and Interesting Times from Terry Pratchett's
| Discworld books series.
| adrn10 wrote:
| Busted! We might even have a Bagage[0] side-project...
|
| [0] https://git.deuxfleurs.fr/Deuxfleurs/bagage
| iam-TJ wrote:
| Your git repository is not reachable over IPv6 despite publishing
| a DNS AAAA record: $ curl -v
| https://git.deuxfleurs.fr/Deuxfleurs/garage/issues *
| Trying 2001:41d0:8:ba0b::1:443... * Trying
| 5.135.179.11:443... * Immediate connect fail for
| 5.135.179.11: Network is unreachable $ dig
| +noquestion +nocmd aaaa git.deuxfleurs.fr ;; Got answer:
| ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 46366
| ;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 0,
| ADDITIONAL: 2 ;; OPT PSEUDOSECTION: ; EDNS:
| version: 0, flags:; udp: 65494 ;; ANSWER SECTION:
| git.deuxfleurs.fr. 6822 IN CNAME
| hammerhead.machine.deuxfleurs.fr.
| hammerhead.machine.deuxfleurs.fr. 6822 IN AAAA
| 2001:41d0:8:ba0b::1
|
| This from an IPv6-only network. Since you publish an IPv6 record
| our DNS64/NAT64 gateway doesn't get involved hence the IPv4
| immediate connect fail.
| superboum wrote:
| Thanks for reporting the problem.
|
| I just deleted the AAAA entry for this machine. In the
| meantime, if the result is cached for you, you can pass the
| `-4` argument to force IPv4: git clone -4
| git@git.deuxfleurs.fr:Deuxfleurs/garage.git
|
| And, in a second time, we will work on a better/working IPv6
| configuration for our Git repository and all of our services
| (we use Nomad+Docker and did not find a way to expose IPv6 in a
| satisfying way yet).
| iam-TJ wrote:
| Looks like the TTL on the records is still affecting local
| resolution. I'm aware of the git protocol work-around to
| force IPv4 but that doesn't help when using a regular web-
| browser to visit the published URL for issues (first thing I
| evaluate for a new-to-me project):
|
| https://git.deuxfleurs.fr/Deuxfleurs/garage/issues
| adrn10 wrote:
| My bad T_T My colleagues keep picking on me for this. Now that
| you, complete stranger, you see it too, I'm feeling obliged to
| improve the situation!
| lxpz wrote:
| For people interested in digging in, we have an interesting
| benchmark here showing how it significantly outperforms Minio in
| the geo-distributed setting:
| https://garagehq.deuxfleurs.fr/documentation/design/benchmar...
|
| We also made a presentation at FOSDEM'22, the video should be out
| soon.
| superboum wrote:
| I tried to answer to a comment that was deleted, probably due to
| the form. Instead of deleting my answer, I want to share the
| reworded critics and my answers.
|
| > Why not using Riak and adding an S3 API around it.
|
| Riak was developed by a company named Basho that went bankrupt
| some years ago, the software is not developed anymore. In fact,
| we do no need to add an S3 API around Riak KV, Basho even
| released "Riak Cloud Storage"[0] that exactly does this: provide
| an S3 API on top of Riak KV architecture. We plan to release a
| comparison between Garage and Riak CS, Garage has some
| interesting features that Riak CS does not have! In practice,
| implementing an object store on top of a DynamoDB-like KV store
| is not that straightforward. For example, Exoscale, a cloud
| provider went this way for their first implementation of their KV
| store, Pithos[1], but rewrote it later as you need special logic
| to handle your chunks (they did not publish Pithos v2).
|
| > Most apps don't have S3 support
|
| We are maintaining in our documentation an "integration" section
| listing all the compatible applications. Garage already works
| with Matrix, Mastodon, Peertube, Nextcloud, Restic (an
| alternative to Borg), Hugo and Publii (a static site generator
| with a GUI). These applications are only a fraction of all
| existing applications, but our software is targeted at its
| users/hosters.
|
| > A distributed system is not necessarily highly available
|
| I will not fight on the wording: we come from an academic
| background where the term "distributed computing" has a specific
| meaning that may differ outside. In our field, we define models
| where we study systems made of processes that can crash.
| Depending on your algorithms and the properties you want, you can
| prove that your system will work despite some crashes. We want to
| build software on these academic foundations. This is also the
| reason we put "Standing on the shoulders of giants" on our front
| page and linking to research papers. To put it in a nutshell, one
| critic we address to other software is that sometimes they lack
| theoretical/academic foundations that lead to unexpected
| failures/more work to sysadmins. But on the theoretical point,
| Basho and Riak were exemplary and a model for us!
|
| [0]: https://docs.riak.com/riak/cs/2.1.1/index.html [1]:
| https://github.com/exoscale/pithos
| lxpz wrote:
| I also had some things to answer to that comment which I'll put
| back here for the record.
|
| > I have been building distributed systems for 20 years, and
| they are not more reliable. They are probabilistically more
| likely to fail.
|
| It depends on what kind of faults you want to protect against.
| In our case, we are hosting servers at home, meaning that any
| one of them could be disconnected at any time due to a power
| outage, a fiber being cut off, or any number of reasons. We are
| also running old hardware where individual machines are more
| likely to fail. We also do not run clusters composed of very
| large numbers of machines, meaning that the number of
| simultaneous failures that can be expected actually remains
| quite low. This means that the choices made by Garage's
| architecture make sense for us.
|
| But maybe your point was about the distinction between
| distributed systems and high availability, in which case I
| agree. Several of us have studied distributed systems in the
| academic setting, and in our vocabulary, distributed systems
| almost by definition includes crash-tolerance and thus making
| systems HA. I understand that in the engineering community the
| vocabulary might be different and we might orient our
| communication more towards presenting Garage as HA thanks to
| your insight, as it is one of our core, defining features.
|
| > However, this isn't it. This is distributed S3 with CRDTs.
| Still too application-specific, because every app that wants to
| use it has to be integrated with S3. They could have just
| downloaded something like Riak and added an S3 API around it.
|
| Garage is almost that, except that we didn't download Riak but
| made our own CRDT-based distributed storage system. It's
| actually not the most complex part at all, and most of the
| developpement time was spent on S3 compatibility. Rewriting the
| storage layer means that we have better integration between
| components, as everything is built in Rust and heavily depends
| on the type system to ensure things work well together. In the
| future, we plan to reuse the storage layer we built for Garage
| for other projects, in particular to build an e-mail storage
| server.
| razzio wrote:
| This is a fantastic project and I can't wait to try it out!
|
| One question on the network requirements. The web page says for
| networking: "200 ms or less, 50 Mbps or more".
|
| How hard are these requirements? For folks like me that can't
| afford a guaranteed 50Mbps internet connection, is this still
| usable?
|
| There are plenty of places in the world where 50Mbps internet
| connectivity would be a dream. Even here in Canada there are
| plenty of places with a max of 10Mbps. The African continent for
| example will have many more.
| adrn10 wrote:
| This depends on your workload. It would work, but we first
| target fiber optics domestic connections.
| lxpz wrote:
| It's not a hard requirement, you might just have a harder time
| handling voluminous files as Garage nodes will in all case have
| to transfer data internally in the cluster (remember that files
| have to be sent to three nodes when they are stored in Garage,
| meaning 3x more bandwidth needs to be used). 10Mbps is already
| pretty good if it is stable and your ping isn't off the charts,
| and it might be totally workable depending on your use case.
| latchkey wrote:
| I see zero unit/integration tests in the source tree, not even a
| `test` target in the Makefile. Terrifying.
| lxpz wrote:
| Then you didn't look close enough. Have a look at the
| .drone.yml to see the entry points, we have both unit tests and
| integration tests that run at each build.
| lloydatkinson wrote:
| > account the geographical location of servers, and ensures that
| copies of your data are located at different locations when
| possible for maximal redundancy, a unique feature in the
| landscape of distributed storage systems.
|
| Azure already does this, so claiming it's unique seems untrue.
| lxpz wrote:
| I mean, sure, but we're operating under the premise that we
| don't want to give out our data to a cloud provider and instead
| want to store data ourselves. Half of the post is dedicated to
| explaining that, so pretending that you can't infer that from
| context seems a bit unfair.
| mediocregopher wrote:
| I've been looking through the available options for this use-
| case, with the same motivation given in this post, and was
| surprised to see there wasn't something which could fit this
| niche (until now!).
|
| Really happy to see this being worked on by a dedicated team.
| Combined with a tool like https://github.com/slackhq/nebula this
| could help form the foundation of a fully autonomous, durable,
| online community.
| [deleted]
| morelish wrote:
| Think the idea is a good one. As S3 becomes standard, it would be
| nice if someone could rip amazon's face-off (like amazon have
| done for other open-source companies, e.g. elastic).
| social_quotient wrote:
| Elastic was a bit with cause?
| morelish wrote:
| How do you mean?
| malteg wrote:
| what about https://min.io/
| morelish wrote:
| Cheers, I'll have a look at that.
| belfalas wrote:
| Looks very worth checking out! Will this be made available for
| installation via Cloudron some time in the future?
| merb wrote:
| this might be a stupid comment.
|
| but is this safe to use inside a commercial project as a backend
| s3 solution? i.e. just using it instead of lcoal storage and not
| changing it?
| superboum wrote:
| Our software is published under the AGPLv3 license and comes
| with no guarantee, like any other FOSS project (if you do not
| pay for support). We are considering our software as "public
| beta" quality, so we think it works well, at least for us.
|
| On the plus side, it survived Hacker News Hug of Death. Indeed,
| the website we linked is hosted on our own Garage cluster made
| of old Lenovo ThinkCentre M83 (with Intel Pentium G3420 and 8GB
| of RAM) and the cluster seems fine. We also host more than 100k
| objects in our Matrix (a chat service) bucket.
|
| On the minus side, this is the first time we have so much
| coverage, so our software has not yet been tested by thousands
| of people. It is possible that in the near future, some edge
| cases we never triggered are reported. This is the reason why
| most people wait that an application reaches a certain level of
| adoption before using it, in other words they don't want to pay
| "the early adopter cost".
|
| In the end, it's up to you :-)
| cvwright wrote:
| Awesome. Are you using the matrix-media-repo to connect the
| Matrix homeserver to Garage?
| lxpz wrote:
| No, we are simply using synapse-s3-storage-provider:
| https://github.com/matrix-org/synapse-s3-storage-provider
| [deleted]
| lxpz wrote:
| License-wise, Garage is AGPLv3 like many similar software
| projects such as MinIO and MongoDB. We are not lawyers, but in
| the spirit, this should not preclude from using it in
| commercial software as long as it is not modified. In the case
| where you are modifying it, like e.g. adding a nice
| administration GUI, we kindly ask that you publish your
| modifications as open source so that we can incorporate them
| back in the main product if we want.
| k8sToGo wrote:
| Do you guys expose Prometheus metrics or how is monitoring
| supposed to be done (e.g. on replication status)?
| superboum wrote:
| Currently, we have no elegant way to achieve what you want.
|
| When failures occur, repair is done through workers that says
| when they launch, when they repair chunks, and when they exit
| in the logs. We also have `garage status` and `garage stats`.
| The first command displays healthy and non healthy nodes, the
| second one displays the queue length of our tables and chunks,
| if their values are greater than zero, we are repairing the
| cluster. We are documenting failure recovery in our
| documentation:
| https://garagehq.deuxfleurs.fr/documentation/cookbook/recove...
|
| For the near future, we plan to integrate opentelemetry. But we
| are still discussing the design and information we want to
| track and report. We are currently discussing these questions
| in our issue tracker:
| https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/111
| https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/207
|
| If you have some knowledge/experience on this subject, feel
| free to share it in these issues.
| lxpz wrote:
| Exposing Prometheus metrics is an ongoing work which we haven't
| had much time to advance yet. For now we can check on
| replication progress or overall cluster health by inspecting
| Garage's state from the command line.
| leathersoft wrote:
| I swear to god, I thought this was a parody on yet another
| product with fancy names for just a physical garage
| adrn10 wrote:
| This is an actual object store - we just have a talent for
| naming artifacts
| bamazizi wrote:
| Github mirror? It's extremely easy to forget and stop following
| up on a project that's not on Github!
|
| You might have personal reasons by your current choice, but maybe
| the project can have shot at success if it's in a more social and
| high traffic ecosystem.
| lxpz wrote:
| Exists, already linked in another comment :
| https://github.com/deuxfleurs-org/garage
| pnocera wrote:
| I'd see a nice fit with Tailscale. Or how to create a secured
| private network of redundant and HA storage...
| adrn10 wrote:
| We haven't looked too much into VPNs, although we do need some
| plumbing to distribute our whole infrastructure at Deuxfleurs.
| Loic wrote:
| Interesting point to note is that they have some _external_
| funding from the EU.
|
| > The Deuxfleurs association has received a grant from NGI
| POINTER[0], to fund 3 people working on Garage full-time for a
| year : from October 2021 to September 2022.
|
| [0]: https://pointer.ngi.eu/
| sofixa wrote:
| That's really great, i didn't know this programme existed.
| adrn10 wrote:
| (Deuxfleurs administrative council member here:) Starting
| from a little association with 200EUR in a cash box last
| year, we were indeed amazed to actually get this funding.
|
| I strongly recommend NGI grants to any European citizen. Just
| look at the variety of profiles that NGI POINTER (one of the
| grant type) funded last year: https://pointer.ngi.eu/wp-
| content/uploads/2021/10/NGI-POINTE...
|
| They even finance individuals wanting to contribute to FOSS
| up to 50kEUR for a year.
| malteg wrote:
| what is the benefit compared to https://min.io/
| adrn10 wrote:
| We made a comparison here:
| https://garagehq.deuxfleurs.fr/documentation/design/benchmar...
|
| Other comments from my colleagues also shed light on Garage's
| specific features (do check the discussion on Raft).
|
| In a nutshell, Garage is designed for higher inter-node
| latency, thanks to less round-trips for reads & writes. Garage
| does not intend to compete with MinIO, though - rather, to
| expand the application domain of object stores.
| superboum wrote:
| Another benefit compared to MinIO is we have "flexible
| topologies".
|
| Due to our design choice, you can add and remove nodes without
| any constraint on number of nodes and size of the storage. So
| you do not have to overprovision your cluster as recommended by
| MinIO[0].
|
| Additionally, and we planned a full blog post on this subject,
| adding or removing a node in the cluster does not lead to a
| full rebalance of the cluster. To understand why, I must
| explain how it works traditionally and how we improved on
| existing work.
|
| When you initialize the cluster, we split the cluster in
| partitions, then assign partitions to nodes (see Maglev[1]).
| Later, based on their hash, we will store data in its
| corresponding partition. When a node is added or removed,
| traditional approaches rerun the whole algorithm and comes with
| a totally different partition assignation. Instead, we try to
| compute a new partition distribution that minimize partitions
| assignment change, which in the end minimize the number of
| partitions moved.
|
| On the drawback side, Garage does not implement erasure coding
| (as it also the reason of many MinIO's limitations) and
| duplicate data 3 times which is less efficient. Garage also
| implements less S3 endpoints than Minio (for example we do not
| support versioning), the full list is available in our
| documentation[2].
|
| [0]: https://docs.min.io/minio/baremetal/installation/deploy-
| mini...
|
| [1]: https://www.usenix.org/conference/nsdi16/technical-
| sessions/...
|
| [2]: https://garagehq.deuxfleurs.fr/documentation/reference-
| manua...
| rhim wrote:
| This is written in Rust!
| skratlo wrote:
| What's with the obsession with S3 API? Why not FUSE and just use
| any programming language without libraries to access the files?
| (Obviously without some S3/FUSE layer, without HTTP, just direct
| access to garage with the fastest protocol possible)
| skybrian wrote:
| S3 is simpler because you don't have to deal with hierarchical
| directories.
| Fiahil wrote:
| NICE ! Very Nice ! Congrats on the project and the naming !
|
| If I may add a few thoughts : You may be very well placed for
| being a more suitable replacement to Minio in Data science / AI
| projects. Why ? Because a hard requirement for any serious MLOps
| construction needs 2 things : An append-only storage and a way to
| stream large content to a single place. First one is very hard to
| get right, second is relatively easy.
|
| Being CRDT-based it should be _very easy_ for you to provide an
| append-only storage that can store partitioned, ordered logs of
| immutable objects (think dataframes, and think kafka). Once you
| have that, it's really easy to build the remaining missing pieces
| (UI, API, ...) for creating a _much better_ (and distributed)
| version of MLFlow.
|
| Finally, S3 protocol is an "okay" version for file storage, but
| as you are probably aware, it's clearly a huge limiter. So, trash
| it. Provide a read-only S3 gateway for compatibility, but writes
| should use a different API.
|
| PS: Galette-Saucisse <3
| adrn10 wrote:
| Thanks for your enthusiastic feedback, fellow breton!
|
| It's an interesting take you make about ML workloads. We
| haven't investigated that yet (next in line is e-mail: you see
| we target ubiquitous, low-tech needs that are horrendous to
| host). But we will definitely consider it for future works:
| small breton food trucks do need their ML solutions for better
| galettes.
| Fiahil wrote:
| If you ever need further insights into the ML/AI use case,
| don't hesitate to ping me at {my-username}-at-gmail. I'll
| forward it to my work address :)
| adrn10 wrote:
| So basically, we are a hosting association that wanted to put
| their servers at home _and_ sleep at night. This demanded inter-
| home redundancy of our data. But none of the existing solutions
| (MinIO, Ceph...) are designed for high inter-node latency. Hence
| Garage! Your cheap and proficient object store: designed to run
| on a toaster, through carrier-pigeon-grade networks, while still
| supporting a tons of workloads (static websites, backups,
| Nextcloud... you name it)
| ddorian43 wrote:
| Did you try seaweedfs? Or contributing to any of them?
| adrn10 wrote:
| Basically, our association is trying to move away from file
| systems, for scalability and availability purposes. There's a
| similarity between RDBMS (say SQL) and filesystems, in the
| sense that they distribute like crap. Main reason being that
| they have too strong consistency properties. Although, one
| can use Garage as a file system, notably using WebDAV. It's a
| beta feature, called Bagage:
| https://git.deuxfleurs.fr/Deuxfleurs/bagage I'll let my
| knowledgeable fellows answer, if you'd like to hear about
| differences between seaweedfs and Garage in more detail :)
| ddorian43 wrote:
| seaweedfs is an object storage at the core. In my
| (personal) case, I disqualify cause of missing erasure
| coding.
| StillBored wrote:
| Garage says isn't a goal either.
|
| https://garagehq.deuxfleurs.fr/documentation/design/goals
| /
|
| "Storage optimizations: erasure coding or any other
| coding technique both increase the difficulty of placing
| data and synchronizing; we limit ourselves to
| duplication."
| superboum wrote:
| (Garage Contributor here) We reviewed many of the existing
| solutions and none of them had the feature set we wanted.
| Compared to SeaweedFS, the main difference we introduce with
| Garage is that our nodes are not specialized, which lead to
| the following benefits:
|
| - Garage is easier to deploy and to operate: you don't have
| to manage independent components like the filer, the volume
| manager, the master, etc. It also seems that a bucket must be
| pinned to a volume server on SeaweedFS. In Garage, all
| buckets are spread on the whole cluster. So you do not have
| to worry that your bucket fills one of your volume server.
|
| - Garage works better in presence of crashes: I would be very
| interested by a deep analysis of Seaweed "automatic master
| failover". They use Raft, I suppose either by running an
| healthcheck every second which lead to data loss on a crash,
| or sending a request for each transaction, which creates a
| huge bottleneck in their design.
|
| - Better scalability: because there is no special node, there
| is no bottlenecks. I suppose that with SeaweedFS, all the
| requests have to pass through the master. We do not have such
| limitations.
|
| As a conclusion, we choose a radically different design with
| Garage. We plan to do a more in-depth comparison in the
| future, but even today, I can say that if we implement the
| same API, our radically different designs lead to radically
| different properties and trade-off.
| [deleted]
| faeyanpiraat wrote:
| > In Garage, all buckets are spread on the whole cluster.
| So you do not have to worry that your bucket fills one of
| your volume server.
|
| Are you saying I do not have to worry about ONE volume
| server getting full, but instead I can worry about ALL of
| them getting full at the same time?
| lxpz wrote:
| Yes, in which case you just make your cluster bigger or
| delete some data. Seems like a reasonable compromise to
| me.
|
| Garage has advanced functionnality to control how much
| data is stored on each node, and is also very flexible
| with respect to adding or removing nodes, making all of
| this very easy to do.
| singron wrote:
| Unless garage has a special mitigation against this,
| usually performance gets much worse in large clusters as
| the filesystem fills up. As files are added and deleted,
| it struggles to keep nodes exactly balanced and a growing
| percentage of nodes will be full and unavailable for new
| writes.
|
| So in a high throughput system, you may notice a soft
| performance degradation before actually running out of
| space.
|
| If you aren't performance sensitive or don't have high
| write throughput, you might not notice. This is
| definitely something you should be forecasting and
| alerting on so you can acquire and add capacity or delete
| old data.
|
| If you really don't like the idea of everything falling
| over at the same time, you could use multiple clusters or
| come up with a quota system (e.g. disable workload X if
| it's using more than Y TiB).
| ddorian43 wrote:
| > independent components like the filer, the volume
| manager, the master, etc.
|
| You can run volume/master/filer in a single server (single
| command).
|
| > filer probably needs an external rdbms to handle the
| metadata
|
| This is true. You can use an external db. Or build/embed
| some other db inside it (think a distributed kv in golang
| that you embed inside to host the metadata).
|
| > It also seems that a bucket must be pinned to a volume
| server on SeaweedFS.
|
| This is not true. A bucket will be using it's own volumes,
| but can be and is distributed on the whole cluster by
| default.
|
| > They use Raft, I suppose either by running an healthcheck
| every second which lead to data loss on a crash, or running
| for each transaction, which creates a huge bottleneck.
|
| Raft is for synchronized writes. It's slow in the case of a
| single-write being slow because you have to wait for an
| "ok" from replicas, which is a good thing (compared to
| async-replication in, say, cassandra/dynamodb). Keep in
| mind that s3 also moved to synced replication. This is
| fixed by having more parallelism.
|
| > Better scalability: because there is no special node,
| there is no bottlenecks. I suppose that SeaweedFS, all the
| requests have to pass through the master. We do not have
| such limitations.
|
| Going to the master is only needed for writes, to get a
| unique id. This can be easily fixed with a plugin to say,
| generate twitter-snowflake-ids which are very efficient.
| For reads, you keep a cache in your client for the volume-
| to-server mapping so you can do reads directly from the
| server that has the data, or you can randomly query a
| server and it will handle everything underneath.
|
| I'm pretty sure seaweedfs has very good fundamentals from
| researching all other open-source distributed object
| storage systems that exists.
| lxpz wrote:
| > Raft is for synchronized writes. It's slow in the case
| of a single-write being slow because you have to wait for
| an "ok" from replicas, which is a good thing (compared to
| async-replication in, say, cassandra/dynamodb). Keep in
| mind that s3 also moved to synced replication. This is
| fixed by having more parallelism.
|
| We have synchronous writes without Raft, meaning we are
| both much faster and still strongly consistent (in the
| sense of read-after-write consistency, not
| linearizability). This is all thanks to CRDTs.
| ddorian43 wrote:
| > This is all thanks to CRDTs.
|
| If you don't sync immediately, you may lose the node
| without it replicating yet and losing the data forever.
| There's no fancy algorithm when the machine gets
| destroyed before it replicated the data. And you can't
| write to 2 replicas simultaneously from the client like,
| say, when using a Cassandra-smart-driver since S3 doesn't
| support that.
|
| CRDTs are nice but not magic.
| superboum wrote:
| So let's take the example of a 9-nodes clusters with a
| 100ms RTT over the network to understand. In this
| specific (yet a little bit artificial) situation, Garage
| particularly shines compared to Minio or SeaweedFS (or
| any Raft-based object store) while providing the same
| consistency properties.
|
| For a Raft-based object store, your gateway will receive
| the write request and forward it to the leader (+ 100ms,
| 2 messages). Then, the leader will forward in parallel
| this write to the 9 nodes of the cluster and wait that a
| majority answers (+ 100ms, 18 messages). Then the leader
| will confirm the write to all the cluster and wait for a
| majority again (+ 100ms, 18 messages). Finally, it will
| answer to your gateway (already counted in the first
| step). In the end, our write took 300ms and generated 38
| messages over the cluster.
|
| Another critical point with Raft is that your writes do
| not scale: they all have to go through your leader. So on
| the writes point of view, it is not very different from
| having a single server.
|
| For a DynamoDB-like object store (Riak CS, Pithos,
| Openstack Swift, Garage), the gateway receives the
| request and know directly on which nodes it must store
| the writes. For Garage, we choose to store every writes
| on 3 different nodes. So the gateway sends the write
| request to the 3 nodes and waits that at least 2 nodes
| confirm the write (+ 100ms, 6 messages). In the end, our
| write took 100ms, generated 6 messages over the cluster,
| and the number of writes is not dependent on the number
| of (raft) nodes in the cluster.
|
| With this model, we can still provide always up to date
| values. When performing a read request, we also query the
| 3 nodes that must contain the data and wait for 2 of
| them. Because we have 3 nodes, wrote at least on 2 of
| them, and read on 2 of them, we will necessarily get the
| last value. This algorithm is discussed in Amazon's
| DynamoDB paper[0].
|
| I reasoned in a model where there is no bandwidth, no CPU
| limit, no contention at all. In real systems, these
| limits apply, and we think that's another argument in
| favor of Garage :-)
|
| [0]: https://dl.acm.org/doi/abs/10.1145/1323293.1294281
| ddorian43 wrote:
| > For a Raft-based object store, your gateway will
| receive the write request and forward it to the leader (+
| 100ms, 2 messages). Then, the leader will forward in
| parallel this write to the 9 nodes of the cluster and
| wait that a majority answers (+ 100ms, 18 messages). Then
| the leader will confirm the write to all the cluster and
| wait for a majority again (+ 100ms, 18 messages).
| Finally, it will answer to your gateway (already counted
| in the first step). Our write took 300ms and generated 38
| messages over the cluster.
|
| No. The "proxy" node, a random node that you connect to
| will do:
|
| 0. split the file into chunks of ~4MB (can be changed)
| while streaming
|
| for each chunk (you can write chunks in parallel):
|
| 1. get id from master (can be fixed by generating an id
| in the proxy node with some custom code, 0 messages with
| custom plugin)
|
| 2. write to 1 volume-server (which will write to another
| node for majority) (2 messages)
|
| 3. update metadata layer, to keep track of chunks so you
| can resume/cancel/clean-failed uploads (metadata may be
| another raft subsystem, think yugabytedb/cockroachdb, so
| it needs to do it's own 2 writes) (2 messages)
|
| Mark as "ok" in metadata layer and return ok to client.
| (2 messages)
|
| The chunking is more complex, you have to track more
| data, but in the end is better. You spread a file to
| multiple servers & disks. If a server fails with erasure-
| coding and you need to read a file, you won't have to
| "erasure-decode" the whole file since you'll have to do
| it only for the missing chunks. If you have a hot file,
| you can spread reads on many machines/disks. You can
| upload very-big-files (terabytes), you can "append" to a
| file. You can have a smart-client (or colocate a proxy on
| your client server) for smart-routing and stuff.
| ricardobeat wrote:
| If you're still talking about SeaweedFS, the answer seems
| to be that it's _not_ a "raft-based object store", hence
| it's not as chatty as the parent comment described.
|
| That proxy node is a volume server itself, and uses
| simple replication to mirror its volume on another
| server. Raft consensus is not used for the writes. Upon
| replication failure, the data becomes read-only [1], thus
| giving up partition tolerance. These are not really
| comparable.
|
| [1]
| https://github.com/chrislusf/seaweedfs/wiki/Replication
| [deleted]
| vlovich123 wrote:
| How does step 1 work? My understanding is that the ID
| from the master tells you which volume server to write
| to. If you're generating it randomly, then are you saying
| you have queried the master server for the number of
| volumes upfront & then just randomly distribute it that
| way?
| chrislusf wrote:
| In snowflake id generation mode, the "which volume is
| writable" information can be read from other follower
| masters.
| lxpz wrote:
| We ensure the CRDT is synced with at least two nodes in
| different geographical areas before returning an OK
| status to a write operation. We are using CRDTs not so
| much for their asynchronous replication properties (what
| is usually touted as eventual consistency), but more as a
| way to avoid conflicts between concurrent operations so
| that we don't need a consensus algorithm like Raft. By
| combining this with the quorum system (two writes out of
| three need to be successfull before returning ok), we
| ensure durability of written data but without having to
| pay the synchronization penalty of Raft.
| Ne02ptzero wrote:
| > We ensure the CRDT is synced with at least two nodes in
| different geographical areas before returning an OK
| status to a write operation [...] we ensure durability of
| written data but without having to pay the
| synchronization penalty of Raft.
|
| This is, in essence, strongly-consistent replication; in
| the sense that you wait for a majority of writes before
| answering a request: So you're still paying the latency
| cost of a round trip with a least another node on each
| write. How is this any better than a Raft cluster with
| the same behavior? (N/2+1 write consistency)
| lxpz wrote:
| Raft consensus apparently needs more round-trips than
| that (maybe two round-trips to another node per write?),
| as evidenced by this benchmark we made against Minio:
|
| https://garagehq.deuxfleurs.fr/documentation/design/bench
| mar...
|
| Yes we do round-trips to other nodes, but we do much
| fewer of them to ensure the same level of consistency.
|
| This is to be expected from a distributed system's theory
| perspective, as consensus (or total order) is a much
| harder problem to solve than what we are doing.
|
| We haven't (yet) gone into dissecating the Raft protocol
| or Minio's implementation to figure out why exactly it is
| much slower, but the benchmark I linked above is already
| strong enough evidence for us.
| nh2 wrote:
| I think it would be great if you could make a Github repo
| that is just about summarising performance
| characteristics and roundrip types of different storage
| systems.
|
| You would invite Minio/Ceph/SeaweedFS/etc. authors to
| make pull requests in there to get their numbers right
| and explanations added.
|
| This way, you could learn a lot about how other systems
| work, and users would have an easier time choosing the
| right system for their problems.
|
| Currently, one only gets detailed comparisons from HN
| discussions, which arguably aren't a great place for
| reference and easily get outdated.
| withinboredom wrote:
| RAFT needs a lot more round trips that that, it needs to
| send a message about the transaction, the nodes need to
| confirm that they received it, then the leader needs to
| send back that it was committed (no response required).
| This is largely implementation specific (etcd does more
| round trips than that IIRC), but that's the bare minimum.
| lxpz wrote:
| We don't currently have a comparison with SeaweedFS. For the
| record, we have been developping Garage for almost two years
| now, and we hadn't heard of SeaweedFS at the time we started.
|
| To me, the two key differentiators of Garage over its
| competitors are as follows:
|
| - Garage contains an evolved metadata system that is based on
| CRDTs and consistent hashing inspired by Dynamo, solidly
| grounded in distributed system's theory. This allows us to be
| very efficient as we don't use Raft or other consensus
| algorithms between nodes, and we also do not rely on an
| external service for metadata storage (Postgres, Cassandra,
| whatever) meaning we don't pay an additionnal communication
| penalty.
|
| - Garage was designed from the start to be multi-datacenter
| aware, again helped by insights from distributed system's
| theory. In practice we explicitly chose against implementing
| erasure coding, instead we spread three full copies of data
| over different zones so that overall availability is
| maintained with no degradation in performance when one full
| zone goes down, and data locality is preserved at all
| locations for faster access (in the case of a system with
| three zones, our ideal deployment scenario).
| atombender wrote:
| Can you go into a bit more detail about how using CRDTs
| avoids the need for consensus-based replication?
| lxpz wrote:
| Sure!
|
| CRDTs are often presented as a way of building
| asynchronous replication system with eventual
| consistency. This means that when modifying a CRDT
| object, you just update your local copy, and then
| synchronize in background with other nodes.
|
| However this is not the only way of using CRDTs. At there
| core, CRDTs are a way to resolve conflicts between
| different versions of an object without coordination:
| when all of the different states of the object that exist
| in the network become known to a node, it applies a local
| procedure known as a merge that produces a deterministic
| outcome: all nodes do this, and once they have all done
| it, they are all in the same state. In that way, nodes do
| not need to coordinate before-hand when doing
| modifictations, in the sense where they do not need to
| run a two-phase commit protocol that ensures that
| operations are applied one after the other in a specific
| order that is replicated identically at all nodes. (This
| is the problem of consensus which is theoretically much
| harder to solve from a distributed system's theory
| perspective, as well as from an implementation
| perspective).
|
| In Garage, we have a bit of a special way of using CRDTs.
| To simplify a bit, each file stored in Garage is a CRDT
| that is replicated on three known nodes
| (deterministically decided). While these three nodes
| could synchronize in the background when an update is
| made, this would mean two things that we don't want: 1/
| when a write is made, it would be written only on one
| node, so if that node crashes before it had a chance to
| synchronize with other nodes, data would be lost; 2/
| reading from a node wouldn't necessarily ensure that you
| have the last version of the data, therefore the system
| is not read-after-write consistent. To fix this, we add a
| simple synchronization system based on read/write quorums
| to our CRDT system. More precisely, when updating a CRDT,
| we wait for the value to be known to at least two of the
| three responsible nodes before returning OK, which allows
| us to tolerate one node failure while always ensuring
| durability of stored data. Further when performing a
| read, we ask for their current state of the CRDT to at
| least two of the three nodes: this ensures that at least
| one of the two will know about the last version that was
| written (due to the intersection of the quorums being
| non-empty), making the system read-after-write
| consistent. These are basically the same principles that
| are applied in CRDT databases such as Riak.
| eternalban wrote:
| I'll also admit to having difficulty understanding how is
| all this distinct from non-CRDT replication mechanisms.
| Great mission and work by DeuxFluers team btw. Bonne
| chance!
|
| Edit: Blogs on Riak's use of CRDTs:
|
| http://christophermeiklejohn.com/erlang/lasp/2019/03/08/m
| ono...
|
| https://web.archive.org/web/20170801114916/http://docs.ba
| sho...
|
| https://naveennegi.medium.com/rendezvous-with-riak-crdts-
| par...)
| lxpz wrote:
| Thanks!
|
| I'll give a bit more details about CRDTs, then.
|
| The thing is, if you don't have CRDT, the only way to
| replicate things over nodes in such a way that they end
| up in consistent states, is to have a way of ordering
| operations so that all nodes apply them in the same
| order, which is costly.
|
| Let me give you a small example. Suppose we have a very
| simple key-value storage, and that two clients are
| writing different values at the same time on the same
| key. The first one will invoke write(k, v1), and the
| second one will invoke write(k, v2), where v1 and v2 are
| different values.
|
| If all nodes receive the two write operation but don't
| have a way to know which one came first, some will
| receive v1 before v2, and end up with value v2 as the
| last written values, and other nodes will receive v2
| before v1 meaning the will keep v1 as the definitive
| value. The system is now in an inconsistent state.
|
| There are several ways to avoid this.
|
| The first one is Raft consensus: all write operations
| will go through a specific node of the system, the
| leader, which is responsible for putting them in order
| and informing everyone of the order it selected for the
| operations. This adds a cost of talking to the leader at
| each operation, as well as a cost of simply selecting
| which node is the leader node.
|
| CRDT are another way to ensure that we have a consistent
| result after applying the two writes, not by having a
| leader that puts everything in a certain order, but by
| embedding certain metadata with the write operation
| itself, which is enough to disambiguate between the two
| writes in a consistent fashion.
|
| In our example, now, the node that does write(k, v1) will
| for instance generate a timestamp ts1 that corresponded
| to the (approximate) time at which v1 was written, and it
| will also generate a UUID id1. Similarly, the nodes that
| does write(k, v2) will generate ts2 and id2. Now when
| they send their new values to other nodes in the network,
| they will send along their values of ts1, id1, ts2 and
| id2. Nodes now know enough to always deterministcally
| select either v1 or v2, consistently everywhere: if ts1 >
| ts2 or ts1 = ts2 and id1 > id2, then v1 is selected,
| otherwise v2 is selected (we suppose that id1 = id2 has a
| negligible probability of happening). In terms of message
| round-trips between nodes, we see that with this new
| method, nodes simply communicate once with all other
| nodes in the network, which is much faster than having to
| pass through a leader.
|
| Here the example uses timestamps as a way to disambiguate
| between v1 and v2, but CRDTs are more general and other
| ways of handling concurrent updates can be devised. The
| core property is that concurrent operations are combined
| a-posteriori using a deterministic rule once they reach
| the storage nodes, and not a-priori by putting them in a
| certain order.
| eternalban wrote:
| Got it, thanks. This reminded me of e-Paxos (e for
| Egalitarian) and a bit of digging there are now 2
| additional contenders (Atlas and Gryff, both 2020) in
| this space per below:
|
| http://rystsov.info/2020/06/27/pacified-consensus.html
| lxpz wrote:
| Very interesting, thanks for sharing.
| dboreham wrote:
| > I'll also admit to having difficulty understanding how
| is all this distinct from non-CRDT replication
| mechanisms.
|
| This is because "CRDT" is not about a new or different
| approach to replication, although for some reason this
| has become a widely held perception. CRDT is about a new
| approach to the _analysis_ of replication mechanisms,
| using order theory. If you read the original CRDT
| paper(s) you'll find old-school mechanisms like Lamport
| Clocks.
|
| So when someone says "we're using a CRDT" this can be
| translated as: "we're using an eventually consistent
| replication mechanism proven to converge using techniques
| from the CRDT paper".
| atombender wrote:
| Sounds interesting. How do you handle the case where
| you're unable to send the update to other nodes?
|
| So an update goes to node A, but not to B and C.
| Meanwhile, the connection to the client may be disrupted,
| so the client doesn't know the fate of the update. If
| you're unlucky here, a subsequent read will ask B and C
| for data, but the newest data is actually on A. Right?
|
| I assume there's some kind of async replication between
| the nodes to ensure that B and C _eventually_ catch up,
| but you do have an inconsistency there.
|
| You also say there is no async replication, but surely
| there must be some, since by definition there is a
| quorum, and updates aren't hitting all of the nodes.
|
| I understand that CRDTs make it easier to order updates,
| which solves part of consistent replication, but you
| still need a consistent _view_ of the data, which is
| something Paxos, Raft, etc. solve, but CRDTs separated
| across multiple nodes don 't automatically give you that,
| unless I am missing something. You need more than one
| node in order to figure out what the newest version is,
| assuming the client needs perfect consistency.
| lxpz wrote:
| True; we don't solve this. If there is a fault and data
| is stored only on node A and not nodes B and C, the view
| might be inconsistent until the next repair procedure is
| run (which happens on each read operation if an
| inconsistency is detected, and also regularly on the
| whole dataset using an anti-entropy procedure). However
| if that happens, the client that sends an update will not
| have received an OK response, so it will know that its
| update is in an indeterminate state. The only guarantee
| that Garage gives you is that if you have an OK, the
| update will be visible in all subsequent reads (read-
| after-write consistency).
| atombender wrote:
| Thanks. That seems risky if one's client is stateless,
| but I understand it's a compromise.
| RobLach wrote:
| I've been researching a solution for this problem for some time
| and you all have landed on exactly how I thought this should be
| architected. Excellent work and I foresee much of the future
| infrastructure of the internet headed in this direction.
| mro_name wrote:
| I love that attitude towards self-hosting and it not being the
| sole hobby from that day on. Casual self-hosting if you will.
|
| House-keeping UX is key for self-hosting by laypersons.
| depingus wrote:
| Hi this is a very cool project, and I love that it's geared
| towards self-hosting. I'm about inherit some hardware so I'm
| actually looking at clustered network storage options right now.
|
| In my setup, I don't care about redundancy. I much prefer to
| maximize storage capacity. Can Garage be configured without
| replication?
|
| If not, maybe someone else can point me in the right direction.
| Something like a glusterfs distributed volume. At first glance,
| SeaweedFS has a "no replication" option, but their docs make it
| seem like they're geared towards handling billions of tiny files.
| I have about 24TB of 1Gb to 20GB files.
| lxpz wrote:
| You probably want at least _some_ redundancy otherwise you 're
| at risk of losing data as soon as a single one of your hard
| drive fails. However if you're interested in maximizing
| capacity, you probably need erasure coding, which Garage does
| not provide. Gluster, Ceph and Minio might all be good
| candidates for your use case. Or if you have the possibility of
| putting all your drives in a single box, just do that and make
| a ZFS pool.
| depingus wrote:
| > You probably want at least some redundancy otherwise you're
| at risk of losing data as soon as a single one of your hard
| drive fails.
|
| If by "fails" you mean the network connection drops out. Then
| yes, that would be a huge problem. I was hoping some project
| had a built-in solution to this. Currently, I'm using
| MergerFS to effectively create 1 disk out of 3 external USB
| drives and it handles accidental drive disconnects with no
| problems (I can't gush enough over how great mergerfs is).
|
| But, if by "fails" you mean actual hardware failure. Then, I
| don't really care. I keep 1 to 1 backups. A few days of
| downtime to restore the data isn't a big deal; this is just
| my home network.
|
| > Or if you have the possibility of putting all your drives
| in a single box...
|
| Unfortunately, I've maxed out the drive bays on my TS140
| server. Buying new, larger drives to replace existing drives
| seems wasteful. Also, I've just been gifted another TS140,
| which is a good platform to start building another file
| server.
|
| You've given me something to think about, thanks. I
| appreciate you taking the time to respond!
| Rygu wrote:
| Github repo (mirror): https://github.com/deuxfleurs-org/garage
| detritus wrote:
| Y'all might want to get to the point a bit quicker in your pitch.
| A lot of fluff to wade through before the lede..
|
| "Garage is a distributed storage solution, that automatically
| replicates your data on several servers. Garage takes into
| account the geographical location of servers, and ensures that
| copies of your data are located at different locations when
| possible for maximal redundancy, a unique feature in the
| landscape of distributed storage systems.
|
| Garage implements the Amazon S3 protocol, a de-facto standard
| that makes it compatible with a large variety of existing
| software. For instance it can be used as a storage back-end for
| many self-hosted web applications such as NextCloud, Matrix,
| Mastodon, Peertube, and many others, replacing the local file
| system of a server by a distributed storage layer. Garage can
| also be used to synchronize your files or store your backups with
| utilities such as Rclone or Restic. Last but not least, Garage
| can be used to host static websites, such as the one you are
| currently reading, which is served directly by the Garage cluster
| we host at Deuxfleurs."
| lxpz wrote:
| Thanks for the feedback! We will definitely be adjusting our
| pitch in further iterations.
| jfindley wrote:
| I'm not quite sure who the article is actually written for.
| If it's written to generate interest from investors, or for
| some other such semi/non technical audience then please
| ignore this (although the fact that I can't tell who it's
| written for is in itself a warning signal).
|
| If it's written to attract potential users, however, then
| it's woefully light on useful content. I don't care in the
| slightest about your views on tech monopolies - why on earth
| waste such a huge amount of space talking about this? What I
| care about are three things: 1) what workloads does it do
| well and, and which ones does it do poorly at (if you don't
| tell me about the latter I won't believe you - no storage
| system is great at everything). 2) What
| consistency/availability model does it use, and 3) how you've
| tested it.
|
| As written, it's full of fluff and vague handwavy promises
| (we have PhDs!) - the only technical information in the
| entire post is that it's written in rust. For users of your
| application, what programming language it's written in is
| about the least interesting thing possible to say about it.
| Even going through your docs I can't find a single proper
| discussion of your consistency and availability model, which
| is a huge red flag to me.
| adrn10 wrote:
| This article broadly explains the motivation and vision
| behind our project. There are many object stores, why build
| a new one? To push self-hosting, in face of
| democratic/political threats we feel are daunting.
|
| Sorry it didn't float your boat. You're just not the
| article's target audience I guess! (And, to answer your
| question: the target audience is mildly-knowledgeable but
| politically concerned citizens.)
|
| I agree this is maybe not the type of post you'd expect on
| HN. But, we don't want to _compete_ with other object
| stores. We want to expand the application domain of
| distributed data stores.
|
| > I can't find a single proper discussion of your
| consistency and availability model
|
| LGTM, I would be dubious too. This will come in time,
| please be patient (the project is less than 2 yo!). You
| know Cypherunks' saying? "We write code." Well, for now,
| we've been focusing on the code, and we want to continue
| doing so until 1.0. Some by-standers will do the paperware
| for sure, but it's not our utmost priority.
| jfindley wrote:
| Thanks for the response. It's great that you're trying to
| build something that will allow folks to do self-hosting
| well, I'm fully on board with this aim. My point about
| the political bits though, is that even for folks that
| agree with you, pushing the political bits too much can
| signal that you're not confident enough in the technical
| merits of what you're doing for them to stand on their
| own.
|
| Re: the paperwork vs code thing - While a full white
| paper on your theoretical foundations would be nice, IMO
| the table stakes for getting people to take your product
| seriously is a sentence or two on the model you're using.
| For example, adding something like: "We aim to prioritise
| consistency and partition tolerance over availability,
| and use the raft algorithm for leader elections. Our
| particular focus is making sure the storage system is
| really good at handling X and Y." or whatever would be a
| huge improvement, until you can get around to writing up
| the details.
|
| The problem with "we write code" is that your users are
| unlikely to trust you to be writing the correct code,
| when there's so little evidence that you've thought
| deeply about how to solve the hard problems inherent in
| distributed storage.
| adrn10 wrote:
| It was actually deliberate to present our project through
| the political lens, today.
|
| You will find a more streamlined presentation of Garage
| on our landing page at https://garagehq.deuxfleurs.fr/
|
| Technical details are also documented (although it is
| insufficient and needs an overhaul):
| https://garagehq.deuxfleurs.fr/documentation/design/
|
| Thank you for your comments!
| StillBored wrote:
| For the average user it shouldn't matter what its written
| in, but lately I've returned to the theory that language
| choice for a project signals a certain amount of technical
| sophistication. Particularly, for things like this, which
| should be considered systems programming. The overhead of a
| slow JIT, GC, threading model, IO model, whatever basically
| means the project is doomed to subpar perf forever and the
| cost of developing it has been passed on to the users,
| which will be forced to buy 10x the hardware to compensate.
| JKCalhoun wrote:
| I'm the layest of laymen, is Garage "a distributed Dropbox"?
| adrn10 wrote:
| It's the layer below: a sound distributed storage backend.
| You can use it to distribute your Nextcloud data, store
| your backups, host websites... You can even mount it in
| your OS' file manager and use it as an external drive!
| erwincoumans wrote:
| Who owns the servers?
| superboum wrote:
| You own the servers. This is a tool to build your own
| object-storage cluster. For example, you can get 3 old
| desktop PCs, install Linux on them, download and launch
| Garage on them, configure your 3 instances in a single
| cluster, then send data to this cluster. Your data will be
| spread and duplicated on the 3 machines. If one machine
| fails or is offline, you can still access and write data to
| the cluster.
| [deleted]
| erwincoumans wrote:
| Then, how does Garage achieve 'different geographical
| locations'? I only have my house to put my server(s).
| That's one of the main reasons I'm using cloud storage.
| Or is the point that I can arrange those servers abroad
| myself, independent of the software solution (S3 etc)?
| lxpz wrote:
| Garage is designed for self-hosting by collectives of
| system administrators, what we could call "inter-
| hosting". Basically you ask your friends to put a server
| box at their home and achieve redundancy that way.
| rglullis wrote:
| Is the data encrypted at rest, or should I encrypt the
| data myself?
| superboum wrote:
| The content is currently stored in plaintext on the disk
| by Garage, so you have to encrypt the data yourself. For
| example, you can configure your server to encrypt at rest
| the partition that contains your `data_dir` and your
| `meta_dir` or build/use applications that supports
| client-side encryption such as rclone with its crypt
| module[0] or Nextcloud with its end-to-end encryption
| module[1].
|
| [0]: https://rclone.org/crypt/
|
| [1]: https://nextcloud.com/endtoend/
| erwincoumans wrote:
| Great, thanks. It would be great to add this in the main
| explanation (how to get access to servers).
___________________________________________________________________
(page generated 2022-02-08 23:01 UTC)