[HN Gopher] Garage, our self-hosted distributed object storage s...
       ___________________________________________________________________
        
       Garage, our self-hosted distributed object storage solution
        
       Author : lxpz
       Score  : 343 points
       Date   : 2022-02-08 10:20 UTC (12 hours ago)
        
 (HTM) web link (garagehq.deuxfleurs.fr)
 (TXT) w3m dump (garagehq.deuxfleurs.fr)
        
       | southerntofu wrote:
       | Hello, quick question: from what i understand since it's based on
       | CRDTs it means any server in the cluster may save something to
       | the storage, that gets replicated elsewhere.
       | 
       | That's fine when there is absolute trust between the server
       | operators, but am i correct to assume it's not the same threat
       | model as encrypted/signed backups pushed to some friends lending
       | storage space for you (who could corrupt your backup but can't
       | corrupt your working data)?
       | 
       | If my understanding is correct, maybe your homepage should
       | outline the threat model more clearly (i.e. a single trusted
       | operator for the whole cluster) and point to other solutions like
       | TAHOE-LAFS for other use-cases.
       | 
       | Congratulations on doing selfhosting with friends, that's pretty
       | cool! Do you have any idea if some hosting cooperatives (CHATONS)
       | or ISPs (FFDN) have practical use-cases in mind? They already
       | have many physical locations and rather good bandwidth, but i
       | personally can't think of an interesting idea.
        
         | ClumsyPilot wrote:
         | I was thiniing StorJ and fulecoin are the ones attempting to
         | provide storage with a 'tryst noone' model
        
         | lxpz wrote:
         | You are absolutely correct concerning the trust model of
         | Garage: any administrator of one node is an administrator of
         | the entire cluster, and can manipulate data arbitrarily in the
         | system.
         | 
         | From an ideological perspective, we are strongly attached to
         | the building of tight-knit communities in which strong trust
         | bonds can emerge, as it gives us more meaning than living in an
         | individualized society where all exchanges between individuals
         | are mediated by a market, or worse, by blockchain technology.
         | This means that trusting several system's administrator makes
         | sense to us.
         | 
         | Note that in the case of cooperatives such as CHATONS, most
         | users are non-technical and have to trust their sysadmin
         | anyways; here, they just have to trust several sysadmins
         | instead of just one. We know that several hosting cooperatives
         | of the CHATONS network are thinking like us and are interested
         | in setting up systems such as Garage that work under this
         | assumption.
         | 
         | In the meantime, we also do perfectly recognize the possibility
         | of a variety of attack scenarios against which we want to
         | consider practical defenses, such as the following:
         | 
         | 1/ An honest-but-curious system administrator or an intruder in
         | the network that wants to read user's private data, or
         | equivalently, a police raid where server software is embarked
         | for inspection by state services;
         | 
         | 2/ A malicious administrator or an intruder that wants to
         | manipulate the users by introducing fake data;
         | 
         | 3/ A malicious administrator or an intruder that simply wants
         | to wreak havoc by deleting everything.
         | 
         | Point 1 is the biggest risk in my eyes. Several solutions can
         | be built to add an encryption layer over S3 for different usage
         | scenarios. For instance for storing personnal files, Rclone can
         | be used to add simple file encryption to an S3 bucket and can
         | also be mounted directly via FUSE, allowing us to access Garage
         | as an end-to-end encrypted network drive. For backups, programs
         | like Restic and Borg allow us to upload our backups encrypted
         | into Garage.
         | 
         | Point 2 can be at least partially solved by adding signatures
         | and verification in the encryption layer which is handled on
         | the client (I don't know for sure if Rclone, Restic and Borg
         | are doing this).
         | 
         | Point 3 is harder and probably requires adaptation on the side
         | of Garage to be solved, for instance by adding a restriction on
         | which nodes are allowed to propagate updates in the network and
         | thus establishing a hierarchy between two categories of nodes:
         | those that implement the S3 gateway and thus have full power in
         | the network, and those that are only responsible for storing
         | data given by the gateway but cannot originate modifications
         | themselves (under the assumption that modifications must be
         | accompanied by a digital signature and that only gateway nodes
         | have the private keys to generate such signatures). Then we
         | could separate gateway nodes according to the different buckets
         | in which they can write for better separation of concernts.
         | However, as long as we are trying to implement the S3 protocol
         | I think we are stuck with some imperfect solution like this,
         | because S3 itself does not implement using public-key
         | cryptography to attest operations sent by the users.
         | 
         | It is perfectly true that solutions that are explicitely
         | designed to handle these threats (such as TAHOE-LAFS) would
         | provide better guarantees in a system that is maybe more
         | consistent as a whole. However we are also trying to juggle
         | these security constraints with deployment contraints such as
         | keeping compatibility with standard protocols such as S3 (and
         | soon also IMAP for mailbox storage), which restricts us in the
         | design choices we make.
         | 
         | Just to clarify, the fact that any node can tamper with any of
         | the data in the cluster is not strictly linked to the fact that
         | we use CRDTs internally. It is true that CRDTs make it more
         | difficult, but we do believe that there are solutions to this
         | and we intend to implement at least some of them in our next
         | project that involves mailbox storage.
        
           | ddrdrck_ wrote:
           | "or worse, by blockchain technology" -> I really would
           | appreciate if you could elaborate on this. Blockchain
           | technology has its pros and cons, depending on the underlying
           | consensus algorithm, but it could certainly be put to good
           | use in the context of a political aware project promoting
           | decentralization and freedom ?
        
             | lxpz wrote:
             | I'd much rather have a system that doesn't try to solve
             | human conflict by algorithms. Blockchains do this: you
             | write code, the code becomes the law, and then there is no
             | recourse when things go wrong. I deeply believe that we
             | should rather focus on developping our interpersonnal
             | communication skills and building social structures that
             | empower us to do things together by trusting eachother and
             | taking care together of the group. Algorithms are just
             | tools, and we don't want them interfering in our
             | relationships; in particular in the case of blockchains,
             | the promise of trustlessness is absolutely opposite to this
             | ideal.
        
       | prmoustache wrote:
       | I understand the founders are Terry Pratchett / Discworld fans
       | from the name of the associations.
       | 
       | For non french speakers, Deuxfleurs is the french traduction of
       | the Twoflower character appearing in The Colour of Magic, The
       | Light Fantastic and Interesting Times from Terry Pratchett's
       | Discworld books series.
        
         | adrn10 wrote:
         | Busted! We might even have a Bagage[0] side-project...
         | 
         | [0] https://git.deuxfleurs.fr/Deuxfleurs/bagage
        
       | iam-TJ wrote:
       | Your git repository is not reachable over IPv6 despite publishing
       | a DNS AAAA record:                   $ curl -v
       | https://git.deuxfleurs.fr/Deuxfleurs/garage/issues         *
       | Trying 2001:41d0:8:ba0b::1:443...         *   Trying
       | 5.135.179.11:443...         * Immediate connect fail for
       | 5.135.179.11: Network is unreachable              $ dig
       | +noquestion +nocmd aaaa git.deuxfleurs.fr         ;; Got answer:
       | ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 46366
       | ;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 0,
       | ADDITIONAL: 2              ;; OPT PSEUDOSECTION:         ; EDNS:
       | version: 0, flags:; udp: 65494         ;; ANSWER SECTION:
       | git.deuxfleurs.fr.      6822    IN      CNAME
       | hammerhead.machine.deuxfleurs.fr.
       | hammerhead.machine.deuxfleurs.fr. 6822 IN AAAA
       | 2001:41d0:8:ba0b::1
       | 
       | This from an IPv6-only network. Since you publish an IPv6 record
       | our DNS64/NAT64 gateway doesn't get involved hence the IPv4
       | immediate connect fail.
        
         | superboum wrote:
         | Thanks for reporting the problem.
         | 
         | I just deleted the AAAA entry for this machine. In the
         | meantime, if the result is cached for you, you can pass the
         | `-4` argument to force IPv4:                 git clone -4
         | git@git.deuxfleurs.fr:Deuxfleurs/garage.git
         | 
         | And, in a second time, we will work on a better/working IPv6
         | configuration for our Git repository and all of our services
         | (we use Nomad+Docker and did not find a way to expose IPv6 in a
         | satisfying way yet).
        
           | iam-TJ wrote:
           | Looks like the TTL on the records is still affecting local
           | resolution. I'm aware of the git protocol work-around to
           | force IPv4 but that doesn't help when using a regular web-
           | browser to visit the published URL for issues (first thing I
           | evaluate for a new-to-me project):
           | 
           | https://git.deuxfleurs.fr/Deuxfleurs/garage/issues
        
         | adrn10 wrote:
         | My bad T_T My colleagues keep picking on me for this. Now that
         | you, complete stranger, you see it too, I'm feeling obliged to
         | improve the situation!
        
       | lxpz wrote:
       | For people interested in digging in, we have an interesting
       | benchmark here showing how it significantly outperforms Minio in
       | the geo-distributed setting:
       | https://garagehq.deuxfleurs.fr/documentation/design/benchmar...
       | 
       | We also made a presentation at FOSDEM'22, the video should be out
       | soon.
        
       | superboum wrote:
       | I tried to answer to a comment that was deleted, probably due to
       | the form. Instead of deleting my answer, I want to share the
       | reworded critics and my answers.
       | 
       | > Why not using Riak and adding an S3 API around it.
       | 
       | Riak was developed by a company named Basho that went bankrupt
       | some years ago, the software is not developed anymore. In fact,
       | we do no need to add an S3 API around Riak KV, Basho even
       | released "Riak Cloud Storage"[0] that exactly does this: provide
       | an S3 API on top of Riak KV architecture. We plan to release a
       | comparison between Garage and Riak CS, Garage has some
       | interesting features that Riak CS does not have! In practice,
       | implementing an object store on top of a DynamoDB-like KV store
       | is not that straightforward. For example, Exoscale, a cloud
       | provider went this way for their first implementation of their KV
       | store, Pithos[1], but rewrote it later as you need special logic
       | to handle your chunks (they did not publish Pithos v2).
       | 
       | > Most apps don't have S3 support
       | 
       | We are maintaining in our documentation an "integration" section
       | listing all the compatible applications. Garage already works
       | with Matrix, Mastodon, Peertube, Nextcloud, Restic (an
       | alternative to Borg), Hugo and Publii (a static site generator
       | with a GUI). These applications are only a fraction of all
       | existing applications, but our software is targeted at its
       | users/hosters.
       | 
       | > A distributed system is not necessarily highly available
       | 
       | I will not fight on the wording: we come from an academic
       | background where the term "distributed computing" has a specific
       | meaning that may differ outside. In our field, we define models
       | where we study systems made of processes that can crash.
       | Depending on your algorithms and the properties you want, you can
       | prove that your system will work despite some crashes. We want to
       | build software on these academic foundations. This is also the
       | reason we put "Standing on the shoulders of giants" on our front
       | page and linking to research papers. To put it in a nutshell, one
       | critic we address to other software is that sometimes they lack
       | theoretical/academic foundations that lead to unexpected
       | failures/more work to sysadmins. But on the theoretical point,
       | Basho and Riak were exemplary and a model for us!
       | 
       | [0]: https://docs.riak.com/riak/cs/2.1.1/index.html [1]:
       | https://github.com/exoscale/pithos
        
         | lxpz wrote:
         | I also had some things to answer to that comment which I'll put
         | back here for the record.
         | 
         | > I have been building distributed systems for 20 years, and
         | they are not more reliable. They are probabilistically more
         | likely to fail.
         | 
         | It depends on what kind of faults you want to protect against.
         | In our case, we are hosting servers at home, meaning that any
         | one of them could be disconnected at any time due to a power
         | outage, a fiber being cut off, or any number of reasons. We are
         | also running old hardware where individual machines are more
         | likely to fail. We also do not run clusters composed of very
         | large numbers of machines, meaning that the number of
         | simultaneous failures that can be expected actually remains
         | quite low. This means that the choices made by Garage's
         | architecture make sense for us.
         | 
         | But maybe your point was about the distinction between
         | distributed systems and high availability, in which case I
         | agree. Several of us have studied distributed systems in the
         | academic setting, and in our vocabulary, distributed systems
         | almost by definition includes crash-tolerance and thus making
         | systems HA. I understand that in the engineering community the
         | vocabulary might be different and we might orient our
         | communication more towards presenting Garage as HA thanks to
         | your insight, as it is one of our core, defining features.
         | 
         | > However, this isn't it. This is distributed S3 with CRDTs.
         | Still too application-specific, because every app that wants to
         | use it has to be integrated with S3. They could have just
         | downloaded something like Riak and added an S3 API around it.
         | 
         | Garage is almost that, except that we didn't download Riak but
         | made our own CRDT-based distributed storage system. It's
         | actually not the most complex part at all, and most of the
         | developpement time was spent on S3 compatibility. Rewriting the
         | storage layer means that we have better integration between
         | components, as everything is built in Rust and heavily depends
         | on the type system to ensure things work well together. In the
         | future, we plan to reuse the storage layer we built for Garage
         | for other projects, in particular to build an e-mail storage
         | server.
        
       | razzio wrote:
       | This is a fantastic project and I can't wait to try it out!
       | 
       | One question on the network requirements. The web page says for
       | networking: "200 ms or less, 50 Mbps or more".
       | 
       | How hard are these requirements? For folks like me that can't
       | afford a guaranteed 50Mbps internet connection, is this still
       | usable?
       | 
       | There are plenty of places in the world where 50Mbps internet
       | connectivity would be a dream. Even here in Canada there are
       | plenty of places with a max of 10Mbps. The African continent for
       | example will have many more.
        
         | adrn10 wrote:
         | This depends on your workload. It would work, but we first
         | target fiber optics domestic connections.
        
         | lxpz wrote:
         | It's not a hard requirement, you might just have a harder time
         | handling voluminous files as Garage nodes will in all case have
         | to transfer data internally in the cluster (remember that files
         | have to be sent to three nodes when they are stored in Garage,
         | meaning 3x more bandwidth needs to be used). 10Mbps is already
         | pretty good if it is stable and your ping isn't off the charts,
         | and it might be totally workable depending on your use case.
        
       | latchkey wrote:
       | I see zero unit/integration tests in the source tree, not even a
       | `test` target in the Makefile. Terrifying.
        
         | lxpz wrote:
         | Then you didn't look close enough. Have a look at the
         | .drone.yml to see the entry points, we have both unit tests and
         | integration tests that run at each build.
        
       | lloydatkinson wrote:
       | > account the geographical location of servers, and ensures that
       | copies of your data are located at different locations when
       | possible for maximal redundancy, a unique feature in the
       | landscape of distributed storage systems.
       | 
       | Azure already does this, so claiming it's unique seems untrue.
        
         | lxpz wrote:
         | I mean, sure, but we're operating under the premise that we
         | don't want to give out our data to a cloud provider and instead
         | want to store data ourselves. Half of the post is dedicated to
         | explaining that, so pretending that you can't infer that from
         | context seems a bit unfair.
        
       | mediocregopher wrote:
       | I've been looking through the available options for this use-
       | case, with the same motivation given in this post, and was
       | surprised to see there wasn't something which could fit this
       | niche (until now!).
       | 
       | Really happy to see this being worked on by a dedicated team.
       | Combined with a tool like https://github.com/slackhq/nebula this
       | could help form the foundation of a fully autonomous, durable,
       | online community.
        
       | [deleted]
        
       | morelish wrote:
       | Think the idea is a good one. As S3 becomes standard, it would be
       | nice if someone could rip amazon's face-off (like amazon have
       | done for other open-source companies, e.g. elastic).
        
         | social_quotient wrote:
         | Elastic was a bit with cause?
        
           | morelish wrote:
           | How do you mean?
        
         | malteg wrote:
         | what about https://min.io/
        
           | morelish wrote:
           | Cheers, I'll have a look at that.
        
       | belfalas wrote:
       | Looks very worth checking out! Will this be made available for
       | installation via Cloudron some time in the future?
        
       | merb wrote:
       | this might be a stupid comment.
       | 
       | but is this safe to use inside a commercial project as a backend
       | s3 solution? i.e. just using it instead of lcoal storage and not
       | changing it?
        
         | superboum wrote:
         | Our software is published under the AGPLv3 license and comes
         | with no guarantee, like any other FOSS project (if you do not
         | pay for support). We are considering our software as "public
         | beta" quality, so we think it works well, at least for us.
         | 
         | On the plus side, it survived Hacker News Hug of Death. Indeed,
         | the website we linked is hosted on our own Garage cluster made
         | of old Lenovo ThinkCentre M83 (with Intel Pentium G3420 and 8GB
         | of RAM) and the cluster seems fine. We also host more than 100k
         | objects in our Matrix (a chat service) bucket.
         | 
         | On the minus side, this is the first time we have so much
         | coverage, so our software has not yet been tested by thousands
         | of people. It is possible that in the near future, some edge
         | cases we never triggered are reported. This is the reason why
         | most people wait that an application reaches a certain level of
         | adoption before using it, in other words they don't want to pay
         | "the early adopter cost".
         | 
         | In the end, it's up to you :-)
        
           | cvwright wrote:
           | Awesome. Are you using the matrix-media-repo to connect the
           | Matrix homeserver to Garage?
        
             | lxpz wrote:
             | No, we are simply using synapse-s3-storage-provider:
             | https://github.com/matrix-org/synapse-s3-storage-provider
        
         | [deleted]
        
         | lxpz wrote:
         | License-wise, Garage is AGPLv3 like many similar software
         | projects such as MinIO and MongoDB. We are not lawyers, but in
         | the spirit, this should not preclude from using it in
         | commercial software as long as it is not modified. In the case
         | where you are modifying it, like e.g. adding a nice
         | administration GUI, we kindly ask that you publish your
         | modifications as open source so that we can incorporate them
         | back in the main product if we want.
        
       | k8sToGo wrote:
       | Do you guys expose Prometheus metrics or how is monitoring
       | supposed to be done (e.g. on replication status)?
        
         | superboum wrote:
         | Currently, we have no elegant way to achieve what you want.
         | 
         | When failures occur, repair is done through workers that says
         | when they launch, when they repair chunks, and when they exit
         | in the logs. We also have `garage status` and `garage stats`.
         | The first command displays healthy and non healthy nodes, the
         | second one displays the queue length of our tables and chunks,
         | if their values are greater than zero, we are repairing the
         | cluster. We are documenting failure recovery in our
         | documentation:
         | https://garagehq.deuxfleurs.fr/documentation/cookbook/recove...
         | 
         | For the near future, we plan to integrate opentelemetry. But we
         | are still discussing the design and information we want to
         | track and report. We are currently discussing these questions
         | in our issue tracker:
         | https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/111
         | https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/207
         | 
         | If you have some knowledge/experience on this subject, feel
         | free to share it in these issues.
        
         | lxpz wrote:
         | Exposing Prometheus metrics is an ongoing work which we haven't
         | had much time to advance yet. For now we can check on
         | replication progress or overall cluster health by inspecting
         | Garage's state from the command line.
        
       | leathersoft wrote:
       | I swear to god, I thought this was a parody on yet another
       | product with fancy names for just a physical garage
        
         | adrn10 wrote:
         | This is an actual object store - we just have a talent for
         | naming artifacts
        
       | bamazizi wrote:
       | Github mirror? It's extremely easy to forget and stop following
       | up on a project that's not on Github!
       | 
       | You might have personal reasons by your current choice, but maybe
       | the project can have shot at success if it's in a more social and
       | high traffic ecosystem.
        
         | lxpz wrote:
         | Exists, already linked in another comment :
         | https://github.com/deuxfleurs-org/garage
        
       | pnocera wrote:
       | I'd see a nice fit with Tailscale. Or how to create a secured
       | private network of redundant and HA storage...
        
         | adrn10 wrote:
         | We haven't looked too much into VPNs, although we do need some
         | plumbing to distribute our whole infrastructure at Deuxfleurs.
        
       | Loic wrote:
       | Interesting point to note is that they have some _external_
       | funding from the EU.
       | 
       | > The Deuxfleurs association has received a grant from NGI
       | POINTER[0], to fund 3 people working on Garage full-time for a
       | year : from October 2021 to September 2022.
       | 
       | [0]: https://pointer.ngi.eu/
        
         | sofixa wrote:
         | That's really great, i didn't know this programme existed.
        
           | adrn10 wrote:
           | (Deuxfleurs administrative council member here:) Starting
           | from a little association with 200EUR in a cash box last
           | year, we were indeed amazed to actually get this funding.
           | 
           | I strongly recommend NGI grants to any European citizen. Just
           | look at the variety of profiles that NGI POINTER (one of the
           | grant type) funded last year: https://pointer.ngi.eu/wp-
           | content/uploads/2021/10/NGI-POINTE...
           | 
           | They even finance individuals wanting to contribute to FOSS
           | up to 50kEUR for a year.
        
       | malteg wrote:
       | what is the benefit compared to https://min.io/
        
         | adrn10 wrote:
         | We made a comparison here:
         | https://garagehq.deuxfleurs.fr/documentation/design/benchmar...
         | 
         | Other comments from my colleagues also shed light on Garage's
         | specific features (do check the discussion on Raft).
         | 
         | In a nutshell, Garage is designed for higher inter-node
         | latency, thanks to less round-trips for reads & writes. Garage
         | does not intend to compete with MinIO, though - rather, to
         | expand the application domain of object stores.
        
         | superboum wrote:
         | Another benefit compared to MinIO is we have "flexible
         | topologies".
         | 
         | Due to our design choice, you can add and remove nodes without
         | any constraint on number of nodes and size of the storage. So
         | you do not have to overprovision your cluster as recommended by
         | MinIO[0].
         | 
         | Additionally, and we planned a full blog post on this subject,
         | adding or removing a node in the cluster does not lead to a
         | full rebalance of the cluster. To understand why, I must
         | explain how it works traditionally and how we improved on
         | existing work.
         | 
         | When you initialize the cluster, we split the cluster in
         | partitions, then assign partitions to nodes (see Maglev[1]).
         | Later, based on their hash, we will store data in its
         | corresponding partition. When a node is added or removed,
         | traditional approaches rerun the whole algorithm and comes with
         | a totally different partition assignation. Instead, we try to
         | compute a new partition distribution that minimize partitions
         | assignment change, which in the end minimize the number of
         | partitions moved.
         | 
         | On the drawback side, Garage does not implement erasure coding
         | (as it also the reason of many MinIO's limitations) and
         | duplicate data 3 times which is less efficient. Garage also
         | implements less S3 endpoints than Minio (for example we do not
         | support versioning), the full list is available in our
         | documentation[2].
         | 
         | [0]: https://docs.min.io/minio/baremetal/installation/deploy-
         | mini...
         | 
         | [1]: https://www.usenix.org/conference/nsdi16/technical-
         | sessions/...
         | 
         | [2]: https://garagehq.deuxfleurs.fr/documentation/reference-
         | manua...
        
         | rhim wrote:
         | This is written in Rust!
        
       | skratlo wrote:
       | What's with the obsession with S3 API? Why not FUSE and just use
       | any programming language without libraries to access the files?
       | (Obviously without some S3/FUSE layer, without HTTP, just direct
       | access to garage with the fastest protocol possible)
        
         | skybrian wrote:
         | S3 is simpler because you don't have to deal with hierarchical
         | directories.
        
       | Fiahil wrote:
       | NICE ! Very Nice ! Congrats on the project and the naming !
       | 
       | If I may add a few thoughts : You may be very well placed for
       | being a more suitable replacement to Minio in Data science / AI
       | projects. Why ? Because a hard requirement for any serious MLOps
       | construction needs 2 things : An append-only storage and a way to
       | stream large content to a single place. First one is very hard to
       | get right, second is relatively easy.
       | 
       | Being CRDT-based it should be _very easy_ for you to provide an
       | append-only storage that can store partitioned, ordered logs of
       | immutable objects (think dataframes, and think kafka). Once you
       | have that, it's really easy to build the remaining missing pieces
       | (UI, API, ...) for creating a _much better_ (and distributed)
       | version of MLFlow.
       | 
       | Finally, S3 protocol is an "okay" version for file storage, but
       | as you are probably aware, it's clearly a huge limiter. So, trash
       | it. Provide a read-only S3 gateway for compatibility, but writes
       | should use a different API.
       | 
       | PS: Galette-Saucisse <3
        
         | adrn10 wrote:
         | Thanks for your enthusiastic feedback, fellow breton!
         | 
         | It's an interesting take you make about ML workloads. We
         | haven't investigated that yet (next in line is e-mail: you see
         | we target ubiquitous, low-tech needs that are horrendous to
         | host). But we will definitely consider it for future works:
         | small breton food trucks do need their ML solutions for better
         | galettes.
        
           | Fiahil wrote:
           | If you ever need further insights into the ML/AI use case,
           | don't hesitate to ping me at {my-username}-at-gmail. I'll
           | forward it to my work address :)
        
       | adrn10 wrote:
       | So basically, we are a hosting association that wanted to put
       | their servers at home _and_ sleep at night. This demanded inter-
       | home redundancy of our data. But none of the existing solutions
       | (MinIO, Ceph...) are designed for high inter-node latency. Hence
       | Garage! Your cheap and proficient object store: designed to run
       | on a toaster, through carrier-pigeon-grade networks, while still
       | supporting a tons of workloads (static websites, backups,
       | Nextcloud... you name it)
        
         | ddorian43 wrote:
         | Did you try seaweedfs? Or contributing to any of them?
        
           | adrn10 wrote:
           | Basically, our association is trying to move away from file
           | systems, for scalability and availability purposes. There's a
           | similarity between RDBMS (say SQL) and filesystems, in the
           | sense that they distribute like crap. Main reason being that
           | they have too strong consistency properties. Although, one
           | can use Garage as a file system, notably using WebDAV. It's a
           | beta feature, called Bagage:
           | https://git.deuxfleurs.fr/Deuxfleurs/bagage I'll let my
           | knowledgeable fellows answer, if you'd like to hear about
           | differences between seaweedfs and Garage in more detail :)
        
             | ddorian43 wrote:
             | seaweedfs is an object storage at the core. In my
             | (personal) case, I disqualify cause of missing erasure
             | coding.
        
               | StillBored wrote:
               | Garage says isn't a goal either.
               | 
               | https://garagehq.deuxfleurs.fr/documentation/design/goals
               | /
               | 
               | "Storage optimizations: erasure coding or any other
               | coding technique both increase the difficulty of placing
               | data and synchronizing; we limit ourselves to
               | duplication."
        
           | superboum wrote:
           | (Garage Contributor here) We reviewed many of the existing
           | solutions and none of them had the feature set we wanted.
           | Compared to SeaweedFS, the main difference we introduce with
           | Garage is that our nodes are not specialized, which lead to
           | the following benefits:
           | 
           | - Garage is easier to deploy and to operate: you don't have
           | to manage independent components like the filer, the volume
           | manager, the master, etc. It also seems that a bucket must be
           | pinned to a volume server on SeaweedFS. In Garage, all
           | buckets are spread on the whole cluster. So you do not have
           | to worry that your bucket fills one of your volume server.
           | 
           | - Garage works better in presence of crashes: I would be very
           | interested by a deep analysis of Seaweed "automatic master
           | failover". They use Raft, I suppose either by running an
           | healthcheck every second which lead to data loss on a crash,
           | or sending a request for each transaction, which creates a
           | huge bottleneck in their design.
           | 
           | - Better scalability: because there is no special node, there
           | is no bottlenecks. I suppose that with SeaweedFS, all the
           | requests have to pass through the master. We do not have such
           | limitations.
           | 
           | As a conclusion, we choose a radically different design with
           | Garage. We plan to do a more in-depth comparison in the
           | future, but even today, I can say that if we implement the
           | same API, our radically different designs lead to radically
           | different properties and trade-off.
        
             | [deleted]
        
             | faeyanpiraat wrote:
             | > In Garage, all buckets are spread on the whole cluster.
             | So you do not have to worry that your bucket fills one of
             | your volume server.
             | 
             | Are you saying I do not have to worry about ONE volume
             | server getting full, but instead I can worry about ALL of
             | them getting full at the same time?
        
               | lxpz wrote:
               | Yes, in which case you just make your cluster bigger or
               | delete some data. Seems like a reasonable compromise to
               | me.
               | 
               | Garage has advanced functionnality to control how much
               | data is stored on each node, and is also very flexible
               | with respect to adding or removing nodes, making all of
               | this very easy to do.
        
               | singron wrote:
               | Unless garage has a special mitigation against this,
               | usually performance gets much worse in large clusters as
               | the filesystem fills up. As files are added and deleted,
               | it struggles to keep nodes exactly balanced and a growing
               | percentage of nodes will be full and unavailable for new
               | writes.
               | 
               | So in a high throughput system, you may notice a soft
               | performance degradation before actually running out of
               | space.
               | 
               | If you aren't performance sensitive or don't have high
               | write throughput, you might not notice. This is
               | definitely something you should be forecasting and
               | alerting on so you can acquire and add capacity or delete
               | old data.
               | 
               | If you really don't like the idea of everything falling
               | over at the same time, you could use multiple clusters or
               | come up with a quota system (e.g. disable workload X if
               | it's using more than Y TiB).
        
             | ddorian43 wrote:
             | > independent components like the filer, the volume
             | manager, the master, etc.
             | 
             | You can run volume/master/filer in a single server (single
             | command).
             | 
             | > filer probably needs an external rdbms to handle the
             | metadata
             | 
             | This is true. You can use an external db. Or build/embed
             | some other db inside it (think a distributed kv in golang
             | that you embed inside to host the metadata).
             | 
             | > It also seems that a bucket must be pinned to a volume
             | server on SeaweedFS.
             | 
             | This is not true. A bucket will be using it's own volumes,
             | but can be and is distributed on the whole cluster by
             | default.
             | 
             | > They use Raft, I suppose either by running an healthcheck
             | every second which lead to data loss on a crash, or running
             | for each transaction, which creates a huge bottleneck.
             | 
             | Raft is for synchronized writes. It's slow in the case of a
             | single-write being slow because you have to wait for an
             | "ok" from replicas, which is a good thing (compared to
             | async-replication in, say, cassandra/dynamodb). Keep in
             | mind that s3 also moved to synced replication. This is
             | fixed by having more parallelism.
             | 
             | > Better scalability: because there is no special node,
             | there is no bottlenecks. I suppose that SeaweedFS, all the
             | requests have to pass through the master. We do not have
             | such limitations.
             | 
             | Going to the master is only needed for writes, to get a
             | unique id. This can be easily fixed with a plugin to say,
             | generate twitter-snowflake-ids which are very efficient.
             | For reads, you keep a cache in your client for the volume-
             | to-server mapping so you can do reads directly from the
             | server that has the data, or you can randomly query a
             | server and it will handle everything underneath.
             | 
             | I'm pretty sure seaweedfs has very good fundamentals from
             | researching all other open-source distributed object
             | storage systems that exists.
        
               | lxpz wrote:
               | > Raft is for synchronized writes. It's slow in the case
               | of a single-write being slow because you have to wait for
               | an "ok" from replicas, which is a good thing (compared to
               | async-replication in, say, cassandra/dynamodb). Keep in
               | mind that s3 also moved to synced replication. This is
               | fixed by having more parallelism.
               | 
               | We have synchronous writes without Raft, meaning we are
               | both much faster and still strongly consistent (in the
               | sense of read-after-write consistency, not
               | linearizability). This is all thanks to CRDTs.
        
               | ddorian43 wrote:
               | > This is all thanks to CRDTs.
               | 
               | If you don't sync immediately, you may lose the node
               | without it replicating yet and losing the data forever.
               | There's no fancy algorithm when the machine gets
               | destroyed before it replicated the data. And you can't
               | write to 2 replicas simultaneously from the client like,
               | say, when using a Cassandra-smart-driver since S3 doesn't
               | support that.
               | 
               | CRDTs are nice but not magic.
        
               | superboum wrote:
               | So let's take the example of a 9-nodes clusters with a
               | 100ms RTT over the network to understand. In this
               | specific (yet a little bit artificial) situation, Garage
               | particularly shines compared to Minio or SeaweedFS (or
               | any Raft-based object store) while providing the same
               | consistency properties.
               | 
               | For a Raft-based object store, your gateway will receive
               | the write request and forward it to the leader (+ 100ms,
               | 2 messages). Then, the leader will forward in parallel
               | this write to the 9 nodes of the cluster and wait that a
               | majority answers (+ 100ms, 18 messages). Then the leader
               | will confirm the write to all the cluster and wait for a
               | majority again (+ 100ms, 18 messages). Finally, it will
               | answer to your gateway (already counted in the first
               | step). In the end, our write took 300ms and generated 38
               | messages over the cluster.
               | 
               | Another critical point with Raft is that your writes do
               | not scale: they all have to go through your leader. So on
               | the writes point of view, it is not very different from
               | having a single server.
               | 
               | For a DynamoDB-like object store (Riak CS, Pithos,
               | Openstack Swift, Garage), the gateway receives the
               | request and know directly on which nodes it must store
               | the writes. For Garage, we choose to store every writes
               | on 3 different nodes. So the gateway sends the write
               | request to the 3 nodes and waits that at least 2 nodes
               | confirm the write (+ 100ms, 6 messages). In the end, our
               | write took 100ms, generated 6 messages over the cluster,
               | and the number of writes is not dependent on the number
               | of (raft) nodes in the cluster.
               | 
               | With this model, we can still provide always up to date
               | values. When performing a read request, we also query the
               | 3 nodes that must contain the data and wait for 2 of
               | them. Because we have 3 nodes, wrote at least on 2 of
               | them, and read on 2 of them, we will necessarily get the
               | last value. This algorithm is discussed in Amazon's
               | DynamoDB paper[0].
               | 
               | I reasoned in a model where there is no bandwidth, no CPU
               | limit, no contention at all. In real systems, these
               | limits apply, and we think that's another argument in
               | favor of Garage :-)
               | 
               | [0]: https://dl.acm.org/doi/abs/10.1145/1323293.1294281
        
               | ddorian43 wrote:
               | > For a Raft-based object store, your gateway will
               | receive the write request and forward it to the leader (+
               | 100ms, 2 messages). Then, the leader will forward in
               | parallel this write to the 9 nodes of the cluster and
               | wait that a majority answers (+ 100ms, 18 messages). Then
               | the leader will confirm the write to all the cluster and
               | wait for a majority again (+ 100ms, 18 messages).
               | Finally, it will answer to your gateway (already counted
               | in the first step). Our write took 300ms and generated 38
               | messages over the cluster.
               | 
               | No. The "proxy" node, a random node that you connect to
               | will do:
               | 
               | 0. split the file into chunks of ~4MB (can be changed)
               | while streaming
               | 
               | for each chunk (you can write chunks in parallel):
               | 
               | 1. get id from master (can be fixed by generating an id
               | in the proxy node with some custom code, 0 messages with
               | custom plugin)
               | 
               | 2. write to 1 volume-server (which will write to another
               | node for majority) (2 messages)
               | 
               | 3. update metadata layer, to keep track of chunks so you
               | can resume/cancel/clean-failed uploads (metadata may be
               | another raft subsystem, think yugabytedb/cockroachdb, so
               | it needs to do it's own 2 writes) (2 messages)
               | 
               | Mark as "ok" in metadata layer and return ok to client.
               | (2 messages)
               | 
               | The chunking is more complex, you have to track more
               | data, but in the end is better. You spread a file to
               | multiple servers & disks. If a server fails with erasure-
               | coding and you need to read a file, you won't have to
               | "erasure-decode" the whole file since you'll have to do
               | it only for the missing chunks. If you have a hot file,
               | you can spread reads on many machines/disks. You can
               | upload very-big-files (terabytes), you can "append" to a
               | file. You can have a smart-client (or colocate a proxy on
               | your client server) for smart-routing and stuff.
        
               | ricardobeat wrote:
               | If you're still talking about SeaweedFS, the answer seems
               | to be that it's _not_ a  "raft-based object store", hence
               | it's not as chatty as the parent comment described.
               | 
               | That proxy node is a volume server itself, and uses
               | simple replication to mirror its volume on another
               | server. Raft consensus is not used for the writes. Upon
               | replication failure, the data becomes read-only [1], thus
               | giving up partition tolerance. These are not really
               | comparable.
               | 
               | [1]
               | https://github.com/chrislusf/seaweedfs/wiki/Replication
        
               | [deleted]
        
               | vlovich123 wrote:
               | How does step 1 work? My understanding is that the ID
               | from the master tells you which volume server to write
               | to. If you're generating it randomly, then are you saying
               | you have queried the master server for the number of
               | volumes upfront & then just randomly distribute it that
               | way?
        
               | chrislusf wrote:
               | In snowflake id generation mode, the "which volume is
               | writable" information can be read from other follower
               | masters.
        
               | lxpz wrote:
               | We ensure the CRDT is synced with at least two nodes in
               | different geographical areas before returning an OK
               | status to a write operation. We are using CRDTs not so
               | much for their asynchronous replication properties (what
               | is usually touted as eventual consistency), but more as a
               | way to avoid conflicts between concurrent operations so
               | that we don't need a consensus algorithm like Raft. By
               | combining this with the quorum system (two writes out of
               | three need to be successfull before returning ok), we
               | ensure durability of written data but without having to
               | pay the synchronization penalty of Raft.
        
               | Ne02ptzero wrote:
               | > We ensure the CRDT is synced with at least two nodes in
               | different geographical areas before returning an OK
               | status to a write operation [...] we ensure durability of
               | written data but without having to pay the
               | synchronization penalty of Raft.
               | 
               | This is, in essence, strongly-consistent replication; in
               | the sense that you wait for a majority of writes before
               | answering a request: So you're still paying the latency
               | cost of a round trip with a least another node on each
               | write. How is this any better than a Raft cluster with
               | the same behavior? (N/2+1 write consistency)
        
               | lxpz wrote:
               | Raft consensus apparently needs more round-trips than
               | that (maybe two round-trips to another node per write?),
               | as evidenced by this benchmark we made against Minio:
               | 
               | https://garagehq.deuxfleurs.fr/documentation/design/bench
               | mar...
               | 
               | Yes we do round-trips to other nodes, but we do much
               | fewer of them to ensure the same level of consistency.
               | 
               | This is to be expected from a distributed system's theory
               | perspective, as consensus (or total order) is a much
               | harder problem to solve than what we are doing.
               | 
               | We haven't (yet) gone into dissecating the Raft protocol
               | or Minio's implementation to figure out why exactly it is
               | much slower, but the benchmark I linked above is already
               | strong enough evidence for us.
        
               | nh2 wrote:
               | I think it would be great if you could make a Github repo
               | that is just about summarising performance
               | characteristics and roundrip types of different storage
               | systems.
               | 
               | You would invite Minio/Ceph/SeaweedFS/etc. authors to
               | make pull requests in there to get their numbers right
               | and explanations added.
               | 
               | This way, you could learn a lot about how other systems
               | work, and users would have an easier time choosing the
               | right system for their problems.
               | 
               | Currently, one only gets detailed comparisons from HN
               | discussions, which arguably aren't a great place for
               | reference and easily get outdated.
        
               | withinboredom wrote:
               | RAFT needs a lot more round trips that that, it needs to
               | send a message about the transaction, the nodes need to
               | confirm that they received it, then the leader needs to
               | send back that it was committed (no response required).
               | This is largely implementation specific (etcd does more
               | round trips than that IIRC), but that's the bare minimum.
        
           | lxpz wrote:
           | We don't currently have a comparison with SeaweedFS. For the
           | record, we have been developping Garage for almost two years
           | now, and we hadn't heard of SeaweedFS at the time we started.
           | 
           | To me, the two key differentiators of Garage over its
           | competitors are as follows:
           | 
           | - Garage contains an evolved metadata system that is based on
           | CRDTs and consistent hashing inspired by Dynamo, solidly
           | grounded in distributed system's theory. This allows us to be
           | very efficient as we don't use Raft or other consensus
           | algorithms between nodes, and we also do not rely on an
           | external service for metadata storage (Postgres, Cassandra,
           | whatever) meaning we don't pay an additionnal communication
           | penalty.
           | 
           | - Garage was designed from the start to be multi-datacenter
           | aware, again helped by insights from distributed system's
           | theory. In practice we explicitly chose against implementing
           | erasure coding, instead we spread three full copies of data
           | over different zones so that overall availability is
           | maintained with no degradation in performance when one full
           | zone goes down, and data locality is preserved at all
           | locations for faster access (in the case of a system with
           | three zones, our ideal deployment scenario).
        
             | atombender wrote:
             | Can you go into a bit more detail about how using CRDTs
             | avoids the need for consensus-based replication?
        
               | lxpz wrote:
               | Sure!
               | 
               | CRDTs are often presented as a way of building
               | asynchronous replication system with eventual
               | consistency. This means that when modifying a CRDT
               | object, you just update your local copy, and then
               | synchronize in background with other nodes.
               | 
               | However this is not the only way of using CRDTs. At there
               | core, CRDTs are a way to resolve conflicts between
               | different versions of an object without coordination:
               | when all of the different states of the object that exist
               | in the network become known to a node, it applies a local
               | procedure known as a merge that produces a deterministic
               | outcome: all nodes do this, and once they have all done
               | it, they are all in the same state. In that way, nodes do
               | not need to coordinate before-hand when doing
               | modifictations, in the sense where they do not need to
               | run a two-phase commit protocol that ensures that
               | operations are applied one after the other in a specific
               | order that is replicated identically at all nodes. (This
               | is the problem of consensus which is theoretically much
               | harder to solve from a distributed system's theory
               | perspective, as well as from an implementation
               | perspective).
               | 
               | In Garage, we have a bit of a special way of using CRDTs.
               | To simplify a bit, each file stored in Garage is a CRDT
               | that is replicated on three known nodes
               | (deterministically decided). While these three nodes
               | could synchronize in the background when an update is
               | made, this would mean two things that we don't want: 1/
               | when a write is made, it would be written only on one
               | node, so if that node crashes before it had a chance to
               | synchronize with other nodes, data would be lost; 2/
               | reading from a node wouldn't necessarily ensure that you
               | have the last version of the data, therefore the system
               | is not read-after-write consistent. To fix this, we add a
               | simple synchronization system based on read/write quorums
               | to our CRDT system. More precisely, when updating a CRDT,
               | we wait for the value to be known to at least two of the
               | three responsible nodes before returning OK, which allows
               | us to tolerate one node failure while always ensuring
               | durability of stored data. Further when performing a
               | read, we ask for their current state of the CRDT to at
               | least two of the three nodes: this ensures that at least
               | one of the two will know about the last version that was
               | written (due to the intersection of the quorums being
               | non-empty), making the system read-after-write
               | consistent. These are basically the same principles that
               | are applied in CRDT databases such as Riak.
        
               | eternalban wrote:
               | I'll also admit to having difficulty understanding how is
               | all this distinct from non-CRDT replication mechanisms.
               | Great mission and work by DeuxFluers team btw. Bonne
               | chance!
               | 
               | Edit: Blogs on Riak's use of CRDTs:
               | 
               | http://christophermeiklejohn.com/erlang/lasp/2019/03/08/m
               | ono...
               | 
               | https://web.archive.org/web/20170801114916/http://docs.ba
               | sho...
               | 
               | https://naveennegi.medium.com/rendezvous-with-riak-crdts-
               | par...)
        
               | lxpz wrote:
               | Thanks!
               | 
               | I'll give a bit more details about CRDTs, then.
               | 
               | The thing is, if you don't have CRDT, the only way to
               | replicate things over nodes in such a way that they end
               | up in consistent states, is to have a way of ordering
               | operations so that all nodes apply them in the same
               | order, which is costly.
               | 
               | Let me give you a small example. Suppose we have a very
               | simple key-value storage, and that two clients are
               | writing different values at the same time on the same
               | key. The first one will invoke write(k, v1), and the
               | second one will invoke write(k, v2), where v1 and v2 are
               | different values.
               | 
               | If all nodes receive the two write operation but don't
               | have a way to know which one came first, some will
               | receive v1 before v2, and end up with value v2 as the
               | last written values, and other nodes will receive v2
               | before v1 meaning the will keep v1 as the definitive
               | value. The system is now in an inconsistent state.
               | 
               | There are several ways to avoid this.
               | 
               | The first one is Raft consensus: all write operations
               | will go through a specific node of the system, the
               | leader, which is responsible for putting them in order
               | and informing everyone of the order it selected for the
               | operations. This adds a cost of talking to the leader at
               | each operation, as well as a cost of simply selecting
               | which node is the leader node.
               | 
               | CRDT are another way to ensure that we have a consistent
               | result after applying the two writes, not by having a
               | leader that puts everything in a certain order, but by
               | embedding certain metadata with the write operation
               | itself, which is enough to disambiguate between the two
               | writes in a consistent fashion.
               | 
               | In our example, now, the node that does write(k, v1) will
               | for instance generate a timestamp ts1 that corresponded
               | to the (approximate) time at which v1 was written, and it
               | will also generate a UUID id1. Similarly, the nodes that
               | does write(k, v2) will generate ts2 and id2. Now when
               | they send their new values to other nodes in the network,
               | they will send along their values of ts1, id1, ts2 and
               | id2. Nodes now know enough to always deterministcally
               | select either v1 or v2, consistently everywhere: if ts1 >
               | ts2 or ts1 = ts2 and id1 > id2, then v1 is selected,
               | otherwise v2 is selected (we suppose that id1 = id2 has a
               | negligible probability of happening). In terms of message
               | round-trips between nodes, we see that with this new
               | method, nodes simply communicate once with all other
               | nodes in the network, which is much faster than having to
               | pass through a leader.
               | 
               | Here the example uses timestamps as a way to disambiguate
               | between v1 and v2, but CRDTs are more general and other
               | ways of handling concurrent updates can be devised. The
               | core property is that concurrent operations are combined
               | a-posteriori using a deterministic rule once they reach
               | the storage nodes, and not a-priori by putting them in a
               | certain order.
        
               | eternalban wrote:
               | Got it, thanks. This reminded me of e-Paxos (e for
               | Egalitarian) and a bit of digging there are now 2
               | additional contenders (Atlas and Gryff, both 2020) in
               | this space per below:
               | 
               | http://rystsov.info/2020/06/27/pacified-consensus.html
        
               | lxpz wrote:
               | Very interesting, thanks for sharing.
        
               | dboreham wrote:
               | > I'll also admit to having difficulty understanding how
               | is all this distinct from non-CRDT replication
               | mechanisms.
               | 
               | This is because "CRDT" is not about a new or different
               | approach to replication, although for some reason this
               | has become a widely held perception. CRDT is about a new
               | approach to the _analysis_ of replication mechanisms,
               | using order theory. If you read the original CRDT
               | paper(s) you'll find old-school mechanisms like Lamport
               | Clocks.
               | 
               | So when someone says "we're using a CRDT" this can be
               | translated as: "we're using an eventually consistent
               | replication mechanism proven to converge using techniques
               | from the CRDT paper".
        
               | atombender wrote:
               | Sounds interesting. How do you handle the case where
               | you're unable to send the update to other nodes?
               | 
               | So an update goes to node A, but not to B and C.
               | Meanwhile, the connection to the client may be disrupted,
               | so the client doesn't know the fate of the update. If
               | you're unlucky here, a subsequent read will ask B and C
               | for data, but the newest data is actually on A. Right?
               | 
               | I assume there's some kind of async replication between
               | the nodes to ensure that B and C _eventually_ catch up,
               | but you do have an inconsistency there.
               | 
               | You also say there is no async replication, but surely
               | there must be some, since by definition there is a
               | quorum, and updates aren't hitting all of the nodes.
               | 
               | I understand that CRDTs make it easier to order updates,
               | which solves part of consistent replication, but you
               | still need a consistent _view_ of the data, which is
               | something Paxos, Raft, etc. solve, but CRDTs separated
               | across multiple nodes don 't automatically give you that,
               | unless I am missing something. You need more than one
               | node in order to figure out what the newest version is,
               | assuming the client needs perfect consistency.
        
               | lxpz wrote:
               | True; we don't solve this. If there is a fault and data
               | is stored only on node A and not nodes B and C, the view
               | might be inconsistent until the next repair procedure is
               | run (which happens on each read operation if an
               | inconsistency is detected, and also regularly on the
               | whole dataset using an anti-entropy procedure). However
               | if that happens, the client that sends an update will not
               | have received an OK response, so it will know that its
               | update is in an indeterminate state. The only guarantee
               | that Garage gives you is that if you have an OK, the
               | update will be visible in all subsequent reads (read-
               | after-write consistency).
        
               | atombender wrote:
               | Thanks. That seems risky if one's client is stateless,
               | but I understand it's a compromise.
        
         | RobLach wrote:
         | I've been researching a solution for this problem for some time
         | and you all have landed on exactly how I thought this should be
         | architected. Excellent work and I foresee much of the future
         | infrastructure of the internet headed in this direction.
        
         | mro_name wrote:
         | I love that attitude towards self-hosting and it not being the
         | sole hobby from that day on. Casual self-hosting if you will.
         | 
         | House-keeping UX is key for self-hosting by laypersons.
        
       | depingus wrote:
       | Hi this is a very cool project, and I love that it's geared
       | towards self-hosting. I'm about inherit some hardware so I'm
       | actually looking at clustered network storage options right now.
       | 
       | In my setup, I don't care about redundancy. I much prefer to
       | maximize storage capacity. Can Garage be configured without
       | replication?
       | 
       | If not, maybe someone else can point me in the right direction.
       | Something like a glusterfs distributed volume. At first glance,
       | SeaweedFS has a "no replication" option, but their docs make it
       | seem like they're geared towards handling billions of tiny files.
       | I have about 24TB of 1Gb to 20GB files.
        
         | lxpz wrote:
         | You probably want at least _some_ redundancy otherwise you 're
         | at risk of losing data as soon as a single one of your hard
         | drive fails. However if you're interested in maximizing
         | capacity, you probably need erasure coding, which Garage does
         | not provide. Gluster, Ceph and Minio might all be good
         | candidates for your use case. Or if you have the possibility of
         | putting all your drives in a single box, just do that and make
         | a ZFS pool.
        
           | depingus wrote:
           | > You probably want at least some redundancy otherwise you're
           | at risk of losing data as soon as a single one of your hard
           | drive fails.
           | 
           | If by "fails" you mean the network connection drops out. Then
           | yes, that would be a huge problem. I was hoping some project
           | had a built-in solution to this. Currently, I'm using
           | MergerFS to effectively create 1 disk out of 3 external USB
           | drives and it handles accidental drive disconnects with no
           | problems (I can't gush enough over how great mergerfs is).
           | 
           | But, if by "fails" you mean actual hardware failure. Then, I
           | don't really care. I keep 1 to 1 backups. A few days of
           | downtime to restore the data isn't a big deal; this is just
           | my home network.
           | 
           | > Or if you have the possibility of putting all your drives
           | in a single box...
           | 
           | Unfortunately, I've maxed out the drive bays on my TS140
           | server. Buying new, larger drives to replace existing drives
           | seems wasteful. Also, I've just been gifted another TS140,
           | which is a good platform to start building another file
           | server.
           | 
           | You've given me something to think about, thanks. I
           | appreciate you taking the time to respond!
        
       | Rygu wrote:
       | Github repo (mirror): https://github.com/deuxfleurs-org/garage
        
       | detritus wrote:
       | Y'all might want to get to the point a bit quicker in your pitch.
       | A lot of fluff to wade through before the lede..
       | 
       | "Garage is a distributed storage solution, that automatically
       | replicates your data on several servers. Garage takes into
       | account the geographical location of servers, and ensures that
       | copies of your data are located at different locations when
       | possible for maximal redundancy, a unique feature in the
       | landscape of distributed storage systems.
       | 
       | Garage implements the Amazon S3 protocol, a de-facto standard
       | that makes it compatible with a large variety of existing
       | software. For instance it can be used as a storage back-end for
       | many self-hosted web applications such as NextCloud, Matrix,
       | Mastodon, Peertube, and many others, replacing the local file
       | system of a server by a distributed storage layer. Garage can
       | also be used to synchronize your files or store your backups with
       | utilities such as Rclone or Restic. Last but not least, Garage
       | can be used to host static websites, such as the one you are
       | currently reading, which is served directly by the Garage cluster
       | we host at Deuxfleurs."
        
         | lxpz wrote:
         | Thanks for the feedback! We will definitely be adjusting our
         | pitch in further iterations.
        
           | jfindley wrote:
           | I'm not quite sure who the article is actually written for.
           | If it's written to generate interest from investors, or for
           | some other such semi/non technical audience then please
           | ignore this (although the fact that I can't tell who it's
           | written for is in itself a warning signal).
           | 
           | If it's written to attract potential users, however, then
           | it's woefully light on useful content. I don't care in the
           | slightest about your views on tech monopolies - why on earth
           | waste such a huge amount of space talking about this? What I
           | care about are three things: 1) what workloads does it do
           | well and, and which ones does it do poorly at (if you don't
           | tell me about the latter I won't believe you - no storage
           | system is great at everything). 2) What
           | consistency/availability model does it use, and 3) how you've
           | tested it.
           | 
           | As written, it's full of fluff and vague handwavy promises
           | (we have PhDs!) - the only technical information in the
           | entire post is that it's written in rust. For users of your
           | application, what programming language it's written in is
           | about the least interesting thing possible to say about it.
           | Even going through your docs I can't find a single proper
           | discussion of your consistency and availability model, which
           | is a huge red flag to me.
        
             | adrn10 wrote:
             | This article broadly explains the motivation and vision
             | behind our project. There are many object stores, why build
             | a new one? To push self-hosting, in face of
             | democratic/political threats we feel are daunting.
             | 
             | Sorry it didn't float your boat. You're just not the
             | article's target audience I guess! (And, to answer your
             | question: the target audience is mildly-knowledgeable but
             | politically concerned citizens.)
             | 
             | I agree this is maybe not the type of post you'd expect on
             | HN. But, we don't want to _compete_ with other object
             | stores. We want to expand the application domain of
             | distributed data stores.
             | 
             | > I can't find a single proper discussion of your
             | consistency and availability model
             | 
             | LGTM, I would be dubious too. This will come in time,
             | please be patient (the project is less than 2 yo!). You
             | know Cypherunks' saying? "We write code." Well, for now,
             | we've been focusing on the code, and we want to continue
             | doing so until 1.0. Some by-standers will do the paperware
             | for sure, but it's not our utmost priority.
        
               | jfindley wrote:
               | Thanks for the response. It's great that you're trying to
               | build something that will allow folks to do self-hosting
               | well, I'm fully on board with this aim. My point about
               | the political bits though, is that even for folks that
               | agree with you, pushing the political bits too much can
               | signal that you're not confident enough in the technical
               | merits of what you're doing for them to stand on their
               | own.
               | 
               | Re: the paperwork vs code thing - While a full white
               | paper on your theoretical foundations would be nice, IMO
               | the table stakes for getting people to take your product
               | seriously is a sentence or two on the model you're using.
               | For example, adding something like: "We aim to prioritise
               | consistency and partition tolerance over availability,
               | and use the raft algorithm for leader elections. Our
               | particular focus is making sure the storage system is
               | really good at handling X and Y." or whatever would be a
               | huge improvement, until you can get around to writing up
               | the details.
               | 
               | The problem with "we write code" is that your users are
               | unlikely to trust you to be writing the correct code,
               | when there's so little evidence that you've thought
               | deeply about how to solve the hard problems inherent in
               | distributed storage.
        
               | adrn10 wrote:
               | It was actually deliberate to present our project through
               | the political lens, today.
               | 
               | You will find a more streamlined presentation of Garage
               | on our landing page at https://garagehq.deuxfleurs.fr/
               | 
               | Technical details are also documented (although it is
               | insufficient and needs an overhaul):
               | https://garagehq.deuxfleurs.fr/documentation/design/
               | 
               | Thank you for your comments!
        
             | StillBored wrote:
             | For the average user it shouldn't matter what its written
             | in, but lately I've returned to the theory that language
             | choice for a project signals a certain amount of technical
             | sophistication. Particularly, for things like this, which
             | should be considered systems programming. The overhead of a
             | slow JIT, GC, threading model, IO model, whatever basically
             | means the project is doomed to subpar perf forever and the
             | cost of developing it has been passed on to the users,
             | which will be forced to buy 10x the hardware to compensate.
        
           | JKCalhoun wrote:
           | I'm the layest of laymen, is Garage "a distributed Dropbox"?
        
             | adrn10 wrote:
             | It's the layer below: a sound distributed storage backend.
             | You can use it to distribute your Nextcloud data, store
             | your backups, host websites... You can even mount it in
             | your OS' file manager and use it as an external drive!
        
           | erwincoumans wrote:
           | Who owns the servers?
        
             | superboum wrote:
             | You own the servers. This is a tool to build your own
             | object-storage cluster. For example, you can get 3 old
             | desktop PCs, install Linux on them, download and launch
             | Garage on them, configure your 3 instances in a single
             | cluster, then send data to this cluster. Your data will be
             | spread and duplicated on the 3 machines. If one machine
             | fails or is offline, you can still access and write data to
             | the cluster.
        
               | [deleted]
        
               | erwincoumans wrote:
               | Then, how does Garage achieve 'different geographical
               | locations'? I only have my house to put my server(s).
               | That's one of the main reasons I'm using cloud storage.
               | Or is the point that I can arrange those servers abroad
               | myself, independent of the software solution (S3 etc)?
        
               | lxpz wrote:
               | Garage is designed for self-hosting by collectives of
               | system administrators, what we could call "inter-
               | hosting". Basically you ask your friends to put a server
               | box at their home and achieve redundancy that way.
        
               | rglullis wrote:
               | Is the data encrypted at rest, or should I encrypt the
               | data myself?
        
               | superboum wrote:
               | The content is currently stored in plaintext on the disk
               | by Garage, so you have to encrypt the data yourself. For
               | example, you can configure your server to encrypt at rest
               | the partition that contains your `data_dir` and your
               | `meta_dir` or build/use applications that supports
               | client-side encryption such as rclone with its crypt
               | module[0] or Nextcloud with its end-to-end encryption
               | module[1].
               | 
               | [0]: https://rclone.org/crypt/
               | 
               | [1]: https://nextcloud.com/endtoend/
        
               | erwincoumans wrote:
               | Great, thanks. It would be great to add this in the main
               | explanation (how to get access to servers).
        
       ___________________________________________________________________
       (page generated 2022-02-08 23:01 UTC)