[HN Gopher] Infinite Git repos on Cloudflare workers
       ___________________________________________________________________
        
       Infinite Git repos on Cloudflare workers
        
       Author : plesiv
       Score  : 117 points
       Date   : 2024-10-25 17:34 UTC (5 hours ago)
        
 (HTM) web link (gitlip.com)
 (TXT) w3m dump (gitlip.com)
        
       | yjftsjthsd-h wrote:
       | > It allows us to easily host an infinite number of repositories
       | 
       | I like this system in general, but I don't understand why scaling
       | the number of repos is treated as a pinch point? Are there git
       | hosts that struggle with the number of repos hosted in
       | particular? (I don't think the "Motivation" section answers this,
       | either.)
        
         | plesiv wrote:
         | OP here.
         | 
         | It's unlikely any Git providers struggle with the number of
         | repos they're hosting, but most are larger companies.
         | 
         | Currently, we're a bootstrapped team of 2. I think our approach
         | changes the kind of product we can build as a small team.
        
           | rad_gruchalski wrote:
           | How? What makes it so much more powerful than gitea hosted on
           | a cheap vps with some backup in s3?
           | 
           | Unless, of course, your product is infinite git repos with cf
           | workers.
        
         | icambron wrote:
         | Seems like it enables you do things like use git repos as per-
         | customer or per-some-business-object storage, which you
         | otherwise wouldn't even consider. Like imagine you were setting
         | up a blogging site where each blog was backed by a repo
        
           | abraae wrote:
           | Or perhaps a SaaS product where individual customers had
           | their own fork of the code.
           | 
           | There are many reasons not to do this, perhaps this scratches
           | away at one of them.
        
         | bhl wrote:
         | Serverless git repos would be useful if you wanted to make a
         | product like real-time collaboration + offline support code
         | editing in the browser.
         | 
         | You can still sync to a platform like GitHub or BitBucket after
         | all users close their tabs.
         | 
         | A long time ago, I looked into using isomorphic-git with
         | lightning-fs to build light note-taking app in the browser:
         | pull your markdown files in, edit them in a rich-text-editor a
         | la Notion, stage and then commit changes back using git.
        
           | aphantastic wrote:
           | That's essentially what github.dev and vscode.dev do FWIW.
        
       | jauntywundrkind wrote:
       | > _After extensive research, we rewrote significant parts of
       | Emscripten to support asynchronous file system calls._
       | 
       | > _We ended up creating our own Emscripten filesystem on top of
       | Durable Objects, which we call DOFS._
       | 
       | > _We abandoned the porting efforts and ended up implementing the
       | missing Git server functionality ourselves by leveraging
       | libgit2's core functionality, studying all available
       | documentation, and painstakingly investigating Git's behavior._
       | 
       | Using a ton of great open source & taking it all further. Would
       | sure be great if ya'll could contribute some of this forward!
       | 
       | Libgit2 is GPL with Linking Exception, and Emscripten MIT so I
       | think legally everything is in the clear. But it sure would be
       | such a boon to share.
        
         | plesiv wrote:
         | Definitely! We're focused on launching right now, but once we
         | have more bandwidth, we'd be happy to do it.
         | 
         | I believe our changes are solid, but they're tailored
         | specifically to our use case and can't be merged as-is. For
         | example, our modifications to libgit2 would need at least as
         | much additional code to make them toggleable in the build
         | process, which requires extra effort.
        
           | abstractbeliefs wrote:
           | No free software no support. You don't have to merge it
           | upstream right away, but publish it for others to study and
           | use as permitted by the license.
        
       | scosman wrote:
       | Serverless git repos: super cool
       | 
       | But I can't figure out what makes this an AI company. Seems like
       | a collaboration tool?
        
         | plesiv wrote:
         | OP here. We're not an AI company; we're aiming to be AI-
         | adjacent and simplify the practical application of AI models.
        
           | ijamj wrote:
           | Honest question: how is this "AI-adjacent"? How does it
           | specifically "simplify the practical application of AI
           | models"? Focus of the question being on "AI"...
        
       | gavindean90 wrote:
       | I really like the idea if file system over durable objects
        
         | eastdakota wrote:
         | I do too.
        
       | yellow_lead wrote:
       | The latency on the examples seems quite slow, around 7 seconds to
       | a full load for me.
       | 
       | https://gitlip.com/@nataliemarleny/test-repo
        
         | plesiv wrote:
         | OP here. That's expected for now, and we're working on a
         | solution. We didn't explain the reason in the post because we
         | plan to cover it in a separate write-up.
        
           | yellow_lead wrote:
           | I see you haven't launched yet so that's fair. Looking
           | forward to trying it
        
       | ecshafer wrote:
       | Github doesn't stop me from making an infinite number of git
       | repos. Or maybe they do, but I have never hit the limit. And if I
       | am hitting that limit, and become a large enterprise customer, I
       | am sure they would work with me on getting around that limit.
       | 
       | Where does this fit into a product? Maybe I am blind, but while
       | this is cool, I don't really see where I would want this.
        
         | plesiv wrote:
         | OP here. We're building a new kind of Git platform. "Infinity"
         | is more beneficial for us as platform builders (simplifying
         | infrastructure) but less relevant to our customers as users.
        
         | aftbit wrote:
         | Github would definitely reach out if you tried to make 100k+
         | Github repos. We once automatically opened issues in response
         | to exceptions (sort of a ghetto Bugsnag / Sentry) and received
         | a nice email from an engineer asking us if we really needed to
         | do that when we hit around the 200k mark.
        
           | foota wrote:
           | In some ways, you could imagine repos might be more scalable
           | than issues within a repo, since you could reasonably assume
           | a bound on the number of issues in a single repo.
        
           | no_wizard wrote:
           | Oh here's an interesting idea.
           | 
           | What if these bug reporting platforms could create a branch
           | and tag it for each issue.
           | 
           | This would be particularly useful for point and time things
           | where you have an immutable deployment branch. So it could
           | create a branch off that immutable deployment branch and tag
           | it, so you always have a point in time code reference for
           | bugs.
           | 
           | Would that be useful? I feel like what you're doing here
           | isn't that different if I get what's going on (basically
           | creating one repository per bug?)
        
             | justincormack wrote:
             | Github werent terribly happy with the number of branches we
             | created for this type of use case at one point.
        
               | 0zymandiass wrote:
               | A branch doesn't use any more space than a commit... I'm
               | curious what their complaint was with a large number of
               | branches?
               | 
               | There are various repositories with 500k+ commits
        
               | dizhn wrote:
               | It might be something silly like the number of items in
               | the Branches dropbox menu.
        
             | aphantastic wrote:
             | Why not just keep the sha of the release in the big report?
        
         | shivasaxena wrote:
         | Imagine every notion docs or every airtable base being a a git
         | repo. Imagine the PR workflow that we developers love being
         | available to everyone.
        
       | VoidWhisperer wrote:
       | Not the main purpose of the article but they mention they were
       | working on a notetaking app oriented towards developers - did
       | anything ever come of that? If not, does anyone know products
       | that might fit this niche? (I currently use obsidian)
        
         | plesiv wrote:
         | OP here. Not yet - it's about 50% complete. I plan to open-
         | source it in the future.
        
           | nbbaier wrote:
           | Definitely interested in seeing this as well. What are the
           | key features?
        
       | tredre3 wrote:
       | > Wanting to avoid managing the servers ourselves, we
       | experimented with a serverless approach.
       | 
       | I must be getting old but building a gigantic house of card of
       | interlinked components only to arrive to a more limited solution
       | is truly bizarre to me.
       | 
       | The maintenance burden for a VPS: periodically run apt update
       | upgrade. Use filesystem snapshots to create periodic backups. If
       | something happens to your provider, spin up a new VM elsewhere
       | with your last snapshot.
       | 
       | The maintenance burden for your solution: Periodically merge
       | upstream libgit2 in your custom fork, maintain your custom git
       | server code and audit it for vulnerabilities, make sure
       | everything still compiles with emscripten, deploy it. Rotate API
       | keys to make sure your database service can talk to your storage
       | service and your worker service. Then I don't even know how you'd
       | backup all this to get it back online quickly if something
       | happened to cloudflare. And all that only to end up with worse
       | latency than a VPS, and more size constraints on the repo and
       | objects.
       | 
       | But hey, at least it scales infinitely!
        
         | notamy wrote:
         | > The maintenance burden for a VPS: periodically run apt update
         | upgrade. Use filesystem snapshots to create periodic backups.
         | If something happens to your provider, spin up a new VM
         | elsewhere with your last snapshot.
         | 
         | And make sure it reboots for kernel upgrades (or set up live-
         | patching), and make sure that service updates don't go
         | wrong[0], and make sure that your backups work consistently,
         | and make sure that you're able to vertically or horizontally
         | scale, and make sure it's all automated and repeatable, and
         | make sure the automation is following best-practices, and make
         | sure you're not accidentally configuring any services to be
         | vulnerable[1], and ...
         | 
         | Making this stuff be someone else's problem by using managed
         | services is a lot easier, especially with a smaller team,
         | because then you can focus on what you're building and not
         | making sure your SPOF VPS is still running correctly.
         | 
         | [0] I self-host some stuff for a side-project right now, and
         | packages updates are miserable because they're not simply `apt-
         | get update && apt-get upgrade`. Instead, the documented upgrade
         | process for some services is more/less "dump the entire DB,
         | stop the service, rm -rf the old DB, upgrade the service
         | package, start the service, load the dump in, hope it works."
         | 
         | [1] Because it's _so easy_ to configure something to be
         | vulnerable because it makes it easier, even if the
         | vulnerability was unintentional.
        
         | kentonv wrote:
         | > Periodically merge upstream libgit2 in your custom fork,
         | maintain your custom git server code and audit it for
         | vulnerabilities, make sure everything still compiles with
         | emscripten, deploy it.
         | 
         | There's only a difference here because there exist off-the-
         | shelf git packages for traditional VPS environments but there
         | do not yet exist off-the-shelf git packages for serverless
         | stacks. The OP is a pioneer here. The work they are doing is
         | what will eventually make this an off-the-shelf thing for
         | everyone else.
         | 
         | > Rotate API keys to make sure your database service can talk
         | to your storage service and your worker service.
         | 
         | Huh? With Durable Objects the storage is local to each object.
         | There is no API key involved in accessing it.
         | 
         | > Then I don't even know how you'd backup all this
         | 
         | Durable Object storage (under the new beta storage engine)
         | automatically gives you point-in-time recovery to any point in
         | time in the last 30 days.
         | 
         | https://developers.cloudflare.com/durable-objects/api/storag...
         | 
         | > And all that only to end up with worse latency than a VPS
         | 
         | Why would it be worse? It should be better, because Cloudflare
         | can locate each DO (git repo) close to whoever is accessing it,
         | whereas your VPS is going to sit in one single central location
         | that's probably further away.
         | 
         | > and more size constraints on the repo and objects.
         | 
         | While each individual repo may be more constrained, this
         | solution can scale to far more total repos than a single-server
         | VPS could.
         | 
         | (I'm the tech lead for Cloudflare Workers.)
        
       | ericyd wrote:
       | Engaging read! For me, just the right balance of technical detail
       | and narrative content. It's a hard balance to strike and I'm sure
       | preferences vary widely which makes it an impossible target for
       | every audience.
        
       | betaby wrote:
       | Somewhat related question. Assume I have ~1k ~200MB XML files
       | that get ~20% of their content changed. What are my best option
       | to store them? While using vanilla git on a SSD raid10 works,
       | that's quite slow in retrieving historical data dating back ~3-6
       | months. Are there other options for a quickie back-end? I'm fine
       | with it being not that storage efficient to a degree.
        
         | nomel wrote:
         | If you can share, but I'd be curious to know what that large of
         | an XML file might be used for, and what benefits it might have
         | over other formats. My persona and professional use of XML has
         | been pretty limited, but XSD was super powerful, and the reason
         | we choose it when we did.
        
           | hobs wrote:
           | it's a good question because my answer for a system like this
           | which had very little schema changing was just dump it into a
           | database and add historical tracking per object that way,
           | hash, compare, insert and add historical record.
        
             | betaby wrote:
             | I do have the current state in the DB. However I need
             | sometimes to compare today's file with the one from 6 month
             | ago.
        
               | hobs wrote:
               | So I assumed something like - you have the same schema
               | with the same tabular format inside or the XML document,
               | and that those state changes are in a way so you can tell
               | the timestamp - then you can bring up both states at the
               | same time and compare across the attributes for
               | wrongness.
               | 
               | EXCEPT/INTERSECT make this easy for a bunch of columns
               | (excluding the times of course, I usually hash these for
               | performance reasons) but wont tell you what itself is the
               | difference, you have to do column by column comparisons
               | here, which is where I usually shell out to my language
               | of choice because SQL sucks at doing that.
        
           | betaby wrote:
           | Juniper routers configs, something like below.
           | 
           | adamc@router> show arp | display xml <rpc-reply
           | xmlns:JUNOS="http://xml.juniper.net/JUNOS/15.1F6/JUNOS">
           | <arp-table-information
           | xmlns="http://xml.juniper.net/JUNOS/15.1F6/JUNOS-arp"
           | JUNOS:style="normal"> <arp-table-entry> <mac-
           | address>0a:00:27:00:00:00</mac-address> <ip-
           | address>10.0.201.1</ip-address> <hostname>adamc-
           | mac</hostname> <interface-name>em0.0</interface-name> <arp-
           | table-entry-flags> <none/> </arp-table-entry-flags> </arp-
           | table-entry> </arp-table-information> <cli> <banner></banner>
           | </cli> </rpc-reply>
        
         | tln wrote:
         | > get ~20% of their content changed
         | 
         | ...daily? monthly? how many versions do you have to keep
         | around?
         | 
         | I'd look at a simple zstd dictionary based scheme, first. Put
         | your history/metadata into a database. Put the XML data into
         | file system/S3/BackBlaze/B2, zstd compressed against a
         | dictionary.
         | 
         | Create the dictionary : zstd --train PathToTrainingSet/* -o
         | dictionaryName Compress with the dictionary: zstd FILE -D
         | dictionaryName Decompress with the dictionary: zstd
         | --decompress FILE.zst -D dictionaryName
         | 
         | Although you say you're fine with it being not that storage
         | efficient to a degree, I think if you were OK with storing
         | every version of every XML file, uncompressed, you wouldn't
         | have to ask right?
        
           | betaby wrote:
           | If one stores a whole versions of the files that defeats the
           | idea of git, and would consume too much space. I suppose I
           | don't even need zstd if I have ZFS with compression, although
           | compression levels won't be as good.
        
             | tln wrote:
             | You're relying on compression either way... my hunch is
             | that controlling the compression yourself may get you a
             | better result.
             | 
             | Git does not store diffs, it stores every version. These
             | get compressed into packfiles https://git-
             | scm.com/book/en/v2/Git-Internals-Packfiles. It looks like
             | it uses zlib.
        
         | hokkos wrote:
         | You can compress in EXI, it's a format for XML and if it is
         | informed by the schema can give a big boost in compression.
        
         | adobrawy wrote:
         | I don't know what your "best" criterion is (implementation
         | costs, implementation time, maintainability, performance,
         | compression ratio, etc.). Still, the easiest way to start is to
         | delegate it to the file system, so zfs + compression. Access
         | time should be decent. No application-level changes are
         | required to enable that.
        
           | betaby wrote:
           | It is already on ZFS with compression.
        
         | o11c wrote:
         | I'm not sure if this _quite_ fits your workload, but a lot of
         | times people use `git` when `casync` would be more appropriate.
        
       | tln wrote:
       | Congrats, you've done a lot of interesting work to get here.
       | 
       | This could be a fantastic building block for headless CMS and the
       | like.
        
         | plesiv wrote:
         | OP here. Thank you and good catch! :-) We have a blog post
         | planned on that topic.
        
       | sluongng wrote:
       | @plesiv could you please elaborate on how repack/gc is handled
       | with a libgit2 backend? I know that Alibaba has done something
       | similar in the past based on libgit2, but I have yet to see
       | another implementation in the wild like this.
       | 
       | Very cool project. I hope Cloudflare workers can support more
       | protocols like SSH and GRPC. It's one of the reasons why I prefer
       | Fly.io over Cloudflare worker for special servers like this.
        
         | plesiv wrote:
         | Great question! By default, with libgit2 each write to a repo
         | (e.g. push) will create a new pack file. We have written a
         | simple packing algorithm that runs after each write. It works
         | like this:
         | 
         | Choose these values:
         | 
         | * P, pack "Planck" size, e.g. 100kB
         | 
         | * N, branching factor, e.g. 8
         | 
         | After each write:
         | 
         | 1. iterate over each pack (pack size is S) and assign each pack
         | a class C which is the smallest integer that satisfies P * N^C
         | > S
         | 
         | 2. iterate variable c from 0 to the maximum value of C that you
         | got in step 2
         | 
         | * if there are N packs of class c, repack them into a new pack,
         | new pack is going to be at most of class c+1
        
       | gkoberger wrote:
       | This is really cool! I've been building something on libgit2 +
       | EFS, and this approach is really interesting.
       | 
       | Between libgit2 on emscripten, the number of file writes to DO,
       | etc, how is performance?
        
       | seanvelasco wrote:
       | this leverages Durable Objects, but as i remember from two years
       | ago, DO's way of guaranteeing uniqueness is that there can only
       | be once instance of that DO in the world.
       | 
       | what if there are two users who wants to access the same DO repo
       | at the same time, one in the US and the other in Singapore? the
       | DO must live either in US servers or SG servers, but not at the
       | same time. so one of the two users must have high latency then?
       | 
       | then after some time, a user in Australia accesses this DO repo -
       | the DO bounces to AU servers - US and SG users will have high
       | latency?
       | 
       | but please correct me if i'm wrong
        
         | skybrian wrote:
         | Last I heard, durable objects don't move while running. It
         | doesn't seem worse than hosting in US-East, though.
        
       | skybrian wrote:
       | Not having a technical limit is nice, because then it's a matter
       | of spending money. But whenever I see "infinite," I ask what it
       | will cost. How expensive is it to host git repos this way?
       | 
       | As a hobbyist, "free" is pretty appealing. I'm pretty sure my
       | repos on GitHub won't cost me anything, and that's unlikely to
       | change anytime soon. Not sure about the new stuff.
        
         | jsheard wrote:
         | With CloudFlare at least when you overstay your welcome on the
         | free plan they just start nagging you to start paying, and
         | possibly kick you out if you don't, rather than sending you a
         | surprise bill for $10,000 like AWS or Azure or GCP might do.
        
       | koolba wrote:
       | > We're building Gitlip - the collaborative devtool for the AI
       | era. An all-in-one combination of Git-powered version control,
       | collaborative coding and 1-click deployments.
       | 
       | Did they get a waiver from the git team to name it as such?
       | 
       | Per the trademark policy, new "git${SUFFIX}" names aren't
       | allowed: https://git-scm.com/about/trademark
       | 
       | >> In addition, you may not use any of the Marks as a syllable in
       | a new word or as part of a portmanteau (e.g., "Gitalicious",
       | "Gitpedia") used as a mark for a third-party product or service
       | without Conservancy's written permission. For the avoidance of
       | doubt, this provision applies even to third-party marks that use
       | the Marks as a syllable or as part of a portmanteau to refer to a
       | product or service's use of Git code.
        
         | WorkerBee28474 wrote:
         | You don't need their permission to make a portmanteau, all you
         | need is to follow trademark law (which may or may not allow
         | it). The policy page can go kick sand.
        
           | saurik wrote:
           | While true, using someone else's trademark as a _prefix_ of
           | your name when you are actively _intending_ it to reference
           | the protected use seems egregious.
        
             | eli wrote:
             | Do you think many users will mistakenly believe Gitlip is
             | an official Git project put out by the same authors as Git?
             | 
             | There can't be trademark infringement unless there is a
             | likelihood of confusion.
        
         | plesiv wrote:
         | OP here. Oops, thank you for pointing that out! We weren't
         | aware of it. We will investigate ASAP. In the worst case, we'll
         | change our name.
        
           | benatkin wrote:
           | Doesn't sound worse case to me. It could use a better name
           | anyway.
        
             | Spunkie wrote:
             | gitlip is not a good name, but you can be sure that if a
             | new name does not include git it will be a worse name.
        
         | rzzzt wrote:
         | What about an old word? Agitator, legitimate, cogitate?
        
         | fumplethumb wrote:
         | What about... GitHub, Gitlab, Gitkraken, GirButler (featured on
         | HN recently)? The list goes on forever!
        
           | afiori wrote:
           | Supposedly they got written permission
        
       | bagels wrote:
       | Infinite sounds like a bug happened. It's obviously not infinite,
       | some resource will eventually be exhausted, in this case, memory.
        
       | iampims wrote:
       | Some serious engineering here. Kudos!
        
       | nathants wrote:
       | this is very cool!
       | 
       | i prototyped a similar serverless git product recently using a
       | different technique.
       | 
       | i used aws lambda holding leases in dynamo backed by s3. i zipped
       | git binaries into the lambda and invoked them directly. i used
       | shallow clone style repos stored in chunks in s3, that could be
       | merged as needed in lambda /tmp.
       | 
       | lambda was nice because for cpu heavy ops like merging many
       | shallow clones, i could do that in a larger cpu lambda, and cache
       | the result.
       | 
       | other constraints were similar to what is described here. mainly
       | that an individual push/pull cannot exceed the api gateway max
       | payload size, a few MB.
       | 
       | i looked at isomorphic, but did not try emscripten libgit2. using
       | cloudflare is nice because of free egress, which opens up many
       | new use cases that don't make sense on $0.10/GB egress.
       | 
       | i ended up shelving this while i build a different product. glad
       | to see others pursuing the same thing, serverless git is an
       | obvious win! do you back your repos with r2?
       | 
       | for my own git usage, what i ended up building was a trustless
       | git system backed by dynamo and s3 directly. this removes the
       | push/pull size limit, and makes storage trustless. this uses git
       | functionality i had no idea about prior, git bundle and
       | unbundle[1]. they are used for transfer of git objects without a
       | server, serverless git! this one i published[2].
       | 
       | good luck with your raise and your project. looking forward to
       | the next blog. awesome stuff.
       | 
       | 1. https://git-scm.com/docs/git-bundle
       | 
       | 2. https://github.com/nathants/git-remote-aws
        
       | stavros wrote:
       | This is a very impressive technical achievement, and it's clear
       | that a lot of work went into it.
       | 
       | Unfortunately, the entrepreneur in me continues that thought with
       | "work that could have gone into finding customers instead". Now
       | you have a system that could store "infinite" git repos, but how
       | many customers?
        
         | akerl_ wrote:
         | I agree. That's a really unfortunate way to view somebody's
         | project.
        
           | stavros wrote:
           | Is it a project, or is it a company?
        
         | deadbunny wrote:
         | TFA is literally marketing for them. HN is their target
         | audience and a good way to capture that audience is to show
         | them something technically interesting.
        
       | Spunkie wrote:
       | I've been wondering what to do to backup our github repos other
       | than keeping a local copy and/or dumping them on something like
       | S3.
       | 
       | I would love to use this to serve as a live/working automatic
       | backup for my github repos on CF infrastructure.
        
       ___________________________________________________________________
       (page generated 2024-10-25 23:01 UTC)