[HN Gopher] Infinite Git repos on Cloudflare workers
___________________________________________________________________
Infinite Git repos on Cloudflare workers
Author : plesiv
Score : 117 points
Date : 2024-10-25 17:34 UTC (5 hours ago)
(HTM) web link (gitlip.com)
(TXT) w3m dump (gitlip.com)
| yjftsjthsd-h wrote:
| > It allows us to easily host an infinite number of repositories
|
| I like this system in general, but I don't understand why scaling
| the number of repos is treated as a pinch point? Are there git
| hosts that struggle with the number of repos hosted in
| particular? (I don't think the "Motivation" section answers this,
| either.)
| plesiv wrote:
| OP here.
|
| It's unlikely any Git providers struggle with the number of
| repos they're hosting, but most are larger companies.
|
| Currently, we're a bootstrapped team of 2. I think our approach
| changes the kind of product we can build as a small team.
| rad_gruchalski wrote:
| How? What makes it so much more powerful than gitea hosted on
| a cheap vps with some backup in s3?
|
| Unless, of course, your product is infinite git repos with cf
| workers.
| icambron wrote:
| Seems like it enables you do things like use git repos as per-
| customer or per-some-business-object storage, which you
| otherwise wouldn't even consider. Like imagine you were setting
| up a blogging site where each blog was backed by a repo
| abraae wrote:
| Or perhaps a SaaS product where individual customers had
| their own fork of the code.
|
| There are many reasons not to do this, perhaps this scratches
| away at one of them.
| bhl wrote:
| Serverless git repos would be useful if you wanted to make a
| product like real-time collaboration + offline support code
| editing in the browser.
|
| You can still sync to a platform like GitHub or BitBucket after
| all users close their tabs.
|
| A long time ago, I looked into using isomorphic-git with
| lightning-fs to build light note-taking app in the browser:
| pull your markdown files in, edit them in a rich-text-editor a
| la Notion, stage and then commit changes back using git.
| aphantastic wrote:
| That's essentially what github.dev and vscode.dev do FWIW.
| jauntywundrkind wrote:
| > _After extensive research, we rewrote significant parts of
| Emscripten to support asynchronous file system calls._
|
| > _We ended up creating our own Emscripten filesystem on top of
| Durable Objects, which we call DOFS._
|
| > _We abandoned the porting efforts and ended up implementing the
| missing Git server functionality ourselves by leveraging
| libgit2's core functionality, studying all available
| documentation, and painstakingly investigating Git's behavior._
|
| Using a ton of great open source & taking it all further. Would
| sure be great if ya'll could contribute some of this forward!
|
| Libgit2 is GPL with Linking Exception, and Emscripten MIT so I
| think legally everything is in the clear. But it sure would be
| such a boon to share.
| plesiv wrote:
| Definitely! We're focused on launching right now, but once we
| have more bandwidth, we'd be happy to do it.
|
| I believe our changes are solid, but they're tailored
| specifically to our use case and can't be merged as-is. For
| example, our modifications to libgit2 would need at least as
| much additional code to make them toggleable in the build
| process, which requires extra effort.
| abstractbeliefs wrote:
| No free software no support. You don't have to merge it
| upstream right away, but publish it for others to study and
| use as permitted by the license.
| scosman wrote:
| Serverless git repos: super cool
|
| But I can't figure out what makes this an AI company. Seems like
| a collaboration tool?
| plesiv wrote:
| OP here. We're not an AI company; we're aiming to be AI-
| adjacent and simplify the practical application of AI models.
| ijamj wrote:
| Honest question: how is this "AI-adjacent"? How does it
| specifically "simplify the practical application of AI
| models"? Focus of the question being on "AI"...
| gavindean90 wrote:
| I really like the idea if file system over durable objects
| eastdakota wrote:
| I do too.
| yellow_lead wrote:
| The latency on the examples seems quite slow, around 7 seconds to
| a full load for me.
|
| https://gitlip.com/@nataliemarleny/test-repo
| plesiv wrote:
| OP here. That's expected for now, and we're working on a
| solution. We didn't explain the reason in the post because we
| plan to cover it in a separate write-up.
| yellow_lead wrote:
| I see you haven't launched yet so that's fair. Looking
| forward to trying it
| ecshafer wrote:
| Github doesn't stop me from making an infinite number of git
| repos. Or maybe they do, but I have never hit the limit. And if I
| am hitting that limit, and become a large enterprise customer, I
| am sure they would work with me on getting around that limit.
|
| Where does this fit into a product? Maybe I am blind, but while
| this is cool, I don't really see where I would want this.
| plesiv wrote:
| OP here. We're building a new kind of Git platform. "Infinity"
| is more beneficial for us as platform builders (simplifying
| infrastructure) but less relevant to our customers as users.
| aftbit wrote:
| Github would definitely reach out if you tried to make 100k+
| Github repos. We once automatically opened issues in response
| to exceptions (sort of a ghetto Bugsnag / Sentry) and received
| a nice email from an engineer asking us if we really needed to
| do that when we hit around the 200k mark.
| foota wrote:
| In some ways, you could imagine repos might be more scalable
| than issues within a repo, since you could reasonably assume
| a bound on the number of issues in a single repo.
| no_wizard wrote:
| Oh here's an interesting idea.
|
| What if these bug reporting platforms could create a branch
| and tag it for each issue.
|
| This would be particularly useful for point and time things
| where you have an immutable deployment branch. So it could
| create a branch off that immutable deployment branch and tag
| it, so you always have a point in time code reference for
| bugs.
|
| Would that be useful? I feel like what you're doing here
| isn't that different if I get what's going on (basically
| creating one repository per bug?)
| justincormack wrote:
| Github werent terribly happy with the number of branches we
| created for this type of use case at one point.
| 0zymandiass wrote:
| A branch doesn't use any more space than a commit... I'm
| curious what their complaint was with a large number of
| branches?
|
| There are various repositories with 500k+ commits
| dizhn wrote:
| It might be something silly like the number of items in
| the Branches dropbox menu.
| aphantastic wrote:
| Why not just keep the sha of the release in the big report?
| shivasaxena wrote:
| Imagine every notion docs or every airtable base being a a git
| repo. Imagine the PR workflow that we developers love being
| available to everyone.
| VoidWhisperer wrote:
| Not the main purpose of the article but they mention they were
| working on a notetaking app oriented towards developers - did
| anything ever come of that? If not, does anyone know products
| that might fit this niche? (I currently use obsidian)
| plesiv wrote:
| OP here. Not yet - it's about 50% complete. I plan to open-
| source it in the future.
| nbbaier wrote:
| Definitely interested in seeing this as well. What are the
| key features?
| tredre3 wrote:
| > Wanting to avoid managing the servers ourselves, we
| experimented with a serverless approach.
|
| I must be getting old but building a gigantic house of card of
| interlinked components only to arrive to a more limited solution
| is truly bizarre to me.
|
| The maintenance burden for a VPS: periodically run apt update
| upgrade. Use filesystem snapshots to create periodic backups. If
| something happens to your provider, spin up a new VM elsewhere
| with your last snapshot.
|
| The maintenance burden for your solution: Periodically merge
| upstream libgit2 in your custom fork, maintain your custom git
| server code and audit it for vulnerabilities, make sure
| everything still compiles with emscripten, deploy it. Rotate API
| keys to make sure your database service can talk to your storage
| service and your worker service. Then I don't even know how you'd
| backup all this to get it back online quickly if something
| happened to cloudflare. And all that only to end up with worse
| latency than a VPS, and more size constraints on the repo and
| objects.
|
| But hey, at least it scales infinitely!
| notamy wrote:
| > The maintenance burden for a VPS: periodically run apt update
| upgrade. Use filesystem snapshots to create periodic backups.
| If something happens to your provider, spin up a new VM
| elsewhere with your last snapshot.
|
| And make sure it reboots for kernel upgrades (or set up live-
| patching), and make sure that service updates don't go
| wrong[0], and make sure that your backups work consistently,
| and make sure that you're able to vertically or horizontally
| scale, and make sure it's all automated and repeatable, and
| make sure the automation is following best-practices, and make
| sure you're not accidentally configuring any services to be
| vulnerable[1], and ...
|
| Making this stuff be someone else's problem by using managed
| services is a lot easier, especially with a smaller team,
| because then you can focus on what you're building and not
| making sure your SPOF VPS is still running correctly.
|
| [0] I self-host some stuff for a side-project right now, and
| packages updates are miserable because they're not simply `apt-
| get update && apt-get upgrade`. Instead, the documented upgrade
| process for some services is more/less "dump the entire DB,
| stop the service, rm -rf the old DB, upgrade the service
| package, start the service, load the dump in, hope it works."
|
| [1] Because it's _so easy_ to configure something to be
| vulnerable because it makes it easier, even if the
| vulnerability was unintentional.
| kentonv wrote:
| > Periodically merge upstream libgit2 in your custom fork,
| maintain your custom git server code and audit it for
| vulnerabilities, make sure everything still compiles with
| emscripten, deploy it.
|
| There's only a difference here because there exist off-the-
| shelf git packages for traditional VPS environments but there
| do not yet exist off-the-shelf git packages for serverless
| stacks. The OP is a pioneer here. The work they are doing is
| what will eventually make this an off-the-shelf thing for
| everyone else.
|
| > Rotate API keys to make sure your database service can talk
| to your storage service and your worker service.
|
| Huh? With Durable Objects the storage is local to each object.
| There is no API key involved in accessing it.
|
| > Then I don't even know how you'd backup all this
|
| Durable Object storage (under the new beta storage engine)
| automatically gives you point-in-time recovery to any point in
| time in the last 30 days.
|
| https://developers.cloudflare.com/durable-objects/api/storag...
|
| > And all that only to end up with worse latency than a VPS
|
| Why would it be worse? It should be better, because Cloudflare
| can locate each DO (git repo) close to whoever is accessing it,
| whereas your VPS is going to sit in one single central location
| that's probably further away.
|
| > and more size constraints on the repo and objects.
|
| While each individual repo may be more constrained, this
| solution can scale to far more total repos than a single-server
| VPS could.
|
| (I'm the tech lead for Cloudflare Workers.)
| ericyd wrote:
| Engaging read! For me, just the right balance of technical detail
| and narrative content. It's a hard balance to strike and I'm sure
| preferences vary widely which makes it an impossible target for
| every audience.
| betaby wrote:
| Somewhat related question. Assume I have ~1k ~200MB XML files
| that get ~20% of their content changed. What are my best option
| to store them? While using vanilla git on a SSD raid10 works,
| that's quite slow in retrieving historical data dating back ~3-6
| months. Are there other options for a quickie back-end? I'm fine
| with it being not that storage efficient to a degree.
| nomel wrote:
| If you can share, but I'd be curious to know what that large of
| an XML file might be used for, and what benefits it might have
| over other formats. My persona and professional use of XML has
| been pretty limited, but XSD was super powerful, and the reason
| we choose it when we did.
| hobs wrote:
| it's a good question because my answer for a system like this
| which had very little schema changing was just dump it into a
| database and add historical tracking per object that way,
| hash, compare, insert and add historical record.
| betaby wrote:
| I do have the current state in the DB. However I need
| sometimes to compare today's file with the one from 6 month
| ago.
| hobs wrote:
| So I assumed something like - you have the same schema
| with the same tabular format inside or the XML document,
| and that those state changes are in a way so you can tell
| the timestamp - then you can bring up both states at the
| same time and compare across the attributes for
| wrongness.
|
| EXCEPT/INTERSECT make this easy for a bunch of columns
| (excluding the times of course, I usually hash these for
| performance reasons) but wont tell you what itself is the
| difference, you have to do column by column comparisons
| here, which is where I usually shell out to my language
| of choice because SQL sucks at doing that.
| betaby wrote:
| Juniper routers configs, something like below.
|
| adamc@router> show arp | display xml <rpc-reply
| xmlns:JUNOS="http://xml.juniper.net/JUNOS/15.1F6/JUNOS">
| <arp-table-information
| xmlns="http://xml.juniper.net/JUNOS/15.1F6/JUNOS-arp"
| JUNOS:style="normal"> <arp-table-entry> <mac-
| address>0a:00:27:00:00:00</mac-address> <ip-
| address>10.0.201.1</ip-address> <hostname>adamc-
| mac</hostname> <interface-name>em0.0</interface-name> <arp-
| table-entry-flags> <none/> </arp-table-entry-flags> </arp-
| table-entry> </arp-table-information> <cli> <banner></banner>
| </cli> </rpc-reply>
| tln wrote:
| > get ~20% of their content changed
|
| ...daily? monthly? how many versions do you have to keep
| around?
|
| I'd look at a simple zstd dictionary based scheme, first. Put
| your history/metadata into a database. Put the XML data into
| file system/S3/BackBlaze/B2, zstd compressed against a
| dictionary.
|
| Create the dictionary : zstd --train PathToTrainingSet/* -o
| dictionaryName Compress with the dictionary: zstd FILE -D
| dictionaryName Decompress with the dictionary: zstd
| --decompress FILE.zst -D dictionaryName
|
| Although you say you're fine with it being not that storage
| efficient to a degree, I think if you were OK with storing
| every version of every XML file, uncompressed, you wouldn't
| have to ask right?
| betaby wrote:
| If one stores a whole versions of the files that defeats the
| idea of git, and would consume too much space. I suppose I
| don't even need zstd if I have ZFS with compression, although
| compression levels won't be as good.
| tln wrote:
| You're relying on compression either way... my hunch is
| that controlling the compression yourself may get you a
| better result.
|
| Git does not store diffs, it stores every version. These
| get compressed into packfiles https://git-
| scm.com/book/en/v2/Git-Internals-Packfiles. It looks like
| it uses zlib.
| hokkos wrote:
| You can compress in EXI, it's a format for XML and if it is
| informed by the schema can give a big boost in compression.
| adobrawy wrote:
| I don't know what your "best" criterion is (implementation
| costs, implementation time, maintainability, performance,
| compression ratio, etc.). Still, the easiest way to start is to
| delegate it to the file system, so zfs + compression. Access
| time should be decent. No application-level changes are
| required to enable that.
| betaby wrote:
| It is already on ZFS with compression.
| o11c wrote:
| I'm not sure if this _quite_ fits your workload, but a lot of
| times people use `git` when `casync` would be more appropriate.
| tln wrote:
| Congrats, you've done a lot of interesting work to get here.
|
| This could be a fantastic building block for headless CMS and the
| like.
| plesiv wrote:
| OP here. Thank you and good catch! :-) We have a blog post
| planned on that topic.
| sluongng wrote:
| @plesiv could you please elaborate on how repack/gc is handled
| with a libgit2 backend? I know that Alibaba has done something
| similar in the past based on libgit2, but I have yet to see
| another implementation in the wild like this.
|
| Very cool project. I hope Cloudflare workers can support more
| protocols like SSH and GRPC. It's one of the reasons why I prefer
| Fly.io over Cloudflare worker for special servers like this.
| plesiv wrote:
| Great question! By default, with libgit2 each write to a repo
| (e.g. push) will create a new pack file. We have written a
| simple packing algorithm that runs after each write. It works
| like this:
|
| Choose these values:
|
| * P, pack "Planck" size, e.g. 100kB
|
| * N, branching factor, e.g. 8
|
| After each write:
|
| 1. iterate over each pack (pack size is S) and assign each pack
| a class C which is the smallest integer that satisfies P * N^C
| > S
|
| 2. iterate variable c from 0 to the maximum value of C that you
| got in step 2
|
| * if there are N packs of class c, repack them into a new pack,
| new pack is going to be at most of class c+1
| gkoberger wrote:
| This is really cool! I've been building something on libgit2 +
| EFS, and this approach is really interesting.
|
| Between libgit2 on emscripten, the number of file writes to DO,
| etc, how is performance?
| seanvelasco wrote:
| this leverages Durable Objects, but as i remember from two years
| ago, DO's way of guaranteeing uniqueness is that there can only
| be once instance of that DO in the world.
|
| what if there are two users who wants to access the same DO repo
| at the same time, one in the US and the other in Singapore? the
| DO must live either in US servers or SG servers, but not at the
| same time. so one of the two users must have high latency then?
|
| then after some time, a user in Australia accesses this DO repo -
| the DO bounces to AU servers - US and SG users will have high
| latency?
|
| but please correct me if i'm wrong
| skybrian wrote:
| Last I heard, durable objects don't move while running. It
| doesn't seem worse than hosting in US-East, though.
| skybrian wrote:
| Not having a technical limit is nice, because then it's a matter
| of spending money. But whenever I see "infinite," I ask what it
| will cost. How expensive is it to host git repos this way?
|
| As a hobbyist, "free" is pretty appealing. I'm pretty sure my
| repos on GitHub won't cost me anything, and that's unlikely to
| change anytime soon. Not sure about the new stuff.
| jsheard wrote:
| With CloudFlare at least when you overstay your welcome on the
| free plan they just start nagging you to start paying, and
| possibly kick you out if you don't, rather than sending you a
| surprise bill for $10,000 like AWS or Azure or GCP might do.
| koolba wrote:
| > We're building Gitlip - the collaborative devtool for the AI
| era. An all-in-one combination of Git-powered version control,
| collaborative coding and 1-click deployments.
|
| Did they get a waiver from the git team to name it as such?
|
| Per the trademark policy, new "git${SUFFIX}" names aren't
| allowed: https://git-scm.com/about/trademark
|
| >> In addition, you may not use any of the Marks as a syllable in
| a new word or as part of a portmanteau (e.g., "Gitalicious",
| "Gitpedia") used as a mark for a third-party product or service
| without Conservancy's written permission. For the avoidance of
| doubt, this provision applies even to third-party marks that use
| the Marks as a syllable or as part of a portmanteau to refer to a
| product or service's use of Git code.
| WorkerBee28474 wrote:
| You don't need their permission to make a portmanteau, all you
| need is to follow trademark law (which may or may not allow
| it). The policy page can go kick sand.
| saurik wrote:
| While true, using someone else's trademark as a _prefix_ of
| your name when you are actively _intending_ it to reference
| the protected use seems egregious.
| eli wrote:
| Do you think many users will mistakenly believe Gitlip is
| an official Git project put out by the same authors as Git?
|
| There can't be trademark infringement unless there is a
| likelihood of confusion.
| plesiv wrote:
| OP here. Oops, thank you for pointing that out! We weren't
| aware of it. We will investigate ASAP. In the worst case, we'll
| change our name.
| benatkin wrote:
| Doesn't sound worse case to me. It could use a better name
| anyway.
| Spunkie wrote:
| gitlip is not a good name, but you can be sure that if a
| new name does not include git it will be a worse name.
| rzzzt wrote:
| What about an old word? Agitator, legitimate, cogitate?
| fumplethumb wrote:
| What about... GitHub, Gitlab, Gitkraken, GirButler (featured on
| HN recently)? The list goes on forever!
| afiori wrote:
| Supposedly they got written permission
| bagels wrote:
| Infinite sounds like a bug happened. It's obviously not infinite,
| some resource will eventually be exhausted, in this case, memory.
| iampims wrote:
| Some serious engineering here. Kudos!
| nathants wrote:
| this is very cool!
|
| i prototyped a similar serverless git product recently using a
| different technique.
|
| i used aws lambda holding leases in dynamo backed by s3. i zipped
| git binaries into the lambda and invoked them directly. i used
| shallow clone style repos stored in chunks in s3, that could be
| merged as needed in lambda /tmp.
|
| lambda was nice because for cpu heavy ops like merging many
| shallow clones, i could do that in a larger cpu lambda, and cache
| the result.
|
| other constraints were similar to what is described here. mainly
| that an individual push/pull cannot exceed the api gateway max
| payload size, a few MB.
|
| i looked at isomorphic, but did not try emscripten libgit2. using
| cloudflare is nice because of free egress, which opens up many
| new use cases that don't make sense on $0.10/GB egress.
|
| i ended up shelving this while i build a different product. glad
| to see others pursuing the same thing, serverless git is an
| obvious win! do you back your repos with r2?
|
| for my own git usage, what i ended up building was a trustless
| git system backed by dynamo and s3 directly. this removes the
| push/pull size limit, and makes storage trustless. this uses git
| functionality i had no idea about prior, git bundle and
| unbundle[1]. they are used for transfer of git objects without a
| server, serverless git! this one i published[2].
|
| good luck with your raise and your project. looking forward to
| the next blog. awesome stuff.
|
| 1. https://git-scm.com/docs/git-bundle
|
| 2. https://github.com/nathants/git-remote-aws
| stavros wrote:
| This is a very impressive technical achievement, and it's clear
| that a lot of work went into it.
|
| Unfortunately, the entrepreneur in me continues that thought with
| "work that could have gone into finding customers instead". Now
| you have a system that could store "infinite" git repos, but how
| many customers?
| akerl_ wrote:
| I agree. That's a really unfortunate way to view somebody's
| project.
| stavros wrote:
| Is it a project, or is it a company?
| deadbunny wrote:
| TFA is literally marketing for them. HN is their target
| audience and a good way to capture that audience is to show
| them something technically interesting.
| Spunkie wrote:
| I've been wondering what to do to backup our github repos other
| than keeping a local copy and/or dumping them on something like
| S3.
|
| I would love to use this to serve as a live/working automatic
| backup for my github repos on CF infrastructure.
___________________________________________________________________
(page generated 2024-10-25 23:01 UTC)