[HN Gopher] Show HN: We scaled Git to support 1 TB repos
___________________________________________________________________
Show HN: We scaled Git to support 1 TB repos
I've been in the MLOps space for ~10 years, and data is still the
hardest unsolved open problem. Code is versioned using Git, data is
stored somewhere else, and context often lives in a 3rd location
like Slack or GDocs. This is why we built XetHub, a platform that
enables teams to treat data like code, using Git. Unlike Git LFS,
we don't just store the files. We use content-defined chunking and
Merkle Trees to dedupe against everything in history. This allows
small changes in large files to be stored compactly. Read more
here: https://xethub.com/assets/docs/how-xet-deduplication-works
Today, XetHub works for 1 TB repositories, and we plan to scale to
100 TB in the next year. Our implementation is in Rust (client &
cache + storage) and our web application is written in Go. XetHub
includes a GitHub-like web interface that provides automatic CSV
summaries and allows custom visualizations using Vega. Even at 1
TB, we know downloading an entire repository is painful, so we
built git-xet mount - which, in seconds, provides a user-mode
filesystem view over the repo. XetHub is available today (Linux &
Mac today, Windows coming soon) and we would love your feedback!
Read more here: - https://xetdata.com/blog/2022/10/15/why-xetdata
- https://xetdata.com/blog/2022/12/13/introducing-xethub
Author : reverius42
Score : 194 points
Date : 2022-12-13 15:14 UTC (7 hours ago)
(HTM) web link (xethub.com)
(TXT) w3m dump (xethub.com)
| timsehn wrote:
| Founder of DoltHub here. One of my team pointed me at this
| thread. Congrats on the launch. Great to see more folks tackling
| the data versioning problem.
|
| Dolt hasn't come up here yet, probably because we're focused on
| OLTP use cases, not MLOps, but we do have some customers using
| Dolt as the backing store for their training data.
|
| https://github.com/dolthub/dolt
|
| Dolt also scales to the 1TB range and offers you full SQL query
| capabilities on your data and diffs.
| ylow wrote:
| CEO/Cofounder here. Thanks! Agreed, we think data versioning is
| an important problem and we are at related, but opposite parts
| of the space. (BTW we really wanted gitfordata.com. Or perhaps
| we can split the domain? OLTP goes here, Unstructured data goes
| there :-) Shall we chat? )
| chubot wrote:
| Can it be used to store container images (Docker)? As far as I
| remember they are just compressed tar files. Does the compression
| defeat Xet's own chunking?
|
| Can you sync to another machine without Xethub ?
|
| How about cleaning up old files?
| ylow wrote:
| Yeah... The compression does defeat the chunking (your mileage
| may vary. We do a small amount of dedupe in some experiments
| but never quite investigated it in detail.). That said, we have
| experimental preprocessors / chunkers that are file-type
| specific that we could potentially do something about tar.gz.
| Not something we have explored much yet.
| the_arun wrote:
| Signed up & browsing "Flickr30k" repo (auto generated) repo & it
| was really slow for me. Like CSV, does it also supports other
| data formats like json, yml etc.,?
| ylow wrote:
| We are file format agnostic and you should be able to put
| anything in the repo. We have special support for CSV files for
| visualizations. Sorry for the UI perf... there are a lot of
| optimizations we need to work on.
| amadvance wrote:
| How data is split in chunks ? Just curious.
| [deleted]
| sesm wrote:
| They mention 'content-defined chunking', but it as far as
| understand it requires different chunking algorithms for
| different content types. Does it support plugins for chunking
| different file formats?
| ylow wrote:
| Today we just have a variation of FastCDC in production, but
| we have alternate experimental chunkers for some file formats
| (ex: a heuristic chunker for CSV files that will enable
| almost free subsampling). Hope to have them enter production
| in the next 6 months.
| sesm wrote:
| That's interesting. Can a CSV chunker make adding a column
| not affect all of the chunks?
| ylow wrote:
| The simplest really is to chunk row-wise so adding
| columns will unfortunately rewrite all the chunks. If you
| have a parquet file, adding columns will be cheap.
| ylow wrote:
| CEO/Cofounder here! Content defined chunking. Specifically a
| variation of FastCDC. We have a paper coming out soon with a
| lot more technical details.
| subnetwork wrote:
| This feels like something that is prime for abuse. I agree with
| @bastardoperator, treating git for file storage is going to go
| nowhere good.
| ledauphin wrote:
| The link takes me to a login page. It would be nice to see that
| fixed to somehow match the title.
| reverius42 wrote:
| Visit https://xetdata.com for more info! (Sorry, can't edit the
| post link now.)
| jrockway wrote:
| There are a couple of other contenders in this space. DVC
| (https://dvc.org/) seems most similar.
|
| If you're interested in something you can self-host... I work on
| Pachyderm (https://github.com/pachyderm/pachyderm), which doesn't
| have a Git-like interface, but also implements data versioning.
| Our approach de-duplicates between files (even very small files),
| and our storage algorithm doesn't create objects proportional to
| O(n) directory nesting depth as Xet appears to. (Xet is very much
| like Git in that respect.)
|
| The data versioning system enables us to run pipelines based on
| changes to your data; the pipelines declare what files they read,
| and that allows us to schedule processing jobs that only
| reprocess new or changed data, while still giving you a full view
| of what "would" have happened if all the data had been
| reprocessed. This, to me, is the key advantage of data
| versioning; you can save hundreds of thousands of dollars on
| compute. Being able to undo an oopsie is just icing on the cake.
|
| Xet's system for mounting a remote repo as a filesystem is a good
| idea. We do that too :)
| ylow wrote:
| By the way, our mount mechanism has one very interesting
| novelty. It does not depend on a FUSE driver on Mac :-)
| jrockway wrote:
| That's smart! I think users have to install a kext still?
| ylow wrote:
| Nope. No kernel driver needed :-) We wrote an localhost NFS
| server.
| catiopatio wrote:
| Based on unfsd or entirely in-house?
| ylow wrote:
| Entirely in house. In Rust!
| catiopatio wrote:
| Fancy! That's awesome.
| chubot wrote:
| Is DVC useful/efficient at storing container images (Docker)?
| As far as I remember they are just compressed tar files. Does
| the compression defeat its chunking / differential compression?
|
| How about cleaning up old versions?
| ylow wrote:
| We have found pointer files to be _surprisingly_ efficient as
| long as you don 't have to actually materialize those files.
| (Git's internals actually very well done). Our mount mechanism
| does avoid materializing pointer files which makes it pretty
| fast even for repos with very large number of files.
| unqueued wrote:
| For bigger annex repos with lots of pointer files, I just
| disable the git-annex smudge filters. Consider whether smudge
| filters are requirement, or a convenience. The smudge filter
| interface does not scale that well at all.
| Izmaki wrote:
| If I had to "version control" a 1 TB large repo - and assuming I
| wouldn't quit in anger - I would use a tool which is built for
| this kind of need and has been used in the industry for decades:
| Perforce.
| ryneandal wrote:
| This was my thought as well. Perforce has its own issues, but
| is an industry standard in game dev for a reason: it can handle
| immense amounts of data.
| Phelinofist wrote:
| What does immense mean in the context of game dev?
| llanowarelves wrote:
| On "real" (AA/AAA) games? Easily hundreds of gigabytes or
| several terabytes of raw assets + project files.
|
| Sometimes even individual art project files can be many
| gigabytes each. I saw a .psd that was 30gb because of the
| embedded hi-res reference images.
|
| You can throw pretty much anything in there, in one place
| and things like locking, partial-checkout, etc. Which gets
| artists to use it
| hinkley wrote:
| Perforce also has support for proxies right? It's not
| just the TB of data, it's all of your coworkers in a
| branch office having to pull all the updates first thing
| in the morning. If each person has to pull from origin,
| that's a lot of bandwidth, and wasted mornings. If the
| first person in pays and everyone else gets it off the
| LAN, then you have a better situation.
| mentos wrote:
| I work in gamedev and think perforce is good but far from
| great. Would love to see someone bring some competition to the
| space maybe XetHub can.
| tinco wrote:
| So, you wouldn't consider using a new tool that someone
| developed to solve the same problem despite an older solution
| already existing? Your advice to that someone is to just use
| the old solution?
| TylerE wrote:
| When the new solution involves voluntary use of git? Not just
| yea, but hell yes. I hate git.
| xur17 wrote:
| Why do you hate git? I've been pretty happy with it for
| code, and wouldn't mind being able to use it for data
| repositories as well.
| TylerE wrote:
| Is it really worth re-hashing at this point? Reams have
| been written about the UX
| xur17 wrote:
| It's used by the vast majority of software engineers, so
| apparently it's "good enough".
| hinkley wrote:
| Don't ascribe positive feelings to popularity. I'm only
| using git until the moment there's a viable alternative
| written by someone who knows what DX is.
| JZL003 wrote:
| I also have a lot of issues with versioning data. But look at git
| annex - it's free, self hosted and has a very easy underlying
| data structure [1]. So I don't even use the magic commands it has
| for remote data mounting/multi-device coordination, just backup
| using basic S3 commands and can use rclone mounting. Very robust,
| open source, and useful
|
| [1] When you run `git annex add` it hashes the file and moves the
| original file to a `.git/annex/data` folder under the
| hash/content addressable file system, like git. Then it replaces
| the original file with a symlink to this hashed file path. The
| file is marked as read only, so any command in any language which
| tries to write to it will error (you can always `git annex
| unlock` so you can write to it). If you have duplicated files,
| they easily point to the same hashed location. As long as you git
| push normally and back up the `.git/annex/data` you're totally
| version controlled, and you can share the subset of files as
| needed
| kspacewalk2 wrote:
| Sounds like `git annex` is file-level deduplication, whereas
| this tool is block-level, but with some intelligent, context-
| specific way of defining how to split up the data (i.e.
| Content-Defined Chunking). For data management/versioning,
| that's usually a big difference.
| cma wrote:
| If git annex stores large files uncompressed you could use
| filesystem bl9ck level deduplication in combination with it.
| synergy20 wrote:
| can you be more specific here,very interested
| dark-star wrote:
| There are filesystems that support inline or post-process
| deduplication. btrfs[1] and zfs[2] come to mind as free
| ones, but there are also commercial ones like WAFL etc.
|
| It's always a tradeoff. Deduplication is a CPU-heavy
| process, and if it's done inline, it is also memory-
| heavy, so you're basically trading CPU and memory for
| storage space. It heavily depends on the use-case (and
| the particular FS / deduplication implementation) whether
| it's worth it or not
|
| [1]:
| https://btrfs.wiki.kernel.org/index.php/Deduplication
|
| [2]: https://docs.oracle.com/cd/E36784_01/html/E39134/fsd
| edup-1.h...
| cma wrote:
| One problem is if you need to support Windows clients.
| Microsoft charges $1600 for deduplication support or
| something like that: https://learn.microsoft.com/en-
| us/windows-server/storage/dat...
| mattpallissard wrote:
| Yeah, which is great for storage but doesn't help over
| the wire.
| xmodem wrote:
| ZFS at least supports sending a deduplicated stream.
| mattpallissard wrote:
| Right, and btrfs can send a compressed stream as well,
| but we aren't sending raw filesystem data via VCS.
| alchemist1e9 wrote:
| zbackup is a great block level deduplication trick.
| rsync wrote:
| "Sounds like `git annex` is file-level deduplication, whereas
| this tool is block-level ..."
|
| I am not a user of git annex but I do know that it works
| perfectly with an rsync.net account as a target:
|
| https://git-
| annex.branchable.com/forum/making_good_use_of_my...
|
| ... _which means_ that you could do a _dumb mirror_ of your
| repo(s) - perhaps just using rsync - and then let the ZFS
| snapshots handle the versioning /rotation which would give
| you the benefits of _block level diffs_.
|
| One additional benefit, beyond more efficient block level
| diffs, is that the ZFS snapshots are immutable/readonly as
| opposed to your 'git' or 'git annex' produced versions which
| could be destroyed by Mallory ...
| darau1 wrote:
| > let the ZFS snapshots handle the versioning/rotation
| which would give you the benefits of block level diffs
|
| Can you explain this a bit? I don't know anything about
| ZFS, but it sounds as though it creates snapshots based on
| block level differences? Maybe a git-annex backend could be
| written to take advantage of that -- I don't know.
| unqueued wrote:
| No, that is not correct, git-annex uses a variety of special
| remotes[2], some of which support deduplication. Mentioned in
| another comment[1]
|
| When you have checked something out and fetched it, then it
| consumes space on disk, but that is true with git-lfs, and
| most other tools like it. It does NOT consume any space in
| any git object files.
|
| I regularly use a git-annex repo that contains about 60G of
| files, which I can use with github or any git host, and uses
| about 6G in its annex, and 1M in the actual git repo itself.
| I chain git-annex to an internal .bup repo, so I can keep
| track of the location, and benefit from dedup.
|
| I honestly have not found anything that comes close to the
| versatility of git-annex.
|
| [1]: https://news.ycombinator.com/item?id=33976418
|
| [2]: https://git-annex.branchable.com/special_remotes/
| rajatarya wrote:
| XetHub Co-founder here. Yes, one illustrative example of the
| difference is:
|
| Imagine you have a 500MB file (lastmonth.csv) where every day
| 1MB is changed.
|
| With file-based deduplication every day 500MB will be
| uploaded, and all clones of the repo will need to download
| 500MB.
|
| With block-based deduplication, only around the 1MB that
| changed is uploaded and downloaded.
| unqueued wrote:
| I combine git-annex with the bup special remote[1], which
| lets me still externalize big files, while benefiting from
| block level deduplication. Or depending on your needs, you
| can just use a tool like bup[2] or borg directly. Bup
| actually uses the git pack file format and git metadata.
|
| I actually wrote a script which I'm happy to share, that
| makes this much easier, and even lets you mount your bup
| repo over .git/annex/objects for direct access.
|
| [1]: https://git-
| annex.branchable.com/walkthrough/using_bup/
|
| [2]: https://github.com/bup/bup
| civilized wrote:
| Does that work equally well whether the changes are
| primarily row-based or primarily column-based?
| rajatarya wrote:
| Yes, see this for more details of how XetHub
| deduplication: https://xethub.com/assets/docs/xet-
| specifics/how-xet-dedupli...
| prirun wrote:
| HashBackup author here. Your question is (I think) about
| how well block-based dedup functions on a database -
| whether rows are changed or columns are changed. This
| answer is how most block-based dedup software, including
| HashBackup work.
|
| Block-based dedup can be done either with fixed block
| sizes or variable block sizes. For a database with fixed
| page sizes, a fixed block size matching the page size is
| most efficient. For a database with variable page sizes,
| a variable block size will work better, assuming there
| the dedup "chunking" algorithm is fine-grained enough to
| detect the database page size. For example, if the db
| used a 4-6K variable page size and the dedup algo used a
| 1M variable block size, it could not save a single
| modified db page but would save more like 20 db pages
| surrounding the modified page.
|
| Your column vs row question depends on how the db stores
| data, whether key fields are changed, etc. The main dedup
| efficiency criteria are whether the changes are
| physically clustered together in the file or whether they
| are dispersed throughout the file, and how fine-grained
| the dedup block detection algorithm is.
| AustinDev wrote:
| Have you tested this out with Unreal Engine blueprint
| files? If you all can do block-based diffing on those, and
| other binary assets used in game development it'd be huge
| for game development.
|
| I have a couple ~1TB repositories I've had the misfortune
| of working with using perforce in the past.
| rajatarya wrote:
| Not yet. Would be happy to try - can you point me to a
| project to use?
|
| Do you have a repo you could try us out with?
|
| We have tried a couple Unity projects (41% smaller due to
| republication) but not much from Unreal projects yet.
| AustinDev wrote:
| Most of my examples of that size are AAA game source that
| I can't share however, I think this is a project using
| similar files that is based on unreal. It should show if
| there is any benefit: https://github.com/CesiumGS/cesium-
| unreal-samples & where the .umap binaries have been
| updated and in this example where the .uasset blueprints
| have been updated
| https://github.com/renhaiyizhigou/Unreal-Blueprint-
| Project
| timbotron wrote:
| If you like git annex check out
| [datalad](http://handbook.datalad.org/en/latest/), it provides
| some useful wrappers around git annex oriented towards
| scientific computing.
| blobbers wrote:
| ... why do you have 1TB of source code (you don't! mandatory
| hacker snark) Is git really supposed to be used for data? Or is
| this just a git-like interface for source control on data?
| IshKebab wrote:
| Git is only not "supposed" to be used for data because it
| doesn't work very well with data by default. Not because that's
| not a useful and sensible thing to want from a VCS.
| TillE wrote:
| It's a fundamentally bad idea because of how any DVCS works.
| You really don't want to be dragging around gigabytes of
| obsolete data forever.
|
| Something like git-lfs is the appropriate solution. You need
| a little bit of centralization.
| IshKebab wrote:
| Because of how _Git 's current implementation of DVS_
| works. There's nothing fundamental about it. Git already
| supports partial clones and on-demand checkouts in some
| ways, it's just not very ergonomic.
|
| All that's really needed is a way to mark individual files
| as lazily fetched from a remote only when needed. LFS is a
| hacky substandard way to emulate that behaviour. It should
| be built in to Git.
| stevelacy wrote:
| Game development, especially Unreal engine, can produce repos
| in excess of 1TB. Git LFS is used extensively for binary file
| support.
| Eleison23 wrote:
| Aperocky wrote:
| I see a lot of reasons to version code.
|
| I see far less reasons to version data, in fact, I find reasons
| _against_ versioning data and storing them in diffs.
| treeman79 wrote:
| Anything that might be audited. Being able to look at things
| how they were how they changed to find out how they got to
| where they currently are, and who did what; is amazing for many
| application. Finance, healthcare, elections, Etc.
|
| Well unless fraud is the goal.
| bfm wrote:
| Shameless plug for https://snapdir.org which focuses on this
| particular use case using regular git and auditable plain
| text manifests
| zachmu wrote:
| You're suffering from a failure of imagination, maybe because
| you've never been able to version data usefully before. There
| are already lots of interesting applications, and it's still
| quite new.
|
| https://www.dolthub.com/blog/2022-07-11-dolt-case-studies/
| WorldMaker wrote:
| Something that I've experienced from many years in enterprise
| software: 90% of enterprise software is about versioning data
| in some way. SharePoint is half as complicated as it is because
| it has be a massive document and data version manager. (Same
| with Confluence and other competitors.) "Everyone" needs deep
| audit logs for some likely overlap of SOX compliance, PCI
| compliance, HIPAA compliance, and/or other industry specific
| standards and practices. Most business analysts want accurate
| "point in time" reporting tools to revisit data as it looked at
| almost any point in the past, and if you don't build it for
| them they often build it as ad hoc file cabinets full of Excel
| export dumps for themselves.
|
| The wheels of data versioning just get reinvented over and over
| and over again, with all sorts of slightly different tools.
| Most of the job of "boring CRUD app development" is data
| version management and some of the "joy" is how every database
| you ever encounter is often its own little snowflake with
| respect to how it versions its data.
|
| There have been times I've pined for being able to just store
| it all in git and reduce things to a single paradigm. That
| said, I'd never actually want to teach business analysts or
| accountants how to _use_ git (and would probably spend nearly
| as much time building custom CRUD apps against git as against
| any other sort of database). There are times though where I
| have thought for backend work "if I could just checkout the
| database at the right git tag instead needing to write this
| five table join SQL statement with these eighteen differently
| named timestamp fields that need to be sorted in four different
| ways...".
|
| Reasons to version data are plenty and most of the data
| versioning in the world is ad hoc and/or operationally
| incompatible/inconsistent across systems. (Ever had to ETL
| SharePoint lists and its CVC-based versioning with a timestamp
| based data table? Such "fun".) I don't think git is necessarily
| the savior here, though there remains some appeal in "I can use
| the same systems I use for code" two birds with one stone.
| Relatedly, content-addressed storage and/or merkle trees are a
| growing tool for Enterprise and do look a lot like a git
| repository and sometimes you also have the feeling like if you
| are already using git why build your own merkle tree store when
| git gives you a swiss army knife tool kit on top of that merkle
| tree store.
| ch71r22 wrote:
| What are the reasons against?
| ltbarcly3 wrote:
| The lack of reasons for doing it IS the reason against. GIT
| isn't a magic 'good way' to store arbitrary data, it's a good
| way to collaborate on projects implemented using most
| programming languages which store code as plain text broken
| into short lines, where edits to non-sequential lines can
| generally be applied concurrently without careful human
| verification. That is an extremely specific use case, and
| anything outside of that very specific use case leaves git
| terrible, inefficient, and gives almost no benefit despite
| huge problems.
|
| People in ML ops use git because they aren't very
| sophisticated with programming professionally and they have
| git available to them and they haven't run into the
| consequences of using it to store large binary blobs, namely
| that it becomes impossible to live with eventually and wastes
| a huge amount of time and space.
|
| ML didn't invent the need for large artifacts that can't be
| versioned in source control but must be versioned with it,
| but they don't know that because they are new to professional
| programming and aren't familiar with how it's done.
| ylow wrote:
| Indeed, there is a lot of pain if you actually try to store
| large binary data in git. But we managed to make that work!
| So a question worth asking is how might things change IF
| you can store large binary data in git??
| ltbarcly3 wrote:
| I think this is a foot-gun, it's a bad idea even if it
| works great, and I doubt it works very well. You should
| manage your build artifacts explicitly, not just jam them
| in git along with the code that generates them because
| you are already using it and you haven't thought it
| through.
| wpietri wrote:
| I don't think you've made your case here. The practices
| you describe are partly an artifact of computation,
| bandwidth, and storage costs. But not the current ones,
| the ones when git was invented more than 15 years ago. In
| the short term, we have to conform to the computer's
| needs. But in the long term, it has to be the other way
| around.
| ltbarcly3 wrote:
| You're right! It makes way more sense, in the long run,
| to abuse a tool like git in a way that it isn't designed
| for and which it can't actually support, then instead of
| actually using git use a proprietary service that may or
| may not be around in a week. Here I was thinking short
| term.
| Game_Ender wrote:
| Xet's initial focus appears to be on data files used to
| drive machine learning pipelines, not on any resulting
| binaries.
| sk0g wrote:
| That is exactly what git-lfs is, a way to "version
| control" binary files, by storing revisions - possibly
| separately, while the actual repo contains text files +
| "pointer" files that references a binary file.
|
| It's not perfect, and still feels like a bit of a hack
| compared to something like p4 for the context I uses LFS
| in (game dev), but it works, and doesn't require
| expensive custom licenses when teams grow beyond an
| arbitrary number like 3 or 5.
| rajatarya wrote:
| XetHub Co-founder here. Yes, we use the same Git
| extension mechanism as Git LFS (clean/smudge filters) and
| we store pointer files in the git repository. Unlike Git
| LFS we do block-level deduplication (Git LFS does file-
| level deduplication) and this can result in a significant
| savings in storage and bandwidth.
|
| As an example, a Unity game repo reduced in size by 41%
| using our block-level deduplication vs Git LFS. Raw repo
| was 48.9GB, Git LFS was 48.2GB, and with XetHub was
| 28.7GB.
|
| Why do you think using a Git-based solution is a hack
| compared to p4? What part of the p4 workflow feels more
| natural to you?
| mardifoufs wrote:
| I literally don't know anyone or any team in ML using git
| as a data versioning tool. It doesn't even make sense to
| me, and most mlops people I have talked to would agree. Is
| that really the point of this tool? To be a general purpose
| data store for mlops? I thought it is for very specialized
| ML use cases. Because even 1TB isn't much for ML data
| versioning
|
| Mlops people are very aware of tools that are more suited
| for the job... even too aware in fact. The entire field is
| full of tools, databases, etc to the point where it's hard
| make sense of it. So your comment is a bit weird to me
| ltbarcly3 wrote:
| I think you'll find varying levels of maturity in ML ops.
| Anyway I think we basically agree, if you use something
| like this you aren't that mature, and if you are mature
| you would avoid this thing.
| oftenwrong wrote:
| One use-case would be for including dependencies in a repo. For
| example, it is common for companies to operate their own
| artifact caches/mirrors to protect their access to artifacts
| from npm, pypi, dockerhub, maven central, pkg.go.dev, etc. With
| the ability to efficiently work with a big repo, it would be
| possible to store the artifacts in git, saving the trouble of
| having to operate artifact mirrors. Additionally, it guarantees
| that the artifacts for a given, known-buildable revision are
| available offline.
| guardian5x wrote:
| As always it depends on the application. It can definitely be
| useful in some applications.
| substation13 wrote:
| Versioning data is great, but storing as diffs is inefficient
| when 99% of the file changes each version.
| reverius42 wrote:
| We don't store as diffs, we store as snapshots -- and it's
| efficient thanks to the way we do dedupe. See
| https://xethub.com/assets/docs/how-xet-deduplication-works/
| ylow wrote:
| Cofounder/CEO here! I think it less about "versioning", but the
| ability to modify with confidence knowing that you can go back
| in time anytime. (Minor clarification: we are not quite storing
| diffs; holding snapshots just like Git + a bunch of data
| dedupe)
| krageon wrote:
| > the ability to modify with confidence knowing that you can
| go back in time anytime
|
| This is versioning
| rafael09ed wrote:
| Versioning is a technique. Backups, copy+paste+rename also
| does it
| angrais wrote:
| How's this differ from using git LFS?
| ylow wrote:
| We are _significantly_ faster? :-) Also, block-level dedupe,
| scalability, perf, visualization, mounting, etc.
| polemic wrote:
| There seem to be a lot of data version control systems built
| around ML pipelines or software development needs, but not so
| much on the sort of data editing that happens outside of software
| development & analysis.
|
| Kart (https://kartproject.org) is built on git to provide data
| version control for geospatial vector & tabular data. Per-row
| (feature & attribute) version control and the ability to
| collaborate with a team of people is sorely missing from those
| workflows. It's focused on geographic use-cases, but you can work
| with 'plain old tables' too, with MySQL, PostgreSQL and MSSQL
| working copies (you don't have to pick - you can push and pull
| between them).
| dandigangi wrote:
| One monorepo to rule them all and the in the darkness, pull them.
| - Gandalf, probably
| irrational wrote:
| And in the darkness merge conflicts.
| amelius wrote:
| Does this fix the problem that Git becomes unreasonably slow when
| you have large binary files in the repo?
|
| Also, why can't Git show me an accurate progress-bar while
| fetching?
| reverius42 wrote:
| Mostly! (At the moment, it doesn't fully fix the slowdown
| associated with storing large binary files, but reduces it by
| 90-99%. We're working on improving to closer to 100% that by
| moving even the Merkle Tree storage outside the git repo
| contents.)
|
| As for why git can't show you an accurate progress bar while
| fetching (specifically when using an extension like git-lfs or
| git-xet), this has to do with the way git extensions work --
| each file gets "cleaned" by the extension through a Unix pipe,
| and the protocol for that is too simple to reflect progress
| information back to the user. In git-xet, we do write a
| percent-complete to stdout so you get some more info (but a
| real progress bar would be nice).
| Game_Ender wrote:
| The tl;dr is that "xet" is like GitLFS (it stores pointers in
| Git, with the data in a remote server and uses smudge filters to
| make this transparent) with some additional features:
|
| - Automatically includes all files >256KB in size
|
| - By default data is de-duplicated 16KB chunks instead of whole
| files (with the ability to customize this per file type).
|
| - Has a "mount" command to allow read-only browse without
| downloading
|
| When launching on HN it would be better if the team was a bit
| more transparent with the internals. I get that "we made a better
| GitLFS" doesn't market as well. But you can couple that with a
| credible vision and story about how you are a better and where
| you are headed next. Instead this is mostly closer to market
| speak of "trust our magic solution to solve your problem".
| nightpool wrote:
| These details seemed.... really clear to me from the post the
| OP made? Did you just not read it, or have they updated it
| since you commented?
|
| (excerpt from the OP post:
|
| > Unlike Git LFS, we don't just store the files. We use
| content-defined chunking and Merkle Trees to dedupe against
| everything in history. This allows small changes in large files
| to be stored compactly. Read more here:
| https://xethub.com/assets/docs/how-xet-deduplication-works)
| culanuchachamim wrote:
| Maybe a silly question:
|
| Why do you need 1Tb for repos? What do you store inside, besides
| code and some images?
| layer8 wrote:
| Some docker images? ;)
| lazide wrote:
| A whole lot of images?
|
| I personally would love to be able to store datasets next to
| code for regression testing, easier deployment, easier dev
| workstation spin up, etc.
| culanuchachamim wrote:
| Still, 1TB?
|
| Once you get to that amount of images it would be much easy
| to manage it with some files storage solution.
|
| Or I'm missing something important?
| lazide wrote:
| All of them require having some sort of parallel
| authentication, synchronization, permissions management,
| change tracking, etc.
|
| Which is a huge hassle, and a lot of work I'd rather not
| do.
|
| My current photogrammetry dataset is well over 1TB, and it
| isn't a lot for the industry by any stretch of the
| imagination.
|
| In fact, the only thing that considers it 'a lot' and is
| hard to work with is git.
| dafelst wrote:
| Repositories for games are often larger than 1TB, and with
| things like UE5's Nanite becoming more viable, they're only
| going to get bigger.
| Wojtkie wrote:
| Can I upload a full .pbix file to this and use it for versioning?
| If so, I'd use it in a heartbeat.
| ylow wrote:
| CEO/Cofounder here. We are file format agnostic and will
| happily take everything. Not too familiar with the needs around
| pbix, but please do try it out and let us know what you think!
| COMMENT___ wrote:
| What about SVN?
|
| Besides other features, Subversion supports representation
| sharing. So adding new textual or binary files with identical
| data won't increase the size of your repository.
|
| I'm not familiar with ML data sets, but it seems that SVN may
| work great with them. It already works great for huge and small
| game dev projects.
| iFire wrote:
| https://github.com/facebook/sapling is doing good work and they
| are suggesting their git server for large repositories exists.
| wnzl wrote:
| Just in case if you are wondering about alternatives: there is
| Unity's Plastic https://unity.com/products/plastic-scm which
| happens to use bidirectional sync with git. I'm curious how this
| solution compares to it! I'll definitely give it a try over the
| weekend!
| ziml77 wrote:
| I was already upset about Codice Software pulling Semantic
| Merge and only making it available as an integrated part of
| Plastic SCM. Now that I see the reason such a useful tool was
| taken away was to stuff the pockets of a large company, I'm
| fuming.
|
| I know that they're well within their rights to do this as they
| only ever offered subscription licensing for Semantic Merge,
| but that doesn't make it suck less to lose access.
| web007 wrote:
| Please consider https://sso.tax/ before making that an
| "enterprise" feature.
| IshKebab wrote:
| I mean yeah, that's working as intended surely? Some of those
| price differences are pretty egregious but in general companies
| have to actually make money, and charging more for features
| that are mainly needed by richer customers is a very obvious
| thing to do.
| mdaniel wrote:
| I believe the counter-argument is that they should charge for
| _features_ but that security should be available to anyone.
| Imagine if "passwords longer than 6 chars: now only $8/mo!"
|
| That goes double for products where paying for "enterprise"
| is _only_ to get SAML, which at least in my experience causes
| me to go shopping for an entirely different product because I
| view it as extortion
| IshKebab wrote:
| Security _is_ available for everyone. It 's centralised
| security that can be easily managed by IT that isn't.
|
| I don't see an issue with charging more for SSO though as I
| said some of the prices are egregious.
| Alifatisk wrote:
| Very sad to see Bitwarden in this list
| unqueued wrote:
| I have a 1.96 TB git repo:
| https://github.com/unqueued/repo.macintoshgarden.org-fileset (It
| is a mirror of a Macintosh abandoneware site) git
| annex info .
|
| Of course, it uses pointer files for the binary blobs that are
| not going to change much anyway.
|
| And the datalad project has neuro imaging repos that are tens of
| TB in size.
|
| Consider whether you actually need to track differences in all of
| your files. Honestly git-annex is one of the most powerful tools
| I have ever used. You can use git for tracking changes in text,
| but use a different system for tracking binaries.
|
| I love how satisfying it is to be able to store the index for
| hundreds of gigs of files on a floppy disk if I wanted.
| bastardoperator wrote:
| I actually encountered a 4TB git repo. After pulling all the
| binary shit out of it the repo was actually 200MB. Anything that
| promotes treating git like a filesystem is a bad idea in my
| opinion.
| frognumber wrote:
| Yes... and no. The git userspace is horrible for this. The git
| data model is wonderful.
|
| The git userspace would need to be able to easily:
|
| 1. Not grab all files
|
| 2. Got grab the whole version history
|
| ... and that's more-or-less it. At that point, it'd do great
| with large files.
| ylow wrote:
| Exactly for the giant repo use case, we have a mount feature
| that will let you get a filesystem mount of any repo at any
| commit very very quickly.
| TacticalCoder wrote:
| What does a Merkle Tree bring here? (honest question) I mean: for
| content-based addressing of chunks (and hence deduplication of
| these chunks), a regular tree works too if I'm not mistaken (I
| may be wrong but I literally wrote a "deduper" splitting files
| into chunks and using content-based addressing to dedupe the
| chunks: but I just used a dumb tree).
|
| Is the Merkle true used because it brings something else than
| deduplication, like chunks integrity verification or something
| like that?
| V1ndaar wrote:
| You say you support up to 1TB repositories, but from your pricing
| page all I see is the free tier for up to 20GB and one for teams.
| The latter doesn't have a price and only a contact option and I
| assume likely will be too expensive for an individual.
|
| As someone who'd love to put their data into a git like system,
| this sounds pretty interesting. Aside from not offering a tier
| for someone like me who would maybe have a couple of repositories
| of size O(250GB) it's unclear how e.g. bandwidth would work &
| whether other people could simply mount and clone the full repo
| if desired for free etc.
| rajatarya wrote:
| XetHub Co-founder here. We are still trying to figure out
| pricing and would love to understand what sort of pricing tier
| would work for you.
|
| In general, we are thinking about usage-based pricing (which
| would include bandwidth and storage) - what are your thoughts
| for that?
|
| Also, where would you be mounting your repos from? We have
| local caching options that can greatly reduce the overall
| bandwidth needed to support data center workloads.
| V1ndaar wrote:
| Thanks for the reply!
|
| Generally usage based pricing sounds fair. In the end for
| cases like mine where it's "read rarely, but should be
| available publicly long term" it would need to compute with
| pricing offered by the big cloud providers.
|
| I'm about to leave my academic career and I'm thinking about
| how to make sure all my detector data will be available to
| other researchers in my field in the future. Aside from the
| obvious candidate https://zenodo.org it's an annoying problem
| as usually most universities I'm familiar with only archive
| data internally, which is hard to access for researchers from
| different institutions. As I don't want to rely on a single
| place to have that data available I'm looking for an
| additional alternative (that I'm willing to pay for out of my
| own pocket, it just shouldn't be a financial burden).
|
| In particular while still taking data a couple of years ago I
| would have loved being able to commit each daily data taking
| in the same way as I commit code. That way having things
| timestamped, backed up and all possible notes that came up
| that day associated straight in the commit message would have
| been very nice.
|
| Regarding mounting I don't have any specific needs there
| anymore. Just thinking about how other researchers would be
| able to clone the repo to access the data.
| blagie wrote:
| My preferences on pricing.
|
| First, it's all open-source, so I can take it and run it.
| Second, you provide a hosted service, and by virtue of being
| the author, you're the default SaaS host. You charge a
| premium over AWS fees for self-hosting, which works out to:
|
| 1. Enough to sustain you.
|
| 2. Less than the cost of doing dev-ops myself (AWS fees +
| engineer).
|
| 3. A small premium over potential cut-rate competitors.
|
| You offer value-added premium services too. Whether that's
| economically viable, I don't know.
___________________________________________________________________
(page generated 2022-12-13 23:01 UTC)