[HN Gopher] Elfshaker: Version control system fine-tuned for bin...
___________________________________________________________________
Elfshaker: Version control system fine-tuned for binaries
Author : jim90
Score : 466 points
Date : 2021-11-19 12:41 UTC (10 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| mrich wrote:
| I'm guessing this does not yield that high compression for
| release builds, where code can be optimized across translation
| units? Likewise when a commit changes a header that is included
| in many cpps?
| peterwaller-arm wrote:
| Author here. The executables shipped in manyclangs are release
| builds! The catch is that manyclangs stores object files pre-
| link. Executables are materialized by relinking after they are
| extracted with elfshaker.
|
| The stored object files are compiled with -ffunction-sections
| and -fdata-sections, which ensures that insertions/deletions to
| the object file only have a local effect (they don't cause
| relative addresses to change across the whole binary).
|
| As you observe, anything which causes significant non-local
| changes in the data you store is going to have a negative
| effect when it comes to compression ratio. This is why we don't
| store the original executables directly.
| zeotroph wrote:
| Thank you for the explanation, so the pre-link storage is one
| of the magical ingredients, maybe mention this as well in the
| README?
|
| Is this the reason why manyclang (using llvms cmake based
| build system) can be provided easily, but it would be more
| difficult for gcc? Or is the object -> binary dependency
| automatically deduced?
| peterwaller-arm wrote:
| > maybe mention this as well in the README?
|
| We've tweaked the readme, I hope it's clearer.
|
| It would be great to provide this for gcc too. The project
| is new and we've just started out. I know less about gcc's
| build system and how hard it will be to apply these
| techniques there. It seems as though it should be possible
| though and I'd love to see it happen.
|
| To infer the object->executable dependencies we currently
| read the compilation database and produce a stand-alone
| link.sh shell script, which gets packaged into each
| manyclangs snapshot.
| zeotroph wrote:
| Ah, the compilation database is where more magic
| originates from :)
| peterwaller-arm wrote:
| Yes, this is less great than I would like! :( :)
| mrich wrote:
| Thanks. I had a use case in mind where LTO is enabled.
| Unfortunately the LTO step is quite expensive so relinking
| does not seem like a viable option. If I find some time I'll
| give it a try though.
| peterwaller-arm wrote:
| ThinLTO can be pretty quick if you have enough cores, it
| might work. Not sure how well the LTO objects compress
| against each other when you have small changes to them. It
| might work reasonably.
|
| manyclangs is optimized to provide you with a binary
| quickly. The binary is not necessarily itself optimized to
| be fast, because it's expected that a developer might want
| to access any version of it for the purposes of testing
| whether some input manifests a bug or has a particular
| codegen output. In that scenario, it's likely that the
| developer is able to reduce the size of the input such that
| the speed of the compiler itself is not terribly
| significant in the overall runtime. Therefore, I don't see
| LTO for manyclangs as such a significant win. But it is
| still hoped that the overall end-to-end runtime is good,
| and the binaries are optimized, just not with LTO.
| nh2 wrote:
| I experimented with something similar with a Linux distribution's
| package binary cache.
|
| Using `bup` (deduplicating backup tool using git packfile format)
| I deduplicated 4 Chromium builds into the size of 1. It could
| probably pack thousands into the size of a few.
|
| Large download/storage requirements for updates are one of
| NixOS's few drawbacks, and I think deduplication could solve that
| pretty much completely.
|
| Details: https://github.com/NixOS/nixpkgs/issues/89380
| peterwaller-arm wrote:
| Author here. I've used bup, and elfshaker was partially
| inspired by it! It's great. However, during initial experiments
| on this project I found bup to be slow, taking quite a long
| time to snapshot and extract. I think this could in principle
| be fixed in bup one day, perhaps.
| Siira wrote:
| Is elfshaker any good for backuping non-text data?
| ybkshaw wrote:
| Thank you for having such a good description on the project!
| Sometimes the links from HN lead to a page that takes a few
| minutes of puzzling to figure out what is going on but not
| yours.
| nh2 wrote:
| I also use bup for a long time, but found that for very large
| server backups I'm hitting performance problems (both in time
| and memory usage).
|
| I'm currently evaluating `bupstash` (also written in Rust) as
| a replacment. It's faster and uses a lot less memory, but is
| younger and thus lacks some features.
|
| Here is somebody's benchmark of bupstas (unfortunately not
| including `bup`):
| https://acha.ninja/blog/encrypted_backup_shootout/
|
| The `bupstash` author is super responsive on Gitter/Matrix,
| it may make sense to join there to discuss
| approaches/findings together.
|
| I would really like to eventually have deduplication-as-a-
| library, to make it easier to put into programs like nix, or
| also other programs, e.g. for versioned "Save" functionality
| in software like Blender or Meshlab that work with huge files
| and for which diff-based incremental saving is more
| difficult/fragile to implement than deduplcating snapshot
| based saving.
| pdimitar wrote:
| I used `bupstash` and evaluated it for a while. I am
| looking to do 5+ offsite backups of a small personal
| directory to services that offer 5GB of cloud space for
| free.
|
| `bupstash` lacked good compression. I settled with `borg`
| because I could use `zstd` compression with it. Currently
| at 60 snapshots of the directory and the `borg` repo
| directory is at ~1.52GB out of 5GB quota. The source
| directory is ~12.19GB uncompressed. Very happy with `borg`
| + `zstd` and how they handle my scenario.
|
| I liked `bupstash` a lot, and the author is responsive and
| friendly. But I won't be giving it another try until it
| implements much more aggressive compression compared to
| what it can do now. It's a shame, I _really_ wanted to use
| it.
|
| I do recognize that for many other scenarios `bupstash` is
| very solid though.
| veselink1 wrote:
| An author here, we've opened a Q&A discussion on GitHub:
| https://github.com/elfshaker/elfshaker/discussions/58.
| thristian wrote:
| This seems very much like the Git repository format, with loose
| objects being collected into compressed pack files - except I
| think Git has smarter heuristics about which files are likely to
| compress well together. It would be interesting to see a
| comparison between this tool and Git used to store the same
| collection of similar files.
| peterwaller-arm wrote:
| An author here, I agree! The packfile format is heavily
| inspired by git, and git may also do quite well at this.
|
| We did some preliminary experiments with git a while back but
| found we were able to do the packing and extraction much faster
| and smaller than git was able to manage. However, we haven't
| had the time to repeat the experiments with our latest
| knowledge and the latest version of git. So it is entirely
| possible that git might be an even better answer here in the
| end. We just haven't done the best experiments yet. It's
| something to bear in mind. If someone wants, they could measure
| this fairly easily by unpacking our snapshots and storing them
| into git.
|
| On our machines, forming a snapshot of one llvm+clang build
| takes hundreds of milliseconds. Forming a packfile for 2,000
| clang builds with elfshaker can take seconds during the pack
| phase with a 'low' compression level (a minute or two for the
| best compression level, which gets it down to the ~50-100MiB/mo
| range), and extracting takes less than a second. Initial
| experiments with git showed it was going to be much slower.
| johnyzee wrote:
| As far as I was able to learn (don't remember the details,
| sorry), git does not do well with large binary files. I
| believe it ends up with a lot of duplication. It is the major
| thing I am missing from git, currently we store assets (like
| big PSDs that change often) outside of version control and it
| is suboptimal.
| peterwaller-arm wrote:
| Performing poorly with non-textual data happens for a a
| number of reasons. Binary data, when changed, often have a
| lot of 'non-local' changes in them. For example, a PSD file
| might well have a compression algorithm already applied to
| it. An insertion/deletion is going to result in a very
| different compressed representation for which there is no
| good way to have an efficient delta. elfshaker will suffer
| the same problem here.
| JoshTriplett wrote:
| Can you talk a bit more about what ELF-specific
| heuristics elfshaker uses? What kind of preprocessing do
| you do before zstd? Do you handle offsets changing in
| instructions, like the BCJ/BCJ2 filter? Do you do
| anything to detect insertions/deletions?
| peterwaller-arm wrote:
| We've just added an applicability section, which explains
| a bit more what we do. We don't have any ELF specific
| heuristics [0].
|
| https://github.com/elfshaker/elfshaker#applicability
|
| In summary, for manyclangs, we compile with -ffunction-
| sections and -fdata-sections, and store the resulting
| object files. These are fairly robust to insertions and
| deletions, since the addresses are section relative, so
| the damage of any addresses changing is contained within
| the sections. A somewhat surprising thing is that this
| works well enough when building many revisions of
| clang/llvm -- as you go from commit to commit, many
| commits have bit identical object files, even though the
| build system often wants to rebuild them because some
| input has changed.
|
| elfshaker packs use a heuristic of sorting all unique
| objects by size, before concatenating them and storing
| them with zstandard. This gives us an amortized cost-per-
| commit of something like 40kiB after compression with
| zstandard.
|
| [0] (edit: despite the playful name suggesting otherwise
| -- when we chose the name we planned to do more with ELF
| files, but it turned out to be unnecessary for our use
| case)
| JoshTriplett wrote:
| Ah, I see! Makes sense that you can do much better if you
| get to compile the programs with your choice of options.
| derefr wrote:
| One could, in theory, write a _git-clean_ filter (like
| the one used for git-lfs), that teaches git various
| heuristic approaches to "take apart" well-known binary
| container formats into trees of binary object leaf-nodes.
|
| Then, when you committed a large binary that git could
| understand, what git would really be committing in its
| place would be a directory tree -- sort of like the
| "resource tree" you see if you edit an MKV file, PNG
| file, etc., but realized as files in directories. Git
| would generate it, then commit it.
|
| On checkout, this process would happen in reverse: a
| matching _git-smudge_ filter could notice a metadata file
| in each of these generated directories, and collapse the
| contents of the directory together to form a binary
| chunk; recursively, up the tree, until you hit the
| toplevel, and end up with the original large binary
| again.
|
| Since most of the _generated leaf-nodes_ from this
| process wouldn 't change on each commit, this would
| eliminate most of the _storage_ overhead of having many
| historical versions of large files in git. (In exchange
| for: 1. the potentially-huge CPU overhead of doing this
| "taking apart" of the file on every commit; 2. the added
| IOPS for temporarily creating the files to commit them;
| and 3. the loss of any file-level compression [though git
| itself compresses its packfiles, so that's a wash.])
|
| I'm almost inspired to try this out for a simple binary
| tree format like
| https://en.wikipedia.org/wiki/Interchange_File_Format.
| But ELF wouldn't be too hard, either! (You could even go
| well past the "logical tree" of ELF by splitting the text
| section into objects per symbol, and ensuring the object
| code for each symbol is stored in a PIC representation in
| git, even if it isn't in the binary.)
| ChrisMarshallNY wrote:
| _> we store assets (like big PSDs that change often)
| outside of version control and it is suboptimal._
|
| Perforce is still used by game developers and other
| creatives, because it handles large binaries, quite well.
|
| In fact, I'm not sure if they still do it, but one of the
| game engines (I think, maybe, Unreal) used to have a free
| tier that also included a free Perforce install.
| mdaniel wrote:
| It was my recollection, and I confirmed it, that they've
| almost always had a "the first hit is free" model for
| small teams, and they also explicitly call out indie game
| studios as getting free stuff too:
| https://www.perforce.com/how-buy
| 3np wrote:
| Do you think it would be feasible to do a git-lfs replacement
| based on elfshaker?
|
| Down the line maybe it would even be possible to have
| binaries as "first-class" (save for diff I guess)
| londons_explore wrote:
| I'd like to see a version of this built into things like IPFS.
|
| It seems obvious that whenever something is saved into IPFS,
| there might be a similar object already stored. If there is, go
| make a diff, and only store the diff.
| hcs wrote:
| It should be possible to do this in IPFS already if you use the
| go-ipfs --chunker option with a content-sensitive chunking
| algorithm like rabin or buzhash [1]. With this there's a good
| chance that a file with small changes from something already on
| IPFS will have some chunks that hash identically, so they'll be
| shared.
|
| [1] https://en.wikipedia.org/wiki/Rolling_hash#Content-
| based_sli...
| londons_explore wrote:
| But that isn't quite as good as something like this that can
| 'understand' diffs in files, rather than simply relying on
| the fact a bunch of bytes in a row might be the same.
| hcs wrote:
| I don't think elfshaker actually does do any binary diffing
| (e.g. xdelta or bsdiff). It works well because it uses pre-
| link objects which are built to change as little as
| possible between versions. Then when it compresses similar
| files together in a pack, Zstandard can recognize the
| trivial repeats.
| peterwaller-arm wrote:
| Author here. This is correct, we set out to do binary
| diffing but we soon discovered that if you put similar
| enough object files together in a stream, and then
| compress the stream, zstandard does a fantastic job at
| compressing and decompressing quickly with a high
| compression ratio. The existing binary diffing tools can
| produce small patches, but they are relatively expensive
| both to compute the delta and to apply the patches.
| mal10c wrote:
| This project reminded me of something I've been looking for for a
| while - although it's not exactly what I'm looking for...
|
| I use SolidWorks PDM at work to control drawings, BOMs, test
| procedures, etc. In all honesty, PDM does an alright job when it
| works, but when I have problems with our local server, all hell
| breaks loose and worst case, the engineers can't move forward.
|
| In that light, I'd love to switch to another option. Preferably
| something decentralized just to ensure we have more backups. Git
| almost gets us there but doesn't include things like "where
| used."
|
| All that being said, am I overlooking some features of Elfshaker
| that would fit well into my hopes of finding an alternative to
| PDM?
|
| I also see there's another HN thread that asks the question I'm
| asking - just not through the lens of Elfshaker:
| https://news.ycombinator.com/item?id=20644770
| kvnhn wrote:
| Maybe not precisely what you want, but I built a CLI tool[1]
| that's like a simplified and decoupled Git-LFS. It tracks large
| files in a content-addressed directory, and then you track the
| references to that store in source control. Data compression
| isn't a top priority for my tool; it uses immutable symlinks,
| not archives.
|
| [1]: https://github.com/kevin-hanselman/dud
| erichocean wrote:
| Seems like the Nix people would be interested in enabling this
| kind of thing for Nix packages...
| lxpz wrote:
| This should be integrated with Cargo to reduce the size of the
| target directories which are becoming ridiculously large.
| peterwaller-arm wrote:
| Author here. I'm unsure whether this would apply very well to
| cargo or not. If it has lots of pre-link object files, then
| maybe.
| lxe wrote:
| > There are many files,
|
| > Most of them don't change very often so there are a lot of
| duplicate files,
|
| > When they do change, the deltas of the [binaries] are not huge.
|
| We need this but for node_modules
| ithkuil wrote:
| The novel trick here is splitting up huge binary files and
| treat them as if they were many small files.
|
| Node_modulea is already tons and tons of files, and when they
| are large, they are usually minified and hard to split on any
| "natural" boundary (like elf sections/symbols etc)
| i_like_waiting wrote:
| Thanks, seems like that could be good solution for storing of
| daily backups of DB. I didn't know I needed it but seems like I
| do.
| phil294 wrote:
| Have a look at Borg, it handles incremental backups very well
| peterwaller-arm wrote:
| Author here, this software is young, please don't use it for
| backups!
|
| But also, in general, it might not work well for your use case,
| and our use case is niche. Please give it a try before making
| assumptions about any suitability for use.
| wpietri wrote:
| In this age of rampant puffery, it's so... soothing to see
| somebody be positive and frank about the limits of their
| creation. Thanks for this and all your comments here!
| peterwaller-arm wrote:
| <3
| the_duke wrote:
| Borg, bup or restic are relatively popular incremental backup
| tools that reduplicate with chunking.
| goodpoint wrote:
| I'm surprised nobody mentioned git-annex. It does the same using
| git for metadata. It's extremely efficient.
| kristjansson wrote:
| AFAIK, git-annex doesn't address address sub-file
| deduplication/compression at all, it just stores a new copy for
| each new hash it sees? I suppose that content-addressed
| storage, combined with the pre-link strategy discussed
| elsewhere for the related manyclangs project would produce
| similar, if less spectacular, results?
| jankotek wrote:
| Does it make a sense to turn it into fuse fs, with transparent
| deduplication?
| peterwaller-arm wrote:
| Author here. Maybe, it's a fun idea. I have toyed with
| providing a fuse filesystem for access to a pack but my time
| for completing this is limited at the moment.
| nh2 wrote:
| Many packfile-deduplicating backup tools (bup, kopia, borg,
| restic) can mount the deduplicated storage as FUSE.
|
| It might make sense to check how they do it.
|
| I'd also be interested in how elfshaker compares to those
| (and `bupstash`, which is written in Rust but doesn't have a
| FUSE mount yet) in terms of compression and speed.
|
| Did you know of their existence when making elfshaker?
|
| Edit: Question also posted in your Q&A: https://github.com/el
| fshaker/elfshaker/discussions/58#discus...
| peterwaller-arm wrote:
| (Copying from Q&A) Before starting out some time ago, I did
| some experiments with bup. I had a good experience with bup
| and high expectations for it. However, I found that quite a
| lot of performance was left on the table, so I was
| motivated to start elfshaker. Unfortunately that time has
| past so I don't have scientific numbers for you measured
| with other software at this time.
|
| As an idea of how elfshaker performs, we see ~300ms time to
| create a snapshot for clang, and ~seconds-to-minute to
| create a binary pack containing thousands of revisions.
| Extraction takes less than a second. One difference of
| elfshaker compared with some other software I tested is
| that we do the compression and decompression in parallel,
| which can make a very big difference on today's many-core
| machines.
| mhx77 wrote:
| Somewhat related (and definitely born out of a very similar use
| case): https://github.com/mhx/dwarfs
|
| I initially built this for having access to 1000+ Perl
| installations (spanning decades of Perl releases). The
| compression in this case is not quite as impressive (50 GiB to
| around 300 MiB), but access times are typically in the
| millisecond region.
| pdimitar wrote:
| That's super impressive, I will definitely give it a go. Thanks
| for sharing!
| peterwaller-arm wrote:
| Nice, I bet dwarfs would do well at our use case too. Thanks
| for sharing.
| tttsxhub wrote:
| Why does it depend on the CPU architecture?
| peterwaller-arm wrote:
| (Disclosure: I work for Arm, opinions are my own)
|
| Author here. elfshaker itself does not have a dependency on any
| architecture to our knowledge. We support the architectures we
| have use of. Contributions to add missing support are welcome.
|
| manyclangs provides binary pack files for aarch64 because
| that's what we have immediate use of. If elfshaker and
| manyclangs proves useful to people, I would love to see
| resource invested to make it more widely useful.
|
| You can still run the manyclangs binaries on other
| architectures using qemu [0], with some performance cost, which
| may be tolerable depending on your use case.
|
| [0] https://github.com/elfshaker/manyclangs/tree/main/docker-
| qem...
| henvic wrote:
| Interesting. I wonder if this can also be [ab]used to, say,
| deliver deltas of programs, so that you can have faster updates,
| but maybe it doesn't make sense.
|
| https://en.wikipedia.org/wiki/Binary_delta_compression
| peterwaller-arm wrote:
| Author here, I don't think it would apply well to that
| scenario. elfshaker is good for manyclangs where we ship 2,000
| revisions in one file (pack), so the cost of individual
| revision is amortized. If one build of llvm+clang costs you
| some ~400 MiB; a single elfshaker pack containing 2,000 builds
| has an amortized cost of around 40kiB/build. But this amazing
| win is only happening because you are shipping 2,000 builds at
| once. If you wanted to ship a single delta, you can't compress
| against all the other builds.
| necovek wrote:
| How fast would it be to get a delta between any two of the
| 2,000 builds in a single elfshaker pack?
|
| If that's reasonably fast, perhaps an approach like that
| could work: server stores the entire pack, but upon user
| request extracts a delta between user's version and target
| binary.
|
| Still, the devil is in the details of building all revisions
| of all software a single distribution has.
| peterwaller-arm wrote:
| Yes you could do that. On the other hand, all revisions for
| a month is 100MiB, and all revisions we've built spanning
| 2019-now are a total of 2.8GiB, so we opted to forego
| implementing any object negotiation and just say 'you have
| to download the 100MiB for the month to access it'. I think
| you could a push/pull protocol could be implemented, but at
| that point probably git might do a reasonable job in that
| case :)
| henvic wrote:
| Thank you for the insight!
| wlll wrote:
| Related, and impressive: https://github.com/elfshaker/manyclangs
|
| > manyclangs is a project enabling you to run any commit of clang
| within a few seconds, without having to build it.
|
| > It provides elfshaker pack files, each containing ~2000 builds
| of LLVM packed into ~100MiB. Running any particular build takes
| about 4s.
| Tobu wrote:
| The clever idea that makes manyclangs compress well is to store
| object files before they are linked, with each function and
| each variable in its own elf section so that changes are mostly
| local; addresses will indirect through sections and a change to
| one item won't cascade into moving every address.
|
| I'm not sure the linking step they provide is
| deterministic/hermetic, if it is that would prove a decent way
| to compress the final binaries while shaving most of the
| compilation time. Maybe the manyclangs repo could store hashes
| of the linked binaries if so?
|
| I'm not seeing any particular tricks done in elfshaker itself
| to enable this, the packfile system orders objects by size as a
| heuristic for grouping similar objects together and compresses
| everything (using zstd and parallel streams for, well,
| parallelism). Sorting by size seems to be part of the Git
| heuristic for delta packing: https://git-scm.com/docs/pack-
| heuristics
|
| I'd like to see a comparison with Git and others listed here
| (same unlinked clang artifacts, compare packing and access):
| https://github.com/elfshaker/elfshaker/discussions/58#discus...
| peterwaller-arm wrote:
| Author here, I'd like to see such a comparison too actually,
| but I'm not in the position to do the work at the moment. We
| did some preliminary experiments at the beginning, but a lot
| changed over the course of the project and I don't know how
| well elfshaker fares ultimately against all the options out
| there. Some basic tests against git found that git is quite a
| bit slower (10s vs 100ms) during 'git add' and git checkout.
| Maybe that can be fixed with some tuning or finding
| appropriate options.
| perth wrote:
| Reminds me of how Microsoft packages the Windows installer
| actually. If you've ever unpacked Microsoft's install.esd it's
| interestingly insane how heavily it's compressed. I assume it's
| full of a lot of stuff that provides semi redundant binaries
| for compatibility to a lot of different systems, because the
| unpacked esd container goes from a few GiBs to I think around
| 40-50 iirc.
| derefr wrote:
| The emulation community also has "ROMsets" -- collections of
| game ROM images, where the ROM images _for a given game
| title_ are all grouped together into an archive. So you 'd
| have one archive for e.g. "every release, dump, and ROMhack
| of Super Mario Bros 1."
|
| These ROM-set archives -- especially when using more modern
| compression algorithms, like LZMA/7zip -- end up about 1.1x
| the size of a _single one_ of the contained game ROM images,
| despite sometimes containing literally hundreds of variant
| images.
| Daishiman wrote:
| How does this work? Do all the game series use the same
| engine code and assets?
| bena wrote:
| Sort of. ROMHacks are modified ROM images of a certain
| game.
|
| If you knew where in the ROM image the level data was
| contained, you could modify it. As long as you didn't
| violate any constraints, the game would run fine.
|
| You could also potentially influence game behavior as
| well.
|
| The Game Genie and Gameshark were kind based on this
| concept. Except, being further along the chain, it could
| write values coming into and out of memory, so other
| effects were possible.
|
| So, in the case of Super Mario Bros. ROMHacks, they all
| use Super Mario Bros. as a base ROM. Then from there, all
| you need to do is store the diff from the base.
| notafraudster wrote:
| I think you're slightly misinterpreting what the parent
| said. Take the game Super Mario World for the console
| Super Nintendo. It was released in Japan. It was released
| in the US. It was released in Europe. It was released in
| Korea. It was released in Australia. It was probably
| released in various minor regions and given unique
| translations. There are almost certainly re-releases of
| the game on Super Nintendo that issued new ROM files to
| correct minor bugs. Maybe there's a Greatest Hits version
| which might be the same game, but with an updated
| copyright date to reflect the re-release. This might
| amount to 10-12 versions of the same game, but 99.99% of
| what's in the ROM file is the same across all of them, so
| they can be represented compressed very well.
|
| A copy of Super Mario Advance 2 for Game Boy Advance,
| which is also a re-release of Super Mario World, almost
| surely uses its own engine and would not be part of the
| same rom set. Likewise, other Mario games (like Mario 64,
| Super Mario Bros, etc.) would not be part of the same rom
| set. So it's nothing about the series using the same
| engine code or assets.
|
| We're talking bugfixes and different regions for the same
| game on the same console. But this still has the effect
| of dropping the size for complete console collections by
| 50% or more, because most consoles have 2-3 regions per
| game for most games.
| derefr wrote:
| You're generally correct. But there are interesting
| exceptions!
|
| Sometimes, ROM-image-based game titles _were_ based on
| the same "engine" (i.e. the same core set of assembler
| source-files with fixed address-space target locations,
| and so fixed locations in a generated ROM image), but
| with a few engine modifications, and entirely different
| assets.
|
| In a sense, this makes these different games effectively
| into mutual "full conversion ROMhacks" of one-another.
|
| You'll usually find these different game titles
| compressed together into the _same_ ROMset (with one game
| title -- usually the one with the oldest official release
| -- being considered the prototype for the others, and so
| naming the ROMset), because they _do_ compress together
| very well -- not near-totally, the way bugfix patches do,
| but adding only the total amount to the archive size that
| you 'd expect for the additional new assets.
|
| Well-known examples of this are _Doki Doki Panic_ vs.
| _Super Mario Bros 2_ ; _Panel de Pon_ vs. _Tetris Attack_
| ; _Gradius III_ vs. _Parodius_ ; and any game with
| editions, e.g. _Pokemon_ or _Megaman Battle Network_.
|
| But there are more "complete" examples as well, where
| you'd never even suspect the two titles are related, with
| the games perhaps existing in entirely-different genres.
| (I don't have a ROMset library on-hand to dig out
| examples, but if you dig through one, you'll find some
| amazing examples of engine reuse.)
| wpietri wrote:
| Ooh, neat. I was wondering why anybody would make a binary-
| specific VCS. And why "elf" was in the name. This answers both
| questions. Thanks!
| [deleted]
| yincrash wrote:
| Could this be useful for packing xcode's deriveddata folder for
| caching in ci builds?
| svilen_dobrev wrote:
| will some of these work for (compressed) variants of audio?
| They're never same..
| peterwaller-arm wrote:
| Author here. Compressed data is unlikely to work well in
| general, unless it never changes.
| cyounkins wrote:
| Cool! I wonder how this would compare to ZFS deduplication.
| veselink1 wrote:
| An author here. elfshaker uses per-file deduplication. When
| building manyclangs packs, we observed that the deduplicated
| content is about 10 GiB in size. After compression with
| `elfshaker pack`, that comes down to ~100 MiB.
|
| There is also a usability difference: elfshaker stores data in
| pack files, which are more easily shareable. Each of the pack
| files released as part of manyclangs ~100 MiB and contains
| enough data to materialize ~2,000 builds of clang and LLVM.
| bogwog wrote:
| Does this work well with image files? (PNG, JPEG, etc)
| peterwaller-arm wrote:
| Author here, it works particularly well for our presented use
| case because it has these properties:
|
| * There are many files,
|
| * Most of them don't change very often,
|
| * When they do change, the deltas of the binaries are not huge.
|
| So, if the image files aren't changing very much, then it might
| work well for you. If the images are changing, their binary
| deltas would be quite large, so you'd get a compression ratio
| somewhat equivalent to if you'd concatenated the two revisions
| of the file and compressed them using ZStandard.
| shp0ngle wrote:
| Ahhh that's the key insight I have been missing, and that
| should be higher somewhere.
|
| Thanks
| IceWreck wrote:
| Please add these points under a usecase heading in your
| README.
| peterwaller-arm wrote:
| Done, hopefully this is clearer. Please let us know if you
| see a way to improve it further:
| https://github.com/elfshaker/elfshaker/pull/60
| ghoul2 wrote:
| If I already have, lets say a 100MB pack file containing (say)
| 200 builds of clang and then I import the 201st build into that
| pack file - is it possible to send across a small delta of this
| new, updated pack file to someone else who already had the older
| pack file (with 200 builds) such that they can apply the delta to
| the old pack and get the new pack containing 201 builds?
| carlmr wrote:
| I find the description a bit confusing, is there and example
| where we can see the usage?
| mxuribe wrote:
| Same here. There is a usage guide, which helped a tiny bit:
| https://github.com/elfshaker/elfshaker/blob/main/docs/users/...
|
| Honestly, I sort of looked at it for conventional backup
| strategy...as in, i wonder if it could work as a replacement
| for tar-zipping up a directory, etc. But, not sure if the use
| cases is appropriate.
| xdfgh1112 wrote:
| For backup you probably want something like Borg to handle
| deduplication of identical content between backups.
| peterwaller-arm wrote:
| Author here, I agree with xdfgh1112, please take care
| before using brand new software to store your backups!
| mxuribe wrote:
| Yes, any time that i use something new or different (or
| both) for something as essential as backups, i take great
| and deliberate care...and test, test, test...well before
| standardizing on it. ;-)
| peterwaller-arm wrote:
| Author here. We'd love this to be a thing, but this is young
| software, so we don't recommend relying on this as a single
| way of doing a backup for now. Bear in mind that our main use
| case is for things that you can reproduce in principle
| (builds of a commit history, see manyclangs).
| mxuribe wrote:
| > our main use case is for things that you can reproduce in
| principle (builds of a commit history, see manyclangs)
|
| I appreciate your response, and thanks very much for the
| clarification of use case; very helpful! Thanks also of
| course for building this!
| w0m wrote:
| My top level being that it's a VCS (like Git) specialized for
| binaries; with commands baked in to prevent the slowdown that
| often comes with large git repositories.
| throw_away wrote:
| Specifically, it's for ELF binaries built in such a way that
| adding a new function or new data does not break however they
| cache existing functions/data.
|
| I wonder if this concept could be extended to other binary
| types that git has problems with, were you able to
| know/control more about the underlying binary format.
| wyldfire wrote:
| There is an associated presentation on manyclangs at LLVM dev
| meeting. I think they presented yesterday?
|
| Unfortunately it won't be uploaded until later but it will show
| up on the llvm YouTube channel:
|
| https://www.youtube.com/c/LLVMPROJ
| ot wrote:
| I would guess it's a way to quickly bisect on compiler
| versions.
| peterwaller-arm wrote:
| One of the authors here, thanks for the feedback. We've tried
| to improve it here:
| https://github.com/elfshaker/elfshaker/pull/59
| xpe wrote:
| Never shake a baby elf!
| 0942v8653 wrote:
| Does it do any architecture-specific processing, i.e. BCJ filter?
| Or is there a generic version of this? The performance seems
| quite good.
| peterwaller-arm wrote:
| Author here. No architecture specific processing currently.
| Most of the magic happens in zstandard (hat tip to this amazing
| project).
|
| Please see our new applicability section which explains the
| result in a bit more detail:
|
| https://github.com/elfshaker/elfshaker/blob/1bedd4eacd3ddd83...
|
| In manyclangs (which uses elfshaker for storage) we arrange
| that the object code has stable addresses when you do
| insertions/deletions, which means you don't need such a filter.
| But today I learned about such filters, so thanks for sharing
| your question!
| dilap wrote:
| Huh, interesting, could you maybe use this as an in-repo
| alternative to something like git-lfs?
| peterwaller-arm wrote:
| Author here, I don't currently know how this compares to git-
| lfs. It it is possible git-lfs would perform quite well on the
| same inputs as elfshaker works on. If git-lfs does already work
| well for your use case I'd recommend using that rather than
| elfshaker, as it is more established.
| dilap wrote:
| Thanks for the response! I was more just curious about future
| possibilities vs immediate practicle use.
|
| git-lfs just offloads the storage of the large binaries to a
| remote site, and then downloads on demand.
|
| If you have a lot of binary assets like artwork or huge excel
| spreadsheets, it's very useful, because in those cases,
| without git-lfs, the git repo will get very large, git will
| get extremely slow, and github will get angry at you for
| having too large a repo.
|
| But it's not all roses with git-lfs, since now you're reliant
| on the external network to do checkouts, vs having fetched
| everything at once w/ the initial clone, and also of course
| just switching between revisions can get slower since you're
| network-limited to fetch those large files. (And though I'm
| not sure, it doesn't seem like git-lfs is doing any local
| caching.)
|
| So you could imagine where something like having elfshaker
| embedded in the repo and integrated as a checkout filter
| could potentially be a useful alternative. Basically an
| efficient way to store binaries directly in the repo.
|
| (Maybe it would be too small a band of use cases to be
| practicle though? Obviously if you have lots of distinct art
| assets, that's just going to be big, no matter what...)
| axismundi wrote:
| does it work on intel macs?
___________________________________________________________________
(page generated 2021-11-19 23:00 UTC)