[HN Gopher] Colliding with the SHA prefix of Linux's initial Git...
       ___________________________________________________________________
        
       Colliding with the SHA prefix of Linux's initial Git commit
        
       Author : 2bluesc
       Score  : 199 points
       Date   : 2024-12-30 22:49 UTC (1 days ago)
        
 (HTM) web link (people.kernel.org)
 (TXT) w3m dump (people.kernel.org)
        
       | philips wrote:
       | Relatedly: Kees's keynote on Linux security from a month ago was
       | great: https://www.youtube.com/watch?v=orO8czP5Bxw
        
       | Terr_ wrote:
       | There were some plans to migrate to SHA256, but somehow it still
       | hasn't happened.
       | 
       | The practical upshot is a git commit hash is _not enough l_ to
       | know you are distributing /executing the legitimate code, as
       | opposed to a malicious doppelganger. This is particularly
       | important for tools that rely on it for dependency management,
       | local caches, etc.
        
         | shakna wrote:
         | Compatibility between remotes using one or the other hasn't
         | arrived yet, and git doesn't want to break compatibility. But
         | you can create SHA256 one's today. [0]
         | 
         | [0] https://lwn.net/Articles/898522/
        
         | onedognight wrote:
         | The is not a full git commit hash collision. It has to do with
         | a git note which only needs to matche a 12 character prefix of
         | the git commit.
        
           | usr1106 wrote:
           | While you corrected one mistake you added a new one:)
           | 
           | Those are git trailers, see git-interpret-trailers(1).
           | 
           | git-notes(1) is something completely different and not used
           | by the kernel.
        
         | theamk wrote:
         | people don't really care, because current collision methods are
         | mitigated.
         | 
         | Git does not actually use "sha1", despite what all the docs
         | say, it uses "sha1dc", which is just like sha1 except for
         | inputs which can cause collisions, in which case it either
         | fails with clear error message or returns completely different
         | value.
         | 
         | https://news.ycombinator.com/item?id=17825441
         | 
         | so don't worry, git hashes _are_ enough to know you are
         | distributing/executing the legitimate code.
         | 
         | (not to mention you need a preimage attack to replace known
         | commit, and this is not yet possible with sha1)
        
           | akoboldfrying wrote:
           | From your link:
           | 
           | >In this case "hash" will be the same as SHA1(input) in all
           | cases, except those where the input is detected to be
           | malicious (as in the SHAttered attack)
           | 
           | I don't see how this can be more than a fundamentally
           | forward-incompatible sticking plaster over the problem. The
           | problem isn't merely that "detecting maliciousness" seems
           | fraught in itself (how does one infer intent reliably?) --
           | it's that today's SHA1DC() implementation can only detect and
           | optionally correct _today 's_ known attacks, so each new
           | attack necessitates a new, incompatible version of SHA1DC().
        
             | Dylan16807 wrote:
             | Sure it's plaster, but plaster can last a good while.
             | Sufficiently new attacks don't come around all that often,
             | and _every_ hash has a risk of new attacks showing up.
        
               | hcs wrote:
               | fwiw, I wasn't too familiar with this usage, looked it up
               | and sticking plaster is in the sense of bandaid here
               | https://en.wikipedia.org/wiki/Adhesive_bandage
               | 
               | But agreed that it's a rather tougher patch than I'd
               | originally thought, reading elsewhere in this thread.
        
             | theamk wrote:
             | Each new _unrelated_ SHA1 attack will need an update in
             | SHA1DC. But it has to be truly unrelated, as the collision
             | detection method is fairly robust. I recommend reading the
             | original "Counter-cryptanalysis" paper [0] for details on
             | how attacks work and how they can be mitigated (there is
             | certain internal state in SHA1 that is used in all known
             | attacks). BTW, this paper has an interesting anecdote:
             | apparently Flame malware had exploited MD5 collisions using
             | novel unpublished attack method... and yet it was detected
             | by collision detector (section 3.2). Another example is
             | that SHA-mbles attack, published 3 years after "Counter-
             | cryptanalysis", was detected as well, with no required code
             | changes.
             | 
             | No, there is nothing "fundamentally incompatible" in the
             | new SHA1DC method. After all, git came out in 2005, 11
             | years before SHA1 attacks were known, so it used regular
             | SHA1. The collision detector was added in 2017 and nothing
             | broke, because false positive chance is 2^-90 [1].
             | 
             | I have not heard of any new SHA1 collision results, but if
             | they are based on no-difference differential paths, git has
             | nothing to worry about. And if they are not, it may be
             | possible to extend DC detector to seamlessly detect and
             | prevent those attacks, and then only upgrade git clients,
             | keeping backward and forward compatibility for data.
             | 
             | Of course there is always a chance that someone will come
             | out with all-new SHA1 preimage attack that cannot be
             | detected without high rate of false positive, so it's
             | prudent to switch git to sha256. There is a lot of work
             | being done: git's sha256 mode went out of beta in 2.42
             | (2023), but neither github nor gitlab support it.
             | 
             | But since the current state of git's sha is that there is
             | nothing broken, and git git commit hash _is_ "enough to
             | know you are distributing/executing the legitimate code, as
             | opposed to a malicious doppelganger", there is no real
             | pressure.
             | 
             | [0] https://marc-stevens.nl/research/papers/C13-S.pdf
             | 
             | [1] https://github.com/cr-
             | marcstevens/sha1collisiondetection
        
             | bawolff wrote:
             | > The problem isn't merely that "detecting maliciousness"
             | seems fraught in itself (how does one infer intent
             | reliably?)
             | 
             | Its not detecting "intent" it is detecting that the hash is
             | one vulnerable to the attack, which is extremely unlikely
             | to happen by accident, so if you see it you can assume
             | malice.
             | 
             | It might be a band-aid, and sha256 is certainly a much
             | better solution, but its more robust than it sounds at
             | first glance (since it sounds crazy at first glance)
        
         | bandrami wrote:
         | The hash space is atoms-in-the-universe range; this is a
         | collision in a much, much smaller subset of that space
        
         | TheDong wrote:
         | > The practical upshot is a git commit hash is not enough l to
         | know you are distributing/executing the legitimate code, as
         | opposed to a malicious doppelganger
         | 
         | Really now? Mind if I challenge you?
         | 
         | I have on my machine a git repo with commit
         | '75eb4e3b1369706a4dcd61cc80e49660ac341ea4'.
         | 
         | If you can give me a second git repo with such a commit
         | containing different contents, I'll happily send you $10k USD,
         | or donate it to a charity of your choice.
        
           | sgjohnson wrote:
           | > If you can give me a second git repo with such a commit
           | containing different contents, I'll happily send you $10k
           | USD, or donate it to a charity of your choice.
           | 
           | Calculating that SHA1 collision is going to be a bit more
           | expensive than $10k, by a couple of orders of magnitude.
           | 
           | Finding it in the wild is improbable, but calculating it is
           | definitely possible, and has been done before.
           | http://shattered.io/
        
             | chippiewill wrote:
             | Shattered didn't produce a collision for an arbitrary hash,
             | it produced two documents with the same hash (which is a
             | slightly easier problem, about 100,000x faster).
             | 
             | SHA1 is certainly insecure at this point, but not even
             | close to trivially so.
        
               | codeflo wrote:
               | That is enough to distribute malicious code though, at
               | least in certain scenarios. Someone might create a setup
               | where reviewers check/sign one version of the source
               | code, and what gets distributed is another version with
               | the same hash.
        
               | ExoticPearTree wrote:
               | Can you create a proof of concept and show it here?
        
               | codeflo wrote:
               | What's your point?
        
               | ExoticPearTree wrote:
               | My point is that you need to put the money where your
               | mouth is.
        
               | usr1106 wrote:
               | Code review in the Linux kernel still happens by email to
               | a large degree.
               | 
               | Further up in contribution tree there is additional
               | signing. Would that further complicate the insertion of a
               | false commit? I am not convinced that signing is used all
               | the way down to every contribution.
        
               | codeflo wrote:
               | Linux probably has enough eyeballs on its source to make
               | attacks like that unlikely anyway, but Git isn't just
               | used by Linux.
        
               | bawolff wrote:
               | Indeed. We can't even do this for md5, let alone sha1.
               | 
               | Preimage attacks are very different from collision
               | attacks.
        
         | oefrha wrote:
         | Switching to SHA-256 and switching to longer substrings of
         | hashes for identification are basically orthogonal problems.
         | The former is hardly going to help with the latter, except in
         | the we already broke everything so why not take the chance to
         | break some more sense.
        
         | usr1106 wrote:
         | TFA has nothing to do with SHA-1 or SHA-1 collisions. It's
         | about abbreviated hash values introduced for readability by
         | humans. Now these values are used by auxiliary scripts. Which
         | again has little to do with git proper. It's just what the
         | kernel community writes into commit messages and what scripts
         | they use to parse those messages.
        
         | dec0dedab0de wrote:
         | I think multiple hashes is the way to go to avoid collisions.
         | it can even be something simple like md5. the chances of
         | finding a collision that matches two or more algorithms is near
         | impossible. Obviously that doesn't work for passwords, but for
         | verifying that data hasn't been tampered with, it works.
        
       | throw0101d wrote:
       | First twelve and last twelve characters are the same:
       | $ echo -n retr0id_662d970782071aa7a038dce6 | sha256sum
       | 307e0e71a409d2bf67e76c676d81bd0ff87ee228cd8f991714589d0564e6ea9a
       | -                  $ echo -n retr0id_430d19a6c51814d895666635 |
       | sha256sum
       | 307e0e71a4098e7fb7d72c86cd041a006181c6d8e29882b581d69d0564e6ea9a
       | -
       | 
       | * Via: https://news.ycombinator.com/item?id=38668893
        
         | Retr0id wrote:
         | Oh hey, it's me. Shortly after then I did a write-up on how I
         | crafted them, which was discussed here:
         | https://news.ycombinator.com/item?id=38718314
        
           | susam wrote:
           | Excellent article. Thanks for resharing it here.
        
       | userbinator wrote:
       | In other words, 6 out of 20 bytes match. Good luck colliding the
       | other 112 bits.
        
         | bandrami wrote:
         | The problem is some (broken and wrong but in the wild) tooling
         | relies on substrings like that
        
           | Dilettante_ wrote:
           | Somebody ought to tell them not to do that! [Spoken in the
           | tone of 'You can't park there sir!']
        
       | cwillu wrote:
       | https://people.kernel.org/kees/colliding-with-the-sha-prefix...
       | is the actual link; the lwn article is just a link to it with a
       | one sentence summary and a short quote.
        
         | dang wrote:
         | Changed. Thanks!
         | 
         | (Submitted URL was https://lwn.net/Articles/1003797/)
        
       | timewizard wrote:
       | Would tooling break if the radix was changed from 16 to something
       | higher? You could double your resistance by just changing to base
       | 32 representation in the same length
        
         | Dylan16807 wrote:
         | _So much_ would break if you changed the character set. And
         | there would be pretty much no benefit. Making it a hundred
         | times harder* is not nearly enough to prevent short collisions.
         | 
         | * The exact amount would depend on repo and method. Using base
         | 32 and changing nothing else would slow collisions by 2^length.
        
           | timewizard wrote:
           | > would slow collisions by 2^length.
           | 
           | It would add a bit to each position, going from the current
           | 48 bits, to 60 bits. Or 4,096 more resistant to _accidental_
           | collision. You can have as many bits as you want you just
           | make it more time consuming to create one, but if you're
           | entire infrastructure blithely relies on one never
           | _intentionally_ being submitted, you've solved nothing
           | really.
           | 
           | Moreover, if you find a commit with a collision, it's very
           | easy to slightly alter it's contents to alleviate the
           | problem, presuming we're only having to contend with
           | unintentional conflicts.
        
             | Dylan16807 wrote:
             | The length of the short hash automatically adjusts to keep
             | the chance of accidental collisions low. There's no need to
             | lower the risk of accidental collisions.
             | 
             | So changing the character set doesn't notably improve the
             | accidental situation, and it doesn't notably improve the
             | deliberate situation. And it would be a huge pain to do.
        
               | timewizard wrote:
               | In the kernel case the length of the short hash is fixed
               | at 12 characters. This is part of the article.
        
               | Dylan16807 wrote:
               | Okay, sorry, I was talking about the _normal_ short hash
               | rather than this specific tool. And I first looked before
               | the link was changed.
               | 
               | Still, they already figured out a solution. They don't
               | need multiple mechanisms for the same thing.
        
       | chrishill89 wrote:
       | Presumably the problem is that these tools only take the
       | abbreviated hash into account. Not also the subject:
       | <abbrev. hash> ("<subject>")
       | 
       | You also have another data point. You only need to search in the
       | history from the commit that you are reading. Assuming that the
       | "Fixes" commit is an ancestor of the commit whose commit footer
       | you are reading.
       | 
       | I always just assumed that tools would take all the data into
       | account. Which means that you both need to collide with the
       | abbreviated hash as well as the subject. Now I don't do that
       | since I just copy-paste the hash, but I would quickly notice in
       | case the subject is different (and likely the commit message and
       | the diff just look irrelevant).
       | 
       | I don't understand why the Linux Kernel has this hard-coded
       | rule[2] -- again, you were going to get collisions eventually, so
       | the tools should have just taken all the data into account (at
       | least the subject) from the start. The recommendation in the Git
       | project is to use `git show -s --pretty=reference`, without any
       | fiddling with the abbreviation:                   <abbrev. hash>
       | (subject, ISO date)
       | 
       | Although the Git maintainer uses `--abbrev=8` since git-show will
       | just use a longer abbreviation in case the output would be
       | ambiguous[1].
       | 
       | They could have used this instead if they wanted simpler, future-
       | proof tooling:                   Fixes: <full hash>
       | 
       | Just like tools like git-revert and git-cherry-pick do.
       | 
       | [1]: https://lore.kernel.org/git/xmqq34j5h7v9.fsf@gitster.g/
       | 
       | [2]: Edit: hard-coded as opposed to Git just figuring out how
       | long the abbreviation should be based on how many objects there
       | are.
        
         | chrishill89 wrote:
         | > Presumably the problem is that these tools only take the
         | abbreviated hash into account. Not also the subject:
         | 
         | Well the first mentioned script:
         | 
         | > > Tools like linux-next's "Fixes tag checker",
         | 
         | has `get_full_hash`[1] which uses the subject to search through
         | the abbreviated matches.
         | 
         | Edit: And that check was added two weeks ago by Kees [2].
         | 
         | [1]: https://github.com/kees/kernel-
         | tools/blob/trunk/helpers/chec...
         | 
         | [2]: https://github.com/kees/kernel-
         | tools/commit/5bf6a1e71df59a23...
        
         | weinzierl wrote:
         | git itself does not use a fixed size abbreviation but
         | determines the length necessary with the birthday paradox
         | formula. It just happens to be seven characters most of the
         | time, because most repos are small.
         | 
         | This is just for the abbreviated hash which git uses only for
         | specific cases like display in the git log and similar, where
         | it is a user experience improvement and relatively safe.
        
           | dotancohen wrote:
           | > with the birthday paradox formula.
           | 
           | At what probability? Can that probability be configured as a
           | global Git option?
        
             | cesnja wrote:
             | It can be configured with `core.abbrev`.
             | 
             | https://git-scm.com/docs/git-config#Documentation/git-
             | config...
        
               | dotancohen wrote:
               | I love when the word "hopefully" is used in technical
               | documentation ))
        
               | MadnessASAP wrote:
               | All technical documentation is wishful thinking, they're
               | just being explicit about it.
        
           | chrishill89 wrote:
           | I know. I don't understand why the Linux Kernel uses a hard-
           | coded abbreviation instead of just letting Git figure out how
           | long it should be (edited my comment now).
        
       | quotemstr wrote:
       | We could just encode hashes on base64 instead of hex and get
       | several times more entropy with the same size text
        
         | Retr0id wrote:
         | base64 only gets you 1.5x more entropy per symbol (measuring in
         | bits)
        
           | quotemstr wrote:
           | Oh, right. It's logarithmic. That'll teach me to post before
           | coffee.
        
       | hn_throwaway_99 wrote:
       | Great example of Hyrum's Law, https://www.hyrumslaw.com/.
       | 
       | Comments about SHA256 are irrelevant - you can misuse the prefix
       | of a SHA256 hash just as easily. The issue is that people got
       | used to human-readable hash prefixes of 10-12 characters as
       | "unique" for all intents and purposes, despite the fact that
       | there were never any uniqueness guarantees for prefixes and git
       | has always handled collisions with short object IDs as ambiguous
       | - it's just that it's so rare to happen in the real world that
       | lots of script writers treated that "mostly unique" as a
       | guarantee.
       | 
       | IMO support for short object IDs is a mistake, as is any behavior
       | that "works this way 99.999% of the time, but hey developer don't
       | forget you need to also code for that .001% edge case". I'm
       | always just copying and pasting things around anyway, so it
       | really doesn't make much difference to me if I'm copying 12 chars
       | or 64.
        
       | loeg wrote:
       | Cute. I've seen CPU commit prefix brute-forcers but not GPU ones.
       | On the CPU, you can generate chosen 7-8 digit prefixes pretty
       | easily (within a few minutes, I forget exactly). 12 digits is, of
       | course, a massively bigger (factor of 2^16) space.
        
       ___________________________________________________________________
       (page generated 2024-12-31 23:01 UTC)