[HN Gopher] Reasons to Prefer Blake3 over Sha256
___________________________________________________________________
Reasons to Prefer Blake3 over Sha256
Author : ementally
Score : 168 points
Date : 2023-11-13 12:34 UTC (10 hours ago)
(HTM) web link (peergos.org)
(TXT) w3m dump (peergos.org)
| ndsipa_pomu wrote:
| At this rate, it's going to take over 700 years before we get
| Blake's 7
| benj111 wrote:
| I had to scroll disappointingly far down to get to the Blake's
| 7 reference.
|
| Thank you for not disappointing though.
|
| The down side of that algorithm though is that everything dies
| at the end.
| dragontamer wrote:
| > BLAKE3 is much more efficient (in time and energy) than SHA256,
| like 14 times as efficient in typical use cases on typical
| platforms.
|
| [snip]
|
| > AVX in Intel/AMD, Neon and Scalable Vector Extensions in Arm,
| and RISC-V Vector computing in RISC-V. BLAKE3 can take advantage
| of all of it.
|
| Uh huh... AVX/x86 and NEON/ARM you say?
|
| https://www.felixcloutier.com/x86/sha256rnds2
|
| https://developer.arm.com/documentation/ddi0596/2021-12/SIMD...
|
| If we're talking about vectorized instruction sets like AVX
| (Intel/AMD) or NEON (aka: ARM), the advantage is clearly with
| SHA256. I don't think Blake3 has any hardware implementation at
| all yet.
|
| Your typical cell phone running ARMv8 / NEON will be more
| efficient with the SHA256 instructions than whatever software
| routine you need to run Blake3. Dedicated hardware inside the
| cores is very difficult to beat on execution speed or efficiency.
|
| I admit that I haven't run any benchmarks on my own. But I'd be
| very surprised if any software routine were comparable to the
| dedicated SHA256 instructions found on modern cores.
| eatonphil wrote:
| From another thread:
|
| > On my machine with sha extensions, blake3 is about 15% faster
| (single threaded in both cases) than sha256.
|
| https://news.ycombinator.com/item?id=22237387
| vluft wrote:
| followup to this now with further blake3 improvements, on a
| faster machine now but with sha extensions vs single-threaded
| blake3; blake3 is about 2.5x faster than sha256 now. (b3sum
| 1.5.0 vs openssl 3.0.11). b3sum is about 9x faster than
| sha256sum from coreutils (GNU, 9.3) which does not use the
| sha extensions. Benchmark 1: openssl sha256
| /tmp/rand_1G Time (mean +- s): 576.8 ms +- 3.5
| ms [User: 415.0 ms, System: 161.8 ms] Range (min
| ... max): 569.7 ms ... 580.3 ms 10 runs
| Benchmark 2: b3sum --num-threads 1 /tmp/rand_1G Time
| (mean +- s): 228.7 ms +- 3.7 ms [User: 168.7 ms,
| System: 59.5 ms] Range (min ... max): 223.5 ms ...
| 234.9 ms 13 runs Benchmark 3: sha256sum
| /tmp/rand_1G Time (mean +- s): 2.062 s +- 0.025
| s [User: 1.923 s, System: 0.138 s] Range (min ...
| max): 2.046 s ... 2.130 s 10 runs Summary
| b3sum --num-threads 1 /tmp/rand_1G ran 2.52 +- 0.04
| times faster than openssl sha256 /tmp/rand_1G 9.02
| +- 0.18 times faster than sha256sum /tmp/rand_1G
| 0xdeafbeef wrote:
| Interestingly, sha256sum and openssl don't use sha_ni.
|
| iced-cpuid $(which b3sum) AVX AVX2 AVX512F AVX512VL BMI1
| CET_IBT CMOV SSE SSE2 SSE4_1 SSSE3 SYSCALL XSAVE
|
| iced-cpuid $(which openssl ) CET_IBT CMOV SSE SSE2
|
| iced-cpuid $(which sha256sum) CET_IBT CMOV SSE SSE2
|
| Also, sha256sum in my case is a bit faster
|
| Benchmark 1: openssl sha256 /tmp/rand_1G Time (mean +- s):
| 540.0 ms +- 1.1 ms [User: 406.2 ms, System: 132.0 ms] Range
| (min ... max): 538.5 ms ... 542.3 ms 10 runs
|
| Benchmark 2: b3sum --num-threads 1 /tmp/rand_1G Time (mean
| +- s): 279.6 ms +- 0.8 ms [User: 213.9 ms, System: 64.4 ms]
| Range (min ... max): 278.6 ms ... 281.1 ms 10 runs
|
| Benchmark 3: sha256sum /tmp/rand_1G Time (mean +- s): 509.0
| ms +- 6.3 ms [User: 386.4 ms, System: 120.5 ms] Range (min
| ... max): 504.6 ms ... 524.2 ms 10 runs
| vluft wrote:
| not sure that tool is correct; on my openssl it shows
| same output as you have there, but not aes-ni which is
| definitely enabled and functional.
|
| ETA: ahh you want to do that on libcrypto:
| iced-cpuid <...>/libcrypto.so.3: ADX AES AVX AVX2
| AVX512BW AVX512DQ AVX512F AVX512VL AVX512_IFMA BMI1 BMI2
| CET_IBT CLFSH CMOV D3NOW MMX MOVBE MSR PCLMULQDQ
| PREFETCHW RDRAND RDSEED RTM SHA SMM SSE SSE2 SSE3 SSE4_1
| SSSE3 SYSCALL TSC VMX XOP XSAVE
| vluft wrote:
| further research suggests that GNU coreutils cksum will
| use libcrypto in some configurations (though not mine); I
| expect that both both your commands above are actually
| using sha-ni
| TacticalCoder wrote:
| > ...on a faster machine now but with sha extensions vs
| single-threaded blake3; blake3 is about 2.5x faster than
| sha256 now
|
| But one of the sweet benefit of blake3 is that it is
| parallelized _by default_. I picked blake3 not because it
| 's 2.5x faster than sha256 when running b3sum with "--num-
| threads 1" but because, with the default, it's 16x faster
| than sha256 (on a machine with "only" 8 cores).
|
| And Blake3, contrarily to some other "parallelizable"
| hashes, give the same hash no matter if you run it with one
| thread or any number of threads (IIRC there are hashes
| which have different executables depending if you want to
| run the single-threaded or multi-threader version of the
| hash, and they give different checksums).
|
| I tried on my machine (which is a bit slower than yours)
| and I get: 990 ms openssl sha256
| 331 ms b3sum --num-threads 1 70 ms b3sum
|
| That's where the big performance gain is when using Blake3
| IMO (even though 2.5x faster than a fast sha256 even when
| single-threaded is already nice).
| vluft wrote:
| yup, for comparison, same file as above using all the
| threads (32) on my system, I get about 45ms with fully
| parallel permitted b3. It does run into diminishing
| returns fairly quickly though; unsurprisingly no
| improvements in perf using hyperthreading, but also
| improvements drop off pretty fast with more cores.
| b3sum --num-threads 16 /tmp/rand_1G ran 1.01 +-
| 0.02 times faster than b3sum --num-threads 15
| /tmp/rand_1G 1.01 +- 0.02 times faster than b3sum
| --num-threads 14 /tmp/rand_1G 1.03 +- 0.02 times
| faster than b3sum --num-threads 13 /tmp/rand_1G
| 1.04 +- 0.02 times faster than b3sum --num-threads 12
| /tmp/rand_1G 1.07 +- 0.02 times faster than b3sum
| --num-threads 11 /tmp/rand_1G 1.10 +- 0.02 times
| faster than b3sum --num-threads 10 /tmp/rand_1G
| 1.13 +- 0.02 times faster than b3sum --num-threads 9
| /tmp/rand_1G 1.20 +- 0.03 times faster than b3sum
| --num-threads 8 /tmp/rand_1G 1.27 +- 0.03 times
| faster than b3sum --num-threads 7 /tmp/rand_1G
| 1.37 +- 0.02 times faster than b3sum --num-threads 6
| /tmp/rand_1G 1.53 +- 0.05 times faster than b3sum
| --num-threads 5 /tmp/rand_1G 1.72 +- 0.03 times
| faster than b3sum --num-threads 4 /tmp/rand_1G
| 2.10 +- 0.04 times faster than b3sum --num-threads 3
| /tmp/rand_1G 2.84 +- 0.06 times faster than b3sum
| --num-threads 2 /tmp/rand_1G 5.03 +- 0.12 times
| faster than b3sum --num-threads 1 /tmp/rand_1G
|
| (over 16 elided from this run as they're all ~= the 16
| time)
| oconnor663 wrote:
| Figure 4 from https://github.com/BLAKE3-team/BLAKE3-specs
| /blob/master/blak... is a related benchmark. On that
| particular machine we saw good scaling up to 16 threads,
| but yeah somewhere in that neighborhood you start to run
| into memory bandwidth issues. Which is the problem you
| want I guess :)
| api wrote:
| I'd be curious to see power consumption. SHA (and AES) are
| usually available as what amounts to an ASIC built into the
| processor, while this requires a lot more work to be done
| with vector instructions.
| vluft wrote:
| this is less precise than the perf numbers as I don't
| really have a way to measure power directly, but with
| rerunning the benchmarks above locked to a cpu core, it
| boosted ~the same level for all 3 commands (about 5.5ghz),
| so should be ~the same power usage.
| wmf wrote:
| If Blake3 is 2.5x faster then it's going to be roughly 2.5x
| less energy.
| silotis wrote:
| That's not how it works on modern CPUs. Power draw at
| "100%" utilization can vary widely depending on what part
| of the core is being utilized. The SIMD units are
| typically the most power hungry part of the CPU by a
| large margin so just because a job finishes in less time
| doesn't mean total energy is necessarily lower.
| wmf wrote:
| I'm assuming that both SHA256 and Blake3 are using
| integer SIMD.
| api wrote:
| In some CPUs that may be true, but in many there are
| dedicated SHA instructions that amount to a SHA ASIC in
| the CPU.
|
| AES units are even more common. Most non-tiny CPUs have
| them these days.
| insanitybit wrote:
| > I don't think Blake3 has any hardware implementation at all
| yet.
|
| > https://github.com/BLAKE3-team/BLAKE3
|
| > The blake3 Rust crate, which includes optimized
| implementations for SSE2, SSE4.1, AVX2, AVX-512, and NEON, with
| automatic runtime CPU feature detection on x86. The rayon
| feature provides multithreading.
|
| There aren't blake3 instructions, like some hardware has for
| SHA1, but it does use hardware acceleration.
|
| edit: Re-reading, I think you're saying "If we're going to talk
| about hardware acceleration, SHA1 still has the advantage
| because of specific instructions" - that is true.
| jonhohle wrote:
| I just tested the C implementation on a utility I wrote[0] and
| at least on macOS where SHA256 is hardware accelerated beyond
| just NEON, BLAKE3 ends up being slower than SHA256 from
| CommonCrypto (the Apple provided crypto library). BLAKE3 ends
| up being 5-10% slower for the same input set.
|
| As far as I'm aware, Apple does not expose any of the hardware
| crypto functions, so unless what exists supports BLAKE3 and
| they add support in CommonCrypto, there's no advantage to using
| it from a performance perspective.
|
| The rust implementation is multithreaded and ends up beating
| SHA256 handily, but again, for my use case the C impl is only
| single threaded, and the utility assumes a single threaded
| hasher with one running on each core. Hashing is the bottleneck
| for `dedup`, so finding a faster hasher would have a lot of
| benefits.
|
| 0 - https://github.com/ttkb-oss/dedup
| tromp wrote:
| For short inputs, Blake3 behaves very similar to Blake2, on which
| it is based. From Blake's wikipedia page [1]:
|
| BLAKE3 is a single algorithm with many desirable features
| (parallelism, XOF, KDF, PRF and MAC), in contrast to BLAKE and
| BLAKE2, which are algorithm families with multiple variants.
| BLAKE3 has a binary tree structure, so it supports a practically
| unlimited degree of parallelism (both SIMD and multithreading)
| given long enough input.
|
| [1] https://en.wikipedia.org/wiki/BLAKE_(hash_function)
| cesarb wrote:
| While I really like Blake3, for all reasons mentioned in this
| article, I have to say it does have one tiny disadvantage over
| older hashes like SHA-256: its internal state is slightly bigger
| (due to the tree structure which allows it to be highly
| parallelizable). This can matter when running on tiny
| microcontrollers with only a few kilobytes of memory.
| londons_explore wrote:
| The internal state is no bigger when hashing small things
| though right?
|
| I assume most microcontrollers are unlikely to be hashing
| things much bigger than RAM.
| oconnor663 wrote:
| It's hard to give a short answer to that question :)
|
| - Yes, if you know your input is short, you can use a smaller
| state. The limit is roughly a BLAKE2s state plus (32 bytes
| times the log_2 of the number of KiB you need to hash).
| Section 5.4 of https://github.com/BLAKE3-team/BLAKE3-specs/bl
| ob/master/blak... goes into this.
|
| - But it's hard to take advantage of this space optimization,
| because no libraries implement it in practice.
|
| - But the reason libraries don't implement it is that almost
| no one needs it. The max state size is just under 2 KiB,
| which is small enough even for
| https://github.com/oconnor663/blake3-6502.
|
| - But it would be super easy to implement if we just put the
| "CV stack" on the heap instead of allocating the whole thing
| as an array up front.
|
| - But the platforms that care about this don't have a heap.
|
| @caesarb mentioned _really_ tiny microcontrollers, even
| tinier than the 6502 maybe. The other place I 'd expect to
| see this optimization is in a full hardware implementation,
| but those are rare. Most hardware accelerators for hash
| functions provide the block operation, and they leave it to
| software to deal with this sort of bookkeeping.
| gavinhoward wrote:
| Good, terse article that basically reinforces everything I've
| seen in my research about cryptographic hashing.
|
| Context: I'm building a VCS meant for any size of file, including
| massive ones. It needs a cryptographic hash for the Merkle Tree.
|
| I've chosen BLAKE3, and I'm going to use the original
| implementation because of its speed.
|
| However, I'm going to make it easy to change hash algorithms _per
| commit_ , just so I don't run into the case that Git had trying
| to get rid of SHA1.
| AdamN wrote:
| Smart idea doing the hash choice per-commit. Just make sure
| that somebody putting in an obscure hash doesn't mess up
| everybody's usage of the repo if they don't have a library to
| evaluate that hash installed.
| gavinhoward wrote:
| I agree.
|
| There will be a set of presets of hash function and settings;
| if BLAKE3 fails, then I'll actually have to add SHA3 or
| something, with a set of settings, as presets.
|
| The per-commit storage will then be an enum identifying the
| hash and its settings.
|
| This will let me do other things, like letting companies use
| a 512-bit hash if they expect the repo to be large.
| agodfrey wrote:
| Maybe you're already aware, but you glossed over something:
| Since you're using the hash to locate/identify the contect
| (you mentioned Merkle and git), if you support multiple
| hash functions you need some assurance that the chance of
| collisions is low _across all supported hash functions_.
| For example two identical functions that differ only in the
| value of their padding bytes (when the input size doesn't
| match the block size) can't coexist.
| gavinhoward wrote:
| You are absolutely right. And yes, I am aware.
|
| Location will actually be done by prefixing the hash with
| the value of the enum for the hash function/settings pair
| that made the hash.
| tczMUFlmoNk wrote:
| Since you seem to have done a fair bit of research in
| this area, do you have any opinions or thoughts about the
| Multihash format?
|
| https://multiformats.io/multihash/
|
| It fills in some of the blanks in your "prefixing the
| hash with the value of the enum for the hash" step.
| gavinhoward wrote:
| The multihash format is an excellent format that I am
| tempted to use.
|
| However, there are a two general problems:
|
| * The spec is not done, and it doesn't look like much has
| been done.
|
| * While I agree with the FAQ that agreeing on a set of
| hash functions is possible, the format only allows 256
| possible hash functions, so it can run out of space,
| especially since there can be multiple formats of the
| same function (BLAKE2B and BLAKE2S, for example), and
| especially since they want to allow non-cryptographic
| hash functions.
|
| Then specifically for my use case:
|
| * There is the problem brought up by AdamN [1]: if
| multihash is supported, an obscure hash may be supported,
| and that may cause problems.
|
| * As an extension of that, once a hash function and set
| of settings is supported, it's supported forever, so I
| want to be more picky, and an enum allows me to restrict
| what is usable.
|
| * By using a 16-bit enum, I have more space to grow.
|
| * And finally, by using an enum, if my software
| encounters a repo with a enum value greater than all of
| the values it knows, it knows that that repo needs a
| later version of the software, since I will only _add_
| enum values.
|
| [1]: https://news.ycombinator.com/item?id=38250444
| FabHK wrote:
| > letting companies use a 512-bit hash if they expect the
| repo to be large.
|
| A repo would have to have more than 1e32 documents for a
| one in a trillion chance of a collision with a 256 bit
| hash. (Total annual world data production is estimated at
| less than 1e24 bytes.)
|
| A 512 bit hash thus seems overkill for almost all purposes.
|
| https://en.wikipedia.org/wiki/Birthday_problem
|
| https://www.weforum.org/agenda/2021/05/world-data-
| produced-s...
| gavinhoward wrote:
| For normal VCS's, you are absolutely right. And you're
| actually right for mine, but I decided to redo the math
| to make sure.
|
| My VCS will track things at a finer level than documents.
| In a C file, it will track individual functions and
| structs. In a Java file, it will track individual fields
| and classes. In a Microsoft Word document, it might track
| individual paragraphs. And in a Blender file, it will
| track each object, material, texture, etc. individually.
|
| Yes, it will handle binary files.
|
| Anyway, it will also be designed for non-technical users.
| To that end, it will hook into the source software and do
| a "commit" every time the user saves.
|
| It will also track individual directories to make renames
| work.
|
| I am a ctrl+s freak, so I save once a minute or more.
| However, other people are not, so let's assume 10 minutes
| (for autosave, perhaps).
|
| Now let's assume a monorepo for a company of 100,000
| people. And let's assume that when they save every 10
| minutes, they save one object in one file (also tracked)
| two directories down. That means they create 5 hashes
| every 10 minutes (the fifth is the top level).
|
| Let's assume an effective 6-hour work day.
|
| That's 5 objects times 6 times per hour times 6 hours.
| That's 180 objects a day per person.
|
| That's 18,000,000 total objects per day. Times 5 for days
| in a week, times 50 for work weeks in a year.
|
| That's 4.5 billion.
|
| Let's multiply that by 40 for 40 years that the repo
| exists, which includes some of the oldest software.
|
| That's 1.8e11 objects. According to [1], a 128-bit hash
| would not be enough for the error correction on a disk at
| that point.
|
| However, a 256-bit hash would give us a 10^31 objects
| before reaching that point, which gives us 10^20 times 40
| years of space.
|
| Yep, you're absolutely right that 512 bits is overkill. I
| stand corrected.
|
| [1]: https://en.m.wikipedia.org/wiki/Birthday_attack
| deadbeeves wrote:
| You're tracking things at the content level? How will you
| deal with files that are purposely broken, or which cause
| the parser to take impractical (but finite) times to
| complete? Also, tracking the history of a class makes
| sense to some extent, but you say you want to commit
| every time there's a save. How will you maintain a
| history when most commits are likely to contain
| unparseable code and so break the continuity of objects?
| nextaccountic wrote:
| In those cases you can just do error recovery in the
| parser (truncating an erroring function for example) and
| then store out-of-band information necessary to
| reconstruct the original file
|
| This is also necessary to deal with whitespace for
| example (if you just reformat the code, you didnt change
| the ast but you changed the file)
| gavinhoward wrote:
| Good questions.
|
| > How will you deal with files that are purposely broken,
| or which cause the parser to take impractical (but
| finite) times to complete?
|
| I've never seen a language parser do that, but if I run
| into a language that does that, I'll probably have my VCS
| track it at the file level, based on tokens or lines.
|
| Dumb languages don't get nice things. :)
|
| > How will you maintain a history when most commits are
| likely to contain unparseable code and so break the
| continuity of objects?
|
| This is less of a problem with binary files (assuming the
| source software does not have bugs in output), but with
| source files, you're right that that problem does exist.
|
| As of right now, I would do a token-based approach. This
| approach removes the need for whitespace-only commits,
| and if I track the tokens right, I should be able to
| identify which right brace _used_ to end the function
| until the broken code was saved. Then I would just save
| the function as broken using that same right brace.
|
| For example, say you have this: int
| main() { return 0; }
|
| My VCS would know that the right brace corresponds to the
| end of the function.
|
| Then you write this: int main() {
| if (global_bool) { return 0; }
|
| Yes, a dumb system might think that the right brace is
| for the `if`.
|
| However, if you break it down by tokens, the VCS will see
| that `if (global_bool) {` were added before the return,
| so it should be able to tell that the right brace still
| ends the function.
|
| I hope that makes sense.
|
| Another plausible way to do it (at least in C) would be
| to look for things that look like declarations. The
| series of tokens `<type> <name> <left_paren>` is probably
| a function declaration. Java would be easier; its
| declarations are more wordy.
|
| I still have to prove this is possible, but I think it
| is.
| EdSchouten wrote:
| What I dislike about BLAKE3 is that they added explicit logic to
| ensure that identical chunks stored at different offsets result
| in different Merkle tree nodes (a.k.a. the 'chunk counter').
|
| Though this feature is well intended, it makes this hash function
| hard to use for a storage system where you try to do aggressive
| data deduplication.
|
| Furthermore, on platforms that provide native instructions for
| SHA hashing, BLAKE3 isn't necessarily faster, and certainly more
| power hungry.
| lazide wrote:
| Huh?
|
| The storage system doing this wouldn't use that part of the
| hash, it would do it itself so no issues? (Hash chunks, instead
| of feeding everything in linearly)
|
| Otherwise the hash isn't going to be even remotely safe for
| most inputs?
| persnickety wrote:
| Could you point to how this is implemented and how it can be
| used? From the sound of it, you're trying to do something like
| rsync's running-window comparison?
| EdSchouten wrote:
| Imagine the case where you're trying to create a storage
| system for a large number of virtual machine images (e.g.,
| you're trying to build your own equivalent of AWS Machine
| Images). There is obviously a lot of duplication between
| parts of images. And not necessarily at the same offset, but
| also at different offsets that are n*2^k bytes apart, where
| 2^k represents the block/sector size.
|
| You could consider building this storage system on top of
| BLAKE3's tree model. Namely you store blocks as small Merkle
| tree. And an image is basically a collection of blocks that
| has a different 'hat' on top of it. Unfortunately, BLAKE3
| makes this hard, because the same block will end up having a
| different Merkle tree node depending on the offset at which
| it's stored.
| luoc wrote:
| You mean something like a CDC algorithm? I know that some
| Backup tools like Restic use this.
|
| https://en.m.wikipedia.org/wiki/Rolling_hash
| EdSchouten wrote:
| You can use a CDC algorith, but if you know that
| duplication mostly occurs at power-of-two boundaries,
| there is no need to use that. Deduplicating on a binary
| tree basis is sufficient.
| londons_explore wrote:
| Sounds to me like you are trying to use the innards of a
| hash algorithm for something for which it was not
| designed...
|
| Either modify the algorithm to your needs, and rename it.
|
| Or just use something thats already suitable off-the-shelf.
| Plenty of merkle-trees out there.
| luoc wrote:
| I thing CDC is what you're looking for. Some backup tools
| like restic use it. See
| https://en.m.wikipedia.org/wiki/Rolling_hash
| luoc wrote:
| Duplicated myself, sry
| marktangotango wrote:
| > You could consider building this storage system on top of
| BLAKE3's tree model.
|
| Consider a crypto currency pow that did that without the
| chunk counter. It'd be trivially exploitably by
| precalculating all the tree but the chunk that changed per
| nonce.
| prirun wrote:
| Author of HashBackup here. I don't see how any kind of hash
| tree would be effective at de-duplicating VM machine
| images, other than the degenerate case of an exact copy,
| which is easy to detect with a single file hash.
|
| Most OSes use 4K block sizes. To get the best dedup you
| have to hash every 4K block and lookup each one
| individually in a dedup table. Two VM images could both
| contain an identical 4GB file, but every 4K block of that
| file could be stored at different offsets in the VM images.
| A tree hash wouldn't let you dedup anything but identical
| sections stored at identical offsets, whereas a dedup table
| and 4K blocks allows you to dedup the entire file.
| oconnor663 wrote:
| We go over some of our reasoning around that in section 7.5 of
| https://github.com/BLAKE3-team/BLAKE3-specs/blob/master/blak...
| . An early BLAKE3 prototype actually didn't include the chunk
| counter (https://github.com/oconnor663/bao/blob/master/docs/spe
| c_0.9....), so I'm definitely sympathetic to the use cases that
| wish it wasn't there. However, after publication we found out
| that something like a chunk counter is necessary for the
| security of the Bao streaming verification tool:
| https://github.com/oconnor663/bao/issues/41. It could be that
| there's a design that's the best of both worlds, but I'm not
| sure.
| stylepoints wrote:
| Until it starts coming installed by default on Linux and other
| mojor OS's, it won't be mainstream.
| theamk wrote:
| Python 3.11 will have it https://bugs.python.org/issue39298
| latexr wrote:
| That says "Resolution: rejected" and Python is currently at
| 3.12.0. Did the feature land?
| theamk wrote:
| oops I misread it.. seems it was rejected because it was
| not standard enough...
|
| https://github.com/python/cpython/issues/83479#issuecomment
| -...
| syonfox wrote:
| murmur3
| Snawoot wrote:
| murmur3 is not a cryptographic hash, so it's not even in the
| same field.
| latexr wrote:
| It bears mentioning `shasum` is better supported in that it ships
| with operating systems (macOS, I guess Linux depends on the
| distro, don't know about Windows) and is available directly from
| programming languages (Ruby, Swift, Python, ...) without the need
| for libraries.
|
| Even if BLAKE3 is massively faster, it's not like I ever noticed
| SHA256's supposed slowness. But I do notice its ubiquity.
|
| Based on the article, I would consider switching to BLAKE3
| immediately where I use checksums. But until it gets wider
| support (might be easier with a public domain license instead of
| the current ones) I can't really do it because I need to do
| things with minimal dependencies.
|
| Best of luck to the BLAKE3 team on making their tool more widely
| available.
| upget_tiding wrote:
| > might be easier with a public domain license instead of the
| current ones
|
| There reference implementation is public domain (CC0) or at
| your choice Apache 2.0
|
| https://github.com/BLAKE3-team/BLAKE3/blob/master/LICENSE
| zahllos wrote:
| I agree that if you can, BLAKE3 (or even BLAKE2) are nicer
| choices than SHA2. However I would like to add the following
| comments:
|
| * SHA-2 fixes the problems with SHA-1. SHA-1 was a step up over
| SHA-0 that did not completely resolve flaws in SHA-0's design
| (SHA-0 was broken very quickly).
|
| * JP Aumasson (one of the B3 authors) has said publicly a few
| times SHA-2 will never be broken:
| https://news.ycombinator.com/item?id=13733069 is an indirect
| source, can't seem to locate a direct one from Xitter (thanks
| Elon).
|
| Thus it does not necessarily follow that SHA-2 is a bad choice
| because SHA-1 is broken.
| gavinhoward wrote:
| All that may be true.
|
| However, I don't think we can say for sure if SHA2 will be
| broken. Cryptography is hard like that.
|
| In addition, SHA2 is still vulnerable to length extension
| attacks, so in a sense, SHA2 _is_ broken, at least when length
| extension attacks are part of the threat model.
| Godel_unicode wrote:
| I don't understand why people use sha256 when sha512 is often
| significantly faster:
|
| https://crypto.stackexchange.com/questions/26336/sha-512-fas...
| oconnor663 wrote:
| A couple reasons just on the performance side:
|
| - SHA-256 has hardware acceleration on many platforms, but
| SHA-512 mostly doesn't.
|
| - Setting aside hardware acceleration, SHA-256 is faster on
| 32-bit platforms, like a lot of embedded devices. If you have
| to choose between "fast on a desktop" vs "fast in embedded", it
| can make sense to assume that desktops are always fast enough
| and that your bottlenecks will be in embedded.
| adrian_b wrote:
| On older 64-bit CPUs without hardware SHA-256 (i.e. up to the
| Skylake derivatives), SHA-512 is faster.
|
| Many recent Arm CPUs have hardware SHA-512 (and SHA-3).
|
| Intel will add hardware SHA-512 starting with Arrow Lake S,
| to be launched at the end of 2024 (the successor in desktops
| of the current Raptor Lake Refresh).
|
| Most 64-bit CPUs that have been sold during the last 4 years
| and many of those older than that have hardware SHA-256.
| garblegarble wrote:
| This may only be applicable to certain CPUs - e.g. sha512 is a
| lot slower on M1 $ openssl speed sha256
| sha512 type 16 bytes 64 bytes 256
| bytes 1024 bytes 8192 bytes 16384 bytes sha256
| 146206.63k 529723.90k 1347842.65k 2051092.82k 2409324.54k
| 2446518.95k sha512 85705.68k 331953.22k
| 707320.92k 1149420.20k 1406851.34k 1427259.39k
| nayuki wrote:
| It's an interesting set of reasons, but I prefer Keccak/SHA-3
| over SHA-256, SHA-512, and BLAKE. I trust the standards body and
| public competition and auditing that took place - more so than a
| single author trumpeting the virtues of BLAKE.
| tptacek wrote:
| I'd probably use a Blake too. But:
|
| _SHA256 was based on SHA1 (which is weak). BLAKE was based on
| ChaCha20, which was based on Salsa20 (which are both strong)._
|
| _NIST /NSA have repeatedly signaled lack of confidence in
| SHA256: first by hastily organising the SHA3 contest in the
| aftermath of Wang's break of SHA1_
|
| No: SHA2 lacks the structure the SHA1 attack relies on it (SHA1
| has a linear message schedule, which made it possible to work out
| a differential cryptanalysis attack on it).
|
| Blake's own authors keep saying SHA2 is secure (modulo length
| extension), but people keep writing stuff like this. Blake3 is a
| good and interesting choice on the real merits! It doesn't need
| the elbow throw.
| ianopolous wrote:
| Would be interesting to hear Zooko's response to this. (Peergos
| lead here)
| pclmulqdq wrote:
| Most people who publicly opine on the Blake vs. SHA2 debate
| seem to be relatively uninformed on the realities of each one.
| SHA2 and the Blakes are both usually considered to be secure.
|
| The performance arguments most people make are also outdated or
| specious: the original comparisons of Blake vs SHA2 performance
| on CPUs were largely done before Intel and AMD had special SHA2
| instructions.
| ianopolous wrote:
| The author is one of the creators of blake3, Zooko.
| tptacek wrote:
| Sorry, I should have been more precise. JP Aumasson is
| specifically who I'm thinking of; he's made the semi-
| infamous claim that SHA2 won't be broken in his lifetime.
| The subtext I gather is that there's just nothing on the
| horizon that's going to get it. SHA1 we saw coming a ways
| away!
| zahllos wrote:
| Quoting directly from https://nostarch.com/crypto-
| dictionary under the entry SHA-2:
|
| > Unlike SHA-1, SHA-2 algorithms aren't broken and are
| unlikely to ever be.
|
| There's also the fact NIST themselves deprecated SHA-1 in
| 2011 (https://csrc.nist.gov/news/2017/research-results-
| on-sha-1-co... not mentioned, but otherwise nice timeline
| here: https://crypto.stackexchange.com/a/60655), yet
| SHA-2 is still OK. Wiki has a table of cryptanalysis of
| SHA-2: https://en.wikipedia.org/wiki/SHA-2#Cryptanalysis_
| and_valida...
|
| The summary is that either you attack a very reduced
| round variant and you get "practical" complexity for the
| attack, or you attack almost a full round variant and you
| get an entirely practical attack.
|
| So I think your interpretation of the subtext is entirely
| correct.
| honzaik wrote:
| Also, the NSA is currently recommending to change SHA3/Keccak
| inside Dilithium and Kyber into SHA2-based primitives...
| https://groups.google.com/a/list.nist.gov/g/pqc-forum/c/SPTp...
| twiss wrote:
| For those who didn't click the link, it should be noted that
| they're suggesting this because it would be easier to deploy
| (in places that have a SHA-2 implementation but not SHA-3),
| not for reasons related to security or anything like that.
| Looking at the responses, there's also some disagreement on
| whether it would offer equal security for the particular use
| case of ML-DSA and ML-KEM (as the final version of Dilithium
| and Kyber standardized by NIST will be called).
| hellcow wrote:
| > they're suggesting this because it would be easier to
| deploy (in places that have a SHA-2 implementation but not
| SHA-3), not for reasons related to security
|
| That's a bit absurd, right? Sure, the NSA didn't overtly
| say, "we propose you use SHA-2 because we can break it."
| That doesn't mean it's secure against them.
|
| We can't look at their stated justification for supporting
| one algorithm over another because the NSA lies. Their very
| _purpose_ as an organization is to defeat encryption, and
| one tactic is to encourage the industry to use something
| they can defeat while reassuring people it's secure. We
| need to look at their recommendations with a lot of
| suspicion and assume an ulterior motive.
| tptacek wrote:
| The article uses inferred NSA preferences as
| justification to avoid SHA2. Can't have it both ways.
| pbsd wrote:
| While there is more confidence now on the security of SHA-2, or
| rather the lack of transference of the SHA-1 approach to SHA-2,
| this was not the case in 2005-2006 when NIST decided to hold
| the SHA-3 competition. See for example the report on Session 4
| of the 2005 NIST workshop on hash functions [1].
|
| [1] https://csrc.nist.gov/events/2005/first-cryptographic-
| hash-w...
| omginternets wrote:
| What do you mean by "weak" and "strong", here?
| nabla9 wrote:
| _Weak_ means that a mathematical flaw has been discovered
| that makes it inherently insecure, or that it is so simple
| that modern computer technology makes it possible to use
| "brute force" to crack. _Strong_ means that it 's not either.
| chx wrote:
| There are fundamentally two kinds of attacks, preimage which
| splits into two:
|
| In a first-preimage attack, you know a hash value but not the
| message that created it, and you want to discover any message
| with the known hash value; in the second-preimage attack, you
| have a message and you want to find a second message that has
| the same hash. Attacks that can find one type of preimage can
| often find the other as well. A weak algorithm allows this to
| be done in less than 2^(hash length) attempts.
|
| And then there is collision: two messages which produce the
| same hash. A weak algorithm allows this to be done in less
| than 2^(half of hash length) attempts.
|
| Source: https://www.rfc-editor.org/rfc/rfc4270
| RcouF1uZ4gsC wrote:
| SHA-256 has the advantage that it is used for BitCoin. It is the
| biggest bug bounty of all time. There see literally billions
| riding on the security of SHA-256.
| aburan28 wrote:
| There has been a mountain of cryptanalysis done on SHA256 with no
| major breaks compared to a much smaller amount analysis on
| blake3.
| jrockway wrote:
| Fast hash functions are really important, and SHA256 is really
| slow. Switching the hash function where you can is enough to
| result in user-visible speedups for common hashing use cases;
| verifying build artifacts, seeing if on-disk files changed, etc.
| I was writing something to produce OCI container images a few
| months ago, and the 3x SHA256 required by the spec for layers
| actually takes on the order of seconds. (.5s to sha256 a 50MB
| file, on my 2019-era Threadripper!) I was shocked to discover
| this. (gzip is also very slow, like shockingly slow, but
| fortunately the OCI spec lets you use Zstd, which is
| significantly faster.)
| coppsilgold wrote:
| sha256 is not slow on modern hardware. openssl doesn't have
| blake3, but here is blake2: type
| 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes
| 16384 bytes BLAKE2s256 75697.37k 308777.40k
| 479373.40k 567875.81k 592687.09k 591254.18k
| BLAKE2b512 63478.11k 243125.73k 671822.08k
| 922093.51k 1047833.51k 1048959.57k sha256
| 129376.82k 416316.32k 1041909.33k 1664480.49k 2018678.67k
| 2043838.46k
|
| This is with the x86 sha256 instructions: sha256msg1,
| sha256msg2, sha256rnds2
| dralley wrote:
| "modern hardware" deserves some caveats. AMD has supported
| those extensions since the original Zen, but Intel CPUs
| generally lacked them until only about 2 years ago.
| adrian_b wrote:
| For many years, starting in 2016, Intel has supported
| SHA-256 only in their Atom CPUs.
|
| The reason seems to be that the Atom CPUs were compared in
| Geekbench with ARM CPUs, and without hardware SHA the Intel
| CPUs would have obtained worst benchmark scores.
|
| In their big cores, SHA has been added in 2019, in Ice Lake
| (while Comet Lake still lacked it, being a Skylake
| derivative), and since then all newer Intel CPUs have it.
|
| So except for the Intel Core CPUs, the x86 and ARM CPUs
| have had hardware SHA for at least 7 years, while the Intel
| Core CPUs have had it for the last 4 years.
| richardwhiuk wrote:
| If you want a fast hash function (and don't care about it's
| cryptographic properties), don't use a cryptographic hash
| function.
| dralley wrote:
| BLAKE3 is actually competitive with non-cryptographic hashes
| like crc32.
| oconnor663 wrote:
| To be fair, it _really_ depends on the platform. There 's
| an argument to be made that platforms where you care about
| the difference are specifically the ones where BLAKE3 is
| slower (no SIMD, no threads).
| adrian_b wrote:
| SHA256 is very fast on most modern CPUs, i.e. all AMD Zen, all
| Intel Atom since 2016, Intel Core Ice Lake or newer, Armv8 and
| Armv9.
|
| I use every day both SHA-256 and BLAKE3. BLAKE3 is faster only
| because it is computed by multiple threads using all available
| CPU cores. When restricted to a single thread, it is slower on
| CPUs with fast hardware SHA-256.
|
| The extra speed of BLAKE3 is not always desirable. The fact
| that it uses all cores can slow down other concurrent
| activities, without decreasing the overall execution time of
| the application.
|
| There are cases when the computation of a hash like SHA-256 can
| be done as a background concurrent activity, or when the speed
| of hashing is limited by the streaming speed of data from the
| main memory or from a SSD, so spawning multiple threads does
| not gain anything and it only gets in the way of other
| activities.
|
| So the right choice between SHA-256 and BLAKE3 depends on the
| application. Both can be useful. SHA-256 has the additional
| advantage that it needs negligible additional code, only a few
| lines are necessary to write a loop that computes the hash
| using the hardware instructions.
| ur-whale wrote:
| One metric that is seldom mentioned for crypto algos is code
| complexity.
|
| I really wish researchers would at least pay lip service to it.
|
| TEA (an unfortunately somewhat weak symmetric cipher) was a very
| nice push in that direction.
|
| TweetNaCl was another very nice push in that direction by djb
|
| Why care about that metric you ask?
|
| Well here are a couple of reasons: - algo fits in head - algo is
| short -> cryptanalysis likely easier - algo is short -> less
| likely to have buggy implementation - algo is short -> side-
| channel attacks likely easier to analyse - algo fits in a 100
| line c++ header -> can be incorporated into anything - algo can
| be printed on a t-shirt, thereby skirting export control
| restrictions - algo can easily be implemented on tiny micro-
| controllers
|
| etc ...
| oconnor663 wrote:
| We put a lot of effort into section 5.1.2 of https://github.com
| /BLAKE3-team/BLAKE3-specs/blob/master/blak..., and the
| complicated part of BLAKE3 (incrementally building the Merkle
| tree) ends up being ~4 lines of code. Let me know what you
| think.
| rstuart4133 wrote:
| > One metric that is seldom mentioned for crypto algos is code
| complexity. ... TEA (an unfortunately somewhat weak symmetric
| cipher) was a very nice push in that direction.
|
| Spec is also a push in that direction [0]. It's code looks to
| be as complex as TEA's (1/2 a page of C), blindingly fast, yet
| as far I know has no known attacks despite being subject to a
| fair bit of scrutiny. About the only reason I can see for it
| not being largely ignored is it was designed by NSA.
|
| SHA3 is also a simple algorithm. Downright pretty, in fact.
| It's a pity it's so slow.
|
| [0] https://en.wikipedia.org/wiki/Speck_(cipher)
| colmmacc wrote:
| It's very hard to see Blake3 getting included in FIPS. Meanwhile,
| SHA256 is. That's probably the biggest deciding factor on whether
| you want to use it or not.
| Retr0id wrote:
| Blake3 is a clear winner for large inputs.
|
| However, for smaller inputs (~1024 bytes and down), the
| performance gap between it and everything else (blake2, sha256)
| gets much narrower, because you don't get to benefit from the
| structural parallelization.
|
| If you're mostly dealing with small inputs, raw hash throughput
| is _probably_ not high on your list of concerns - In the context
| of a protocol or application, other costs like IO latency
| probably completely dwarf the actual CPU time spent hashing.
|
| If raw performance is no longer high on your list of priorities,
| you care more about the other things - ubiquitous and battle-
| tested library support (blake3 is still pretty bleeding-edge, in
| the grand scheme of things), FIPS compliance (sha256), greater
| on-paper security margin (blake2). Which is all to say, while
| blake3 _is_ great, there are still plenty of reasons not to
| prefer it for a particular use-case.
| LegibleCrimson wrote:
| How does the extended output work, and what's the point of
| extended output?
|
| From what I can see, BLAKE3 has 256 bits of security, and
| extended output doesn't provide any extra security. In this case,
| what's the point of extended output over doing something like
| padding with 0-bits or extending by re-hashing the previous
| output and appending it to the previous output (eg, for 1024
| bits, doing h(m) . h(h(m)) . h(h(h(m))) . h(h(h(h(m))))). Either
| way, you get 256 bits of security.
|
| Is it just because the design of the hash makes it simple to do,
| so it's just offered as a consistent option for arbitrary output
| sizes where needed, or is there some greater purpose that I'm
| missing?
| oconnor663 wrote:
| > From what I can see, BLAKE3 has 256 bits of security, and
| extended output doesn't provide any extra security.
|
| 128 bits of collision resistance but otherwise correct. As a
| result of that we usually just call it 128 bits across the
| board, but yes in an HMAC-like use case you would generally
| expect 256 bits of security from the 256 bit output. Extended
| outputs don't change that, because the internal chaining values
| are 256 bits even when the output is larger.
|
| > extending by re-hashing the previous output and appending it
| to the previous output
|
| It's not quite that simple, because you don't want later parts
| of your output to be predictable from earlier parts (which
| might be published, depending on the use case). You also want
| it to be parallelizable.
|
| You could compute H(m) as a "pre-hash" and then make an
| extended output something like H(H(m)|1)|H(H(m)|2)|... That's
| basically what BLAKE3 is doing in the inside. The advantage of
| having the algorithm do it for you is that 1) it's an "off the
| shelf" feature that doesn't require users to roll their own
| crypto and 2) it's slightly faster when the input is short,
| because you don't have to spend an extra block operation
| computing the pre-hash.
|
| > what's the point of extended output?
|
| It's kind of niche, but for example Ed25519 needs a 512 bit
| hash output internally to "stretch" its secret seed into two
| 256-bit keys. You could also use a BLAKE3 output reader as a
| stream cipher or a random byte generator. (These sorts of use
| cases are why it's nice not to make the caller tell you the
| output length in advance.)
| 15155 wrote:
| Keccak is my preference. Keccak is substantially easier to
| implement in hardware: fewer operations and no carry propagation
| delay issue because there's no addition.
___________________________________________________________________
(page generated 2023-11-13 23:00 UTC)