hngopher.com

       [HN Gopher] A 100LOC C impl of memset, that is faster than glibc's
       ___________________________________________________________________
        
       A 100LOC C impl of memset, that is faster than glibc's
        
       Author : Q26124
       Score  : 76 points
       Date   : 2021-11-12 07:43 UTC (15 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | stephc_int13 wrote:
       | A long time ago, as I was working with the Nintendo SDK for the
       | DS console I wondered if the provided memcpy implementation was
       | optimal.
       | 
       | Turned out it was quite slow.
       | 
       | I replaced it with an Intel hand optimized version made for the
       | StrongARM, and replaced the prefetch opcode by a simple load
       | because this opcode was not supported by the arch of the CPU of
       | this console.
       | 
       | 50% faster, this is quite significant for such a low-level,
       | already optimized routine, used extensively in many stages of a
       | game engine.
       | 
       | I think that we should never assume that standard implementations
       | are optimal, trust but verify.
        
         | gavinray wrote:
         | This is brilliant and really interesting/neat, thanks for
         | posting.
        
         | OskarS wrote:
         | Modern compilers have quite a deep understanding of memcpy, and
         | they will recognize the pattern and put in optimal assembly (on
         | x86, probably "rep movsb" or whatever), even if you don't
         | literally call memcpy. This is why the GCC implmentation of
         | memcpy is, like, trivial: [1]. The compiler will recognize that
         | this is a memcpy and sub the better implementation.
         | 
         | I wonder though: it seems to me that memory bandwidth should
         | far and away be the limiting factor for a memcpy, so I would
         | think even a straight-forward translation of the "trivial"
         | implementation wouldn't be that far off from an "optimal" one.
         | I guess memory prefetching would make a difference, but would
         | minimizing the number of loads/stores (or unrolling the loop)
         | really matter that much?
         | 
         | [1]: https://github.com/gcc-
         | mirror/gcc/blob/master/libgcc/memcpy....
        
           | stephc_int13 wrote:
           | From my experience, good prefetching and pre-alignment
           | changes a lot of things/
           | 
           | Compiler optimized memcpy are good for small copies that will
           | be inlined, but copying big chunks is an other story and I've
           | seen non-marginal differences depending on implementation.
           | 
           | The most difficult problem is that each implementation is
           | usually tuned for a specific CPU and might be sub-optimal
           | with a different brand or revision...
        
           | stephc_int13 wrote:
           | If you think that a specific routine or algorithm is memory
           | bound, you should always do a quick benchmark to check this
           | assumption.
           | 
           | In practice everything is memory bound because of course the
           | CPU is faster than memory, but you'd be surprised by how
           | difficult it can be to reach the full CPU capacity.
           | 
           | "Memory bound" or "Network bound" are way too frequently used
           | as poor excuses by lazy coders.
        
           | saagarjha wrote:
           | > on x86, probably "rep movsb" or whatever
           | 
           | Only on recent x86, and with a long list of caveats. Look up
           | discussion about erms online.
           | 
           | > I wonder though: it seems to me that memory bandwidth
           | should far and away be the limiting factor for a memcpy, so I
           | would think even a straight-forward translation of the
           | "trivial" implementation wouldn't be that far off from an
           | "optimal" one. I guess memory prefetching would make a
           | difference, but would minimizing the number of loads/stores
           | (or unrolling the loop) really matter that much?
           | 
           | Memory bandwidth is often the limiting factor, but not
           | always. But your simple byte-by-byte loop is not going to get
           | anywhere near saturating that; you'll need to unroll and use
           | vector instructions, which might dispatch slower but copy
           | several orders of magnitude more data.
        
           | josefx wrote:
           | > on x86, probably "rep movsb" or whatever)
           | 
           | Sadly I don't have a link, but as far as I remember rep movsb
           | was always hilariously slow. So memcpy implementations tried
           | to optimize copies using half a page of vector instructions
           | with size and alignment tests, which of course killed the
           | CPUs instruction cache.
        
             | benttoothpaste wrote:
             | Always hilariously slow? That must have been before Ivy
             | Bridge.
        
       | dathinab wrote:
       | But is it still faster if used in real world instead of
       | benachmars?
       | 
       | Given it's comparatively huge impl. it probably massively messes
       | with the instruction cache of the rest of your program or am I
       | overlooking something?
        
       | AreYouSirius wrote:
       | What is not faster then GLIBC ? ??
        
       | jvanderbot wrote:
       | Is this one of those times where you ignore 2% of the edge cases
       | or legacy compatibilities and get a bunch of extra performance?
        
         | [deleted]
        
       | rurban wrote:
       | I have a memset that's not only 10x faster than glibc, but also
       | secure. The trick is to bypass the generic asm, and let the
       | compiler optimize it, esp. with constant sizes and known
       | alignments. Esp. with clang.
        
         | errcorrectcode wrote:
         | Don't do this. Security. explicit_bzero() won't be optimized
         | away.
        
       | nynyny7 wrote:
       | In all fairness it needs to be said that the libc's
       | implementation has to consider portability to more "exotic"
       | architectures. For example, not every CPU allows to make
       | unaligned 32-bit or 64-bit writes, or it takes a huge penalty for
       | such writes.
        
         | asdfasgasdgasdg wrote:
         | Does glibc not have feature detection and conditional
         | compilation for cases like this? That is surprising to me.
        
           | rwmj wrote:
           | It does. Each subdirectory of sysdeps/ can contain specific
           | implementations per platform, arch, etc. eg: the aarch64
           | assembler memset is:
           | 
           | https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/aar.
           | ..
           | 
           | There's also the "ifunc" mechanism which can be used to make
           | the choice at runtime, eg:
           | 
           | https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/aar.
           | ..
        
         | dolmen wrote:
         | In all fairness we need the fastest memset on every
         | architecture. Whatever the cost of maintenance.
        
         | nynyny7 wrote:
         | What do I mean by "not every CPU allows to make unaligned
         | 32-bit or 64-bit writes"? Let's test the code (as of commit
         | eac67b6) on a Raspberry Pi 400:
         | pi@rasppi400:~/memset_benchmark $ uname -a       Linux
         | rasppi400 5.10.63-v8+ #1459 SMP PREEMPT Wed Oct 6 16:42:49 BST
         | 2021 aarch64 GNU/Linux
         | pi@rasppi400:~/memset_benchmark $ ./bench_memset       size,
         | alignment, offset, libc, local       0, 16, 0, 1237452, 834116,
         | 1.483549,       1, 16, 0, 1612697, 945325, 1.705971,       2,
         | 16, 0, 1779538, 945320, 1.882472,       3, 16, 0, 1557081,
         | 945324, 1.647140,       4, 16, 0, 1779527, 889736, 2.000062,
         | 5, 16, 0, 1557103, 1000940, 1.555641,       6, 16, 0, 1779551,
         | 1000944, 1.777873,       7, 16, 0, 1557111, 1000945, 1.555641,
         | 8, 16, 0, 1334654, 889723, 1.500078,       Bus error
         | pi@rasppi400:~/memset_benchmark $ gdb ./bench_memset
         | [...]       (gdb) run       Starting program:
         | /home/pi/memset_benchmark/bench_memset       size, alignment,
         | offset, libc, local       0, 16, 0, 1557105, 722928, 2.153887,
         | 1, 16, 0, 1557103, 889797, 1.749953,       2, 16, 0, 1557107,
         | 889849, 1.749855,       3, 16, 0, 1557108, 889759, 1.750033,
         | 4, 16, 0, 1557117, 889789, 1.749985,       5, 16, 0, 1557110,
         | 889745, 1.750063,       6, 16, 0, 1557116, 889754, 1.750052,
         | 7, 16, 0, 1557110, 889758, 1.750038,       8, 16, 0, 1557109,
         | 889803, 1.749948,              Program received signal SIGBUS,
         | Bus error.              small_memset (n=<optimized out>,
         | c=<optimized out>, s=0x29690)           at
         | /home/pi/memset_benchmark/src/lib.c:33       33
         | *((uint64_t *)last) = val8;
        
           | rwmj wrote:
           | Is that 32 bit ARM code on 64 bit kernel? I thought ARM
           | (since v6) allows unaligned access, although it might have to
           | be emulated through the kernel which is going to be super-
           | slow.
           | 
           | On SPARC you have no choice, align or die!
        
             | nynyny7 wrote:
             | Yes it is 32 bit code on a 64 bit kernel. I didn't debug
             | what the instruction is that ultimately causes the bus
             | error.                 pi@rasppi400:~/memset_benchmark $
             | uname -a       Linux rasppi400 5.10.63-v8+ #1459 SMP
             | PREEMPT Wed Oct 6 16:42:49 BST 2021 aarch64 GNU/Linux
             | pi@rasppi400:~/memset_benchmark $
             | pi@rasppi400:~/memset_benchmark $ file ./bench_memset
             | ./bench_memset: ELF 32-bit LSB executable, ARM, EABI5
             | version 1 (SYSV), dynamically linked, interpreter /lib/ld-
             | linux-armhf.so.3, for GNU/Linux 3.2.0,
             | BuildID[sha1]=ebeb69b6cb9664d78c1256a2c862f3d28f11e15e,
             | with debug_info, not stripped
        
               | nynyny7 wrote:
               | PS: It's STRD, which as far as I understand the Arm
               | Architecture Reference Manual always requires word
               | alignment.                 Program received signal
               | SIGBUS, Bus error.       small_memset (n=<optimized out>,
               | c=<optimized out>, s=0x29690)           at
               | /home/pi/memset_benchmark/src/lib.c:33       33
               | *((uint64_t *)last) = val8;       1: x/i $pc       =>
               | 0x11c8c <local_memset+1560>: strd    r0, [r7, #-8]
               | (gdb) info registers       r0             0x0
               | 0       r1             0x0                 0       r2
               | 0x0                 0       r3             0x1475
               | 5237       r4             0x5f5e100           100000000
               | r5             0x11674             71284       r6
               | 0x29690             169616       r7             0x29699
               | 169625
        
           | dividuum wrote:
           | Can't look right now, but you might not have benchmarked
           | against libc, but against an optimized version included in
           | Raspbian (https://github.com/simonjhall/copies-and-fills).
           | I'm not sure if that's still active in the latest Raspberry
           | Pi OS releases.
        
         | gpderetta wrote:
         | glibc has hand crafted assembler implementations of memcpy
         | (often specialized for specific size ranges) for many
         | architectures.
        
       | cryptonector wrote:
       | > if (n == 0) > return s;
       | 
       | That branch is not needed because memset() is UD if length is 0,
       | but it's nice that it's safer.
        
         | [deleted]
        
         | Filligree wrote:
         | You wouldn't rather have an arguably very slighter faster
         | memset, with the caveat that it might explode in your face?
        
       | cma wrote:
       | Calls to memset outside of the benchmark may be of heterogenous
       | sizes, which may heavily affect branch prediction since every
       | branch relates to size.
       | 
       | I'm not saying it would go either way, just a big flaw to
       | consider with the benchmarking method where it is doing only
       | repeated calls of the same size only.
       | 
       | It is suprising the GCC version does an integer multiply, if I am
       | reading right (several cycles, unless it is cheaper for uint32 *
       | char).
        
         | BeeOnRope wrote:
         | Yes, it's hard to take the benchmark results seriously in light
         | of the failure to use anything other than a single size at a
         | time.
        
           | stephencanon wrote:
           | There's a bunch of things that make benchmarking memset and
           | similar functions really hard:
           | 
           | Measuring the time _repeated_ small calls to memset usually
           | doesn't make any sense, even when the lengths are
           | heterogeneous; this results in an instruction stream that's
           | almost all memset, but for small memsets you almost always
           | have lots of "other stuff" mixed in in real use. This can
           | lead you to a suboptimal implementation.
           | 
           | You have to factor in what distribution of sizes actually
           | occurs in a live system. I haven't looked at this for a
           | decade or so, but the last time I checked the majority of
           | system-wide time in memset (on macOS running diverse
           | applications) was spent in length-4096 calls, and the next
           | highest spike was (perversely) length-zero. A system
           | implementation has to balance the whole system's needs; a
           | memset for just your program can certainly do better. Dtrace
           | or similar tooling is invaluable to gather this information.
           | 
           | As with any benchmarking, the only way to actually know is to
           | swap out the implementation and measure real app / system
           | performance. All that said, Nadav's implementation looks
           | pretty plausible. It's branchier than I would like, and
           | doesn't take advantage of specialized instructions for large
           | buffers, but for some input distributions that's a very
           | reasonable tradeoff, and I don't doubt that it's competitive
           | with system memsets.
        
             | jeffbee wrote:
             | As far as real-world performance goes, this paper claims
             | (and shows) that code size is the relevant aspect of mem*
             | functions, and concludes that `rep stosb` is optimal in
             | practice, even though it obviously loses to exotic hand-
             | rolled memset and memcmp in microbenchmarks.
             | 
             | https://storage.googleapis.com/pub-tools-public-
             | publication-...
        
               | stephencanon wrote:
               | rep stos _is_ worth using, but that paper makes no
               | mention of it (it does show that in their use rep cmps
               | beat a more complicated memcmp implementation).
        
         | mhh__ wrote:
         | This is why you should benchmark like you test.
         | 
         | Spot regressions early (locally) but make the decisions based
         | on the big the picture.
        
         | gpderetta wrote:
         | Also repeatedly zeroing the same memory can have different
         | performance characteristics than zeroing different memory each
         | time. Haven't checked what the benchmark does though.
        
       | teddyh wrote:
       | > _The implementation of memset in this repository is around 25%
       | faster than the glibc implementation. This is the geometric mean
       | for sizes 0..512._
        
       | marcodiego wrote:
       | Great! Now benchmark it with all compiler X architectures
       | supported by glibc.
        
       | jmull wrote:
       | I'm wonder how well this handles unaligned memory?
       | 
       | That used to be stable stakes for this kind of thing, but maybe
       | it doesn't matter much anymore?
        
         | mananaysiempre wrote:
         | On the x86, in the P4 times the best performing bulk operations
         | essentially required using SIMD, _and_ that SIMD hated you
         | unless you aligned your memory accesses. The result was
         | horrible bloated code to handle leading and trailing data and
         | thus also a need to split off the implementations for small
         | sizes. The unaligned access penalty is much lower now, _and_
         | REP-prefixed operations have microcoded implementations that
         | use the maximum memory access width (which you can't do
         | otherwise without SIMD instructions).
         | 
         | I'm curious about what the referenced code compiles down to,
         | actually, because not only could GCC be auto-vectorizing it, it
         | could be replacing it with a REP STOSQ or indeed _a call to
         | memset_.
        
           | ninkendo wrote:
           | Here's the code in godbolt:
           | 
           | gcc: https://godbolt.org/z/6xG5dKjj9
           | 
           | clang: https://godbolt.org/z/Mh9zozjvK
           | 
           | I'm no asm expert, but it doesn't look like a lot of vector
           | instructions in the gcc compilation of this, while the clang
           | compilation seems to have more calls with the 128-bit xmm
           | registers (at least on x86_64.) You can also just see visibly
           | how many more instructions the gcc version outputs.
        
             | mananaysiempre wrote:
             | Thank you! Indeed GCC does not use SIMD here unless you set
             | -O3 (... I seem to remember this enables some
             | vectorization?) or allow it to use AVX with -mavx or
             | -march=x86-64-v3. For some reason I'm unable to get it to
             | use plain SSE (always available on x86-64) with any -mtune
             | setting or even with -march=x86-64-v2.
        
         | kevin_b_er wrote:
         | Probably poorly. It is a violation to cast an unaligned pointer
         | to an aligned type that the base pointer is not aligned for.
         | And the code looks like it does just that right here:
         | https://github.com/nadavrot/memset_benchmark/blob/main/src/l...
         | 
         | This is undefined behavior under C99 SS6.3.2.3 Paragraph 7. "If
         | the resulting pointer is not correctly aligned for the pointed-
         | to type, the behavior is undefined."
         | 
         | The musl code referenced has handling for this.
        
       | rwmj wrote:
       | There is an interesting related problem - how do you efficiently
       | test if a buffer contains only zeroes? We use this for
       | automatically sparsifying disk images. There's no standard C
       | function for this. My colleague came up with the following nice
       | trick. It reuses the (presumably already maximally optimized)
       | memcmp function from libc:
       | 
       | https://gitlab.com/nbdkit/nbdkit/-/blob/b31859402d1404ba0433...
       | static inline bool __attribute__((__nonnull__ (1)))       is_zero
       | (const char *buffer, size_t size)       {         size_t i;
       | const size_t limit = size < 16 ? size : 16;              for (i =
       | 0; i < limit; ++i)           if (buffer[i])             return
       | false;         if (size != limit)           return ! memcmp
       | (buffer, buffer + 16, size - 16);              return true;
       | }
       | 
       | Example usage for sparsifying while copying disk images:
       | https://gitlab.com/nbdkit/libnbd/-/blob/46fa6ecc7422e830f10d...
        
         | brettdeveffect wrote:
         | I'd be curious about this in practice. Would it make sense to
         | trade off probing in various places as 0s may be spatially
         | correlated?
        
         | junon wrote:
         | There are a number of standard functions that can achieve this,
         | namely in string.h. Performance is a question of course.
        
           | rwmj wrote:
           | Which functions in particular?
        
         | tehjoker wrote:
         | It would be interesting if there was a way to measure the
         | voltage difference between two memory addresses and if it was
         | equal, the bits would be all one or zero and then you just need
         | to read one byte to see which it is. I don't know how practical
         | that is, but it would be a constant time check.
        
         | secondcoming wrote:
         | I have the same issue. I want to use SIMD to do ANDs between
         | large buffers and also be able to detect if a buffer is empty
         | (all zeroes) after an AND. It doesn't seem possible to do this
         | without iterating over the entire buffer again because vpand()
         | doesn't affect the eflags register.
        
         | dahfizz wrote:
         | I wonder if it would be meaningfully faster if you checked the
         | first 16 bytes as uint64_t or uint128_t instead of byte by
         | byte. It would save you 14 or 15 comparisons per function call.
        
           | rwmj wrote:
           | GCC (-O3) actually unrolls the loop completely into 16 x
           | (cmpb + jne), which I find slightly surprising.
           | 
           | We can't easily use a larger size because we mustn't read
           | beyond the end of the buffer if it's shorter than 16 bytes
           | and not a multiple of 2, 4, etc.
        
         | dolmen wrote:
         | Sources:
         | https://gitlab.com/nbdkit/libnbd/-/blob/46fa6ecc7422e830f10d...
         | 
         | https://listman.redhat.com/archives/libguestfs/2017-April/ms...
        
         | _3u10 wrote:
         | use the popcnt instruction or the popcnt intrinsic function.
         | 
         | It counts how many bits are set to 1, you're looking for 0.
         | 
         | You can also cast it into a uint64_t integer and do an equality
         | test. There might be a way to use fused multiply add.
         | 
         | Also are you mmaping the file so you can just read it directly
         | as a single buffer? You should be able to madvise to free pages
         | after they've been checked.
         | 
         | PUSHFB can also be used. http://0x80.pl/articles/sse-
         | popcount.html
         | 
         | Essentially you want to vectorize this tho memcmp may already
         | be vectorized and do the cpu detection.
         | 
         | Edit: also... You should be able to load 15 x 256 bits and then
         | test them. Try VPTEST
         | https://www.intel.com/content/www/us/en/develop/documentatio...
        
         | bitwize wrote:
         | > There's no standard C function for this. My colleague came up
         | with the following nice trick.
         | 
         | One of the big things about C is that there is no standard
         | library function for anything remotely nontrivial. So
         | successfully coding in C relies on "tricks", and snippets and
         | lore that have been passed on over the years.
         | 
         | Rust, meanwhile, has a check_for_all_zeroes crate or something.
        
         | omegalulw wrote:
         | Clever, I love it. Maybe I'm just dumb, but it took me a lil
         | bit to convince myself it's correct.
        
         | TonyTrapp wrote:
         | Assuming that vector instructions are available, shouldn't it
         | be much faster to actually compare the buffer contents against
         | a vector register initialized to all-zeros rather than
         | comparing against some other memory? Or would memcmp
         | automatically optimize that away because of the precondition
         | that the first 16 bytes are already known to be 0?
        
           | BeeOnRope wrote:
           | It is probably faster, yes (half the number of reads) - but
           | the point of this truck is that you can re-use the
           | (hopefully) vectorized memcmp on every platform with portable
           | code rather than getting on the SIMD ISA treadmill yourself.
        
             | rwmj wrote:
             | The qemu implementation does indeed do it the hard way.
             | It's a lot of code: https://gitlab.com/qemu-
             | project/qemu/-/blob/master/util/buff...
        
               | pm215 wrote:
               | Interestingly, we used to have a special-case aarch64
               | version, but we dropped it because the C version was
               | faster: https://gitlab.com/qemu-
               | project/qemu/-/commit/2250d3a293d36e...
               | 
               | (Might or might not still be true on more modern aarch64
               | hardware...)
        
               | pm215 wrote:
               | Does your memset version beat QEMU's plain-old-C fallback
               | version ?
        
         | tkhattra wrote:
         | just make sure it's not "overly optimized" -
         | https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95189 (in this
         | case it was gcc's builtin memcmp that was broken, not glibc's)
         | :)
        
         | yissp wrote:
         | Is the choice of 16 as the "limit" value based on benchmarking?
         | As opposed to just doing something like "!buffer[0] &&
         | !memcmp(buffer, buffer + 1, size - 1)" which uses the same
         | principle.
        
           | BeeOnRope wrote:
           | Not the OP, but 16 has the benefit of keeping both pointers
           | in the comparison 16-byte aligned if the buffer was initially
           | aligned.
           | 
           | This would eliminate split loads and provide a decent
           | speedup.
        
         | [deleted]
        
       | jstanley wrote:
       | I'm curious why the time is so much worse for sizes just slightly
       | larger than 400, but then better again for sizes larger than
       | this?
        
         | MauranKilom wrote:
         | Could be hitting a critical cache stride. See also
         | https://stackoverflow.com/questions/11413855/why-is-transpos...
        
       | nly wrote:
       | I once maintained a project that had a custom memcpy
       | implementation because they didn't want to link to libc.
       | 
       | They assumed aligned writes were safe and the compiler optimized
       | it incorrectly, resulting in memory corruption.
        
       | rasz wrote:
       | memset is something JEDEC SDRAM standard should of implemented on
       | a hardware level back in 1993. Why even bother writing to ram
       | byte by byte when we could of had dedicated command to fill up to
       | whole row (8-16kbit per chip, 8-32KB per DIMM) at a time with
       | _single command_. Safe zero fill memory allocation would be free
       | and standard.
       | 
       | For background: https://faculty-
       | web.msoe.edu/johnsontimoj/EE4980/files4980/m... Since 1993 ram
       | chips have integrated state machines receiving and interpreting
       | higher level commands. They also have wide sense amplifier banks
       | being loaded/stored all at once.
        
         | gpderetta wrote:
         | What's the use of filling your ram with zeros when the data
         | needs to be on L1, L2 or L3? Unless you are memsetting hundreds
         | of MBs of memory, memset/memcpy in practice need to be handled
         | by the cpu or something very close to it.
         | 
         | Zen has CLZERO which can clear a cacheline in one go, but not
         | sure how good it is.
        
           | FpUser wrote:
           | >"Unless you are memsetting hundreds of MBs of memory"
           | 
           | Not hundreds but in one of my apps I do have 10th MB of
           | continuous cache that has to be zeroed before use / reuse.
        
             | knome wrote:
             | I wonder if someone was zeroing enough memory, where the
             | memory is a private anonymous mapping, they might use
             | madvise() with MADV_DONTNEED, which in linux will
             | effectively zero the pages.
             | 
             | It returns the memory to the OS, and will pagefault on
             | later accesses remapping them as zero-filled. It works in
             | pages. Sizes smaller than a page result in whatever else is
             | in the same page getting nuked.
             | 
             | If you don't immediately reuse the whole cache, it might
             | spread out the zeroing/remapping over time, rather than in
             | a single large go. Imagine some testing would be in order
             | to see if a syscall + mapping changes ( require reloading
             | with TLB for the process ? ) would be smaller than a
             | straight run of writing zeros at some point.
             | 
             | IIRC, the zeroing is not something you can expect from non-
             | linux madvise implementations.
        
           | convolvatron wrote:
           | PPC has an instruction to 'load' a line ignoring its previous
           | contents (just set up the cache state). useful in any case
           | when you know you're going to overwrite the whole thing.
        
             | xgkickt wrote:
             | I used dcbz extensively back on the Wii.
        
           | rasz wrote:
           | Such ram capability would result in implementing hybrid
           | compressed caches. Why waste whole cache line for storing
           | zeroes when you can have dedicated compressed representation.
           | 
           | On a similar note part of ATI 2000
           | https://en.wikipedia.org/wiki/HyperZ was fast Z clear, today
           | a norm on every GPU.
        
           | vlovich123 wrote:
           | This would be a CPU command that works with the RAM
           | controller rather than something you control yourself
           | (kernels to my knowledge don't talk directly to the
           | controller beyond maybe some basic power management, if
           | that).
           | 
           | There is a definite need to do hundreds of MB - the Linux
           | kernel has a background thread that does nothing but zero out
           | pages. What do you think happens to the GBs of RAM freed by
           | closing Chrome? Once it's made available in one spot, no
           | reason others could use it (eg a hardened malloc
           | implementation, etc).
        
             | gpderetta wrote:
             | Interesting that you mention linux, because Linus has very
             | very strong opinions about this :)
        
               | xvmt wrote:
               | What is his opinion about this?
        
               | gpderetta wrote:
               | He strongly believes that something like rep stos/rep mov
               | is the right interface for memset/memcpy and off-core
               | accelerators (like DMA) are misguided.
        
               | Something1234 wrote:
               | His reasoning or rant for this?
        
               | monocasa wrote:
               | I'm not sure about Linus's objections, but I've found
               | that DMA accelerators for time sharing systems with
               | general workloads haven't reaped benefits, as the
               | overhead of multiplexing and synchronizing with them
               | kills most of their benefits. At that point it's easier
               | to blit memory yourself.
        
               | kwertyoowiyop wrote:
               | What are they?
        
             | monocasa wrote:
             | If you find yourself doing this a lot, there's write
             | combining memory to coalesce writes to be more friendly to
             | the RAM controller.
             | 
             | Additionally, CLZERO ends up doing very similar work since
             | the resulting cache flush os seen by the RAM controller as
             | a block write.
        
         | _zoltan_ wrote:
         | *should have
        
         | mananaysiempre wrote:
         | Modern microcontrollers can have DMA units that you can program
         | to, among other things, do a memset or even a memcpy when the
         | memory bus happens to be idle, and they'll interrupt you when
         | they're done. The design point is different (a microcontroller
         | application can be limited by processor cycles but rarely by
         | memory bus bandwidth), but I still wonder why PCs don't have
         | anything like that.
        
           | com2kid wrote:
           | Programming Microcontrollers was such an interesting and
           | different experience, designing code to be asynchronous in
           | regards to memory operations was a whole 'nother level of
           | arranging code.
           | 
           | Likewise for doing copies from external RAM to internal SRAM,
           | it was slow enough compared to the 1 cycle latency accessing
           | SRAM, and CPU cycles were precious enough, that code copying
           | lots of memory from external memory was designed to stop
           | execution and let other code run and resume once the copy was
           | finished.
           | 
           | We were able to get some _serious_ speed out of the 96mhz CPU
           | because we optimized everything around our memory bus.
        
           | drran wrote:
           | Just implement a driver for your memory controller and update
           | all software to use syscall into kernel (about 10k total
           | instructions per syscall), which will perform memset or
           | memcpy, then measure performance improvement and tell it to
           | us.
        
       | moonchild wrote:
       | Also in assembly, up to 370% the speed of glibc -
       | https://github.com/moon-chilled/fancy-memset
        
       | Jyaif wrote:
       | I think that Duff's device could be used on lines 65-86 (
       | https://github.com/nadavrot/memset_benchmark/blob/eac67b6205... )
        
       ___________________________________________________________________
       (page generated 2021-11-12 23:01 UTC)