[HN Gopher] A 100LOC C impl of memset, that is faster than glibc's
___________________________________________________________________
A 100LOC C impl of memset, that is faster than glibc's
Author : Q26124
Score : 76 points
Date : 2021-11-12 07:43 UTC (15 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| stephc_int13 wrote:
| A long time ago, as I was working with the Nintendo SDK for the
| DS console I wondered if the provided memcpy implementation was
| optimal.
|
| Turned out it was quite slow.
|
| I replaced it with an Intel hand optimized version made for the
| StrongARM, and replaced the prefetch opcode by a simple load
| because this opcode was not supported by the arch of the CPU of
| this console.
|
| 50% faster, this is quite significant for such a low-level,
| already optimized routine, used extensively in many stages of a
| game engine.
|
| I think that we should never assume that standard implementations
| are optimal, trust but verify.
| gavinray wrote:
| This is brilliant and really interesting/neat, thanks for
| posting.
| OskarS wrote:
| Modern compilers have quite a deep understanding of memcpy, and
| they will recognize the pattern and put in optimal assembly (on
| x86, probably "rep movsb" or whatever), even if you don't
| literally call memcpy. This is why the GCC implmentation of
| memcpy is, like, trivial: [1]. The compiler will recognize that
| this is a memcpy and sub the better implementation.
|
| I wonder though: it seems to me that memory bandwidth should
| far and away be the limiting factor for a memcpy, so I would
| think even a straight-forward translation of the "trivial"
| implementation wouldn't be that far off from an "optimal" one.
| I guess memory prefetching would make a difference, but would
| minimizing the number of loads/stores (or unrolling the loop)
| really matter that much?
|
| [1]: https://github.com/gcc-
| mirror/gcc/blob/master/libgcc/memcpy....
| stephc_int13 wrote:
| From my experience, good prefetching and pre-alignment
| changes a lot of things/
|
| Compiler optimized memcpy are good for small copies that will
| be inlined, but copying big chunks is an other story and I've
| seen non-marginal differences depending on implementation.
|
| The most difficult problem is that each implementation is
| usually tuned for a specific CPU and might be sub-optimal
| with a different brand or revision...
| stephc_int13 wrote:
| If you think that a specific routine or algorithm is memory
| bound, you should always do a quick benchmark to check this
| assumption.
|
| In practice everything is memory bound because of course the
| CPU is faster than memory, but you'd be surprised by how
| difficult it can be to reach the full CPU capacity.
|
| "Memory bound" or "Network bound" are way too frequently used
| as poor excuses by lazy coders.
| saagarjha wrote:
| > on x86, probably "rep movsb" or whatever
|
| Only on recent x86, and with a long list of caveats. Look up
| discussion about erms online.
|
| > I wonder though: it seems to me that memory bandwidth
| should far and away be the limiting factor for a memcpy, so I
| would think even a straight-forward translation of the
| "trivial" implementation wouldn't be that far off from an
| "optimal" one. I guess memory prefetching would make a
| difference, but would minimizing the number of loads/stores
| (or unrolling the loop) really matter that much?
|
| Memory bandwidth is often the limiting factor, but not
| always. But your simple byte-by-byte loop is not going to get
| anywhere near saturating that; you'll need to unroll and use
| vector instructions, which might dispatch slower but copy
| several orders of magnitude more data.
| josefx wrote:
| > on x86, probably "rep movsb" or whatever)
|
| Sadly I don't have a link, but as far as I remember rep movsb
| was always hilariously slow. So memcpy implementations tried
| to optimize copies using half a page of vector instructions
| with size and alignment tests, which of course killed the
| CPUs instruction cache.
| benttoothpaste wrote:
| Always hilariously slow? That must have been before Ivy
| Bridge.
| dathinab wrote:
| But is it still faster if used in real world instead of
| benachmars?
|
| Given it's comparatively huge impl. it probably massively messes
| with the instruction cache of the rest of your program or am I
| overlooking something?
| AreYouSirius wrote:
| What is not faster then GLIBC ? ??
| jvanderbot wrote:
| Is this one of those times where you ignore 2% of the edge cases
| or legacy compatibilities and get a bunch of extra performance?
| [deleted]
| rurban wrote:
| I have a memset that's not only 10x faster than glibc, but also
| secure. The trick is to bypass the generic asm, and let the
| compiler optimize it, esp. with constant sizes and known
| alignments. Esp. with clang.
| errcorrectcode wrote:
| Don't do this. Security. explicit_bzero() won't be optimized
| away.
| nynyny7 wrote:
| In all fairness it needs to be said that the libc's
| implementation has to consider portability to more "exotic"
| architectures. For example, not every CPU allows to make
| unaligned 32-bit or 64-bit writes, or it takes a huge penalty for
| such writes.
| asdfasgasdgasdg wrote:
| Does glibc not have feature detection and conditional
| compilation for cases like this? That is surprising to me.
| rwmj wrote:
| It does. Each subdirectory of sysdeps/ can contain specific
| implementations per platform, arch, etc. eg: the aarch64
| assembler memset is:
|
| https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/aar.
| ..
|
| There's also the "ifunc" mechanism which can be used to make
| the choice at runtime, eg:
|
| https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/aar.
| ..
| dolmen wrote:
| In all fairness we need the fastest memset on every
| architecture. Whatever the cost of maintenance.
| nynyny7 wrote:
| What do I mean by "not every CPU allows to make unaligned
| 32-bit or 64-bit writes"? Let's test the code (as of commit
| eac67b6) on a Raspberry Pi 400:
| pi@rasppi400:~/memset_benchmark $ uname -a Linux
| rasppi400 5.10.63-v8+ #1459 SMP PREEMPT Wed Oct 6 16:42:49 BST
| 2021 aarch64 GNU/Linux
| pi@rasppi400:~/memset_benchmark $ ./bench_memset size,
| alignment, offset, libc, local 0, 16, 0, 1237452, 834116,
| 1.483549, 1, 16, 0, 1612697, 945325, 1.705971, 2,
| 16, 0, 1779538, 945320, 1.882472, 3, 16, 0, 1557081,
| 945324, 1.647140, 4, 16, 0, 1779527, 889736, 2.000062,
| 5, 16, 0, 1557103, 1000940, 1.555641, 6, 16, 0, 1779551,
| 1000944, 1.777873, 7, 16, 0, 1557111, 1000945, 1.555641,
| 8, 16, 0, 1334654, 889723, 1.500078, Bus error
| pi@rasppi400:~/memset_benchmark $ gdb ./bench_memset
| [...] (gdb) run Starting program:
| /home/pi/memset_benchmark/bench_memset size, alignment,
| offset, libc, local 0, 16, 0, 1557105, 722928, 2.153887,
| 1, 16, 0, 1557103, 889797, 1.749953, 2, 16, 0, 1557107,
| 889849, 1.749855, 3, 16, 0, 1557108, 889759, 1.750033,
| 4, 16, 0, 1557117, 889789, 1.749985, 5, 16, 0, 1557110,
| 889745, 1.750063, 6, 16, 0, 1557116, 889754, 1.750052,
| 7, 16, 0, 1557110, 889758, 1.750038, 8, 16, 0, 1557109,
| 889803, 1.749948, Program received signal SIGBUS,
| Bus error. small_memset (n=<optimized out>,
| c=<optimized out>, s=0x29690) at
| /home/pi/memset_benchmark/src/lib.c:33 33
| *((uint64_t *)last) = val8;
| rwmj wrote:
| Is that 32 bit ARM code on 64 bit kernel? I thought ARM
| (since v6) allows unaligned access, although it might have to
| be emulated through the kernel which is going to be super-
| slow.
|
| On SPARC you have no choice, align or die!
| nynyny7 wrote:
| Yes it is 32 bit code on a 64 bit kernel. I didn't debug
| what the instruction is that ultimately causes the bus
| error. pi@rasppi400:~/memset_benchmark $
| uname -a Linux rasppi400 5.10.63-v8+ #1459 SMP
| PREEMPT Wed Oct 6 16:42:49 BST 2021 aarch64 GNU/Linux
| pi@rasppi400:~/memset_benchmark $
| pi@rasppi400:~/memset_benchmark $ file ./bench_memset
| ./bench_memset: ELF 32-bit LSB executable, ARM, EABI5
| version 1 (SYSV), dynamically linked, interpreter /lib/ld-
| linux-armhf.so.3, for GNU/Linux 3.2.0,
| BuildID[sha1]=ebeb69b6cb9664d78c1256a2c862f3d28f11e15e,
| with debug_info, not stripped
| nynyny7 wrote:
| PS: It's STRD, which as far as I understand the Arm
| Architecture Reference Manual always requires word
| alignment. Program received signal
| SIGBUS, Bus error. small_memset (n=<optimized out>,
| c=<optimized out>, s=0x29690) at
| /home/pi/memset_benchmark/src/lib.c:33 33
| *((uint64_t *)last) = val8; 1: x/i $pc =>
| 0x11c8c <local_memset+1560>: strd r0, [r7, #-8]
| (gdb) info registers r0 0x0
| 0 r1 0x0 0 r2
| 0x0 0 r3 0x1475
| 5237 r4 0x5f5e100 100000000
| r5 0x11674 71284 r6
| 0x29690 169616 r7 0x29699
| 169625
| dividuum wrote:
| Can't look right now, but you might not have benchmarked
| against libc, but against an optimized version included in
| Raspbian (https://github.com/simonjhall/copies-and-fills).
| I'm not sure if that's still active in the latest Raspberry
| Pi OS releases.
| gpderetta wrote:
| glibc has hand crafted assembler implementations of memcpy
| (often specialized for specific size ranges) for many
| architectures.
| cryptonector wrote:
| > if (n == 0) > return s;
|
| That branch is not needed because memset() is UD if length is 0,
| but it's nice that it's safer.
| [deleted]
| Filligree wrote:
| You wouldn't rather have an arguably very slighter faster
| memset, with the caveat that it might explode in your face?
| cma wrote:
| Calls to memset outside of the benchmark may be of heterogenous
| sizes, which may heavily affect branch prediction since every
| branch relates to size.
|
| I'm not saying it would go either way, just a big flaw to
| consider with the benchmarking method where it is doing only
| repeated calls of the same size only.
|
| It is suprising the GCC version does an integer multiply, if I am
| reading right (several cycles, unless it is cheaper for uint32 *
| char).
| BeeOnRope wrote:
| Yes, it's hard to take the benchmark results seriously in light
| of the failure to use anything other than a single size at a
| time.
| stephencanon wrote:
| There's a bunch of things that make benchmarking memset and
| similar functions really hard:
|
| Measuring the time _repeated_ small calls to memset usually
| doesn't make any sense, even when the lengths are
| heterogeneous; this results in an instruction stream that's
| almost all memset, but for small memsets you almost always
| have lots of "other stuff" mixed in in real use. This can
| lead you to a suboptimal implementation.
|
| You have to factor in what distribution of sizes actually
| occurs in a live system. I haven't looked at this for a
| decade or so, but the last time I checked the majority of
| system-wide time in memset (on macOS running diverse
| applications) was spent in length-4096 calls, and the next
| highest spike was (perversely) length-zero. A system
| implementation has to balance the whole system's needs; a
| memset for just your program can certainly do better. Dtrace
| or similar tooling is invaluable to gather this information.
|
| As with any benchmarking, the only way to actually know is to
| swap out the implementation and measure real app / system
| performance. All that said, Nadav's implementation looks
| pretty plausible. It's branchier than I would like, and
| doesn't take advantage of specialized instructions for large
| buffers, but for some input distributions that's a very
| reasonable tradeoff, and I don't doubt that it's competitive
| with system memsets.
| jeffbee wrote:
| As far as real-world performance goes, this paper claims
| (and shows) that code size is the relevant aspect of mem*
| functions, and concludes that `rep stosb` is optimal in
| practice, even though it obviously loses to exotic hand-
| rolled memset and memcmp in microbenchmarks.
|
| https://storage.googleapis.com/pub-tools-public-
| publication-...
| stephencanon wrote:
| rep stos _is_ worth using, but that paper makes no
| mention of it (it does show that in their use rep cmps
| beat a more complicated memcmp implementation).
| mhh__ wrote:
| This is why you should benchmark like you test.
|
| Spot regressions early (locally) but make the decisions based
| on the big the picture.
| gpderetta wrote:
| Also repeatedly zeroing the same memory can have different
| performance characteristics than zeroing different memory each
| time. Haven't checked what the benchmark does though.
| teddyh wrote:
| > _The implementation of memset in this repository is around 25%
| faster than the glibc implementation. This is the geometric mean
| for sizes 0..512._
| marcodiego wrote:
| Great! Now benchmark it with all compiler X architectures
| supported by glibc.
| jmull wrote:
| I'm wonder how well this handles unaligned memory?
|
| That used to be stable stakes for this kind of thing, but maybe
| it doesn't matter much anymore?
| mananaysiempre wrote:
| On the x86, in the P4 times the best performing bulk operations
| essentially required using SIMD, _and_ that SIMD hated you
| unless you aligned your memory accesses. The result was
| horrible bloated code to handle leading and trailing data and
| thus also a need to split off the implementations for small
| sizes. The unaligned access penalty is much lower now, _and_
| REP-prefixed operations have microcoded implementations that
| use the maximum memory access width (which you can't do
| otherwise without SIMD instructions).
|
| I'm curious about what the referenced code compiles down to,
| actually, because not only could GCC be auto-vectorizing it, it
| could be replacing it with a REP STOSQ or indeed _a call to
| memset_.
| ninkendo wrote:
| Here's the code in godbolt:
|
| gcc: https://godbolt.org/z/6xG5dKjj9
|
| clang: https://godbolt.org/z/Mh9zozjvK
|
| I'm no asm expert, but it doesn't look like a lot of vector
| instructions in the gcc compilation of this, while the clang
| compilation seems to have more calls with the 128-bit xmm
| registers (at least on x86_64.) You can also just see visibly
| how many more instructions the gcc version outputs.
| mananaysiempre wrote:
| Thank you! Indeed GCC does not use SIMD here unless you set
| -O3 (... I seem to remember this enables some
| vectorization?) or allow it to use AVX with -mavx or
| -march=x86-64-v3. For some reason I'm unable to get it to
| use plain SSE (always available on x86-64) with any -mtune
| setting or even with -march=x86-64-v2.
| kevin_b_er wrote:
| Probably poorly. It is a violation to cast an unaligned pointer
| to an aligned type that the base pointer is not aligned for.
| And the code looks like it does just that right here:
| https://github.com/nadavrot/memset_benchmark/blob/main/src/l...
|
| This is undefined behavior under C99 SS6.3.2.3 Paragraph 7. "If
| the resulting pointer is not correctly aligned for the pointed-
| to type, the behavior is undefined."
|
| The musl code referenced has handling for this.
| rwmj wrote:
| There is an interesting related problem - how do you efficiently
| test if a buffer contains only zeroes? We use this for
| automatically sparsifying disk images. There's no standard C
| function for this. My colleague came up with the following nice
| trick. It reuses the (presumably already maximally optimized)
| memcmp function from libc:
|
| https://gitlab.com/nbdkit/nbdkit/-/blob/b31859402d1404ba0433...
| static inline bool __attribute__((__nonnull__ (1))) is_zero
| (const char *buffer, size_t size) { size_t i;
| const size_t limit = size < 16 ? size : 16; for (i =
| 0; i < limit; ++i) if (buffer[i]) return
| false; if (size != limit) return ! memcmp
| (buffer, buffer + 16, size - 16); return true;
| }
|
| Example usage for sparsifying while copying disk images:
| https://gitlab.com/nbdkit/libnbd/-/blob/46fa6ecc7422e830f10d...
| brettdeveffect wrote:
| I'd be curious about this in practice. Would it make sense to
| trade off probing in various places as 0s may be spatially
| correlated?
| junon wrote:
| There are a number of standard functions that can achieve this,
| namely in string.h. Performance is a question of course.
| rwmj wrote:
| Which functions in particular?
| tehjoker wrote:
| It would be interesting if there was a way to measure the
| voltage difference between two memory addresses and if it was
| equal, the bits would be all one or zero and then you just need
| to read one byte to see which it is. I don't know how practical
| that is, but it would be a constant time check.
| secondcoming wrote:
| I have the same issue. I want to use SIMD to do ANDs between
| large buffers and also be able to detect if a buffer is empty
| (all zeroes) after an AND. It doesn't seem possible to do this
| without iterating over the entire buffer again because vpand()
| doesn't affect the eflags register.
| dahfizz wrote:
| I wonder if it would be meaningfully faster if you checked the
| first 16 bytes as uint64_t or uint128_t instead of byte by
| byte. It would save you 14 or 15 comparisons per function call.
| rwmj wrote:
| GCC (-O3) actually unrolls the loop completely into 16 x
| (cmpb + jne), which I find slightly surprising.
|
| We can't easily use a larger size because we mustn't read
| beyond the end of the buffer if it's shorter than 16 bytes
| and not a multiple of 2, 4, etc.
| dolmen wrote:
| Sources:
| https://gitlab.com/nbdkit/libnbd/-/blob/46fa6ecc7422e830f10d...
|
| https://listman.redhat.com/archives/libguestfs/2017-April/ms...
| _3u10 wrote:
| use the popcnt instruction or the popcnt intrinsic function.
|
| It counts how many bits are set to 1, you're looking for 0.
|
| You can also cast it into a uint64_t integer and do an equality
| test. There might be a way to use fused multiply add.
|
| Also are you mmaping the file so you can just read it directly
| as a single buffer? You should be able to madvise to free pages
| after they've been checked.
|
| PUSHFB can also be used. http://0x80.pl/articles/sse-
| popcount.html
|
| Essentially you want to vectorize this tho memcmp may already
| be vectorized and do the cpu detection.
|
| Edit: also... You should be able to load 15 x 256 bits and then
| test them. Try VPTEST
| https://www.intel.com/content/www/us/en/develop/documentatio...
| bitwize wrote:
| > There's no standard C function for this. My colleague came up
| with the following nice trick.
|
| One of the big things about C is that there is no standard
| library function for anything remotely nontrivial. So
| successfully coding in C relies on "tricks", and snippets and
| lore that have been passed on over the years.
|
| Rust, meanwhile, has a check_for_all_zeroes crate or something.
| omegalulw wrote:
| Clever, I love it. Maybe I'm just dumb, but it took me a lil
| bit to convince myself it's correct.
| TonyTrapp wrote:
| Assuming that vector instructions are available, shouldn't it
| be much faster to actually compare the buffer contents against
| a vector register initialized to all-zeros rather than
| comparing against some other memory? Or would memcmp
| automatically optimize that away because of the precondition
| that the first 16 bytes are already known to be 0?
| BeeOnRope wrote:
| It is probably faster, yes (half the number of reads) - but
| the point of this truck is that you can re-use the
| (hopefully) vectorized memcmp on every platform with portable
| code rather than getting on the SIMD ISA treadmill yourself.
| rwmj wrote:
| The qemu implementation does indeed do it the hard way.
| It's a lot of code: https://gitlab.com/qemu-
| project/qemu/-/blob/master/util/buff...
| pm215 wrote:
| Interestingly, we used to have a special-case aarch64
| version, but we dropped it because the C version was
| faster: https://gitlab.com/qemu-
| project/qemu/-/commit/2250d3a293d36e...
|
| (Might or might not still be true on more modern aarch64
| hardware...)
| pm215 wrote:
| Does your memset version beat QEMU's plain-old-C fallback
| version ?
| tkhattra wrote:
| just make sure it's not "overly optimized" -
| https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95189 (in this
| case it was gcc's builtin memcmp that was broken, not glibc's)
| :)
| yissp wrote:
| Is the choice of 16 as the "limit" value based on benchmarking?
| As opposed to just doing something like "!buffer[0] &&
| !memcmp(buffer, buffer + 1, size - 1)" which uses the same
| principle.
| BeeOnRope wrote:
| Not the OP, but 16 has the benefit of keeping both pointers
| in the comparison 16-byte aligned if the buffer was initially
| aligned.
|
| This would eliminate split loads and provide a decent
| speedup.
| [deleted]
| jstanley wrote:
| I'm curious why the time is so much worse for sizes just slightly
| larger than 400, but then better again for sizes larger than
| this?
| MauranKilom wrote:
| Could be hitting a critical cache stride. See also
| https://stackoverflow.com/questions/11413855/why-is-transpos...
| nly wrote:
| I once maintained a project that had a custom memcpy
| implementation because they didn't want to link to libc.
|
| They assumed aligned writes were safe and the compiler optimized
| it incorrectly, resulting in memory corruption.
| rasz wrote:
| memset is something JEDEC SDRAM standard should of implemented on
| a hardware level back in 1993. Why even bother writing to ram
| byte by byte when we could of had dedicated command to fill up to
| whole row (8-16kbit per chip, 8-32KB per DIMM) at a time with
| _single command_. Safe zero fill memory allocation would be free
| and standard.
|
| For background: https://faculty-
| web.msoe.edu/johnsontimoj/EE4980/files4980/m... Since 1993 ram
| chips have integrated state machines receiving and interpreting
| higher level commands. They also have wide sense amplifier banks
| being loaded/stored all at once.
| gpderetta wrote:
| What's the use of filling your ram with zeros when the data
| needs to be on L1, L2 or L3? Unless you are memsetting hundreds
| of MBs of memory, memset/memcpy in practice need to be handled
| by the cpu or something very close to it.
|
| Zen has CLZERO which can clear a cacheline in one go, but not
| sure how good it is.
| FpUser wrote:
| >"Unless you are memsetting hundreds of MBs of memory"
|
| Not hundreds but in one of my apps I do have 10th MB of
| continuous cache that has to be zeroed before use / reuse.
| knome wrote:
| I wonder if someone was zeroing enough memory, where the
| memory is a private anonymous mapping, they might use
| madvise() with MADV_DONTNEED, which in linux will
| effectively zero the pages.
|
| It returns the memory to the OS, and will pagefault on
| later accesses remapping them as zero-filled. It works in
| pages. Sizes smaller than a page result in whatever else is
| in the same page getting nuked.
|
| If you don't immediately reuse the whole cache, it might
| spread out the zeroing/remapping over time, rather than in
| a single large go. Imagine some testing would be in order
| to see if a syscall + mapping changes ( require reloading
| with TLB for the process ? ) would be smaller than a
| straight run of writing zeros at some point.
|
| IIRC, the zeroing is not something you can expect from non-
| linux madvise implementations.
| convolvatron wrote:
| PPC has an instruction to 'load' a line ignoring its previous
| contents (just set up the cache state). useful in any case
| when you know you're going to overwrite the whole thing.
| xgkickt wrote:
| I used dcbz extensively back on the Wii.
| rasz wrote:
| Such ram capability would result in implementing hybrid
| compressed caches. Why waste whole cache line for storing
| zeroes when you can have dedicated compressed representation.
|
| On a similar note part of ATI 2000
| https://en.wikipedia.org/wiki/HyperZ was fast Z clear, today
| a norm on every GPU.
| vlovich123 wrote:
| This would be a CPU command that works with the RAM
| controller rather than something you control yourself
| (kernels to my knowledge don't talk directly to the
| controller beyond maybe some basic power management, if
| that).
|
| There is a definite need to do hundreds of MB - the Linux
| kernel has a background thread that does nothing but zero out
| pages. What do you think happens to the GBs of RAM freed by
| closing Chrome? Once it's made available in one spot, no
| reason others could use it (eg a hardened malloc
| implementation, etc).
| gpderetta wrote:
| Interesting that you mention linux, because Linus has very
| very strong opinions about this :)
| xvmt wrote:
| What is his opinion about this?
| gpderetta wrote:
| He strongly believes that something like rep stos/rep mov
| is the right interface for memset/memcpy and off-core
| accelerators (like DMA) are misguided.
| Something1234 wrote:
| His reasoning or rant for this?
| monocasa wrote:
| I'm not sure about Linus's objections, but I've found
| that DMA accelerators for time sharing systems with
| general workloads haven't reaped benefits, as the
| overhead of multiplexing and synchronizing with them
| kills most of their benefits. At that point it's easier
| to blit memory yourself.
| kwertyoowiyop wrote:
| What are they?
| monocasa wrote:
| If you find yourself doing this a lot, there's write
| combining memory to coalesce writes to be more friendly to
| the RAM controller.
|
| Additionally, CLZERO ends up doing very similar work since
| the resulting cache flush os seen by the RAM controller as
| a block write.
| _zoltan_ wrote:
| *should have
| mananaysiempre wrote:
| Modern microcontrollers can have DMA units that you can program
| to, among other things, do a memset or even a memcpy when the
| memory bus happens to be idle, and they'll interrupt you when
| they're done. The design point is different (a microcontroller
| application can be limited by processor cycles but rarely by
| memory bus bandwidth), but I still wonder why PCs don't have
| anything like that.
| com2kid wrote:
| Programming Microcontrollers was such an interesting and
| different experience, designing code to be asynchronous in
| regards to memory operations was a whole 'nother level of
| arranging code.
|
| Likewise for doing copies from external RAM to internal SRAM,
| it was slow enough compared to the 1 cycle latency accessing
| SRAM, and CPU cycles were precious enough, that code copying
| lots of memory from external memory was designed to stop
| execution and let other code run and resume once the copy was
| finished.
|
| We were able to get some _serious_ speed out of the 96mhz CPU
| because we optimized everything around our memory bus.
| drran wrote:
| Just implement a driver for your memory controller and update
| all software to use syscall into kernel (about 10k total
| instructions per syscall), which will perform memset or
| memcpy, then measure performance improvement and tell it to
| us.
| moonchild wrote:
| Also in assembly, up to 370% the speed of glibc -
| https://github.com/moon-chilled/fancy-memset
| Jyaif wrote:
| I think that Duff's device could be used on lines 65-86 (
| https://github.com/nadavrot/memset_benchmark/blob/eac67b6205... )
___________________________________________________________________
(page generated 2021-11-12 23:01 UTC)