[HN Gopher] A single line of code made a 24-core server slower t...
___________________________________________________________________
A single line of code made a 24-core server slower than a laptop
Author : Ygg2
Score : 456 points
Date : 2021-12-31 13:40 UTC (9 hours ago)
(HTM) web link (pkolaczk.github.io)
(TXT) w3m dump (pkolaczk.github.io)
| jv22222 wrote:
| Yes! Finally, HN removed the rule to strip thew word "How" from
| the title.
| hinkley wrote:
| A few hours later...
| ncmncm wrote:
| The notion of "lock-free" is fundamentally misleading.
|
| It trips up everybody who hopes "lock-free" will be a magic
| bullet freeing them from resource contention bottlenecks.
|
| When you have explicit locks, aka mutexes (condition variables,
| semaphores, what-have-you), the interaction between your threads
| is visible on the surface. Replacing that with "lock-free"
| interaction, you have essentially the same set of operations as
| when taking a mutex. On the up side, overhead cost may be lower.
| On the down side, mistakes show up as subtly wrong results
| instead of deadlocks. And, you get less control, so when you have
| a problem, it is at best harder to see why, and harder to fix.
|
| Because the lock-free operations involve similar hardware bus
| interactions under the covers, they cannot fix contention
| problems. You have no choice but to fix contention problems at a
| higher, architectural design level, by reducing actual
| contention. Having solved the problems there, the extra cost of
| explicit mutex operations often does not matter, and the extra
| control you get may be worth any extra cost.
|
| What is lock-free good for, then? Lock-free components reduce
| overhead when there is no contention. Given any actual
| contention, performance is abandoned in favor of correctness. So,
| if you have mutexes and performance is good, moving to lock-free
| operations _might_ make performance a little better. If
| performance is bad, mutexes and lock-free operations will be
| about equally as bad.
| Kranar wrote:
| Don't see anything more misleading about lock free than about
| anything else involving parallelism. Lock freedom has to do
| with guarantees about scheduling, you can be lock free and
| still have contention but a lock free algorithm is guaranteed
| to make progress within a bounded amount of time. A blocking
| algorithm is not guaranteed to make any kind of progress within
| a bounded amount of time.
|
| If you're working on a real time system, and I don't mean the
| colloquial meaning of that word, but rather a system that must
| guarantee that a side effect is produced no later than a given
| period of time, then you must use a lock free algorithm, there
| is no alternative. Avionics, DSP, high frequency trading, these
| are all domains where lock free algorithms are necessary.
| Making a fast lock free algorithm is great and generally the
| goal of any kind of parallelism is performance, but fast is not
| an unambiguous term, it can mean low latency or high bandwidth
| and it's very unlikely to write an algorithm that is both.
|
| Lock free algorithms are great when you need low latency and
| guaranteed throughput. If you don't need that then it's
| possible to trade those properties for much higher bandwidth
| and use blocking algorithms.
| ncmncm wrote:
| If you are counting on lock-free operations to guarantee you
| will meet your real-time scheduling requirements, you have
| chosen to fail, right out of the gate. Guaranteeing met
| deadlines takes engineering. Using lock-free operations is
| not a substitute for engineering. There is no such
| substitute.
|
| Also: All this has nothing to do with parallelism.
| Parallelism is about things happening at the same time.
| Locking or not locking, and contention, are about
| concurrency, which is about synchronization and
| serialization, a different topic.
| Kranar wrote:
| If you are using a mutex to guarantee real-time scheduling
| requirements, you have failed.
|
| Furthermore if you think a discussion about benchmarking an
| algorithm on a 24-core SMP system involving atomic reads,
| writes and shared caches has nothing to do with parallelism
| then let's just agree to disagree with one another and I
| wish you all the best in your programming endeavors.
|
| People can decide for themselves what credibility they wish
| to lend to your argument given your perspective on this
| matter.
| duped wrote:
| Unfortunately I think this comment does is muddy the waters
| even more. All "lock free" means is that one process(+) may not
| cause any other process to block during execution.
|
| "Contention" is not very meaningful to a lock-free algorithm,
| since lock-free algorithms typically do not permit shared data
| access semantics. Rather they describe how data is _sent
| between_ processes, which is why a lock-free solution looks
| very different than something behind mutexes.
|
| But to the heart of your point, I do wish people spent more
| time understanding that lock-free vs lock-ful is a distinction
| in semantics and doesn't imply anything about performance.
| There are subtle trade offs.
|
| (+) I'm using "process" with a lot of hand waving, assume it
| means whatever your atom of parallelism is
| ncmncm wrote:
| You are confusing two fundamentally different topics.
| Anywhere there is no possibility of contention for access to
| a resource, "lock-free" is trivially meaningless.
| twoodfin wrote:
| In situations like this, it's valuable to know about all the
| event types perf can sample beyond CPU cycles, including cache
| misses at various levels and branch mis-predictions. These can go
| right into flamegraphs as well.
| Thiez wrote:
| Is this the kind of situation where assigning a core affinity to
| the various threads would have made a difference (to make sure
| they all run on the same physical processor), or is the mere
| presence of another processor enough to trigger the behavior
| shown in the article?
| loeg wrote:
| No, that would not make a significant difference here. The
| problem is that the same line has to bounce around all of the
| threads that share it; pinning doesn't reduce that very much on
| this time-scale. Thread migrations will be very infrequent
| relative to atomic operations, especially when the number of
| runnable threads is lower than number of physical cores.
| dagmx wrote:
| Core affinity would only help if he didn't have more threads
| than clusters. Otherwise he'd either hit the wall his laptop
| did, or cross the cluster and hit the cache issues
| [deleted]
| hermitdev wrote:
| I don't think core affinity would have helped in this
| situation. They had one instance that was being shared across
| all cores, necessitating the reference count to be sync'd
| between sockets at L3 cache level. They may have seen some
| benefit if you had, say, 2 instances, one per CPU socket, and
| dedicated an instance per socket to avoid L3 cache
| interactions. But, in this case, it is far simpler, cleaner and
| better performing to not share the instance and avoid the
| reference counting entirely.
| [deleted]
| w_t_payne wrote:
| Shared-nothing dataflow models are an interesting way to engineer
| (some types of) parallelism.
| bob1029 wrote:
| > The fact that our reference counters are atomic is an
| additional problem that makes things even more complex for the
| processor. Although using atomic instructions is often referred
| to as "lockless programming", this is slightly misleading - in
| fact, atomic operations require some locking to happen at the
| hardware level. This locking is very fine-grained and cheap as
| long as there is no congestion, but as usual with locking, you
| may expect very poor performance if many things try to fight for
| the same lock at the same time. And it is of course much worse if
| those "things" are whole CPUs and not just single cores that are
| close to each other.
|
| Contention is the devil and you can _never_ hide from it if you
| bring it into your life. The fastest "contention avoidance hack"
| would be the CAS operation as mentioned above, which still
| performs approximately 2 orders of magnitude slower than a single
| threaded operation would in the most adverse scenarios.
|
| If you aren't sure of which problems actually "fit" on a modern
| x86 thread these days, rest assured that many developers in
| fintech have been able to squeeze entire financial exchanges onto
| one:
|
| https://lmax-exchange.github.io/disruptor/disruptor.html
|
| For most scenarios, you really don't need all those damn cores.
| They are almost certainly getting in the way of you solving your
| problems.
| hinkley wrote:
| I believe Rust is pulling on the right thread here but I have
| no idea how this will look in fifteen years. At some point we
| will need to be able to tell that two threads don't contend and
| thus can be "farther away" from each other, but these threads
| are constantly communicating and need to use the same complex
| and/or memory banks.
| dagmx wrote:
| I think it could be something built upon lifetime analysis. I
| think a sufficiently complex compiler could use the data
| available already today to infer thread affinity and other
| rules.
| hinkley wrote:
| Basically what I was thinking. In the near term it doesn't
| even have to be that precise to have a benefit (Rust builds
| on escape analysis that helped with distributed garbage
| collection, even when it wasn't that exhaustive). Even
| ranking threads by suspected degree of interaction is a lot
| of data.
|
| These definitely don't talk. Offload. These definitely
| talk. Colocate. These probably talk. Bin pack.
| cjfd wrote:
| "For most scenarios, you really don't need all those damn
| cores. They are almost certainly getting in the way of you
| solving your problems." This rings true. I still think the most
| promising multi-threading strategy is to, by default, keep
| everything on one core and if you find some expensive and more
| or less independent workload to put that in another thread.
| This is the old-fashioned way though. All of the cool kids will
| push for thread pools where every workload will jump from core
| to core and back again.... This potentially can require a lot
| of locking, though, which also is not good for performance.
| zmmmmm wrote:
| It's an interesting reflection then on the Actor paradigm
| where you are encouraged to create large numbers of fine
| grained tasks and toss them into a giant thread pool blender
| because light weight threads are "free" now. It would be
| quite interesting to see the consequences of thread affinity
| loss in that setting.
| cecilpl2 wrote:
| Games are a prime example of systems which must take advantage
| of all possible hardware resources.
| bob1029 wrote:
| The fact that we believe this is probably part of the problem
| with game development today.
| dagmx wrote:
| Games similarly struggle with multi core/ccx designed chips.
|
| Battlefield 4 was famously terrible on Ryzen when AMD first
| released it.
| littlecranky67 wrote:
| To be fair, Ryzen mostly underperformed when it was
| released which was not AMDs fault. When for years engineers
| used Intel as their benchmarking and testing machines and
| were tweaking all sorts of parameters to measure against
| Intel-only, you inevitable end up with an optimized
| implementation that resembles the hardware details of the
| Chip in some kind. Any new architecture would perform
| badly, except if it is a perfect clone of the original
| testing hardware.
| dagmx wrote:
| I wasn't implying it was AMDs fault, but the idea that
| the games were optimized for Intel doesn't hold true IMHO
| because they were also built for consoles that were AMD.
|
| The majority of the performance was down to not
| accounting for CCX scheduling
| PaulDavisThe1st wrote:
| Another interesting example is audio synthesis. It is
| interesting because there are at least two models for audio
| synthesis: single sample processing (as used in, for example,
| VCV Rack) and block structured processing (as used by every
| plugin API and thus every DAW-that-runs-plugins). But it gets
| even more interesting because the degree of potential
| parallelism is entirely controlled by the signal routing
| pathways chosen by the user. If the software is processing N
| independent signal pathways that do not meet until a final
| output stage, you have a high (well, "N") level of potential
| parallelism, and if N can expand to a high value then you will
| always be able to benefit from more cores.
|
| However, if the signal pathway is more convoluted (no pun
| intended for my audio nerd crew), then not only does the degree
| of parallelism decrease but the issue of single-sample vs.
| block structured can become much more important. In the single
| sample case, the cost of computing a single sample has
| (relatively) enormous fixed overhead if the code has any
| serious level of modularity, but is still very cheap compared
| to inter-thread and inter-core synchronization. Spreading the
| computation across cores will frequently, but not always, slow
| things down quite dramatically. By contrast, block structured
| processing has a much lower per-sample cost, because the
| function call tree is invoked only once for an entire block of
| samples, and so the relative cost of synchronization decreases.
|
| This sounds like a slam dunk for block structured processing.
|
| Problem is, there are things you cannot do correctly with block
| structured processing, and so there is always a place for
| single sample processing. However, the potential for
| parallelization and the level at which it should take place
| differs significantly between the two processing models, which
| opens to door to some highly questionable design.
|
| The short version of this is that audio synthesis in the
| abstract can always expand to use any number of cores,
| particularly for cpu-intensive synthesis like physical
| modelling, but that in the presence of single-sample
| processing, the cost of synchronization may completely negate
| the benefits of parallelism.
| spacechild1 wrote:
| Great write up!
|
| What I like about Pd is that you can freely reblock and
| resample any subpatch. Want some section with single-sample-
| feedback? Just put a [block~ 1]. You can also increase the
| blocksize. Usually, this is done for upsampling and FFT
| processing. Finally, reblocking can be nested, meaning that
| you can reblock to 1024 samples and inside have another
| subpatch running at 1 sample blocksize.
|
| SuperCollider, on the other hand, has a fixed global
| blocksize and samplerate, which I think is one of its biggest
| limitations. (Needless to say, there are many things it does
| better than Pd!)
|
| ---
|
| In the last few days I have been experimenting with adding
| multi-threading support to Pd
| (https://github.com/Spacechild1/pure-data/tree/multi-
| threadin...). With the usual blocksize of 64 sample, you can
| definitely observe the scheduling overhead in the CPU meter.
| If you have a few (heavy-weight) subpatches running in
| parallel, the overhead is neglible. But for [clone] with a
| high number of (light-weight) copies, the overhead becomes
| rather noticable. In my quick tests, reblocking to 256
| samples already reduces the overhead significantly, at the
| cost of increased latency, of course.
|
| ---
|
| Also, in my plugin host for Pd/Supercollider
| (https://git.iem.at/pd/vstplugin/) I have a multi-threading
| and bridging/sandboxing option. If the plugin itself is
| rather lightweight and the blocksize is small, the scheduling
| overhead becomes quite noticable. In Pd you can just put
| [vstplugin~] in a subpatch + [block~]. For the SuperCollider
| version I have added a "reblock" argument to process the
| plugin at a higher blocksize, at the cost of increased
| latency.
| michaelrmmiller wrote:
| To add onto this very good explanation from Paul, in
| practice, audio processing graphs always rapidly decay into
| serial processing after starting out massively parallel. Most
| of the time, we are writing to a single output (i.e. speakers
| via your audio interface or a file). On top of that, some of
| the heaviest weight DSP happens on this final output stage
| (mastering chains with multiband compressors, linear phase
| equalizers, limiters etc.) So every 5.3ms (256 sample buffer
| @ 48kHz sample rate) we start massively parallel processing
| all the leaf nodes (audio tracks with sound files, virtual
| instruments and synths) and end up bottlenecking as the tree
| collapses into a line. Then we are stuck doing some of the
| most CPU intensive work on a single core since we can't
| process the limiter DSP plug-in until the EQ finishes its
| work, for example.
|
| We need to meet our real-time deadline or risk dropping
| buffers and making nasty pops and clicks. That mastering
| stage can pretty easily be the limiting (hah) step that
| causes us to miss the deadline, even if we processed hundreds
| of tracks in parallel moments before in less time.
|
| The plug-in APIs (AudioUnits, VSTs, AAX) which are
| responsible for all the DSP and virtual instruments are also
| designed to process synchronously. Some plug-ins implement
| their own threading under the hood but this can often get in
| the way of the host application's real-time processing. On
| top of that, because the API isn't designed to be
| asynchronous, the host's processing thread is tied up waiting
| for the completed result from the plug-in before it can move
| on.
|
| Add on that many DSP algorithms are time-dependent. You can't
| chop up the sample buffer into N different parts and process
| them independently. The result for sample i+1 depends on
| processing sample i first.
| cletus wrote:
| The mantra I live by is that multi-threaded programming is so
| potentially complex and it's so easy to screw up (eg erasing all
| your supposed performance gains as was the case in this article)
| or by introducing subtle bugs or (worse) security flaws that you
| should do absolutely everything you can to avoid it.
|
| IME C++ programmers seem to put an unreasonable amount of faith
| on RAII to "solve" these problems eg:
| std::mutex m; ... {
| std::lock_guard<std::mutex> lg(m); // do stuff with a
| mutex lock } // lock_guard destructor releases mutex
|
| If you're just updating a (private) map or something, this is
| fine. But as soon as this gets complex (eg by calling a function
| and you lose control over what thread is executing because some
| deep code creates a new thread) then this all turns to crap.
|
| So whenever anyone tells you "just start a thread" without any
| hesitation, it's reasonably safe to assume they have no idea what
| they're talking about.
|
| This is one reason I love the async code model used in a number
| of languages. I have the most experience with Hack [1] but it's
| not unique to that. What I tend to think of as "product" code
| (which is a real Facebook-ism) has no need to every create a
| thread and dealing with any concurrency issues.
|
| Now if you're programming an HTTP server or an RDMBS or other
| infrastructure then sure, you have a valid need. But if you're
| just serving HTTP requests, just avoid it entirely if you can
| because any performance gain is probably imaginary but the
| complexity increase isn't.
|
| [1]: https://docs.hhvm.com/hack/asynchronous-
| operations/introduct...
| otabdeveloper4 wrote:
| Don't call functions (that aren't part of the C++ standard
| library) inside the lock guard. It's not a hard rule and you'll
| get 99% of the way there.
| Thiez wrote:
| Isn't that mantra a bit too pessimistic? After fixing the
| performance issue, a speedup that was almost linear in the
| number of cores was achieved using multithreading.
|
| In addition, async code can bring much of the same challenges
| as multithreading. For example, in C# tasks may run on the
| current thread, but they may also run on a thread-pool, and
| this is mostly transparent to the programmer. When you don't
| immediately await all asynchronous operations your tasks may
| end up running at the same time.
| cletus wrote:
| Like anything you weigh up the upside against the downside.
|
| The downside is more complex code, code that's harder to
| reason about, the possibility of concurrency bugs, slower
| development times, the possibility of security flaws and you
| may just be shifting your performance bottleneck (eg creating
| a hot mutex).
|
| The upside is lower latency, more throughput and higher QPS.
|
| My argument is quite simply YAGNI. A lot of these benefits
| are for many people purely imaginary. They're the very
| definition of premature optimization. Code that is less
| complex, faster to develop in and likely "good enough" is
| probably going to help you way more than that.
|
| If you get to the scale where heavy multithreading actually
| matters, that's a good problem to have. My argument is that
| many people aren't there and I would hazard a guess that a
| contributing factor to not getting there is worrying about
| this level of performance before they need to.
| thewarrior wrote:
| This model is good for I/O heavy code that is typical for most
| web services. However if your load is a mix of I/O and compute
| something like Go should do a lot better.
| cletus wrote:
| Not sure why you're being downvoted because you raise a valid
| point: Go has a concurrency model aimed at avoiding many of
| these problems. This is of course the channel abstraction.
|
| Go has other issues. My first piece of advice to any new Go
| programmer is make every single one of your channels
| unbuffered. This way it operates a lot like the async model I
| mentioned and it tends to be easy to reason about.
|
| But people fall into a trap of thinking adding a buffer will
| improve concurrency and throughput. This might be true but
| most of the time it's a premature optimization. Buffered
| channels also often lead to obscure bugs and a system that's
| simply harder to reason about.
|
| It's really better to think of Go as an tool for organizing
| concurrent code, not necessary a direct solution to the
| problems involved. But organization matters.
| thewarrior wrote:
| All concurrency tools are foot guns but go seems to be the
| safest foot gun :)
| FpUser wrote:
| >"But if you're just serving HTTP requests, just avoid it
| entirely if you can because any performance gain is probably
| imaginary but the complexity increase isn't"
|
| I serve HTTP requests offering JSON based RPC in my C++ servers
| that do all kinds of calculations (often CPU consuming) in
| multiple threads. Works just fine and scales well. Yes I have
| to be careful on how I work with shared data but it is
| centralized encapsulated and done in a single place I call
| request router / controller. I reuse it among the projects.
| Nothing too exciting about it and no reason to be scare do
| death ;)
| amichal wrote:
| Would this tool have have scaled as well as 24 separate OS
| processes?
|
| I remember hitting this same issue many years ago. Some primitive
| in the library (c++) didn't on the surface appear to involve
| shared state and cache write lock but of course it did if you
| thought about it. Since I switched over to more application level
| coding I don't work that low level very much and tend to scale
| with processes now (typically not needing that much parallelism
| anyway) so i don't often have to think these things through
| anymore.
| 10000truths wrote:
| The problem outlined in the article is not with process vs
| thread, the problem is with multiple parallel execution
| contexts reading and writing the same cache line. This issue
| would have manifested in the same way if your separate
| processes touched the same offset in a shared memory mapping.
|
| Of course, designing with multiple processes in mind helps to
| design a shared-nothing architecture, as the costs of shared
| state between processes are much more explicit. But _any_ kind
| of state that needs to be shared will run into the same
| bottleneck, because all forms of shared IPC fundamentally have
| to use some form of locking somewhere in the stack to serialize
| incoming inputs.
| lmeyerov wrote:
| I think the intuition on processes vs threads was right:
|
| 1. The Arc use is a case (I believe! Am not a Rust'er!) of
| true sharing of references. The cacheline will be contention
| across cores. Doing processes would have eliminated true
| sharing. Tracking logical core affinity for references via a
| modal type system could have flagged this at compile time,
| but I suspect that's not a thing in Rust land.
|
| 2. Even if it was false sharing at the cacheline level,
| process abstractions in shared memory systems are generally
| aligned to avoid false sharing at cacheline granularity (and
| misaligned ops at smaller granularities). Of course, in the
| extreme, a 10X hero programmer may have tried to do amazing
| levels of bit packing and syscall hackery and thus break all
| those niceties, but that's the cleanup that everyone else in
| a team has to make the hero a 10Xer.
| loeg wrote:
| Probably yes, unless the author went to great lengths to
| contend on shared memory segments.
| wabain wrote:
| This is an excellent writeup, but I'm curious about how dependent
| the microbenchmark is on the scripting engine doing no real work.
| My off-the-cuff guess is that the unshared version might still
| perform better with other workloads (it would have higher
| throughput and avoid variance caused by fluctuating degrees of
| cache contention) but it might be that the observable overhead
| from hammering a single cache line would disappear if each thread
| had more varied work to do. And of course if the scripting engine
| benefits substantially from caching internal state across
| invocations (things like incrementally optimized JIT code) then
| having 24 instances can incur substantial extra work.
| dathinab wrote:
| Has anyone any idea where the `` format strings come from??
|
| As far as I know rust doesn't have them.
|
| And as far as I know if rust gets format strings it most likely
| get them in form of a `f` prefix (like raw strings have an `r`
| prefix and binary strings have a `b` prefix).
|
| Is it just a thing to make the code more readable/shorter on the
| web?
| metabagel wrote:
| Markdown uses backticks around code.
|
| https://docs.github.com/en/github/writing-on-github/getting-...
| cdirkx wrote:
| The code shown in the article isn't Rust, but Rune [0], which
| has similar syntax, but is a dynamic scripting language. It has
| template literals [1].
|
| [0] https://rune-rs.github.io/
|
| [1] https://rune-rs.github.io/book/template_literals.html
| dathinab wrote:
| Oh, that makes sense.
| michalhosna wrote:
| If you are talking about "Here is an example of a complete
| workload that measures performance of selecting rows by random
| keys", then you are looking at Rune code rather than Rust code.
|
| See https://rune-rs.github.io/book/template_literals.html
| johnisgood wrote:
| Can anyone TL;DR it? A single line, in Rust, can make a 24-core
| server slower than a laptop? What line is that?
| jkelleyrtp wrote:
| In a multithreaded context, shared data was shared via atomic
| reference counting. The atomic operation acted as a single
| bottleneck for every operation. The solution was to derive deep
| cloning for the shared data (a 1 line fix) to reduce atomic
| contention.
| PhileinSophia wrote:
| vlovich123 wrote:
| Is there any tooling that can demonstrate cache coherency
| hotspots like cache line bouncing or inter-core communication?
| mhh__ wrote:
| All of the "serious" profilers support this. Intel have a bunch
| of tooling for diagnosing threading issues, AMD also have some
| less polished but still functional metrics you can use.
| vlovich123 wrote:
| Do they track it down to the actual problem point in the
| code? Or do you just see that maybe there's a problem and
| then experimenting with changes to figure out what needs to
| be fixed?
| mhh__ wrote:
| Short answer: yes. Much better than perf does.
| maxwell86 wrote:
| Yes. Linux perf has a c2c hit modify "mode"/"option" that will
| report precisely this issue.
| loeg wrote:
| What is c2c? Thanks.
| tkhattra wrote:
| c2c stands for cache-2-cache.
|
| perf-c2c allows you to measure cacheline contention - see
| https://man7.org/linux/man-pages/man1/perf-c2c.1.html and
| https://joemario.github.io/blog/2016/09/01/c2c-blog/.
| eptcyka wrote:
| Cache to cents? Kernels 2 cents on how your code is
| utilizing the cache? I have 0 clue too.
| masklinn wrote:
| Cache to Cache,
|
| It tracks a bunch of cache-related counters, and HITM
| (hit modify) tracks loads of data modified by either an
| other core of the same node ("local hitm") or by an other
| node entirely ("remote hitm").
|
| Here, the contention would have shown up as extreme
| amounts of both, as each core would almost certainly try
| atomically updating a value which had just been updated
| by an other core, with higher than even chances the value
| had been updated by a core of the other socket (= node).
| gjulianm wrote:
| I've used the Intel VTune Profiler, I think it's free. It's a
| pretty good tool for profiling code and gives you quite a lot
| of information on how you're using the core pipelines, but you
| need an Intel processor to use it. I think it would have
| quickly flagged this kind of problem and told you that the
| processor were stalled waiting for cache.
| mhh__ wrote:
| Vtune is free now. it's by the far the best profiler I've
| ever come across, AMD's equivalent is a bit rough and ready
| by comparison.
| game_the0ry wrote:
| Relevant video - Scott Meyers: Cpu Caches and Why You Care
|
| https://www.youtube.com/watch?v=WDIkqP4JbkE
|
| Remember, boys and girls, small mistakes can have big
| consequences.
|
| Keep those data in immutable data structures when mutli-
| threading. Actor model also helps.
| teen wrote:
| That code? Thread.sleep(5000)
| errcorrectcode wrote:
| My laptop (i7-8650U) has slightly faster single-core performance
| than my 96-thread dual 7402 EPYC Milan home server. I'm fine with
| that. I'm not fine with undue inefficiencies, i.e., excessive
| cache misses, branch mis-predictions, and unparallelization due
| to sloppy or unfortunate code.
| kentonv wrote:
| An aside:
|
| > L3 cache is shared by all cores of a CPU.
|
| I recently learned this is no longer true! AMD EPYC processors
| (and maybe other recent AMDs?) divide cores into groups called
| "core complexes" (CCX), each of which has a separate L3 cache. My
| particular processor is 32-core where each set of 4 cores is a
| CCX. I discovered this when trying to figure out why a benchmark
| was performing wildly differently from one run to the next with a
| bimodal distribution -- it turned out to depend on whether Linux
| has scheduled the two processes in my benchmark to run on the
| same CCX vs. different CCXs.
|
| https://en.wikipedia.org/wiki/Epyc shows the "core config" of
| each model, which is (number of CCX) x (cores per CCX).
| [deleted]
| Ataraxic wrote:
| Just as information for some later readers, but this is
| applicable not just to Epyc but also the consumer version
| Threadripper. My understanding is that this is an example of a
| Non-Uniform Memory Access (NUMA) which was also used to link
| multiple cpus in different sockets together for a long time
| now, but now they've been integrated into a cpu that fits in
| one socket.
|
| This actually has an importance if you are running a VM on such
| a system since you'll run into things like the actual RAM (not
| l3 cache) is often directly linked to a particular NUMA node.
| For example accessing memory in the first ram stick vs the
| second will give different latencies as it goes from ccx1 =>
| ccx2 => stick2 versus ccx1 => stick1. This is applicable to I
| think 2XXX versions and earlier for threadripper. My
| understanding is that they solved this in later versions using
| the infinity fabric (IO die) so now all ccx's go through the IO
| die.
|
| I ran into all of this trying to run an ubuntu machine that ran
| windows using KVM while passing through my nvidia graphics
| card.
| reilly3000 wrote:
| This sheds some light on why I had so much trouble running
| GPU passthrough with Unraid (KVM) on my 1950X w/ Nvida. Those
| were dark days.
| rob_c wrote:
| Frankly that was probably 200% unraid ...
| masklinn wrote:
| > Just as information for some later readers, but this is
| applicable not just to Epyc but also the consumer version
| Threadripper.
|
| It's applicable to any Zen design with more than one CCX,
| which is... any Zen 3 CPU of more than 8 cores (in Zen 2 it
| was 4).
|
| The wiki has the explanation under the "Core config" entry of
| the Zen CPUs, but for the 5000s it's all the 59xx (12 and 16
| cores).
|
| Zen 3 APU are all single-CCX, though there are Zen 2 parts in
| the 5000 range which are multi-CCX (because why not confuse
| people): the 6-cores 5500U is a 2x3 and the 8-core 5700U is a
| 2x4.
|
| The rest is either low-core enough to be single-CCX Zen 2
| (5300U) or topping out at a single 8-cores CCX (everything
| else).
| dagmx wrote:
| This also applies to Ryzen as well. The first gen Ryzen was
| plagued by software and schedulers that couldn't handle the
| ccx affinity
| 123pie123 wrote:
| this is important when sizing and tuning VMs/ hypervisor
| masklinn wrote:
| > I recently learned this is no longer true! AMD EPYC
| processors (and maybe other recent AMDs?) divide cores into
| groups called "core complexes" (CCX), each of which has a
| separate L3 cache.
|
| It's still kinda true, just that "a CPU" in that context is a
| CCX. Cross CCX communications has shown up a fair bit in
| reviews and benches, and really in all chips at that scale
| (e.g. Ampere's Altras and Intel's Xeons):
| https://www.anandtech.com/show/16529/amd-epyc-milan-review/4
| and one of the "improvements" in Zen 3 is the CCX are much
| larger (there's one CCX per CCD, rather than 2) so there's less
| crosstalk.
|
| And it could already be untrue previously e.g. the Pentium D
| were rushed out by sticking two P4 dies on the same PCB, I
| think they even had to go through the northbridge to
| communicate, so they were dual socket in all but physical
| conformation (hence being absolute turds). I don't think they
| had L3 at all though, so that wasn't really a factor, but
| still...
| tentacleuno wrote:
| > And it could already be untrue previously e.g. the Pentium
| D were rushed out by sticking two P4 dies on the same PCB, I
| think they even had to go through the northbridge to
| communicate
|
| Part of me hopes you're wrong here. That is absolutely
| absurd.
| masklinn wrote:
| Yeah you'd hope so wouldn't you?
|
| But Intel got caught completely unaware by the switch to
| multi-core, just as it had by the 64b switch.
|
| The eventual Core 2 was not ready yet (Intel even had to
| bridge to it with the intermediate Core, which really had
| more to do with the Pentium M than with the Core 2 though
| it did feature a proper dual-core design... so much so that
| the Core solo was actually a binned Duo with one core
| disabled).
|
| So anyway Intel was caught with its pants around its ankle
| for the second time and they couldn't let that happen. And
| they actually beat AMD to market, having turned out a
| working dual-core design in the time between AMD's
| announcement of the dual-core opteron (and strong hints of
| x2) and the actual release, about 8 months.
|
| To manage that Intel could not rearchitecture their chip
| (and probably didn't want to as it'd become clear Netburst
| was a dead-end), so they stapled two Prescott cores
| together, FSB included, and connected both to the
| northbridge.
|
| It probably took more time to validate that solution for
| the server market, which is why where AMD released the dual
| core Opterons in April and Athlons in May, it took until
| October for the first dual-core Xeon to be available.
| danudey wrote:
| Here's the wikipedia article:
|
| https://en.wikipedia.org/wiki/Pentium_D
|
| "The Pentium D brand refers to two series of desktop dual-
| core 64-bit x86-64 microprocessors with the NetBurst
| microarchitecture, which is the dual-core variant of
| Pentium 4 "Prescott" manufactured by Intel. Each CPU
| comprised two dies, each containing a single core, residing
| next to each other on a multi-chip module package."
| IronWolve wrote:
| Yup, and why overclocking on AMD tests each of your
| ccx/cores, marks gold cores that can reach 5ghz+, and slower
| cores that dont overclock. Also why AMD tests each ccx, and
| moves worse ones to lower end products, and higher ones to
| the 5950x.
|
| Then you can OC tweak each core and try to milk out
| performance, PBO tends to be pretty good on detecting which
| cores with slight performance boost. With a few bios options,
| all that adds up without effort.
|
| And then finally get a better (advanced) scheduler in
| windows/linux that can move workloads around per core
| depending on workload. Windows released multiple fixes for
| the scheduler on AMD starting with win10 1903.
|
| I find scheduler mods and modest bios tweaks to increase
| performance without much effort, very detectable performance.
|
| Linux I use xanmod (and pf-kernel before that)
|
| Windows I use project lasso and AMD PBO.
|
| Slow/fast cores and scheduler tweaks is how win11/arm
| mac/android, makes things appear faster too.
|
| Amusing, scheduler/core tweaks been around for linux for a
| decade+, making the desktop super smooth, but its now
| mainstream in win11/arm osx.
| Andys wrote:
| does xanmod work as a drop-in replacement in Ubuntu with an
| NVIDIA card?
| sroussey wrote:
| There are core complexes in Apple's M1 variations as well.
|
| And it's not just L3 that can shared by various chips.
|
| Each complex also has its own memory bandwidth, so running on
| two core in two complexes will get around twice the memory
| bandwidth of two cores in the same complex.
| lend000 wrote:
| I've always wondered how much this affects shared cloud
| instances (for example, a single c5.small instance). Can your
| neighbor, who shares your L3 cache, be using memory so much
| more aggressively than you that it causes you to evict your
| L2/L1 cache? Or is cache coherency maintained if the core that
| evicted your L3 cache is known to be living in a different
| memory space?
| aspaceman wrote:
| Coherency will be maintained (since the protocols support
| that case). But yes, a separate process would evict that
| memory. From the processors point of view, they're just
| addresses tagged with data. Caching behavior doesnt depend on
| virtual memory addresses because I believe that info is
| stripped away at that point.
|
| So if someone is thrashing cache on the same core you're on,
| you will notice it if the processes aren't being shared
| effectively.
|
| The contents of the cache aren't stored as part of a paused
| process or context switch. But I'd appreciate a correction
| here if I'm wrong.
|
| For an example, consider two processes A and B running on a
| set of cores. If A makes many more memory accesses than B, A
| can effectively starve B of "cached" memory accesses because
| A accesses memory more frequently.
|
| If B were run alone, then it's working set would fit in
| cache. Effectively making the algorithm operate from cache
| instead of RAM.
|
| BUT. You really have to be hitting the caches hard. Doesn't
| happen too often in casual applications. I only encountered
| this on GPUs (where each core has sperate L1 but a shared
| L2). Even then it's only aa problem if every core is hitting
| different cache lines.
| aspaceman wrote:
| Found some detail - apparently the CPU manuals are the
| place to check https://cs.stackexchange.com/a/1090
| pdpi wrote:
| Out of curiosity -- was the "bad" mode a case of two separate
| workloads competing for limited cache within one CCX, or was it
| really bad cache behaviour across CCXs causing problems because
| the two processes shared memory?
| kentonv wrote:
| The two processes were a client and a server, the client just
| sends HTTP requests to the server over a unix socket, but
| otherwise they don't share resources. TBH I'm surprised there
| was so much impact just from that. There might be something
| more to the story -- I haven't investigated much yet. But
| basically when I used `taskset` to limit the benchmark to two
| cores in different CCX's, it ran about half the speed as when
| limited to two cores in the same CCX. I guess another
| possibility is that the kernel is allowing both processes to
| swap between cores regularly, which would much better explain
| the disastrous performance penalty. But, in that case I don't
| understand the bimodal distribution I see when running the
| benchmark with no core affinity...
| hinkley wrote:
| This kind of [stuff] drives me nuts every time I'm trying to
| read the performance telemetry tea leaves at work.
|
| Do we have requests that take longer, or did Linux do something
| dumb with thread affinity?
| dahfizz wrote:
| Pro tip: if you care about performance, and especially if you
| care about reliable performance measurements, you should pin
| and accelerate your threads. Using isol CPUs is the next
| step.
|
| In other words: if your application is at the mercy of Linux
| making bad decisions about what threads run where, that is a
| performance bug in your app.
| spockz wrote:
| How would you do this when running in kubernetes? Afaik it
| just guarantees that you get scheduled on some cpu for the
| "right" amount of time. Not on which cpu that is.
|
| I only know that there is one scheduler that gives you
| dedicated cores if you set the request and limit both to
| equal multiples of 1000.
| spiddy wrote:
| On post-docker pre-kubernetes time I used `--cpuset-cpus`
| on docker args to dedicate specific cpus to redis
| instances, using CoreOS and fleet for cluster
| orchestration.
| ncmncm wrote:
| Another reason to put off moving to kubernetes.
| masklinn wrote:
| > Do we have requests that take longer, or did Linux do
| something dumb with thread affinity?
|
| Yes.
|
| Cross-socket communications do take longer, but a properly
| configured NUMA-aware OS should probably have segregated
| threads of the same process to the same socket, so the
| performance should have increased linearly from 1 to 12
| threads, then fallen off a cliff as the cross-socket effect
| started blowing up performances.
| klysm wrote:
| Do you use numa nodes to tell the kernel about this or
| something else?
| wmf wrote:
| You can enable node-per-CCX mode in UEFI to do that.
| Otherwise you have to use tools like hwloc to discover the
| cache topology and pin workloads accordingly.
| rob_c wrote:
| Underrated and useful reply. Thank you!
| buildbot wrote:
| L3 being shared is only common on smaller cores really, on any
| large Intel design the L3 isn't all in one place, it's broken
| up across the core.
| wmf wrote:
| That's still a single logical L3 (e.g. a single core could
| use the entire L3 capacity).
| cheschire wrote:
| There are only two hard things in Computer Science: cache
| invalidation and naming things. - Phil Karlton
| airstrike wrote:
| There are actually only two hard problems in computer science:
|
| 0) Cache invalidation
|
| 1) Naming things
|
| 5) Asynchronous callbacks
|
| 2) Off-by-one errors
|
| 3) Scope creep
|
| 6) Bounds checking
| c0balt wrote:
| You forgot text encodings, endianess, ...
| hinkley wrote:
| Is 4 uninitialized variables?
| Scarblac wrote:
| There is only one hard problem in computer science: _people_.
| [deleted]
| nicoburns wrote:
| Most languages will do bounds checking for you. I reckon that
| one's mostly solved at this point.
| chris_st wrote:
| But... but... but... we HAVE to use C, it's the only
| language that's fast enough!
|
| /s
| andi999 wrote:
| Too slow.
| hnaccount_rng wrote:
| You missed the off-by-one errors
| cheschire wrote:
| So you mean I was off by one?
| Pet_Ant wrote:
| Better version:
|
| There are only two hard things in Computer Science: cache
| invalidation, naming things, and off-by-one errors.
| MR4D wrote:
| Better better version:
|
| There are only two hard things in Computer Science: cache
| invalidation, and...what did you ask me again?
| jahnu wrote:
| There are only two hard things in Computer Science: cache
| invalidation, naming things, and off-by-oly two hard things
| in Computer Scivetem dy gjera te veshtira ne Shkencen
| Kompjute
| hermitdev wrote:
| Now, just toss in a byte order mark, and I think we've a
| winner!
| jacquesm wrote:
| Hehe. Ok. That is very funny :)
| gavinray wrote:
| I don't get it -- am I slow?
| [deleted]
| jacquesm wrote:
| Buffer overflow :)
| vaidhy wrote:
| This used to happen in C for me. Basically, you are
| trying to print a string which has overflowed the
| allocated length and you get junk printed on the string.
| C-Strings are null terminated and the system will keep
| printing till it encounters a null. So, bounds checking
| and buffer overflows are also hard problems.
| colejohnson66 wrote:
| Gotta watch out for those nulls :)
| silon42 wrote:
| Nulls are easy compared to invalid non-null pointers.
| ludamad wrote:
| People say null in Java was a mistake, but I've never had
| any issues in my null-only dialect of Java /s
| kibwen wrote:
| There are only two hard problems in Computer Science: cache
| invalidation, naming things, off-by-one errors, and
| remembering to update the documentation.
| nikkinana wrote:
| [deleted]
| physicsgraph wrote:
| This article is a well-written deep dive on parallel performance
| debugging that ends up identifying the difference being due to
| how caching is handled on multi-CPU vs single-CPU systems. Thanks
| to the author for taking time to write up their troubleshooting
| process description.
| pengaru wrote:
| TL;DR: serializing your parallel work dispatch on an atomic
| global counter is bad for scaling across many cores.
|
| Or, per-cpu counters exist for a reason.
| dan-robertson wrote:
| It seems to me that one solution would be to be able to shard an
| Arc into separate counts for each thread, so that the
| synchronisation wouldn't be required (?) but I don't know how it
| would be done without making most Arcs unnecessarily expensive.
|
| Perhaps another problem is that libraries can't be (or just
| aren't) generic about the kind of smart-pointers that are used
| but the kind of smart-pointers used can matter a lot (e.g.
| libraries often have to use Arc for some of their clients even
| though Rc would often suffice).
|
| But maybe I'm just totally wrong about this?
| Kranar wrote:
| It's a great idea and in fact a variation of it is being
| explored by Python to eliminate the use of the GIL without
| introducing performance regressions.
|
| The variation is instead of one count per thread, have two
| reference counts where one is atomic and the other is not
| atomic. The heuristic is that at any given time, the majority
| of writes are performed within a single thread and so that
| thread has exclusive read/write access to the non-atomic
| reference count, all other threads share the atomic reference
| count. There is some work needed to coordinate the atomic and
| non-atomic reference count so as to test when an object can be
| deleted, but the overhead of this work is less than the cost of
| always using an atomic.
|
| http://iacoma.cs.uiuc.edu/iacoma-papers/pact18.pdf
| dan-robertson wrote:
| In the case of the original article, the reference count was
| being updated by all the threads so having an Arc-with-Rc-
| for-one-thread wouldn't actually change things, I think. But
| maybe you just mean for 'cheaper smart-pointers in general'?
|
| The python thing is interesting. I thought there were other
| reasons for the GIL than ref counts but I guess everything
| must be addressed.
| Kranar wrote:
| Yep in this case it might not have helped, but all this to
| say that some exploration of having multiple reference
| counts is a great idea. The balance becomes the overhead of
| coordinating multiple reference counts versus the gains
| from reduced atomic operations.
| Pxtl wrote:
| That actually makes me appreciate that the single-threadedness of
| Python might a good thing since Python uses refcounting.
| reilly3000 wrote:
| FWIW this is the kind of content I originally came to HN for and
| what keeps me coming back.
| chrisaycock wrote:
| The culprit was a shared _atomic reference counter_ (Arc). The
| cache invalidation across cores eliminated any kind of scaling
| via multiple threads.
|
| The solution is to stop sharing. The author had to make changes
| specific to this codebase for each thread to have its own copy of
| the offending dataset.
| denysvitali wrote:
| Awesome TL;DR! I read the full article but this summarizes it
| well. The article is anyways worth a full read
| divs1210 wrote:
| Immutable collections save the day!
| Thiez wrote:
| Immutable collections wouldn't really help here since Rust
| does not have a garbage collector, so they would still be
| wrapped in an `Arc<T>`, or use one internally. Unless you're
| giving each thread their own copy of the immutable
| collection, in which case the immutability doesn't really add
| much.
| ericbarrett wrote:
| You're right if they're wrapped in an Arc<T>. But it's a
| lot easier to confidently clone an immutable data structure
| to threads/async functions because you know they won't be
| mutated by other cores. If you're cloning mutable data,
| you'll have to collect each "shard" at the end of whatever
| runtime loop your program has, and merge them back into the
| authoritative model. (Or drop the changes, but then why
| make the data mutable?) Gets hairy.
| amelius wrote:
| Or use a language with a good concurrent garbage collector.
| dagmx wrote:
| What language wouldn't hit this issue? The second you have an
| atomic write across multiple numa clusters you'll hit this.
|
| Edit: it appears they're talking specifically about the ref
| counting whereas I was considering the entirety of a shared
| context. That clarification helps me understand where their
| statement was coming from
| amelius wrote:
| The problem was caused by a memory management primitive.
| This wouldn't be an issue in a runtime with a GC designed
| for concurrent loads.
| lowbloodsugar wrote:
| Wood for trees. The problem was caused by an ill-thought-
| out design. We can do similarly performance degrading
| things in GC languages, just the details will be
| different. However, at some extreme, which, to be fair,
| most systems won't hit, GC languages perform vastly worse
| than non GC languages. In one service I own, Java's G1GC
| uses 20x more CPU than Rust for an application specific
| benchmark. Most of this time is spent in the concurrent
| phase so Shenandoah and GenShen aren't going to make a
| dent (and we can't afford the RAM for Shendandoah). 20x
| CPU and 2x wall clock. The question we are looking at is,
| "Should we just continue to spend 20x $ on operating
| costs for the Java version to avoid writing it in Rust?"
| dagmx wrote:
| How would the GC avoid the atomic lock and cache
| invalidation across numa boundaries? What language has a
| sufficiently lock less rw capable GC?
|
| Edit: to clarify I was thinking about a mutable context
| share as shown in the code example, not solely about the
| ref counting.
| yencabulator wrote:
| If the only purpose of the Arc was to be a reference
| count, a language with a (non-refcount) GC wouldn't need
| that atomic in the first place.
| dagmx wrote:
| That's fair. It read to me, from his post, that Rune was
| using it for context sharing as well between threads
| (since that's the source of the highest level Arc in his
| code example) . If it's only for ref counts then it makes
| sense that a concurrent GC could avoid the issue
| Const-me wrote:
| > How would the GC avoid the atomic lock and cache
| invalidation across numa boundaries?
|
| By not using reference counting. State of the art GCs
| don't count references. They usually doing mark and
| sweep, implementing multiple generations, and/or doing a
| few other things.
|
| Most of that overhead only happens while collecting.
| Merely referencing an object from another thread doesn't
| modify any shared cache lines.
|
| > What language has a sufficiently lock less rw capable
| GC?
|
| Java, C#, F#, Golang.
| dagmx wrote:
| Yeah I think my confusion was that I was thinking about a
| fully mutable context shared across threads based on the
| code example in the post.
|
| But if it's just for the ref counting part of the Arc
| then I can see how a GC would solve it by not needing the
| RC
| Const-me wrote:
| > I was thinking about a fully mutable context shared
| across threads
|
| A quote from the article: "No locks, no mutexes, no
| syscalls, no shared mutable data here. There are some
| read-only structures context and unit shared behind an
| Arc, but read-only sharing shouldn't be a problem." As
| you see, the data shared across threads was immutable.
|
| However, the library they have picked was designed around
| Rust's ref.counting Arc<> smart pointers. Apparently for
| some other use cases, not needed by the OP, that library
| needs to modify these objects.
|
| > I can see how a GC would solve it by not needing the RC
|
| Interestingly enough, C++ would also solve that. The
| language does not stop programmers from changing things
| from multiple threads concurrently. For this reason, very
| few libraries have their public APIs designed around
| std::shared_ptr<> (C++ equivalent of Rust's Arc<>).
| Instead, what usually happens, library authors write in
| the documentation things like "the object you pass to
| this API must be thread safe" and "it's your
| responsibility to make sure the pointer you pass stays
| alive for as long as you using the API", and call it a
| day.
| amelius wrote:
| GoLang. It was designed for highly concurrent loads. For
| an overview of history and technology involved, read
| e.g.:
|
| https://blog.twitch.tv/en/2016/07/05/gos-march-to-low-
| latenc...
| dagmx wrote:
| Edit: another comment explains that you're likely talking
| about just the ref counting aspect rather than the entire
| context sharing used by the rune code shown, in which
| case, yes I see why a concurrent GC would avoid the issue
| in that scenario.
|
| ----------
|
| I'm familiar with Go's GC. Your linked post doesn't
| explain how it would avoid the hit mentioned by the cache
| invalidation across multiple clusters.
|
| It'll either try and put multiple go routines on a single
| cluster (as listed in the link) or it'll need to copy the
| necessary stack per thread. Which is effectively what the
| original article ends up doing.
|
| But if you encounter anything that needs to run
| concurrently across threads while using a single r/w
| object, you'll hit the same cliff surely?
| YogurtFiend wrote:
| While this isn't true with Go's GC, a GC that's stop-the-
| world can avoid these cache coherency issues altogether.
| If you pause every thread before marking and sweeping,
| then you won't run into problems--only the GC is running.
| While this may sound silly, stopping the world will
| oftentimes lead to higher throughput than concurrent
| collectors, at the cost of higher latency (which is why
| it's not suitable for Go, which is often used for
| building servers where latency is a priority).
| jhgb wrote:
| For one, I'd be surprised if Azul's C4 (Zing) didn't have
| these issues already solved (assuming anyone has solved
| that, _they_ probably did).
| YogurtFiend wrote:
| I don't know why you're being downvoted. Memory reclamation
| is hard in the face of parallelism. It's much easier to
| design lock-free algorithms in languages with a garbage
| collector, because memory reclamation issues are taken care
| of for you (and often times in a batched manner, which won't
| affect the hot path of your code). There _are_ techniques to
| avoid using a garbage collector: atomic reference counting,
| epoch-based reclaimation, hazard points, and more. But they
| can be quite difficult to implement and difficult to use.
| quotemstr wrote:
| Right. Plus GCs can compact memory, which we can't do with
| manual memory management. I don't understand why people
| fetishize manual memory management --- do you want to spend
| your short time on Earth twiddling references?
| titzer wrote:
| After 28 years of programming in C and 25 years
| programming in C++ (on and off), I am sick to death of
| managing memory. Programmers can't do it right or well.
| Let the super-intelligent idiot savant with billions of
| cycles of computation per second and trillions of bits of
| storage do the thinking for me.
|
| Personally I glad the world is moving away from unsafe
| manual memory management in C++, be that with better RAII
| and managed pointers, Rust's type system, etc. But those
| things are still breakable and IMHO, ultimately a big
| waste of people's time. If Rust's lifetime annotations
| could be inferred by a compiler, then by all means. But
| if they can't, just GC already. All that extra thinking
| pushes people to think about the wrong things and designs
| end up ossified, difficult to refactor because of the
| extra overhead of radically changing ownership. Forget it
| already! Let computers figure out the whole memory
| management problem.
|
| Want your program to go fast? Don't allocate a bunch of
| garbage. Redesign it to reuse data structures carefully.
| Don't burden yourself forever twiddling with managing the
| large and small alike, obsessing over every dang byte.
| mcguire wrote:
| Rust's Arc _is_ an atomic reference counter. But, as the
| article describes, those aren 't free.
| duped wrote:
| One could argue that using a GC for reclamation isn't lock-
| free if it's not a concurrent GC.
| hayley-patton wrote:
| There are very few GCs that are actually lock-free. While
| uncooperative mutators cannot block other mutators in an
| on-the-fly collector, for example, the collector is
| blocked until all mutators cooperate. Nonetheless I'd
| guess that a sufficiently concurrent collector is good
| enough.
___________________________________________________________________
(page generated 2021-12-31 23:00 UTC)