[HN Gopher] How many CPU cores can you use in parallel?
       ___________________________________________________________________
        
       How many CPU cores can you use in parallel?
        
       Author : itamarst
       Score  : 138 points
       Date   : 2023-12-18 16:04 UTC (1 days ago)
        
 (HTM) web link (pythonspeed.com)
 (TXT) w3m dump (pythonspeed.com)
        
       | BerislavLopac wrote:
       | > Last updated 15 Dec 2023, originally created 18 Dec 2023
       | 
       | I respect Itamar even more after realising that he has invented
       | the time machine... :o
        
         | not_your_vase wrote:
         | This happens when you can't sync your tasks correctly in a
         | multi-threaded environment. Synchronous tasks happen in wrong
         | order.
        
           | pixelpoet wrote:
           | Knock, knock.
           | 
           | Race condition!
           | 
           | Who's there?
        
           | BerislavLopac wrote:
           | There are only two hard problems in distributed systems: 2.
           | Exactly-once delivery 1. Guaranteed order of messages 2.
           | Exactly-once delivery [0]
           | 
           | [0]
           | https://twitter.com/mathiasverraes/status/632260618599403520
        
         | nolongerthere wrote:
         | lol that's a Wordpress bug, I don't remember how I hit it, but
         | I've done it before too. Might have to do with not having the
         | datetime set correctly on the server.
        
       | dale_glass wrote:
       | > But there is something unexpected too: the optimal number of
       | threads is different for each function.
       | 
       | Nothing unexpected there. Amdahl's Law in its glory.
       | 
       | A fast running function finishes fast, and coordinating the job's
       | execution over many cores requires doing work every time
       | something finishes. If you split your job into chunks in the
       | microsecond time range, then you'll be handling lots and lots of
       | tiny minutia.
       | 
       | You want to set up your tasks such that each thread has a good
       | amount of stuff to chew through before it needs to communicate
       | with anything.
        
         | itamarst wrote:
         | That's not it. I updated the article with an experiment of
         | processing 5 items at a time. The fast function doing 5 images
         | at a time is slower than the slow function doing 1 image at a
         | time (24*5 > 90).
         | 
         | If your theory was correct, we would expect the optimal number
         | of threads for the fast function processing 5 images at a time
         | to be similar to that of the slow function processing 1 image
         | at a time.
         | 
         | In fact, the optimal threads in this case (5 images at a time)
         | was 20 for slow function, 10 for fast function, so essentially
         | the same as the original setup.
        
           | bogwog wrote:
           | Is it a caching thing? The slow version seems less cache
           | efficient, so if it is waiting due to cache misses, that
           | could create an opportunity for something else to get
           | scheduled in.
        
             | gmm1990 wrote:
             | I doubt it the slow version uses division instead of bit
             | shifting. My guess would be the fast version saturated like
             | i/o or some non cpu portion of the processor and the
             | division one was bottle necked by the division logic in the
             | processor.
        
               | bogwog wrote:
               | But it's iterating through the result vectors twice, so
               | that's basically guaranteed to miss. Moving the threshold
               | check into the loop above would at least eliminate that
               | factor.
               | 
               | Maybe division vs bit shifting does play a factor, but
               | it's hard to compare that while the cache behavior is so
               | different.
        
               | cogman10 wrote:
               | Recalibrate how you feel about division and
               | multiplication. It turns out, integer division on new
               | processors is a 1 cycle process (and has been for a while
               | now). Most of the multicycle instructions now-a-days are
               | things like SIMD and encryption.
        
               | jandrewrogers wrote:
               | Which processor has 1-cycle latency for integer division?
               | Even Apple Silicon, which has the mostly highly optimized
               | implementation I am aware of appears to have 2-cycle
               | latency. Recent x86 are much worse, though greatly
               | improved. Integer division is much faster than it used to
               | be but not single cycle.
               | 
               | Also, most of those SIMD and encryption instructions can
               | retire one per cycle on modern cores, but that isn't the
               | same as latency.
        
               | cogman10 wrote:
               | You are correct, I mistakenly thought it was 1/cycle
               | because I had previously remembered IMUL taking 30 cycles
               | (and now it has a 1 cycle throughput).
               | 
               | Agner Fog is reporting anywhere from 4->30 cycles on
               | semi-recent architectures.
        
           | kukkamario wrote:
           | Based on the graph the fast function runtime is really short.
           | You might be just seeing effects of efficiency vs performance
           | cores. Lower thread count makes most of the stuff run on
           | performance cores and task end times align more nicely. With
           | larger number of threads tasks running on performance cores
           | complete first and you are left waiting for efficiency tasks
           | running on efficiency cores to complete, or context switches
           | have to be done and tasks are moved between cores which
           | causes overhead.
           | 
           | You could try what happens if you have 10 times more images
           | when running fast function.
           | 
           | Also you have just 8 physical performance cores and 4
           | physical efficiency cores. Performance cores have hyper
           | threading so they act as 2 logical cores but that doesn't
           | mean that they can actually execute 2 threads at maximum
           | performance. If processing tasks use the same parts of the
           | processor core, then processor cannot run both threads at the
           | same time and IPC will suffer. Slow task maybe uses more
           | varied parts of the core which allows better IPC with hyper
           | threading. So that also may reduce optimal thread count.
        
             | dmoy wrote:
             | Hyper threading is SMT? I can never keep the branding names
             | straight.
        
               | nicoburns wrote:
               | Yes, hyper threading is SMT
        
         | adfgiknio wrote:
         | >Nothing unexpected there. Amdahl's Law in its glory.
         | 
         | That's not Amdahl's Law. Amdahl's Law describes the expected
         | total speedup from a speedup to a segment of the program. It is
         | usually used in the context of adding parallelism but is more
         | general than that. It does not assume parallelism has any
         | overhead.
         | 
         | Amdahl's Law predicts performance will approach a limit with
         | added cores, but still predicts that each core will improve
         | performance.
         | 
         | >A fast running function finishes fast, and coordinating the
         | job's execution over many cores requires doing work every time
         | something finishes. If you split your job into chunks in the
         | microsecond time range, then you'll be handling lots and lots
         | of tiny minutia.
         | 
         | Not necessarily. Modern systems have pools with independent
         | per-core queues and clever ways of distributing work to them
         | with minimal overhead. Work-stealing pools don't have any
         | overhead from synchronization except when a core runs out of
         | work, so each core just runs a `for` loop most of the time. I
         | have seen a simple (pseudocode) `list.parallelMap(foo)` give a
         | linear speedup on a modern JVM even on a large host with a
         | `foo` that takes <100ns.
         | 
         | Trying to manually batch work risks one thread running faster
         | than the others and then idling when it runs out of work. This
         | is common on modern hardware with heterogeneous cores and
         | dynamic frequency scaling.
         | 
         | If the tasks require their own coordination or the runtime
         | isn't smart, then manual batching may be needed. I wouldn't
         | assume Python does the right thing.
        
           | csdvrx wrote:
           | > Amdahl's Law predicts performance will approach a limit
           | with added cores, but still predicts that each core will
           | improve performance.
           | 
           | And I would agree: on my systems, efficiency cores are used
           | to handle other tasks and keep the power core cold and ready
           | to use when needed (dynamic frequency scaling)
           | 
           | > Trying to manually batch work risks one thread running
           | faster than the others and then idling when it runs out of
           | work. This is common on modern hardware with heterogeneous
           | cores and dynamic frequency scaling.
           | 
           | Yes, I think the issue here is more that python can't
           | introspect the cgroup artificial limitations placed by docker
           | on the CPU it's using, causing the kink past 9.
           | 
           | If the goal is performance, I think it would be better to
           | assign the cores manually + use cpu pinning + declare to the
           | containers what resources it really has.
           | 
           | On a NOHZ kernel, you can do manual assignments like
           | nohz_full=1-3,5-7 rcu_nocbs=0-3,5-7 irqaffinity=4
           | 
           | This puts the IRQ burden on one core but you can do 2 (with
           | =4,5) etc
        
           | dale_glass wrote:
           | > That's not Amdahl's Law. Amdahl's Law describes the
           | expected total speedup from a speedup to a segment of the
           | program. It is usually used in the context of adding
           | parallelism but is more general than that. It does not assume
           | parallelism has any overhead.
           | 
           | Yeah, it's a fiction. An useful one, but still a sort of
           | spherical cow in a vacuum scenario. It doesn't exactly
           | describe real hardware.
           | 
           | > Amdahl's Law predicts performance will approach a limit
           | with added cores, but still predicts that each core will
           | improve performance.
           | 
           | Because it's a fiction. You can have slowdowns with bigger
           | tasks on real hardware, like if you run out of L1 cache.
           | 
           | > Not necessarily. Modern systems have pools with independent
           | per-core queues and clever ways of distributing work to them
           | with minimal overhead.
           | 
           | True, but that's Java and this is Python. I may be wrong, but
           | since the GIL removal is very new still, I don't expect
           | Python to have nearly the same level of efficiency in this
           | regard.
           | 
           | But I could be wrong somewhere, of course.
        
       | mihaic wrote:
       | Starting from Python 3.13 there should be a new method
       | os.process_cpu_count() that aims to get the actually available
       | number of cores our process can run on.
        
         | itamarst wrote:
         | Neat! Unfortunately at the moment it still ignores cgroups,
         | it's just a wrapper around sched_getaffinity().
         | 
         | https://github.com/python/cpython/blob/6a69b80d1b1f3987fcec3...
        
           | mihaic wrote:
           | True, I'm wondering if they might change the implementation
           | for this actually, since the function name is pretty
           | agnostic.
        
       | onetimeuse92304 wrote:
       | Personally, I dislike configuration parameters that let future
       | admins of the system change parameters like concurrency of
       | certain processes, etc.
       | 
       | A lot of the time even I can't tell what is going to be the
       | optimum setting.
       | 
       | So for a number of years, rather than expose a configuration
       | parameter, I am building in a performance test that establishes
       | the values of these parameters.
       | 
       | This is a little bit of search and a hill climb through the
       | multidimensional space of all relevant parameters. It is not
       | perfect, but in my experience it is almost always better than I
       | can do manually and always better than an operator without
       | intimate understanding of the system.
       | 
       | The results are cached in a database and can be re-executed if
       | change to configuration is found (just take a number of
       | parameters from your environment, version of the application,
       | etc. and hash the value and check if you already have parameters
       | in the database).
        
         | eropple wrote:
         | This is interesting - how long does it take to write? Is it
         | reusable?
         | 
         | I'd be really interested in a blog post about this.
        
           | onetimeuse92304 wrote:
           | I am not big about writing blog posts.
           | 
           | But now that I think about it, I could write a small library
           | to automate most of it while delegating some tasks to the
           | application (providing functionality to be tested, parameters
           | and possible ranges to be searched, a way to save/restore
           | parameters to/from the database, provide fitness function,
           | etc.) It is essentially a small benchmarking framework with
           | some added functionality. I think it could be quite useful to
           | some people.
        
         | arp242 wrote:
         | This works well for expensive long-running things, but works
         | less well for more short-lived programs, and even for expensive
         | long-running things there may be more considerations than
         | "what's the fastest value?" I intentionally set the default
         | jobs of "make" to "4" rather than "8" because I _don 't_ want
         | to utilize my full system when compiling stuff, so I can still
         | use it for other things at a reasonable speed. There's lots of
         | reasons you might want to do this: on a web platform you don't
         | really want to use up all resources by the one expensive thing,
         | but what's "reasonable" and "desired" often differs - often
         | there is no objectively best "optimum setting".
        
           | onetimeuse92304 wrote:
           | You do not have to target best absolute performance. You can
           | choose your own fitness functions to balance absolute
           | performance and other aspects of the system like resource
           | usage, latency, or any other aspects of the application.
           | Think about what logic the operator would use to establish
           | the parameter and capture it as code or as fitness function.
           | 
           | As usual, there is no single universal solution to optimising
           | performance (and in fact it can be trivially proven to not
           | exist in general case). But you do not have to find perfect
           | solution, you just have to be able to find as good or better
           | solution than a human operator would.
           | 
           | When you have a large system you can be working with
           | thousands or tens of thousands of parameters. I usually see
           | that even if the application is migrated to different
           | hardware, the configuration parameters like buffer sizes,
           | concurrency, connection pools, etc. are not really revisited.
           | The effect is that usually the application will work
           | suboptimally for the environment it is in.
           | 
           | My solution is an attempt to keep this part of application
           | configuration sufficiently close enough to optimal to be
           | usable and do it automatically without operator involvement
           | (and associated cost and possibility for defects).
        
             | vlovich123 wrote:
             | Yeah I'm in agreement with you. It can be hard for
             | performance things (vs tuning parameters for numerical
             | algorithm performance) because such tests can be inherently
             | difficult to reproduce and writing the hill climbing can be
             | tricky. It's a goal to aspire to for sure. Have you found
             | any tooling to do the hill climb search for you in a
             | multidimensional space?
        
               | geysersam wrote:
               | If you're using python `scipy.optimize` has several good
               | options for the minimization procedure.
               | 
               | I guess you could even use those if you're not using
               | python. But it'd require slightly more plumbing.
        
           | Palomides wrote:
           | consider running make with nice, let the scheduler work for
           | you
        
             | arp242 wrote:
             | Actually I do both:                 % alias make
             | make='nice -n 20 make -j4'
             | 
             | But I found that even with a nice level of 20 using all
             | CPUs doesn't quite give me the responsiveness I want for
             | other tasks.
        
               | onetimeuse92304 wrote:
               | Put it in a container where you can explicitly limit CPU
               | / memory usage?
               | 
               | I run all builds in containers/vms anyway as I am too
               | tired to keep my personal machine able to build hundreds
               | of different things. It is better to have a dedicated
               | development environment frozen for each of the projects I
               | work on.
        
               | arp242 wrote:
               | I just want to run make and not slow down Firefox or
               | maybe some other things I'm doing. This simple alias I've
               | had for over 20 years works, on all (Unix-y) systems. Why
               | make everything a hypercomplex story?
        
               | wtallis wrote:
               | Make also has an option to only spawn new jobs when the
               | load average is below a given value, which can be used to
               | get make to reduce the number of jobs running in parallel
               | when there are plenty of other processes using CPU time.
        
         | fritzo wrote:
         | Sounds similar to ATLAS, an automatically tuned BLAS library
         | from the 1990s https://softlib.rice.edu/pub/CRPC-
         | TRs/reports/CRPC-TR98751.p...
        
           | onetimeuse92304 wrote:
           | I am not surprised. I like to say that it is hard to have an
           | original thought. Even if you think you are original, it is
           | more likely somebody else has already done it in the past.
           | Most of the computing novelties I am seeing are really old
           | ideas, revisited and potentially improved upon.
           | 
           | Thanks for the link, I will take a look when I have a
           | moment:)
        
             | bee_rider wrote:
             | Eventually the pendulum swung back for BLAS libraries, the
             | currently popular strategy is hand-tuned assembly kernel in
             | libraries where the most important kernels have been
             | identified.
             | 
             | But BLAS is a very weird case, there are only a couple
             | kernels that you care about (and the most important, GEMM,
             | has been studied like crazy). And BLAS is used under the
             | hood by pretty much the whole HPC community, so there's
             | incentive for tons of expert human time investment.
             | 
             | You can look at ilbflame/BLIS for the modern strategy.
             | 
             | ATLAS is still neat, perhaps one can even derive more
             | generally applicable lessons from that library, not sure...
        
               | ska wrote:
               | I thought the point of ATLAS wasn't the idea that
               | automagical tuning was going to necessarily be best.
               | Rather that it was _expensive_ to [edit hand tune] for
               | all the hardware variations that were around and this was
               | a practical approach to get to pretty damn good even if
               | nobody had spent the money to produce a hand optimized
               | library for your hardware (or, they had, but you didn 't
               | want to pay what they were asking).
               | 
               | e.g. I remember vaguely people using it on AMD cpus
               | because they couldn't use icc/mkl effectively...
        
               | bee_rider wrote:
               | > Rather that it was expensive to _automate_ for all the
               | hardware variations that were around
               | 
               | Is the "automate" that I've italicized supposed to be
               | hand-tune or something along those lines? If so, I think
               | I agree. I think this was the justification for it, and
               | it made sense at the time, and it might make sense for
               | future architectures or exotic ones now.
               | 
               | But eventually Kazushige Goto wrote gotoBLAS, which
               | really emphasized the importance of the matrix-
               | multiplication kernel and showed how much performance
               | could come from getting it right. Since it is only a
               | handful of routines, it is possible to hand-tune them.
               | BLIS went on to formalize this in a nice open-source
               | library, so any vendor can drop their kernels in.
               | 
               | What is special about a particular CPU from this point of
               | view is the SIMD extension and the cache hierarchy/memory
               | system. SIMD extensions, somebody will write a matrix-
               | multiply kernel in assembly or intrinsics before
               | optimizing compilers get really good at using that
               | extension (if they ever get to it!). Cache hierarchy is
               | more abstractable I guess, but there are lots of
               | rectangles and other nice shapes that dedicated tuners
               | can think about, i.e, (PDF warning)
               | 
               | https://www.cs.utexas.edu/~flame/pubs/GotoTOMS_revision.p
               | df
               | 
               | I'm sure one _could_ search across all the possible
               | sizes, but it is easier to just have a human identify the
               | right ones.
               | 
               | Maybe someone can train an ML model to identify good
               | sizes. Machine learning uses lots of linear algebra, so I
               | guess we'll have a nice feedback loop, maybe that will
               | kick off the singularity, haha.
        
               | ska wrote:
               | > Is the "automate" that I've italicized supposed to be
               | hand-tune or something along those lines? I
               | 
               | Gah, yes, typing faster than coffee. Fixed now, thanks.
               | 
               | Agree on your points about more recent work.
        
               | owlbite wrote:
               | Also worth observing BLAS is an even weirder case as odds
               | are the CPU architecture designed to ensure they could
               | hit a high % of peak FMA throughput for that kernel, so
               | there's a lot of SW/HW co-design going on there too.
        
           | mattkrause wrote:
           | The Fourier transform library _FFTW_ does this too.
           | 
           | Its "wisdom" system will plan/measure the best way to do
           | transforms of a particular size/type and cache it for future
           | use.
           | 
           | More here: https://www.fftw.org/fftw3_doc/Words-of-
           | Wisdom_002dSaving-Pl...
        
         | bashinator wrote:
         | As a devops type person, I've been advocating for basic
         | performance testing to be part of microservice unit tests. It'd
         | be really wonderful to get rough numbers for peak and idle
         | cpu/memory usage along with other test output.
        
       | mrlonglong wrote:
       | More cores are excellent for building large projects. My
       | threadripper is worth every penny in software development.
        
         | npoc wrote:
         | The 7950X eats large C++ projects for breakfast, especially
         | those with lots of boost includes. Every logical core counts.
        
           | Skunkleton wrote:
           | I'm still waiting for multithreaded c++ compiler. The project
           | I work on has some nasty templating, and there are a few
           | compilation units that take a long long time.
        
           | bogwog wrote:
           | Add mold into the mix, and you're in C++ heaven.
           | 
           | https://github.com/rui314/mold
        
       | Kon-Peki wrote:
       | > you can spend a little bit of time at the beginning to
       | empirically measure the optimal number of threads, perhaps with
       | some heuristics to compensate for noise.
       | 
       | I'd like to point out that there is a lot of stuff published on
       | ArXiv; this appears to be a very active research subject. Don't
       | start from scratch :)
        
       | dahart wrote:
       | > our faster function could take advantage of no more than 8
       | cores; beyond that it started slowing down. Perhaps it started
       | hitting some bottleneck other than computation, like memory
       | bandwidth.
       | 
       | @itamarst Yes, this in interesting, you should profile it and get
       | to the bottom of the issue! It seems like in my experience that
       | being limited by hyperthreading or instruction-level parallism is
       | relatively rare, and much more often it's cache or memory access
       | patterns or implicit synchronization or contention for a hardware
       | resource. There's a good chance you'll learn something useful by
       | figuring it out. Maybe it's memory bus contention, maybe it's
       | cache, maybe numba compiled in something you aren't expecting.
       | 
       | Worth nothing that using 20 on the fast test isn't that much
       | slower than using 8. A good first guess/proxy for number of
       | threads to use is the number of cores, and that pays off in this
       | case compared to using too few cores.
       | 
       | Out of curiosity, do you know if your images are stored row-major
       | or column major? I see the outer loop over shape[0] and inner
       | loop over shape[1]. Is the compiled code stepping in memory by 1
       | pixel at time, or by a whole column? If your stride is a column,
       | you may be thrashing the cache.
       | 
       | I'd also be curious to hear how the speed of this compiled code
       | compares to a numpy or PIL image threshold operation, if you
       | happen to know.
        
         | itamarst wrote:
         | NumPy default is that you iterate over the earlier dimensions
         | first.
         | 
         | The slow code is likely at least partially slow due to branch
         | misprediction (this is specific to my CPU, not true on CPUs
         | with AVX-512), see https://pythonspeed.com/articles/speeding-
         | up-numba/ where I use `perf stat` to get branch misprediction
         | numbers on similar code.
         | 
         | With SIMD disabled there's also a clear difference in IPC, I
         | believe.
         | 
         | The bigger picture though is that the goal of this article is
         | not to demonstrate speeding up code, it's to ask about level of
         | parallelism given unchanging code. Obviously all things being
         | equal you'll do better if you can make your code faster, but
         | code does get deployed, and when it's deployed you need to
         | choose parallelism levels, regardless of how good the code is.
        
           | dahart wrote:
           | > when it's deployed you need to choose parallelism levels,
           | regardless of how good the code is.
           | 
           | Yes, absolutely, exactly. That's why it can be really helpful
           | to pinpoint that cause of slowdown, right? It might not
           | matter at deployment time if you have an automated shmoo that
           | calculates the optimal thread load, but knowing the actual
           | causes might be critical to the process, and/or really help
           | if you don't do it automatically. (For one, it's possible the
           | conditions for optimal thread load could change over the
           | course of a run.)
        
         | lumost wrote:
         | Is there a good method to split out memory and CPU bottlenecks
         | on a CPU? most folks simply optimize until they hit 100% and
         | assume all CPU usage is created equal.
         | 
         | Seems like there needs to be a tool similar to htop with
         | utiization stats for different instructions to indicate how
         | much of the CPU is idle/busy.
        
           | grogers wrote:
           | I think most people would measure instructions per cycle as
           | an estimate of how busy the CPU execution units actually are.
           | Lots of performance tools do measure it.
        
             | dist1ll wrote:
             | IPC is not a good performance metric - especially for an
             | instruction set like x86, but particularly for SIMD ISAs.
             | Two seemingly similar instructions sequences may get
             | decoded into a different number of uops, which in turn
             | occupy different execution ports. The most extreme example
             | of this being scalar code without stalls. It tends to have
             | high IPC, even though it makes no use of FP/SIMD ports.
             | 
             | Keeping the execution pipeline busy is not a useful goal in
             | isolation. Really, IPC is better left as a sanity check.
        
           | gnufx wrote:
           | Consider appropriate hardware counters. (Note that
           | multiplexing to measure more then three(?) at once can be
           | misleading.) At least for serial code on x86, there's "top-
           | down performance analysis". Some tools (I know Maqao) suggest
           | how to address bottlenecks.
        
       | omgtehlion wrote:
       | > Intel i7-12700K processor
       | 
       | Wait! You can't benchmark on a CPU with unpredictable speed and
       | mix of slow and fast cores. All kinds of effects come into play
       | here, and the code itself is not the most prominent among them.
       | 
       | To measure which _code_ is better, you should use real SMP
       | machine with fixed clock speed and turned off HT. On the machine
       | from TFA you are just fighting with Intel's thermal smarts and OS
       | scheduling shenanigans. (edit: you can use the same machine, but
       | configure it in BIOS. I, myself, use i9-12900k fixed to
       | 5ghz@8P-cores as "Intel testing machine")
        
         | sspiff wrote:
         | Except then you're not testing on the hardware configurations
         | people will actually run the software on.
         | 
         | People do run software with HyperThreading on and on E cores
         | and with Turbo boost and throttling. Having code that behaves
         | better in these dynamic, heterogeneous environments is a net
         | benefit to the user.
        
           | dijit wrote:
           | youre missing the point. you need to remove as many variables
           | as possible when benchmarking because benchmarks are always
           | relative to each other.
           | 
           | in absolute terms _testing_ is required on a standard config,
           | but benchmarks should mostly be conducted with a minimum of
           | external variants if you want to draw meaningful conclusions
        
           | omgtehlion wrote:
           | It is easier to deal with all the issues separately. First
           | your algo (in single thread), then threading, then memory
           | accesses and cache effects. After you sort everything in your
           | control, you try to deal with the real world (like counting
           | only "real" cores, ignoring/or enjoying hyperthreads, OS
           | scheduling, but most of these are quite unpredictable, and if
           | you get your code run faster in 12700k, the same setting will
           | be slower on, say, 5800X or on server machines, while basic
           | stuff speeds up the program on any hardware).
        
             | spenczar5 wrote:
             | Those issues are not separable. The performance of memory
             | and caches can determine which algorithm is best. This is
             | the basis of the entire field of cache-aware algorithms,
             | which routinely beat the pants off of theoretically
             | superior algorithms.
             | 
             | In my experience (which is in video transcoding and in
             | research astrophysics - both domains where it matters!), if
             | you really need to squeeze out performance, you have to
             | design with the target platform available for profiling and
             | benchmarking from the beginning.
             | 
             | Edit to add: I agree wholeheartedly with your top-level
             | comment! I just am perhaps more extreme than you; I don't
             | think "laptop benchmarks" can _ever_ be twisted into being
             | useful.
        
               | the8472 wrote:
               | You usually don't design an algorithm for a single CPU.
               | Most software has to run on different tiers and
               | generations from Intel, AMD, a whole herd of different
               | ARM processors. Even the very hottest code paths will
               | mostly do CPU feature detection but not "if cpu ==
               | i7-12700K { ... }"
        
               | spenczar5 wrote:
               | I get what you are saying, but in some domains, you
               | _really do_ design for a single SKU, beyond even just the
               | CPU. Supercomputers and computing centers like SLAC have
               | a very constrained set of SKUs that your software will
               | ever run on.
               | 
               | I know this is not how most cloud or consumer software
               | works, but that stuff _usually_ isnt as performance-
               | sensitive.
        
               | the8472 wrote:
               | There's plenty of CPU-cycle-munching work in the pro
               | segment that does care about performance and can't assume
               | fixed hardware. CI, compression, CGI, ...
        
               | pletnes wrote:
               | <<Most software>> - lots of software is run on cloud
               | compute / backend servers and can be built or tuned
               | specifically for the box/vm/container you're running it
               | on. I would expect AWS/Azure/Google to do that for their
               | PaaS offerings, for cost savings.
        
         | hinkley wrote:
         | I barely even try to run micro benchmarks on my work laptop,
         | and I'm the guy who introduced them. It's just so random.
         | 
         | Pretty much I'm looking for a slowdown that's more than 25% and
         | anything less could just be noise. When you're reaching for 5%
         | improvements, that means you have to let the CI system tell
         | you, and production response times are the final arbiter of
         | truth.
        
           | hmottestad wrote:
           | Do you have a system for running benchmarks in your CI?
        
         | adfgiknio wrote:
         | Software should be tested on a system like the one that will
         | run it. All modern machines use dynamic frequency scaling. It
         | should rarely be disabled in production. Many modern machines
         | have heterogeneous cores. Your testing procedure might be
         | reproducible but it's unrealistic.
        
       | kristjansson wrote:
       | > (you can also use perfplot, but note it's GPL-licensed)
       | 
       | Surly using a GPL-licensed development tool cannot affect the
       | licensing of the project it's being used to develop?
        
         | mroche wrote:
         | https://www.gnu.org/licenses/gpl-faq.html#GPLOutput
         | 
         |  _Is there some way that I can GPL the output people get from
         | use of my program? For example, if my program is used to
         | develop hardware designs, can I require that these designs must
         | be free?_
         | 
         |  _In general this is legally impossible; copyright law does not
         | give you any say in the use of the output people make from
         | their data using your program. If the user uses your program to
         | enter or convert her own data, the copyright on the output
         | belongs to her, not you. More generally, when a program
         | translates its input into some other form, the copyright status
         | of the output inherits that of the input it was generated
         | from._
         | 
         |  _So the only way you have a say in the use of the output is if
         | substantial parts of the output are copied (more or less) from
         | text in your program. For instance, part of the output of Bison
         | (see above) would be covered by the GNU GPL, if we had not made
         | an exception in this specific case._
         | 
         |  _You could artificially make a program copy certain text into
         | its output even if there is no technical reason to do so. But
         | if that copied text serves no practical purpose, the user could
         | simply delete that text from the output and use only the rest.
         | Then he would not have to obey the conditions on redistribution
         | of the copied text._
         | 
         | ---
         | 
         | https://www.gnu.org/licenses/gpl-faq.html#WhatCaseIsOutputGP...
         | 
         |  _In what cases is the output of a GPL program covered by the
         | GPL too?_
         | 
         |  _The output of a program is not, in general, covered by the
         | copyright on the code of the program. So the license of the
         | code of the program does not apply to the output, whether you
         | pipe it into a file, make a screenshot, screencast, or video._
         | 
         |  _The exception would be when the program displays a full
         | screen of text and /or art that comes from the program. Then
         | the copyright on that text and/or art covers the output.
         | Programs that output audio, such as video games, would also fit
         | into this exception._
         | 
         |  _If the art /music is under the GPL, then the GPL applies when
         | you copy it no matter how you copy it. However, fair use may
         | still apply._
         | 
         |  _Keep in mind that some programs, particularly video games,
         | can have artwork /audio that is licensed separately from the
         | underlying GPLed game. In such cases, the license on the
         | artwork/audio would dictate the terms under which
         | video/streaming may occur._
        
         | Lapha wrote:
         | If it were just a standalone tool it wouldn't affect the
         | licencing of the project, but it's not a tool, it's a software
         | library, so projects using the library need to follow the terms
         | of the GPL.
         | 
         | The tricky part here is that it's a library that produces
         | graphs which themselves aren't covered by the terms of the GPL.
         | The intent of the tool seems to be that you use it to produce
         | graphs during development for benchmarking or during research
         | or whatever where the tool is internal but the graphs aren't
         | necessarily internal, in which case the licence of the library
         | is mostly irrelevant as the GPL is mostly concerned with
         | distributing software to other users of said software, not the
         | output of the software, but if you later choose to want to
         | release that tool it really aught to follow the terms of the
         | GPL and be under a licence compatible with the GPL, so the note
         | about it being a GPL licenced library is completely valid. You
         | should use a more permissive LGPL/MIT/BSD/whatever licenced
         | library like benchit if you don't want to have to think about
         | this sort of stuff.
        
           | kristjansson wrote:
           | > if you later choose to want to release that tool it really
           | aught to follow the terms of the GPL and be under a licence
           | compatible with the GPL
           | 
           | This may be in the spirit of Free-as-in-freedom software, but
           | I don't think it's grounded in the reality of copyright and
           | the GPL. With the usual disclaimer that IANAL: copyright
           | attaches to the work (the code), and is retained by the
           | author. The author grants you certain rights to use, modify,
           | and redistribute the code subject to conditions (the GPL).
           | Those constraints only encumber your works that are
           | derivative of the GPL-licensed code i.e. either direct
           | modifications of original source or linking original/derived
           | object code.
           | 
           | A product that directly and necessarily imported perfplot
           | would probably itself need to be GPL licensed to be
           | distributed. A product developed with merely the assistance
           | of something like perfplot would not need to be licensed in
           | any particular way, any more than software written in Emacs
           | needs to be GPL licensed.
        
             | cozzyd wrote:
             | As far as I understand, I don't think even importing is an
             | issue, unless you are distributing the module too as a
             | combined artifact (e.g. docker image). See e.g.
             | https://github.com/pyFFTW/pyFFTW/issues/229
             | 
             | There's a lot of GPL FUD though from commercial interests
             | though.
        
       | fizzynut wrote:
       | I think the author has discovered memory bandwidth. When you have
       | a simple function and just scale the number of cores it's easy to
       | hit.
        
       | jjslocum3 wrote:
       | I understand this is a Python-centric source, but without having
       | done my homework I'd have thought Python wouldn't be a
       | particularly great language for dealing with these low level
       | concerns. Wouldn't it be much easier in C? In java it's as simple
       | as Runtime.getRuntime().availableProcessors()
        
         | mort96 wrote:
         | I mean other than being wrapped in a useless singleton, that's
         | the same as the 'os.cpu_count()' mentioned in the beginning of
         | the article.
        
       | navels wrote:
       | Related: fascinating deep dive of the Python GIL by David Beazley
       | from PyCon2010: https://www.youtube.com/watch?v=Obt-vMVdM8s
        
       | shanemhansen wrote:
       | The problem of "how many CPUs should I use?" is really only
       | answerable empirically in the general case.
       | 
       | The problem of "how many CPUs are available?" is a little more
       | tractable. Currently when running podman, the cpus allocated
       | seems to be available in the /sys fs. I wonder if it's the same
       | under k8s?                   podman run  --cpus=3.5 -it
       | docker.io/library/alpine:latest /bin/sh         / # cat
       | /sys/fs/cgroup/cpu.max         350000 100000
        
         | o11c wrote:
         | Per cgroups v2 documentation [0], `cpu.max` exists for all but
         | the root cgroup. If it doesn't exist, that means the kernel at
         | least is putting no restrictions compared to what the affinity
         | says (use `nproc` to call `sched_getaffinity` from the command
         | line). I'm not sure what hypervisor-based throttling exposes.
         | 
         | ***
         | 
         | That said, there are several rules of thumb to _guess_ the
         | optimal number of CPUs, even before you measure - and to figure
         | out how to improve the situation (since profiling is even
         | harder for threaded code):
         | 
         | * if you are contending for a shared resource, even relatively
         | rarely [1] (that is, assuming you're already using per-thread
         | resources when reasonable and only sharing big chunks), you're
         | likely to hit a limit for the maximum number of CPUs
         | regardless. Numbers I've seen are often 20 or lower. Using a
         | "nested" approach for resource sharing can improve this, but
         | this may require pinning threads (which in turn threatens
         | starvation - REMEMBER, YOU ARE NOT THE ONLY PROGRAM IN THE
         | WORLD!).
         | 
         | * if you are truly CPU-bound, microoptimizing to avoid (mostly:
         | random-access) memory stalls (usually only possible for purely
         | numerical code), logical CPUs are useless so you should limit
         | yourself to the physical CPU count. Usually you should have an
         | idea if this is the case.
         | 
         | * if your program is actively using an amount of memory similar
         | to a shared cache size, using fewer CPUs is often better. This
         | is not restricted to the exact counts of logical and physical
         | CPUs, except for caches that are physical-CPU-specific (L1
         | almost always is, and L2 usually is these days for P-cores but
         | not E-cores. L3 - and, for that matter, total RAM size to avoid
         | swapping - is the one where the optimal number of CPUs varies
         | arbitrarily in practice)
         | 
         | * if your code does even a small number of unpredicted loads
         | (typical object-oriented code does far more than that!) logical
         | CPUs are easily a win. Most code belongs here so you can assume
         | even if you're not familiar with it, unless one of the hard-to-
         | know points applies.
         | 
         | [0]: https://www.kernel.org/doc/html/latest/admin-
         | guide/cgroup-v2...
         | 
         | [1]: https://en.wikipedia.org/wiki/Amdahl's_law
        
       | dr_kiszonka wrote:
       | PythonSpeed is one of my favorite websites. However, this article
       | leaves me with more questions than answers. (I do appreciate the
       | benchmarking code.) For example, at one point, the author
       | mentions that hyper-threading can be disabled in BIOS. Should I
       | disable it? Based on the author's own description, it sounds like
       | hyper-threading is pretty useful.
        
         | owlbite wrote:
         | Like much of the advice "it depends" and "measure it".
         | 
         | Basically hyper-threading shares a whole load of resources
         | between threads. If your code ends up contending on those
         | resources you probably run slower. If your code is blocked due
         | to lack of instruction-level parallelism, it probably runs
         | faster. Generally an expert-optimized numerical code (e.g.
         | machine learning) is going to fall into the first camp as it's
         | pretty easy to kill the ILP bottlenecks in that sort of code.
        
         | duohedron wrote:
         | When I wrote some parallel code in C, enabling hyper-threading
         | resulted only 30% increase in speed, which not too little, but
         | you would expect more for doubling the thread count. I don't
         | remember what was the bottleneck, but it was not I/O. On an
         | other occasion a consultant for a HPC consultant advised us to
         | disable some physical cores for the optimal performance of a
         | geological simulator application, because the core/cache ratio
         | is higher that way.
        
         | IshKebab wrote:
         | No. It's a performance optimisation and generally makes things
         | faster. The only reason to disable it is because of Spectre
         | security stuff.
        
           | CyberDildonics wrote:
           | No. It's a performance optimization when there are more
           | threads than physical cores because it helps hide memory
           | latency behind threading.
        
       | gnufx wrote:
       | Use hwloc[1] to determine the hardware topology of the system,
       | and pin processes/threads appropriately. For a heterogeneous
       | system, you presumably want to avoid low-performance cores.
       | OpenMP has a standard way of assigning threads to cores.
       | Otherwise use hwloc, or perhaps likwid[2] explicitly.
       | 
       | Most things turn out to be memory-limited, and getting the
       | algorithm right may win more than simple parallelism; GEMM is a
       | case in point. As ever, profile first, and worry about thread
       | numbers after you've pinned appropriately and looked for simple
       | things like initialization interacting with memory policy.
       | 
       | The main HPC performance tools support Python and native code;
       | Threadspotter was good for threading/cache analysis, but seems
       | moribund. Note that SMT on POWER, for instance, is different from
       | x86.
       | 
       | (This is general HPC performance engineering; I haven't
       | considered the code in point.)
       | 
       | 1. https://www.open-mpi.org/projects/hwloc/
       | 
       | 2. https://github.com/rrze-likwid/likwid
        
       ___________________________________________________________________
       (page generated 2023-12-19 23:01 UTC)