[HN Gopher] How many CPU cores can you use in parallel?
___________________________________________________________________
How many CPU cores can you use in parallel?
Author : itamarst
Score : 138 points
Date : 2023-12-18 16:04 UTC (1 days ago)
(HTM) web link (pythonspeed.com)
(TXT) w3m dump (pythonspeed.com)
| BerislavLopac wrote:
| > Last updated 15 Dec 2023, originally created 18 Dec 2023
|
| I respect Itamar even more after realising that he has invented
| the time machine... :o
| not_your_vase wrote:
| This happens when you can't sync your tasks correctly in a
| multi-threaded environment. Synchronous tasks happen in wrong
| order.
| pixelpoet wrote:
| Knock, knock.
|
| Race condition!
|
| Who's there?
| BerislavLopac wrote:
| There are only two hard problems in distributed systems: 2.
| Exactly-once delivery 1. Guaranteed order of messages 2.
| Exactly-once delivery [0]
|
| [0]
| https://twitter.com/mathiasverraes/status/632260618599403520
| nolongerthere wrote:
| lol that's a Wordpress bug, I don't remember how I hit it, but
| I've done it before too. Might have to do with not having the
| datetime set correctly on the server.
| dale_glass wrote:
| > But there is something unexpected too: the optimal number of
| threads is different for each function.
|
| Nothing unexpected there. Amdahl's Law in its glory.
|
| A fast running function finishes fast, and coordinating the job's
| execution over many cores requires doing work every time
| something finishes. If you split your job into chunks in the
| microsecond time range, then you'll be handling lots and lots of
| tiny minutia.
|
| You want to set up your tasks such that each thread has a good
| amount of stuff to chew through before it needs to communicate
| with anything.
| itamarst wrote:
| That's not it. I updated the article with an experiment of
| processing 5 items at a time. The fast function doing 5 images
| at a time is slower than the slow function doing 1 image at a
| time (24*5 > 90).
|
| If your theory was correct, we would expect the optimal number
| of threads for the fast function processing 5 images at a time
| to be similar to that of the slow function processing 1 image
| at a time.
|
| In fact, the optimal threads in this case (5 images at a time)
| was 20 for slow function, 10 for fast function, so essentially
| the same as the original setup.
| bogwog wrote:
| Is it a caching thing? The slow version seems less cache
| efficient, so if it is waiting due to cache misses, that
| could create an opportunity for something else to get
| scheduled in.
| gmm1990 wrote:
| I doubt it the slow version uses division instead of bit
| shifting. My guess would be the fast version saturated like
| i/o or some non cpu portion of the processor and the
| division one was bottle necked by the division logic in the
| processor.
| bogwog wrote:
| But it's iterating through the result vectors twice, so
| that's basically guaranteed to miss. Moving the threshold
| check into the loop above would at least eliminate that
| factor.
|
| Maybe division vs bit shifting does play a factor, but
| it's hard to compare that while the cache behavior is so
| different.
| cogman10 wrote:
| Recalibrate how you feel about division and
| multiplication. It turns out, integer division on new
| processors is a 1 cycle process (and has been for a while
| now). Most of the multicycle instructions now-a-days are
| things like SIMD and encryption.
| jandrewrogers wrote:
| Which processor has 1-cycle latency for integer division?
| Even Apple Silicon, which has the mostly highly optimized
| implementation I am aware of appears to have 2-cycle
| latency. Recent x86 are much worse, though greatly
| improved. Integer division is much faster than it used to
| be but not single cycle.
|
| Also, most of those SIMD and encryption instructions can
| retire one per cycle on modern cores, but that isn't the
| same as latency.
| cogman10 wrote:
| You are correct, I mistakenly thought it was 1/cycle
| because I had previously remembered IMUL taking 30 cycles
| (and now it has a 1 cycle throughput).
|
| Agner Fog is reporting anywhere from 4->30 cycles on
| semi-recent architectures.
| kukkamario wrote:
| Based on the graph the fast function runtime is really short.
| You might be just seeing effects of efficiency vs performance
| cores. Lower thread count makes most of the stuff run on
| performance cores and task end times align more nicely. With
| larger number of threads tasks running on performance cores
| complete first and you are left waiting for efficiency tasks
| running on efficiency cores to complete, or context switches
| have to be done and tasks are moved between cores which
| causes overhead.
|
| You could try what happens if you have 10 times more images
| when running fast function.
|
| Also you have just 8 physical performance cores and 4
| physical efficiency cores. Performance cores have hyper
| threading so they act as 2 logical cores but that doesn't
| mean that they can actually execute 2 threads at maximum
| performance. If processing tasks use the same parts of the
| processor core, then processor cannot run both threads at the
| same time and IPC will suffer. Slow task maybe uses more
| varied parts of the core which allows better IPC with hyper
| threading. So that also may reduce optimal thread count.
| dmoy wrote:
| Hyper threading is SMT? I can never keep the branding names
| straight.
| nicoburns wrote:
| Yes, hyper threading is SMT
| adfgiknio wrote:
| >Nothing unexpected there. Amdahl's Law in its glory.
|
| That's not Amdahl's Law. Amdahl's Law describes the expected
| total speedup from a speedup to a segment of the program. It is
| usually used in the context of adding parallelism but is more
| general than that. It does not assume parallelism has any
| overhead.
|
| Amdahl's Law predicts performance will approach a limit with
| added cores, but still predicts that each core will improve
| performance.
|
| >A fast running function finishes fast, and coordinating the
| job's execution over many cores requires doing work every time
| something finishes. If you split your job into chunks in the
| microsecond time range, then you'll be handling lots and lots
| of tiny minutia.
|
| Not necessarily. Modern systems have pools with independent
| per-core queues and clever ways of distributing work to them
| with minimal overhead. Work-stealing pools don't have any
| overhead from synchronization except when a core runs out of
| work, so each core just runs a `for` loop most of the time. I
| have seen a simple (pseudocode) `list.parallelMap(foo)` give a
| linear speedup on a modern JVM even on a large host with a
| `foo` that takes <100ns.
|
| Trying to manually batch work risks one thread running faster
| than the others and then idling when it runs out of work. This
| is common on modern hardware with heterogeneous cores and
| dynamic frequency scaling.
|
| If the tasks require their own coordination or the runtime
| isn't smart, then manual batching may be needed. I wouldn't
| assume Python does the right thing.
| csdvrx wrote:
| > Amdahl's Law predicts performance will approach a limit
| with added cores, but still predicts that each core will
| improve performance.
|
| And I would agree: on my systems, efficiency cores are used
| to handle other tasks and keep the power core cold and ready
| to use when needed (dynamic frequency scaling)
|
| > Trying to manually batch work risks one thread running
| faster than the others and then idling when it runs out of
| work. This is common on modern hardware with heterogeneous
| cores and dynamic frequency scaling.
|
| Yes, I think the issue here is more that python can't
| introspect the cgroup artificial limitations placed by docker
| on the CPU it's using, causing the kink past 9.
|
| If the goal is performance, I think it would be better to
| assign the cores manually + use cpu pinning + declare to the
| containers what resources it really has.
|
| On a NOHZ kernel, you can do manual assignments like
| nohz_full=1-3,5-7 rcu_nocbs=0-3,5-7 irqaffinity=4
|
| This puts the IRQ burden on one core but you can do 2 (with
| =4,5) etc
| dale_glass wrote:
| > That's not Amdahl's Law. Amdahl's Law describes the
| expected total speedup from a speedup to a segment of the
| program. It is usually used in the context of adding
| parallelism but is more general than that. It does not assume
| parallelism has any overhead.
|
| Yeah, it's a fiction. An useful one, but still a sort of
| spherical cow in a vacuum scenario. It doesn't exactly
| describe real hardware.
|
| > Amdahl's Law predicts performance will approach a limit
| with added cores, but still predicts that each core will
| improve performance.
|
| Because it's a fiction. You can have slowdowns with bigger
| tasks on real hardware, like if you run out of L1 cache.
|
| > Not necessarily. Modern systems have pools with independent
| per-core queues and clever ways of distributing work to them
| with minimal overhead.
|
| True, but that's Java and this is Python. I may be wrong, but
| since the GIL removal is very new still, I don't expect
| Python to have nearly the same level of efficiency in this
| regard.
|
| But I could be wrong somewhere, of course.
| mihaic wrote:
| Starting from Python 3.13 there should be a new method
| os.process_cpu_count() that aims to get the actually available
| number of cores our process can run on.
| itamarst wrote:
| Neat! Unfortunately at the moment it still ignores cgroups,
| it's just a wrapper around sched_getaffinity().
|
| https://github.com/python/cpython/blob/6a69b80d1b1f3987fcec3...
| mihaic wrote:
| True, I'm wondering if they might change the implementation
| for this actually, since the function name is pretty
| agnostic.
| onetimeuse92304 wrote:
| Personally, I dislike configuration parameters that let future
| admins of the system change parameters like concurrency of
| certain processes, etc.
|
| A lot of the time even I can't tell what is going to be the
| optimum setting.
|
| So for a number of years, rather than expose a configuration
| parameter, I am building in a performance test that establishes
| the values of these parameters.
|
| This is a little bit of search and a hill climb through the
| multidimensional space of all relevant parameters. It is not
| perfect, but in my experience it is almost always better than I
| can do manually and always better than an operator without
| intimate understanding of the system.
|
| The results are cached in a database and can be re-executed if
| change to configuration is found (just take a number of
| parameters from your environment, version of the application,
| etc. and hash the value and check if you already have parameters
| in the database).
| eropple wrote:
| This is interesting - how long does it take to write? Is it
| reusable?
|
| I'd be really interested in a blog post about this.
| onetimeuse92304 wrote:
| I am not big about writing blog posts.
|
| But now that I think about it, I could write a small library
| to automate most of it while delegating some tasks to the
| application (providing functionality to be tested, parameters
| and possible ranges to be searched, a way to save/restore
| parameters to/from the database, provide fitness function,
| etc.) It is essentially a small benchmarking framework with
| some added functionality. I think it could be quite useful to
| some people.
| arp242 wrote:
| This works well for expensive long-running things, but works
| less well for more short-lived programs, and even for expensive
| long-running things there may be more considerations than
| "what's the fastest value?" I intentionally set the default
| jobs of "make" to "4" rather than "8" because I _don 't_ want
| to utilize my full system when compiling stuff, so I can still
| use it for other things at a reasonable speed. There's lots of
| reasons you might want to do this: on a web platform you don't
| really want to use up all resources by the one expensive thing,
| but what's "reasonable" and "desired" often differs - often
| there is no objectively best "optimum setting".
| onetimeuse92304 wrote:
| You do not have to target best absolute performance. You can
| choose your own fitness functions to balance absolute
| performance and other aspects of the system like resource
| usage, latency, or any other aspects of the application.
| Think about what logic the operator would use to establish
| the parameter and capture it as code or as fitness function.
|
| As usual, there is no single universal solution to optimising
| performance (and in fact it can be trivially proven to not
| exist in general case). But you do not have to find perfect
| solution, you just have to be able to find as good or better
| solution than a human operator would.
|
| When you have a large system you can be working with
| thousands or tens of thousands of parameters. I usually see
| that even if the application is migrated to different
| hardware, the configuration parameters like buffer sizes,
| concurrency, connection pools, etc. are not really revisited.
| The effect is that usually the application will work
| suboptimally for the environment it is in.
|
| My solution is an attempt to keep this part of application
| configuration sufficiently close enough to optimal to be
| usable and do it automatically without operator involvement
| (and associated cost and possibility for defects).
| vlovich123 wrote:
| Yeah I'm in agreement with you. It can be hard for
| performance things (vs tuning parameters for numerical
| algorithm performance) because such tests can be inherently
| difficult to reproduce and writing the hill climbing can be
| tricky. It's a goal to aspire to for sure. Have you found
| any tooling to do the hill climb search for you in a
| multidimensional space?
| geysersam wrote:
| If you're using python `scipy.optimize` has several good
| options for the minimization procedure.
|
| I guess you could even use those if you're not using
| python. But it'd require slightly more plumbing.
| Palomides wrote:
| consider running make with nice, let the scheduler work for
| you
| arp242 wrote:
| Actually I do both: % alias make
| make='nice -n 20 make -j4'
|
| But I found that even with a nice level of 20 using all
| CPUs doesn't quite give me the responsiveness I want for
| other tasks.
| onetimeuse92304 wrote:
| Put it in a container where you can explicitly limit CPU
| / memory usage?
|
| I run all builds in containers/vms anyway as I am too
| tired to keep my personal machine able to build hundreds
| of different things. It is better to have a dedicated
| development environment frozen for each of the projects I
| work on.
| arp242 wrote:
| I just want to run make and not slow down Firefox or
| maybe some other things I'm doing. This simple alias I've
| had for over 20 years works, on all (Unix-y) systems. Why
| make everything a hypercomplex story?
| wtallis wrote:
| Make also has an option to only spawn new jobs when the
| load average is below a given value, which can be used to
| get make to reduce the number of jobs running in parallel
| when there are plenty of other processes using CPU time.
| fritzo wrote:
| Sounds similar to ATLAS, an automatically tuned BLAS library
| from the 1990s https://softlib.rice.edu/pub/CRPC-
| TRs/reports/CRPC-TR98751.p...
| onetimeuse92304 wrote:
| I am not surprised. I like to say that it is hard to have an
| original thought. Even if you think you are original, it is
| more likely somebody else has already done it in the past.
| Most of the computing novelties I am seeing are really old
| ideas, revisited and potentially improved upon.
|
| Thanks for the link, I will take a look when I have a
| moment:)
| bee_rider wrote:
| Eventually the pendulum swung back for BLAS libraries, the
| currently popular strategy is hand-tuned assembly kernel in
| libraries where the most important kernels have been
| identified.
|
| But BLAS is a very weird case, there are only a couple
| kernels that you care about (and the most important, GEMM,
| has been studied like crazy). And BLAS is used under the
| hood by pretty much the whole HPC community, so there's
| incentive for tons of expert human time investment.
|
| You can look at ilbflame/BLIS for the modern strategy.
|
| ATLAS is still neat, perhaps one can even derive more
| generally applicable lessons from that library, not sure...
| ska wrote:
| I thought the point of ATLAS wasn't the idea that
| automagical tuning was going to necessarily be best.
| Rather that it was _expensive_ to [edit hand tune] for
| all the hardware variations that were around and this was
| a practical approach to get to pretty damn good even if
| nobody had spent the money to produce a hand optimized
| library for your hardware (or, they had, but you didn 't
| want to pay what they were asking).
|
| e.g. I remember vaguely people using it on AMD cpus
| because they couldn't use icc/mkl effectively...
| bee_rider wrote:
| > Rather that it was expensive to _automate_ for all the
| hardware variations that were around
|
| Is the "automate" that I've italicized supposed to be
| hand-tune or something along those lines? If so, I think
| I agree. I think this was the justification for it, and
| it made sense at the time, and it might make sense for
| future architectures or exotic ones now.
|
| But eventually Kazushige Goto wrote gotoBLAS, which
| really emphasized the importance of the matrix-
| multiplication kernel and showed how much performance
| could come from getting it right. Since it is only a
| handful of routines, it is possible to hand-tune them.
| BLIS went on to formalize this in a nice open-source
| library, so any vendor can drop their kernels in.
|
| What is special about a particular CPU from this point of
| view is the SIMD extension and the cache hierarchy/memory
| system. SIMD extensions, somebody will write a matrix-
| multiply kernel in assembly or intrinsics before
| optimizing compilers get really good at using that
| extension (if they ever get to it!). Cache hierarchy is
| more abstractable I guess, but there are lots of
| rectangles and other nice shapes that dedicated tuners
| can think about, i.e, (PDF warning)
|
| https://www.cs.utexas.edu/~flame/pubs/GotoTOMS_revision.p
| df
|
| I'm sure one _could_ search across all the possible
| sizes, but it is easier to just have a human identify the
| right ones.
|
| Maybe someone can train an ML model to identify good
| sizes. Machine learning uses lots of linear algebra, so I
| guess we'll have a nice feedback loop, maybe that will
| kick off the singularity, haha.
| ska wrote:
| > Is the "automate" that I've italicized supposed to be
| hand-tune or something along those lines? I
|
| Gah, yes, typing faster than coffee. Fixed now, thanks.
|
| Agree on your points about more recent work.
| owlbite wrote:
| Also worth observing BLAS is an even weirder case as odds
| are the CPU architecture designed to ensure they could
| hit a high % of peak FMA throughput for that kernel, so
| there's a lot of SW/HW co-design going on there too.
| mattkrause wrote:
| The Fourier transform library _FFTW_ does this too.
|
| Its "wisdom" system will plan/measure the best way to do
| transforms of a particular size/type and cache it for future
| use.
|
| More here: https://www.fftw.org/fftw3_doc/Words-of-
| Wisdom_002dSaving-Pl...
| bashinator wrote:
| As a devops type person, I've been advocating for basic
| performance testing to be part of microservice unit tests. It'd
| be really wonderful to get rough numbers for peak and idle
| cpu/memory usage along with other test output.
| mrlonglong wrote:
| More cores are excellent for building large projects. My
| threadripper is worth every penny in software development.
| npoc wrote:
| The 7950X eats large C++ projects for breakfast, especially
| those with lots of boost includes. Every logical core counts.
| Skunkleton wrote:
| I'm still waiting for multithreaded c++ compiler. The project
| I work on has some nasty templating, and there are a few
| compilation units that take a long long time.
| bogwog wrote:
| Add mold into the mix, and you're in C++ heaven.
|
| https://github.com/rui314/mold
| Kon-Peki wrote:
| > you can spend a little bit of time at the beginning to
| empirically measure the optimal number of threads, perhaps with
| some heuristics to compensate for noise.
|
| I'd like to point out that there is a lot of stuff published on
| ArXiv; this appears to be a very active research subject. Don't
| start from scratch :)
| dahart wrote:
| > our faster function could take advantage of no more than 8
| cores; beyond that it started slowing down. Perhaps it started
| hitting some bottleneck other than computation, like memory
| bandwidth.
|
| @itamarst Yes, this in interesting, you should profile it and get
| to the bottom of the issue! It seems like in my experience that
| being limited by hyperthreading or instruction-level parallism is
| relatively rare, and much more often it's cache or memory access
| patterns or implicit synchronization or contention for a hardware
| resource. There's a good chance you'll learn something useful by
| figuring it out. Maybe it's memory bus contention, maybe it's
| cache, maybe numba compiled in something you aren't expecting.
|
| Worth nothing that using 20 on the fast test isn't that much
| slower than using 8. A good first guess/proxy for number of
| threads to use is the number of cores, and that pays off in this
| case compared to using too few cores.
|
| Out of curiosity, do you know if your images are stored row-major
| or column major? I see the outer loop over shape[0] and inner
| loop over shape[1]. Is the compiled code stepping in memory by 1
| pixel at time, or by a whole column? If your stride is a column,
| you may be thrashing the cache.
|
| I'd also be curious to hear how the speed of this compiled code
| compares to a numpy or PIL image threshold operation, if you
| happen to know.
| itamarst wrote:
| NumPy default is that you iterate over the earlier dimensions
| first.
|
| The slow code is likely at least partially slow due to branch
| misprediction (this is specific to my CPU, not true on CPUs
| with AVX-512), see https://pythonspeed.com/articles/speeding-
| up-numba/ where I use `perf stat` to get branch misprediction
| numbers on similar code.
|
| With SIMD disabled there's also a clear difference in IPC, I
| believe.
|
| The bigger picture though is that the goal of this article is
| not to demonstrate speeding up code, it's to ask about level of
| parallelism given unchanging code. Obviously all things being
| equal you'll do better if you can make your code faster, but
| code does get deployed, and when it's deployed you need to
| choose parallelism levels, regardless of how good the code is.
| dahart wrote:
| > when it's deployed you need to choose parallelism levels,
| regardless of how good the code is.
|
| Yes, absolutely, exactly. That's why it can be really helpful
| to pinpoint that cause of slowdown, right? It might not
| matter at deployment time if you have an automated shmoo that
| calculates the optimal thread load, but knowing the actual
| causes might be critical to the process, and/or really help
| if you don't do it automatically. (For one, it's possible the
| conditions for optimal thread load could change over the
| course of a run.)
| lumost wrote:
| Is there a good method to split out memory and CPU bottlenecks
| on a CPU? most folks simply optimize until they hit 100% and
| assume all CPU usage is created equal.
|
| Seems like there needs to be a tool similar to htop with
| utiization stats for different instructions to indicate how
| much of the CPU is idle/busy.
| grogers wrote:
| I think most people would measure instructions per cycle as
| an estimate of how busy the CPU execution units actually are.
| Lots of performance tools do measure it.
| dist1ll wrote:
| IPC is not a good performance metric - especially for an
| instruction set like x86, but particularly for SIMD ISAs.
| Two seemingly similar instructions sequences may get
| decoded into a different number of uops, which in turn
| occupy different execution ports. The most extreme example
| of this being scalar code without stalls. It tends to have
| high IPC, even though it makes no use of FP/SIMD ports.
|
| Keeping the execution pipeline busy is not a useful goal in
| isolation. Really, IPC is better left as a sanity check.
| gnufx wrote:
| Consider appropriate hardware counters. (Note that
| multiplexing to measure more then three(?) at once can be
| misleading.) At least for serial code on x86, there's "top-
| down performance analysis". Some tools (I know Maqao) suggest
| how to address bottlenecks.
| omgtehlion wrote:
| > Intel i7-12700K processor
|
| Wait! You can't benchmark on a CPU with unpredictable speed and
| mix of slow and fast cores. All kinds of effects come into play
| here, and the code itself is not the most prominent among them.
|
| To measure which _code_ is better, you should use real SMP
| machine with fixed clock speed and turned off HT. On the machine
| from TFA you are just fighting with Intel's thermal smarts and OS
| scheduling shenanigans. (edit: you can use the same machine, but
| configure it in BIOS. I, myself, use i9-12900k fixed to
| 5ghz@8P-cores as "Intel testing machine")
| sspiff wrote:
| Except then you're not testing on the hardware configurations
| people will actually run the software on.
|
| People do run software with HyperThreading on and on E cores
| and with Turbo boost and throttling. Having code that behaves
| better in these dynamic, heterogeneous environments is a net
| benefit to the user.
| dijit wrote:
| youre missing the point. you need to remove as many variables
| as possible when benchmarking because benchmarks are always
| relative to each other.
|
| in absolute terms _testing_ is required on a standard config,
| but benchmarks should mostly be conducted with a minimum of
| external variants if you want to draw meaningful conclusions
| omgtehlion wrote:
| It is easier to deal with all the issues separately. First
| your algo (in single thread), then threading, then memory
| accesses and cache effects. After you sort everything in your
| control, you try to deal with the real world (like counting
| only "real" cores, ignoring/or enjoying hyperthreads, OS
| scheduling, but most of these are quite unpredictable, and if
| you get your code run faster in 12700k, the same setting will
| be slower on, say, 5800X or on server machines, while basic
| stuff speeds up the program on any hardware).
| spenczar5 wrote:
| Those issues are not separable. The performance of memory
| and caches can determine which algorithm is best. This is
| the basis of the entire field of cache-aware algorithms,
| which routinely beat the pants off of theoretically
| superior algorithms.
|
| In my experience (which is in video transcoding and in
| research astrophysics - both domains where it matters!), if
| you really need to squeeze out performance, you have to
| design with the target platform available for profiling and
| benchmarking from the beginning.
|
| Edit to add: I agree wholeheartedly with your top-level
| comment! I just am perhaps more extreme than you; I don't
| think "laptop benchmarks" can _ever_ be twisted into being
| useful.
| the8472 wrote:
| You usually don't design an algorithm for a single CPU.
| Most software has to run on different tiers and
| generations from Intel, AMD, a whole herd of different
| ARM processors. Even the very hottest code paths will
| mostly do CPU feature detection but not "if cpu ==
| i7-12700K { ... }"
| spenczar5 wrote:
| I get what you are saying, but in some domains, you
| _really do_ design for a single SKU, beyond even just the
| CPU. Supercomputers and computing centers like SLAC have
| a very constrained set of SKUs that your software will
| ever run on.
|
| I know this is not how most cloud or consumer software
| works, but that stuff _usually_ isnt as performance-
| sensitive.
| the8472 wrote:
| There's plenty of CPU-cycle-munching work in the pro
| segment that does care about performance and can't assume
| fixed hardware. CI, compression, CGI, ...
| pletnes wrote:
| <<Most software>> - lots of software is run on cloud
| compute / backend servers and can be built or tuned
| specifically for the box/vm/container you're running it
| on. I would expect AWS/Azure/Google to do that for their
| PaaS offerings, for cost savings.
| hinkley wrote:
| I barely even try to run micro benchmarks on my work laptop,
| and I'm the guy who introduced them. It's just so random.
|
| Pretty much I'm looking for a slowdown that's more than 25% and
| anything less could just be noise. When you're reaching for 5%
| improvements, that means you have to let the CI system tell
| you, and production response times are the final arbiter of
| truth.
| hmottestad wrote:
| Do you have a system for running benchmarks in your CI?
| adfgiknio wrote:
| Software should be tested on a system like the one that will
| run it. All modern machines use dynamic frequency scaling. It
| should rarely be disabled in production. Many modern machines
| have heterogeneous cores. Your testing procedure might be
| reproducible but it's unrealistic.
| kristjansson wrote:
| > (you can also use perfplot, but note it's GPL-licensed)
|
| Surly using a GPL-licensed development tool cannot affect the
| licensing of the project it's being used to develop?
| mroche wrote:
| https://www.gnu.org/licenses/gpl-faq.html#GPLOutput
|
| _Is there some way that I can GPL the output people get from
| use of my program? For example, if my program is used to
| develop hardware designs, can I require that these designs must
| be free?_
|
| _In general this is legally impossible; copyright law does not
| give you any say in the use of the output people make from
| their data using your program. If the user uses your program to
| enter or convert her own data, the copyright on the output
| belongs to her, not you. More generally, when a program
| translates its input into some other form, the copyright status
| of the output inherits that of the input it was generated
| from._
|
| _So the only way you have a say in the use of the output is if
| substantial parts of the output are copied (more or less) from
| text in your program. For instance, part of the output of Bison
| (see above) would be covered by the GNU GPL, if we had not made
| an exception in this specific case._
|
| _You could artificially make a program copy certain text into
| its output even if there is no technical reason to do so. But
| if that copied text serves no practical purpose, the user could
| simply delete that text from the output and use only the rest.
| Then he would not have to obey the conditions on redistribution
| of the copied text._
|
| ---
|
| https://www.gnu.org/licenses/gpl-faq.html#WhatCaseIsOutputGP...
|
| _In what cases is the output of a GPL program covered by the
| GPL too?_
|
| _The output of a program is not, in general, covered by the
| copyright on the code of the program. So the license of the
| code of the program does not apply to the output, whether you
| pipe it into a file, make a screenshot, screencast, or video._
|
| _The exception would be when the program displays a full
| screen of text and /or art that comes from the program. Then
| the copyright on that text and/or art covers the output.
| Programs that output audio, such as video games, would also fit
| into this exception._
|
| _If the art /music is under the GPL, then the GPL applies when
| you copy it no matter how you copy it. However, fair use may
| still apply._
|
| _Keep in mind that some programs, particularly video games,
| can have artwork /audio that is licensed separately from the
| underlying GPLed game. In such cases, the license on the
| artwork/audio would dictate the terms under which
| video/streaming may occur._
| Lapha wrote:
| If it were just a standalone tool it wouldn't affect the
| licencing of the project, but it's not a tool, it's a software
| library, so projects using the library need to follow the terms
| of the GPL.
|
| The tricky part here is that it's a library that produces
| graphs which themselves aren't covered by the terms of the GPL.
| The intent of the tool seems to be that you use it to produce
| graphs during development for benchmarking or during research
| or whatever where the tool is internal but the graphs aren't
| necessarily internal, in which case the licence of the library
| is mostly irrelevant as the GPL is mostly concerned with
| distributing software to other users of said software, not the
| output of the software, but if you later choose to want to
| release that tool it really aught to follow the terms of the
| GPL and be under a licence compatible with the GPL, so the note
| about it being a GPL licenced library is completely valid. You
| should use a more permissive LGPL/MIT/BSD/whatever licenced
| library like benchit if you don't want to have to think about
| this sort of stuff.
| kristjansson wrote:
| > if you later choose to want to release that tool it really
| aught to follow the terms of the GPL and be under a licence
| compatible with the GPL
|
| This may be in the spirit of Free-as-in-freedom software, but
| I don't think it's grounded in the reality of copyright and
| the GPL. With the usual disclaimer that IANAL: copyright
| attaches to the work (the code), and is retained by the
| author. The author grants you certain rights to use, modify,
| and redistribute the code subject to conditions (the GPL).
| Those constraints only encumber your works that are
| derivative of the GPL-licensed code i.e. either direct
| modifications of original source or linking original/derived
| object code.
|
| A product that directly and necessarily imported perfplot
| would probably itself need to be GPL licensed to be
| distributed. A product developed with merely the assistance
| of something like perfplot would not need to be licensed in
| any particular way, any more than software written in Emacs
| needs to be GPL licensed.
| cozzyd wrote:
| As far as I understand, I don't think even importing is an
| issue, unless you are distributing the module too as a
| combined artifact (e.g. docker image). See e.g.
| https://github.com/pyFFTW/pyFFTW/issues/229
|
| There's a lot of GPL FUD though from commercial interests
| though.
| fizzynut wrote:
| I think the author has discovered memory bandwidth. When you have
| a simple function and just scale the number of cores it's easy to
| hit.
| jjslocum3 wrote:
| I understand this is a Python-centric source, but without having
| done my homework I'd have thought Python wouldn't be a
| particularly great language for dealing with these low level
| concerns. Wouldn't it be much easier in C? In java it's as simple
| as Runtime.getRuntime().availableProcessors()
| mort96 wrote:
| I mean other than being wrapped in a useless singleton, that's
| the same as the 'os.cpu_count()' mentioned in the beginning of
| the article.
| navels wrote:
| Related: fascinating deep dive of the Python GIL by David Beazley
| from PyCon2010: https://www.youtube.com/watch?v=Obt-vMVdM8s
| shanemhansen wrote:
| The problem of "how many CPUs should I use?" is really only
| answerable empirically in the general case.
|
| The problem of "how many CPUs are available?" is a little more
| tractable. Currently when running podman, the cpus allocated
| seems to be available in the /sys fs. I wonder if it's the same
| under k8s? podman run --cpus=3.5 -it
| docker.io/library/alpine:latest /bin/sh / # cat
| /sys/fs/cgroup/cpu.max 350000 100000
| o11c wrote:
| Per cgroups v2 documentation [0], `cpu.max` exists for all but
| the root cgroup. If it doesn't exist, that means the kernel at
| least is putting no restrictions compared to what the affinity
| says (use `nproc` to call `sched_getaffinity` from the command
| line). I'm not sure what hypervisor-based throttling exposes.
|
| ***
|
| That said, there are several rules of thumb to _guess_ the
| optimal number of CPUs, even before you measure - and to figure
| out how to improve the situation (since profiling is even
| harder for threaded code):
|
| * if you are contending for a shared resource, even relatively
| rarely [1] (that is, assuming you're already using per-thread
| resources when reasonable and only sharing big chunks), you're
| likely to hit a limit for the maximum number of CPUs
| regardless. Numbers I've seen are often 20 or lower. Using a
| "nested" approach for resource sharing can improve this, but
| this may require pinning threads (which in turn threatens
| starvation - REMEMBER, YOU ARE NOT THE ONLY PROGRAM IN THE
| WORLD!).
|
| * if you are truly CPU-bound, microoptimizing to avoid (mostly:
| random-access) memory stalls (usually only possible for purely
| numerical code), logical CPUs are useless so you should limit
| yourself to the physical CPU count. Usually you should have an
| idea if this is the case.
|
| * if your program is actively using an amount of memory similar
| to a shared cache size, using fewer CPUs is often better. This
| is not restricted to the exact counts of logical and physical
| CPUs, except for caches that are physical-CPU-specific (L1
| almost always is, and L2 usually is these days for P-cores but
| not E-cores. L3 - and, for that matter, total RAM size to avoid
| swapping - is the one where the optimal number of CPUs varies
| arbitrarily in practice)
|
| * if your code does even a small number of unpredicted loads
| (typical object-oriented code does far more than that!) logical
| CPUs are easily a win. Most code belongs here so you can assume
| even if you're not familiar with it, unless one of the hard-to-
| know points applies.
|
| [0]: https://www.kernel.org/doc/html/latest/admin-
| guide/cgroup-v2...
|
| [1]: https://en.wikipedia.org/wiki/Amdahl's_law
| dr_kiszonka wrote:
| PythonSpeed is one of my favorite websites. However, this article
| leaves me with more questions than answers. (I do appreciate the
| benchmarking code.) For example, at one point, the author
| mentions that hyper-threading can be disabled in BIOS. Should I
| disable it? Based on the author's own description, it sounds like
| hyper-threading is pretty useful.
| owlbite wrote:
| Like much of the advice "it depends" and "measure it".
|
| Basically hyper-threading shares a whole load of resources
| between threads. If your code ends up contending on those
| resources you probably run slower. If your code is blocked due
| to lack of instruction-level parallelism, it probably runs
| faster. Generally an expert-optimized numerical code (e.g.
| machine learning) is going to fall into the first camp as it's
| pretty easy to kill the ILP bottlenecks in that sort of code.
| duohedron wrote:
| When I wrote some parallel code in C, enabling hyper-threading
| resulted only 30% increase in speed, which not too little, but
| you would expect more for doubling the thread count. I don't
| remember what was the bottleneck, but it was not I/O. On an
| other occasion a consultant for a HPC consultant advised us to
| disable some physical cores for the optimal performance of a
| geological simulator application, because the core/cache ratio
| is higher that way.
| IshKebab wrote:
| No. It's a performance optimisation and generally makes things
| faster. The only reason to disable it is because of Spectre
| security stuff.
| CyberDildonics wrote:
| No. It's a performance optimization when there are more
| threads than physical cores because it helps hide memory
| latency behind threading.
| gnufx wrote:
| Use hwloc[1] to determine the hardware topology of the system,
| and pin processes/threads appropriately. For a heterogeneous
| system, you presumably want to avoid low-performance cores.
| OpenMP has a standard way of assigning threads to cores.
| Otherwise use hwloc, or perhaps likwid[2] explicitly.
|
| Most things turn out to be memory-limited, and getting the
| algorithm right may win more than simple parallelism; GEMM is a
| case in point. As ever, profile first, and worry about thread
| numbers after you've pinned appropriately and looked for simple
| things like initialization interacting with memory policy.
|
| The main HPC performance tools support Python and native code;
| Threadspotter was good for threading/cache analysis, but seems
| moribund. Note that SMT on POWER, for instance, is different from
| x86.
|
| (This is general HPC performance engineering; I haven't
| considered the code in point.)
|
| 1. https://www.open-mpi.org/projects/hwloc/
|
| 2. https://github.com/rrze-likwid/likwid
___________________________________________________________________
(page generated 2023-12-19 23:01 UTC)