[HN Gopher] Is Python Code Sensitive to CPU Caching? (2024)
___________________________________________________________________
Is Python Code Sensitive to CPU Caching? (2024)
Author : leonry
Score : 68 points
Date : 2025-04-02 09:53 UTC (3 days ago)
(HTM) web link (lukasatkinson.de)
(TXT) w3m dump (lukasatkinson.de)
| PhilipRoman wrote:
| IME CPU caches can be observed in almost all languages,
| regardless of how detached they are from the machine. What I'm
| really wondering is, if branch prediction can be observed from
| interpreted code.
| stingraycharles wrote:
| It should be, as most interpreted code uses a JIT native
| compiler at this point, which implies the branch prediction is
| native as well.
|
| I would be happy to stand corrected if someone knows more about
| this, though!
| tarruda wrote:
| AFAIK CPython doesn't JIT compile.
| pansa2 wrote:
| Recent versions of CPython do have an experimental JIT
| compiler. I'm not sure how widely-used it is, though.
|
| https://peps.python.org/pep-0744/
| Qem wrote:
| It's disabled by default, not production ready. You need
| to compile CPython yourself if you want to try it.
| pansa2 wrote:
| > _most interpreted code uses a JIT native compiler at this
| point_
|
| I suspect most JavaScript code is JIT-compiled - but most
| Python, Ruby etc is interpreted (via bytecode).
| vlovich123 wrote:
| Why would you think that branch prediction wouldn't be? Do you
| think the interpreter is adding too much false sharing to the
| branch predictor to render it worthless?
| almostgotcaught wrote:
| you do realize that the Python if x > 128:
| r = not r
|
| doesn't necessarily correspond to one branch at the ASM
| instruction level right? in fact, a priori, it doesn't even
| correspond to absolutely _any_ branches at the ASM
| instruction level.
| vlovich123 wrote:
| Yes I do and that's neither here nor there. There almost
| certainly ASM instruction level branches to implement the
| conditional and the branch predictor isn't tied to a single
| instruction - the rough high level mental model is a cache
| of a few of the least significant digits of the CPU to the
| prediction although in practice it's far more complicated.
| Since the predictor is right like 80% of the time, it means
| that even when there's false sharing of a lot of branches,
| the CPU does a good job predicting. It's performance is
| primarily impacted when execution takes both branches
| closer to 50/50 than to 100/0.
|
| That's why I asked about false sharing specifically as
| that's the main way I can think of that Python code
| wouldn't be able to observe the branch predictor
| performance - because the interpreter has so many branches
| internally that it dominates any branches that might be
| caused by the Python code itself.
| exyi wrote:
| It corresponds to a way more than one branch at instruction
| level. The branch prediction AFAIK does not care based on
| what are you branching, it just assumes branches will go in
| similar sequences as they did last time. If the Python 'if'
| is never taken, the instruction-level predictor will learn
| that after the comparison operation, there is an 'if'
| operation and then another array access operation. If the
| Python 'if' is unpredictable unpredictable, CPU predictor
| can't be sure which opcode are we processing after 'if', so
| there will be penalty.
| mardifoufs wrote:
| Is there any public documentation on modern branch
| prediction algorithms? I know branch prediction is very
| important to modern CPU so SOTA techniques are probably
| not public... But it's really amazing what it can do
| especially considering the "limited" cache sizes that
| branch predictors have .
| exyi wrote:
| I guess it depends on how deep you want to go, I think
| the real predictors are based on publicly known
| algorithms such as TAGE. This seems to be nice overview:
| https://comparch.net/2013/06/30/why-tage-is-the-best/
| (it's 2013, so definitely not SOTA, but advanced enough
| for my taste :) )
| exyi wrote:
| It is! Although my test case is probably an unrealistically bad
| scenario:
|
| It's the classic, why is processing sorted array faster than
| unsorted one def f(arr): r = True
| for x in arr: if x > 128: r = not r
| return r arr = [ random.randint(0, 255) for i in
| range(0, 1000_000) ] arr_sorted = list(sorted(arr))
| %timeit f(arr) %timeit f(arr_sorted)
|
| Results are (on my machine): 17.5 ms for unsorted, and 13.5 ms
| for sorted. For comparison, in a compiled language, the
| difference is 4x
| ammar2 wrote:
| Edit: Analyzed the wrong thing earlier.
|
| This depends on the Python version, but if it has the
| specializing interpreter changes, the `COMPARE_OP` comparing
| the integers there is probably hitting a specialized
| `_COMPARE_OP_INT` [1].
|
| This specialization has a ternary that does `res = (sign_ish
| & oparg) ? PyStackRef_True : PyStackRef_False;`. This might
| be the branch that ends up getting predicted correctly?
|
| Older versions of Python go through a bunch of dynamic
| dispatch first and then end up with a similar sort of int
| comparison in `long_richcompare`. [2]
|
| [1] https://github.com/python/cpython/blob/561965fa5c8314dee5
| b86...
|
| [2] https://github.com/python/cpython/blob/561965fa5c8314dee5
| b86...
| dzaima wrote:
| This isn't actually timing the sorting, but just the (dumb)
| function f.
| ammar2 wrote:
| Oh whoops, that's right. I totally missed that.
| cma wrote:
| Python speed up is probably from small integer caching, a
| sorted array will have runs of pointers to the same integers
| adjacent. The compiled language one is probably branch
| prediction right?
| exyi wrote:
| I intentionally stayed in the small integer range to avoid
| benchmarking the cache. 256 distinct values should fit into
| L1 just fine in both cases.
|
| I'm now thinking that the difference might be even larger
| if we instead avoid small integers and let the CPU get
| stuck chasing pointers. The idea is that it gets stuck on a
| memory access, which forces it to speculate much further,
| which in turn makes it backtrack a longer path if a branch
| was mispredicted. I'm obviously no expert on this, feel
| free to correct me
|
| The results for 1B range instead of 255 are 17.6 ms for
| unsorted / 68.2 ms for sorted! We are back to what the
| original article observed and it's a way stronger effect
| than what branch prediction can offer. So don't sort your
| arrays, keep them in the order the boxed values were
| allocated ;)
| cma wrote:
| How big is the pointed to small integer? With alignment
| etc. I'm seeing some stuff saying 256 of them would fill
| an 8KB L1. Plus other stuff for the interpreter might
| overfill it. Sorted that would be less of an issue.
|
| Larger range one being slower unsorted yes makes sense
| because of allocation order no longer matching the
| iteration order.
| exyi wrote:
| I don't know how large are those boxes, but normal CPU L1
| cache has 32 or 48KB which should be plenty for this.
| Python opcodes for this program are going to be tiny, and
| the interpreter itself uses the instruction-L1 cache
| (which is another 32-48KB). I hope the sequential scan of
| the big array won't flush the L1 cache (there should be
| 12-way associativity with LRU, so I don't see how it
| could).
|
| Anyway, there is no need to have 256 integers, just 2 is
| enough. When I try that, the results are similar: 17.5 ms
| (unsorted) / 12.5 ms (sorted)
| bgwalter wrote:
| That seems very likely. The benchmark should probably use a
| range that is guaranteed to be outside of the cached
| smallint range.
| exyi wrote:
| Then you are back to what the article discusses. Each
| integer is in a separate box, those boxes are allocated
| in one order, sorting the array by value will shuffle it
| by address and it will be much slower. I tested this as
| well, see the other comment.
| LPisGood wrote:
| This is a really good example. It is more like branch
| prediction than standard data/instruction caching.
|
| I wonder if you could do Spectre type vulnerabilities in
| python. You would need some way to leak micro-architectural
| state, so without being particularly clever maybe python code
| could only be used as a gadget or something.
| eulgro wrote:
| My guess would be that branch misprediction does have an impact
| on interpreted language, but much less. If bytecode
| instructions take on average 20 CPU cycles to execute, and the
| branch misprediction penalty is 50 CPU cycles, the relative
| cost of a misprediction is much smaller than in compiled code.
| adgjlsfhk1 wrote:
| the counterpoint is that a lot of those 20 instructions are
| also branches
| saagarjha wrote:
| Yep, this is why Spectre mitigations are applied to browsers.
| PaulHoule wrote:
| "Mechanical sympathy" effects are so strong on modern CPUs you
| feel them even in languages like Python that are far from the
| metal.
| davvid wrote:
| The article didn't mention Python's atomic refcounts. They're
| very bad for cache usage as they're constantly invalidating
| cache.
| quotemstr wrote:
| The only version of Python that uses atomic reference counting
| is the very new free threaded version.
| senderista wrote:
| Non-atomic refcounts invalidate cache as well, they're just not
| as expensive.
| SteelByte wrote:
| Great overview of CPU caching and its impact on Python
| performance. The benchmark really drives home the point. It would
| be interesting to see how other dynamic languages like JavaScript
| compare, and if JIT compilers are able to optimize away some of
| these caching inefficiencies.
| saagarjha wrote:
| No, they can't. The "caching inefficiencies" (?!) are part of
| the hardware and can't be optimized away by a JIT. I think you
| have a very confused understanding of the post.
| Delk wrote:
| I really don't see why you wouldn't expect to find cache-
| sensitivity in Python, or in some other similarly high-level
| language.
|
| Python sequences are defined as "support[ing] efficient element
| access using integer indices" [1]. Python lists are sequences and
| thus must support random access. In practice that means the
| implementation is a (dynamically resizing) array allocated
| contiguously. That means spatial locality is relevant.
|
| If the list type were defined simply as an iterable collection,
| with no requirement of constant-time random access, the
| definition would be abstract enough that an implementation might
| end up being something else than a contiguous array. But if you
| define the type so that it supports constant-time random access
| [2], you pretty much end up with cache-friendly sequential access
| as well.
|
| If you don't define the list type as supporting random access,
| you also sacrifice asymptotic algorithmic efficiency for lookups
| by index. Any language that cares about efficiency _at all_
| separates collections that support efficient random access from
| those that don 't. (For example, Python has lists and dicts/sets
| for different access patterns. The Java standard library has
| separate types for contiguous arrays/lists, hashtable-backed
| collections and tree-backed collections because the three have
| different algorithmic efficiencies for different access patterns.
| In practice it leads to different properties in terms of cache-
| friendliness as well.)
|
| [1] https://docs.python.org/3/glossary.html#term-sequence
|
| [2] As usual, the Python documentation is a bit vague on the
| details. It doesn't really say random access has to be constant-
| time, only that it has to be "efficient". So you might be able to
| have a non-constant time implementation such as an indexable
| skiplist while arguing that it's efficient, but you'd have to go
| out of your way to do that.
| zahlman wrote:
| >I really don't see why you wouldn't expect to find cache-
| sensitivity in Python, or in some other similarly high-level
| language... Python lists are sequences and thus must support
| random access. In practice that means the implementation is a
| (dynamically resizing) array allocated contiguously.
|
| The contents of the dynamically allocated array, in the C
| implementation, are pointers (PyObject _), since the lists are
| heterogeneous and must be able to store any Python object. That
| entails indirection, which defeats spatial locality. Even if
| you iterate over cached data, you have to jump around to
| retrieve the actual objects to_ do anything* with them.
___________________________________________________________________
(page generated 2025-04-05 23:01 UTC)