[HN Gopher] Is Python Code Sensitive to CPU Caching? (2024)
       ___________________________________________________________________
        
       Is Python Code Sensitive to CPU Caching? (2024)
        
       Author : leonry
       Score  : 68 points
       Date   : 2025-04-02 09:53 UTC (3 days ago)
        
 (HTM) web link (lukasatkinson.de)
 (TXT) w3m dump (lukasatkinson.de)
        
       | PhilipRoman wrote:
       | IME CPU caches can be observed in almost all languages,
       | regardless of how detached they are from the machine. What I'm
       | really wondering is, if branch prediction can be observed from
       | interpreted code.
        
         | stingraycharles wrote:
         | It should be, as most interpreted code uses a JIT native
         | compiler at this point, which implies the branch prediction is
         | native as well.
         | 
         | I would be happy to stand corrected if someone knows more about
         | this, though!
        
           | tarruda wrote:
           | AFAIK CPython doesn't JIT compile.
        
             | pansa2 wrote:
             | Recent versions of CPython do have an experimental JIT
             | compiler. I'm not sure how widely-used it is, though.
             | 
             | https://peps.python.org/pep-0744/
        
               | Qem wrote:
               | It's disabled by default, not production ready. You need
               | to compile CPython yourself if you want to try it.
        
           | pansa2 wrote:
           | > _most interpreted code uses a JIT native compiler at this
           | point_
           | 
           | I suspect most JavaScript code is JIT-compiled - but most
           | Python, Ruby etc is interpreted (via bytecode).
        
         | vlovich123 wrote:
         | Why would you think that branch prediction wouldn't be? Do you
         | think the interpreter is adding too much false sharing to the
         | branch predictor to render it worthless?
        
           | almostgotcaught wrote:
           | you do realize that the Python                 if x > 128:
           | r = not r
           | 
           | doesn't necessarily correspond to one branch at the ASM
           | instruction level right? in fact, a priori, it doesn't even
           | correspond to absolutely _any_ branches at the ASM
           | instruction level.
        
             | vlovich123 wrote:
             | Yes I do and that's neither here nor there. There almost
             | certainly ASM instruction level branches to implement the
             | conditional and the branch predictor isn't tied to a single
             | instruction - the rough high level mental model is a cache
             | of a few of the least significant digits of the CPU to the
             | prediction although in practice it's far more complicated.
             | Since the predictor is right like 80% of the time, it means
             | that even when there's false sharing of a lot of branches,
             | the CPU does a good job predicting. It's performance is
             | primarily impacted when execution takes both branches
             | closer to 50/50 than to 100/0.
             | 
             | That's why I asked about false sharing specifically as
             | that's the main way I can think of that Python code
             | wouldn't be able to observe the branch predictor
             | performance - because the interpreter has so many branches
             | internally that it dominates any branches that might be
             | caused by the Python code itself.
        
             | exyi wrote:
             | It corresponds to a way more than one branch at instruction
             | level. The branch prediction AFAIK does not care based on
             | what are you branching, it just assumes branches will go in
             | similar sequences as they did last time. If the Python 'if'
             | is never taken, the instruction-level predictor will learn
             | that after the comparison operation, there is an 'if'
             | operation and then another array access operation. If the
             | Python 'if' is unpredictable unpredictable, CPU predictor
             | can't be sure which opcode are we processing after 'if', so
             | there will be penalty.
        
               | mardifoufs wrote:
               | Is there any public documentation on modern branch
               | prediction algorithms? I know branch prediction is very
               | important to modern CPU so SOTA techniques are probably
               | not public... But it's really amazing what it can do
               | especially considering the "limited" cache sizes that
               | branch predictors have .
        
               | exyi wrote:
               | I guess it depends on how deep you want to go, I think
               | the real predictors are based on publicly known
               | algorithms such as TAGE. This seems to be nice overview:
               | https://comparch.net/2013/06/30/why-tage-is-the-best/
               | (it's 2013, so definitely not SOTA, but advanced enough
               | for my taste :) )
        
         | exyi wrote:
         | It is! Although my test case is probably an unrealistically bad
         | scenario:
         | 
         | It's the classic, why is processing sorted array faster than
         | unsorted one                   def f(arr):             r = True
         | for x in arr:                if x > 128: r = not r
         | return r              arr = [ random.randint(0, 255) for i in
         | range(0, 1000_000) ]         arr_sorted = list(sorted(arr))
         | %timeit f(arr)         %timeit f(arr_sorted)
         | 
         | Results are (on my machine): 17.5 ms for unsorted, and 13.5 ms
         | for sorted. For comparison, in a compiled language, the
         | difference is 4x
        
           | ammar2 wrote:
           | Edit: Analyzed the wrong thing earlier.
           | 
           | This depends on the Python version, but if it has the
           | specializing interpreter changes, the `COMPARE_OP` comparing
           | the integers there is probably hitting a specialized
           | `_COMPARE_OP_INT` [1].
           | 
           | This specialization has a ternary that does `res = (sign_ish
           | & oparg) ? PyStackRef_True : PyStackRef_False;`. This might
           | be the branch that ends up getting predicted correctly?
           | 
           | Older versions of Python go through a bunch of dynamic
           | dispatch first and then end up with a similar sort of int
           | comparison in `long_richcompare`. [2]
           | 
           | [1] https://github.com/python/cpython/blob/561965fa5c8314dee5
           | b86...
           | 
           | [2] https://github.com/python/cpython/blob/561965fa5c8314dee5
           | b86...
        
             | dzaima wrote:
             | This isn't actually timing the sorting, but just the (dumb)
             | function f.
        
               | ammar2 wrote:
               | Oh whoops, that's right. I totally missed that.
        
           | cma wrote:
           | Python speed up is probably from small integer caching, a
           | sorted array will have runs of pointers to the same integers
           | adjacent. The compiled language one is probably branch
           | prediction right?
        
             | exyi wrote:
             | I intentionally stayed in the small integer range to avoid
             | benchmarking the cache. 256 distinct values should fit into
             | L1 just fine in both cases.
             | 
             | I'm now thinking that the difference might be even larger
             | if we instead avoid small integers and let the CPU get
             | stuck chasing pointers. The idea is that it gets stuck on a
             | memory access, which forces it to speculate much further,
             | which in turn makes it backtrack a longer path if a branch
             | was mispredicted. I'm obviously no expert on this, feel
             | free to correct me
             | 
             | The results for 1B range instead of 255 are 17.6 ms for
             | unsorted / 68.2 ms for sorted! We are back to what the
             | original article observed and it's a way stronger effect
             | than what branch prediction can offer. So don't sort your
             | arrays, keep them in the order the boxed values were
             | allocated ;)
        
               | cma wrote:
               | How big is the pointed to small integer? With alignment
               | etc. I'm seeing some stuff saying 256 of them would fill
               | an 8KB L1. Plus other stuff for the interpreter might
               | overfill it. Sorted that would be less of an issue.
               | 
               | Larger range one being slower unsorted yes makes sense
               | because of allocation order no longer matching the
               | iteration order.
        
               | exyi wrote:
               | I don't know how large are those boxes, but normal CPU L1
               | cache has 32 or 48KB which should be plenty for this.
               | Python opcodes for this program are going to be tiny, and
               | the interpreter itself uses the instruction-L1 cache
               | (which is another 32-48KB). I hope the sequential scan of
               | the big array won't flush the L1 cache (there should be
               | 12-way associativity with LRU, so I don't see how it
               | could).
               | 
               | Anyway, there is no need to have 256 integers, just 2 is
               | enough. When I try that, the results are similar: 17.5 ms
               | (unsorted) / 12.5 ms (sorted)
        
             | bgwalter wrote:
             | That seems very likely. The benchmark should probably use a
             | range that is guaranteed to be outside of the cached
             | smallint range.
        
               | exyi wrote:
               | Then you are back to what the article discusses. Each
               | integer is in a separate box, those boxes are allocated
               | in one order, sorting the array by value will shuffle it
               | by address and it will be much slower. I tested this as
               | well, see the other comment.
        
           | LPisGood wrote:
           | This is a really good example. It is more like branch
           | prediction than standard data/instruction caching.
           | 
           | I wonder if you could do Spectre type vulnerabilities in
           | python. You would need some way to leak micro-architectural
           | state, so without being particularly clever maybe python code
           | could only be used as a gadget or something.
        
         | eulgro wrote:
         | My guess would be that branch misprediction does have an impact
         | on interpreted language, but much less. If bytecode
         | instructions take on average 20 CPU cycles to execute, and the
         | branch misprediction penalty is 50 CPU cycles, the relative
         | cost of a misprediction is much smaller than in compiled code.
        
           | adgjlsfhk1 wrote:
           | the counterpoint is that a lot of those 20 instructions are
           | also branches
        
         | saagarjha wrote:
         | Yep, this is why Spectre mitigations are applied to browsers.
        
       | PaulHoule wrote:
       | "Mechanical sympathy" effects are so strong on modern CPUs you
       | feel them even in languages like Python that are far from the
       | metal.
        
       | davvid wrote:
       | The article didn't mention Python's atomic refcounts. They're
       | very bad for cache usage as they're constantly invalidating
       | cache.
        
         | quotemstr wrote:
         | The only version of Python that uses atomic reference counting
         | is the very new free threaded version.
        
         | senderista wrote:
         | Non-atomic refcounts invalidate cache as well, they're just not
         | as expensive.
        
       | SteelByte wrote:
       | Great overview of CPU caching and its impact on Python
       | performance. The benchmark really drives home the point. It would
       | be interesting to see how other dynamic languages like JavaScript
       | compare, and if JIT compilers are able to optimize away some of
       | these caching inefficiencies.
        
         | saagarjha wrote:
         | No, they can't. The "caching inefficiencies" (?!) are part of
         | the hardware and can't be optimized away by a JIT. I think you
         | have a very confused understanding of the post.
        
       | Delk wrote:
       | I really don't see why you wouldn't expect to find cache-
       | sensitivity in Python, or in some other similarly high-level
       | language.
       | 
       | Python sequences are defined as "support[ing] efficient element
       | access using integer indices" [1]. Python lists are sequences and
       | thus must support random access. In practice that means the
       | implementation is a (dynamically resizing) array allocated
       | contiguously. That means spatial locality is relevant.
       | 
       | If the list type were defined simply as an iterable collection,
       | with no requirement of constant-time random access, the
       | definition would be abstract enough that an implementation might
       | end up being something else than a contiguous array. But if you
       | define the type so that it supports constant-time random access
       | [2], you pretty much end up with cache-friendly sequential access
       | as well.
       | 
       | If you don't define the list type as supporting random access,
       | you also sacrifice asymptotic algorithmic efficiency for lookups
       | by index. Any language that cares about efficiency _at all_
       | separates collections that support efficient random access from
       | those that don 't. (For example, Python has lists and dicts/sets
       | for different access patterns. The Java standard library has
       | separate types for contiguous arrays/lists, hashtable-backed
       | collections and tree-backed collections because the three have
       | different algorithmic efficiencies for different access patterns.
       | In practice it leads to different properties in terms of cache-
       | friendliness as well.)
       | 
       | [1] https://docs.python.org/3/glossary.html#term-sequence
       | 
       | [2] As usual, the Python documentation is a bit vague on the
       | details. It doesn't really say random access has to be constant-
       | time, only that it has to be "efficient". So you might be able to
       | have a non-constant time implementation such as an indexable
       | skiplist while arguing that it's efficient, but you'd have to go
       | out of your way to do that.
        
         | zahlman wrote:
         | >I really don't see why you wouldn't expect to find cache-
         | sensitivity in Python, or in some other similarly high-level
         | language... Python lists are sequences and thus must support
         | random access. In practice that means the implementation is a
         | (dynamically resizing) array allocated contiguously.
         | 
         | The contents of the dynamically allocated array, in the C
         | implementation, are pointers (PyObject _), since the lists are
         | heterogeneous and must be able to store any Python object. That
         | entails indirection, which defeats spatial locality. Even if
         | you iterate over cached data, you have to jump around to
         | retrieve the actual objects to_ do anything* with them.
        
       ___________________________________________________________________
       (page generated 2025-04-05 23:01 UTC)