[HN Gopher] Numba: A High Performance Python Compiler
       ___________________________________________________________________
        
       Numba: A High Performance Python Compiler
        
       Author : tosh
       Score  : 188 points
       Date   : 2022-12-27 13:36 UTC (9 hours ago)
        
 (HTM) web link (numba.pydata.org)
 (TXT) w3m dump (numba.pydata.org)
        
       | itamarst wrote:
       | Quick overview of the design space:
       | 
       | * PyPy JITs everything, so it can do _normal_ Python numerical
       | code that is quite fast and regular Python code that is fast.
       | However, its interactions with libraries like NumPy add overhead,
       | and it seems like it can't JIT code that interacts with NumPy in
       | a useful way (AFAIK, would be happy to be proven wrong). So not
       | useful for optimizing numeric functions that interact with
       | libraries like NumPy.
       | 
       | * Plain old NumPy and friends. This is great... if the operation
       | you want is already available as a "vectorized" API. "Vectorized"
       | in this context does NOT mean SIMD, it's a Python-specific usage,
       | see below.
       | 
       | * Numba: JIT compilation specifically focusing on interop with
       | NumPy and similar libraries. Lets you write subset of Python but
       | unlike NumPy you can use for loops and go fast.
       | 
       | * AOT compilation: Cython, Rust, C++, etc.. You have a longer
       | feedback loop, but you have a full programming language,
       | especially if you avoid Cython. OTOH Cython has nicer Python
       | interop so for simple just-a-little-addon it can be easier to use
       | if you don't already know Rust. You really shouldn't be writing
       | new C++ in this day and age (but wrapping an existing library is
       | useful). Like C++, Cython doesn't help with memory safety. Cython
       | also suffers from two compilers, so debugging can be harder,
       | especially if you use the C++ interop; if you are wrapping
       | existing C++ library, I'd probably start with PyBind11 based on
       | long-ago experience with Boost::Python.
       | 
       | Longer form:
       | 
       | * "Vectorization" in the context of Python:
       | https://pythonspeed.com/articles/vectorization-python/
       | 
       | * PyPy and Numba as alternatives to vectorization:
       | https://pythonspeed.com/articles/vectorization-python-altern...
       | 
       | * Choosing a compiled language:
       | https://pythonspeed.com/articles/rust-cython-python-extensio...
       | 
       | * The performance overhead of AOT compiled libraries (less
       | relevant if you're doing anything numeric):
       | https://pythonspeed.com/articles/python-extension-performanc...
       | 
       | * Numba intro: https://pythonspeed.com/articles/numba-faster-
       | python/
        
         | hedgehog wrote:
         | Good overview. "Vectorized" is an old term that's been around
         | since the early days of supercomputers and maybe before, not
         | sure where it came from. Numba does a bunch of different things
         | for code written to the Numpy API including CUDA acceleration.
         | Certain machine learning frameworks like PyTorch and JAX also
         | roughly follow the Numpy API because it is widely familiar and
         | easy enough to work with. The kind of code that benefits from
         | this kind of acceleration is hard to write yourself. A lot of
         | workloads lean on linear algebra operations that are
         | conceptually simple but complicated to implement with good
         | performance, thus why all of this tooling isn't just a couple
         | thousand lines of C. Good overview of matmul on CPU:
         | 
         | https://gist.github.com/nadavrot/5b35d44e8ba3dd718e595e40184...
        
       | chestertn wrote:
       | I will save you the pain: switch to Julia.
        
         | m_c_g wrote:
         | Indeed! Converting one's entire code base to a different
         | language ecosystem, finding equivalents to each of your third-
         | party dependencies, is less painful than employing a library to
         | selectively compile a few performance bottlenecks in your code.
         | 
         | (Modules like PyJulia facilitate a more incremental approach.)
        
           | Alifatisk wrote:
           | /s
        
           | xigoi wrote:
           | That's why you should switch before creating the codebase in
           | the first place.
        
       | IceHegel wrote:
       | Are there a standard set of benchmarks these python JIT projects
       | use?
       | 
       | I'm very interested in adding something like this to some
       | projects but it needs to be 10-100x faster to be worth the
       | hassle. Otherwise, for our applications, it's a better time
       | investment to rewrite in Go and get the speed and pro tooling
       | than to further optimize python.
        
         | ellisv wrote:
         | I'm surprised you'd rewrite in Go rather than Julia. I'd expect
         | Julia would be much easier to translate to from Python and have
         | much better support for any mathematical operation.
        
         | doliveira wrote:
         | Lol, matrix arithmetics and scientific programming in Go
        
         | hedgehog wrote:
         | If you have numeric code that's too slow in Numba your next
         | stop will likely involve a big multi-language effort and GPU
         | specialists and none of that would be in Go except maybe a
         | wrapper for your apps.
        
       | chazeon wrote:
       | Software from our group (cij[1], qha[2]) were developed when
       | numba seems to be the best option for JIT. It generates more pain
       | in the hindsight. It generates a lot of depreciated warning due
       | to unstable API, locked numpy to a certain version (i remember
       | 1.21) due to compatibility issues, and when M1 Mac comes out,
       | there were for a long time lack of llvmlite porting to the new
       | platform, so cannot run on these new Macs.
       | 
       | If I had to do it again I would just use plain numpy or use the
       | JAX from Google if JIT is really necessary.
       | 
       | [1]: https://github.com/MineralsCloud/cij
       | 
       | [2]: https://github.com/MineralsCloud/qha
        
         | gjvc wrote:
         | What if I'm (in Python) doing non-numerical stuff like parsing
         | text and generating code? What JIT / AOT tooling (if any) is
         | suitable?
        
           | fwilliams wrote:
           | I have personally gotten a lot of mileage from just writing
           | the compute heavy parts of my code in C++ and exposing it to
           | Python with a tool like PyBind11 [1] or NumpyEigen [2]. I
           | find tools like numba and cython to be more trouble than
           | they're worth.
           | 
           | [1] https://github.com/pybind/pybind11 [2]
           | https://github.com/fwilliams/numpyeigen
        
             | netjiro wrote:
             | I prototype in python or whatever, then, if the project
             | survives into market and has legs I either buy more
             | hardware or rewrite the expensive parts in C++.
             | 
             | Reduces calendar time, risk, cost. And I'm likely to make
             | better decisions once the code and market is better
             | understood after the prototype is tested under real world
             | conditions and the requirements have changed (like they
             | always seem to do).
        
           | singhrac wrote:
           | As a slight contrast to the other responses, I found setting
           | up maturin (Rust + Python) very straightforward since the
           | documentation is recent, and I find it's easy to write
           | parsers in Rust because the ADT syntax is very terse.
        
           | auxym wrote:
           | Pypy, probably. You could also consider writing pre compiled
           | extensions for your "hot" code, eg. in Cython.
        
           | chazeon wrote:
           | I think most parsing-heavy code are just use C/C++ extension.
           | 
           | Example I can think of include:
           | 
           | 1. pyyaml's parser in C vs the Python version get a huge
           | speed up on large files
           | 
           | 2. some parsing table (~GB size) using pandas vs self-
           | implemented Python code with a lot of for loop gain 20x speed
           | up at least.
        
           | ptype wrote:
           | I think what will be the most maintainable and bring you the
           | least long term pain is Cython
        
       | [deleted]
        
       | baggiponte wrote:
       | I am really intrigued by the Codon project, which aims to be a
       | JIT compiler for Python with Numba/JAX decorator syntax:
       | https://github.com/exaloop/codon
        
         | ipsum2 wrote:
         | It's not going to take off, since it doesn't have full (or even
         | most) API compatibility with Python. Numba seems strictly
         | better because it can interop with Python.
        
       | stared wrote:
       | As a side note, now it is easy to write Rust code, which can be
       | directly used in Python - https://github.com/PyO3/pyo3.
       | 
       | It cannot use NumPy and other libraries (since it is Rust), but
       | at the same time, I see its potential in creating high-
       | performance code to be used in Python numerical environment.
        
         | kylebarron wrote:
         | On the contrary, it can use and interface with numpy quite
         | easily: https://github.com/PyO3/rust-numpy
        
           | stared wrote:
           | Good to know!
        
       | grej wrote:
       | We were very heavy numba users at my former company. I would even
       | go so far as to say numba was probably the biggest computational
       | enabler for the product. I've also made a small contribution to
       | the library.
       | 
       | It's a phenomenal library for developing novel computationally
       | intensive algorithms on numpy arrays. It's also more versatile
       | than Jax.
       | 
       | In presentations, I've heard Leland McInnes credits numba often
       | when he speaks of his development of UMAP. We built a very
       | computationally intensive portion of our application with it and
       | it has been running in production, stable, for several years now.
       | 
       | It's not suitable for all use cases. But I recommend testing it
       | if you need to do somewhat complex calculations iterating over
       | numpy arrays for which standard numpy or scipy functions don't
       | exist. Even then, often we were surprised that we could speed up
       | some of those calculations by placing them inside numba.
       | 
       | Edit: ex of a very small function I wrote with numba that speeds
       | up an existing numpy function (note - written years ago and numba
       | has undergone quite some amount of changes since!):
       | https://github.com/grej/pure_numba_alias_sampling
       | 
       | Disclosure - I now work for Anaconda, the company that sponsors
       | the numba project.
        
         | PheonixPharts wrote:
         | > It's also more versatile than Jax
         | 
         | Does numba do automatic differentiation?
         | 
         | I view JAX as primarily an automatic differentiation tool with
         | the bonus that it makes great use of XLA and can easy make use
         | of GPU/TPUs.
         | 
         | I don't usually see numba and JAX as solving the same problem,
         | but would be excited to be wrong
        
           | fasttriggerfish wrote:
           | [dead]
        
         | melony wrote:
         | These days I have switched to
         | 
         | https://www.taichi-lang.org/
        
       | melling wrote:
       | [flagged]
        
         | dang wrote:
         | That's a bit too cynical, I think. People post follow-
         | up/related stories because the brain likes to follow chains of
         | associations.
         | 
         | You're right that these chains tend towards already-familiar
         | associations, which lower their value as HN stories. The best
         | HN stories are the ones that can't be predicted from any
         | existing sequence!
         | https://hn.algolia.com/?dateRange=all&page=0&prefix=true&sor...
        
       | inflationwatch wrote:
       | [dead]
        
       | _Wintermute wrote:
       | I've only used numba once but I was really impressed. We have an
       | analysis at work that runs hundreds of times a day which uses a
       | Hampel filter written in numpy, but still requires iterating over
       | an array. Just adding a @numba.jit decorator above the function
       | gave us a 10x speed improvement.
        
       | Escapado wrote:
       | When I wrote my bachelor thesis years back I worked on a
       | particle-in-cell code [1] that makes heavy use of numba for GPU
       | kernels. At the time it was the most convenient way to do that
       | from python. I remember spending weeks to optimizing these
       | kernels to eek out every last bit of performance I could (which
       | interestingly enough did eventually involve using atomic
       | operations and introducing a lot of variables[2] instead of using
       | arrays everywhere to keep things in registers instead of slower
       | caches).
       | 
       | I remember the team being really responsive to feature requests
       | back then and I had a lot of fun working with it. IIRC compared
       | to using numpy we managed to get speedups of up to 60x for the
       | most critical pieces of code.
       | 
       | [1]: https://github.com/fbpic/fbpic [2]:
       | https://github.com/fbpic/fbpic/blob/1867a4f216baf4269f2314ab...
        
       | lvass wrote:
       | Does anyone know how this approach of adding decorators to
       | numerical functions compare to Elixir's Nx approach of compiling
       | those functions through a specialized macro for numerical
       | computations? Would Numba benefit if (PEP 638?) macros were added
       | to python?
        
       | usgroup wrote:
       | I think numba still makes sense for loopy algorithms but not so
       | much if youre more vector oriented given that Jax is more or less
       | a drop in replacement for numpy and is shockingly fast.
        
         | galangalalgol wrote:
         | I have used pytorch as a (almost) drop in replacement for
         | numpy. Are there good reasons to look at jax instead assuming
         | I'm doing DSP and not ML?
        
           | martinsmit wrote:
           | If you are doing array or vector-based work where the
           | operations can be written as maps as opposed to for loops
           | then JAX is king imo.
        
           | patrickkidger wrote:
           | Honestly, the two are now incredibly close.
           | 
           | JAX introduced a lot of cool concepts (e.g. autobatching
           | (vmap), autoparallel (pmap)) and supported a lot of things
           | that PyTorch didn't (e.g. forward mode autodiff).
           | 
           | And at least for my applications (scientific computing), it
           | was much faster (~100x) due to a much better JIT compiler and
           | reduced Python overhead.
           | 
           | ...but! PyTorch has worked hard to introduce all of the
           | former, and the recent PyTorch 2 announcement was primarily
           | about a better JIT compiler for PyTorch. (I don't think
           | anyone has done serious non-ML benchmarks for this though, so
           | it remains to be seen how this holds up.)
           | 
           | There are still a few differences. E.g. JAX has a better
           | differential equation solving ecosystem. PyTorch has a better
           | protein language model ecosystem. JAX offers some better
           | power-user features like custom vmap rules. PyTorch probably
           | has a lower barrier to entry.
           | 
           | (FWIW I don't know how either hold up specifically for DSP.)
           | 
           | I'd honestly suggest just trying both; always nice to have a
           | broader selection of tools available.
        
       | short_sells_poo wrote:
       | As someone who uses the python numerical computing libraries
       | extensively, Numba is my biggest disappointment in the ecosystem.
       | 
       | The main problem with Numba is that simple functions are easy
       | enough, and this lulls you into a false sense of security- that
       | things will work.
       | 
       | Unfortunately, every time it turns into an a hair tearing
       | exercise of trying to structure the code such that Numba's vast
       | array of unpredictable edge cases isn't hit.
       | 
       | The error messages are often infuriatingly bad.
       | 
       | At this point I've banned Numba from our codebase. If there's a
       | case for Numba, we just do it in C++ instead.
       | 
       | Edit: we've been looking at Taichi https://www.taichi-lang.org/
        
       | samsquire wrote:
       | How would this compare to Pypy?
       | 
       | I didn't think Pypy uses LLVM so I wonder who produced better
       | code.
       | 
       | That said, they're targeted at different audiences. I feel Numba
       | is targeted at data science and machine learning and even AI.
       | 
       | I feel a large portion of using or programming a computer is
       | structural and not the actual work of adding numbers together.
       | Very little of the code generated does the useful part a computer
       | does: addition. The rest is control flow management and data
       | placement! It's all preparation for the code to do an addition.
       | The hard part is putting together the structure for the computer
       | to do things that are useful.
       | 
       | So we invented methods, variables, classes, functions, closures,
       | expressions to create that structure easier.
       | 
       | I thought about creating a language which tries to eliminate the
       | structure that most programs accumulate and focus on the critical
       | addition or calculation and let the computer do the arrangement.
       | A JIT compiler for structure.
        
         | csdvrx wrote:
         | > let the computer do the arrangement
         | 
         | Isn't that constraint propagation?
         | 
         | I'm discovering JS at the moment. I don't fully understand the
         | async model, but the promise seems like a generic constraint of
         | "the result is now available"
         | 
         | Maybe you could have the "flow managements" as other
         | constraints?
        
           | samsquire wrote:
           | Thank you for your reply.
           | 
           | I'm thinking the code for your average CRUD or even desktop
           | compositor. A compositor copies pixels from multiple places
           | into one place. Surely that can be defined with a simple
           | loop? But no there's hundreds of APIs in the way. Add Wayland
           | and X11 and you have something that is opaque and understood
           | by very few people.
           | 
           | The motivation behind my comment was that most of programming
           | computers is gluing together APIs to shift data from one
           | place to another before doing something useful with it. The
           | APIs themselves do very little addition or subtraction of
           | data but actual just moving data around and placing it into
           | the right place.
           | 
           | Maybe defining where things should be, declaratively, in
           | order to do a calculation would be useful. So the shape of
           | the calculation defines the data structure, rather than the
           | data structure defining the caclulation.
        
             | csdvrx wrote:
             | For a compositor, I'd think of the set pixels being changed
             | (an "invalidation") a good example: the constraint would be
             | to update it on the screen.
             | 
             | Unchanged? Don't bother, leave it as-is. I think that's how
             | Intel power saving works.
             | 
             | Now think about the MVC model: some changes in the data
             | could result in a change in the view if the data currently
             | shown on screen is what has changed - like triggers in SQL.
             | 
             | I wonder if you could have everything work like that?
        
               | samsquire wrote:
               | You're right, and thank you for bringing async up.
               | 
               | And thankyou for bringing up constraint propagation.
               | 
               | One of my ideas is the definition of formulas that act as
               | materialized views over other materialised views. So we
               | can layer materialized views over other materialized
               | views and then work out a derived formula that is
               | potentially nearer to what we want and potentially
               | summarise the formula without needing to calculate the
               | underlying views, we can compute the formula directly.
               | 
               | Is this differential dataflow?
               | 
               | I think it's an application of algebra and JIT compilers
               | could do it to expressions if we fed symbolic expressions
               | of programming languages into sympy or machine algebra.
               | 
               | In react, react does diffing between virtual DOM nodes to
               | see if there are changed. There is also dirty region
               | checking in old games and damage regions. These problems
               | are mathematically defined.
               | 
               | Here's my writings on the idea
               | https://github.com/samsquire/ideas4#31-algebraic-
               | materialise...
        
               | csdvrx wrote:
               | > I think it's an application of algebra and JIT
               | compilers could do it to expressions if we fed symbolic
               | expressions of programming languages into sympy or
               | machine algebra
               | 
               | Yes and the constraints could be the used to reduce the
               | computational costs, giving higher performance and lower
               | latency.
               | 
               | A while back, a good friend (we even shared HN accounts
               | for a while lol) pointed me to pipelinedb: a PostgreSQL
               | timeseries plugin for continuously updating
               | """materialized views"""
               | 
               | I use a lot of quotes around, because it wasn't either
               | like a regular view (computed when you query it, which
               | introduces latency) or a materialized view (frozen, needs
               | to be refreshed, same problem) but more like the NO_HZ
               | tickless kernel: the update of the calculations was
               | caused by the introduction of new data, not the passage
               | of time (which would be wasteful)
               | 
               | The general approach makes a lot of sense to me, and I
               | see how it could be used for more generic problems.
        
       | optimalsolver wrote:
       | I went out and learned C++ because Numba was so finicky to work
       | with.
        
       | dang wrote:
       | Related:
       | 
       |  _Faster Python calculations with Numba_ -
       | https://news.ycombinator.com/item?id=30392367 - Feb 2022 (66
       | comments)
       | 
       |  _Numba: a JIT compiler for Python that works best on code that
       | uses NumPy_ - https://news.ycombinator.com/item?id=21614533 - Nov
       | 2019 (9 comments)
       | 
       |  _How Numba and Cython speed up Python code_ -
       | https://news.ycombinator.com/item?id=17678758 - Aug 2018 (45
       | comments)
       | 
       |  _Numba: High-Performance Python with CUDA Acceleration_ -
       | https://news.ycombinator.com/item?id=15301766 - Sept 2017 (62
       | comments)
       | 
       |  _Numba - JIT specializing compiler for annotated Python and
       | NumPy code to LLVM_ -
       | https://news.ycombinator.com/item?id=5927787 - June 2013 (8
       | comments)
       | 
       |  _Accelerating Python Libraries with Numba (Part 2)_ -
       | https://news.ycombinator.com/item?id=5757231 - May 2013 (23
       | comments)
       | 
       |  _Accelerating Python Libraries with Numba_ -
       | https://news.ycombinator.com/item?id=5680722 - May 2013 (30
       | comments)
       | 
       |  _Numba: NumPy-aware optimizing compiler for Python_ -
       | https://news.ycombinator.com/item?id=4430780 - Aug 2012 (23
       | comments)
       | 
       |  _NumPy aware dynamic Python compiler using LLVM_ -
       | https://news.ycombinator.com/item?id=3864659 - April 2012 (9
       | comments)
       | 
       |  _Numba - A NumPy aware (LLVM-based) optimizing compiler for
       | Python_ - https://news.ycombinator.com/item?id=3692055 - March
       | 2012 (6 comments)
        
       | voz_ wrote:
       | Very impressive project. If compiling Python interests you, check
       | out the pytorch compiler stack too!
       | 
       | https://pytorch.org/get-started/pytorch-2.0/
        
       | micheles wrote:
       | I use numba a lot nowadays. Works perfectly well on all platforms
       | (linux, windows, mac, even the M1) and gives speedups as expected
       | (few percent for already well vectorized numpy code, and extra-
       | large speedups for loopy code). I strongly recommend it for the
       | performance critical part of your code. Many things are not
       | supported yet, so it has to be used with care. I remember I
       | needed a missing scipy special function and I the end I
       | implemented it myself by vectorizing math.erf: it was
       | surprisingly easy to do and a big success in terms of
       | performance.
        
       ___________________________________________________________________
       (page generated 2022-12-27 23:01 UTC)