XClose

Research Software Engineering Summer School

Home
Menu

Miscellaneous libraries to improve your code's performance

Over the course of years, Python ecosystem has developed more and more solutions to combat the relatively slow performance of the language. As seen in the previous lessons, the speed of computation matters quite a lot while using a technology in scientific setting. This lesson introduces a few such popular libraries that have been widely adopted in the scientific use of Python.

Compiling Python code

We know that Python is an interpreted and not a compiled language, but is there a way to compile Python code? There are a few libraries/frameworks that lets you just in time (JIT) or ahead of time (AOT) compile Python code. Both the techniques allow users to compile their Python code without using any explicit low level langauge's bindings, but both the techniques are different from each other.

Just in time compilers compile functions/methods at runtime whereas ahead of time compilers compile the entire code before runtime. AOT can do much more compiler optimizations than JIT, but AOT compilers produce huge binaries and that too only for specific platforms. JIT compilers have limited optimization routines but they produce small and platform independent binaries.

JIT and AOT compilers for Python include, but are not limited to -

  • Numba: a Just-In-Time Compiler for Numerical Functions in Python
  • mypyc: compiles Python modules to C extensions using standard Python type hints
  • JAX: a Python library for accelerator-oriented array computation and program transformation

Although all of them were built to speed-up Python code, each one of them is a bit different from each other when it comes to their use-case. We will specifically look into Numba in this lesson.

Numba

Numba is an open source JIT compiler that translates a subset of Python and NumPy code into fast machine code. The good thing about Numba is that it works on plain Python loops and one doesn't need to configure any compilers manually to JIT compile Python code. Another good thing is that it understands NumPy natively, but as detailed in its documentation, it only understands a subset of NumPy functionalities.

Numba provides users with an easy to use function decorator - jit. Let's start by importing the libraries we will use in this lesson.

In [1]:
import math

from numba import jit, njit
import numpy as np

We can mark a function to be JIT compiled by decorating it with numba's @jit. The decorator takes a nopython argument that tells Numba to enter the compilation mode and not fall back to usual Python execution.

Here, we are showing a usual python function, and one that's decorated. We don't need to duplicate and change its name when using numba, but we want to keep both of them here to compare their execution times.

In [2]:
def f(x):
    return np.sqrt(x)

@jit(nopython=True)
def jit_f(x):
    return np.sqrt(x)

It looks like the jit decorator should make our Numba compile our function and make it much faster than the non-jit version. Let's test this out.

Note that the first function call is not included while timing because that is where Numba compiles the function. The compilation at runtime is called just in time compilation and resultant binaries are cached for the future runs.

In [3]:
data = np.random.uniform(low=0.0, high=100.0, size=1_000)
In [4]:
%%timeit
f(data)
1.99 μs ± 24.3 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
In [5]:
%%timeit -n 1 -r 1
_ = jit_f(data)  # compilation and run
462 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
In [6]:
%%timeit
jit_f(data)  # just run
1.59 μs ± 2.46 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

Surprisingly, the JITted function was slower than plain Python and NumPy implementation! Why did this happen? Numba does not provide valuable performance gains over pure Python or NumPy code for simple operations and small dataset. The JITted function turned out to be slower than the non-JIT implementation because of the compilation overhead. Note that the result from the compilation run could be very noisy and could give a higher than real value, as mentioned in the previous lessons. Let's try increasing the size of our data and perform a non-NumPy list comprehension on the data.

The jit decorator with nopython=True is so widely used there exists an alias decorator for the same - @njit

In [7]:
data = np.random.uniform(low=0.0, high=100.0, size=1_000_000)
In [8]:
def f(x):
    return [math.sqrt(elem) for elem in x]

@njit
def jit_f(x):
    return [math.sqrt(elem) for elem in x]
In [9]:
%%timeit
f(data)
87.7 ms ± 1.04 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [10]:
%%timeit -n 1 -r 1
_ = jit_f(data)  # compilation and run
114 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
In [11]:
%%timeit
jit_f(data)  # just run
37.1 ms ± 1.09 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

That was way faster than the non-JIT function! But, the result was still slower than the NumPy implementation. NumPy is still good for relatively simple computations, but as the complexity increases, Numba functions start outperforming NumPy implementations.

Let's go back to our plain Python mandelbrot code from the previous lessons and JIT compile it -

In [12]:
@njit
def mandel1(position, limit=50):
    
    value = position
    
    while abs(value) < 2:
        limit -= 1        
        value = value**2 + position
        if limit < 0:
            return 0
        
    return limit

xmin = -1.5
ymin = -1.0
xmax = 0.5
ymax = 1.0
resolution = 300
xstep = (xmax - xmin) / resolution
ystep = (ymax - ymin) / resolution
xs = [(xmin + (xmax - xmin) * i / resolution) for i in range(resolution)]
ys = [(ymin + (ymax - ymin) * i / resolution) for i in range(resolution)]
In [13]:
%%timeit -n 1 -r 1
data = [[mandel1(complex(x, y)) for x in xs] for y in ys]  # compilation and run
167 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
In [14]:
%%timeit
data = [[mandel1(complex(x, y)) for x in xs] for y in ys]  # just run
55.5 ms ± 430 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

The compiled code already beats our fastest NumPy implementation! It is not necessary the compiled code will perform better than NumPy code, but it usually gives performance gains for signifantly large computations. As always, it is good to measure the performance to check if there are any gains.

Let's try JITting our NumPy code.

In [15]:
@njit
def mandel_numpy(position,limit=50):
    value = position
    diverged_at_count = np.zeros(position.shape)
    while limit > 0:
        limit -= 1
        value = value**2 + position
        diverging = value * np.conj(value) > 4
        first_diverged_this_time = np.logical_and(diverging, diverged_at_count == 0)
        diverged_at_count[first_diverged_this_time] = limit
        value[diverging] = 2
        
    return diverged_at_count

ymatrix, xmatrix = np.mgrid[ymin:ymax:ystep, xmin:xmax:xstep]
values = xmatrix + 1j * ymatrix
In [16]:
%%timeit
mandel_numpy(values)
---------------------------------------------------------------------------
TypingError                               Traceback (most recent call last)
Cell In[16], line 1
----> 1 get_ipython().run_cell_magic('timeit', '', 'mandel_numpy(values)\n')

File /opt/hostedtoolcache/Python/3.12.8/x64/lib/python3.12/site-packages/IPython/core/interactiveshell.py:2541, in InteractiveShell.run_cell_magic(self, magic_name, line, cell)
   2539 with self.builtin_trap:
   2540     args = (magic_arg_s, cell)
-> 2541     result = fn(*args, **kwargs)
   2543 # The code below prevents the output from being displayed
   2544 # when using magics with decorator @output_can_be_silenced
   2545 # when the last Python token in the expression is a ';'.
   2546 if getattr(fn, magic.MAGIC_OUTPUT_CAN_BE_SILENCED, False):

File /opt/hostedtoolcache/Python/3.12.8/x64/lib/python3.12/site-packages/IPython/core/magics/execution.py:1195, in ExecutionMagics.timeit(self, line, cell, local_ns)
   1193 for index in range(0, 10):
   1194     number = 10 ** index
-> 1195     time_number = timer.timeit(number)
   1196     if time_number >= 0.2:
   1197         break

File /opt/hostedtoolcache/Python/3.12.8/x64/lib/python3.12/site-packages/IPython/core/magics/execution.py:173, in Timer.timeit(self, number)
    171 gc.disable()
    172 try:
--> 173     timing = self.inner(it, self.timer)
    174 finally:
    175     if gcold:

File <magic-timeit>:1, in inner(_it, _timer)

File /opt/hostedtoolcache/Python/3.12.8/x64/lib/python3.12/site-packages/numba/core/dispatcher.py:423, in _DispatcherBase._compile_for_args(self, *args, **kws)
    419         msg = (f"{str(e).rstrip()} \n\nThis error may have been caused "
    420                f"by the following argument(s):\n{args_str}\n")
    421         e.patch_message(msg)
--> 423     error_rewrite(e, 'typing')
    424 except errors.UnsupportedError as e:
    425     # Something unsupported is present in the user code, add help info
    426     error_rewrite(e, 'unsupported_error')

File /opt/hostedtoolcache/Python/3.12.8/x64/lib/python3.12/site-packages/numba/core/dispatcher.py:364, in _DispatcherBase._compile_for_args.<locals>.error_rewrite(e, issue_type)
    362     raise e
    363 else:
--> 364     raise e.with_traceback(None)

TypingError: Failed in nopython mode pipeline (step: nopython frontend)
No implementation of function Function(<built-in function setitem>) found for signature:
 
 >>> setitem(array(float64, 2d, C), array(bool, 2d, C), int64)
 
There are 16 candidate implementations:
     - Of which 14 did not match due to:
     Overload of function 'setitem': File: <numerous>: Line N/A.
       With argument(s): '(array(float64, 2d, C), array(bool, 2d, C), int64)':
      No match.
     - Of which 2 did not match due to:
     Overload in function 'SetItemBuffer.generic': File: numba/core/typing/arraydecl.py: Line 221.
       With argument(s): '(array(float64, 2d, C), array(bool, 2d, C), int64)':
      Rejected as the implementation raised a specific error:
        NumbaTypeError: Multi-dimensional indices are not supported.
  raised from /opt/hostedtoolcache/Python/3.12.8/x64/lib/python3.12/site-packages/numba/core/typing/arraydecl.py:91

During: typing of setitem at /tmp/ipykernel_15020/2672903129.py (10)

File "../../../../../../tmp/ipykernel_15020/2672903129.py", line 10:
<source missing, REPL/exec in use?>

That does not work. The error might be solvable or it might just be out of Numba's scope. Numba does not distinguish between plain Python and NumPy; hence both the implementations are broken down to the same machine code. Therefore, for Numba, it makes sense to write complex computations with the ease of Python loops and lists than with NumPy functions. Moreover, Numba only understands a subset of Python and NumPy so it is possible that a NumPy snippet does not work but a simplified Python loop does.

Let's make minor adjustments to fix the NumPy implementation and measure its performance. We flatten the NumPy arrays and consider only the real part while performing the comparison.

In [17]:
@njit
def mandel_numpy(position,limit=50):
    value = position.flatten()
    diverged_at_count = np.zeros(position.shape).flatten()
    while limit > 0:
        limit -= 1
        value = value**2 + position.flatten()
        diverging = (value * np.conj(value)).real > 4
        first_diverged_this_time = (np.logical_and(diverging, diverged_at_count == 0))
        diverged_at_count[first_diverged_this_time] = limit
        value[diverging] = 2

    return diverged_at_count.reshape(position.shape)

ymatrix, xmatrix = np.mgrid[ymin:ymax:ystep, xmin:xmax:xstep]
values = xmatrix + 1j * ymatrix
In [18]:
%%timeit -n 1 -r 1
mandel_numpy(values)  # compilation and run
838 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
In [19]:
%%timeit
mandel_numpy(values)  # just run
16.4 ms ± 15.5 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

The code performs similar to the plain Python example!

Numba also has functionalities to vectorize, parallelize, and strictly type check the code. All of these functions boost the performance even further or helps Numba to avoid falling back to the "object" mode (nopython=False). Refer to Numba's documentation for a complete list of features.

Numba support in Scientific Python ecosystem

Most of the scientific libraries nowadays ship Numba support with them. For example, Awkward Array is a library that lets users perfor JIT compilable operations on non-uniform data, something that NumPy misses. Similarly, Akimbo provides JIT compilable processing of nested, non-uniform data in dataframes. There are other niche libraries, such as, vector that provides Numba support for high energy physics vector.

Compiled Python bindings

Several Python libraries/frameworks are written in a compiled language but provide Python bindings for their compiled code. This code does not have to be explicitly marked to be compiled by developers. A few Python libraries that have their core written in compiled languages but provide Python bindings -

  • Cython: superset of Python that supports calling C functions and declaring C types on variables and class attributes
  • Scipy: wraps highly-optimized scientific implementations written in low-level languages like Fortran, C, and C++
  • Polars: dataframes powered by a multithreaded, vectorized query engine, written in Rust
  • boost-histogram: Python bindings for the C++14 Boost::Histogram library
  • Awkward Array: compiled and fast manipulation of JSON-like data with NumPy-like idioms
  • Astropy: astronomy and astrophysics core library

JIT or AOT compiling libraries and the libraries providing Python bindings for a compiled language are not mutually exclusive. For instance, PyTorch and Tensorflow offer users "eager" and "graph" executation. Eager execution builds computational graph at runtime, making it slow, easy to debug, and JIT compilable. On the other hand, "graph" execution builds the computational graph at the kernel level, making it fast, hard to debug, and with no need to JIT compile. Similarly, Awkward Array provides Python bindings for array creation, but high-level operations on these arrays can be JIT compiled.

We will specifically look into Cython in the next lesson.

Distributed and parallel computing with Python

Along with the speed of execution of a program, scientific computation usually requires a large amount of memory because of the massive nature of data produced at scientific experiments. This issue can be addressed by employing distributed and parallel computing frameworks to spread the computation across multiple machines or multiple cores in a single machine. Parallel computing at a smaller level can be achieved by distributing a task over multiple threads, but Python's GIL (Global Interpreter Lock) blocks the interpretor from doing this for computational tasks. There are ongoing efforts, such as the implementation of PEP 703 to make GIL optional, but Numba allows users to bypass GIL by pass nogil=True to @jit.

At a larger scale, Dask can be used to parallelize computation over multiple machine or "nodes". Dask can even parallelize the code marked with @numba.jit(nogil=True) to multiple threads in a machine, but it does not itself bypass the GIL.

Dask

Dask is a Python library for parallel and distributed computing. Dask supports arrays, dataframes, Python objects, as well as tasks on computer clusters and nodes. The API of Dask is very similar to that of NumPy, Pandas, and plain Python, allowing users to quickly parallelize their implementations. Let's create a dummy dataframe using dask.

In [20]:
import dask

df = dask.datasets.make_people()  # returns a dask "bag"
df = df.to_dataframe()
/opt/hostedtoolcache/Python/3.12.8/x64/lib/python3.12/site-packages/dask_expr/_collection.py:5063: FutureWarning: from_legacy_dataframe is deprecated and will be removed in a future release. The legacy implementation as a whole is deprecated and will be removed, making this method unnecessary.
  warnings.warn(
In [21]:
df
Out[21]:
Dask DataFrame Structure:
age name occupation telephone address credit-card
npartitions=10
int64 string string string string string
... ... ... ... ... ...
... ... ... ... ... ... ...
... ... ... ... ... ...
... ... ... ... ... ...
Dask Name: to_pyarrow_string, 1 expression

The computation gave us a dask object and not the actual answer. Why is that? Displaying the dataframe just displays the metadata of the variable, and not any data. This is because of the "lazy" nature of dask. Dask has "lazy" execution, which means that it will store the operations on the data and create a task graph for the same, but will not execute the operations until a user explicitly asks for the result. The metadata specifies npartitions=10, which means that the dataframe is split into 10 parts that will be accessed parallely. We can explicitly tell dask to give us the dataframe values using .compute().

In [22]:
df.compute()
Out[22]:
age name occupation telephone address credit-card
0 19 ('Felton', 'Duke') Garage Attendant +16621398281 {'address': '318 Gates Hill', 'city': 'Christi... {'number': '4209 3299 8928 6236', 'expiration-...
1 26 ('Annamaria', 'Humphrey') Student Teacher +19522790537 {'address': '1233 Talbert Drive', 'city': 'Nor... {'number': '3447 077653 19319', 'expiration-da...
2 31 ('Miles', 'Keith') Scrap Dealer +1-952-235-5807 {'address': '790 Dunnes Highway', 'city': 'Dor... {'number': '5129 1676 7043 6882', 'expiration-...
3 17 ('Earl', 'Diaz') Liaison Officer +1-716-016-9854 {'address': '535 Coral Junction', 'city': 'Moo... {'number': '4037 0023 3430 6201', 'expiration-...
4 99 ('Lynell', 'Carlson') Chiropractor +1-720-843-2039 {'address': '49 Roselyn Square', 'city': 'Wash... {'number': '3775 216157 64008', 'expiration-da...
... ... ... ... ... ... ...
995 111 ('Hwa', 'Carlson') Nurseryman +14145460299 {'address': '83 Ward Turnpike', 'city': 'Deser... {'number': '2417 5623 9154 3537', 'expiration-...
996 21 ('Rozanne', 'Callahan') Funeral Director +1-941-782-3880 {'address': '871 Kensington Route', 'city': 'H... {'number': '5294 7354 9919 9508', 'expiration-...
997 75 ('Levi', 'Morgan') Marine Pilot +1-870-397-6767 {'address': '664 Century Heights', 'city': 'No... {'number': '3730 696217 78442', 'expiration-da...
998 106 ('Keena', 'York') Telephonist +14326244518 {'address': '853 Mcnair Viaduct', 'city': 'Sta... {'number': '5360 7448 9066 1388', 'expiration-...
999 96 ('Erich', 'Delacruz') Tanner +1-404-476-6858 {'address': '589 Fallon Bayou', 'city': 'Natio... {'number': '3732 242464 65806', 'expiration-da...

10000 rows × 6 columns

We can now perform pandas like operation on our dataframe and let dask create a task graph for the same.

In [23]:
new_df = df.groupby("occupation").age.max()
new_df
Out[23]:
Dask Series Structure:
npartitions=1
    int64
      ...
Dask Name: max, 3 expressions
Expr=Max(frame=FromGraph(be923fc)[['age', 'occupation']], observed=False, chunk_kwargs={'numeric_only': False}, aggregate_kwargs={'numeric_only': False}, _slice='age')

Instead of computing, let's visualize the task graph.

In [24]:
dask.visualize(new_df, filename="visualization.png")
Out[24]:
No description has been provided for this image

We can see that the task graph starts with 10 independent branches because our dataframe was split into 10 partitions at the start. Let's compute the answer.

In [25]:
new_df.compute()
Out[25]:
occupation
Accounts Clerk         119
Accounts Manager        94
Accounts Staff         107
Acoustic Engineer       98
Actuary                106
                      ... 
Off Shore               89
Furnace Man            113
Office Manager          83
Trinity House Pilot     52
Bakery Assistant        75
Name: age, Length: 1156, dtype: int64

Similarly, one can peform such computations on arrays and selected Python data structures.

Dask support in Scientific Python ecosystem

Similar to the adoption of Numba in the scientific Python ecosystem, dask is being adopted at an increasing rate. Some examples include dask-awkward for ragged data computation, dask-histogram for parallel filling of histograms dask support in cuDF, for parallelizing CUDA dataframes, and dask-sql, a distributed SQL Engine.

GPU accelerated computing in Python

The last method for speeding up Python that we will look at is running your code on GPUs instead of CPUs. GPUs are specifically built to speed up computational tasks such as doing linear algebra or experimenting with computer graphics. Support for GPUs can be provided by writing custom kernels or by using Pythonic GPU libraris such as CuPy.

CuPy

CuPy is an array-based library enabling Python code to be scaled to GPUs. CuPy's API is very similar to NumPy and SciPy, and it was built as their extension to GPUs. There are a few differences in the API, and even a few missing functions, but it is actively developed and in a stable state. CuPy requires CUDA or AMD ROCm to be installed on your machine. After the installation, majority of the NumPy and SciPy scripts can be converted to CuPy by just replacing numpy and scipy with cupy.

import cupy as cp

print(cp.cuda.runtime.getDeviceCount())  # check if CuPy identifies the CUDA GPU

x = cp.arange(6).reshape(2, 3).astype('f')
y = cp.arange(6).reshape(2, 3).astype('f')

result = x + y

The code is not executed in this lesson as the service where these lessons are built do not posses a GPU.

CuPy support in Scientific Python ecosystem

Similar to Numba and Dask, CuPy is being adopted in multiple scientific libraries to provide GPU capabilities to the users. Dask along with its exosystem libraries, such as dask-image provide CuPy and GPU support. Similarly, packages like Awkward Array, cupy-xarray, and pyxem use CuPy internally to offer GPU support.

Writing custom GPU kernels and binding them to Python

Some Python libraries write their own custom kernels to provide GPU support to the users. For instance, cuDF uses libcudf, a C++/CUDA dataframe library to provide GPU support for dataframes in Python. Similarly, Tensorflow and PyTorch have their own GPU kernels and they do not depend on CuPy to run their code on GPUs. One can even write custom GPU kernels in CuPy, and libraries like Awkward Array leverage that for ragged data computation.