Just-in-time compilation and parallelisation with Numba#

Python interpreter overheads#

Python being an interpreted language, executes each line of source code one at a time in a sequential manner. An interpreter is a program that directly executes the instructions in the source code without requiring them to be compiled into an executable first. Interpreters tend to be more flexible than compilers, but are less efficient when running programs. The efficiency penalty arises because the interpreting process needs to be done every time by the python interpreter when the program is run.

Due to these overheads of the Python interpreter at runtime, running a pure python program is generally slow which may not be suitable for large-scale numerical computing demands. Each mathematical operation usually has a considerable overhead arising from the Python interpreter, making especially time-critical operations inefficient.

Compilation to machine code#

One approach to avoid the slowdown due to the interpreting program is to compile (i.e. translate) the user written code into machine-level instructions that can be directly executed by the CPU thereby circumventing the need for an interpreting process. In the domain of scientific computing, this paradigm is usually adopted to execute programs written in languages such as C, C++, Fortran.

The process of translating the user code to machine code typically involves several steps and is performed by a compiler toolchain (a set of related programs) installed on the machine. Since compilers can view the entire source code upfront, they can perform a number of analyses and optimizations when generating machine instructions that speeds up execution than just interpreting each line individually.

However, compiled languages are not usually flexible e.g. they typically do not have features like dynamic typing wherein the data type of each variable need not be known apriori. In general, it is accepted that languages that traditionally use interpreter runtimes are easier for new programmers to learn and prototype to quickly evaluate scientific concepts.

Just-in-time (JIT) compilation#

Since interpreters and compilers have complementary strengths and weaknesses, one possibility is to implement facilities to somehow combine aspects of both. A “Just-in-time” compilation is a process wherein only certain critical sub-sections of code are identified to be compiled at runtime. The rest of the user code shall continue to be evaluated by the python interpreter as usual.

The critical sections of code that need to be run several times (e.g. compulational bottlenecks such as slow functions identified with the help of a profiler) can be instructed to be compiled into machine code.

Parallel Computing#

Modern computers are highly parallel systems. We already covered SIMD parallelisation, where within each CPU cores there exist hardware vector units that allow the parallel execution of certain floating point operations.

CPUs themselves consists of multiple CPU cores, and they can be leveraged i.e. the task to solve can be distributed across these different cores to further achieve speedup. However, it is worth remembering that the serial part of the code (i.e. code that needs to be run sequentially due to its fundamental nature of operations such as in time-stepping for Ordinary and Partial Differential Equations) shall dominate and the speed-up shall most likely be far below the number of CPU cores across which the program code was distributed across.

We can also leverage General Purpose Graphics Processing Units (GPGPUs) as accelerator devices that contain highly parallel floating point units. If we scale to larger compute clusters then there is also a further level of parallelism between the individual hardware nodes.

The numba library#

Numba is a third party accelerator library for Python that allows to just-in-time compile Python functions into fast, direct machine code that need not access the Python interpreter. Moreover, it allows to explore parallelisations that cover a lot of use-cases for parallel computing on a single machine using shared-memory parallelism. In addition to all this, Numba has features to directly cross-compile code for use on NVidia GPU accelerators.

Depending on the mathematical operations within the function being targeted for compilation, its performance can be close to hand-optimized C code. It is cross-platform and bundles the llvmlite library (using the LLVM compiler toolchain) for providing cross-platform high-quality compilation of python functions.

Exercise: Accelerate large-scale vector addition using numba#

Although numba provides an extensive set of facilities and features to tackle various compilation aspects, easy gains can usually be made by making use of the parallelisati. This is done by decorating critical functions of interest with the @njit(parallel=True) decorator.

Let us accelerate the addition of two numpy arrays and see the difference in performance afforded by parallelisation

import numpy as np
import numba
import time
n = 1000000                       # 1 million array elements
rng = np.random.default_rng()     # Set up a random number generator
a = rng.random(n)                 # n uniformly distributed floating point numbers in [0, 1]
b = rng.standard_normal(n)        # n Gaussian distributed floating point numbers
c = np.empty(n, dtype='float64')  # Placeholder for result vector

First, we benchmark the numpy vector addition

t1 = time.perf_counter_ns()
c = a + b
t2 = time.perf_counter_ns()
print(f'Numpy vectorised operation on a single CPU core took {(t2-t1)/1e6} ms')
Numpy vectorised operation on a single CPU core took 5.422406 ms

Next, we refactor the vector addition operation into a function, and mark that function for jit compilation by decorating it suitably.

@numba.njit(parallel=True)
def numba_fun(in1, in2, out):
    out = in1 + in2
        
# numba.set_num_threads(8)
t1 = time.perf_counter_ns()
numba_fun(a,b,c)
t2 = time.perf_counter_ns()
print(f'Running the function (including compilation time) with parallel CPU cores took {(t2-t1)/1e6} ms')
Running the function (including compilation time) with parallel CPU cores took 837.853423 ms

However, the performance seems to be rather poor. This is because the compilation process incurs significant overhead. However, once compiled, evaluating the function is significantly faster. Hence, JIT compilation is most beneficial in cases where a time-consuming function is repeatedly evaluated (as in our jacobi function in the CFD code exercise). The speed-up of compiled function evaluation can be validated by repeatedly calling the function a few times.

for i in range(5):
    t1 = time.perf_counter_ns()
    numba_fun(a,b,c)
    t2 = time.perf_counter_ns()
    print(f'Evaluating the pre-compiled function with parallel CPU cores took {(t2-t1)/1e3} us')
Evaluating the pre-compiled function with parallel CPU cores took 11.323 us
Evaluating the pre-compiled function with parallel CPU cores took 3.182 us
Evaluating the pre-compiled function with parallel CPU cores took 1.61 us
Evaluating the pre-compiled function with parallel CPU cores took 1.422 us
Evaluating the pre-compiled function with parallel CPU cores took 1.397 us

As seen from the timing results, a significant speed-up was achieved by distributing the compiled function over multiple cores. The number of cores to use for parallelisation can optionally be set using numba.set_num_threads(n) before the @njit decorator line. Otherwise, numba shall use all available CPU cores, which may not be the most efficient when considering the scaling characteristics of the problem at hand.

Exercise: Refactor jacobi function of CFD code to use numba JIT and autoparallelisation#

In the cfd_numba directory, we have provided a version of the CFD code that uses numba to accelerate the computations. Upon inspecting the code, we see that the numba version of the CFD code differs mainly in the fact that the jacobi function is annotated to be just-in-time compiled and distributed across cores for parallel computation of results. Not all methods in the numpy library are compatible with numba though. In particular, we notice through experimentation that the np.copyto method cannot be used here, and hence the code provided reverts to the simpler np.copy() for assigning the stream function’s updated values.

Navigate to the cfd_numba directory, and run the program with the same problem size as before, i.e. a 128 x 128 grid (scaling factor of 4) and 10000 Jacobi iterations:

prompt:/path/to/cfd_numba>python cfd.py 4 10000

The timing output of the program should unfortunatel result in a slowdown over the single-core numpy version e.g.

2D CFD Simulation
=================
Scale factor = 4
Iterations   = 10000

Initialisation took 0.00021s

Grid size = 128 x 128

Starting main Jacobi loop...

...finished

Calculation took 5.27128s

However, one key observation is that, we can now refine the grid size and/or crank up the number of Jacobi iterations to profit from the amortised cost of repeatedly invoking a pre-compiled function.

Exercise: Compare the numba and numpy versions of CFD code on a 512 x 512 grid (scale factor of 16) and with 10000 Jacobi iterations#

Sample timing output from numba parallel version:

2D CFD Simulation
=================
Scale factor = 16
Iterations   = 10000

Initialisation took 0.00142s

Grid size = 512 x 512

Starting main Jacobi loop...

...finished

Calculation took 23.12356s

Sample timing output from numpy serial version:

2D CFD Simulation
=================
Scale factor = 16
Iterations   = 10000

Initialisation took 0.00133s

Grid size = 512 x 512

Starting main Jacobi loop...

...finished

Calculation took 49.91595s

We get a speed up of approx 2. Experiment to see if there are further improvements in speed-up using a finer grid and/or by increasing the number of Jacobi iterations

Concluding remarks#

Just-in-time compilation is not a guaranteed panacea for accelerating scientific codes. The potential speed-up that can be achieved depends on a number of factors such as the nature of floating point operations used, the library methods used within the function to be compiled etc. This requires further investigation on the part of the user to determine a reasonable trade-off between pace of development vs runtime performance.

There exist various other approaches to accelerate scientific python codes. This includes distributing computations over many computational nodes using the MPI protocol and mpi4py library, performing distributed task/data parallelism/pipelining using the dask library, exploring concurrency and multithreading using joblib, offloading computationally intensive functions as kernels to dedicated accelerator hardware such as GPUs and FPGAs using libraries such as dace etc.