Just-in-time compilation and parallelisation with Numba#
Python interpreter overheads#
Python being an interpreted language, executes each line of source code one at a time in a sequential manner. An interpreter is a program that directly executes the instructions in the source code without requiring them to be compiled into an executable first. Interpreters tend to be more flexible than compilers, but are less efficient when running programs. The efficiency penalty arises because the interpreting process needs to be done every time by the python interpreter when the program is run.
Due to these overheads of the Python interpreter at runtime, running a pure python program is generally slow which may not be suitable for large-scale numerical computing demands. Each mathematical operation usually has a considerable overhead arising from the Python interpreter, making especially time-critical operations inefficient.
Compilation to machine code#
One approach to avoid the slowdown due to the interpreting program is to compile (i.e. translate) the user written code into machine-level instructions that can be directly executed by the CPU thereby circumventing the need for an interpreting process. In the domain of scientific computing, this paradigm is usually adopted to execute programs written in languages such as C, C++, Fortran.
The process of translating the user code to machine code typically involves several steps and is performed by a compiler toolchain (a set of related programs) installed on the machine. Since compilers can view the entire source code upfront, they can perform a number of analyses and optimizations when generating machine instructions that speeds up execution than just interpreting each line individually.
However, compiled languages are not usually flexible e.g. they typically do not have features like dynamic typing wherein the data type of each variable need not be known apriori. In general, it is accepted that languages that traditionally use interpreter runtimes are easier for new programmers to learn and prototype to quickly evaluate scientific concepts.
Just-in-time (JIT) compilation#
Since interpreters and compilers have complementary strengths and weaknesses, one possibility is to implement facilities to somehow combine aspects of both. A “Just-in-time” compilation is a process wherein only certain critical sub-sections of code are identified to be compiled at runtime. The rest of the user code shall continue to be evaluated by the python interpreter as usual.
The critical sections of code that need to be run several times (e.g. compulational bottlenecks such as slow functions identified with the help of a profiler) can be instructed to be compiled into machine code.
Parallel Computing#
Modern computers are highly parallel systems. We already covered SIMD parallelisation, where within each CPU cores there exist hardware vector units that allow the parallel execution of certain floating point operations.
CPUs themselves consists of multiple CPU cores, and they can be leveraged i.e. the task to solve can be distributed across these different cores to further achieve speedup. However, it is worth remembering that the serial part of the code (i.e. code that needs to be run sequentially due to its fundamental nature of operations such as in time-stepping for Ordinary and Partial Differential Equations) shall dominate and the speed-up shall most likely be far below the number of CPU cores across which the program code was distributed across.
We can also leverage General Purpose Graphics Processing Units (GPGPUs) as accelerator devices that contain highly parallel floating point units. If we scale to larger compute clusters then there is also a further level of parallelism between the individual hardware nodes.
The numba
library#
Numba
is a third party accelerator library for Python that allows to just-in-time compile Python functions into fast, direct machine code that need not access the Python interpreter. Moreover, it allows to explore parallelisations that cover a lot of use-cases for parallel computing on a single machine using shared-memory parallelism. In addition to all this, Numba has features to directly cross-compile code for use on NVidia GPU accelerators.
Depending on the mathematical operations within the function being targeted for compilation, its performance can be close to hand-optimized C code. It is cross-platform and bundles the llvmlite
library (using the LLVM compiler toolchain) for providing cross-platform high-quality compilation of python functions.
Exercise: Accelerate large-scale vector addition using numba
#
Although numba
provides an extensive set of facilities and features to tackle various compilation aspects, easy gains can usually be made by making use of the parallelisati. This is done by decorating critical functions of interest with the @njit(parallel=True)
decorator.
Let us accelerate the addition of two numpy arrays and see the difference in performance afforded by parallelisation
import numpy as np
import numba
import time
n = 1000000 # 1 million array elements
rng = np.random.default_rng() # Set up a random number generator
a = rng.random(n) # n uniformly distributed floating point numbers in [0, 1]
b = rng.standard_normal(n) # n Gaussian distributed floating point numbers
c = np.empty(n, dtype='float64') # Placeholder for result vector
First, we benchmark the numpy vector addition
t1 = time.perf_counter_ns()
c = a + b
t2 = time.perf_counter_ns()
print(f'Numpy vectorised operation on a single CPU core took {(t2-t1)/1e6} ms')
Numpy vectorised operation on a single CPU core took 5.422406 ms
Next, we refactor the vector addition operation into a function, and mark that function for jit compilation by decorating it suitably.
@numba.njit(parallel=True)
def numba_fun(in1, in2, out):
out = in1 + in2
# numba.set_num_threads(8)
t1 = time.perf_counter_ns()
numba_fun(a,b,c)
t2 = time.perf_counter_ns()
print(f'Running the function (including compilation time) with parallel CPU cores took {(t2-t1)/1e6} ms')
Running the function (including compilation time) with parallel CPU cores took 837.853423 ms
However, the performance seems to be rather poor. This is because the compilation process incurs significant overhead. However, once compiled, evaluating the function is significantly faster. Hence, JIT compilation is most beneficial in cases where a time-consuming function is repeatedly evaluated (as in our jacobi
function in the CFD code exercise). The speed-up of compiled function evaluation can be validated by repeatedly calling the function a few times.
for i in range(5):
t1 = time.perf_counter_ns()
numba_fun(a,b,c)
t2 = time.perf_counter_ns()
print(f'Evaluating the pre-compiled function with parallel CPU cores took {(t2-t1)/1e3} us')
Evaluating the pre-compiled function with parallel CPU cores took 11.323 us
Evaluating the pre-compiled function with parallel CPU cores took 3.182 us
Evaluating the pre-compiled function with parallel CPU cores took 1.61 us
Evaluating the pre-compiled function with parallel CPU cores took 1.422 us
Evaluating the pre-compiled function with parallel CPU cores took 1.397 us
As seen from the timing results, a significant speed-up was achieved by distributing the compiled function over multiple cores. The number of cores to use for parallelisation can optionally be set using numba.set_num_threads(n)
before the @njit
decorator line. Otherwise, numba shall use all available CPU cores, which may not be the most efficient when considering the scaling characteristics of the problem at hand.
Exercise: Refactor jacobi
function of CFD code to use numba
JIT and autoparallelisation#
In the cfd_numba
directory, we have provided a version of the CFD code that uses numba to accelerate the computations. Upon inspecting the code, we see that the numba
version of the CFD code differs mainly in the fact that the jacobi
function is annotated to be just-in-time compiled and distributed across cores for parallel computation of results. Not all methods in the numpy
library are compatible with numba
though. In particular, we notice through experimentation that the np.copyto
method cannot be used here, and hence the code provided reverts to the simpler np.copy()
for assigning the stream function’s updated values.
Navigate to the cfd_numba
directory, and run the program with the same problem size as before, i.e. a 128 x 128 grid (scaling factor of 4) and 10000 Jacobi iterations:
prompt:/path/to/cfd_numba>python cfd.py 4 10000
The timing output of the program should unfortunatel result in a slowdown over the single-core numpy
version e.g.
2D CFD Simulation
=================
Scale factor = 4
Iterations = 10000
Initialisation took 0.00021s
Grid size = 128 x 128
Starting main Jacobi loop...
...finished
Calculation took 5.27128s
However, one key observation is that, we can now refine the grid size and/or crank up the number of Jacobi iterations to profit from the amortised cost of repeatedly invoking a pre-compiled function.
Exercise: Compare the numba
and numpy
versions of CFD code on a 512 x 512 grid (scale factor of 16) and with 10000 Jacobi iterations#
Sample timing output from numba
parallel version:
2D CFD Simulation
=================
Scale factor = 16
Iterations = 10000
Initialisation took 0.00142s
Grid size = 512 x 512
Starting main Jacobi loop...
...finished
Calculation took 23.12356s
Sample timing output from numpy
serial version:
2D CFD Simulation
=================
Scale factor = 16
Iterations = 10000
Initialisation took 0.00133s
Grid size = 512 x 512
Starting main Jacobi loop...
...finished
Calculation took 49.91595s
We get a speed up of approx 2. Experiment to see if there are further improvements in speed-up using a finer grid and/or by increasing the number of Jacobi iterations
Concluding remarks#
Just-in-time compilation is not a guaranteed panacea for accelerating scientific codes. The potential speed-up that can be achieved depends on a number of factors such as the nature of floating point operations used, the library methods used within the function to be compiled etc. This requires further investigation on the part of the user to determine a reasonable trade-off between pace of development vs runtime performance.
There exist various other approaches to accelerate scientific python codes. This includes distributing computations over many computational nodes using the MPI protocol and mpi4py
library, performing distributed task/data parallelism/pipelining using the dask
library, exploring concurrency and multithreading using joblib
, offloading computationally intensive functions as kernels to dedicated accelerator hardware such as GPUs and FPGAs using libraries such as dace
etc.