Parallel and asynchronous processing

Python has a good ecosystem of libraries for parallelising the processing of tasks, as well as asynchronous processing.

Parallelisation in Python is typically process-based with code parallelised across multiple Python processes each with their own interpreter or makes use of tools which run the tasks to be parallelised outside of the Python interpreter, using for example Python wrappers around external code which uses thread-based parallelism.

🟠 tools in the following should be chosen, if there are external reasons to use a specific interface or parallelisation scheme. Possibly due to the nature of the research problem, the high-performance computing resources available or simply due to pre-existing code using a library like pandas.

Process-based (and thread-based) parallelism

Name Short description 🚦
multiprocess A fork of multiprocessing which uses dill instead of pickle to allow serializing wider range of object types including nested / anonymous functions. We’ve found this easier to use than multiprocessing. 🟢
concurrent.futures See the table below. 🟠
dask Aims to make scaling existing code in familiar libraries (numpy, pandas, scikit-learn, …) easy. 🟠
multiprocessing The standard library module for distributing tasks across multiple processes. 🟠
mpi4py Support for MPI based parallelism. 🟠
threading The standard library module for multi-threading. Due to the global interpreter lock currently only one thread can execute Python code at a time. 🔴

Compiler-based parallelism

Asynchronous processing

Name Short description 🚦
asyncio Python standard library for asynchronous programming with tasks run in a single-threaded event loop. Used for cooperative multitasking. 🟠
concurrent.futures Another Python standard library for asynchronous processing. Provides a common interface for thread and process based concurrency as an alternative to using multiprocess(ing) or threading directly. 🟠

See also