This lesson is in the early stages of development (Alpha version)

Running a parallel job

Overview

Teaching: 30 min
Exercises: 60 min
Questions
  • How do we execute a task in parallel?

  • What benefits arise from parallel execution?

  • What are the limits of gains from execution in parallel?

Objectives
  • Install a Python package using pip

  • Prepare a job submission script for the parallel executable.

  • Launch jobs with parallel execution.

  • Record and summarize the timing and accuracy of jobs.

  • Describe the relationship between job parallelism and performance.

We now have the tools we need to run a multi-processor job. This is a very important aspect of HPC systems, as parallelism is one of the primary tools we have to improve the performance of computational tasks.

If you disconnected, log back in to the cluster.

[you@laptop:~]$ ssh yourUsername@myriad.rc.ucl.ac.uk

Install the Amdahl Program

With the Amdahl source code on the cluster, we can install it, which will provide access to the amdahl executable. Move into the extracted directory, then use the Package Installer for Python, or pip, to install it in your (“user”) home directory:

[yourUsername@login12 ~]$  cd amdahl
[yourUsername@login12 ~]$  python3 -m pip install --user .

Amdahl is Python Code

The Amdahl program is written in Python, and installing or using it requires locating the python3 executable on the login node. If it can’t be found, try listing available modules using module avail, load the appropriate one, and try the command again.

MPI for Python

The Amdahl code has one dependency: mpi4py. If it hasn’t already been installed on the cluster, pip will attempt to collect mpi4py from the Internet and install it for you. If this fails due to a one-way firewall, you must retrieve mpi4py on your local machine and upload it, just as we did for Amdahl.

Retrieve and Upload mpi4py

If installing Amdahl failed because mpi4py could not be installed, retrieve the tarball from https://github.com/mpi4py/mpi4py/tarball/master then rsync it to the cluster, extract, and install:

[you@laptop:~]$ wget -O mpi4py.tar.gz https://github.com/mpi4py/mpi4py/releases/download/3.1.4/mpi4py-3.1.4.tar.gz
[you@laptop:~]$ scp mpi4py.tar.gz yourUsername@myriad.rc.ucl.ac.uk:
# or
[you@laptop:~]$ rsync -avP mpi4py.tar.gz yourUsername@myriad.rc.ucl.ac.uk:
[you@laptop:~]$ ssh yourUsername@myriad.rc.ucl.ac.uk
[yourUsername@login12 ~]$  tar -xvzf mpi4py.tar.gz  # extract the archive
[yourUsername@login12 ~]$  mv mpi4py* mpi4py        # rename the directory
[yourUsername@login12 ~]$  cd mpi4py
[yourUsername@login12 ~]$  python3 -m pip install --user .
[yourUsername@login12 ~]$  cd ../amdahl
[yourUsername@login12 ~]$  python3 -m pip install --user .

If pip Raises a Warning…

pip may warn that your user package binaries are not in your PATH.

WARNING: The script amdahl is installed in "${HOME}/.local/bin" which is
not on PATH. Consider adding this directory to PATH or, if you prefer to
suppress this warning, use --no-warn-script-location.

To check whether this warning is a problem, use which to search for the amdahl program:

[yourUsername@login12 ~]$  which amdahl

If the command returns no output, displaying a new prompt, it means the file amdahl has not been found. You must update the environment variable named PATH to include the missing folder. Edit your shell configuration file as follows, then log off the cluster and back on again so it takes effect.

[yourUsername@login12 ~]$  nano ~/.bashrc
[yourUsername@login12 ~]$  tail ~/.bashrc
export PATH=${PATH}:${HOME}/.local/bin

After logging back in to myriad.rc.ucl.ac.uk, which should be able to find amdahl without difficulties. If you had to load a Python module, load it again.

Help!

Many command-line programs include a “help” message. Try it with amdahl:

[yourUsername@login12 ~]$  amdahl --help
usage: amdahl [-h] [-p [PARALLEL_PROPORTION]] [-w [WORK_SECONDS]] [-t] [-e] [-j [JITTER_PROPORTION]]

optional arguments:
  -h, --help            show this help message and exit
  -p [PARALLEL_PROPORTION], --parallel-proportion [PARALLEL_PROPORTION]
                        Parallel proportion: a float between 0 and 1
  -w [WORK_SECONDS], --work-seconds [WORK_SECONDS]
                        Total seconds of workload: an integer greater than 0
  -t, --terse           Format output as a machine-readable object for easier analysis
  -e, --exact           Exactly match requested timing by disabling random jitter
  -j [JITTER_PROPORTION], --jitter-proportion [JITTER_PROPORTION]
                        Random jitter: a float between -1 and +1

This message doesn’t tell us much about what the program does, but it does tell us the important flags we might want to use when launching it.

Running the Job on a Compute Node

Create a submission file, requesting one task on a single node, then launch it.

[yourUsername@login12 ~]$  nano serial-job.sh
[yourUsername@login12 ~]$  cat serial-job.sh
#!/bin/bash -l
#$ -N solo-job
#$ -l mem=3G
#$ -l h_rt=00:30:00
#$ -pe mpi 1
#$ -cwd

# Load the computing environment we need
module load python3
module unload compilers mpi
module load mpi4py

# Execute the task
amdahl
[yourUsername@login12 ~]$  qsub serial-job.sh

As before, use the SGE status commands to check whether your job is running and when it ends:

[yourUsername@login12 ~]$  qstat -u yourUsername

Use ls to locate the output file. The -t flag sorts in reverse-chronological order: newest first. What was the output?

Read the Job Output

The cluster output should be written to a file in the folder you launched the job from. For example,

[yourUsername@login12 ~]$  ls -t
slurm-347087.out  serial-job.sh  amdahl  README.md  LICENSE.txt
[yourUsername@login12 ~]$  cat slurm-347087.out
Doing 30.000 seconds of 'work' on 1 processor,
which should take 30.000 seconds with 0.850 parallel proportion of the workload.

  Hello, World! I am process 0 of 1 on node-d00a-001. I will do all the serial 'work' for 4.500 seconds.
  Hello, World! I am process 0 of 1 on node-d00a-001. I will do parallel 'work' for 25.500 seconds.

Total execution time (according to rank 0): 30.033 seconds

As we saw before, two of the amdahl program flags set the amount of work and the proportion of that work that is parallel in nature. Based on the output, we can see that the code uses a default of 30 seconds of work that is 85% parallel. The program ran for just over 30 seconds in total, and if we run the numbers, it is true that 15% of it was marked ‘serial’ and 85% was ‘parallel’.

Since we only gave the job one CPU, this job wasn’t really parallel: the same processor performed the ‘serial’ work for 4.5 seconds, then the ‘parallel’ part for 25.5 seconds, and no time was saved. The cluster can do better, if we ask.

Running the Parallel Job

The amdahl program uses the Message Passing Interface (MPI) for parallelism – this is a common tool on HPC systems.

What is MPI?

The Message Passing Interface is a set of tools which allow multiple tasks running simultaneously to communicate with each other. Typically, a single executable is run multiple times, possibly on different machines, and the MPI tools are used to inform each instance of the executable about its sibling processes, and which instance it is. MPI also provides tools to allow communication between instances to coordinate work, exchange information about elements of the task, or to transfer data. An MPI instance typically has its own copy of all the local variables.

While MPI-aware executables can generally be run as stand-alone programs, in order for them to run in parallel they must use an MPI run-time environment, which is a specific implementation of the MPI standard. To activate the MPI environment, the program should be started via a command such as mpiexec (or mpirun, or srun, etc. depending on the MPI run-time you need to use), which will ensure that the appropriate run-time support for parallelism is included.

MPI Runtime Arguments

On their own, commands such as mpiexec can take many arguments specifying how many machines will participate in the execution, and you might need these if you would like to run an MPI program on your own (for example, on your laptop). In the context of a queuing system, however, it is frequently the case that MPI run-time will obtain the necessary parameters from the queuing system, by examining the environment variables set when the job is launched.

Let’s modify the job script to request more cores and use the MPI run-time.

[yourUsername@login12 ~]$  cp serial-job.sh parallel-job.sh
[yourUsername@login12 ~]$  nano parallel-job.sh
[yourUsername@login12 ~]$  cat parallel-job.sh
#!/bin/bash -l
#$ -N parallel-job
#$ -l mem=3G
#$ -l h_rt=00:30:00
#$ -pe mpi 4
#$ -cwd

# Load the computing environment we need
module load python3
module unload compilers mpi
module load mpi4py

# Execute the task
gerun amdahl

Then submit your job. Note that the submission command has not really changed from how we submitted the serial job: all the parallel settings are in the batch file rather than the command line.

[yourUsername@login12 ~]$  qsub parallel-job.sh

As before, use the status commands to check when your job runs.

[yourUsername@login12 ~]$  ls -t
slurm-347178.out  parallel-job.sh  slurm-347087.out  serial-job.sh  amdahl  README.md  LICENSE.txt
[yourUsername@login12 ~]$  cat slurm-347178.out
Doing 30.000 seconds of 'work' on 4 processors,
which should take 10.875 seconds with 0.850 parallel proportion of the workload.

  Hello, World! I am process 0 of 4 on node-d00a-001. I will do all the serial 'work' for 4.500 seconds.
  Hello, World! I am process 2 of 4 on node-d00a-001. I will do parallel 'work' for 6.375 seconds.
  Hello, World! I am process 1 of 4 on node-d00a-001. I will do parallel 'work' for 6.375 seconds.
  Hello, World! I am process 3 of 4 on node-d00a-001. I will do parallel 'work' for 6.375 seconds.
  Hello, World! I am process 0 of 4 on node-d00a-001. I will do parallel 'work' for 6.375 seconds.

Total execution time (according to rank 0): 10.888 seconds

Is it 4× faster?

The parallel job received 4× more processors than the serial job: does that mean it finished in ¼ the time?

Solution

The parallel job did take less time: 11 seconds is better than 30! But it is only a 2.7× improvement, not 4×.

Look at the job output:

  • While “process 0” did serial work, processes 1 through 3 did their parallel work.
  • While process 0 caught up on its parallel work, the rest did nothing at all.

Process 0 always has to finish its serial task before it can start on the parallel work. This sets a lower limit on the amount of time this job will take, no matter how many cores you throw at it.

This is the basic principle behind Amdahl’s Law, which is one way of predicting improvements in execution time for a fixed workload that can be subdivided and run in parallel to some extent.

How Much Does Parallel Execution Improve Performance?

In theory, dividing up a perfectly parallel calculation among n MPI processes should produce a decrease in total run time by a factor of n. As we have just seen, real programs need some time for the MPI processes to communicate and coordinate, and some types of calculations can’t be subdivided: they only run effectively on a single CPU.

Additionally, if the MPI processes operate on different physical CPUs in the computer, or across multiple compute nodes, even more time is required for communication than it takes when all processes operate on a single CPU.

In practice, it’s common to evaluate the parallelism of an MPI program by

Since “more is better” – improvement is easier to interpret from increases in some quantity than decreases – comparisons are made using the speedup factor S, which is calculated as the single-CPU execution time divided by the multi-CPU execution time. For a perfectly parallel program, a plot of the speedup S versus the number of CPUs n would give a straight line, S = n.

Let’s run one more job, so we can see how close to a straight line our amdahl code gets.

[yourUsername@login12 ~]$  nano parallel-job.sh
[yourUsername@login12 ~]$  cat parallel-job.sh
#!/bin/bash -l
#$ -N parallel-job
#$ -l mem=3G
#$ -l h_rt=00:30:00
#$ -pe mpi 8
#$ -cwd

# Load the computing environment we need
module load python3
module unload compilers mpi
module load mpi4py

# Execute the task
gerun amdahl

Then submit your job. Note that the submission command has not really changed from how we submitted the serial job: all the parallel settings are in the batch file rather than the command line.

[yourUsername@login12 ~]$  qsub parallel-job.sh

As before, use the status commands to check when your job runs.

[yourUsername@login12 ~]$  ls -t
slurm-347271.out  parallel-job.sh  slurm-347178.out  slurm-347087.out  serial-job.sh  amdahl  README.md  LICENSE.txt
[yourUsername@login12 ~]$  cat slurm-347178.out
which should take 7.688 seconds with 0.850 parallel proportion of the workload.

  Hello, World! I am process 4 of 8 on node-d00a-001. I will do parallel 'work' for 3.188 seconds.
  Hello, World! I am process 0 of 8 on node-d00a-001. I will do all the serial 'work' for 4.500 seconds.
  Hello, World! I am process 2 of 8 on node-d00a-001. I will do parallel 'work' for 3.188 seconds.
  Hello, World! I am process 1 of 8 on node-d00a-001. I will do parallel 'work' for 3.188 seconds.
  Hello, World! I am process 3 of 8 on node-d00a-001. I will do parallel 'work' for 3.188 seconds.
  Hello, World! I am process 5 of 8 on node-d00a-001. I will do parallel 'work' for 3.188 seconds.
  Hello, World! I am process 6 of 8 on node-d00a-001. I will do parallel 'work' for 3.188 seconds.
  Hello, World! I am process 7 of 8 on node-d00a-001. I will do parallel 'work' for 3.188 seconds.
  Hello, World! I am process 0 of 8 on node-d00a-001. I will do parallel 'work' for 3.188 seconds.

Total execution time (according to rank 0): 7.697 seconds

Non-Linear Output

When we ran the job with 4 parallel workers, the serial job wrote its output first, then the parallel processes wrote their output, with process 0 coming in first and last.

With 8 workers, this is not the case: since the parallel workers take less time than the serial work, it is hard to say which process will write its output first, except that it will not be process 0!

Now, let’s summarize the amount of time it took each job to run:

Number of CPUs Runtime (sec)
1 30.033
4 10.888
8 7.697

Then, use the first row to compute speedups S, using Python as a command-line calculator:

[yourUsername@login12 ~]$  for n in 30.033 10.888 7.697; do python3 -c "print(30.033 / $n)"; done
Number of CPUs Speedup Ideal
1 1.0 1
4 2.75 4
8 3.90 8

The job output files have been telling us that this program is performing 85% of its work in parallel, leaving 15% to run in serial. This seems reasonably high, but our quick study of speedup shows that in order to get a 4× speedup, we have to use 8 or 9 processors in parallel. In real programs, the speedup factor is influenced by

Using Amdahl’s Law, you can prove that with this program, it is impossible to reach 8× speedup, no matter how many processors you have on hand. Details of that analysis, with results to back it up, are left for the next class in the HPC Carpentry workshop, HPC Workflows.

In an HPC environment, we try to reduce the execution time for all types of jobs, and MPI is an extremely common way to combine dozens, hundreds, or thousands of CPUs into solving a single problem. To learn more about parallelization, see the parallel novice lesson lesson.

Key Points

  • Parallel programming allows applications to take advantage of parallel hardware.

  • The queuing system facilitates executing parallel tasks.

  • Performance improvements from parallel execution do not scale linearly.