Command-Line Programs
Last updated on 2024-02-23 | Edit this page
Overview
Questions
- How can I write Python programs that will work like Unix command-line tools?
Objectives
- Use the values of command-line arguments in a program.
- Handle flags and files separately in a command-line program.
- Read data from standard input in a program so that it can be used in a pipeline.
The Jupyter Notebook and other interactive tools are great for prototyping code and exploring data, but sooner or later we will want to use our program in a pipeline or run it in a shell script to process thousands of data files. In order to do that in an efficient way, we need to make our programs work like other Unix command-line tools. For example, we may want a program that reads a dataset and prints the average GDP per country.
Switching to Shell Commands
In this lesson we are switching from typing commands in a Python
interpreter to typing commands in a shell terminal window (such as
bash). When you see a $
in front of a command that tells
you to run that command in the shell rather than the Python
interpreter.
This program does exactly what we want - it prints the average GDP per country for a given file.
OUTPUT
5937.029526
36126.4927
33692.60508
...
37506.41907
8458.276384
33203.26128
We might also want to look at the minimum of the first four lines
or the maximum GDP in several files one after another:
Our scripts should do the following:
- If no filename is given on the command line, read data from standard input.
- If one or more filenames are given, read data from them and report statistics for each file separately.
- Use the
--min
,--mean
, or--max
flag to determine what statistic to print.
To make this work, we need to know how to handle command-line arguments in a program, and understand how to handle standard input. We’ll tackle these questions in turn below.
Command-Line Arguments
We are going to create a file with our python code in, then use the
bash shell to run the code. Using the text editor of your choice, save
the following in a text file called sys_version.py
:
The first line imports a library called sys
, which is
short for “system”. It defines values such as sys.version
,
which describes which version of Python we are running. We can run this
script from the command line like this:
OUTPUT
version is 3.11.3 (main, Apr 5 2023, 15:52:25) [GCC 12.2.1 20230201]
Create another file called argv_list.py
and save the
following text to it.
The strange name argv
stands for “argument values”.
Whenever Python runs a program, it takes all of the values given on the
command line and puts them in the list sys.argv
so that the
program can determine what they were. If we run this program with no
arguments:
OUTPUT
sys.argv is ['argv_list.py']
the only thing in the list is the full path to our script, which is
always sys.argv[0]
. If we run it with a few arguments,
however:
OUTPUT
sys.argv is ['argv_list.py', 'first', 'second', 'third']
then Python adds each of those arguments to that magic list.
With this in hand, let’s build a version of readings.py
that always prints the per-country mean of a single data file. The first
step is to write a function that outlines our implementation, and a
placeholder for the function that does the actual work. By convention
this function is usually called main
, though we can call it
whatever we want:
PYTHON
import sys
import pandas as pd
def main():
script = sys.argv[0]
filename = sys.argv[1]
data = pd.read_csv(filename, index_col='country')
for row_mean in data.mean(axis='columns'):
print(row_mean)
This function gets the name of the script from
sys.argv[0]
, because that’s where it’s always put, and the
name of the file to process from sys.argv[1]
. Here’s a
simple test:
There is no output because we have defined a function, but haven’t
actually called it. Let’s add a call to main
:
PYTHON
import sys
import pandas as pd
def main():
script = sys.argv[0]
filename = sys.argv[1]
data = pd.read_csv(filename, index_col='country')
for row_mean in data.mean(axis='columns'):
print(row_mean)
if __name__ == '__main__':
main()
and run that:
OUTPUT
9980.595634166664
17262.6228125
Running Versus Importing
Running a Python script in bash is very similar to importing that file in Python. The biggest difference is that we don’t expect anything to happen when we import a file, whereas when running a script, we expect to see some output printed to the console.
In order for a Python script to work as expected when imported or when run as a script, we typically put the part of the script that produces output in the following if statement:
When you import a Python file, __name__
is the name of
that file (e.g., when importing readings.py
,
__name__
is 'readings'
). However, when running
a script in bash, __name__
is always set to
'__main__'
in that script so that you can determine if the
file is being imported or run as a script.
The Right Way to Do It
If our programs can take complex parameters or multiple filenames, we
shouldn’t handle sys.argv
directly. Instead, we should use
Python’s argparse
library, which handles common cases in a
systematic way, and also makes it easy for us to provide sensible error
messages for our users. We will not cover this module in this lesson but
you can go to Tshepang Lekhonkhobe’s Argparse
tutorial that is part of Python’s Official Documentation.
Handling Multiple Files
The next step is to teach our program how to handle multiple files. Since 60 lines of output per file is a lot to page through, we’ll start by using three smaller files:
OUTPUT
small_gdp_discworld.csv small_gdp_middle-earth.csv
OUTPUT
country,800,1000,1200,1400,1600,1800
Rivendell, 100, 100, 200, 200, 300, 300
Mordor, 20, 40, 60, 80, 100, 300
Hobbiton,10, 10, 10, 10, 10, 10
Moria, 150, 250, 100, 50, 50, 0
OUTPUT
200.0
100.0
10.0
100.0
Using small data files as input also allows us to check our results more easily: here, for example, we can see that our program is calculating the mean correctly for each line, whereas we were really taking it on faith before. This is yet another rule of programming: test the simple things first.
We want our program to process each file separately, so we need a
loop that executes once for each filename. If we specify the files on
the command line, the filenames will be in sys.argv
, but we
need to be careful: sys.argv[0]
will always be the name of
our script, rather than the name of a file. We also need to handle an
unknown number of filenames, since our program could be run for any
number of files.
The solution to both problems is to loop over the contents of
sys.argv[1:]
. The ‘1’ tells Python to start the slice at
location 1, so the program’s name isn’t included; since we’ve left off
the upper bound, the slice runs to the end of the list, and includes all
the filenames. Here’s our changed program
readings_03.py
:
PYTHON
import sys
import pandas as pd
def main():
script = sys.argv[0]
for filename in sys.argv[1:]:
data = pd.read_csv(filename, index_col='country')
for row_mean in data.mean(axis='columns'):
print(row_mean)
if __name__ == '__main__':
main()
and here it is in action:
OUTPUT
0.0
35.0
15.0
200.0
100.0
10.0
100.0
The Right Way to Do It
At this point, we have created three versions of our script called
readings_01.py
, readings_02.py
, and
readings_03.py
. We wouldn’t do this in real life: instead,
we would have one file called readings.py
that we committed
to version control every time we got an enhancement working. For
teaching, though, we need all the successive versions side by side.
Handling Command-Line Flags
The next step is to teach our program to pay attention to the
--min
, --mean
, and --max
flags.
These always appear before the names of the files, so we could do
this:
PYTHON
import sys
import pandas as pd
def main():
script = sys.argv[0]
action = sys.argv[1]
filenames = sys.argv[2:]
for filename in filenames:
data = pd.read_csv(filename, index_col='country')
if action == '--min':
values = data.min(axis='columns')
elif action == '--mean':
values = data.mean(axis='columns')
elif action == '--max':
values = data.max(axis='columns')
for val in values:
print(val)
if __name__ == '__main__':
main()
This works:
OUTPUT
0
60
20
but there are several things wrong with it:
main
is too large to read comfortably.If we do not specify at least two additional arguments on the command-line, one for the flag and one for the filename, but only one, the program will not throw an exception but will run. It assumes that the file list is empty, as
sys.argv[1]
will be considered theaction
, even if it is a filename. Silent failures like this are always hard to debug.The program should check if the submitted
action
is one of the three recognized flags.
This version pulls the processing of each file out of the loop into a
function of its own. It also checks that action
is one of
the allowed flags before doing any processing, so that the program fails
fast:
PYTHON
import sys
import pandas as pd
def main():
script = sys.argv[0]
action = sys.argv[1]
filenames = sys.argv[2:]
assert action in ['--min', '--mean', '--max'], \
'Action is not one of --min, --mean, or --max: ' + action
for filename in filenames:
process(filename, action)
def process(filename, action):
data = pd.read_csv(filename, index_col='country')
if action == '--min':
values = data.min(axis='columns')
elif action == '--mean':
values = data.mean(axis='columns')
elif action == '--max':
values = data.max(axis='columns')
for val in values:
print(val)
if __name__ == '__main__':
main()
This is four lines longer than its predecessor, but broken into more digestible chunks of 8 and 12 lines.
Handling Standard Input
The next thing our program has to do is read data from standard input
if no filenames are given so that we can put it in a pipeline, redirect
input to it, and so on. Let’s experiment in another script called
count_stdin.py
:
PYTHON
import sys
count = 0
for line in sys.stdin:
count += 1
print(count, 'lines in standard input')
This little program reads lines from a special “file” called
sys.stdin
, which is automatically connected to the
program’s standard input. We don’t have to open it — Python and the
operating system take care of that when the program starts up — but we
can do almost anything with it that we could do to a regular file. Let’s
try running it as if it were a regular command-line program:
OUTPUT
5 lines in standard input
A common mistake is to try to run something that reads from standard input like this:
i.e., to forget the <
character that redirects the
file to standard input. In this case, there’s nothing in standard input,
so the program waits at the start of the loop for someone to type
something on the keyboard. Since there’s no way for us to do this, our
program is stuck, and we have to halt it using the
Interrupt
option from the Kernel
menu in the
Notebook.
We now need to rewrite the program so that it loads data from
sys.stdin
if no filenames are provided. Luckily,
pandas.read_csv
can handle either a filename or an open
file as its first parameter, so we don’t actually need to change
process
. Only main
changes:
PYTHON
import sys
import pandas as pd
def main():
script = sys.argv[0]
action = sys.argv[1]
filenames = sys.argv[2:]
assert action in ['--min', '--mean', '--max'], (
'Action is not one of --min, --mean, or --max: ' + action)
if len(filenames) == 0:
process(sys.stdin, action)
else:
for filename in filenames:
process(filename, action)
def process(filename, action):
data = pd.read_csv(filename, index_col='country')
if action == '--min':
values = data.min(axis='columns')
elif action == '--mean':
values = data.mean(axis='columns')
elif action == '--max':
values = data.max(axis='columns')
for val in values:
print(val)
if __name__ == '__main__':
main()
Let’s try it out:
OUTPUT
0
60
20
That’s better. In fact, that’s done: the program now does everything we set out to do.
PYTHON
import sys
def main():
assert len(sys.argv) == 4, 'Need exactly 3 arguments'
operator = sys.argv[1]
assert operator in ['--add', '--subtract', '--multiply', '--divide'], \
'Operator is not one of --add, --subtract, --multiply, or --divide: bailing out'
try:
operand1, operand2 = float(sys.argv[2]), float(sys.argv[3])
except ValueError:
print('cannot convert input to a number: bailing out')
return
do_arithmetic(operand1, operator, operand2)
def do_arithmetic(operand1, operator, operand2):
if operator == 'add':
value = operand1 + operand2
elif operator == 'subtract':
value = operand1 - operand2
elif operator == 'multiply':
value = operand1 * operand2
elif operator == 'divide':
value = operand1 / operand2
print(value)
main()
Finding Particular Files
Using the glob
module introduced earlier, write a simple version of
ls
that shows files in the current directory with a
particular suffix. A call to this script should look like this:
OUTPUT
left.py
right.py
zero.py
PYTHON
import sys
import glob
def main():
"""prints names of all files with sys.argv as suffix"""
assert len(sys.argv) >= 2, 'Argument list cannot be empty'
suffix = sys.argv[1] # NB: behaviour is not as you'd expect if sys.argv[1] is *
glob_input = '*.' + suffix # construct the input
glob_output = sorted(glob.glob(glob_input)) # call the glob function
for item in glob_output: # print the output
print(item)
return
main()
Changing Flags
Rewrite readings.py
so that it uses -n
,
-m
, and -x
instead of --min
,
--mean
, and --max
respectively. Is the code
easier to read? Is the program easier to understand?
PYTHON
# this is code/readings_07.py
import sys
import pandas as pd
def main():
script = sys.argv[0]
action = sys.argv[1]
filenames = sys.argv[2:]
assert action in ['-n', '-m', '-x'], (
'Action is not one of -n, -m, or -x: ' + action)
if len(filenames) == 0:
process(sys.stdin, action)
else:
for filename in filenames:
process(filename, action)
def process(filename, action):
data = pd.read_csv(filename, index_col='country')
if action == '-n':
values = data.min(axis='columns')
elif action == '-m':
values = data.mean(axis='columns')
elif action == '-x':
values = data.max(axis='columns')
for val in values:
print(val)
if __name__ == '__main__':
main()
Adding a Help Message
Separately, modify readings.py
so that if no parameters
are given (i.e., no action is specified and no filenames are given), it
prints a message explaining how it should be used.
PYTHON
# this is code/readings_08.py
import sys
import pandas as pd
def main():
script = sys.argv[0]
if len(sys.argv) == 1: # no arguments, so print help message
print("Usage: python readings_08.py action filenames\n"
"Action:\n"
" Must be one of --min, --mean, or --max.\n"
"Filenames:\n"
" If blank, input is taken from standard input (stdin).\n"
" Otherwise, each filename in the list of arguments is processed in turn.")
return
action = sys.argv[1]
filenames = sys.argv[2:]
assert action in ['--min', '--mean', '--max'], (
'Action is not one of --min, --mean, or --max: ' + action)
if len(filenames) == 0:
process(sys.stdin, action)
else:
for filename in filenames:
process(filename, action)
def process(filename, action):
data = pd.read_csv(filename, index_col='country')
if action == '--min':
values = data.min(axis='columns')
elif action == '--mean':
values = data.mean(axis='columns')
elif action == '--max':
values = data.max(axis='columns')
for val in values:
print(val)
if __name__ == '__main__':
main()
Adding a Default Action
Separately, modify readings.py
so that if no action is
given it displays the means of the data.
PYTHON
# this is code/readings_09.py
import sys
import pandas as pd
def main():
script = sys.argv[0]
action = sys.argv[1]
if action not in ['--min', '--mean', '--max']: # if no action given
action = '--mean' # set a default action, that being mean
# start the filenames one place earlier in the argv list
filenames = sys.argv[1:]
else:
filenames = sys.argv[2:]
if len(filenames) == 0:
process(sys.stdin, action)
else:
for filename in filenames:
process(filename, action)
def process(filename, action):
data = pd.read_csv(filename, index_col='country')
if action == '--min':
values = data.min(axis='columns')
elif action == '--mean':
values = data.mean(axis='columns')
elif action == '--max':
values = data.max(axis='columns')
for val in values:
print(val)
if __name__ == '__main__':
main()
A File-Checker
Write a program called check.py
that takes the names of
one or more GDP-like CSV data files as arguments and checks that all the
files have the same number of rows and columns. What is the best way to
test your program?
PYTHON
import sys
import pandas as pd
def main():
script = sys.argv[0]
filenames = sys.argv[1:]
if len(filenames) <= 1: # nothing to check
print('Only 1 file specified on input')
else:
nrow0, ncol0 = row_col_count(filenames[0])
print('First file %s: %d rows and %d columns' % (
filenames[0], nrow0, ncol0))
for filename in filenames[1:]:
nrow, ncol = row_col_count(filename)
if nrow != nrow0 or ncol != ncol0:
print('File %s does not check: %d rows and %d columns'
% (filename, nrow, ncol))
else:
print('File %s checks' % filename)
return
def row_col_count(filename):
try:
nrow, ncol = pd.read_csv(filename, index_col='country').shape
except ValueError:
# This occurs if the file doesn't have same number of rows and columns,
# or if it has non-numeric content
nrow, ncol = (0, 0)
return nrow, ncol
if __name__ == '__main__':
main()
Counting Lines
Write a program called line_count.py
that works like the
Unix wc
command:
- If no filenames are given, it reports the number of lines in standard input.
- If one or more filenames are given, it reports the number of lines in each, followed by the total number of lines.
PYTHON
import sys
def main():
"""print each input filename and the number of lines in it,
and print the sum of the number of lines"""
filenames = sys.argv[1:]
sum_nlines = 0 #initialize counting variable
if len(filenames) == 0: # no filenames, just stdin
sum_nlines = count_file_like(sys.stdin)
print('stdin: %d' % sum_nlines)
else:
for filename in filenames:
nlines = count_file(filename)
print('%s %d' % (filename, nlines))
sum_nlines += nlines
print('total: %d' % sum_nlines)
def count_file(filename):
"""count the number of lines in a file"""
f = open(filename,'r')
nlines = len(f.readlines())
f.close()
return(nlines)
def count_file_like(file_like):
"""count the number of lines in a file-like object (eg stdin)"""
n = 0
for line in file_like:
n = n+1
return n
main()
Generate an Error Message
Write a program called check_arguments.py
that prints
usage then exits the program if no arguments are provided. (Hint: You
can use sys.exit()
to exit the program.)
OUTPUT
usage: python check_argument.py filename.txt
OUTPUT
Thanks for specifying arguments!
Key Points
- The
sys
library connects a Python program to the system it is running on. - The list
sys.argv
contains the command-line arguments that a program was run with. - Avoid silent failures.
- The pseudo-file
sys.stdin
connects to a program’s standard input.