Software Carpentry at UCL: Bash, Git, and Python Workshop

:tada: Welcome!

:calendar: 19th - 20th Feb, 2025

Exercises

Lesson Material


Staff

DAY 1 (Bash, Git):

Instructors: Dimitra Salmanidou, William Graham
Helpers: Stephen Thompson, Ankur Sinha

DAY 2 (Python):

Instructors: Devaraj Gopinathan, Arindam Saha
Helpers: William Graham

Notes

This is Will typing

Bash (Terminal)

Some SHELL commands:
ls : list contents of directory
pwd : print working directory
cd : change directory
mkdir : make directory
nano : run text editor Nano to create or edit a file
mv : move file
cp : copy file
rm : remove/delete file. BE CAREFUL! Deleting is permenant, there's no recycle bin! A good habit is to use rm -i where it will ask you for confirmation before deleting items
rmdir: remove a directory (will refuse if it is not empty)
wc : word count/ count lines, word and characters in file
cat : concatenate/print file contents (what does tac do?)
sort : sort text and binary files by lines
head: print lines from the start of a file
tail: print lines from the end of a file
history: print all the commands you have used previously (including wrong commands!)

General usage of shell commands

command flag/option optional-argument-for-flag another-flag/option optional-argument-for-flag2 ...

Example:

ls -F adirectory

Here, ls is the command, -F is the first flag, and adirectory is the argument to the ls command.

head -n 10 afile

Here, head is the command -n is the option, and 10 is the argument passed to the option -n, and finally afile is the agrument passed to the head command.

Note: not all commands/flags take arguments

Wildcards and matching

Redirecting Outputs

The > character can be used to redirect the output of a command to somewhere else, normally a text file.

For example,

ls -a .

normally lists all the files in your current directory, and displays them in the terminal.

ls -a . > list_of_files.txt

will instead not print this list to your terminal. Instead, it will create a file called list_of_files.txt (in your current working directory) that contains the result. You can view the contents of the file using

cat list_of_files.txt

Note that > will overwrite a file that already exists! You can use >> to instead append the output to the file.

You can also "pipe" outputs to other commands, using the | character (next to "z" on a UK keyboard). This will take the output of the first command and pass it directly into the second command! The following commands will do the same thing, for example:

(Note, we ran these commands in the exercise-data/alkanes/ directory)

  1. wc -l *.txt > word_counts.txt, followed by sort -n word_counts.txt,
  2. wc -l *.txt | sort -n

The second way, using a pipe, avoids creating a temporary file (word_counts.txt) that we don't actually need afterwards.

Version Control with Git

Everything we do in this session is done inside the shell-lesson-data folder. The first thing we'll do is create a new directory, research-paper, inside that folder, next to north-pacific-gyre and exercise-data.

We can check the directory is empty with ls -aF research-paper

We will then cd research-paper to move into our new folder, and work on git in there.

Start by telling git who you are:

git config --global user.name "yourname"

If you have a github account you should use that for your name.

git config --global user.email "youremail"

You can choose what editor git will use with;

git config --global core.editor "nano -w"

And how git handles carriage return and line feeds. For mac and linux;

git config --global core.autocrlf input

For Windows;

git config --global core.autocrlf true

And what your default branch is;

git config --global init.defaultBranch main

You should still be in your research-paper directory. You can check with pwd and ls if you wish. We can now start using git. Step 1, initialise you're directory.

git init

Now using 'ls -a' you should see a new item .git. .git is a hidden folder that git uses for version control of the directory contents. Don't remove ir or edit its contents.

Now try:

git status

Let's create a file to add to our repository.

nano abstract.md

Enter some text of your choice, save the file and exit nano. Now try git status again. You should see that abstract.md is listed under "Untracked files". You can add it to git's version control with;

'git add abstract.md'

then

git commit

This should open a text editor (nano if that's what you set earlier). In nano you can add a descriptive message, maybe "my first commit". Save and exit nano.

Re-run 'git status'

Now let's edit abstract.md and use git to keep track of our changes.

Use nano to edit abstract.md. Then:
git status
If you like you can the changes you made with:
git diff
Then git add then git status, add a commit message, save and exit nano.

Now lets create a subfolder for our analysis.

mkdir analysis

Running git status should show "nothing to commit, working tree clean". This is because git doesn't worry about directories, it works on files. So create a file in analysis (maybe a pythpn script).

nano analyis/analysis.py

Enter some text of your choice, save and exit.

Re-run git status

Add some more files:
nano introduction.md etc.

You can selectively add file for each commit. So let's just add analsys files.

git add analyis/analysis.py

git commit

Enter some text like "added analysis script"

This enables us to have a meaningful commit history.

We can use git commit -m "a commit message" as a shortcut to commit with the provided commit message. git will not open nano (your editor) for you to write a commit message if you use -m.

We can do git add introduction.md, git commit and "added introduction".

git log shows the complete "commit history" of your project.

The commit information includes:

git log --oneline shows a shorter summary:

To see the history of a particular file/path:

git log -- <path to file>

To see what changed in a particular commit, use:

git diff <hash> <file>

In a diff:

The HEAD is a pointer to where git thinks it currently is. This usually point to the latest commit.

So, after making a change to some file, file.md, if we run this before running git add:

git diff HEAD file.md

we're asking git what has changed from HEAD to now.

Git works line-by-line.
So, even if you have added a few more words to the same line, git will say that a line was removed and a new one was added.

Once a file is staged, git diff will not show it in the diff, because git now considers it part of the "present" version that is ready to be committed

To see the diff for a file that has been added with git add, we can use:

git add --staged <file>

The HEAD pointer can be used to easily go back in time:

So, this command will show what has changed since the last commit (the commit before HEAD):

git diff HEAD~1

To restore files to the "present" (HEAD), we can use:

git restore <files>

To restore a file to a particular commit, we can specify the commit too:

git restore --source=<commit> -- <files>

Note that restore does not automatically run git add.
git still considers it a modification and you will need to manually add it.

To remove a file from git (tell git to stop tracking it):

git rm <file>

To "undo" a commit, we can use:

git revert <hash of the commit to undo>

This will open a commit editor for you to edit the commit message, which will include information about the commit being reverted.

Sometimes, we want to keep data files but not track them in Git. So, we can tell git to "ignore" them by adding them in to a .gitignore file that must be placed in the same folder where the .git folder is. Note that you must git add .gitignore and then git commit it too.

Each line in the .gitignore file:

To ignore all csv files in subdirectories we can add **.csv to .gitignore

A set of gitignore files for different projects can be found on GitHub here: https://github.com/github/gitignore

Issues/bits to check later

Python

For Python today, we will be making use of Jupyter notebooks.
These let you write either code (Python code to be run) or markdown (text) "cells" so you can annotate your work as you go along.

For those who installed Anaconda as per the setup instructions, to load up Jupyter:

  1. Launch Anaconda Navigator from your Start (Windows) / Applications (Mac) menu
  2. When Anaconda loads, you should see a series of panels. On the one that says "Jupyter Lab", click launch.
  3. This should open a new tab in your web browser, which is Jupyter and where we'll be working for today!

If you can't get Jupyter / Anaconda working

Notes

Python "comments" start with a hashtag (#) - this makes Python ignore the rest of the line, and lets you write notes to yourself to remind you what your code is doing.

The equals (=) operator assigns a value to a variable.
weight_kg = 60 creates a variable called weight_kg, and stores the value 60 in it.

print is an in-built Python function, that displays the value currently stored inside a variable. print(weight_kg) - display the value of the variable weight_kg.

type is another in-built Python function, that displays what type of value is stored in a variable. It might be an int (whole number, INTeger), float (a decimal number, or FLOATing point number), or str (STRing of characters), or one of many other types!

Plotting Code

This is a recap of the code that Devaraj has written so far to create the plots.

import numpy import matplotlib.pyplot # Remember, if your notebooks are in the same folder as your notebooks, you need to use # fname="inflammation-01.csv" instead of fname="data/inflammation-01.csv" data = numpy.loadtxt(fname="data/inflammation-01.csv", delimiter=",") # To create a heatmap matplotlib.pyplot.imshow(data) # Then to actually display it to the screen matplotlib.pyplot.show() # To create a line plot, of the daily average inflammation ave_inflammation = numpy.mean(data, axis=0) ave_plot = matplotlib.pyplot.plot(avg_inflammation) # Display our new figure matplotlib.pyplot.show() # We didn't actually need to create an intermediary variable (ave_inflammation)! # So if we create a plot for the max, we can just pass in the values from the numpy.max calculation directly max_plot = matplotlib.pyplot.plot(numpy.amax(data, axis=0)) # Display figure matplotlib.pyplot.show() # Similarly we can do this for a plot of the minimum min_plot = matplotlib.pyplot.plot(numpy.amin(data, axis=0)) # Display figure matplotlib.pyplot.show()

Grouped Plots

This is the code that we used when we started grouping plots.

import numpy import matplotlib.pyplot # Remember, if your notebooks are in the same folder as your notebooks, you need to use # fname="inflammation-01.csv" instead of fname="data/inflammation-01.csv" data = numpy.loadtxt(fname="data/inflammation-01.csv", delimiter=",") # Create a blank canvas, that we're storing in a variable called 'fig' fig = matplotlib.pyplot.figure(figsize=(10., 3.)) # The figure is empty, so displaying it will show nothing matplotlib.pyplot.show(fig) # We need to add something to our figure! # Let's start by actually adding some axes on which to plot. axes1 = fig.add_subplot(1, 3, 1) # 1 row, 3 columns of subfigures. The last '1' means that axes1 will correspond to the first subfigure axes2 = fig.add_subplot(1, 3, 2) axes3 = fig.add_subplot(1, 3, 3) # Notice that we are doing "fig.add_subplot"" here - this means that we are using a function inside the 'fig' variable, which exists to add more pieces to our figure! # Add some labels so our figure is readable! # Again, we are using axes1.set_ylabel here because we are adding something to our axes. axes1.set_ylabel("Average") axes2.set_ylabel("Max") axes3.set_ylabel("Min") # We can see what we've got so far... matplotlib.pyplot.show(fig) # .. which is now 3 empty subplots with labels! # So let's start plotting on our axes axes1.plot(numpy.mean(data, axis=0)) axes2.plot(numpy.amax(data, axis=0)) axes3.plot(numpy.amin(data, axis=0)) matplotlib.pyplot.show(fig) # Looks good, but our axis labels are overlapping with the other plots. # We can fix this by forcing a tighter (stricter) layout fig.tight_layout() matplotlib.pyplot.show(fig) # If you want to save your figure, we can do this too! fig.savefig("inflammation.png") # Other options we set were our axis limits axes1.set_xlim(0, 40) # we know there are exactly 40 days, so we might as well compress our axes limits, for example!

Lists and Loops

We can define a list by using square brackets, and separating the items (elements) of the list with commas:

list_of_numbers = [1, 3, 5, 7] list_of_names = ["Will", "Arindam", "Devaraj"] list_with_a_mixture_of_things = [1, "seven", 3.141592] emtpy_list = []

IMPORTANT Do not call your list list! list is a special function that Python uses to create lists, so doing something like

list = [1, 3, 5, 7]

will mean that you can't make any more lists! If you did accidently do this, you can "restart" your notebook using the restart button in the toolbar (next to the "run cell" button), and then re-run the code cells (obviously avoiding this one!).

Lists can be accessed in similar ways to arrays and data; we can access by index in the list

print(list_of_numbers[0]) # Remember, indexes start at 0! print(list_of_numbers[-1]) # Count backwards from the end print(list_of_numbers[0:3]) # Slicing also works on lists # But you can "empty slice" lists too if you start at an index that is earlier than your final index: print(list_of_numbers[3:0]) # Will return an empty list # We can also change the "step size" of our slices long_list = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10] # Start at index 1, go up to (but not including) # index 8, and take every 2nd element! print(long_list[1:8:2])

We can also add extra elements to lists using the append method.

print(list_of_numbers) list_of_numbers.append(9) print(list_of_numbers)

And we can check how many elements our list has with the len function:

print(len(list_of_numbers))

Lists are also mutable, we can change their values in-place:

print(list_of_names) list_of_names[0] = "Graham" print(list_of_names)

note how this is different to strings, which up until this point have behaved like lists of characters:

my_string = "William" # Slicing / indexing works, just like lists print(my_string[4]) # But we can't change the individual characters my_string[4] = "g" # <- This produces an error!

Nested Lists

Lists can contain anything even other lists!

dogs = ["Cyote", "Wolf", "Dingo"] cats = ["Manx", "Lion", "Tiger", "Leopard"] horses = ["Horse", "Zebra", "Donkey"] list_of_animal_types = [dogs, cats, horses] print(list_of_animal_types) # The 0-indexed item in this list, is itself a list! print(list_of_animal_types[0]) # dogs # So we can ask for the 2-indexed item, inside the list at index 0 print(list_of_animal_types[0][2]) # dogs[2] = "Poodle" # Note that we can't do print(list_of_animal_types[0, 2]) # Causes an error # because a list isn't 2D. It's a list, and we have to "get" the sub-list before we can access anything else. There's no concept of rows and columns like numpy arrays.

In general though, if you are in a situation where you're using lists-of-lists, you can normally do something easier using either numpy arrays (which we saw earlier) or your own custom classes (which we won't cover today).

Also, notice that sub-list elements are not counted by len;

print(len(list_of_animal_types)) # 3, since there are 3 "sub-lists" inside this list. The individual elements of the sub-lists do not contribute to the count, because they are not "direct" elements of the main list.

The reason why we use lists is so that we can write loops. We will be looking at the for-loop, which is a way of letting us repeat instructions multiple times, once for each element in a list.

For example, if we want to print out every element in a list, with a comment, we could write out the following code

fibbonacci = [0, 1, 1, 2, 3, 5, 8, 13, 21] print("Element 1 is", fibbonacci[0]) print("Element 2 is", fibbonacci[1]) print("Element 3 is", fibbonacci[2]) print("Element 4 is", fibbonacci[3])

but this is quite tedious. It would also break if you later changed fibbonacci to have less than 4 elements too! Instead, we can reliably use a for-loop to run this print statement for every element in the list:

for number in fibbonacci: print(number)

Note: Notice how the print(number) line is indented. This is important, since Python uses indentation to know when your loop instructions end!

for number in fibbonacci: print("Start of loop instructions") print("Current number is", number) print("End of loop instructions, but still in the loop") print("Now outside the loop - this text will only appear once")

Combining Loops and Plotting

Our plan is to use loops to run our analysis (or make plots for) each of our inflammation datasets.

import glob # We will use this to fetch our list of files import numpy import matplotlib.pyplot # First, we need to search for the csv data files. # This is what glob is for # Remember, if your csv files are in the same folder as your notebook, # you need "inflammation-*.csv" instead of "data/inflammation-*.csv". csv_files = glob.glob("data/inflammation-*.csv") # The csv files are not necessarily found in order, so we # might need to sort them into alphabetical order csv_files = sorted(csv_files) # Now, we want to create the figure we had before in plotting, but # we want to do this for EVERY data file! # So first, we need to loop over our file names. for filename in csv_files: # Print out a record so we can see what's happening print("Currently looking at:" filename) # Load the current datafile data = numpy.loadtxt(fname=filename, delimiter=",") # Prepare our blank canvas fig = matplotlib.pyplot.figure(figsize=(10., 3.)) # Create our 3 subplots ax1 = fig.add_subplot(1, 3, 1) ax2 = fig.add_subplot(1, 3, 2) ax3 = fig.add_subplot(1, 3, 3) # Plot the average, mean, and min of THIS dataset on the figure axis ax1.plot(numpy.mean(data, axis=0)) ax2.plot(numpy.amax(data, axis=0)) ax3.plot(numpy.amin(data, axis=0)) # Add some nice axis labels ax1.set_ylabel("Average") ax2.set_ylabel("Max") ax3.set_ylabel("Min") # Add one super-title to each figure. # We use slicing to access just the "01", "02", "03" number # of each dataset: # inflammation-XX.csv # ^ ^ # index -6 index -4 title = "Plot for dataset number " + filename[-6:-4] fig.suptitle(title) # Ensure the figure has a layout that doesn't overlap # different subplots fig.tight_layout()