XClose

Research Software Engineering Summer School

Home
Menu

Getting data from the internet

We've seen about obtaining data from our local file system.

The other common place today that we might want to obtain data is from the internet.

It's very common today to treat the web as a source and store of information; we need to be able to programmatically download data, and place it in Python objects.

We may also want to be able to programmatically upload data, for example, to automatically fill in forms.

This can be really powerful if we want to, for example, do automated meta-analysis across a selection of research papers.

Uniform resource locators

All internet resources are defined by a uniform resource locator (URL) which are a particular type of uniform resource identifier (URI). For example

In [1]:
"https://mt0.google.com:443/vt?x=658&y=340&z=10&lyrs=s"
Out[1]:
'https://mt0.google.com:443/vt?x=658&y=340&z=10&lyrs=s'

A URL consists of:

  • A scheme (http hypertext transfer protocol, https hypertext transfer protocol secure , ssh secure shell, ...)
  • A host (mt0.google.com, the name of the remote computer you want to talk to)
  • A port (optional, most protocols have a typical port associated with them, e.g. 80 for HTTP, 443 for HTTPS)
  • A path (analogous to a file path on the machine, here it is just vt)
  • A query part after a ?, (optional, usually ampersand & separated parameters e.g. x=658 or z=10)

Supplementary materials: These can actually be different for different protocols, the above is a simplification, you can see more, for example, at the Wikipedia article on URIs.

URLs are not allowed to include all characters; we need to, for example, escape a space that appears inside the URL, replacing it with %20, so e.g. a request of http://some example.com/ would need to be http://some%20example.com/.

Supplementary materials: The code used to replace each character is the ASCII code for it.

Supplementary materials: The escaping rules are quite subtle. See the Wikipedia article on percent-encoding. The standard library provides the urlencode function that can take care of this for you.

Requests

The Python Requests library can help us manipulate URLs and requesting the content associated with them. It is easier to use than the urllib library that is part of the standard library, and is included with Anaconda and Canopy. It sorts out escaping, parameter encoding, and so on for us.

In [2]:
# sending requests to the web is not fully supported on jupyterlite yet, and the
# cells below might error out on the browser (jupyterlite) version of this notebook
import requests

To request the above URL, for example, we write:

In [3]:
response = requests.get(
    url="https://mt0.google.com:443/vt", 
    params={'x': 658, 'y': 340, 'lyrs': 's', 'z': 10}
)

The returned object is a instance of the requests.Response class

In [4]:
response
Out[4]:
<Response [200]>
In [5]:
isinstance(response, requests.Response)
Out[5]:
True

The Response class defines various useful attributes associated with the responses, for example we can check the status code for our request with a value of 200 indicating a successful request

In [6]:
response.status_code
Out[6]:
200

We can also more directly check if the response was successful or not with the boolean Response.ok attribute

In [7]:
response.ok
Out[7]:
True

We can get the URL that was requested using the Response.url attribute

In [8]:
response.url
Out[8]:
'https://mt0.google.com:443/vt?x=658&y=340&lyrs=s&z=10'

When we do a request, the associated response content, accessible at the Response.content attribute, is returned as bytes. For the JPEG image in the above, this isn't very readable:

In [9]:
type(response.content)
Out[9]:
bytes
In [10]:
response.content[:10]
Out[10]:
b'\xff\xd8\xff\xe0\x00\x10JFIF'

We can also get the content as a string using the Response.content attribute, though this is even less readable here as some of the returned bytes do not have corresponding character encodings

In [11]:
type(response.text)
Out[11]:
str
In [12]:
response.text[:10]
Out[12]:
'����\x00\x10JFIF'

To get a more useful representation of the data, we will therefore need to process the content we get using a Python function which understands the byte-encoding of the corresponding file format.

Again, it is important to separate the transport model, (e.g. a file system, or a HTTP request for the web), from the data model of the data that is returned.

Example: sunspots

Let's try to get something scientific: the sunspot cycle data from the Sunspot Index and Long-term Solar Observations website

In [13]:
spots = requests.get('http://www.sidc.be/silso/INFO/snmtotcsv.php').text
In [14]:
spots[-100:]
Out[14]:
'91; 166.4; 23.9;  893;0\n2024;11;2024.873; 152.5; 20.9;  681;0\n2024;12;2024.958; 154.5; 25.6;  572;0\n'

This looks like semicolon-separated data, with different records on different lines. Line separators come out as \n which is the escape-sequence corresponding a newline character in Python.

There are many many scientific datasets which can now be downloaded like this - integrating the download into your data pipeline can help to keep your data flows organised.

Writing our own parser

We'll need a Python library to handle semicolon-separated data like the sunspot data.

You might be thinking: "But I can do that myself!":

In [15]:
lines = spots.split("\n")
lines[0:5]
Out[15]:
['1749;01;1749.042;  96.7; -1.0;   -1;1',
 '1749;02;1749.123; 104.3; -1.0;   -1;1',
 '1749;03;1749.204; 116.7; -1.0;   -1;1',
 '1749;04;1749.288;  92.8; -1.0;   -1;1',
 '1749;05;1749.371; 141.7; -1.0;   -1;1']
In [16]:
years = [line.split(";")[0] for line in lines]
In [17]:
years[0:15]
Out[17]:
['1749',
 '1749',
 '1749',
 '1749',
 '1749',
 '1749',
 '1749',
 '1749',
 '1749',
 '1749',
 '1749',
 '1749',
 '1750',
 '1750',
 '1750']

But don't: what if, for example, one of the records contains a separator inside it; most computers will put the content in quotes, so that, for example,

"Something; something"; something; something

has three fields, the first of which is

Something; something

Our naive code above would however not correctly parse this input:

In [18]:
'"Something; something"; something; something'.split(';')
Out[18]:
['"Something', ' something"', ' something', ' something']

You'll never manage to get all that right; so you'll be better off using a library to do it.

Writing data to the internet

Note that we're using requests.get. get is used to receive data from the web. You can also use post to fill in a web-form programmatically.

Supplementary material: Learn about using post with Requests.

Supplementary material: Learn about the different kinds of HTTP request: Get, Post, Put, Delete...

This can be used for all kinds of things, for example, to programmatically add data to a web resource. It's all well beyond our scope for this course, but it's important to know it's possible, and start to think about the scientific possibilities.