Case Study: Accessing PCM Data from the Web

Published

February 13, 2023

Abstract

One of the most appealling aspects of learning a programming language is to ‘join up’ all parts of your data analysis workflow in a single place. While reading in data locally can be picked up relatively quickly, obtaining data straight from the internet can often evade new users. This case study details how data can be obtained directly from the web, avoiding needing the extra human step of clicking the ‘download’ button!

1 Packages

library(readr) # read rectangular data
library(dplyr) # data manipulation
library(purrr) # iterate over a list, .x, with a function, .f
library(ggplot2) # visualise data

2 Loading a single PCM year

We’re going to obtain grid-square modelled PCM data from UK-AIR, found here. Let’s just arbitrarily choose NO₂. Right click on one of the download links and copy the address, and provide it as the first argument of readr::read_csv() to read it in.

read_csv("https://uk-air.defra.gov.uk/datastore/pcm/mappm252021g.csv")

# A tibble: 254,910 × 4
   pm2.5   ...2    ...3  ...4
   <dbl>  <dbl>   <dbl> <dbl>
 1  2021     NA      NA NA   
 2    NA     NA      NA NA   
 3    NA     NA      NA NA   
 4    NA     NA      NA NA   
 5    NA     NA      NA NA   
 6 55671 460500 1219500  1.94
 7 56360 459500 1218500  1.93
 8 56361 460500 1218500  1.94
 9 56362 461500 1218500  1.94
10 56363 462500 1218500  1.94
# … with 254,900 more rows

Note that it hasn’t quite been read in properly; let’s refine this function call a bit:

The UK AIR website notes that missing values are sometimes encoded as the word “MISSING”, so we can deal with that.
The actual data starts a few lines down, on line 6. To tell R to skip some lines, we can use the skip argument.

pcm_2021 <- 
  read_csv(
    "https://uk-air.defra.gov.uk/datastore/pcm/mappm252021g.csv",
    na = "MISSING",
    skip = 5
  )

3 Write a function to read in PCM data

It may be useful to write a function to read in the PCM data. We already have the “meat” of it in our read_csv() call, but we’ll need to add extra parts to it to make it more user-friendly.

3.1 First Pass: Reading

With functions, its useful to think of inputs, outputs, and the steps needed in between. Our function should ideally just take the PCM year, and return a data frame of PCM data.

flowchart LR
  year --construct--> URL
  URL --used to read--> data

We can achieve this with the below function, which uses paste0() to construct the URL and then runs read_csv() using it.

read_pcm <- function(year){
  # construct a url
  url <- paste0("https://uk-air.defra.gov.uk/datastore/pcm/mappm25", year, "g.csv")

  # read the data  
  read_csv(url, na = "MISSING", skip = 5)
}

pcm2019 <- read_pcm(2019)
pcm2020 <- read_pcm(2020)

There is an issue with this, however! The data are inconsistent and can’t be easily bound together - the PM concentrations and gridcodes are in four different columns!

bind_rows(pcm2019, pcm2020)

# A tibble: 694,285 × 6
   ukgridcode      x       y pm252019g gridcode pm252020g
        <dbl>  <dbl>   <dbl>     <dbl>    <dbl>     <dbl>
 1      54291 460500 1221500        NA       NA        NA
 2      54292 461500 1221500        NA       NA        NA
 3      54294 463500 1221500        NA       NA        NA
 4      54979 458500 1220500        NA       NA        NA
 5      54980 459500 1220500        NA       NA        NA
 6      54981 460500 1220500        NA       NA        NA
 7      54982 461500 1220500        NA       NA        NA
 8      54983 462500 1220500        NA       NA        NA
 9      54984 463500 1220500        NA       NA        NA
10      54985 464500 1220500        NA       NA        NA
# … with 694,275 more rows

3.2 Second Pass: Cleaning

Let’s revisit our function. Now we’d prefer it to return consistently formatted data frame which can be easily bound.

flowchart LR
  year --construct--> URL
  URL --used to read--> raw[raw\ndata]
  raw --clean--> output[output\ndata]

Let’s add this extra cleaning step to our function. Thankfully, the columns are always in a consistent order, so we can just rename them by position (though, if we were going to use this a lot, we could do this in a more robust way in case positions change in future).

read_pcm <- function(year){
  # construct a url
  url <- paste0("https://uk-air.defra.gov.uk/datastore/pcm/mappm25", year, "g.csv")

  # read the data  
  data <- read_csv(url, na = "MISSING", skip = 5)
  
  # fix names
  names(data)[[4]] <- "pm25"
  names(data)[[1]] <- "ukgridcode"
  
  # return data
  return(data)
}


pcm2019 <- read_pcm(2019)
pcm2020 <- read_pcm(2020)

Now these should combine fine… but there’s no way to know which year is which! Back to the drawing board!

bind_rows(pcm2019, pcm2020)

# A tibble: 694,285 × 4
   ukgridcode      x       y  pm25
        <dbl>  <dbl>   <dbl> <dbl>
 1      54291 460500 1221500    NA
 2      54292 461500 1221500    NA
 3      54294 463500 1221500    NA
 4      54979 458500 1220500    NA
 5      54980 459500 1220500    NA
 6      54981 460500 1220500    NA
 7      54982 461500 1220500    NA
 8      54983 462500 1220500    NA
 9      54984 463500 1220500    NA
10      54985 464500 1220500    NA
# … with 694,275 more rows

3.3 Third Pass: Labelling

Let’s revisit our function one last time. We need to add one last step - assigning the year.

flowchart LR
  year --construct--> URL
  URL --used to read--> raw[raw\ndata]
  raw --clean--> output[output\ndata]
  year --include in--> output

This is just one extra line in our function, which will finally give us nice, tidy data to work with.

read_pcm <- function(year){
  # construct a url
  url <- paste0("https://uk-air.defra.gov.uk/datastore/pcm/mappm25", year, "g.csv")

  # read the data  
  data <- read_csv(url, na = "MISSING", skip = 5)
  
  # fix names
  names(data)[[4]] <- "pm25"
  names(data)[[1]] <- "ukgridcode"
  
  # append year
  data$year <- year
  
  # return data
  return(data)
}

pcm2019 <- read_pcm(2019)
pcm2020 <- read_pcm(2020)

bind_rows(pcm2019, pcm2020)

# A tibble: 694,285 × 5
   ukgridcode      x       y  pm25  year
        <dbl>  <dbl>   <dbl> <dbl> <dbl>
 1      54291 460500 1221500    NA  2019
 2      54292 461500 1221500    NA  2019
 3      54294 463500 1221500    NA  2019
 4      54979 458500 1220500    NA  2019
 5      54980 459500 1220500    NA  2019
 6      54981 460500 1220500    NA  2019
 7      54982 461500 1220500    NA  2019
 8      54983 462500 1220500    NA  2019
 9      54984 463500 1220500    NA  2019
10      54985 464500 1220500    NA  2019
# … with 694,275 more rows

4 Reading multiple years at once

Now we have a function, there’s nothing stopping us from reading all of the recent available data on UK AIR. We will use the purrr package to do this. The syntax for this package is a bit strange, but all you need is:

.f - a function to apply to a list of items, .x.
.x - a list of items to which a function, .f, will be applied.

As a really simple example, let’s apply the toupper() function to a vector of character strings:

map(.x = c("i'm", "normally", "really", "quiet!"),
    .f = toupper)

[[1]]
[1] "I'M"

[[2]]
[1] "NORMALLY"

[[3]]
[1] "REALLY"

[[4]]
[1] "QUIET!"

Note that map() returns a “list”. This is a special kind of R object a bit like a vector, but it can contain a lot more than vectors can, like tibbles, graphs, models, and all sorts! purrr has some specialist functions to help turn these lists into more useful forms. For example, map_vec() exists to put the output into a vector, if you know that’s what you’re expecting.

map_vec(.x = c("i'm", "normally", "really", "quiet!"),
        .f = toupper)

[1] "I'M"      "NORMALLY" "REALLY"   "QUIET!"

Now let’s use our read_pcm() function to read in a load of PCM data. We’re going to make one small change to suppress all the text that’s printed to the console so we’re not spammed by read_csv()! It also means we can use the new (as of version 1.0.0) .progress argument to see a progress bar of our data downloading.

read_pcm <- function(year){
  # construct a url
  url <- paste0("https://uk-air.defra.gov.uk/datastore/pcm/mappm25", year, "g.csv")

  # read the data  
  data <-
    read_csv(
      url,
      na = "MISSING",
      skip = 5,
      progress = FALSE,
      show_col_types = FALSE
    )
  
  # fix names
  names(data)[[4]] <- "pm25"
  names(data)[[1]] <- "ukgridcode"
  
  # append year
  data$year <- year
  
  # return data
  return(data)
}

pcm_list <- map(2010:2021, read_pcm, .progress = "Downloading PCM Data")

As before, this has returned a list. The tool to combine this into one big dataframe is purrr::list_rbind(), which acts somewhat like bind_rows() from dplyr but works on lists of dataframes.

pcm_all <- list_rbind(pcm_list)

# that's a lot of data!
pcm_all

# A tibble: 3,485,408 × 5
   ukgridcode      x       y  pm25  year
        <dbl>  <dbl>   <dbl> <dbl> <int>
 1      54291 460500 1221500    NA  2010
 2      54292 461500 1221500    NA  2010
 3      54294 463500 1221500    NA  2010
 4      54979 458500 1220500    NA  2010
 5      54980 459500 1220500    NA  2010
 6      54981 460500 1220500    NA  2010
 7      54982 461500 1220500    NA  2010
 8      54983 462500 1220500    NA  2010
 9      54984 463500 1220500    NA  2010
10      54985 464500 1220500    NA  2010
# … with 3,485,398 more rows

5 Plotting

Now we have read this data directly into R, we can do whatever we’d like with it! For example, if we had access to grid-square population data, we could create a population-weighted map. As we don’t have that to hand, let’s just plot up what we’ve got. For speed, we’ll just use a few years of data, but there’s nothing stopping us from plotting it all.

pcm2021 <- filter(pcm_all, year == 2021)
pcmdiff <- filter(pcm_all, year %in% c(2010, 2021))

Starting with just one year, let’s use ggplot2 to create a simple x/y graph. To unpack the below code:

All plots start with ggplot(), which is the canvas on which your data is drawn.
Any columns are mapped to “aesthetics” using the aes() function. Aesthetics can be anything represented visually in the plot - the axes, a color bar, different shapes, opacities, and so on.
To put “pen to paper”, you use a geom_*() function. These are all effectively different geometric ways to express your data - scatter markers, lines, polygons, box plots, etc.

Here we map “x” to the x-axis, “y” to the y-axis, and “pm25” to the “fill” axis.

plt <- 
  ggplot(pcm2021, aes(x = x, y = y)) +
  geom_tile(aes(fill = pm25))

plt

Figure 1: A rough and ready ggplot graph!

There is clearly room for improvement here! ggplot2 provides a small number of different coordinate systems. Normally you’ll use cartesian coordinates (the default), but there also polar coordinates (coord_polar()) and map coordinates (coord_sf()). coord_sf() will redraw the coordinates using a map projection, which stops the UK looking so squat!

plt <- plt + coord_sf()
plt

Figure 2: Our plot now has a proper coordinate system.

Some of the more visual elements of a plot are controlled by theme(). This changes things like text size/colour/font, line width/colour/style, panel background colour and opacitiy, and so on. In our case, we might as well strip away all of that stuff and can therefore use the “nuclear option”, theme_void(), which drops almost all theme elements from the plot.

plt <- plt + theme_void()
plt

Figure 3: We’ve removed a lot of the excess ‘chart junk’ from our figure.

Now let’s refine the colour bar. ggplot2 has a lot of “scale” functions which change how aesthetics are mapped - there are scales for the x axis, y axis, colour bars, and so on. Let’s give the colour bar a better name, and even use an openair colour scheme to make it fit with our openair plots.

plt + scale_fill_gradientn(colors = openair::openColours(),
                           name = openair::quickText("PM2.5"))

Note that we could even do fancier things with scales, like scale transforms or binning!

plt + scale_fill_stepsn(
  colors = openair::openColours(),
  n.breaks = 10,
  name = openair::quickText("PM2.5")
)

plt + scale_fill_gradientn(
  colors = openair::openColours(),
  name = openair::quickText("PM2.5"),
  trans = "log10"
)

One last thing to mention is “faceting”. This allows for our data to be split up into multiple panels. In our case, this let’s us create multiple maps for different years, using the same colour scale.

ggplot(na.omit(pcmdiff), aes(x, y)) +
  geom_tile(aes(fill = pm25)) +
  scale_fill_gradientn(colors = openair::openColours(),
                       name = openair::quickText("PM2.5")) +
  theme_void() +
  theme(legend.position = "bottom") +
  facet_wrap(vars(year)) +
  coord_sf()