Case Study: Accessing PCM Data from the Web
1 Packages
2 Loading a single PCM year
We’re going to obtain grid-square modelled PCM data from UK-AIR, found here. Let’s just arbitrarily choose NO2. Right click on one of the download links and copy the address, and provide it as the first argument of readr::read_csv()
to read it in.
read_csv("https://uk-air.defra.gov.uk/datastore/pcm/mappm252021g.csv")
# A tibble: 254,910 × 4
pm2.5 ...2 ...3 ...4
<dbl> <dbl> <dbl> <dbl>
1 2021 NA NA NA
2 NA NA NA NA
3 NA NA NA NA
4 NA NA NA NA
5 NA NA NA NA
6 55671 460500 1219500 1.94
7 56360 459500 1218500 1.93
8 56361 460500 1218500 1.94
9 56362 461500 1218500 1.94
10 56363 462500 1218500 1.94
# … with 254,900 more rows
Note that it hasn’t quite been read in properly; let’s refine this function call a bit:
The UK AIR website notes that missing values are sometimes encoded as the word “MISSING”, so we can deal with that.
The actual data starts a few lines down, on line 6. To tell R to skip some lines, we can use the
skip
argument.
pcm_2021 <-
read_csv(
"https://uk-air.defra.gov.uk/datastore/pcm/mappm252021g.csv",
na = "MISSING",
skip = 5
)
3 Write a function to read in PCM data
It may be useful to write a function to read in the PCM data. We already have the “meat” of it in our read_csv()
call, but we’ll need to add extra parts to it to make it more user-friendly.
3.1 First Pass: Reading
With functions, its useful to think of inputs, outputs, and the steps needed in between. Our function should ideally just take the PCM year, and return a data frame of PCM data.
flowchart LR year --construct--> URL URL --used to read--> data
We can achieve this with the below function, which uses paste0()
to construct the URL and then runs read_csv()
using it.
There is an issue with this, however! The data are inconsistent and can’t be easily bound together - the PM concentrations and gridcodes are in four different columns!
bind_rows(pcm2019, pcm2020)
# A tibble: 694,285 × 6
ukgridcode x y pm252019g gridcode pm252020g
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 54291 460500 1221500 NA NA NA
2 54292 461500 1221500 NA NA NA
3 54294 463500 1221500 NA NA NA
4 54979 458500 1220500 NA NA NA
5 54980 459500 1220500 NA NA NA
6 54981 460500 1220500 NA NA NA
7 54982 461500 1220500 NA NA NA
8 54983 462500 1220500 NA NA NA
9 54984 463500 1220500 NA NA NA
10 54985 464500 1220500 NA NA NA
# … with 694,275 more rows
3.2 Second Pass: Cleaning
Let’s revisit our function. Now we’d prefer it to return consistently formatted data frame which can be easily bound.
flowchart LR year --construct--> URL URL --used to read--> raw[raw\ndata] raw --clean--> output[output\ndata]
Let’s add this extra cleaning step to our function. Thankfully, the columns are always in a consistent order, so we can just rename them by position (though, if we were going to use this a lot, we could do this in a more robust way in case positions change in future).
read_pcm <- function(year){
# construct a url
url <- paste0("https://uk-air.defra.gov.uk/datastore/pcm/mappm25", year, "g.csv")
# read the data
data <- read_csv(url, na = "MISSING", skip = 5)
# fix names
names(data)[[4]] <- "pm25"
names(data)[[1]] <- "ukgridcode"
# return data
return(data)
}
pcm2019 <- read_pcm(2019)
pcm2020 <- read_pcm(2020)
Now these should combine fine… but there’s no way to know which year is which! Back to the drawing board!
bind_rows(pcm2019, pcm2020)
# A tibble: 694,285 × 4
ukgridcode x y pm25
<dbl> <dbl> <dbl> <dbl>
1 54291 460500 1221500 NA
2 54292 461500 1221500 NA
3 54294 463500 1221500 NA
4 54979 458500 1220500 NA
5 54980 459500 1220500 NA
6 54981 460500 1220500 NA
7 54982 461500 1220500 NA
8 54983 462500 1220500 NA
9 54984 463500 1220500 NA
10 54985 464500 1220500 NA
# … with 694,275 more rows
3.3 Third Pass: Labelling
Let’s revisit our function one last time. We need to add one last step - assigning the year.
flowchart LR year --construct--> URL URL --used to read--> raw[raw\ndata] raw --clean--> output[output\ndata] year --include in--> output
This is just one extra line in our function, which will finally give us nice, tidy data to work with.
read_pcm <- function(year){
# construct a url
url <- paste0("https://uk-air.defra.gov.uk/datastore/pcm/mappm25", year, "g.csv")
# read the data
data <- read_csv(url, na = "MISSING", skip = 5)
# fix names
names(data)[[4]] <- "pm25"
names(data)[[1]] <- "ukgridcode"
# append year
data$year <- year
# return data
return(data)
}
pcm2019 <- read_pcm(2019)
pcm2020 <- read_pcm(2020)
bind_rows(pcm2019, pcm2020)
# A tibble: 694,285 × 5
ukgridcode x y pm25 year
<dbl> <dbl> <dbl> <dbl> <dbl>
1 54291 460500 1221500 NA 2019
2 54292 461500 1221500 NA 2019
3 54294 463500 1221500 NA 2019
4 54979 458500 1220500 NA 2019
5 54980 459500 1220500 NA 2019
6 54981 460500 1220500 NA 2019
7 54982 461500 1220500 NA 2019
8 54983 462500 1220500 NA 2019
9 54984 463500 1220500 NA 2019
10 54985 464500 1220500 NA 2019
# … with 694,275 more rows
4 Reading multiple years at once
Now we have a function, there’s nothing stopping us from reading all of the recent available data on UK AIR. We will use the purrr package to do this. The syntax for this package is a bit strange, but all you need is:
.f
- a function to apply to a list of items,.x
..x
- a list of items to which a function,.f
, will be applied.
As a really simple example, let’s apply the toupper()
function to a vector of character strings:
[[1]]
[1] "I'M"
[[2]]
[1] "NORMALLY"
[[3]]
[1] "REALLY"
[[4]]
[1] "QUIET!"
Note that map()
returns a “list”. This is a special kind of R object a bit like a vector, but it can contain a lot more than vectors can, like tibbles, graphs, models, and all sorts! purrr has some specialist functions to help turn these lists into more useful forms. For example, map_vec()
exists to put the output into a vector, if you know that’s what you’re expecting.
[1] "I'M" "NORMALLY" "REALLY" "QUIET!"
Now let’s use our read_pcm()
function to read in a load of PCM data. We’re going to make one small change to suppress all the text that’s printed to the console so we’re not spammed by read_csv()
! It also means we can use the new (as of version 1.0.0) .progress
argument to see a progress bar of our data downloading.
read_pcm <- function(year){
# construct a url
url <- paste0("https://uk-air.defra.gov.uk/datastore/pcm/mappm25", year, "g.csv")
# read the data
data <-
read_csv(
url,
na = "MISSING",
skip = 5,
progress = FALSE,
show_col_types = FALSE
)
# fix names
names(data)[[4]] <- "pm25"
names(data)[[1]] <- "ukgridcode"
# append year
data$year <- year
# return data
return(data)
}
pcm_list <- map(2010:2021, read_pcm, .progress = "Downloading PCM Data")
As before, this has returned a list. The tool to combine this into one big dataframe is purrr::list_rbind()
, which acts somewhat like bind_rows()
from dplyr but works on lists of dataframes.
pcm_all <- list_rbind(pcm_list)
# that's a lot of data!
pcm_all
# A tibble: 3,485,408 × 5
ukgridcode x y pm25 year
<dbl> <dbl> <dbl> <dbl> <int>
1 54291 460500 1221500 NA 2010
2 54292 461500 1221500 NA 2010
3 54294 463500 1221500 NA 2010
4 54979 458500 1220500 NA 2010
5 54980 459500 1220500 NA 2010
6 54981 460500 1220500 NA 2010
7 54982 461500 1220500 NA 2010
8 54983 462500 1220500 NA 2010
9 54984 463500 1220500 NA 2010
10 54985 464500 1220500 NA 2010
# … with 3,485,398 more rows
5 Plotting
Now we have read this data directly into R, we can do whatever we’d like with it! For example, if we had access to grid-square population data, we could create a population-weighted map. As we don’t have that to hand, let’s just plot up what we’ve got. For speed, we’ll just use a few years of data, but there’s nothing stopping us from plotting it all.
Starting with just one year, let’s use ggplot2 to create a simple x/y graph. To unpack the below code:
All plots start with
ggplot()
, which is the canvas on which your data is drawn.Any columns are mapped to “aesthetics” using the
aes()
function. Aesthetics can be anything represented visually in the plot - the axes, a color bar, different shapes, opacities, and so on.To put “pen to paper”, you use a
geom_*()
function. These are all effectively different geometric ways to express your data - scatter markers, lines, polygons, box plots, etc.
Here we map “x” to the x-axis, “y” to the y-axis, and “pm25” to the “fill” axis.
There is clearly room for improvement here! ggplot2 provides a small number of different coordinate systems. Normally you’ll use cartesian coordinates (the default), but there also polar coordinates (coord_polar()
) and map coordinates (coord_sf()
). coord_sf()
will redraw the coordinates using a map projection, which stops the UK looking so squat!
plt <- plt + coord_sf()
plt
Some of the more visual elements of a plot are controlled by theme()
. This changes things like text size/colour/font, line width/colour/style, panel background colour and opacitiy, and so on. In our case, we might as well strip away all of that stuff and can therefore use the “nuclear option”, theme_void()
, which drops almost all theme elements from the plot.
plt <- plt + theme_void()
plt
Now let’s refine the colour bar. ggplot2 has a lot of “scale” functions which change how aesthetics are mapped - there are scales for the x axis, y axis, colour bars, and so on. Let’s give the colour bar a better name, and even use an openair colour scheme to make it fit with our openair plots.
plt + scale_fill_gradientn(colors = openair::openColours(),
name = openair::quickText("PM2.5"))
Note that we could even do fancier things with scales, like scale transforms or binning!
plt + scale_fill_stepsn(
colors = openair::openColours(),
n.breaks = 10,
name = openair::quickText("PM2.5")
)
plt + scale_fill_gradientn(
colors = openair::openColours(),
name = openair::quickText("PM2.5"),
trans = "log10"
)
One last thing to mention is “faceting”. This allows for our data to be split up into multiple panels. In our case, this let’s us create multiple maps for different years, using the same colour scale.
ggplot(na.omit(pcmdiff), aes(x, y)) +
geom_tile(aes(fill = pm25)) +
scale_fill_gradientn(colors = openair::openColours(),
name = openair::quickText("PM2.5")) +
theme_void() +
theme(legend.position = "bottom") +
facet_wrap(vars(year)) +
coord_sf()