Shiny App for the Wittgenstein Centre Population Projections

A few weeks ago a new version of the the Wittgenstein Centre Data Explorer was launched. The data explorer is intended to disseminate the results of a recent global population projection exercise which uniquely incorporates level of education (as well as age and sex) and the scientific input of more than 500 population experts around the world. Included are the projected populations used in the 5th assessment report of the Intergovernmental Panel on Climate Change (IPCC).


Over the past year or so I have been working (on and off) with the data lab team to create a shiny app, on which the data explorer is based. All the code and data is available on my github page. Below are notes to summarise some of the lessons I learnt:

1. Large data

We had a pretty large amount of data to display (31 indicators based on up to 7 scenarios x 26 time periods x 223 geographical areas x 21 age groups x 2 genders x 7 education categories)… so somewhere over 8 million rows for some indicators. Further complexity was added by the fact that some indicators were by definition not available for some dimensions of the data, for example, population median age is not available by age group. The size and complexity meant that data manipulations were a big issue. Using read.csv to load the data didn’t really cut the mustard, taking over 2 minutes when running on the server. The fantastic saves package and argument in the loads function came to the rescue, alongside some pre-formatting to avoid as much joining and reshaping of the data on the server as possible. This cut load times to a couple of seconds at most, and allowed the app to work with the indicator variables on the fly as demanded by the user selections. Once the data was in, the more than awesome dplyr functions finished the data manipulations jobs in style. I am sure there is some smarter way to get everything running a little bit quicker than it does now, but I am pretty happy with the present speed, given the initial waiting times.

2. googleVis and gvisMerge

It’s a demographic data explorer, which means population pyramids have to pop-up somewhere. We needed pyramids that illustrate population sizes by education level, on top of the standard age and sex breakdown. Static versions of the education pyramids in the explorer have previously been used by my colleagues to illustrate past and future populations. For the graphic explorer I created some interactive versions, for side-by-side comparisons over time and between countries, and which also have some tool tip features. These took a little while to develop. I played with ggvis but couldn’t get my bar charts to go horizontal. I also took a look at some other functions for interactive pyramids but I couldn’t figure out a way to overlay the educational dimension. I found a solution by creating gender specific stacked bar charts from gvisBarChart in the googleVis package and then gvisMerge to bring them together in one plot. As with the data tables, they take a second or so render, so I added a withProgress bar to try and keep the user entertained.

I could not figure out a way in R to convert the HTML code outputted by the gvisMerge function to a familiar file format for users to download. Instead I used a system call to the wkhtmltopdf program to return PDF or PNG files. By default, wkhtmltopdf was a bit hit and miss, especially with converting the more complex plots in the maps to PNG files. I found setting --enable-javascript --javascript-delay 2000 helped in many cases.

3. The shiny user community

I asked questions using the shiny tag on stackoverflow and the shiny google group a number of times. A big thank you to everyone who helped me out. Browsing through other questions and answers was also super helpful. I found this question on organising large shiny code particularly useful. Making small changes during the reviewing process became a lot easier once I broke the code up across multiple .R files with sensible names.

4. Navbar Pages

When I started building the shiny app I was using a single layout with a sidebar and tabbed pages to display data and graphics (using tabsetPanel()), adding extra tabs as we developed new features (data selection, an assumption data base, population pyramids, plots of population size, maps, FAQ’s, etc, etc.). As these grew, the switch to the new Navbar layout helped clean up the appearance and provide a better user experience, where you can move between data, graphics and background information using the bar at the top of page.

5. Shading and link buttons

I added some shading and buttons to help navigate through the data selection and between different tabs. For the shading I used to generate the colour of a fluidRow background. The code generated there was copy and pasted into a tags$style element for my defined row myRow1, as such;

  ui = shinyUI(fluidPage(
      class = "myRow1", 
      selectInput('variable', 'Variable', names(iris))
    tags$style(".myRow1{background: rgba(212,228,239,1); 
                background: -moz-linear-gradient(left, rgba(212,228,239,1) 0%, rgba(44,146,208,1) 100%);
                background: -webkit-gradient(left top, right top, color-stop(0%, rgba(212,228,239,1)), color-stop(100%, rgba(44,146,208,1)));
                background: -webkit-linear-gradient(left, rgba(212,228,239,1) 0%, rgba(44,146,208,1) 100%);
                background: -o-linear-gradient(left, rgba(212,228,239,1) 0%, rgba(44,146,208,1) 100%);
                background: -ms-linear-gradient(left, rgba(212,228,239,1) 0%, rgba(44,146,208,1) 100%);
                background: linear-gradient(to right, rgba(212,228,239,1) 0%, rgba(44,146,208,1) 100%);
                filter: progid:DXImageTransform.Microsoft.gradient( startColorstr='#d4e4ef', endColorstr='#2c92d0', GradientType=1 );
                border-radius: 10px 10px 10px 10px;
                -moz-border-radius: 10px 10px 10px 10px;
                -webkit-border-radius: 10px 10px 10px 10px;}")
  server = function(input, output) {

I added some buttons to help novice users switch between tabs once they had selected or viewed their data. It was a little tougher to implement than the shading, and in the end I need a little help. I used to add some icons and define the style of the navigation buttons (using the tags$style element again).

That is about it for the moment. I might add a few more notes to this post as they occur to me… I would encourage anyone who is tempted to learn shiny to take the plunge. I did not know JavaScript or any other web languages before I started, and I still don’t… which is the great beauty of the shiny package. I started with the RStudio tutorials, which are fantastic. The R code did not get a whole lot more complex than what I learnt there, even though the shiny app is quite large in comparisons to most others I have seen.

Any comments or suggestions for improving website are welcome.

Forecasting Environmental Immigration to the UK

A couple of months ago, a paper I worked on with co-authors from the Centre of Population Change was published in Population and Environment. It summarised work we did as part of the UK Government Office for Science Foresight project on Migration and Global Environmental Change. Our aim was to build expert based forecasts of environmental immigrants to the UK. We conducted a Delphi survey of nearly 30 migration experts from academia, the civil service and non-governmental organisations to obtain estimates on the future levels of immigration to the UK in 2030 and 2060 with uncertainty. We also asked them what proportion of current and future immigration are/will be environmental migrants. The results were incorporated into a set of model averaged Bayesian time series models through prior distributions on the mean and variance terms.

The plots in the journal article got somewhat butchered during the publication process. Below is the non-butchered version for the future immigration to the UK alongside the past immigration data from the Office of National Statistics.
At first, I was a bit taken aback with this plot. A few experts thought there were going to be some very high levels of future immigration which cause the rather striking large upper tail. However, at a second glance, the central percentiles show a gentle decrease where these is only (approximately) a 30% chance of an increase in future migration from the 2010 level throughout the forecast period.

The expert based forecast for total immigration was combined with the responses to questions on the proportion of environmental migrants, to obtain an estimate on both the current level of environmental migration (which is not currently measured) and future levels:

As is the way with these things, we came across some problems in our project. The first, was with the definition of an environmental migrant, which is not completely nailed on in the migration literature. As a result the part of the uncertainty in the expert based forecasts are reflective of not only the future level but also of the measure itself. The second was with the elicitation of uncertainty. We used a Likert type scale, which caused some difficulties even during the later round of the Delphi survey. If I was to do over, then this I reckon problem could be much better addressed by getting experts to visualise their forecast fans in an interactive website, perhaps creating a shiny app with the fanplot package. Such an approach would result in smoother fans than those in the plots above, which were based on interpolations from expert answers at only two points of time in the future (2030 and 2060).

Publication Details:

Abel, G.J., Bijak, J., Findlay, A.M., McCollum, D. and Wiśniowski, A. (2013). Forecasting environmental migration to the United Kingdom: An exploration using Bayesian models. Population and Environment. 35 (2), 183–203

Over the next 50 years, the potential impact of environmental change on human livelihoods could be considerable, with one possible consequence being increased levels of human mobility. This paper explores how uncertainty about the level of immigration to the United Kingdom as a consequence of environmental factors elsewhere may be forecast using a methodology involving Bayesian models. The conceptual understanding of forecasting is advanced in three ways. First, the analysis is believed to be the first time that the Bayesian modelling approach has been attempted in relation to environmental mobility. Second, the paper considers the expediency of this approach by comparing the responses to a Delphi survey with conventional expectations about environmental mobility in the research literature. Finally, the values and assumptions of the expert evidence provided in the Delphi survey are interrogated to illustrate the limited set of conditions under which forecasts of environmental mobility, as set out in this paper, are likely to hold.

Bank of England Fan Charts in R

I have/will update this post as I expanded the fanplot package.

I managed to catch David Spiegelhalter’s Tails You Win on BBC iplayer last week. I missed it the first time round, only for my parents on my last visit home to tell me about a Statistician jumping out of a plane on TV. It was a great watch. Towards the end I spotted some fan charts used by the Bank of England to illustrate uncertainty in their forecasts, similar to this one:

Bank of England February 2013 CPI Fan Chart.
Obtained from Chart 5.3 in February 2013 Inflation Report

They discussed how even in the tails of their GDP predictive distribution they missed the financial crisis by a long shot. This got me googling, trying and find you how they made the plots, something that (also) completely passed me by when I put together my fanplot package for R. As far as I could tell they did them in Excel, although (appropriately) I am not completely certain. There are also MATLAB files that can create fan charts. Anyhow, I thought I would have a go at replicating a Bank of England fan chart in R….

Split Normal (Two-Piece) Normal Distribution.

The Bank of England produce fan charts of forecasts for CPI and GDP in their quarterly Inflation Reports. They also provide data, in the form of mode, uncertainty and a skewness parameters of a split-normal distribution that underlie their fan charts (The Bank of England predominately refer to the equivalent, re-parametrised, two-piece normal distribution). The probability density of the split-normal distribution is given by Julio (2007) as

f(x; \mu, \sigma_1, \sigma_2) = \left\{  \begin{array}{ll}  \frac{\sqrt 2}{\sqrt\pi (\sigma_1+\sigma_2)} e^{-\frac{1}{2\sigma_1^2}(x-\mu)^2} & \mbox{for } -\infty < x \leq \mu \\  \frac{\sqrt 2}{\sqrt\pi (\sigma_1+\sigma_2)} e^{-\frac{1}{2\sigma_2^2}(x-\mu)^2} & \mbox{for } \mu < x < \infty \\  \end{array},  \right.

where \mu represents the mode parameter, and the two standard deviations \sigma_1 and \sigma_2 can be derived given the overall uncertainty parameter, \sigma and skewness parameters, \gamma, as;


As no split normal distribution existed in R, I added routines for a density, distribution and quantile function, plus a random generator, to a new version (2.1) of the fanplot package. I used the formula in Julio (2007) to code each of the three functions, and checked the results against those from the fan chart MATLAB code.

Fan Chart Plots for CPI.

Once I had the qsplitnorm function working properly, producing the fan chart plot in R was pretty straight-forward. I added two data objects to the fanplot package to help readers reproduce my plots below. The first, cpi, is a time series object with past values of CPI index. The second, boe, is a data frame with historical details on the split normal parameters for CPI inflation between Q1 2004 to Q4 2013 forecasts by the Bank of England.

> library("fanplot")
> head(boe)
  time0    time mode uncertainty skew
1  2004 2004.00 1.34      0.2249    0
2  2004 2004.25 1.60      0.3149    0
3  2004 2004.50 1.60      0.3824    0
4  2004 2004.75 1.71      0.4274    0
5  2004 2005.00 1.77      0.4499    0
6  2004 2005.25 1.68      0.4761    0

The first column time0 refers to the base year of forecast, the second, time indexes future projections, whilst the remaining three columns provide values for the corresponding projected mode (\mu), uncertainty (\sigma) and skew (\gamma) parameters:

Users can replicate past Bank of England fan charts for a particular period after creating a matrix object that contains values on the split-normal quantile function for a set of user defined probabilities. For example, in the code below, a subset of the Bank of England future parameters of CPI published in Q1 2013 are first selected. Then a vector of probabilities related to the percentiles, that we ultimately would like to plot different shaded fans for, are created. Finally, in a for loop the qsplitnorm function, calculates the values for which the time-specific (i) split-normal distribution will be less than or equal to the probabilities of p.

# select relevant data
y0 <- 2013
boe0 <- subset(boe, time0==y0)
k <- nrow(boe0)

# guess work to set percentiles the BOE are plotting
p <- seq(0.05, 0.95, 0.05)
p <- c(0.01, p, 0.99)
# quantiles of split-normal distribution for each probability
# (row) at each future time point (column)
cpival <- matrix(NA, nrow = length(p), ncol = k)
for (i in 1:k) 
   cpival[, i] <- qsplitnorm(p, mode = boe0$mode[i], 
                                sd = boe0$uncertainty[i],
                                skew = boe0$skew[i])

The new object cpival contains the values evaluated from the qsplitnorm function in 6 rows and 13 columns, where rows represent the probabilities used in the calculation p and columns represent successive time periods.

The object cpival can then used to add a fan chart to the active R graphic device. In the code below, the area of the plot is set up when plotting the past CPI data, contained in the time series object cpi. The xlim arguments are set to ensure space on the right hand side of the plotting area for the fan. Following the Bank of England style for plotting fan charts, the background for future values is set to a gray colour, y-axis are plotted on the right hand side, a horizontal line are added for the CPI target and a vertical line for the two-year ahead point.

# past data
plot(cpi, type = "l", col = "tomato", lwd = 2, 
     xlim = c(y0 - 5, y0 + 3), ylim = c(-2, 7), 
     xaxt = "n", yaxt = "n", ylab="")
# background
rect(y0 - 0.25, par("usr")[3] - 1, y0 + 3, par("usr")[4] + 1, 
     border = "gray90", col = "gray90")
# add fan
fan(data = cpival, data.type = "values", probs = p, 
    start = y0, frequency = 4, 
    anchor = cpi[time(cpi) == y0 - 0.25], 
    fan.col = colorRampPalette(c("tomato", "gray90")),  
    ln = NULL, rlab = NULL)
# boe aesthetics
axis(2, at = -2:7, las = 2, tcl = 0.5, labels = FALSE)
axis(4, at = -2:7, las = 2, tcl = 0.5)
axis(1, at = 2008:2016, tcl = 0.5)
axis(1, at = seq(2008, 2016, 0.25), labels = FALSE, tcl = 0.2)
abline(h = 2)  #boe cpi target
abline(v = y0 + 1.75, lty = 2)  #2 year line

The fan chart itself is outputted from the fan function, where arguments are set to ensure a close resemblance of the R plot to that produced by the Bank of England. The first three arguments in the fan function called in the above code, provide the cpival data to plotted, indicate that the data are a set of calculated values (as opposed to simulations) and provide the probabilities that correspond to each row of cpival object. The next two arguments define the start time and frequency of the data. These operate in a similar fashion to those used when defining time series in R with the ts function. The anchor argument is set to the value of CPI before the start of the fan chart. This allows a join between the value of the Q1 2013 observation and the fan chart. The fan.col argument is set to a colour palette for shades between tomato and gray90. The final two arguments are set to NULL to suppress the plotting of contour lines at the boundary of each shaded fan and their labels, as per the Bank of England style.

Default Fan Chart Plot.

By default, the fan function treats objects passed to the data argument as simulations from sequential distributions, rather than user-created values corresponding probabilities provided in the probs argument (as above). An alternative plot below, based on simulated data and default style settings in the fan function produces a fan chart with a greater array of coloured fans with labels and contour lines alongside selected percentiles of the future distribution. To illustrate we can simulate 10,000 values from the future split-normal distribution parameters from Q1 2013 in the boe0 data frame using the rsplitnorm function

#simulate future values
cpisim <- matrix(NA, nrow = 10000, ncol = k)
for (i in 1:k) 
   cpisim[, i] <- rsplitnorm(n=10000, mode = boe0$mode[i],
                                      sd = boe0$uncertainty[i], 
                                      skew = boe0$skew[i])

The fan chart based on the simulations in cpisim can then be added to the plot;

# truncate cpi series
cpi0 <- ts(cpi[time(cpi)<2013], start=start(cpi), 
           frequency=frequency(cpi) )

# past data
plot(cpi0, type = "l", lwd = 2, 
     xlim = c(y0 - 5, y0 + 3.25), ylim = c(-2, 7))

# add fan
fan(data = cpisim, start = y0, frequency = 4) 

The fan function calculates the values of 100 equally spaced percentiles of each future distribution when the default data.type = "simulations" is set. This allows 50 fans to be plotted from the heat.colours colour palate, providing a finer level of shading in the representation of future distributions. In addition, lines and labels are provided along each decile. The fan chart does not connect to the last observation as anchor = NULL by default.

Does specification matter? Experiments with simple multiregional probabilistic population projections.

A paper that I am a co-author on, looking at uncertainty in population forecasting generated by different measures of migration, came out this week in Environment and Planning A. Basically, try and avoid using net migration measures. Not only do they tend to give some dodgy projections, we also found out that they give you more uncertainty. Using in and out measures of migration in a projection model give a big reduction in uncertainty over a net measure. They also are a fairly good approximation to the uncertainty from a full multiregional projection model. Plots in the paper were done by my good-self using the fanplot package.

Publication Details:

Raymer J., Abel, G.J. and Rogers, A. (2012). Does Speci cation Matter? Experiments with Simple Multiregional Probabilistic Population Projections. Environment and Planning A 44 (11), 2664–2686.

Population projection models that introduce uncertainty are a growing subset of projection models in general. In this paper we focus on the importance of decisions made with regard to the model specifications adopted. We compare the forecasts and prediction intervals associated with four simple regional population projection models: an overall growth rate model, a component model with net migration, a component model with in-migration and out-migration rates, and a multiregional model with destination-specific out-migration rates. Vector autoregressive models are used to forecast future rates of growth, birth, death, net migration, in-migration and out-migration, and destination-specific out-migration for the North, Midlands, and South regions in England. They are also used to forecast different international migration measures. The base data represent a time series of annual data provided by the Office for National Statistics from 1976 to 2008. The results illustrate how both the forecasted subpopulation totals and the corresponding prediction intervals differ for the multiregional model in comparison to other simpler models, as well as for different assumptions about international migration. The paper ends with a discussion of our results and possible directions for future research.

The fanplot package for R

I have/will update this post as I expanded the fanplot package.

My fanplot package has gone up on CRAN. Below is a online version of the package vignette…

Visualising Time Series Model Results

The fanplot package can also be used to display uncertainty in estimates from time series models. To illustrate, the packages th.mcmc data frame object contains posterior density distributions of the estimated volatility of daily returns y_t from the Pound/Dollar exchange rate from 02/10/1981 to 28/6/1985. These distributions are from a MCMC simulation from a stochastic volatility model given in Meyer and Yu (2002) where it assumed;

y_t | \theta_t = \exp\left(\frac{1}{2}\theta_t\right)u_t \qquad u_t \sim N(0, 1) \qquad t=1,\ldots,n.

The latent volatilities \theta_t, which are unknown states in a state-space model terminology, are assumed to follow a Markovian transition over time given by the state equations:

\theta_t | \theta_{t-1}, \mu, \phi, \tau^2 = \mu + \phi \log \sigma^2_{t-1} + v_t \qquad v_t \sim N(0, \tau^2) \qquad t=1,\ldots,n

with \theta_0 \sim N(\mu, \tau^2).

The th.mcmc object consists of (1000) rows corresponding to MCMC simulations and (945) columns corresponding to each t. A fan chart of the evolution of the distribution of \theta_t can be visualised using the fanplot package via,

# empty plot
plot(NULL, main="Percentiles", xlim = c(1, 965), ylim = c(-2.5, 1.5))

# add fan
fan(data = th.mcmc)

The fan function calculates the values of 100 equally spaced percentiles of each future distribution when the default data.type = "simulations" is set. This allows 50 fans to be plotted from the heat.colours colour palette, providing a fine level of shading. In addition, lines and labels are provided along each decile.

Prediction Intervals

When argument type = "interval" is set, the probs argument corresponds to prediction intervals. Consequently, the fan chart comprises of 3 different shades, running from the darkest shade for the 50th prediction interval to the lightest for the 95th prediction interval.

# empty plot
plot(NULL, main="Prediction Intervals",
     xlim = c(-20, 965), ylim = c(-2.5, 1.5))

# add fan
fan(data = th.mcmc, type = "interval", llab=TRUE, rcex=0.6)

Contour lines are overlayed for the upper and lower bounds of each prediction intervals, as set using the ln command. A further line is plotted along the median of \theta_t, which is controlled by the med.ln argument (set to TRUE by default when type="interval"). The default labels on the right hand side correspond to the upper and lower bounds of each plotted line. The left labels are added by setting llab = TRUE. Note, some extra room is created for the labels by setting the xlim = c(-20, 965) argument of plotting area to a wider range than the original data (945 observations). The text size of the right labels are controlled using the rcex argument. The left labels, by default, take the same text size as rcex although they can be separately controlled using the lcex argument.

Alternative Colours

Alternative colour schemes to the default heat.colors, can be obtained by supplying a colorRampPalette to the fan.col argument. For example, a new palette running from blue to white, via grey can be passed using;

# empty plot
plot(NULL, main="Alternative Colour Scheme",
xlim = c(-20, 965), ylim = c(-2.5, 1.5))

# add fan
fan(data = th.mcmc, rlab=seq(20,80,15), llab=c(10,50,90),
    fan.col=colorRampPalette(c("royalblue", "grey", "white")))

Alternative labels are specified using the rlab and llab arguments.

Spaghetti Plots

Spaghetti plots can be used to represent uncertainty shown by a range of possible future trajectories or past estimates. For example using the th.mcmc object, 20 random sets of \theta_t can be plotted when setting the argument style = "spaghetti";

# empty plot
plot(NULL, main="Spaghetti Plot", xlim = c(-20, 965), ylim = range(th.mcmc))

# transparent fan with visible lines
fan(th.mcmc, ln=c(5, 50, 95), llab=TRUE, alpha=0, ln.col="orange")

# spaghetti lines
fan(th.mcmc, style="spaghetti", n.spag=20)

The spaghetti lines are superimposed on a fan chart in order to illustrate some underlying probabilities. The initial fan chart is completely transparent from setting the transparency argument alpha to 0. In order for the percentile lines to be visible a non-transparent colour is used for the ln.col argument.

Forecast Fans

The fanplot package can also be used to illustrate probabilistic forecasts. For example, using the auto.arima function in the forecast package a model for the time series for net migration to the United Kingdom (contained in the ips data frame of the fanplot package) can be fitted.

#create time series
net <- ts(ips$net, start=1975)
#fit model
m <- auto.arima(net)
Series: net
ARIMA(1,1,2) with drift         

          ar1      ma1      ma2   drift
      -0.2301  -0.0851  -0.6734  6.7625
s.e.   0.3715   0.3620   0.1924  1.4154

sigma^2 estimated as 1231:  log likelihood=-179.3
AIC=368.6   AICc=370.54   BIC=376.66

We may then simulate 1000 values from the selected model using the simulate.Arima function, and plot the results.

mm <- matrix(NA, nrow=1000, ncol=5)
for(i in 1:1000)
  mm[i,] &amp;lt;- simulate(m, nsim=5)

# empty plot
plot(net, main="UK Net Migration", xlim=c(1975,2020), ylim=c(-100,300))

# add fan
fan(mm, start=2013)

Users might want to connect the fan with the past data. This can be achieved by providing the last value to the anchor argument.

# empty plot
plot(net, main="UK Net Migration",
     xlim=c(1975,2020), ylim=c(-100,300))

# add fan
fan(mm, start=2013, anchor=net[time(net)==2012],
    type="interval", probs=seq(5, 95, 5), ln=c(50, 80))

More shades for the fan are added to the plot (over the default 3 used for a interval fans) by supplying a sequence to the probs argument. Alternative contour lines (from the default median, 50th, 80th and 95th percentiles for interval fans) are added using the ln argument.