The tsbugs package for R

I have/will update this post as I expanded the tsbugs package.

My tsbugs package has gone up on CRAN. I decided not to write a vignette for the submission, as it would have involved doing some estimation in BUGS via R2WinBUGS or R2OpenBUGS and running into some problems when submitted the package. Instead I thought I would post a quick guide below. Note, the package code is on my github if you would like to contribute more models to the package…

The functions in the tsbugs package are aimed to automate the writing of time series models to run in WinBUGS or OpenBUGS. I created these functions a while back when I was doing some work on model averaging for time series models. I found it a lot easier to build R functions to write the BUGS models than the more error-inducing process of copy and pasting BUGS scripts, and then making slight alterations to create new models. It also allowed me to add arguments to specify different lag lengths, prior distributions, variance assumptions and data lengths. Below are examples for three types of time series models; autorgressive models with constant variance, stochastic volatility and random variance shift models.

Autoregressive Models

The ar.bugs command builds a BUGS script for autoregressive (AR) models ready to use in R2OpenBUGS. For example, consider the LakeHuron data.

LH <- LakeHuron 
par(mfrow=c(2,1)) 
plot(LH, main="Level (ft)") 
plot(diff(LH), main="Differenced Level")

tsbugs1-cvdata
We can construct a AR(1) model for this data (after differencing the data to obtain a stationary mean) as such:

library("tsbugs")
ar1 <- ar.bugs(y=diff(LH), ar.order=1)
print(ar1)

The ar.bugs function allows for alternative specifications for prior distributions, forecasts and the inclusion of mean term:

ar2 <- ar.bugs(y=diff(LH), ar.order=2, ar.prior="dunif(-1,1)", 
               var.prior="dgamma(0.001,0.001)", k = 10, 
               mean.centre = TRUE)
print(ar2)

The tsbugs objects can be used with R2OpenBUGS to easily run models from R. This is made even easier using the inits and nodes functions (also in the tsbugs package). For example:

writeLines(ar2$bug, "ar2.txt") 
library("R2OpenBUGS")
ar2.bug <- bugs(data = ar2$data,
                inits = list(inits(ar2)),
                param = c(nodes(ar2, "prior")$name, "y.new"),
                model = "ar2.txt", 
                n.iter = 11000, n.burnin = 1000, n.chains = 1)

Note, 1) the model is written to a .txt file (as required by R2OpenBUGS), 2) the data used is part of the tsbugs object. The ar.bugs command cleans the data and adds missing values at the end of the series for foretasted values, 3) the initial values offered by the inits function are very crude, and with more complicated data or models, users might be better off specifying there own list of initial values. The parameter traces and posterior distributions can be plotted using the coda package:

library("coda")
param.mcmc <- as.mcmc(ar2.bug$sims.matrix[,nodes(ar2, "prior")$name])
plot(param.mcmc[,1:4])

tsbugs1-cvparam
The fanplot package can be used to plot the entire series of posterior predictive distributions. We may also plot (after deriving using the diffinv function) the posterior predictive distributions of the lake level:

# derive future level
ynew.mcmc <- ar2.bug$sims.list$y.new
lhnew.mcmc <- apply(ynew.mcmc, 1, diffinv, xi = tail(LH,1))
lhnew.mcmc <- t(lhnew.mcmc[-1,])

# plot differenced
par(mfrow=c(2,1)) 
plot(diff(LH) ,xlim=k0+c(-50,10), main="Differenced Level")

# add fan
library("fanplot")
k0 <- end(LH)[1]
fan(ynew.mcmc, start=k0+1, rcex=0.5)

# plot undifferenced
plot(LH, xlim=k0+c(-50,10), main="Level")
fan(lhnew.mcmc, start=k0+1, rcex=0.5)

tsbugs1-cvforc

Stochastic Volatility Models

The sv.bugs command builds a BUGS script for stochastic volatility SV models ready to use in R2OpenBUGS. For example, consider the svpdx data.

# plot
plot(svpdx$pdx, type = "l", 
     main = "Return of Pound-Dollar exchange rate data 
             from 2nd October 1981 to 28th June 1985", 
     cex.main = 0.8)

tsbugs1-svdata
We can construct a AR(0)-SV model for this data, and also obtain posterior simulations using the sv.bugs command:

y <- svpdx$pdx
sv0 <- sv.bugs(y, sim=TRUE)
print(sv0)

This model closely matches those presented in
Meyer and Yu (2002). There are further options in the tsbugs package to incorporate different priors that do not involve transformations such as those for psi1 above. Using R2OpenBUGS we can fit the model,

# decent initial value for variance in first period
init <- inits(sv0, warn=FALSE)
init$psi0 <- log(var(y))
# write bug
writeLines(sv0$bug, "sv0.txt")
# might take a while to compile
sv0.bug <- bugs(data = sv0$data,
                inits = list(init),
                param = c(nodes(sv0, "prior")$name,"y.sim","h"),
                model = "sv0.txt", 
                n.iter = 11000, n.burnin = 1000, n.chains = 1)

The volatility and estimates can be easily extracted,

h.mcmc <- sv0.bug$sims.list$h

Which allows us to directly view the estimated volatility process or the time-dependent standard deviation using the fanplot package,

# plot
plot(NULL, xlim = c(1, 945)+c(0,40), ylim = c(-4,2), 
     main="Estimated Volatility from SV Model") 
# fan
fan(h.mcmc, type = "interval")

tsbugs1-svh
We can also plot the posterior simulations from the model:

# derive percentiles
y.mcmc <- sv0.bug$sims.list$y.sim
# plot
plot(NULL, type = "l", xlim = c(1, 945)+c(0,20), 
     ylim = range(y), 
     main = "Posterior Model Simulations and Data")
fan(y.mcmc)
lines(y)

tsbugs1-svysim

Random Variance Shift Models

The rv.bugs command builds a BUGS script for random variance (RV) shift models, similar to that of McCulloch and Tsay (1993) ready to use in R2OpenBUGS. Consider the ew data.

r <- ts(ew[2:167]/ew[1:166]-1, start=1841)
y <- diff(r) 
plot(y, main="Difference in England and Wales 
              Population Growth Rate")

tsbugs1-rvdata
We can create a BUGS script to fit a RV model to this data, including posterior simulations, using the rv.bugs command:

rv0<-rv.bugs(y, sim=TRUE)
print(rv0)

and then run the script in R2OpenBUGS (this can take a couple of hours):

# decent inital value for variance in first period
init <- inits(rv0, warn=FALSE)
init$isig02<-sd(y)^-2
# write bug
writeLines(rv0$bug,"rv0.txt")
# might take a while to compile
rv0.bug <- bugs(data = rv0$data,
                inits = list(init),
                param = c(nodes(rv0, "prior")$name,"y.sim",
                          "h","delta","beta"),
                model = "rv0.txt", 
                n.iter = 11000, n.burnin = 1000, n.chains = 1)

We can plot the posterior simulations from the model using the fanplot package:

# derive percentiles
y0 <- tsp(y)[1]
y.mcmc <- rv0.bug$sims.list$y.sim

# plot
plot(NULL, xlim=tsp(y)[1:2]+c(-5,5), ylim = range(y), 
     main="Posterior Simulations")
fan(y.mcmc, start = y0, rlab=c(10,50,90), llab=TRUE)
lines(y)

tsbugs1-rvysim
Alongside the posterior distributions of the standard deviations,

# derive sigma
h.mcmc <- rv0.bug$sims.list$h
sigma.mcmc <- sqrt(exp(h.mcmc))
# plots
plot(NULL, xlim =tsp(y)[1:2]+c(-5,5), ylim = c(0,0.008), 
     main="Standard Deviation")
fan(sigma.mcmc, start = y0, rlab=c(5,50,95), llab = c(5,50,95))

tsbugs1-rvhsigma
The posterior distributions of the probability of a variance shift and multiplier effect of the shift in variance (delta[t] and beta[t] in the BUGS model) can also be plotted. Note, when there is no variance shift, the posterior of the beta[t] is similar to the prior distribution.

#extract data
delta.mcmc <- rv0.bug$sims.list$delta
beta.mcmc <- rv0.bug$sims.list$beta

# plots
par(mfrow=c(2,1))
plot(NULL, xlim = tsp(y)[1:2]+c(-5,5), ylim = c(0,1),
     main="Probability of Variance Change Point")
fan(delta.mcmc, start=y0, ln = NULL, rlab = NULL)

plot(NULL, xlim = tsp(y)[1:2]+c(-5,5), ylim = c(-2,2), 
     main="Variance Multiplier")
fan(beta.mcmc, start=y0)

tsbugs1-rvbeta

Advertisements

The fanplot package for R

I have/will update this post as I expanded the fanplot package.

My fanplot package has gone up on CRAN. Below is a online version of the package vignette…

Visualising Time Series Model Results

The fanplot package can also be used to display uncertainty in estimates from time series models. To illustrate, the packages th.mcmc data frame object contains posterior density distributions of the estimated volatility of daily returns y_t from the Pound/Dollar exchange rate from 02/10/1981 to 28/6/1985. These distributions are from a MCMC simulation from a stochastic volatility model given in Meyer and Yu (2002) where it assumed;

y_t | \theta_t = \exp\left(\frac{1}{2}\theta_t\right)u_t \qquad u_t \sim N(0, 1) \qquad t=1,\ldots,n.

The latent volatilities \theta_t, which are unknown states in a state-space model terminology, are assumed to follow a Markovian transition over time given by the state equations:

\theta_t | \theta_{t-1}, \mu, \phi, \tau^2 = \mu + \phi \log \sigma^2_{t-1} + v_t \qquad v_t \sim N(0, \tau^2) \qquad t=1,\ldots,n

with \theta_0 \sim N(\mu, \tau^2).

The th.mcmc object consists of (1000) rows corresponding to MCMC simulations and (945) columns corresponding to each t. A fan chart of the evolution of the distribution of \theta_t can be visualised using the fanplot package via,

library("fanplot")
# empty plot
plot(NULL, main="Percentiles", xlim = c(1, 965), ylim = c(-2.5, 1.5))

# add fan
fan(data = th.mcmc)

fanplot-sv1
The fan function calculates the values of 100 equally spaced percentiles of each future distribution when the default data.type = "simulations" is set. This allows 50 fans to be plotted from the heat.colours colour palette, providing a fine level of shading. In addition, lines and labels are provided along each decile.

Prediction Intervals

When argument type = "interval" is set, the probs argument corresponds to prediction intervals. Consequently, the fan chart comprises of 3 different shades, running from the darkest shade for the 50th prediction interval to the lightest for the 95th prediction interval.

# empty plot
plot(NULL, main="Prediction Intervals",
     xlim = c(-20, 965), ylim = c(-2.5, 1.5))

# add fan
fan(data = th.mcmc, type = "interval", llab=TRUE, rcex=0.6)

fanplot-sv2
Contour lines are overlayed for the upper and lower bounds of each prediction intervals, as set using the ln command. A further line is plotted along the median of \theta_t, which is controlled by the med.ln argument (set to TRUE by default when type="interval"). The default labels on the right hand side correspond to the upper and lower bounds of each plotted line. The left labels are added by setting llab = TRUE. Note, some extra room is created for the labels by setting the xlim = c(-20, 965) argument of plotting area to a wider range than the original data (945 observations). The text size of the right labels are controlled using the rcex argument. The left labels, by default, take the same text size as rcex although they can be separately controlled using the lcex argument.

Alternative Colours

Alternative colour schemes to the default heat.colors, can be obtained by supplying a colorRampPalette to the fan.col argument. For example, a new palette running from blue to white, via grey can be passed using;

# empty plot
plot(NULL, main="Alternative Colour Scheme",
xlim = c(-20, 965), ylim = c(-2.5, 1.5))

# add fan
fan(data = th.mcmc, rlab=seq(20,80,15), llab=c(10,50,90),
    fan.col=colorRampPalette(c("royalblue", "grey", "white")))

fanplot-sv3
Alternative labels are specified using the rlab and llab arguments.

Spaghetti Plots

Spaghetti plots can be used to represent uncertainty shown by a range of possible future trajectories or past estimates. For example using the th.mcmc object, 20 random sets of \theta_t can be plotted when setting the argument style = "spaghetti";

# empty plot
plot(NULL, main="Spaghetti Plot", xlim = c(-20, 965), ylim = range(th.mcmc))

# transparent fan with visible lines
fan(th.mcmc, ln=c(5, 50, 95), llab=TRUE, alpha=0, ln.col="orange")

# spaghetti lines
fan(th.mcmc, style="spaghetti", n.spag=20)

fanplot-sv4
The spaghetti lines are superimposed on a fan chart in order to illustrate some underlying probabilities. The initial fan chart is completely transparent from setting the transparency argument alpha to 0. In order for the percentile lines to be visible a non-transparent colour is used for the ln.col argument.

Forecast Fans

The fanplot package can also be used to illustrate probabilistic forecasts. For example, using the auto.arima function in the forecast package a model for the time series for net migration to the United Kingdom (contained in the ips data frame of the fanplot package) can be fitted.

#create time series
net <- ts(ips$net, start=1975)
#fit model
library("forecast")
m <- auto.arima(net)
m
Series: net
ARIMA(1,1,2) with drift         

Coefficients:
          ar1      ma1      ma2   drift
      -0.2301  -0.0851  -0.6734  6.7625
s.e.   0.3715   0.3620   0.1924  1.4154

sigma^2 estimated as 1231:  log likelihood=-179.3
AIC=368.6   AICc=370.54   BIC=376.66

We may then simulate 1000 values from the selected model using the simulate.Arima function, and plot the results.

mm <- matrix(NA, nrow=1000, ncol=5)
for(i in 1:1000)
  mm[i,] &amp;lt;- simulate(m, nsim=5)

# empty plot
plot(net, main="UK Net Migration", xlim=c(1975,2020), ylim=c(-100,300))

# add fan
fan(mm, start=2013)

fanplot-ips1
Users might want to connect the fan with the past data. This can be achieved by providing the last value to the anchor argument.

# empty plot
plot(net, main="UK Net Migration",
     xlim=c(1975,2020), ylim=c(-100,300))

# add fan
fan(mm, start=2013, anchor=net[time(net)==2012],
    type="interval", probs=seq(5, 95, 5), ln=c(50, 80))

fanplot-ips2
More shades for the fan are added to the plot (over the default 3 used for a interval fans) by supplying a sequence to the probs argument. Alternative contour lines (from the default median, 50th, 80th and 95th percentiles for interval fans) are added using the ln argument.

Sequential Line Plots in R

I was trying to create some sequential plots today in R to analyse some MCMC simulations. I found the par(ask=TRUE) command very useful for looking at iterations of individual parameter values. Setting the ask graphical parameter to TRUE (before a for loop) allows you to update plots by clicking on the plotting device (in windows). Here is some example code to show how par(ask=TRUE) works.

xx <- rnorm(1000)
xx <- matrix(xx,10,100)
par(ask = TRUE)
plot(xx[1,], type="n", ylim=range(xx), ylab="")
for(i in 1:10){
   plot(xx[i,], ylim=range(xx), ylab="", type="l")
}

This will give the following output after each click:
ask1

If you want to keep previous plotted lines on the plot you can adapt the for loop in such a manner:

par(ask=TRUE)
plot(xx[1,], type="n", ylim=range(xx), ylab="")
for(i in 1:10){
   plot(xx[1,], ylim=range(xx), ylab="", type="l")
   for(j in 1:i){
      lines(xx[j,])
   }
}

This will give the following output after each click:
ask2

Combining census and registration data to estimate detailed elderly migration flows in England and Wales

During my MS.c. I worked on methods for combining internal migration data in England and Wales. Migration data is often represented in square tables of origin-destination flows. These are of particular interest to analysing migration patterns when they are disaggregated by age, sex and some other variable such as illness, ethnicity or economic status. In England and Wales the data within these detailed flow table are typically missing in non-census years. However, row and column (origin and destination) totals are regularly provided from the NHS patient registers (see the first two columns of the hypothetical data situation below). I worked on a method to estimate the detailed missing flow data to sum to the provided totals in non-census years (see the third column of the hypothetical data situation below). This method is particularly useful for estimating migration flow tables disaggregated by detailed characteristics of migrants (such as illness, ethnicity or economic status) that are only provided by the ONS for census years.

Hypothetical Example of Data Set Situation (where migrant origins are labelled on the vertical axis and destinations on the horizontal axis).

Auxiliary Data (e.g. 2001 Census) Primary Data (e.g. 2004 NHSCR Data) Detailed Estimates for 2004 Based on Methodology
Without Limiting Long Term Illness Without Limiting Long Term Illness
N M S N M S
N 80 20 50 150 N 88 56 40 183
M 50 100 50 200 Illness details unavailable M 29 145 21 195
S 10 30 110 150 N M S S 7 52 54 113
140 150 210 500 N 260 124 252 115 491
With Limiting Long Term Illness M 320 With Limiting Long Term Illness
N M S S 170 N M S
N 30 10 20 60 200 370 180 750 N 33 28 16 77
M 40 50 70 160 M 23 73 29 125
S 30 10 40 80 S 20 17 20 57
100 70 130 300 76 118 65 259

The estimated values maintain some properties (various cross product ratios) of the Census data whilst updating marginal totals to more current data. For more details see my MS.c. dissertation (which I have put online here). I also presented the method and some further results at POPFSET 2007, see here for more details. This contains the R/S-Plus code to conduct the estimation in the Appendix. There is also a published paper based on my MS.c. that uses a slightly modified R code.

Publication Details:

Raymer J., Abel G.J. and Smith P.W.F. (2007). Combining census and registration data to estimate detailed elderly migration flows in England and Wales. Journal of the Royal Statistical Society Series A (Statistics in Society) 170 (4) 891–908.

A log-linear model is developed to estimate detailed elderly migration flows by combining data from the 2001 UK census and National Health Services patient register. After showing that the census and National Health Service migration flows can be reasonably combined, elderly migration flows between groupings of local authority districts by age, sex and health status for the 2000–2001 and 2003–2004 periods are estimated and then analysed to show how the patterns have changed. By combining registration data with census data, we can provide recent estimates of detailed elderly migration flows, which can be used for improvements in social planning or policy.

Chatfield’s Plots in S-Plus

I have recently finished reading the sixth edition of The Analysis of Time Series: An Introduction by Chatfield in our Statistics reading group. Whilst enjoying most of the book I got a little confused when looking at Appendix D: Some MINITAB and S-PLUS Commands. In the S-Plus section the author gives the code below to replicate his Figure 1.2.

postscript(file="1.2.ps")
recife<-scan("Recife")
RecfS<-cts(recife, start=dates("01/01/1953"),units="months",frequency="12")
fst_as.numeric(date("01/01/1953"),format="dd/mm/yyyy")
mid1_as.numeric(dates("01/01/1954"),format="dd/mm/yyyy")
mid2_as.numeric(dates("01/01/1955"),format="dd/mm/yyyy")
mid3_as.numeric(dates("01/01/1956"),format="dd/mm/yyyy")
mid4_as.numeric(dates("01/01/1957"),format="dd/mm/yyyy")
mid5_as.numeric(dates("01/01/1958"),format="dd/mm/yyyy")
mid6_as.numeric(dates("01/01/1959"),format="dd/mm/yyyy")
mid7_as.numeric(dates("01/01/1960"),format="dd/mm/yyyy")
mid8_as.numeric(dates("01/01/1961"),format="dd/mm/yyyy")
lst_as.numeric(dates("01/01/1962"),format="dd/mm/yyyy")
ts.plot(RecfS,xaxt="n",ylab="Temperature (deg C)", xlab="Year", type="l")
axis(side=1, at = c(fst,mid1,mid2,mid3,mid4,mid5,mid6,mid7,mid8,lst),
labels=c("Jan 53", "Jan 54", "Jan 55", "Jan 56", "Jan 57", "Jan 58",
"Jan 59", "Jan 60", "Jan 61", "Jan 62"),
ticks=T)
dev.off()

I was not too sure what was going on with the code, which gave a tonne of error messages, some of which might well have been typo’s by the publisher (_ instead of <-)? In addition, the author stresses the effort required to construct nice plots in S-Plus. This got me thinking that 1) his code was excessive (not just because it does not work) and 2) he could have saved a lot of his effort by using R. The R code below proves my point

> png("...:/1.2.png", width=650, height=460, units="px")
> recife <- scan("http://people.bath.ac.uk/mascc/Recife.TS", sep="", skip=3)
> n <- length(recife)
> times <- seq(as.Date("1953/1/1"), by="month", length.out=n)
> plot(recife, xaxt="n", type= "l",ylab="Temperature (deg C)", xlab="Year")
> axis(1, at = seq(1,n,12), labels = format(times, "%b %Y")[seq(1,n,12)])
> dev.off()

Much Tidier! Here is the plot..
1.2
If you only wanted labels every other January, as in p.2 of the book (but not in the S-Plus code), then you can use..

> axis(1, at = seq(1,n,24), 
       labels = format(times, "%b %y")[seq(1,n,24)])