BISMG:TamsinE

Tamsin's page
This page now replaced by Collaboration wiki page

5th Mar 2010
Attended JCRP meeting (3rd Mar, Exeter).

Presented a paper for the online journal club I go to (started by Hadley Centre colleagues in order to collaborate with external scientists):

Annan and Hargreaves (2010), Reliability of the CMIP3 ensemble, GRL 37, L02703.

I thought this would be useful because it attempts to answer important questions about how to interpret multi-model ensemble predictions.

Summary of Annan and Hargreaves (2010)
The CMIP3 (i.e. IPCC 2007 climate model) ensemble is an 'ensemble of opportunity' rather than a designed experiment. So how should we interpret it?


 * as a random sample from a distribution centred on the truth?
 * i.e. the truth lies at the mean of the distribution
 * no: models thought to be strongly correlated (e.g. Knutti et al., 2010, 'Challenges in combining projections from multiple climate models')
 * as an 'exchangeable' ensemble?
 * i.e. the truth is sampled from the distribution, and the mean is only the expectation of the truth

In their definition of 'exchangeable', the truth is drawn from the same distribution as the ensemble and no statistical test can distinguish one from the other. There are other meanings to 'exchangeable'. Goldstein, House and Rougier (2008, draft), 'Assessing Model Discrepancy Using a Multi-Model Ensemble', use exchangeable in the sense of 'believing that each model is equally plausible'. They use the assumption of second-order exchangeability for a multi-model ensemble - that the mean, variance and covariance structure are invariant under permutations of the models.

They approach this question using the rank histogram, which is used in validating ensemble forecasts of weather. In weather forecasting, e.g. at ECMWF, it is common to interpret the frequencies of an ensemble forecast (of about 50 members) as a probabilistic prediction (a frequentist approach to probability, as opposed to Bayesian). A single probabilistic forecast cannot be validated with observations: repeated forecasts are needed in order to quantify whether the forecast and observed frequencies are similar.

Repeated ensemble forecasts can be compared with observations using a rank histogram. For each forecast: Then make a histogram of all the ranks.
 * put the N ensemble members and the single observation in order (e.g. from low to high temperature)
 * record the rank of the observations within the ensemble, {1, ... N+1}

Interpretation of shape:
 * If the ensemble and observations are sampled from the same distribution, the histogram will be flat ('uniform') because the observations are equally likely to be found at any given rank.
 * If the ensemble spread is too low, the histogram is 'U-shaped' because the observations often lie at one end of, or outside, the ensemble range.
 * If the ensemble spread is too low, the histogram is 'dome-shaped', because the observations often lie at the centre of the ensemble range.
 * If the ensemble is biased, the histogram bins follow a linear trend, because the observations often lie near one end of the ensemble range.

James and Julia analyse (a) idealised experiments, in which the ensemble members and truth are drawn from normal distributions with varying widths, and (b) CMIP3 data, in which the ensemble members are 24 different climate models and the truth is the observed mid-twentieth century climatology.

They use the idealised experiments to repeat Knutti et al. (2010)'s analysis of correlation in models and its affect on RMSE of the ensemble mean, and show rank histograms for these experiments. They then go on to the CMIP data.

In weather forecasting, a rank histogram is filled from repeated forecasts. With the CMIP data, James and Julia obtain repetition in space rather than time, by dividing the globe into 40 areas. This number of areas is chosen to minimise the spatial correlation - individual grid cells would be highly correlated. The time dimension is averaged out. (For the idealised experiments, the dimensions are arbitrary).

To evaluate whether the rank histograms are uniform, they use a 'decomposed' chi^2 test for bias and spread. They find that the ensemble:

(a) temperatures are biased low (though by a small amount relative to the ensemble spread) and the range is too large w.r.t observations (b) precipitation is about right (c) sea level pressure range is too large w.r.t. observations

They conclude that the received wisdom - that multi-model ensembles are likely to under-represent the range of possible future climate change - is wrong, and that the CMIP ensemble has the opposite properties.

They also repeat the rank histogram using each model as 'truth' in turn, to test the exchangeability of the models with each other. The chi^2 tests for uniformity fail in about 30% of cases at the p < 5% level, but if they reduce the number of degrees of freedom from 40 to 5 then the failure rate falls to about 5% and they conclude that the models are therefore exchangeable with each other as well as with the truth.

Some caveats are added about the models being tuned to the observations, and how palaeoclimate comparisons would be useful. They also state that their definition of exchangeability does not necessarily mean that all the models should be viewed as equally good (as in the Goldstein et al. paper).

Problems with this paper
The main questions and problems I have with this paper are:


 * This paper tests for uniformity of the rank histograms. It then infers from uniformity that the models are exchangeable - but this inference is not possible! Exchangeability implies uniformity, but not the reverse. Even the paper they cite on using decomposed chi^2 for rank histograms states this. ("a flat histogram does not necessarily indicate reliability of the ensemble forecasts"). In fact, that paper also states that a linear rank histogram need not imply bias, and a U-shaped histogram need to imply under-dispersion.
 * How valid is it to use the rank histogram on these 'space-casts' rather than forecasts? This kind of space-for-time substitution arises from the ergodic assumption. It's often used in climate modelling (e.g. in palaeoclimate reconstruction, or using model internal variability to estimate spatial covariance), but probably isn't valid. Worth commenting on, anyway.
 * Having said that there should be at least 40 degrees of freedom due to the correlation length, why suddenly drop to 5 d.o.f. at the end of the paper? This choice completely changes the failure rate of uniformity, and the conclusion that the models are exchangeable with each other appears to rely on it.
 * There is an assumption of independence of observations for the rank histograms. Even if spatial correlations are greatly reduced or even removed by the spatial-averaging, is this assumption still valid (e.g. if models have common structure)?

Other questions
 * Why are the idealised ensemble members correlated, if they are randomly drawn from a 40D normal distribution? (have I missed something?)
 * Why would precipitation be the closest to truth out of the three variables - surely this is surprising and worth commenting on?

Jonty thinks the whole idea is flawed. He categorises it as 'multiple testing' - doing many tests and then working out what it all means afterwards. In other fields such as biostatistics they are careful to control the rate of false rejections. I asked how you could construct a rank histogram without this multiple testing, and he said

"You don't need repetitions to do a hypothesis test. If there were 100000 model evaluations, and the observed value was larger than any of them, then you'd be fairly sure that the observed value was not exchangeable with them.  I agree that it is not so conclusive with n = 24 models, but the principle is the same. [...] there is a very elegant theory which tells us how to choose optimal test statistics, eg those that have the largest power for a given level of significance (even the most powerful test might still not be very powerful, of course).  There is also a Bayesian approach that takes into account prior probabilities on the hypotheses, and the cost of making an error."

"It [chi^2] is also rather a contentious test, since the results often depend on the way in which the observations are classed into intervals, and there is also some ambiguity, even without J&J's ad hoc modification, regarding the correct number of degrees of freedom.

The bottom line
It's an interesting back-of-the-envelope exercise - an easy way to demonstrate the broad properties of the ensemble. But the problem of the choice of the number of d.o.f., and how to treat spatial correlations, mean that it is not a good way to test a hypothesis about ensemble reliability.

29th Jan 2010
Had a good chat with Jonty about ensemble Kalman Filtering as a method for parameter estimation.

Particle filtering
Particle filtering (also known as sequential Monte Carlo) and ensemble Kalman Filtering are Bayesian techniques that can be used for data assimilation, i.e. merging information from a sequence of model simulations and a sequence of observations as they both progress through time.

In filtering methods, a large number of random samples ('particles') of the 'state vector' are generated at a given timestep, where the state vector comprises simulations of the physical state (e.g. weather, climate, ice sheet) and model parameter values (if using filtering to learn about these too). This large sample gives an estimate of the probability distributions of each component of the state vector (e.g. temperature, ..., parameter 1, ...). The distributions are updated using the observations for that timestep: in Bayesian terms, the original ('prior)' distribution is reweighted using the distribution of model agreement with observations ('likelihood') to give a new ('posterior') distribution which is therefore 'conditional' on the observations.

This updating of the estimated model distributions occurs at each timestep using all previous observations. In other words, the distributions at time t_1 are conditional on observations y_1, but the distributions at t_N are conditional on all the observations y_1, y_2,...y_N.

The physical state is time-evolving: the simulated prior distributions vary for each timestep according to the dynamical equations of the physical model.

However, the model parameters, at least in most weather, climate and ice sheet models, are static: the distributions at each timestep only vary in time through the updating with observations (typically narrowing very quickly). At the end of the filtering, the mean values can be used as fixed values in the model.

Ensemble Kalman Filtering is similar to normal particle filtering. The particles are ensemble members, and the probability distributions are assumed to be Gaussian.

Application to ice sheet modelling

 * The parameters of the ice sheet model are not time-varying, so if using ensemble Kalman Filtering:
 * the number of candidate values for the parameters decreases very quickly
 * the parameters have no width (i.e. single-valued) - but width is needed for comparison with observations
 * to overcome these problems requires a very large number of particles, ideally infinite
 * Better instead to use Particle Markov Chain Monte Carlo (PMCMC):
 * a new method by Christophe Andrieu (Bristol maths dept)
 * calculates the whole path at once, rather than moving forwards or backwards through it
 * resamples particles according to comparison with observations - i.e. stochastic
 * has been used with a soil moisture model (Cari Kaufman)
 * Jonty is writing generic code for this (with Michel Crucifix) over the next few months

But if the calibration data is a 2D timeslice rather than a timeseries, then everything is much easier - we don't need these sequential methods.

22nd Jan 2010
Finished setting up computing: installing Matlab, Office, Parallels, Windows virtual machine, printers, network drives, automatic back up / syncing. Attended BISMG meeting 22/1.

15th Jan 2010
Attended JCRP meeting 15/1, and started working my way through Greve and Blatter book, "Dynamics of Ice Sheets and Glaciers".

8th Jan 2010
Spent some time downloading software onto the new laptop, organising all my pdfs of articles, organising all my paper copies of articles, unpacking more stuff in the office. Currently more organised than I have ever been, due in part to two excellent pieces of software: 'Papers' for, er, papers, and 'Evernote' as an electronic logbook (I use a hardback logbook too).

Started reading ensemble Kalman filter papers, beginning with the method section of this paper:

Hargreaves and Annan (2002), Assimilation of paleo-data in a simple Earth system model, Climate Dynamics 19: 371-381.

They use Monte Carlo Markov Chain (essentially correlated random walk) to estimate pdfs of parameters for Saltzmann and Marsch model of glacial-interglacial cycles. Found the notation confusing at first - they appear to use the index 'i' to mean two different things (number of observations and current set of parameter values) and similarly with prime (model output rather than observations, and trial set of parameter values).

Will work my way through other Annan and Hargreaves papers from Tony in date order. Jonty has doubts about the use of a Kalman Filter if the parameters are not stochastically varying through time (as in his work with Michel) - need to ask him again about this to make sure I understand.

November - December 2009
I'm officially working for 1 day a week until the end of January, then 2 days a week in Febuary, then full-time from March. Due to a slight overload I'm nudging some of my November days into December and January.

I intend to spend the last few days before Christmas catching up on some ice2sea-related work or study. This will bring me up to about 6 days total, and I'll carry 2 days over into Jan/Feb.

I've been to these meetings and courses:


 * 20/11/09 ice sheet course
 * 13/11/09 JCRP @ UKMO
 * 11/11/09 BISMG
 * 27-28/10/09 ice2sea WP5 @ BAS
 * 22/9/09 Tony and Jonty