This post is intended to introduce an unfamiliar reader to some basic techniques in Bayesian modeling of autoregressive time series. We’ll cover the basics of autoregressive models, use the Matrix Normal Inverse Wishart (MNIW) as a conjugate prior for efficient inference, and give some examples of using this model for a point moving in a circle.

## References

For more information about autoregressive processes or the MNIW distribution, here are a few key references:

Emily Fox, PhD Thesis, http://www.stat.washington.edu/~ebfox/publications/ebfox_thesis.pdf

- See especially Appendix F for derivations proving conjugacy for the MNIW

Wang and West, “Bayesian Analysis of Matrix Normal Graphical Models”, http://ftp.stat.duke.edu/WorkingPapers/08-01.pdf

- See especially section 2 for the MN distribution

Quintana and West, “An analysis of international exchange rates using multivariate DLM’s”, http://www.stat.duke.edu/~mw/MWextrapubs/Quintana1987.pdf

- This is one of the first papers to combine the autoregressive likelihood with the MNIW prior (and the only one I can find online for free

## Autoregressive Time Series

We seek a generative model for a time-series of observed sensor data, where each measurement is some -dimensional column vector. In many applications (tracking, motion capture, financial modeling, etc.), it is reasonable to assume that the data at time is Gaussian distributed, with its mean determined by a linear perturbation of the recent past plus some additional noise. For example, if we assume that the mean of each is fully determined by the previous observation , we have a first-order autoregressive model AR(1).

Here, is a -by- matrix of regression coefficients, and is a symmetric, positive definite covariance matrix. defines a linear function for the mean of given the previous value, and defines Gaussian noise.

It is often useful to consider conditioning on more of the “recent past”, to make a function of the previous values . This yields,

In this simple expanded notation, each is a -by- matrix.

We can rewrite this in more concise matrix notation as

where is a by matrix, and is a -by-1 column vector which concatenates all previous observations:

## AR Parameter Learning

Given observed data , we are interested in learning some order autoregressive model, so we need to infer and . We take a Bayesian approach, placing priors on which allow tractable posterior inference.

The Matrix Normal – Inverse Wishart prior is a conjugate prior to the AR likelihood. This factorizes into separate priors for and . First, we have an Inverse Wishart on :

(1)

Given the covariance , we have a Matrix Normal prior on

(2)

where is -by- symmetric positive definite matrix.

To understand the role of the covariance parameters and , it is useful to consider the following marginal distributions for the rows and columns of :

(3)

Thus, determines (up to a multiplicative constant) the covariance of all rows, while determines (up to a multiplicative constant) the covariance of the columns.

Given the observed data , we can aggregate these observations to form sufficient statistics. The first, , is a matrix with rows, each holding one observation vector.

(4)

Also, we have an -by- matrix whose rows are the lagged observations .

(5)

Using these sufficient statistics, and assuming an all-zero mean , we have a conjugate MNIW posterior on with parameters

(6)

## Toy Example: Tracing a circle in 2D

We consider using an order 1 AR process to model the movement of a 2D point around a circle. We’d like to confirm that our posterior learning makes sense, and further see exactly how much data we need to get accurate estimates.

The above picture was made with T=500 timesteps. We generate this data via an AR(1) process with known true weights and noise , set to a very small standard deviation ( = .005). The true regression matrix to be a \emph{rotation matrix} with some very small angle, . We set to the rotation matrix with radians, or exactly 1 degree per timestep. This means after T=500 we’ve completed almost 1 and a half revolutions (500 deg).

Here’s a plot of what happens after T=50,000. You can see the noise can make things a bit unstable, but still generally circular.

Here’s a plot of the estimate of as a function of the number of observed timesteps, .

Startlingly, we significantly overestimate after 10000 examples, and still overestimate the noise even after 50000 examples! I’ve experimented with many settings of the prior parameters, but haven’t been able to do much better. This should serve as a warning, that when the true variance is small, posterior inference will require lots of data to converge well.

Overall, however, the circular dynamics of the regression weight matrix are well-recovered. Rather than plot the matrix itself, I’ll plot simulated time-series generated by samples from ‘s posterior given fixed set to its posterior mean. These samples are unperturbed by noise, so we can understand the recovered dynamics exactly.

Here are 9 sample trajectories after observing 1000 timestep examples:

Here are some sample trajectories after 50000 timesteps:

As expected, the dynamics of the recovered matrix after seeing more data are much more stable.

## About stability

In general, it is worth noting that for most possible choices of the matrix, the resulting dynamics will not be “stable”, in the sense that the time series will either converge to the zero vector or diverge to infinity, producing rather “uninteresting” dynamics. This can happen quite rapidly, sometimes within 10 or 20 timesteps.

To explain this phenomenon, we need only see that the mean of observation at timestep is a deterministic function of the initial point and repeated matrix multiplication by .

Without loss of generality, we can assume is an eigenvector of , with corresponding eigenvalue (a scalar). So . This shows that if has eigenvalues above one, the mean will diverge to infinity (or negative infinity, depending on the sign), while eigenvalues less than one will send to zero as increases. Only systems with eigenvalues equal in magnitude to one will remain “stable”, but many of these matrices (e.g. identity matrix) are also uninteresting.

Most regression weights will yield rather non-interesting dynamics after long time epochs. This motivates extending the model for time series to overcome this tendency, such as using switching VAR processes so that after small amounts of time under some the system switches to other parameters , which can explain different regimes of time series behavior.

## About identifiability

It is important to note that under the Matrix Normal distribution, the Kronecker product is identifiable, but the individual parameters , are only identifiable up to a constant . This occurs because for any , we can use and achieve the same probability value for matrix . See Wang and West (Section 2.4) for more information.

Thus, it can be difficult to parameterize the priors appropriately, since and can be scaled arbitrarily via the constant and yield similar results. I have not yet found a coherent way out of this mess.

## Recent Comments