↑ Return to Probability Basics

Bayesian Autoregressive Time Series Models

This post is intended to introduce an unfamiliar reader to some basic techniques in Bayesian modeling of autoregressive time series.  We’ll cover the basics of autoregressive models, use the Matrix Normal Inverse Wishart (MNIW) as a conjugate prior for efficient inference, and give some examples of using this model for a point moving in a circle.

References

For more information about autoregressive processes or the MNIW distribution, here are a few key references:

Emily Fox, PhD Thesis, http://www.stat.washington.edu/~ebfox/publications/ebfox_thesis.pdf

  • See especially Appendix F for derivations proving conjugacy for the MNIW

Wang and West, “Bayesian Analysis of Matrix Normal Graphical Models”, http://ftp.stat.duke.edu/WorkingPapers/08-01.pdf

  • See especially section 2 for the MN distribution

Quintana and West, “An analysis of international exchange rates using multivariate DLM’s”, http://www.stat.duke.edu/~mw/MWextrapubs/Quintana1987.pdf

  • This is one of the first papers to combine the autoregressive likelihood with the MNIW prior (and the only one I can find online for free

Autoregressive Time Series

We seek a generative model for a time-series y_1, y_2, \ldots y_T of observed sensor data, where each measurement y_t is some D-dimensional column vector.  In many applications (tracking, motion capture, financial modeling, etc.), it is reasonable to assume that the data at time t is Gaussian distributed, with its mean determined by a linear perturbation of the recent past plus some additional noise.  For example, if we assume that the mean of each y_t is fully determined by the previous observation y_{t-1}, we have a first-order autoregressive model AR(1).

    \[ y_t \sim \mathcal{N}( A y_{t-1}, \Sigma ) \]

Here, A is a D-by-D matrix of regression coefficients, and \Sigma is a symmetric, positive definite covariance matrix. A defines a linear function for the mean of y_t given the previous value, and \Sigma defines Gaussian noise.

It is often useful to consider conditioning on more of the “recent past”, to make y_t a function of the previous R values y_{t-1}, y_{t-2}, y_{t-R}. This yields,

    \[ y_t \sim \mathcal{N}( A_1 y_{t-1} + A_2 y_{t-2} + \ldots A_R y_{t-R}, \Sigma ) \]

In this simple expanded notation, each A_r is a D-by-D matrix.

We can rewrite this in more concise matrix notation as

    \[ y_t &\sim \mathcal{N}( \mathbf{A} \tilde{y}_{t}, \Sigma ) \]

where \mathbf{A} = [A_1 A_2 \ldots A_R] is a D by DR matrix, and \tilde{y}_{t} is a DR-by-1 column vector which concatenates all previous R observations:

    \[ \tilde{y} = [ --y_{t-1}^T-- \quad --y^T_{t-2}-- \quad --y^T_{t-R}-- ]^T \]

AR Parameter Learning

Given observed data y_1, \ldots y_{T}, we are interested in learning some order R autoregressive model, so we need to infer \mathbf{A} and \Sigma. We take a Bayesian approach, placing priors on A, \Sigma which allow tractable posterior inference.

The Matrix Normal – Inverse Wishart prior is a conjugate prior to the AR likelihood. This factorizes into separate priors for \Sigma and \mathbf{A} \mid \Sigma. First, we have an Inverse Wishart on \Sigma:

(1)   \begin{align*} \Sigma \sim \mathcal{W}^{-1}( v_0, S_0 ) \\ \end{align*}

Given the covariance \Sigma, we have a Matrix Normal prior on \mathbf{A}

(2)   \begin{align*} A | \Sigma &\sim \mathcal{N}_{D x DR} ( M_0, \Sigma, \Psi_0 ) \\ p( A | \Sigma ) &= \frac{ \exp( -\frac{1}{2}\mbox{tr}( (A-M_0)^T\Sigma^{-1}(A-M_0)\Psi_0^{-1} ) ) } { (2\pi)^{D^2R/2} |\Sigma|^{DR/2} |\Psi_0 |^{D/2} } \end{align*}

where \Psi_{0} is DR-by-DR symmetric positive definite matrix.

To understand the role of the covariance parameters \Sigma and \Psi_0, it is useful to consider the following marginal distributions for the rows and columns of A:

(3)   \begin{align*} A_{i,:} \sim \mathcal{N}( {M_0}_{i,:}, \Sigma_{i,i} {\Psi_0} ) \\ A_{:,j} \sim \mathcal{N}( {M_0}_{:,j}, {\Psi_0}_{j,j} \Sigma ) \end{align*}

Thus, \Psi_0 determines (up to a multiplicative constant) the covariance of all rows, while \Sigma determines (up to a multiplicative constant) the covariance of the columns.

Given the observed data y_1, \ldots y_T, we can aggregate these observations to form sufficient statistics. The first, \mathbf{Y}, is a matrix with T rows, each holding one y_t observation vector.

(4)   \begin{align*} \mathbf{Y} = [-- &y_1^T --\\ -- &y_2^T --\\ \vdots\\ -- &y_T^T-- ] \end{align*}

Also, we have an T-by-DR matrix \mathbf{\tilde{Y}} whose rows are the lagged observations \tilde{y}_t.

(5)   \begin{align*} \mathbf{\tilde{Y}} = [-- &\tilde{y}_1^T --\\ -- &\tilde{y}_2^T --\\ \vdots\\ -- &\tilde{y}_T^T-- ] \end{align*}

Using these sufficient statistics, and assuming an all-zero mean M_0, we have a conjugate MNIW posterior on (\mathbf{A}, \Sigma) with parameters

(6)   \begin{align*} v_P &= v_0 + T \\ M_P &= Y^T\tilde{Y} ( \tilde{Y}^T\tilde{Y} + \Psi_0^{-1} )^{-1} \\ S_P &= S_0 + Y^TY - (M_P)(Y^T\tilde{Y})^T \\ \Psi_P &= ( Y^TY + \Psi_0^{-1} )^{-1} \end{align*}

Toy Example: Tracing a circle in 2D

We consider using an order 1 AR process to model the movement of a 2D point around a circle.  We’d like to confirm that our posterior learning makes sense, and further see exactly how much data we need to get accurate estimates.

The above picture was made with T=500 timesteps. We generate this data via an AR(1) process with known true weights A and noise \Sigma = diag(\sigma^2), set to a very small standard deviation (\sigma = .005).  The true regression matrix A to be a \emph{rotation matrix} with some very small angle, \theta.   We set A to the rotation matrix with \theta = \frac{\pi}{180} radians, or exactly 1 degree per timestep. This means after T=500 we’ve completed almost 1 and a half revolutions (500 deg).

Here’s a plot of what happens after T=50,000.  You can see the noise can make things a bit unstable, but still generally circular.

Here’s a plot of the estimate of \mathbb{E}[\sigma] as a function of the number of observed timesteps, T.

Startlingly, we significantly overestimate after 10000 examples, and still overestimate the noise even after 50000 examples!  I’ve experimented with many settings of the prior parameters, but haven’t been able to do much better.   This should serve as a warning, that when the true variance is small, posterior inference will require lots of data to converge well.

Overall, however, the circular dynamics of the regression weight matrix \mathbf{A} are well-recovered.  Rather than plot the A matrix itself, I’ll plot simulated time-series \hat{y}_1, \ldots \hat{y}_T generated by samples from A‘s posterior given fixed \Sigma set to its posterior mean.  These samples are unperturbed by noise, so we can understand the recovered dynamics exactly.

Here are 9 sample trajectories after observing 1000 timestep examples:

Here are some sample trajectories after 50000 timesteps:

As expected, the dynamics of the recovered A matrix after seeing more data are much more stable.

About stability

In general, it is worth noting that for most possible choices of the A matrix, the resulting dynamics will not be “stable”, in the sense that the time series y_1, \ldots y_T will either converge to the zero vector or diverge to infinity, producing rather “uninteresting” dynamics. This can happen quite rapidly, sometimes within 10 or 20 timesteps.

To explain this phenomenon, we need only see that the mean of observation at timestep t is a deterministic function of the initial point and repeated matrix multiplication by A.

    \[ \mu_t = \mathbb{E}[ y_t ] = A^t y_1 \]

Without loss of generality, we can assume y_1 is an eigenvector of A, with corresponding eigenvalue \lambda (a scalar). So \mu_t = \lambda^t y_1. This shows that if A has eigenvalues above one, the mean will diverge to infinity (or negative infinity, depending on the sign), while eigenvalues less than one will send \mu_t to zero as t increases. Only systems with eigenvalues equal in magnitude to one will remain “stable”, but many of these matrices (e.g. identity matrix) are also uninteresting.

Most regression weights A will yield rather non-interesting dynamics after long time epochs. This motivates extending the model for time series to overcome this tendency, such as using switching VAR processes so that after small amounts of time under some (A,\Sigma) the system switches to other parameters (A', \Sigma'), which can explain different regimes of time series behavior.

About identifiability

It is important to note that under the Matrix Normal distribution, the Kronecker product \Psi \circ \Sigma is identifiable, but the individual parameters \Psi, \Sigma are only identifiable up to a constant c.  This occurs because for any \Psi,\Sigma, we can use \Psi/c, c\Sigma and achieve the same probability value for matrix A.  See Wang and West (Section 2.4) for more information.

Thus, it can be difficult to parameterize the priors appropriately, since \Psi_0 and S_0 can be scaled arbitrarily via the constant c and yield similar results.  I have not yet found a coherent way out of this mess.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>