In this document we detail how dynamic linear models (DLMs) can be implemented in Theano (or similar tensor libraries), as well as a complementary Rao-Blackwellized sampler tailored to the structure of DLMs. Furthermore, we provide an example of how DLMs can be extended–via Gaussian scale-mixtures–to model non-gaussian observations. The example is a Pólya-Gamma extension to forward-filtering backward-sampling for negative-binomial observations.
In the future, we plan to build optimizations that automate much of the work done here (and more). This exposition sets the foundation for such work by first motivating the use and generality of Bayesian frameworks like DLMs, then by demonstrating the analytic steps that produce customized, efficient samplers. These are the steps that would undergo automation in the future.
A DLM with prior \(\theta_0 \sim \operatorname{N}\left( m_0, C_0 \right)\) is defined as follows:
for \(t \in \{1, \dots, T\}\), \(y_t \in \mathbb{R}\), and \(\theta_t \in \mathbb{R}^{M}\).
The most “notationally” faithful representation of the timeseries model in \(\eqref{eq:basic-dlm-state}\) using Theano is provided in Listing 2. It represents the notion of a recursion–to the best of Theano’s ability–by way of the scan operator.
Throughout we’ll use data sampled from \(\eqref{eq:basic-dlm-state}\) for demonstration purposes. Our simulation has the following model parameter values:
A sample from \(\eqref{eq:sim-settings}\) is generated in Listing 3.
In Figure 4 we plot a sample from the model in Listing 2 for a fixed RNG seed.
Analytic Posteriors
The standard DLM is essentially a Kalman Filter and enjoys many well documented closed-form results. In the following, we will simply state the relevant prior predictive and posterior results.
Given all the prior and observed data up to time \(t\), \(D_t\), these distribution are given by the following:
These “filtered” moments/distributions are one “kind” of posterior result for a DLM, and they only take into account the data up to time \(t\). The other kind are the “smoothed” distributions, which provide posterior distributions for each time \(t\) given all observations preceding and following \(t\).
Notationally, we’ve used \(D_t\) to signify all conditional observations and parameters up to time \(t\), so the smoothed distributions are given by \(\theta_t \mid D_T\), in contrast to \(\theta_t \mid D_t\)
The smoothed \(\theta_t\) distributions are still Gaussian, i.e. \(\left(\theta_t \mid D_T\right) \sim \operatorname{N}\left(s_t, S_t\right)\), and their moments are as follows:
In most cases, models will not be as simple as the standard DLM. Even so, these basic closed-form solutions can still be relevant. For instance, efficient MCMC algorithms can be constructed using these closed-form results for conditionally linear models. In those cases, we can compute the posterior moments–in closed-form–conditional on samples generated by other means.
The standard approach is called forward-filtering backward-sampling (FFBS) and uses smoothed posteriors \(\theta_t \mid \theta_{t+1}, D_T\) conditioned on all other parameters.
We’ll build up to forward-backward sampling in what follows, but, first, we need to establish how the requisite quantities can be computed symbolically.
Posterior Estimation
In Listings 6 and 7, we demonstrate how the posterior moments in \(\eqref{eq:dlm-post-moments}\) and \(\eqref{eq:dlm-smooth-moments}\) can be computed in Theano.
Unfortunately, if we attempt to implement the exact closed-form updates in \(\eqref{eq:dlm-post-moments}\) or \(\eqref{eq:dlm-smooth-moments}\), our results will be fraught with numerical errors. This is a very basic issue with naively implemented Kalman filters. The solution to these issues usually involves some analytic reformulations that compensate for the covariance matrix subtractions. The standard approaches generally use some form of matrix decomposition that directly accounts for the positive semi-definite nature of the covariance matrices.
The approach taken here is based on the singular value decomposition (SVD) and effectively computes only one symmetric “half” of the updated covariances. The SVD also allows for easy inversions. See Zhang & Li (1996) for more details, or Petris, Petrone & Campagnoli (2009) for a concise overview of the procedure in the context of DLMs.
SVD-based Filtering
The SVD forms of the filtering equations in \(\eqref{eq:dlm-post-moments}\) are produced through creative use of the SVDs of its component matrices. Using a slightly modified version of the formulation established in Petris, Petrone & Campagnoli (2009), the SVD for a matrix \(M\) is given by \(M = U_{M} D_{M} V_{M}^\top\). A symmetric matrix then takes the form \(M = U_{M} D_{M} U_{M}^\top\) and its “square-root” is given by \(M = N_M^\top N_M\) with \(N_M = S_{M} U_{M}^\top\) and \(S_{M} = D_{M}^{1/2}\). Likewise, matrix (generalized) inverses take the form \(M^{-1} = U_{M} S_{M}^{-1} U_{M}^\top\).
The idea here is that we can combine these SVD identities to derive square-root relationship between the SVD of \(C_t^{-1}\) and the SVDs of \(C_{t-1}\), \(W_t\), \(V_t\), and \(R_t\), then we can easily invert \(C_t^{-1}\) to arrive at the desired numerically stable SVD of \(C_t\).
First, note that \(N_{R_t}^\top N_{R_t} = G_t C_{t-1} G_t^\top + W_t = R_t\) for
From this, we know that the SVD of \(R_t\) can be easily derived from the SVD of its square root, \(N_{R_t}\), i.e. \(U_{R_t} = V_{N_{R_t}}\) and \(S_{R_t} = D_{N_{R_t}}\). In other words, we can obtain a matrix’s SVD by computing the SVD of its “half”, which is itself entirely comprised of previous SVD components. The inherent symmetry of our covariance matrices is nicely preserved because we’re only ever using and computing one “half” of these matrices.
With the updated SVD of \(R_t\), we can use the identity \(C_t^{-1} = F_t V_t^{-1} F_t^\top + R_t^{-1}\)–obtained via the classic Sherman-Morrison-Woodbury matrix inverse identity–to employ the same technique as before and produce the SVD of \(C_t^{-1}\) by way of the SVD of yet another block square-root matrix,
Again, we compute the SVD of \(N_{C_t^{-1}}\) at this step and obtain \(V_{N_{C_t^{-1}}}\) and \(D_{N_{C_t^{-1}}}\).
This time, the block square-root matrix relationship isn’t so direct, and we have to multiply by \(U_{R_t}\): \(U_{R_t} N_{C_t^{-1}}^\top N_{C_t^{-1}} U_{R_t}^\top = C_t^{-1}\). However, since the additional \(U_{R_t}\) terms are orthogonal, we are able to derive the SVD of \(C_t\) as \(U_{C_t} = U_{R_t} V_{N_{C_t^{-1}}}\) and \(S_{C_t} = D_{N_{C_t^{-1}}}^{-1}\).
We can use the techniques above to produce SVD versions of the smoothing equations in \(\eqref{eq:dlm-smooth-moments}\). In this case, some extra steps are required in order to SVD-decompose \(S_t\) in the same manner as \(R_t\) and \(C_t^{-1}\) were.
First, notice that our target, \(S_t\), is a difference of matrices, unlike the matrix sums that comprised \(R_t\) and \(C_t^{-1}\) above. Furthermore, \(S_t\) is given as a difference of a (transformed) difference. To address the latter, we start by expanding \(S_t\) and setting \(B_t = C_t G_{t+1}^\top R_{t+1}^{-1}\) to obtain
Having turned \(S_t\) into a sum of two terms, we can now consider another blocked SVD-based square-root reformulation, which starts with the reformulation of \(H_t\).
We can use the definition of \(R_t = G_{t+1} C_t G_{t+1}^\top + W_{t+1}\) to get
This form of \(H_t\) fits the Woodbury identity and results in \(H_t^{-1} = G_{t+1}^\top W_{t+1}^{-1} G_{t+1} + C_t^{-1}\), which is amenable to our square-root formulation.
Specifically, \(H_t^{-1} = U_{C_t} N_{H_t}^{-\top} N_{H_t}^{-1} U_{C_t}^\top\), where
By inverting the SVD of \(N_{H_t}^{-1}\) we obtain the SVD of \(H_t\) as \(U_{H_t} = U_{C_t} V_{N_{H_t}^{-1}}\) and \(D_{H_t} = {D_{N_{H_t}^{-1}}}^{-2} = S_{H_t}^2\).
Finally, using \(\eqref{eq:S_t_decomp}\) and \(\eqref{eq:N_H_t_inv}\) we can derive the last blocked square-root decomposition \(S_t = N_{S_t}^\top N_{S_t}\):
Again, we take the SVD of \(N_{S_t}\) and derive the SVD of \(S_t\) as \(U_{S_t} = V_{N_{S_t}}\) and \(D_{S_t} = D_{N_{S_t}}^2 = S_{S_t}^2\).
Example
Listing 8 computes the filtered and smoothed means for our simulated series, and Figure 9 shows the results.
Forward-filtering Backward-sampling
We can use the smoothing and filtering steps in the previous section to perform more efficient MCMC estimation than would otherwise be possible without the Rao-Blackwellization inherent to both steps.
Forward-filtering backward-sampling (Fr-Schnatter 1994) works by first computing the forward filtered moments, allowing one to draw \(\theta_T\) from \(\left(\theta_T \mid D_T\right) \sim \operatorname{N}\left(m_T, C_T\right)\) and, subsequently, \(\theta_t\) from \(\left(\theta_t \mid \theta_{t+1}, D_T \right) \sim \operatorname{N}\left(h_t, H_t\right)\).
The latter distribution’s moments are easily derived from the filtered and smoothed moments:
Since all the quantities in \(\eqref{eq:ffbs-moments}\) appear in the filtering and smoothing moments, we can use the SVD-based approach described earlier to perform the updates and sampling. We reproduce the relevant subset of calculations in Listing 10.
Quantities besides the state values, \(\theta_t\), can be sampled sequentially (i.e. within the function ffbs_step in Listing 10), or after FFBS when all \(\theta_t \mid D_T\) have been sampled. These quantities can use the conditionally Gaussian form of \(\left(\theta_t \mid \theta_{t+1}, D_T \right)\) to derive Gibbs steps, further Rao-Blackwellize hierarchical quantities, or apply any other means of producing posterior samples conditional on \(\left(\theta_t \mid \theta_{t+1}, D_T \right)\).
In our simulation example, we will further augment our original model by adding the classic conjugate gamma priors to our previously fixed state and observation precision parameters, \(\phi_W\) and \(\phi_V\), respectively:
This classical conjugate prior allows one to derive simple closed-form posteriors for a Gibbs sampler conditional on \(y_t\), \(\theta_t\), and \(\theta_{t-1}\), as follows:
Those posterior computations are implemented in Listings 11 and 12, and they are used to update the shared Theano variables for \(\phi_W\) and \(\phi_V\) within a Gibbs sampling loop in Listing 13.
Simulation Example
Figure 14 shows the posterior \(\theta_t\) samples and Figure 15 plots the posterior sample traces.
Non-Gaussian Extension
Let’s say we want to model count observations that are driven by a smooth time-varying process. We can assume a negative-binomial observation model with a log-link function–in standard GLM fashion (McCullagh & Nelder 1989)–and connect it to the same state and observation dynamics as the basic DLM in \(\eqref{eq:basic-dlm-state}\) via its mean \(\mu_t = \exp\left(\eta_t\right)\):
The logit-inverse form in \(\eqref{eq:nb-pmf}\) has a Gaussian scale-mixture representation in the Pólya-Gamma distribution (Polson, Scott & Windle 2013). Said scale-mixture is as follows:
where \(\kappa = a - b / 2\) and \(\omega_t \sim \operatorname{PG}\left(b, 0\right)\).
When the Gaussian scale-mixture identity of \(\eqref{eq:pg-identity}\) is applied to our observation model in \(\eqref{eq:nb-pmf}\), \(a = y_t\), \(b = r + y_t\), \(\kappa = (y_t - r)/2\), and \(\psi = \eta_t - \log r\).
From the scale mixture formulation, we obtain the following augmented joint observation model density:
with \(V_t = \operatorname{diag}\left( \omega^{-1}_t \right)\).
The augmented observation equation in \(\eqref{eq:nb-virtual-obs}\) recovers our original DLM formulation in \(\eqref{eq:basic-dlm-obs}\) and–by comparison–details the changes needed to use the Pólya-Gamma scale-mixture. Specifically, we need to sample \(\omega_t \sim \operatorname{PG}\left(r + y_t, F_t^\top \theta_t - \log r \right)\) and perform the following substitutions: \(y_t \to y^*_t\) and \(V_t \to V^*_t\).
Simulation Example
Listing 17 creates a Theano graph for negative-binomial model defined in \(\eqref{eq:nb-dlm}\).
Listing 18 specifies parameters for a simulation from \(\eqref{eq:nb-dlm}\) and samples a series.
In Figure 19 we plot the sample generated in Listing 17.
Augmented FFBS Sampler
In order to create a FFBS sampler for our Pólya-Gamma DLM in \(\eqref{eq:nb-aug-obs}\), we need to update the filtering code to use time-varying “virtual” observation variances, \(V^*_t\). After this change is made, all Theano graphs that depend on the resulting objects need to be recreated, as well. This is done in Listing 22.
Simulation Example
In Listing 24 we sample the initial values and create Theano terms for posterior/updated \(\omega_t\)-related values.
Finally, the sampler steps are defined and executed in Listing 25.
Figure 26 shows the posterior \(\theta_t\) samples, Figure 27 plots the posterior sample traces, and Figure 28 shows \(p_t \mid \theta_t\).
COVID–19 Example
As another example, we’ll use data from the COVID–19 outbreak in New York and we’ll only concentrate on the positive test counts. We do not make any assertions about the results; we simply want to apply our implementation to a real-life dataset. If anything, we would simply like to highlight the potential provided by these flexible and constructive Bayesian frameworks and offer a concrete starting point for motivated statisticians and Python developers.
To start, Listing 29 pulls and formats the COVID–19 data and Figure 30 plots it.
Our example model is a simple Gaussian walk with an unknown evolution variance (i.e. again, modeled by a gamma prior on the precision \(\phi_W\)) given by the following values:
Again, we set \(r = 1000\), effectively making our negative-binomial a Poisson approximation. Listing 31 sets the initial values in \(\eqref{eq:ny-model-settings}\) and Listing 32 specifies the sampler loop.
Discussion
We’ve implemented a highly extensible Bayesian timeseries framework with respectable model coverage (e.g. ARIMA models, polynomial trends, Fourier seasonality, dynamic regression, etc.), and shown how such a basic class can be extended to non-Gaussian observations. Additionally, we were able to show how an existing model-specialized sampler, i.e. FFBS, can be adapted alongside such an extension using the conditional linearity provided by a Gaussian scale-mixture.
In general, the same scale-mixture and conditional linearity techniques can be employed for other observation distributions (e.g. Beta, Binomial), as well as state-of-the-art shrinkage and sparsity priors (Bhadra, Datta, Polson & Willard 2016),(Datta & Dunson 2016).
Although we used Gibbs sampling, these techniques do not necessitate it; one can use Metropolis–or any other mathematically sound–steps instead.
Bayesian frameworks like these have considerable flexibility and are amenable to much analytic creativity; unfortunately, we don’t often see this–directly, at least–in practice. There are clearly some roadblocks to the average applied statistics/data science developer when attempting to implement these models straight from published papers. Those roadblocks often involve subtle numeric stability issues and a sometimes non-standard mathematics proficiency.
We hope that future automations will be able to address more straight-forward numerical stability issues by–for instance–automating the SVD-based reformulations used herein. The scale mixture manipulations are likewise amenable to certain automations, but, at the very least, it would serve the Bayesian community–and all who desire to construct non-trivial principled statistical models–well to develop low-level tools that codify and generate such scale-mixture variations. In this way, even experimentation by knowledgeable developers will be much easier and less error-prone.
Petris, Petrone & Campagnoli, Dynamic Linear Models, Springer (2009). ↩︎
Gamerman & Lopes, Markov Chain Monte Carlo: Stochastic Simulation for Bayesian Inference, CRC Press (2006). ↩︎
Pole, West & Harrison, Applied Bayesian Forecasting and Time Series Analysis, CRC Press (1994). ↩︎
Bergstra, Breuleux, Bastien, Lamblin, Pascanu, Desjardins, Turian, Warde-Farley & Bengio, Theano: A CPU and GPU Math Expression Compiler, in in: Proceedings of the Python for Scientific Computing Conference (SciPy), edited by (2010) ↩︎
Zhang & Li, Fixed-Interval Smoothing Algorithm Based on Singular Value Decomposition, 916–921, in in: , Proceedings of the 1996 IEEE International Conference on Control Applications, 1996, edited by (1996) ↩︎
Petris, Petrone & Campagnoli, Dynamic Linear Models, Springer (2009). ↩︎
Fr"uhwirth-Schnatter, Data Augmentation and Dynamic Linear Models, Journal of time series analysis, 15(2), 183–202 (1994). link. ↩︎
McCullagh & Nelder, Generalized Linear Models, CRC press (1989). ↩︎
Polson, Scott & Windle, Bayesian Inference for Logistic Models Using P'olyaLatent Variables, Journal of the American Statistical Association, 108(504), 1339–1349 (2013). link. ↩︎
Comments
comments powered by Disqus