This case study documents the implementation in Stan (Carpenter et al. 2017) of the Hidden Markov Model (HMM) for unsupervised learning (Baum and Petrie 1966; Baum and Eagon 1967; Baum and Sell 1968; Baum et al. 1970; Baum 1972). Additionally, we present the adaptations needed for the Input-Output Hidden Markov Model (IOHMM). IOHMM is an architecture proposed by Bengio and Frasconi (1994) to map input sequences, sometimes called the control signal, to output sequences. Compared to HMM, it aims at being especially effective at learning long term memory, that is when input-output sequences span long points. Finally, we illustrate the use of HMMs as a component within more complex constructions with a volatility model taken from the econometrics literature. In all cases, we provide a fully Bayesian estimation of the model parameters and inference on hidden quantities, namely filtered and smoothed posterior distribution of the hidden states, and jointly most probable state path.

*A Tutorial on Hidden Markov Models using Stan* is distributed under the Creative Commons Attribution 4.0 International Public License. Accompanying code is distributed under the GNU General Public License v3.0. See the README file for details. All files are available in the stancon18 GitHub repository.

While HMMs are useful for modeling certain phenomena directly, their utility is significantly increased by embedding them within larger models. In this section, we embed an HMM within a commonly used econometric model and apply it to stock market returns. We provide code for the forward algorithm for this model. The other algorithms discussed above can be applied to this model with minor modifications.

Many financial time series exhibit so-called “volatility clustering”; that is, periods of significant activity tend to occur closely together, suggesting that there is some form of short-term memory to the volatility (standard deviation of returns). Generalized Autoregressive Conditional Heteroskedasticity (GARCH) models are commonly used to capture this phenomenon and have been widely studied in the econometrics literature (Bollerslev 1986; Bollerslev 2010). No less important is the phenomenon of “regime-switching” - the observation that financial markets go through alternating periods of low-volatility booms and high-volatility busts. The Markov Switching GARCH (MS-GARCH) model of Hass *et al.* (2004b; 2004a) uses a HMM to switch between two latent GARCH processes. In an extensive empirical comparison, Ardia *et al.* report that Bayesian estimation significantly improves the performance of the MS-GARCH over the more common maximum likelihood estimate (Ardia et al. 2017). We show how this model can be easily and efficiently estimated using Stan.

The standard GARCH(1, 1) model is described in the Stan Manual (Stan Development Team 2017c, Section 10.2). Under the model, the return at each date \(y_t\) is drawn independently from a normal distribution with mean \(\mu\), which we fix to be zero, and a time-varying standard deviation \(\sigma_t\) which evolves according to

\[ \begin{aligned} \sigma^2_t &= \alpha_0 + \alpha_1 y_{t-1}^2 + \beta_1 \sigma_{t-1}^2 \\ p(y_t) &= \mathcal{N}(0, \sigma^2_t). \end{aligned} \]

In the MS-GARCH model, we have two parallel GARCH processes (with different parameters) and the standard deviation of the return is drawn according to one of the two processes as determined by an HMM.

\[ \begin{aligned} (\sigma^{(1)}_t)^2 &= \alpha^{(1)}_0 + \alpha^{(1)}_1 y_{t-1}^2 + \beta_1^{(1)} \sigma_{t-1}^2 \\ (\sigma^{(2)}_t)^2 &= \alpha^{(2)}_0 + \alpha^{(2)}_1 y_{t-1}^2 + \beta_1^{(2)} \sigma_{t-1}^2 \\ p(z_{t}|z_{t-1} = k) &= \text{Categorical}(p_k) \\ p(y_t|z_t = k) &= \mathcal{N}(y_t | 0,(\sigma_t^{(k)})^2) \end{aligned} \]

The implementation is a relatively straight-forward combination of the standard HMM discussed above and the GARCH model from the Stan Manual. To ensure our model is well-identified, we use the `positive_ordered`

type for the baseline volatilities \(\alpha^{(1)}_0, \ \alpha^{(2)}_0\). We use weak priors on the GARCH coefficients as well as hard constraints on \((\alpha_1^{(i)}, \beta_1^{(i)})\) to ensure the resulting model is stationary.

We fit this to S&P 500 data from the last 5 years:

```
msgarch_fit <- function(y) {
rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())
stan.model = 'stan/hmm_garch.stan'
y <- as.vector(coredata(y));
stan.data = list(
T = length(y),
y = y
)
stan(file = stan.model,
data = stan.data, verbose = T,
iter = 1000, warmup = 500,
thin = 1, chains = 1,
cores = 1, seed = 900)
}
fit <- msgarch_fit(SPY.R)
```

Examining the fit, *e.g.* in `ShinyStan`

(Stan Development Team 2017b), we see that Stan samples efficiently from this model, with all \(\hat{R}\)-statistics below \(1.1\) and high \(n_{\text{eff}}/n\) ratios. From here, we are able to perform posterior inference, *e.g.*, examining posterior means of the conditional volatilities from the two GARCH processes,

or the posterior probability of being in the low-volatility state on each date:

The model suggests that most dates are indeed in the low-volatility regime, as we would expect.^{4}

We thank the members of the Stan Development Team for developing `Stan`

and for freely sharing their passion and expertise. In particular, we would like to thank Aaron Goodman, Ben Bales, and Bob Carpenter for their active participation in the discussions held in Stan forums for HMM with constraints and HHMM. Although not strictly related, the discussion remains very valuable for the current tutorial.

We acknowledge the Google Summer Of Code (GSOC) program for funding. This tutorial builds on top of our project: Bayesian Hierarchical Hidden Markov Models applied to financial time series.

```
## Warning in readLines(makevars_file): incomplete final line found on 'C:
## \Users\Bebop/.R/Makevars'
```

`## CXXFLAGS=-O3 -Wno-unused-variable -Wno-unused-function`

`## Session info -------------------------------------------------------------`

```
## setting value
## version R version 3.4.1 (2017-06-30)
## system x86_64, mingw32
## ui RTerm
## language (EN)
## collate Spanish_Argentina.1252
## tz America/Buenos_Aires
## date 2018-01-31
```

`## Packages -----------------------------------------------------------------`

```
## package * version date source
## BH 1.65.0-1 2017-08-24 CRAN (R 3.4.1)
## colorspace 1.3-2 2016-12-14 CRAN (R 3.4.1)
## dichromat 2.0-0 2013-01-24 CRAN (R 3.4.1)
## digest 0.6.12 2017-01-27 CRAN (R 3.4.1)
## ggplot2 * 2.2.1 2016-12-30 CRAN (R 3.4.1)
## graphics * 3.4.1 2017-06-30 local
## grDevices * 3.4.1 2017-06-30 local
## grid 3.4.1 2017-06-30 local
## gridExtra 2.3 2017-09-09 CRAN (R 3.4.1)
## gtable 0.2.0 2016-02-26 CRAN (R 3.4.1)
## inline 0.3.14 2015-04-13 CRAN (R 3.4.1)
## labeling 0.3 2014-08-23 CRAN (R 3.4.1)
## lattice 0.20-35 2017-03-25 CRAN (R 3.4.1)
## lazyeval 0.2.0 2016-06-12 CRAN (R 3.4.1)
## magrittr 1.5 2014-11-22 CRAN (R 3.4.1)
## MASS 7.3-47 2017-02-26 CRAN (R 3.4.1)
## Matrix 1.2-10 2017-05-03 CRAN (R 3.4.1)
## methods * 3.4.1 2017-06-30 local
## munsell 0.4.3 2016-02-13 CRAN (R 3.4.1)
## plyr 1.8.4 2016-06-08 CRAN (R 3.4.1)
## R6 2.2.2 2017-06-17 CRAN (R 3.4.1)
## RColorBrewer 1.1-2 2014-12-07 CRAN (R 3.4.1)
## Rcpp 0.12.12 2017-07-15 CRAN (R 3.4.1)
## RcppEigen 0.3.3.3.0 2017-05-01 CRAN (R 3.4.1)
## reshape2 1.4.2 2016-10-22 CRAN (R 3.4.1)
## rlang 0.1.2 2017-08-09 CRAN (R 3.4.1)
## rstan * 2.16.2 2017-07-03 CRAN (R 3.4.1)
## scales 0.5.0 2017-08-24 CRAN (R 3.4.1)
## StanHeaders * 2.16.0-1 2017-07-03 CRAN (R 3.4.1)
## stats * 3.4.1 2017-06-30 local
## stats4 3.4.1 2017-06-30 local
## stringi 1.1.5 2017-04-07 CRAN (R 3.4.1)
## stringr 1.2.0 2017-02-18 CRAN (R 3.4.1)
## tibble 1.3.4 2017-08-22 CRAN (R 3.4.1)
## tools 3.4.1 2017-06-30 local
## utils * 3.4.1 2017-06-30 local
## viridisLite 0.2.0 2017-03-24 CRAN (R 3.4.1)
```

Ardia, David, Keven Bluteau, Kris Boudt, and Leopoldo Catania. 2017. “Forecast Risk with Markov-Switching GARCH Models: A Large-Scale Performance Study.” *SSRN Pre-Print*. doi:10.2139/ssrn.2918413.

Bakis, Raimo. 1976. “Continuous Speech Recognition via Centisecond Acoustic States.” *The Journal of the Acoustical Society of America* 59 (S1). ASA: S97–S97.

Baum, Leonard E, and Ted Petrie. 1966. “Statistical Inference for Probabilistic Functions of Finite State Markov Chains.” *The Annals of Mathematical Statistics* 37 (6): 1554–63.

Baum, Leonard E, and George Sell. 1968. “Growth Transformations for Functions on Manifolds.” *Pacific Journal of Mathematics* 27 (2). Mathematical Sciences Publishers: 211–27.

Baum, Leonard E. 1972. “An Inequality and Associated Maximaization Technique in Stattistical Estimation for Probablistic Functions of Markov Process.” *Inequalities* 3: 1–8.

Baum, Leonard E., and J. A. Eagon. 1967. “An Inequality with Applications to Statistical Estimation for Probabilistic Functions of Markov Processes and to a Model for Ecology.” *Bulletin of the American Mathematical Society* 73 (3). American Mathematical Society (AMS): 360–64. doi:10.1090/s0002-9904-1967-11751-8.

Baum, Leonard E., Ted Petrie, George Soules, and Norman Weiss. 1970. “A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains.” *The Annals of Mathematical Statistics* 41 (1). Institute of Mathematical Statistics: 164–71. doi:10.1214/aoms/1177697196.

Bengio, Yoshua, and Paolo Frasconi. 1994. “An Input Output Hmm Architecture.” In *Proceedings of the 7th International Conference on Neural Information Processing Systems (NIPS 1994)*, 427–34.

Betancourt, Michael. 2017. “Identifying Bayesian Mixture Models.” *Stan Case Studies* 4. http://mc-stan.org/users/documentation/case-studies/identifying_mixture_models.html.

Bollerslev, Tim. 1986. “Generalized Autoregressive Conditional Heteroskedasticity.” *Journal of Econometrics* 31 (3): 307–27. doi:10.1016/0304-4076(86)90063-1.

———. 2010. “Glossary to Arch.” In *Volatility and Time Series Econometrics: Essays in Honor of Robert F. Engle*, edited by Tim Bollerslev, Jeffrey Russell, and Mark Watson. Advanced Texts in Econometrics. Oxford University Press. doi:10.1093/acprof:oso/9780199549498.003.0008.

Carpenter, Bob, Andrew Gelman, Matt Hoffman, Daniel Lee, Ben Goodrich, Michael Betancourt, Michael A Brubaker, Jiqiang Guo, Peter Li, and Allen Riddell. 2017. “Stan: A Probabilistic Programming Language.” *Journal of Statistical Software* 76 (1). doi:10.18637/jss.v076.i01.

Gelman, Andrew, and Donald B Rubin. 1992. “Inference from Iterative Simulation Using Multiple Sequences.” *Statistical Science* 7 (4): 457–72. doi:10.1214/ss/1177011136.

Haas, Markus, Stefan Mittnik, and Marc S. Paolella. 2004a. “A New Approach to Markov-Switching GARCH Models.” *Journal of Financial Econometrics* 2 (1): 493–530. doi:10.1093/jjfinec/nbh020.

———. 2004b. “Mixed Normal Conditional Heteroskedasticity.” *Journal of Financial Econometrics* 2 (1): 211–50. doi:10.1093/jjfinec/nbh009.

Jelinek, Frederick. 1976. “Continuous Speech Recognition by Statistical Methods.” *Proceedings of the IEEE* 64 (4). IEEE: 532–56. doi:10.1109/PROC.1976.10159.

Jordan, Michael I. 2003. “An Introduction to Probabilistic Graphical Models.” In Preparation.

Murphy, Kevin P. 2012. *Machine Learning*. MIT Press Ltd.

Rabiner, Lawrence R. 1990. “A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition.” In *Readings in Speech Recognition*, 267–96. Elsevier. doi:10.1016/b978-0-08-051584-7.50027-9.

Stan Development Team. 2017a. “RStan: The R Interface to Stan.” http://mc-stan.org/.

———. 2017b. “Shinystan: Interactive Visual and Numerical Diagnostics and Posterior Analysis for Bayesian Models.” http://mc-stan.org/.

———. 2017c. *Stan Modeling Language: User’s Guide and Reference Manual: Version 2.17.0.*

Viterbi, A. 1967. “Error Bounds for Convolutional Codes and an Asymptotically Optimum Decoding Algorithm.” *IEEE Transactions on Information Theory* 13 (2). Institute of Electrical; Electronics Engineers (IEEE): 260–69. doi:10.1109/tit.1967.1054010.

Corresponding author: damiano.luis@gmail.com.↩

Both the discrete-time and discrete-state assumptions can be relaxed, though we do not pursue that direction in this paper.↩

The output can be univariate or multivariate depending on the choice of model specification, in which case an observation at a given time index \(t\) is a scalar \(y_t\) or a vector \(\mat{y}_t\) respectively. Although we introduce definitions and properties along an example based on a univariate series, we keep the bold notation to remind that the equations are valid in a multivariate context as well.↩

This time window is not the best illustration of this model because the U.S. was in the middle of an extended post-crisis recovery. Still, we do see that the model correctly identifies a number of short-term reversals over this period.↩