Rating: 8.2/10.
Textbook on classical statistical methods for working with time series, mainly focusing on autoregressive, moving average, exponential smoothing, and regression-based models. This book is overall fairly easy to read while being mathematically rigorous and provides examples of working with time series in various statistical packages like JMP, SAS, and R. The examples are quite detailed (maybe overly so, with many pages in each chapter just displaying full datasets in a table of numbers, not sure what the point of this is). Although examples are given in R, the book does not use any modern packages related to time series and instead focuses on doing it manually, which is good for educational purposes but not the most practical as it goes into detail about steps for manually identifying ARIMA models rather than using more automated approaches; it also does not cover any machine learning methods for working with time series, only the classical statistical methods.
Chapter 1. Time series are used in many diverse applications and can exhibit autocorrelation (where a point above the mean tends to be followed by more points above the mean), as well as seasonality and shifting mean, which makes it non-stationary. There can also be atypical events due to either real occurrences or system glitches. The data analysis process often involves first retrieving the data from a warehouse and then cleaning it, eg by imputing missing values.
Chapter 2. A moving average is useful for smoothing data, and a Hanning filter is a weighted average with the center having the greatest weight. A moving median is effective for smoothing outliers that occur as long as they’re over a brief enough period of time. Stationary means that a time series is not affected by a change in time; strict stationarity means that the joint distribution of every window is the same if shifted by any amount, and it implies that the mean and variance are constant. Autocovariance or autocorrelation function (ACF) at lag k measures the correlation between points k units apart, eg if a large value is followed by another large value, then there is positive autocorrelation at k = 1. The sample ACF is estimated from time series data by plotting sample autocorrelation; if it does not go to zero after a few lags, then it generally means that the series is non-stationary. A variogram is a plot of the variance of points that are k units apart divided by the variance at lag 1; if it is stationary, then this stabilizes to a constant value, and if it keeps increasing forever, then it is non-stationary.
A power transform is useful if the data is skewed, and if there is a long-term trend, then you might want to fit a regression model to capture the trend and then subtract it. Another way is differencing, which can make a series stationary by taking the difference between successive terms to create a new series, and this may be applied multiple times or with a seasonal lag. A decomposition technique is to fit a linear model to capture some trend and then subtract this trend, although this may still leave other non-stationary patterns like cyclic patterns.
For evaluation, a common method is one-step-ahead forecast error using MSE or sometimes absolute or percentage error. The Box-Pierce test is useful for testing if a time series is white noise (autocorrelation function is zero and all observations are uncorrelated). The Ljung-Box test is similar but better for small samples. Some ways of choosing between models include MSE and R^2 (which are equivalent), and AIC/BIC (which penalize model complexity while scoring accuracy). For monitoring a forecast model, it’s common to set a control limit and raise an alert if the one-step-ahead error exceeds some limit, or keep the cumulative total and alert when it exceeds past some limit, which indicates that smaller errors have accumulated and the model has been biased towards one direction.
Chapter 3. This chapter begins with a review of least squares multiple linear regression and linear algebra. For hypothesis testing, the t-test can determine if an individual regression coefficient is nonzero, while a partial F-test is used to assess if a group of variables should be nonzero by comparing the sum of squares and residuals for a full model versus a reduced model. Confidence intervals (CI) may be derived for the mean response, and separately, prediction intervals (PI) for a new point to predict, with the PI being longer than the CI.
Residual analysis is useful for identifying deviations from model assumptions, such as the assumption of equal variance. Standard residuals are the simplest, whereas studentized residuals are useful for identifying outliers that have a disproportionate contribution to the model. Determining which input variables to include in the model can be done in stepwise selection by adding or removing one variable at a time and comparing the model fit using AIC or BIC.
Generalized least squares (GLS) fit a model allowing for variances to be different and possibly correlated, with the variance specified in an nxn matrix. One special case is weighted least squares (WLS), which is a special case of GLS where the variances are different but uncorrelated, and it is weighted because the variance is the reciprocal of the weight. If the weight is unknown, one procedure called iteratively reweighted least squares (IRLS) uses an iterative method of fitting a linear model to predict the variance and using this model to assign the weights.
Discounted least squares is a form of weighted least squares (WLS) where more recent points are weighted more heavily than past points. This method fits a linear regression and includes a formula to derive the model fit from the last model fit without refitting the entire model. Trigonometric functions may be used as variables in addition to the linear time step variable to capture cyclic patterns. Linear models assume uncorrelated residuals, but time series often exhibit autocorrelation.
The Durbin-Watson test detects if the sequence of residuals exhibits autocorrelation; if it does, the model fit is incorrect and no longer an unbiased estimate. The Cochrane-Orcutt method can be used if the residuals exhibit autocorrelation. It works by estimating the autocorrelation parameter (phi), and transforming the dependent variable y to remove the autocorrelation before fitting the linear model. An alternative is to use maximum likelihood estimation (MLE) to jointly find phi and the regression coefficient (beta), which results in a nonlinear model that requires iterative numerical methods to fit.
A prediction interval for time series is not just the standard linear regression PI; it also needs to account for the autocorrelation parameter phi, which can be estimated to provide the PI at future points an arbitrary distance into the future. In cases where our model relies on other variables that are also unknown and need to be forecasted, we first determine the PI for the x variables and then combine it with the variance of the regression y given x to derive the PI for the y variable.
Chapter 4. Smoothing is a family of techniques that use averages to reduce noise and identify the signal. The simple moving average is the most basic method; it simply takes the average of the last n points, but the problem is when the underlying process changes, as it takes some time for the average to catch up. The idea of exponential smoothing is to take an average that has exponentially decaying weights for past values, so it weighs the present values more heavily and reacts more quickly to changes; you need to choose a discount factor, usually between 0.1 and 0.4. Exponential smoothing consistently underpredicts if the trend is increasing, such as when predicting the Dow Jones index. One solution is using second-order exponential smoothing, which applies exponential smoothing to the first-order smoothed values; it is possible way to combine the two such that the prediction is unbiased.
When forecasting using exponential smoothing, we can assume a constant or a linear model and update the model parameters based on the exponentially weighted average of the current time value and previous versions of the values and parameters. From the exponential smoothing equations, the parameters from the previous timestep can be updated using the most recent value to obtain the most recent parameters for a linear model. The discount factor (lambda) may be chosen by MSE on one-step-ahead predictions; alternatively it may also be adjusted dynamically based on prediction error; this is the Trigg-Leach smoother. When calculating the PI, you must be careful to ensure that the model assumptions are valid, or the PI may not be realistic, eg, for the constant model, the PI for forecasts does not get wider when predicting further ahead into the future.
The Holt-Winters method is an exponential smoothing method that accounts for seasonality where the period is known, using a linear model with a seasonality component. The parameters for both linear and seasonality components are updated at every step using exponential smoothing. There is also a multiplicative seasonality variant where seasonality is assumed to have a multiplicative rather than additive effect.
Chapter 5. Weak stationarity requires only that covariances be independent of time, ie they are just a function of lag; this is weaker than stationarity because independence is stronger than being uncorrelated. Wold’s decomposition theorem states that a weakly stationary series can be written as a mean function plus a sum of past random disturbances (usually denoted psi) that are uncorrelated, multiplied by a white noise term (epsilon), and can be thought of as a linear filter applied to white noise. A finite order moving average process, denoted MA(q), is stationary and has autocorrelation for short durations, but the autocorrelation is expected to decay to close to zero after lag q, as the past q noise terms affect the current value.
AR(1) is a first-order autoregressive process where the value of y at a timestep depends on the value of y at the previous step, with some decay factor phi and some noise. This process is stationary as long as phi < 1. However, unlike MA(1), the autocorrelation lingers for longer than one step. AR(2) is a process where y depends on a linear combination of the previous two timesteps, and depending on the coefficients, it may or may not be stationary. The Yule-Walker equations give the autocorrelation functions, which may a mixture of exponential decay terms or damped sinusoids, depending on the roots of a quadratic equation of the two autoregressive coefficients. In general, AR(p) is a pth-order autoregressive model with more complex behavior than AR(2).
The partial autocorrelation function (PACF) at lag k is the autocorrelation after adjusting for past values beyond lag k. The AR(p) model has partial autocorrelation that is significant for only p lags and then zero afterwards, however the MA(q) model does not cut off after q lags; rather, it decays slowly. Therefore, recommend always checking both the ACF and the PACF to determine which model is the best fit. The finite AR(p) can be written as an infinite MA process and vice versa.
The ARMA(p, q) model is a combination of the AR(p) and MA(q) components, and both the ACF and the PACF will exhibit exponential decay; it is stationary if the AR component is stationary. The ARIMA (autoregressive integrated moving average) model, denoted as ARIMA(p, d, q), involves taking the d-th difference of an ARMA(p, q) process to achieve stationarity, where typically d is at most 1 or 2, eg: ARIMA(0, 1, 0) is a random walk, which is non-stationary, but taking the first difference yields white noise.
To identify ARIMA models, first examine the stationarity to decide whether differencing is needed, and then look at the ACF and PACF to determine which combination of AR and MA components are present. Next, fit the model using a nonlinear optimization package and test the adequacy of the model by examining the residuals, which should not exhibit autocorrelation as determined by the Box-Pierce test or a similar test. Forecasting can also be done with these models, and formulas can give prediction intervals.
For ARIMA models with seasonality, the data can be decomposed into regular and seasonal processes with a given lag, where the seasonal part is also an ARIMA(p, d, q) model. Choosing which ARIMA model to use when there are several possibilities involves a combination of examining the ACF and PACF, comparing the AIC and MSE of different model fits, looking at the residuals, also applying domain knowledge (eg: whether the process should have a constant mean function or not), so it is a somewhat subjective process.
Chapter 6. Up until this chapter, we considered exclusively univariate time series; transfer functions allow us to incorporate data from other time series. A transfer function model relate a time series yi to a weighted sum of a different series xi, where the transfer function v provides a linear combination of past xi values to construct the current yi (xi is called the input, and yi is the response). The transfer function includes AR and MA terms, as well as a noise term that is another ARIMA model to capture the influence of y on itself. There may also be a delay term to specify how many steps of delay occur until x affects y.
The cross-correlation is similar to autocorrelations at lag j, but it measures the correlation between two series instead of one series with itself and can be estimated from the sample. When fitting a model, the first recommended step is pre-whitening, which involves fitting an ARIMA model to the xi series; this step avoids problems that arise if x has too much autocorrelation. Pre-whitening transforms x to resemble white noise and then applies the same filter to the y series. Fitting a transfer function model is a multi-step process that involves examining the ACF, PACF, and residuals to determine the order of the AR and MA terms and the delay steps for the pre-whitening. This process is then repeated for the transfer function and again for the noise component. Finally, the adequacy of the model is checked to ensure that the assumptions are met.
It is possible to extend this method to multiple input series that affect the response y series, and the formulas are provided for forecasting with this model and generating prediction intervals. Interventions can be modeled similarly to input series: two choices are to model it as a pulse input at a given time if the effects are expected to be temporary, or as a step input, which is 1 at all steps after a point if the effects are expected to be long-lasting, and a transfer function can incorporate AI and MA terms from the intervention.
Chapter 7. This chapter gives a quick survey of other forecasting methods. Vector autoregressive (VAR) models generalize to multiple time series that are jointly stationary, and all coefficients used in the formulas of ARIMA models are replaced with matrices of coefficients. Since model identification becomes complex, it is sometimes simplified by dropping the MA part, resulting in just the vector autoregressive model.
In state-space models, each step has a state vector that is hidden, and this state vector undergoes a linear transformation with noise to obtain the next state vector. A different linear transformation is applied to the state vector to produce the observed data y. This is a generalization of many models, including ARIMA and exponential smoothing models, making it theoretically and practically useful to have a unified framework for these models. The autoregressive conditional heteroscedastic (ARCH) model deals with processes that have changing variance: instead of assuming independent variance, the variance is an AR process, which can handle processes with changing variance.
Direct forecasting of percentiles: instead of providing prediction intervals, model the probability of each percentile bucket and using exponential smoothing to forecast the percentile distribution. Multiple forecasts can be combined using linear combinations, where the weights are chosen according to a formula that uses the variance and correlations of the forecasts to minimize the variance of the result; this method is most effective when the forecasts are negatively correlated.
Spectral analysis uses Fourier representation of a time series to analyze it in the frequency domain, which may be useful as a complement to ACF and PACF analysis since it can identify seasonality. Finally, Bayesian methods update the prior distribution of model parameters with new data to obtain a posterior distribution. However, this book only briefly covers very simple Bayesian methods for a constant time series model.