'I think I should understand that better,' Alice said very politely, 'if I had it written down.' Through the Looking Glass, Lewis Carroll. 6. RECURSIVE BAYESIAN ESTIMATION, SEVERAL DEPENDENT AND SEVERAL EXPLANATORY VARIABLES The need for several dependent variables in a single model can arise when there is some interrelationship between them; for example when one of the explanatory variables affects two dependent variables in the same way. This could happen in wool consumption. Log(NDC) = B1 + B2 . log(PCEt) + B3 . log(PW/PS)t-1 Log(TFC) = B4 + B2 . log(PCEt) where NDC = wool consumption PCE = personal consumption expenditure PW/PS = price of wool / price of synthetic fibres TFC = total fibre consumption The assumption is that the income elasticity of wool consumption is the same as the income elasticity of total fibre consumption; the justification for this assumption is given in Solomon, 1980. At this point in the paper, the Kalman filter is introduced. It will subsequently be shown that recursive Bayesian estimation and Kalman filtering are the same thing; the only difference is the derivation. 6.1 THE KALMAN FILTER Suppose a vector Yt is related stochastically to a matrix of explanatory variables Xt by a vector of parameters Bt, by : OBSERVATION MODEL Yt = Xt . Bt + vt where vt is N(0,Vt) eq. 6.1.1 Suppose that the parameters are not constant, but change in a way described by : PARAMETER MODEL Bt = H . Bt-1 + wt where wt is N(0,Wt) eq. 6.1.2 where Yt is an m by 1 vector of observations made at time t Xt is an m by n matrix of explanatory variables, known at time t Bt is an n by 1 vector of parameters at time t H is an n by n matrix describing the relationship of the parameters at time t with the parameters at time t-1 vt, wt are independent random normal vectors, m by 1 and n by 1 respectively, with zero mean and variances (known at time t) Vt, Wt Write the sequence of values of a variable Y1,Y2.........Yt as Datat(Y) Clearly, Datat(Y) = Datat-1(Y), Yt Suppose Bt-1 is distributed Normally, [Bt GIVEN Datat-1(Y), Datat-1(X) ] is N(MEt-1, CEt-1) Then (Bt GIVEN Datat(Y), Datat(X) ) is N(MEt, CEt) where Mt and Ct are calculated from the following. Prediction of Bt given Datat-1(X), Datat-1(Y); Bt is N(MFt, CFt), where MFt = H . MEt-1 eq. 6.1.3 CFt = H . CEt-1 . HT + Wt eq. 6.1.4 (From 6.1.2, using the additive property of means and normal variances) Prediction of Yt given Datat(X), Datat-1(Y); Yt is N(YFt, var(YFt)), where YFt = Xt . MFt eq. 6.1.5 Var(YFt) = Xt . CFt . XTt + Vt eq. 6.1.6 (From 6.1.1, using the additive property of means and normal variances) Definition of Gt; define Gt by Gt = CFt . XTt .(Var(YFt))-1 eq. 6.1.7 Gt is called the gain of the filter. Correction of Bt given Datat(X), Datat(Y); Bt is N(MEt, CEt), where MEt = MFt + Gt . (Yt - YFt) eq. 6.1.8 CEt = CFt - Gt . Xt . CFt eq. 6.1.9 This beautifully elegant result was first obtained by Kalman (1960). A proof of this result can be found in Harrison and Stevens, 1975a. 6.2 AN EXPLANATION OF HOW THE KALMAN FILTER WORKS It may be thought unnecessary to explain "how the Kalman filter works", as equations 6.1.1 to 6.1.9 plus the proof cited above could be regarded as an explanation of how the Kalman filter works. It is valuable, however, to give an explanation of the Kalman filter which is more intuitively appealing (although this explanation will not be a rigorous proof) so as to gain an intuitive understanding of the process. To facilitate this intuitive understanding we shall treat the case of one dependent variable, with H as the identity matrix. The figure on the next page shows the Kalman filter diagrammatically. The model to be estimated is written Yt = Xt . Bt + vt where Y is the dependent variable, X the vector of explanatory variables, and B the vector of parameters. B is allowed to change with time, both systematically (Bt = H Bt-1) and stochastically (Bt = Bt + wt), so that Bt = H . Bt-1 + wt We are considering the case where H = I, so Bt = Bt-1 + wt The problem is to estimate Bt. Suppose we have an estimate for Bt-1, with mean Mt-1 and variance Ct-1. First we predict Bt; BFt has mean MFt and variance CFt, where MFt and CFt are given by MFt = MEt-1 (equation 6.1.3 with H = I) CFt = CEt-1 + Wt (equation 6.1.4 with H = I) Note that W is added to the variance of the parameter; in effect the precision with which B is known is degraded by the passing of time. These equations follow from the elementary properties of multivariate normal distributions. Secondly, we forecast Yt as a random variable with mean : YFt = Xt . MFt and variance: Var(YFt) = Xt . CFt . XTt + Vt This follows from the fact that Yt = Xt . Bt + vt, Bt is N(MFt, CFt), vt is N(0,Vt) Finally, we correct our forecasts of Bt, using the error in forecasting Y as our guide. Bt is N(MEt,CEt), where MEt = MFt + Gt . (Yt - YFt) where G is a matrix called the gain. The (Yt - YFt) term expresses the fact that the size of the correction should be proportional to the size of the error, the factor of proportionality being Gt. CEt = CFt - Gt.Xt.CFt = CFt . (I - Gt . Xt) The additional information that the latest observation of X gives us enables us to reduce the variance attached to B by a factor proportional to X. The proportionality is again represented by the gain matrix Gt. Gt is given by Gt = CFt . XTt . (Var(YFt))-1. This expresses the idea that the size of the correction to B should be greater if the variance attached to the previous estimate of B was large, and the correction should be small if the forecast of the dependent variable is uncertain. Gt can roughly be interpreted as the importance of the fact that the forecast of Yt is different from the observed value of Yt. 6.3 THE RELATIONSHIP OF THE KALMAN FILTER TO RECURSIVE BAYESIAN ESTIMATION The procedure that we have called recursive Bayesian estimation is defined by the three equations 4.3.9, 4.3.10 and 4.3.11, repeated below for convenience. Gn = Sn-1XOT n(XO nS n -1XOT n + 1)-1 eq. 4.3.9 BE n = BE n-1 - Gn(XO nBE n-1 - YO n) eq. 4.3.10 Prec(BE n) = Prec(BE n-1) + XOT nXO n/s 2 eq. 4.3.11 Consider a Kalman filter system in which H = I, the identity, and W = 0 (which means that the parameters are assumed not to change, and our information about the parameters does not degrade with time), m = 1 (only one dependent variable) and V = s2. Then equation 6.1.3 and 6.1.4 reduce to : MFt = MEt-1 CFt = CEt-1 Equations 6.1.5 and 6.1.6 reduce to YFt = Xt . MEt-1 Var(YFt) = Xt . CEt-1 . XTt + s2 And so equation 6.1.7 becomes Gt = CEt-1 XTt (XtCEt-1XTt + s2)-1 = (CEt-1/s2) XTt [Xt (CEt-1/s2) XTt + 1]-1 and 6.1.9 becomes : CEt = CEt-1 - Gt . Xt . CEt-1 CEt/s2 = CEt-1/s2 - (CEt-1/s2).XTt.[Xt(CEt-1/s2)XTt + 1]-1 Xt(CEt-1/s2) We now invoke the matrix inversion lemma (quoted in 4.2 above), writing CEt-1/s2 for B-1, XTt for A and 1 for C. CEt-1/s2-(CEt-1/s2).XTt.[Xt(CEt-1/s2)XTt+1]-1Xt(CEt-1/s2) = [(CEt-1/s2)-1 + XTX]-1 (CEt/s2)-1 = (CEt-1/s2)-1 + XTtXt (CEt)-1 = (CEt-1)-1 + XTtXt/s2 Equation 6.1.8 is MEt = MEt-1 + Gt(Yt - YFt) = MEt-1 - Gt(XtMEt-1 - Yt) Thus the updating equations for the Kalman filter, equations 6.1.7, 6.1.8 and 6.1.9 can be written Gt = (CEt-1/s2)XTt[Xt(CEt-1/s2)XTt + 1]-1 eq. 6.3.1 MEt = MEt-1 - Gt(Xt.MEt-1 - Yt) eq. 6.3.2 (CEt)-1 = (CEt-1)-1 + XTt.Xt/s2 eq. 6.3.3 The equivalence between 4.3.9, 4.3.10 and 4.3.11 on the one hand, and 6.3.1, 6.3.2 and 6.3.3 on the other is clear, and the equivalences of the terms are shown below. Table 7 COMPARISON OF TERMS 4.3.9 - 4.3.11 6.3.1 - 6.3.3 n t XOn, YOn Xt, Yt Sn-1 CEt-1/s2 BEn MEt Prec(BEn) (CEt)-1 Ct-1, called precision in the univariate case, is called the information matrix (Jazwinski, 1970, p.231) in the multivariate case. To avoid confusion, I shall continue to use the term "precision". Thus, the Kalman filter is a generalization of recursive Bayesian estimation, in which the parameters are allowed to change systematically and stochastically, and in which more than one dependent variable is possible. It is also clear that the Kalman filter is a generalization of ordinary least squares (OLS), as the equivalence between OLS and recursive Bayesian estimation was shown in 3.4. Ho and Lee (1964) give a derivation of the Kalman filter directly from Bayes theorem. From now on, the terms "recursive Bayesian estimation" and "Kalman filtering" will be used as equivalents; the choice of which term will be used will be governed by context. The notation that will be used from now on will mainly be that of the Kalman filter as used in this chapter. 6.4 FORECASTING The equations 6.1.3 to 6.1.9 can be used for forecasting Y after the last observation of Y. The forecasting procedure consists simply of the following. 1. Forecast Bt by the fact that Bt is N(MFt, CFt), where MFt = H . MEt-1 CFt = H . CEt-1 . HT + Wt 2. Forecast Yt by the fact that Yt is N(YFt, Var(YFt)) where YFt = Xt . MFt Var(YFt) = XT . CFt . XTt + Vt 3. As there is no more information about the parameters B (because there are no more observations), MEt = MFt CEt = CFt This assumes that X is known precisely. The case where X is a random variable is treated by Harrison and Stevens, 1976, page 213. 6.5 THE RELEVANCE OF V AND W 6.5.1 V As in multiple regression, the vt are residuals, and are caused by measurement errors in Y, and errors in the specification of the relationship between X and Y (for example, omitted variables, wrong functional form). The residuals have zero mean, and their variance is denoted by Vt. Note that the variance need not be assumed constant; merely known at time t. Vt could be estimated by conventional econometric techniques (as suggested by Athans, 1974). This would involve analysing the same data twice. This initial estimation of Vt would require a change in the model (i.e. setting W to zero and H to the identity), and the estimated Vt from this modified model could have any relationship at all to the Vt of the original model. Harrison and Stevens (1971) show that V can be in error by a factor of four either way with very little impact on forecasting performance (8% worse root mean square error for 5 step ahead forecasts). Harrison and Stevens (1976) suggest that V can be estimated as part of the procedure, by specifying a small number of discrete values V1, V2, ... Vn, assigning prior probabilities to these, and updating these probabilities as the time-series is analysed. The method used in this paper is to concentrate on the measurement errors in Y, which can be estimated without the requirement of pre-analysing the data, in two possible ways. The first is a consideration of how the data are prepared (for example, if the data come from consumer surveys, the measurement error depends on the sample size in a well-defined way). The second comes from a consideration of alternative estimates of the same quantities (for example, six different estimates for wool consumption in Belgium in 1970 are available, made at various time by various people using various methods, ranging from 16.0 kt to 20.3 kt). The other component of V, specification error, is small for a well-specified model (by definition). For example, if we consider Belgium again, Solomon (1980), p. 61 gives the standard error of the estimates as 10.1% (in a log-log model, the standard errors of the estimates are in log form). Compare this with a standard error of measurement of 9%, and it can be seen that there is very little specification error compared to measurement error. This paper will concentrate on estimating the measurement error at the expense of the the specification error. 6.5.2 W The significance of W is much harder to understand. There is no counterpart in conventional econometric methods, as W expresses a stochastic variation in the parameters, whereas conventional econometric models have constant parameters. There are some exceptions to this (for example Hildreth and Houck, 1968, Rouhaianen, 1978 and Griffiths et al., 1979). The simplest interpretation of W comes directly from equation 6.1.4; it is the increase in the variance attached to the forecasts of the parameters, on account of the fact that the parameters at time t may not be exactly the same as at time t-1. A simple illustration will demonstrate this. Suppose we have a one-parameter model, and H is the identity. Suppose at time t-1 the estimate of this parameter is distributed normally; Bt-1 is N(1.0, 0.01). Thus the best estimate for Bt-1 is 1, and the standard error of Bt-1 is 0.1. This can be expressed by saying that there is a 0.95 probability that the true value of B lies between 0.8 and 1.2. Suppose that Wt is 0.0. Then Bt is also N(1, 0.01), and the best forecast for B at time t is also 1, with a 0.95 probability that the true value lies between 0.8 and 1.2. But now suppose that Wt is 0.01. Then Bt is N(1, 0.02), and the best forecast for B at time t is 1, but the 0.95 limits are now 0.72 and 1.28. So Wt describes how information about B is made irrelevant by the passage of time; how information about B degrades or decays with time. A small value for Wt means that information is degraded only a little; a large value means that much of the information about B in the past is irrelevant. Taking the same example as above, the table below gives the 95% confidence intervals for Bt for various values of Wt, given that Bt-1 is N(1, 0.01). Table 8 Wt 95% Confidence interval for Bt 0 0.8 to 1.2 .01 0.72 to 1.28 .02 0.66 to 1.34 .03 0.6 to 1.4 .05 0.5 to 1.5 .10 0.34 to 1.66 .20 0.08 to 1.92 .30 -0.1 to 2.1 The obvious question is, what are the appropriate values of Wt to use? The answer is, of course, that this depends on the situation to be modelled. During periods of stability (i.e. the parameters are constant) it would be best to have a low Wt, reflecting the fact that older data are still very relevant compared to newer data. During periods of instability (i.e. we think that the parameters could change) a higher Wt would be appropriate, reflecting the fact that older data are not very relevant. In general, Wt is an n by n matrix (where n is the number of parameters), and so we have the opportunity to specify some parameters as stable and others as unstable (which gives the Kalman filter a considerable advantage over weighted least squares, where only one weight is specified per time period, and this weight is applied to all the data). The following example illustrates this. Consider a simple demand model for some product, Qt = B0,t + B1,t . Yt + B2,t . Pt + vt ; vt is N(0, Vt) (B0,t) (B0,t-1) (w0,t) (B1,t) = (B1,t-1) + (w1,t) (B2,t) (B2,t-1) (w2,t) where Qt is log(consumption of product) Yt is log(income of consumers) Pt is log(price of product) wt is N(0, Wt) Suppose at time t = r, a competitor introduces a competing product, which rapidly increases its market share until it stabilizes, at time t = s. Then appropriate values for Wt might be, in the period t = 1 to t = r-1 ( .0025 0 0 ) ( 0 .0025B1,t-12 0 ) ( 0 0 .0025B2,t-12 ) Note that there is nothing in the Kalman filter to say that Wt need not be related to past values of parameters; the only requirement is that Wt be known at time t. The matrix above would give the model the property that if the parameters B1 and B2 were known with zero variance at time t, then the forecast 95% confidence limits are plus or minus 10% of the value of the parameter, and if B0 were known with zero variance at time t, then the forecast 95% confidence limits are 0.1 (this corresponds to a plus or minus 10% confidence limit in product consumption, as the model is linear in logs). At time t = r, the competitor introduces a competing product. We would expect, a priori, that the price elasticity of our product would change, and that the demand curve would shift as the competitor increased his market share. So for t = r to t = s-1 (the period of instability) Wt could be set to : ( .01 0 0 ) ( 0 .0025B1,t-12 0 ) ( 0 0 .04 ) The value of W11 has been increased to reflect the fact that we expect to lose some market share to the new product, and so expect the demand curve to shift. W22 has not been changed, as the existence of the competing product should not affect consumers income elasticity for the product by very much. W33 has been increased a lot, however, as we would expect a greater price elasticity for the old product. The value of .04 would mean that if the price elasticity had been known with zero variance, the new price elasticity is known with 95% confidence limits of plus or minus 0.4 ( = 2 . W330.5 ). When the market has settled down again (when t = s), it might be appropriate to use the old matrix of W, as used for t less than r. Alternatively, it might be felt that the new competitor exerts a continual unsettling influence on the market (but not as great as when the new product was introduced). For example, the competitor will advertise, and undertake other marketing activities, but this sales pressure is not likely to be constant; the competitor may vary his campaign. So a matrix for W like the old one but with a bit more uncertainty on the forecast of B0 might be used: ( .005 0 0 ) ( 0 .0025B1,t-12 0 ) ( 0 0 .0025B2,t-12 ) This example is illustrative of how W can be used to model the changing structure of the market. Note that in this example, the covariance terms (W12, W23 and W13) of the W matrix are assumed to be zero. This is equivalent to assuming that the passage of time does not increase the covariance of the parameters, but only the variances. This is not a necessary assumption, but in practice (for example Harrison and Stevens, 1976, and Meade, 1979, p472) this is frequently done. The assumption made is that any increase in uncertainty on one parameter caused by the passage of time is not correlated with the increase in uncertainty on any other parameter. This is not, of course, the same as assuming that the covariance matrix of the parameter estimates is diagonal. 6.6 THE RELEVANCE OF H H is a matrix describing the relationship of the parameters at time t, Bt, to the parameters in the previous period, Bt-1. Bt = H . Bt-1 + wt where wt is N(0, Wt) eq. 6.6.1 Most econometric models specify constant parameters, and so H would be the identity matrix, and Wt = 0, so Bt = Bt-1. But it is quite difficult to estimate models with systematically varying parameters using conventional econometric methods, so perhaps constant parameters are often specified out of necessity, rather than voluntarily. The most obvious use of equation 6.6.1 is to allow the parameters to execute a random walk; Bt = Bt-1 + wt This was suggested by Cooley and Prescott, 1973b, and by various others. Another way to use equation 6.6.1 is as follows. Suppose we want to fit a simple time trend to the data, but we want to allow the slope and intercept to fit locally at each point (as opposed to globally over the whole series). Then Bt = ( slopet ) ( interceptt ) And the forecast of Bt, given Bt-1, is ( slopet-1 ) ( interceptt-1 + slopet-1 ) which means that H = ( 1 0 ) ( 1 1 ) Harrison and Stevens (1976) use this model. Another way to use equation 6.6.1; suppose we believe that the B are declining or increasing; Bt = k . Bt-1 with different k's for the different parameters. Then we simply write the H-matrix as ( k1 0 0 0 ........... ) ( 0 k2 0 0 ........... ) ( 0 0 k3 0 ........... ) ( . . . . ) ( 0 0 0 0 .........kn ) But H need not be a constant; it need only be known at time t. So, for example, the ki can be functions of other variables. Another use for H is as follows. Suppose we have a model Yt = Bt . Xt + vt and suppose there is some variable Zt that explains the movements in Bt. Bt = At . Zt + zt Substituting, we get Yt = At . Zt . Xt + zt . Xt + vt This could be estimated by ordinary least squares, were it not for the fact that the residuals, zt . Xt + vt are heteroscedastic; the Kalman filter will, however, cope with this model. 6.7 THE RELEVANCE OF G The matrix G defined in 6.1.7 as Gt = CFt . XTt .(Var(YFt))-1 eq. 6.1.7 is called the gain of the filter, and is calculated inside the Kalman filter algorithm. If Yt is an m by 1 vector of dependent variables, and Bt is an n by 1 vector of parameters, then G is an n by m matrix which is used for two purposes. There is a strong analogy between G and the smoothing constants used in exponential or double exponential smoothing (see Kendall, 1973); G controls the amount of forecast error to be fed back into revising the parameter estimates. The big differences are : - G is a matrix, not a scalar (or vector, as used in higher order smoothing) - G is not fixed, but varies according to how much is already known about the parameters, and how accurate the forecasts are expected to be. 6.7.1 G ADJUSTS THE PARAMETER ESTIMATES G relates the adjustments to be made to the estimates of the n elements of B to the errors (between forecast and actual) in Y, using equation 6.1.8. MEt = MFt + Gt . (Yt - YFt) eq. 6.1.8 The larger the error in forecasting, the larger the correction that will be made in M (the mean of the distribution describing our information about B). From 6.1.7, we can also see that the size of correction is also directly proportional to the precision attached to the forecast of Yt ( = [Var(YFt)]-1 ); this is because it is only by having some degree of precision in our forecasts that we can attach any importance to the fact that the forecast is different from the data. 6.1.7 also tells us that the correction is inversely proportional to the precision with which we can forecast B (the variance attached to the forecast of B is CFt, so the precision is (CFt)-1. This too has a simple meaning; a better forecast of B means a greater precision attached to the forecast of B, and so we want to change our estimate of B by less. The reason for the term in XTt in equation 6.1.7 can be seen by a simple dimensional analysis. Consider our model Yt = Xt . Bt + vt where vt is N(0,Vt) eq. 6.1.1 Suppose the units of Y are kilograms, and the units of X are dollars. Then the units of B must be kilograms per dollar. Now write equation 6.1.8 in units : kilograms . dollars-1 = kilograms . dollars-1 + G . (kilograms - kilograms) So the dimension of G must be dollars-1. Now write 6.1.7 in units. dollars-1 = kilograms2 . dollars-2 . dollars (kilograms2)-1 = dollars-1 So the equation is dimensionally correct. This simple analysis gives an intuitive feel for why Xt is in equation 6.1.7; it is to scale G to the right units. 6.7.2 G ADJUSTS THE PRECISION ATTACHED TO THE PARAMETER ESTIMATES The second purpose for G is to describe the increase in the precision in the parameter estimates, using equation 6.1.9. CEt = CFt - Gt Xt CFt eq. 6.1.9. = (I - Gt Xt)CFt where I is the Identity matrix. The same dimensional analysis reveals the purpose of the X in this equation; it is a scaling factor again. We can also see from this equation that G is the fractional reduction in the variance attached to the parameter estimates (suitably scaled by X). From 6.1.7. we can see that this reduction in variance is directly proportional to the precision attached to the forecast of Yt because more precise forecasts of Yt mean that we can attach more precision to parameter estimates. We can also see that this reduction in variance is inversely proportional to the precision with which we could forecast B; if we have more precise forecasts of B then we want to change them less. 6.8 OTHER BENEFITS OF THE KALMAN FILTER There are two other benefits from the Kalman filter; these may be important in some applications. Firstly, the Kalman filter can help with the multicollinearity problem; secondly the Kalman filter needs less computer time and storage than ordinary least squares. 6.8.1 KALMAN FILTERING AND THE MULTICOLLINEARITY PROBLEM The multicollinearity problem is described in Solomon (1980), which concludes that the only answer to the multicollinearity problem is to get more information about the effects of the collinear variables on the dependent variable. The Kalman filter does not provide this extra information, but can help in two ways. The first way is that when the extra information about the parameters of the model has been collected, it will usually be in the form of estimates of the parameters, and estimates of the uncertainty attached to the parameters. If the information collected is not in this form, then some way will have to be found to cast it into this form to make it usable to any estimation system (see for example, 1.2.3 above). This information can then be regarded as a prior distribution of the parameters, which can be fed into the Kalman filter very easily at t = 0 in terms of values for M0 and C0. The second way that the Kalman filter can help is particularly useful when some of the explanatory variables are tangled up in an extreme (i.e., near perfect or perfect) multicollinearity. With ordinary least squares, when two or more variables are perfectly collinear, the matrix XTX is singular, and so (XTX)-1 does not exist, and so it is not possible to calculate estimates of the parameters of any of the variables (including variables not included in the multicollinearity). When multicollinearity is not quite perfect, the finite accuracy of computers will often lead to the matrix inversion being impossible, or else highly inaccurate (Wampler, 1970); furthermore most packages give no (or inadequate) warning of this inaccuracy. In an extreme case, the Helsinki School of Economics package, under conditions of severe multicollinearity, can report R2 as being greater than unity, and negative F statistics. More sophisticated packages (such as BMDP) invert the XTX matrix by "sweeping" (explained in the BMDP manual, 1981, p. 671). This give warning of multicollinearity by displaying the multiple correlation between each explanatory variable and all the others; variables can be excluded if this multiple correlation is very close to 1.0. The Kalman filter applied to one dependent variable does not need to invert a matrix; the nearest it gets to inversion is in equation 6.1.7, taking the reciprocal of the variance attached to the forecast of the dependent variable. The effect of this is that when the Kalman filter is applied to collinear data, the breakdown described above does not happen. As an illustration, consider a model in five variables, of which three are multicollinear. When all the data have been digested by the Kalman filter, the information extracted about the parameters is expressed as a probability distribution with mean M, covariance matrix C. Suppose (without loss of generality) that the three multicollinear variables are the first three. Then the top left hand 3 X 3 corner of the matrix C will show large variances and covariances, indicating that our estimates of these three parameters are highly imprecise. The bottom right hand 2 X 2 corner, however, will not be affected by the multicollinearity, and will display variances and covariances reflecting the amount of information in the data. The 3 X 2 strip along the top right, and the 2 X 3 strip along the bottom left will show little covariance between the 3 multicollinear variables and the other two variables. When the Kalman filter is applied to multiple dependent variables, there are similar advantages, as at each stage the only matrix needing to be inverted is the covariance matrix attached to the forecast of the dependent variables. The only way that this matrix could be singular is for the forecast of some linear combination of the dependent variables to be perfect (i.e. zero variance), in which case we are dealing with an identity which can be eliminated from the set of equations. 6.8.2 THE COMPUTATIONAL ADVANTAGES OF KALMAN FILTERING In these days of high-speed, low-cost electronic computers, it is sometimes thought unnecessary to worry about computational efficiency. But many researchers have tight computing budgets, and a reduction in the cost per estimation would make it possible to try a richer model, or a wider variety of models. Also, there is an increasing use of micro-computers, which are much slower than mini-computers or main-frames. So computational efficiency is still an advantage. The computational efficiency of the Kalman filter comes from the avoidance of the need to invert the matrix of independent variables. Matrix inversion is a very time-consuming job, especially for large matrices. In most algorithms, the process is iterative until the inverse is found, and the time taken increases as the square of the order of the matrix. Furthermore, it is advisable (Wampler, 1970) to carry out the most important parts of the calculation in extended precision, which is rather slower than single precision. A simultaneous equation system of four equations can easily have 20 explanatory variables, and the inversion of a 20 X 20 matrix is no small matter; The Kalman filter would be inverting 4X4 matrices. The Helsinki School of Economics package and the London Business School package both use matrix inversion, and the advanced econometric package WYMER (used at the London School of Economics and elsewhere) has no fewer than five different matrix inversion subroutines (used for different purposes); often if one subroutine reports singularity, another one is tried. If a single equation model with 5 explanatory variables is being estimated, then ordinary least squares would need to invert a 6 X 6 matrix. The Kalman filter would need no matrix inversion at all. Note also, that to add more observations to our data set, the computational burden of re-inverting the XTX matrix is not required, and so we have further computational savings.