UFDC Home  Search all Groups  UF Institutional Repository  UF Institutional Repository   Help 
Material Information
Notes
Record Information

Full Text 
WATER RESOU L IES rpe icrch center Publication No. 67 ESTIMATING MISSING VALUES IN MONTHLY RAINFALL SERIES By EFI FOUFOULAGEORGIOU A Thesis Presented to the Graduate Council of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Master of Engineering University of Florida Gainesville U N,I. F. 'L U FLO. IDA ,: , UNIERSITY OF FLOIDA ESTIMATING MISSING VALUES IN MONTHLY RAINFALL SERIES By EFI FOUFOULAGEORGIOU Publication No. 67 FLORIDA WATER RESOURCES RESEARCH CENTER Research Project Technical Completion Report Sponsored by South Florida Water Management District A THESIS PRESENTED TO THE GRADUATE COUNCIL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF ENGINEERING UNIVERSITY OF FLORIDA 1982 ACKNOWLEDGEMENTS I wish to express my sincere gratitude to all those who contributed towards making this work possible. I am particularly indebted to the chairman of my supervisory committee, Professor Wayne C. Huber. Through the many constructive discussions along the course of this research, he provided an invaluable guidance. It was his technical and moral support that brought this work into completion. I would like to express my sincere appreciation to the other members of my supervisory committee: Professors J. P. Heaney, D. L. Harris, and M. C. K. Yang, for their helpful suggestions and their thoughtful and critical evaluation of this work. Special thanks are also given to my fellow students and friends, Khlifa, Dave D., Bob, Terrie, Richard, Dave M., and Mike, for their cheerful help and the pleasant environment for work they have created. Finally my deepest appreciation and love go to my husband, Tryphon, who has been a constant source of encouragement and inspiration for creative work. Many invaluable discussions with him helped a great deal in gaining an understanding of some problems considered in this thesis. The research was supported in part by the South Florida Water Management District. Computations were performed at the Northeast Regional Data Center on the University of Florida campus, Gainesville. iii TABLE OF CONTENTS Page ACKNOWLEDGEMENTS . . . . ... ii LIST OF TABLES . . . ... vii LIST OF FIGURES . . . . ix ABSTRACT . . . . ... . xi CHAPTER 1. INTRODUCTION . . . 1 Rainfall Records . ......... 1 Frequency Analysis of Missing Observations in the South Florida Monthly Rainfall Records . . . . ... 5 Description of the Chapters . . ... .15 CHAPTER 2. SIMPLIFIED ESTIMATION TECHNIQUES ... .17 Introduction . . . .... .. 17 Mean Value Method (MV) . . . .17 Reciprocal Distance Method (RD) . . 20 Normal Ratio Method (NR) . . .. .21 Modified Weighted Average Method (MWA) ..... 22 Least Squares Method (LS) . . . 27 CHAPTER 3. UNIVARIATE STOCHASTIC MODELS . .. .32 Introduction .. . . . 32 Review of BoxJenkins Models . . 34 Autoregressive Models . . .. 35 Moving Average Models . . ... 39 Mixed AutoregressiveMoving Average Models 42 Autoregressive Integrated Moving Average Models . . . .. 44 Transformation of the Original Series . .. .46 Transformation to Normality . .. .46 Stationarity . . . ... 50 Page Monthly Rainfall Series . . ... .52 Normalization and Stationarization ... .52 Modeling of Normalized Series . .. .55 CHAPTER 4. MULTIVARIATE STOCHASTIC MODELS . .. .58 Introduction ................. 58 General Multivariate Regression Model . .. .59 Multivariate LagOne Autoregressive Model 60 Comments on Multivariate AR(1) Model . .. .63 Assumption of Normality and Stationarity .63 CrossCorrelation Matrix M1 ....... 65 Further Simplification .. . ... 66 Higher Order Multivariate Models . .. .68 CHAPTER 5. ESTIMATION OF MISSING MONTHLY RAINFALL VALUESA CASE STUDY . ... .71 Introduction ..... ........... 71 Set Up of the Problem ............. 71 Simplified Estimation Techniques . . 75 Techniques Utilized . . ... .75 Least Squares Methods . . ... 78 Modified Weighted Average Method . .. .82 Comparison of the MV, RD, NR and MWA Methods . . . .. 85 Univariate Model . . . ... .97 Model Fitting . . . Proposed Estimation Algorithm . Application of the Algorithm on the Monthly Rainfall Series . . Results of the Method . . Remarks . . . . Bivariate Model . . . . Model Fitting . . . Proposed Estimation Algorithm . Application of the Algorithm on the Monthly Rainfall Series . . CHAPTER 6. CONCLUSIONS AND RECOMMENDATIONS . Summary and Conclusions . . . Further Research . . . . 97 . 106 . 108 . 110 . 106 . 117 . 117 . 119 . 121 . 131 * 131 S. 134 Page APPENDIX A. DEFINITIONS . . . .. .136 APPENDIX B. DETERMINATION OF MATRICES A AND B OF THE MULTIVARIATE AR(1) MODEL ..... 150 APPENDIX C. DATA USED AND STATISTICS . .. .156 APPENDIX D. COMPUTER PROGRAMS . . .. .169 REFERENCES . . . . ... ... .182 BIOGRAPHICAL SKETCH . . . .. .188 LIST OF TABLES Table Page 1.1 Frequency Distribution of the Percent of Missing Values in 213 South Florida Monthly Rainfall Records . . . . 9 5.1 Least Squares Regression Coefficients and Their Significance Levels . . ... 80 5.2 Correction Coefficients for Each Month and for Each Different Percent of Missing Values .. 83 5.3 Statistics of the Actual (ACT), Incomplete (INC) and Estimated Series (MV, RD, NR, MWA) . .. .88 5.4 Bias in the Mean . . . ... 90 5.5 Bias in the Standard Deviation . . .. .92 5.6 Bias in the LagOne and LagTwo Correlation Coefficients . . . . ... 94 5.7 Accuracy Mean and Variance of the Residuals .95 5.8 Initial Estimates and MLE of the parameters P and 6 of an ARMA(1,1) Model Fitted to the Monthly Rainfall Series of Station A . .. 102 5.9 Results of the RAEMVU Applied at the 10% Level of Missing Values. Upper Value is (1, Lower Value is . . . . 111 5.10 Results of the RAEMVU Applied at the 20% Level of Missing Values. Upper Value is 1, Lower Value is 8 . . . .. . 112 5.11 Statistics of the Actual Series (ACT) and the Two Estimated Series (UN10, UN20) . ... 115 5.12 Bias in the Mean, Standard Deviation and Serial Correlation CoefficientUnivariate Model . . . . ... .116 vii Table Page 5.13 Results of the RAEMVB1 Applied at the 10% Level of Missing Values . . . ... .125 5.14 Results of the RAEMVB1 Applied at the 20% Level of Missing Values . . . ... .127 5.15 Statistics of the Actual Series (ACT) and the Two Estimated Series (B10 and B20) . ... 129 5.16 Bias in the Mean, Standard Deviation and Serial Correlation CoefficientBivariate Model ... .130 viii LIST OF FIGURES Figure Page 1.1 Monthly distribution of rainfall in the United States . . . . 6 1.2 Probability density function, f(m), of the percentage of missing values . . 8 1.3 Probability density function, f(T), of the interevent size . . . . 11 1.4 Probability density, f(k), and mass function, p(k), of the gap size . . . .. 12 2.1 Mean value method without random component 19 2.2 Mean value method with random component ..... 19 2.3 Least squares method without random component 30 2.4 Least squares method with random component 30 5.1 The four south Florida rainfall stations used in the analysis . . . ... 73 5.2 Plot of the monthly means and standard devia tions of the rainfall series of Station A 76 5.3 Autocorrelation function plot of the residual series of an ARMA(1,1) model fitted to the monthly rainfall series of Station A . .. .98 5.4 Sum of squares of the residuals surface of an ARMA(1,1) model fitted to the monthly rainfall series of Station A . . . .. 101 5.5 Recursive algorithm for the estimation of the missing valuesunivariate model (RAEMVU) 109 5.6 Recursive algorithm for the estimation of missing valuesbivariate modelI station to be estimated (RAEMVB1) . . .. 122 Figure Page 5.7 Recursive algorithm for the estimation of missing valuesbivariate model2 stations to be estimated (RAEMVB2) . . ... .123 Abstract of Thesis Presented to the Graduate Council of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Master of Engineering ESTIMATION OF MISSING OBSERVATIONS IN MONTHLY RAINFALL SERIES By Efstathia FoufoulaGeorgiou December, 1982 Chairman: Wayne C. Huber Cochairman: James P. Heaney Major Department: Environmental Engineering Sciences This study compares and evaluates different methods for the estimation of missing observations in monthly rainfall series. The estimation methods studied reflect three basic ideas: (1) the use of regionalstatistical information in four simple techniques: mean value method (MV), reciprocal distance method (RD), normal ratio method (NR), modified weighted average method (MWA); (2) the use of a univariate autoregressive moving average (ARMA) model which describes the time correlation of the series; (3) the use of a multivariate ARMA model which describes the time and space correlation of the series. An algorithm for the recursive estimation of the missing values in a series by a parallel updating of the univariate or multivariate ARMA model is proposed and demonstrated. All methods are illustrated in a case study using 55 years of monthly rainfall data from four south Florida stations. / Chairman xii CHAPTER 1 INTRODUCTION Rainfall Records Rainfall is the source component of the hydrologic cycle. As such it regulates water availability and thus land use, agricultural and urban expansion, maintenance of environmental quality and even population growth and human habitation. As Hamrick (1972) points out, water may be transported for considerable distances from where it fell as rain and may be stored for long periods of time, but with very few exceptions it originates as rainfall. Consequently, the measurement and study of rainfall is in actuality the measurement and study of our potential water supply. Rainfall studies attempt to derive models, both probabilistic and physical, to describe and forecast the rainfall process. Since the quality of every study is immediately related to the quality of the data used, the need for "good quality" rainfall data has been expressed by all hydrologists. By "good quality" is meant accurate, long and uninterrupted series of rainfall measurements at a range of different time intervals (e.g., hourly, daily, monthly, and yearly data) and for a dense raingage network. Missing values in the series (due, for example, to failure of the recording instruments or to deletion of a station) is a real handicap to the hydrologic data users. The estimation of these missing values is often desirable prior to the use of the data. For instance, the South Florida Water Management District prepared a magnetic tape with monthly rainfall data for all rainfall stations in south Florida for use in this study (T. MacVicar, SFWMD, personal communication, May, 1982). The data included values for the period of record at each station, ranging from over 100 years (at Key West) to only a few months at several temporary stations. Approximately one month was required to preprocess these data prior to performing routine statistical and time series analyses. The preprocessing included tasks such as manipulations of the magnetic tape, selection of stations with desirable characteristics (e.g., long period of record, proximity to other stations of interest, few missing values) and a major effort at replacement of missing values that did exist. This effort, in fact, was the motivation for this thesis. Many different kinds of statistical analyses may be performed on a given data set, e.g., determination of elementary statistical parameters, auto and cross correlation analysis, spectral analysis, frequency analysis, fitting time series models. For routine statistics (e.g., calculation of mean, variance and skewness) missing values are seldom a problem. But for techniques as common as autocorrelation and spectral analysis missing values can cause difficulties. In multivariate analysis missing values result in "wasted information" when only the overlapping period of the series can be used in the analysis, and in inconsistencies (Fiering, 1968, and Chapter 4 of this thesis) when the incomplete series are used. In general, two approaches to the problem of missing observations exist. The first consists of developing methods of analysis that use only the available data, the second in developing methods of estimation of the missing observations followed by application of classical methods of analysis. Monthly rainfall totals are usually calculated as the sum of daily recorded values. Thus, if one or more daily observations are missing the monthly total is not reported for that month. An investigation conducted by the Weather Bureau in 1950 (Paulhus and Kohler, 1952), showed that almost one third of the stations for which monthly and yearly totals were not published had only a few (less than five) days missing. Furthermore, for some of these missing days there was apparently no rainfall in the area as concluded by the rainfall observations at nearby stations. Therefore, in many cases estimation of a few missing daily rainfall values can provide a means for the estimation of the monthly totals. Statisticians have been most concerned with the problem of handling short record multivariate data with missing observations in some or all of the variables, but no explicit and simple solutions have been given, apart from a few special cases in which the missing data follow certain patterns. A review of these methods is given by Afifi and Elashoff (1956). In the time domain, "the analysis of time series, when missing observations occur has not received a great deal of attention" as Marshall (1980, p. 567) comments, and he proposes a method for the estimation of the autocorrelations using only the observed values. Jones (1980) attempts to fit an ARMA model to a stationary time series which has missing observations using Akaike's Markovian representation and Kalman's recursive algorithm. In the frequency domain, spectral analysis with randomly missing observations has been examined by Jones (1962), Parzen (1963), Scheinok (1965), Neave (1970) and Bloomfield (1970). In hydrology, the problem of missing observations has not been studied much as Salas et al. (1980) state: The fillingin or extension of a data series is a topic which has not received a great deal of attention either in this book or elsewhere. Because of its importance, the subject is expected to be paid more attention in the future. (Salas et al., 1980, p. 464) Simple and "practicable" methods for the estimation of missing rainfall values for large scale application were proposed by Paulhus and Kohler (1952), for the completion of the rainfall data published by the Weather Bureau. The study was initiated after numerous requests of the climatological data users. Beard (1973) adopted a multisite stochastic generation technique to fillin missing streamflow data, and Kottegoda and Elgy (1977) compared a weighted average scheme and a multivariate method for the estimation of missing data in monthly flow series. Hashino (1977) introduced the "concept of similar storm" for the estimation of missing rainfall sequences. Although the same methods of estimation can be applied to both rainfall and runoff series, a specific method is not expected to perform equally well when applied to the two different series due mainly to the different underlying processes. This is true even for rainfall series from different geographical regions, since their distributions may vary greatly as shown in Fig. 1.1. This analysis will use monthly rainfall data from four south Florida stations. First, a frequency analysis of the missing observations has been performed and their typical pattern has been identified. In this work the term "missing observations" is used for a sequence of missing monthly values restricted to less than twelve, so that unusual cases of lengthy gaps (a year or more of missing values) is avoided since they do not reflect the general situation. Frequency Analysis of Missing Observations in the South Florida Monthly Rainfall Records An analysis of the monthly rainfall series of 213 stations of the South Florida Water Management District 5 aUUUI I JJ; 4 ;* I,, I"UU Fig. 1.1. Monthly distribution of Fainfall in the United States (after Linsley R.K., Kohler M.A. and Paulhus J.L., Hydrology for Engineers, 1975, McGrawHill, 2nd. edition p. 90) (SFWMD) gave the results shown on Table 1.1. Figure 1.2 shows the probability density function (pdf) plot of the percent m of missing values, f(m), which is defined as the ratio of the probability of occurrence over an interval to the length of that interval (column 4 of Table 1.1). The shape of the pdf f(m) suggests the fit by an exponential distribution f(m) = Xem (1.1) where X is the parameter of the distribution calculated as the inverse of the expected value of m, E(m); E(m) = Ep(mi) mi (1.2) where p(m.) is the probability of having mi percent of missing values. The mean value of the percentage of missing values is m = E(m) = 13.663, and therefore the fitted exponential pdf is 0.073m f(m) = 0.073 e 073(1.3) which gives an interesting and unexpectedly good fit as shown by Fig. 1.2 and column 5 of Table 1.1 The question now arises as to whether the missing values within a record follow a certain pattern. In f (m) 0.07 0.06 0.05 0.073m f(m) = 0.073 e 73m 0.04 0.03 0.02 0. 01. 0.00 0 10 20 30 40 50 60 70 % missing values, m Fig. 1.2. Probability density function, f(m), of the percentage of missing values. Based on 213 stations, m = 13.663%. Table 1.1. Frequency Distribution of the Percent of Missing Values in 213 South Florida Monthly Rainfall Records. 2 3 4 % of Cumulative Empirical Stations % of Stations pdf 1 % of Missing Values 05 510 1015 1520 2025 2530 3035 3540 4045 4550 5055 5560 6065 6570 30.52 51.64 66.19 79.80 85.90 89.10 91.70 92.01 94.36 97.18 97.65 98.12 99.53 100.00 0.061 0.042 0.029 0.027 0.012 0.007 0.004 0.002 0.005 0.006 0.001 0.001 0.003 0.001 5 Fitted Exponential 0.061 0.042 0.029 0.020 0.014 0.010 0.007 0.005 0.003 0.002 0.002 0.001 0.001 0.001 30.52 21.12 14.55 13.61 6.10 3.29 1.88 0.94 2.35 2.82 0.47 0.47 1.41 0.47 particular, if the occurrence of a gap is viewed as an "event" then the distribution of the interevent times (sizes of the interevents) and of the durations of the events (sizes of the gaps) may be examined. The probability distribution of the size of the interevents (number of values between two successive gaps) has been studied for four "typical" stations of the SFWMD, as far as length of the record, distribution and percent of missing values is concerned. These four stations are: MRF 6018, Titusville 2W, 19011981, 7.5% missing MRF 6021, Fellsmere 4W, 19111979, 9.3% missing MRF 6029, Ocala, 19001981, 4.4% missing MRF 6005, Plant City, 18921981, 8.6% missing A derived pdf for the four stations combined and the fitted exponential pdf are shown in Fig. 1.3. The mean size of the interevent, T, is 19.03 months; therefore, the fitted exponential distribution is f(T) = 0.053 e 0.053T (1.4) Also, the probability distribution of the size of the gaps (number of values missing in each gap) has also been studied for the same four stations. These have been treated as discrete distributions since the size of the gap (k = 1, 2, S. ., 11) is small as compared to the interevent times. A probability distribution for the four stations combined is then derived, which is also the discrete probability mass function (pmf). This plot is shown in Fig. 1.4 and suggests either a Poisson distribution or a discretized exponential. f (T) 0.05 0.04 0.03 f(T) = 0.053 e0.053T 0.02 0.01 0.00 0 20 40 60 80 100 120 months between gaps,T Fig. 1.3. Probability density function, f(T), of the interevent size. Based on four stations. 0.6 0.5 0.447k f(k) = 0.447 e 447 0.4 empirical 0.3 o poisson fitted 0 0 * 0.2 o 0.1 o * 0.0 0 1 2 3 4 5 6 7 8 9 10 II gap size, k (months) Fig. 1.4. Probability density, f(k), and mass function, p(k), of the gap size. Based on four stations. f(k) and p(k) The mean value k is 2.237, which is also the parameter A of the Poisson distribution. The Poisson distribution f(k) e (1.5) k! is nonzero at k = 0 and does not fit the peak of the empirical point very well at k = 1 (it gives a value of 0.24 instead of the actual 0.53). The fitted continuous exponential pdf shown in Fig. 1.4 gives a better fit in general but also implies a nonzero probability for a gap size near zero. To overcome this problem and to discretize the continuous exponential pdf, the area (probability) under the exponential curve between zero and 1.5 is assigned to k = 1, ensuring a zero probability at k = 0. Areas (probabilities) assigned to values of k > 1 are centered around those points. The fitted discretized exponential and the Poisson are also shown in Fig. 1.4. The distributions of the size of the gaps (k) and of the size of interevents (T) will be used to generate randomly distributed gaps in a complete record. Suppose that we have a complete record and desire to remove randomly m percent missing values. If the mean size of the gap (k) is assumed constant, the mean size of interevent (T) must vary, decreasing as the percent of missing values increases. Let N denote the total number of values in the record, m the Pages Missing or Unavailable where R2 = l + P2 + + ppp (3.8) is called the multiple coefficient of determination and represents the fraction of the variance of the series that has been explained through the regression. If we denote by (kj the jth coefficient in an auto regressive process of order k, then the last coefficient (kk of the model is called the partial autocorrelation coefficient. Estimates of the partial autocorrelation coefficients 11i' 22' ," pp may be obtained by fitting to the series autoregressive processes of successively higher order, and solving the corresponding YuleWalker equations. The partial autocorrelation function kk, k = 1, 2, ., p may also be obtained recursively by means of Durbin's relations (Durbin, 1960) k k k+,k+ [rk+l k,j rk+ljV l k,j r] j=1 j=1 (3.9) k+l,j = k,j k+,k+l k,kj+l j = 1, 2, .., k It can be shown (Box and Jenkins, 1976, p. 55) that the autocorrelation function of a stationary AR(p) process is a mixture of damped exponential and damped sine waves, infinite in extent. On the other hand, the partial auto correlation function kk is nonzero for k < p and zero for k > p. The plot of autocorrelation and partial autocorre lation functions of the series may be used to identify the kind and the order of the model that may have generated it (identification of the model). Moving Average Models In a moving average model the deviation of the current value of the process from the mean is expressed as a finite sum of weighted previous shocks a's. Thus a moving average process of order q can be written as: zt = a 6at 2 2 ... qatq (3.10) or zt = 6(B)at (3.11) where 0(B) = 1 1B GB2 ... B (3.12) is the moving average operator of order q. An MA(q) model 2 contains (q+2) parameters, y, 61, 2, ..., a to be esti mated from the data. From the definition of stationarity (see Appendix A) it follows that an MA(q) process is always stationary since 6(B) is finite and thus converges for IBI<1. But for an MA(q) process to be invertible the q moving average coefficients 61, 62, ., 6 must be chosen so that 6 (B) converges on or within the unit circle, in other words the characteristic equation 6(B) = 0 must have its roots out side the unit circle. By multiplying equation (3.10) by ztk and taking expected values on both sides we define the autocovariance at lag k: Yk = E [(at 6lat ... 6atq) (atk latkl ... t k ] (3.13) q tkq which gives y (1 + 2 + 62 + + 82) 2 k = 0 (3.14) o 1 2 q a Y = (i + 1 2 + e + a + 2 k k 1 k+1 2 k+2 qk q a k= (k + 81k+i + 828k+2 + + qke)a2 k = 1, 2, ..., q (3.15) Yk = 0 k > q (3.16) 2 By substituting in equation (3.15) the value of a from a equation (3.14) we obtain a set of q nonlinear equations for 61, 82, ., q in terms of pl, 2, ', pq 8k + 8k+1 2 k+2 qk+ + 8q pk = 2 2 k=l,2,...,q 1 + 6 + ... + 8 1 q (3.17) These equations are analogous to the YuleWalker equa tions for an autoregressive process, but they are not linear and so must be solved iteratively for the estimation of the moving average parameters 8, resulting in estimates that may not have high statistical efficiency. Again it was shown by Wold (1938) that these parameters may need correc tions (e.g., to fit better the correlogram as a whole and not only the first q correlation coefficients), and that there may exist several, at most 2q solutions, for the parameters of the moving average scheme corresponding to an assigned correlogram pl, P2' ..., pq. However, only those 6's are acceptable which satisfy the invertibility conditions. From equation (3.14) an estimate for the white noise 2 variance a may be obtained 2 2 z a =z (3.18) a 2 2 2 1 + 6 + 6 + .. + 6 1 2 q According to the duality principle (see Appendix A) an invertible MA(q) process can be represented as an AR process of infinite order. This implies that the partial autocorre lation function (kk of an MA(q) process is infinite in extent. It can be estimated after tedious algebraic manipulations from the YuleWalker equations by substituting pk as functions of 6's for k < q and pk = 0 for k > q. So, in contrast to a stationary AR(p) process, the autocorrelation function of an invertible MA(q) process is finite and cuts off after lag q, and the partial autocorrelation function is infinite in extent, dominated by damped exponentials and damped sine waves (Box and Jenkins, 1976). Mixed AutoregressiveMoving Average Models In practice, to obtain a parsimonious parameterization, it will sometimes be necessary to include both autoregressive and moving average terms in the model. A mixed autoregres sivemoving average process of order (p,q), ARMA(p,q), can be written as t = Ztl + ... + tp + at tl ... qatq t = tl p tp t I t1 q tq (3.19) or (B) z = 6(B) at (3.20) 2 with (p+q+2) parameters, p, 1, ., q p, aa to be estimated from the data. An ARMA(p,q) process will be stationary provided that the characteristic equation ((B) = 0 has all its roots out side the unit circle. Similarly, the roots of 0(B) = 0 must lie outside the unit circle for the process to be invertible. By multiplying equation (3.19) by ztk and taking expectations we obtain Yk = 1 Yk1 + "' + p Ykp + za(k) 61Yza(k)  e y za(kq) (3.21) q za where y za(k) is the cross covariance function between z and a, defined by yza (k) = E[ztkat]. Since ztk depends only on shocks which have occurred up to time tk, it follows that Yza(k) = 0 k > 0 (3.22) Yza(k) 0 k < 0 and (3.21) implies Pk = iPkl + 2Pk2 + + ~ pPkp k > q + 1 (3.23) or j(B) Pk = 0 k > q + 1 (3.24) Thus, for the ARMA(p,q) process the first q autocorre lations pI, p2 ".. pq depend directly on the choice of the q moving average parameters 0, as well as on the p auto regressive parameters ( through (3.21). The autocorrela tions of higher lags pk, k > q + 1 are determined through the difference equation (3.24) after providing the p starting values Pqp+l' "'. Pq* So, the autocorrelation function of an ARMA(p,q) model is infinite in extent, with the first qp values pl, ..., pp irregular and the others consisting of damped exponentials and/or damped sine waves (Box and Jenkins, 1976; Salas et al., 1980). Autoregressive Integrated Moving Average Models An ARMA(p,q) process is stationary if the roots of P(B) = 0 lie outside the unit circle and "explosive non stationary" if they lie inside. For example, an explosive nonstationary AR(1) model is zt = 2zt_1 + at (the plot of zt vs. t is an exponential growth) in which (B) = 1 2B has its root B = 0.5 inside the unit circle. The special case of homogeneous nonstationarity is when one or more of the roots lie on the unit circle. By introducing a general ized autoregressive operator 0(B), which has d of its roots on the unit circle, the general model can be written as d u 0(B) = p(B) (lB) zt = e(B) at (3.25) that is <(B) wt = 6(B) at (3.26) where d d wt = V z = V z (3.27) and V = 1 B is the difference operator. This model corre sponds to assuming that the dth difference of the series can be represented by a stationary, invertible ARMA process. By inverting (3.27) zt = V w = Sd wt (3.28) where S is the infinite summation operator S = 1 + B + B2 + ... = (1B) = V1 (3.29) Equation (3.28) implies that the nonstationary process zt can be obtained by summing or "integrating" the stationary process wt, d times. Therefore, this process is called a simple autoregressive integrated moving average process, ARIMA(p,d,q). It is also possible to take periodic or seasonal dif ferences at lag's of the series, e.g., the 12th difference of monthly series, introducing the differencing operator V with the meaning that seasonal differencing V is applied s s D times on the series. This periodic ARIMA(P,D,Q) model can be written as 4(BS) VD zt = 0(BS) at (3.30) s t The combination of nonperiodic and periodic models leads to the multiplicative ARIMA(p,d,q) x ARIMA(P,D,Q) model which can be written as (B) D(Bs) Vd VD zt = 0(B) E(Bs) at (3.31) After the model has been fitted to the difference series an integration should be performed to retrieve the original process. But such an integrated series would lack a mean value since a constant of integration has been lost through the differencing. This is the reason that the ARIMA models cannot be used for synthetic generation of time series, although they are useful in forecasting the devia tions of a process (Box and Jenkins, 1976; Salas et al., 1980). Transformation of the Original Series Transformation to Normality Most probability theory and statistical techniques have been developed for normally distributed variables. Hydro logic variables are usually assymetrically distributed or bounded by zero (positive variables), and so a transforma tion to normality is often applied before modeling. Another approach would be to model the original skewed series and then find the probability distribution of the uncorrelated residuals. Care must then be taken to assess the errors of applying methods developed for normal variables to skewed variables, especially when the series are highly skewed, e.g., hourly or daily series. On the other hand, when trans forming the original series into normal, biases in the mean and standard deviation of the generated series may occur. In other words, the statistical properties of the trans formed series may be reproduced in the generated but not in the original series. An alternative for avoiding biases in the moments of the generated series would be to estimate the moments of the transformed series through the derived relationships between the moments of the skewed and normal series. Matalas (1967) and Fiering and Jackson (1971) describe how to estimate the first two moments of the log transformed series so as to reproduce the ones of the original series. Mejia et al. (1974) present another approach in order to preserve the correlation structure of the original series. However, the most widely used approach is to transform the original skewed series to normal and then model the normal series. Several transformations may be applied to the original series, and the transformed series then tested for normality, e.g. the graph of their cumulative distribution should appear as a straight line when it is plotted on normal probability paper. The transformation will be finally chosen that gives the best approximation to normality, e.g., the best fit to a straight line. Another advantage of transforming the series to normal is that the maximum likelihood estimates of the model parameters are essentially the same as the least squares estimates, provided that the residuals are normally dis tributed (Box and Jenkins, 1976, Ch. 7). This facilitates the calculation of the final estimates since they are those values that minimize the sum of squares of the residuals. Box and Cox (1964) showed how a maximum likelihood and a parallel Bayesian analysis can be applied to any type of transformation family to obtain the "best" choice of trans formation from that family. They illustrated those methods for the popular power families in which the observation x is replaced by y, where x 1 y = (3.32) log x X=0 The fundamental assumption was that for some X the trans formed observations y can be treated as independently 2 normally distributed with constant variance 2 and with expectations defined by a linear model E[y] = A L (3.33) where A is a known constant matrix and L is a vector of unknown parameters associated with the transformed observa tions (Box and Cox, 1964). This transformation has the advantage over the simple power transformation proposed by Tukey (1957) x 0 y =, X (3.34) log x X=0 of being continuous at X=0. Otherwise the two transforma tions are identical provided, as has been shown by Schlesselman (1971), that the linear model of (3.33) con tains a constant term. Further, Draper and Cox (1969), showed that the value of A obtained from this family of transformations can be useful even in cases where no power transformation can produce normality exactly. Also, John and Draper (1980) suggested an alternative oneparameter family of transfor mations when the power transformation fails to produce satisfactory distributional properties as in the case of a symmetric distribution with long tails. The selection of the exact transformation to normality (zero skewness) is not an easy task, and overtransforma tion, i.e., transformation of the original data with a large positive (negative) skewness to data with a small negative (positive) skewness, or undertransformation, i.e., transformation of the original data with a large positive (negative) skewness to data with a small positive (negative) skewness, may result in unsatisfactory modeling of the series or in forecasts that are in error. This was the case for the data used by Chatfield and Prothero (1973a), who applied the BoxJenkins forecasting approach and were dissatisfied with the results, concluding that the BoxJenkins forecast ing procedure is less efficient than other forecasting methods. They applied a log transform to the data which evidently overtransformed the data, as shown by Box and Jenkins (1973) who finally suggested the approximate trans formation y = x 25, even though the complicated but precise BoxCox procedure gave an estimate of A = 0.37 [Wilson (1973) ]. Thus, the selection of the normality transformation greatly affects the forecasts, as Chatfield and Prothero (1973b) experienced with their data. They concluded that S. We have seen that a "small" change in X from 0 to 0.25 has a substantial effect on the resulting forecasts from model A [ARIMA(1,1,1) x ARIMA(1,1,1)12] even though the goodness of fit does not seem to be much affected. This reminds us that a model which fits well does not neces sarily forecast well. Since small changes in X close to zero produce marked changes in forecasts, it is obviously advisable to avoid "low" values of X, since a procedure which depends critically on distinguishing between fourthroot and logarithmic transformation is fraught with peril. On the other hand a "large" change in A from 0.25 to 1 appears to have relatively little effect on forecasts. So we conjecture that BoxJenkins forecasts are robust to changes in the transfor mation parameter away from zero. .[Chatfield and Prothero (1973b) p. 347] Stationarity Most time series occurring in practice exhibit non stationarity in the form of trends or periodicities. The physical knowledge of the phenomenon being studied and a visual inspection of the plot of the original data may give the first insight into the problem. Usually the length of the series is not long enough, and the detection of trends or cycles only through the plot of the series is ambiguous. Useful tools for the detection of periodicities are the autocorrelation function and the spectral density function of the series (which is the Fourier transform of the autocorrelation function). If a seasonal pattern is present in the series then the correlogram (plot of the autocorrelation function) will exhibit a sinusoidal appear ance and the periodogram (plot of the spectral density function) will show peaks. The period of the sinusoidal function of the correlogram, or the frequency where the peaks occur in the periodogram, can determine the periodic component exactly (Jenkins and Watts, 1968). Another device for the detection of trends and periodicities is to fit some definite mathematical function, such as exponentials, Fourier series or polynomials to the series and then model the residual series, which is assumed to be stationary. More details on the treatment of nonstationary data as well as on the interpretation of the correlogram and periodogram of a time series can be found in textbooks such as Bendat and Piersol (1958), Jenkins and Watts (1968), Wastler (1969), Yevjevich (1972), and Chatfield (1980). Apart from the approach of removing the nonstationarity of the original series and modeling the residual series with a stationary ARMA(p,q) model, the original nonsta tionary series can be modeled directly with a simple or seasonally integrated ARIMA model. Actually, the second approach can be viewed as an extension of the first one, e.g., the nonstationarity is removed through the simple (V) or seasonal (V ) differencing. However, the integrated model cannot be used for generation of data, as has already been discussed. For many hydrologic applications, one is satisfied with second order or weak stationarity, e.g., stationarity in the mean and variance. Furthermore, weak stationarity and the assumption of normality imply strict stationarity (see Appendix A). Monthly Rainfall Series Normalization and Stationarization Stidd (1953, 1968) suggested that rainfall data have a cube root normal distribution because they are product functions of three variables: vertical motion in the atmosphere, moisture, and duration time. Synthetic rainfall data generated using processes analogous to those operating in nature showed that the exponent required to normalize the distribution is between 0.5 (square root) and 0.33 (cubic root) for different types of rainfall (Stidd, 1970). The square root transformation has been extensively used for the approximate normalization of monthly rainfall series (see Table C12 of Appendix C) with satisfactory results: Delleur and Kavvas (1978), Salas et al. (1980), Ch. 5, Roesner and Yevjevich (1966). However, Hinkley (1977) used the exact BoxCox transformation for monthly rainfall series. Although, Asley et al. (1977) have developed an efficient algorithm for the estimation of X along with other parameters in an ARIMA model, it seems that the exact value of X is not more reliable than the approximate one X = 0.5 (Chatfield and Prothero, 1973b). The reasons for this follow. First, Chatfield and Prothero (1973b) used the BoxCox procedure to evaluate the exact transformation of their data. They obtained estimates X = 0.24 using all the data (77 observations), X = 0.34 using the first 60 observations and X = 0.16 excluding the first year's data. Therefore, it is logical to infer that even if the complicated BoxCox procedure for the incomplete rainfall record is used, the missing values may be enough to give a spurious X, which is not "more exact" than the value of 0.5 used in practice. Second, we may also notice that the use of either A = 0.33 (cubic root) or A = 0.5 (square root) is not expected to greatly affect the forecasts since, according to Chatfield and Prothero (1973b), the BoxJenkins forecasts are not too sensitive to changes of X for A > 0.25. Monthly rainfall series are nonstationary. The variation in the mean is obvious since generally the expected monthly rainfall value for January is not the same as that of July. Although the variation of the standard deviation is not so easy to visualize, calculations show that months with higher mean usually have higher standard deviation. Thus, each month has its own probability distribution and its own statistical parameters resulting in monthly series that are nonstationary. By introducing the concept of circular stationarity as developed by Hannan (1960) and others (see Appendix A for definition), the periodic monthly rainfall series can be considered not as nonstationary but circular stationary, since circular stationarity suggests that the probability distribution of rainfall in a particular month is the same for the different years. Then, the monthly rainfall series is composed of a circularly stationary (periodic) component and a stationary random component. The timeseries models currently used in hydrology are fitted to the stationary random component, so the circularly stationary component must be removed before modeling. This last component appears as a sinusoidal component in the autocorrelation function (with a 12month period) or as a discrete spectral component in the spectrum (peak at the frequency 1/12 cycle per month). Usually several subhar monics of the fundamental 12month period are needed to describe all the irregularities present in the autocorre lation function and spectral density function, since in nature the periodicity does not follow an ideal cosine function with a 12month period. The use of a Fourier series approach for the approximation of the periodic component of monthly rainfall and monthly runoff series has been illustrated by Roesner and Yevjevich (1966). Kavvas and Delleur (1975) investigated three methods of removal of periodicities in the monthly rainfall series: nonseasonal (firstlag) differencing, seasonal differencing (12month difference), and removal of monthly means. They worked both analytically and empirically using the rescaled (divided by the monthly standard deviation) monthly rainfall square roots for fifteen Indiana watersheds. They concluded that "all the above transformations yield hydrologic series which satisfy the classical secondorder weak stationarity conditions. Both seasonal and nonseasonal differencing reduce the periodicity in the covariance function but distort the original spectrum, thus making it impractical or impossible to fit an ARMA model for generation of synthetic monthly series. The subtraction of monthly means removes the periodicity in the covariance and the amount of nonstationarity introduced is negligible for practical purposes." (Kavvas and Delleur, 1975, p. 349.) In other words, they concluded that the best way for modeling monthly rainfall series is to remove the seasonality (by sub tracting the monthly means and dividing by the standard deviations of the normalized series) and then use a station ary ARMA(p,q) model to model the stationary normal residuals. Modeling of Normalized Series It is assumed that the nonstationarities due to long term trends are removed before any operation. Then the appropriate transformation is applied to the data in order to obtain an approximately normal distribution. For monthly rainfall series experience has shown that the best practical transformation is the square root transformation, as has already been discussed. What remains is the modeling of the normalized series with one of the following models: stationary ARMA(p,q), simple nonstationary ARIMA(p,d,q), seasonal nonstationary ARIMA(P,D,Q)s or multiplicative ARIMA(p,d,q)x(P,D,Q) model. Delleur and Kavvas (1978) fitted different models to the monthly rainfall series of 15 basins in Indiana and compared the results. They studied the models: ARIMA (0,0,0), ARIMA(1,0,1), ARIMA(1,1,1), ARIMA(1,1,1)12, and ARIMA(l,0,0)x(1,l,l)12 on the squareroot trans formed series. They concluded that from the nonseasonal ARIMA models, ARMA(1,1) "emerged as the most suitable for the generation and forecasting of monthly rainfall series." The goodnessoffit tests applied on the residuals were the portemanteau lack of fit test (see Appendix A) of Box and Pierce (1970) and the cumulative periodogram test (Box and Jenkins, 1976, p. 294). The ARMA(1,1) model passed both tests in all cases studied. From the nonseasonal models, ARIMA(1,0,0)x(l,1,1)12 also passed the goodnessoffit tests in all cases, but they stress that this model "has only limited use in the forecasting of monthly rainfall series since it does not preserve the monthly standard deviations." As far as forecasts are concerned, they showed that "the forecasts by the several models follow each other very 57 closely and the forecasts rapidly tend to the mean of the observed rainfall square roots (which is the forecast of the white noise model)." CHAPTER 4 MULTIVARIATE STOCHASTIC MODELS Introduction For univariate stochastic models the sequence of observations under study is assumed independent of other sequences of observations and so is studied by itself (single or univariate time series). However, in practice there is always an interdependence among such sequences of observations, and their simultaneous study leads to the concept of multivariate statistical analysis. For example, a rainfall series of one station may be better modeled if its correlation with concurrent rainfall series at other nearby stations is incorporated into the model. Multiple time series can be divided into two groups: (1) multiple time series at several points (e.g., rainfall series at different stations, streamflow series at various points of a river), and (2) multiple series of different kinds at one point (e.g., rainfall and runoff series at the same station). In general, both kinds of multiple time series are studied simultaneously, and their correlation and crosscorrelation structure is used for the construction of a model that better describes all these series. The parameters of this so called multivariate stochastic model are calculated such that the correlation and crosscorrelation structure of the multiple measured series are preserved in the multiple series generated by the model. The multivariate models that will be presented in this chapter have been developed and extensively used for the generation of synthetic series. How these models can be adapted and used for filling in missing values will be discussed in chapter 5. General Multivariate Regression Model The general form of a multivariate regression model is Y = AX + B H (4.1) where Y is the vector of dependent variables, X the vector of independent variables, A and B matrices of regression coefficients, and H a vector of random components. The vectors Y and X may consist of either the same variable at different points (or at different times) or different variables at the same or different points (or at different times). For convenience and without loss of generality all the variables are assumed second order stationary and normally distributed with zero mean and unit variance. Transforma tions to accomplish normality have been discussed in Chapter 3. A random component is superimposed on the model to account for the nondeterministic fluctuations. In the above model, the dependent and independent variables must be selected carefully so that the most information is extracted from the existing data. A good summary of the methods for the selection of independent variables for use in the model is given in Draper and Smith (1966). Most popular is the stepwise regression procedure in which the independent variables are ranked as a function of their partial correlation coefficients with the dependent variable and are added to the model, in that order, if they pass a sequential F test. The parameter matrices A and B are calculated from the existing data in such a way that important statistical characteristics of the historical series are preserved in the generated series. This estimation procedure becomes cumbersome when too many dependent and independent variables are involved in the model, and several simplifications are often made in practice. On the other hand, restrictions have to be imposed on the form of the data, as we shall see later, to ensure the existence of real solutions for the matrices A and B. Multivariate LagOne Autoregressive Model If only one variable (e.g., rainfall at different stations) is used in the analysis then the model of equa tion (4.1) becomes a multivariate autoregressive model. Since in the rest of this chapter we will be dealing only with one variable (rainfall) which has been transformed to normal and second order stationary, the vectors Y and X are replaced by the vector Z for a notation consistent with the univariate models. Matalas (1967) suggested the multivari ate lagone autoregressive model Zt = A Zt1 + B Ht (4.3) where Z is an (mxl) vector whose ith element zit is the observed rainfall value at station i and at time t, and the other variables have been described previously. Such a model can be used for the simultaneous genera tion of rainfall series at m different stations. The correlation and crosscorrelation of the series is incor porated in the model through the parameters A and B. The matrices A and B are estimated from the historical series so that the means, standard deviations and auto correlation coefficients of lagone for all the series, as well as the crosscorrelations of lagzero and lagone between pairs of series are maintained. Let M0 denote the lagzero correlation matrix which is defined as M = E[Zt ZT] (4.4) Then a diagonal element of M0 is E[zi.t z. ] = Pii(0) = 1 (since Zt is standardized) and an off diagonal element (i,j) is E[zit z jt = pij(0) which is the lagzero cross corre lation between series {zi} and {z.}. The matrix M0 is symmetric since pij(0) = pji(0) for every i, j. Let M1 denote the lagone correlation matrix defined as M = E[Zt Zt_] (4.5) A diagonal element of M1 is E[zit zit. l] = pii(1) which 1 i,t i,t1 ii is the lagone serial correlation coefficient of the series {zi}, and an offdiagonal element (i,j) is E(zit z. tl) = pij(1) which is the lagone crosscorre i,t 3,t1 ij lation between the {zi} and {z.} series, the latter lagged behind the former. Since in general pij(1) 7 Pji(1) for i 7 j the matrix M1 is not symmetric. After some algebraic manipulations (see Appendix B) the coefficient matrices A and B are obtained as solutions to the equations 1 A = M1 M1 (4.6) 1 0 T 1 T BB = M M M1 M1 (4.7) 0 1 0 1 1 T where M1 is the inverse of M0, and MT the transpose of MI. The correlation matrices M0 and M1 are calculated from the data. Then an estimate of the matrix A is given directly by equation (4.6), and an estimate for B is found by solving equation (4.7) by using a technique of principal component analysis (Fiering, 1964) or upper triangularization (Young, 1968). For more details on the solution of equation (4.7) see Appendix B. Comments on Multivariate AR(1) Model Assumption of Normality and Stationarity We have assumed that all random variables involved in the model are normal. The assumption of a multivariate normal distribution is convenient but not necessary. It has been shown (Valencia and Schaake, 1973) that the multivari ate AR(1) model preserves first and second order statistics regardless of the underlying probability distributions. Several studies have been done using directly the original skewed series. Matalas (1967) worked with log normal series and constructed the generation model so that it preserves the historical statistics of the lognormal process. Mejia et al. (1974) showed a procedure for multi variate generation of mixtures of normal and lognormal variables. Moran (1970) indicated how a multivariate gamma process may be applied, and Kahan (1974) presented a method for the preservation of skewness in a linear bivariate regression model. But in general, the normalization of the series prior to modeling is more convenient, especially when the series have different underlying probability distribu tions. In such cases different transformations are applied on the series, and that combination of transformations is kept which yields minimum average skewness. Average skew ness is the sum of the skewness of each series divided by the number of series or number of stations used. This operation is called finding the MST (Minimum Skewness Transformation) and results in an approximately multivariate normal distribution (Young and Pisano, 1968). We have also assumed that all variables are standard ized, e.g., have zero mean and unit variance. This assump tion is made without loss of generality since the linear transformations are preserved through the model. On the other hand this transformation becomes necessary when modeling periodic series since by subtracting the periodic means and dividing by the standard deviations we remove almost all of the periodicity. If the data are not standardized, M0 and M1 represent the lagzero and lagone covariance matrices (instead of correlation matrices), respectively. If S denotes the diagonal matrix of the standard deviations and R0, R1 the lagzero and lagone correlation matrices then M0 = S R0 S (4.8) and M1 = S R S (4.9) When we standardize the data the matrix S is an identity matrix and Mo, M1 become the correlation matrices R0 and R1 respectively. Thus, one other advantage of standardization is that we work with correlation matrices whose elements are less than unity and the computations are likely to be more stable (Pegram and James, 1972). CrossCorrelation Matrix M1 Notice that the lagone correlation matrix M1 has been T defined as = E[Z Z ] which contains the lagone M1 t t1 crosscorrelations between pairs of series but having the second series lagged behind the first one. Following this definition the lagminusone correlation matrix will be M = EIZ_ Z T (4.10) 1 t t and it will contain the lagone correlations having now the second series lagged ahead of the first one. It is easy to show that M_1 is actually the transpose of M : T T T T M = E[Z ZT] = E[(Z Z ) ] = M (4.11) Care then must be taken so that there is a consistency between the equation used to calculate matrix A and the way that the crosscorrelation coefficients have been calculated. Such an inconsistency was present in the numerical multisite package (NMP) developed by Young and Pisano (1968) and was first corrected by O'Connell (1973) and completely corrected and improved by Finzi et al. (1974, 1975). Incomplete Data Sets In practice, hydrologic series at different stations are unlikely to be concurrent and of equal length. With lagzero auto and crosscorrelation coefficients calculated from the incomplete data sets, the lagzero correlation matrix M obtained may not be positive semidefinite, and, 1 its inverse M needed for the calculation of matrix A 0 thus may have elements that are complex numbers. Also, a necessary and sufficient condition for a real solution of 1 T matrix B is that C = M M MI M is a positive semi 0 1 0 1 definite matrix (see Appendix B). When all of the series are concurrent and complete then M0 and C are both semidefinite matrices [Valencia and Schaake, 1973], and the generated synthetic series are real numbers. When the series are incomplete there is no guarantee that real solutions for the matrices A and B exist causing the model of Matalas (1967) to be conditional on M0 and C being positive semidefinite [Slack, 1973]. Several techniques have been proposed which use the incomplete data sets but guarantee the posite semidefinite ness of the correlation matrices. Fiering (1968) suggested a technique that can be used to produce a positive semi definite correlation matrix M0. If M0 is not positive semidefinite then negative eigenvalues may occur and hence negative variables, since the eigenvalues are variances in the principal component system. In this technique, the eigenvalues of the original correlation matrix are calcu lated. If negative eigenvalues are encountered, an adjust ment procedure is used to eliminate them (thereby altering the correlation matrix, M0 [Fiering, 1968]). A correlation matrix is called consistent if all its eigenvalues are positive. But consistent estimates of the correlation matrices M0 and M1 do not guarantee that C will also be consistent. Crosby and Maddock (1970) proposed a technique that is suitable only for monotone data (data continuous in collection to the present but having different starting times). This technique produces a consistent estimate of the matrix M0 as well as of the matrix C, and is based on the maximum likelihood technique developed by Anderson (1957). Valencia and Schaake (1973) developed another tech nique. They estimate matrices A and B from the equations 1 A = M M01 (4.12) 1 01 T 1 T B B 02 M M M1 (4.13) where M01 is the lagzero correlation matrix M0 computed from the first (N1) vectors of the data, and M02 is com puted from the last (Nl) vectors, where N is the number of data points (number of times sampled) in each of the n series. Further Simplification Sometimes in practice, the preservation of the lag zero and lagone autocorrelations and the lagzero crosscorrelations is enough. In such cases, i.e., when the lagone crosscorrelations are of no interest, a nice simplification can be made due to Matalas (1967, 1974). He defined matrix A as a diagonal matrix whose diagonal ele ments are the lagone autocorrelation coefficients. With A defined as above, the lagone crosscorrelation of the generated series (Pij(1)) can be shown to be the product of the lagzero crosscorrelation (Pij(0)) and the lagone autocorrelation of the series (pii(1)), but of course dif ferent than the actual lagone crosscorrelation (pij(1)). Pij(1) = ij (0) Pii(1) (4.14) By using Pij (1) of equation (4.14) in place of the actual Pij(1), thus avoiding the actual computation of pij(1) from the data, the desired statistical properties of the series are still preserved. Higher Order Multivariate Models The order p of a multivariate autoregressive model could be estimated from the plots of the autocorrelation and partial autocorrelation functions of the series (Salas et al., 1980) as an extension of the univariate model identification, which is already a difficult and ambiguous task. However, in practice first and second order models are usually adequate and higher order models should be avoided (Box and Jenkins, 1976). 69 In any case, the multivariate multilag autoregressive model of order p takes the form p Z = Ak Zk + B Ht (4.15) k=l and the matrices Al, A2, ... A B are the solutions of the equations p M = Ak M i = 1, 2, ..., p (4.16) k=l T T B B M Z Ak Mk (4.17) k=l where M is the lagk correlation matrix. Equation (4.16) is a set of p matrix equations to be solved for the matrices A,, A2, ..., A and matrix B is obtained from (4.17) using techniques already discussed. Here, the assumption of diag onal A matrices becomes even more attractive. For a multi variate secondorder AR process the above simplification is illustrated in Salas and Pegram (1977) where the case of periodic (not constant) matrix parameters is also considered. O'Connell (1974) studied the multivariate ARMA(1,1) model Z = A Zt1 + B H C H (4.18) where A, B, and C are coefficient matrices to be determined from the data. Specifically they are solutions of the system of matrix equations T T B B + CC =S T (4.19) C B = T where S and T are functions of the correlation matrices M0, M1 and M2. Methods for solving this system are proposed by O'Connell (1974). Explicit solutions for higher order multivariate ARMA models are not available and Salas et al. (1980) propose an approximate multivariate ARMA(p,q) model. CHAPTER 5 ESTIMATION OF MISSING MONTHLY RAINFALL VALUES A CASE STUDY Introduction This section compares and evaluates different methods for the estimation of missing values in hydrological time series. A case study is presented in which four of the simplified methods presented in Chapter 2 have been applied to a set of four concurrent 55 year monthly rainfall series from south Florida and the results compared. Also a recursive method for the estimation of missing values by the use of a univariate or multivariate stochastic model has been proposed and demonstrated. The theory already presented in Chapters 2, 3 and 4 is supplemented whenever needed. Set Up of the Problem The monthly rainfall series of four stations in the South Florida Water Management District (SFWMD) have been used in the analysis. These stations are: Station A : MRF6038, Moore Haven Lock 1 Station 1 : MRF6013, Avon Park Station 2 : MRF6093, Fort Myers WSO Ap. Station 3 : MRF6042, Canal point USDA. For convenience the four stations will sometimes be addressed as A, 1, 2, 3 instead of their SFWMD identification numbers 6038, 6013, 6093 and 6042, respectively. Their locations are shown in the map of Fig. 5.1. Station A in the center is considered as the interpolation station (whose missing values are to be estimated) and the other three stations 1, 2 and 3 as the index stations. Care has been taken so that the three index stations are as close and as evenly distributed around the interpolation station as possible. This particular set of four stations was selected because it exhibits many desired and convenient properties: (1) the stations have an overlapping period of 55 years (19271981), (2) for this 55 year period the record of the interpolation station (station A) is complete (no missing values), (3) the three index stations have a small percent of missing values for the overlapping period (sta tion 1: 2.7% missing, station 2: complete, and station 3: 1.2% missing values). The 55 year length of the records is considered long enough to establish the historical statistics (e.g., monthly mean, standard deviation and skewness) and provides a monthly series of a satisfactory length (660 values) for fitting a univariate or multivariate ARMA model. r a rm \9. b~~ STh m.Y Thu za A~ u 2l FLORIDA Ta IM MI Tn nr w x:._ 4 _T .T L*" _.:. 0 75 TN m Ink ZOM OR n .4.   _ t^. ;,. r r'*~"J I L7" I C~ERD E"  AI  *^Sgl~ ~ iAT* t , j_ ______ ______________ ____________ Fig. 5.1. The four south Florida rainfall stations used in the analysis. A: 6038, Moore Haven Lock 1 1: 6013, Avon Park 2: 6093, Fort Myers WSO AP. 3: 6042, Canal Point USDA L " `'Y~ C~ I I   I I___________ I The completeness of the series of the interpolation station permits the random generation of gaps in the series, corresponding to different percentages of missing values, with the method described in Chapter 1. After the missing values have been estimated by the applied models, the gaps are infilled with the estimated values and the statistics of the new (estimated) series are compared with the statistics of the incomplete series and the statistics of the historical (actual) series. Also the statistical closeness of the infilled (estimated) values to the hidden (actual) values provides a means for the evaluation and comparison of the methods. When, for the estimation of a missing value of the interpolation station, the corresponding value of one or more index stations is also missing the latter is eliminated from the analysis, e.g., only the remaining one or two index stations are used for the estimation. Frequent occurrence of such concurrent gaps in both the interpolation and the index stations would alter the results of the applied method in a way that cannot be easily evaluated (e.g., another parameter such as the probability of having concurrent gaps should be included in the analysis). A small number of missing values in the selected index stations eliminates the possibility of such simultaneous gaps, and thus the effectiveness of the applied estimation procedures can be judged more efficiently. The statistical properties (e.g., monthly mean, standard deviation, skewness and coefficient of variation) of the truncated (to the 19271981 period) original monthly rainfall series for the four stations are shown on Tables C.1, C.2, C.3 and C.4 of Appendix C. Figure 5.2 shows the plot of the monthly means and standard deviations for station A. From these plots we observe that: (1) the plot of monthly means is in agreement with the typical plot for Florida shown in Fig. 1.1, and (2) months with a high mean usually have a high standard deviation. The only exception seems to be the month of January which in spite of its low mean exhibits a high standard deviation and therefore a very high coefficient of variation and an unusually high skewness. A closer look at the January rainfall values of station A shows that the unusual properties for that month are due to an extreme value of 21.4 inches of rainfall for January 1979, the other values being between 0.05 and 6.04 inches. The three index stations 1, 2 and 3 are at distances 59 miles, 51 miles and 29 miles respectively from the interpolation station A. Simplified Estimation Techniques Techniques Utilized From the simplified techniques presented in Chapter 2, the following four are applied for the estimation of missing inches J inches F M A M J J A S 0 N D J F M A M J J A 0 N D (b) monthly standard deviations Fig. 5.2. Plot of the monthly means and standard deviations station 6038 (1927 1981) monthly rainfall values: (1) the mean value method (MV) (2) the reciprocal distances method (RD) (3) the normal ratio method (NR), and (4) the modified weighted average method (MWA). These methods are all deterministic and are applied directly on the available data permitting thus a uniform and objective comparison of the results. The mean value plus random component method has not been included in this thesis. The above four methods will be applied for five different percentages of missing values: 2%, 5%, 10%, 15% and 20%. These percentages cover almost 80% of all cases encountered in practice as has been shown in Table 1.1 (e.g., 80% of the stations have below 20% missing values). From the same table it can also be seen that almost 30% of the stations have below 5% missing values. Therefore, it would be of interest and practical use if we could generalize the results for the region of below 5% missing values since a large fraction of the cases in practice fall in this region. The application of the first three methods (MV, RD, NR methods) is straightforward and no further comments need be made. However, some comments on the least squares (LS) method and the modified weighted average (MWA) method are necessary. Least Squares Method (LS) The least squares method although simple in principle involves an enormous amount of calculations, and for that reason it has been excluded from this study. For example, consider the case in which the interpolation station A is regressed on the three index stations 1, 2 and 3. The estimated values will be given by: y' = a + b1 x1 + b2 x2 + b3 x3 + E (5.1) where a, bl, b2, b3 are the regression coefficients calculated from the available concurrent values of all the four variables. There are 12 such regression equations, one for each month. But if it happens that an index station (say, station 3) has a missing value simultaneously with the interpolation station, a new set of 12 regression equations is needed for the estimation, e.g., y' = a' + b' xl + b x2 + E (5.2) Unless this coincidence of simultaneously missing values is investigated manually so that only the needed least squares regressions are performed (Buck, 1960), all the possible combinations of regressions must otherwise be performed. This involves regressions among all the four variables (y; xl, x2, x3), among the three of them (y; xl, x2), (y; x1, x3), (y; x2, x3) and between pairs of them (y; x ), (y; x2), (y; x3), giving overall 7 sets of 12 regression equations. Because the regression coefficients are different for each percentage of missing values (since their calculation is based only on the existing concurrent values) the 84 (7 x 12) regressions must be repeated for each level of missing values (420 regressions overall for this study). It could be argued that the same 12 regression equations (y; xl, x2, x3) could be kept and a missing values x. replaced by its mean x. or by another estimate x!. In 1 1 1 that case equation 5.1 would become y' = a + b1 x1 + b2 x2 + b3 x3 + E, (5.3) the coefficients of regression a, bl, b2, b3 remaining unchanged. This in fact can be done, but then the method tested will not be the "pure" least squares method since the results will depend on the secondary method used for the estimation of the missing x. values. 1 The coefficients a, bl, b2 and b3 (equation 5.1) of the regression of the {y} series (of station A with 2% missing values) on the series {xl}, {x2} and {x3} (of stations 1, 2 and 3 respectively) are shown in Table 5.1. In the same table the values of the squared multiple regression coefficient R2 and the standard deviation of the {y) series are also shown. The numbers in parenthesis show the significance level a at which the parameters are significant (the percent probability of being nonzero is (1a))100. For Table 5.1. Least Squares Regression Coefficients for Equation (5.1) and Their Significant Levels. The standard deviation, s, for each month is also given. a b1 b2 b3 R s inches inches 0.0059 0.1271 0.4994 0.3377 0.8046 (0.9692) (0.2790) (0.0005) (0.0017) (0.0001) 3. 0.1355 0.2624 0.0086 0.5345 0.7033 (0.5260) (0.0025) (0.9431) (0.0001) (0.0001) 0.0052 0.1617 0.3457 0.4507 0.9142 4 MAR 2.464 (0.9793) (0.0138) (0.0001) (0.0001) (0.0001) 0.7388 0.2405 0.2813 0.1919 0.4936 81 (0.0273) (0.0458) (0.0156) (0.1132) (0.0001) 2.1302 0.4046 0.0591 0.2186 0.2752 MAY 2. 583 (0.0070) (0.0115) (0.7180) (0.1308) (0.0016) 1.8765 0.2192 0.1108 0.3339 0.3351 (0.1505) (0.1576) (0.4034) (0.0133) (0.0002) 3. 2.8601 0.0345 0.3993 0.1885 0.2005 39 (0.0750) (0.7883) (0.0131) (0.1780) (0.0154) 2.0820 0.1771 0.2078 0.2660 0.1789 93 (0.2065) (0.1666) (0.0787) (0.0589) (0.0248) 0.0108 0.5102 0.2113 0.2450 0.5669 (0.9916) (0.0003) (0.0893) (0.0190) (0.0001) 0.6985 0.3960 0.2287 0.4667 0.7749 0 OCT 3.073 (0.0866) (0.0020) (0.0433) (0.0001) (0.0001) 0.3167 0.3009 0.2473 0.1063 0.4575 2 NOV 1. 228 (0.1290) (0.0030) (0.0804) (0.0069) (0.0001) 0.2623 0.2332 0.3807 0.4381 0.7723 5 DEC1987) (0.1065) (0.0084) (0.0001) (0.000 1) 585 (0.1987) (0.1065) (0.0084) (0.0001) (0.0001) example, for January the coefficient b1 is not significant at the 5% significance level (a = 0.05) since 0.279 is greater than 0.05, but the R2 coefficient is significant even at 0.01% significance level (a = 0.0001). The significance levels correspond to the "ttest" for the regression coefficients and to the "Ftest" for the R2 coefficients. The standard deviation, s, of the {y} series is also listed since the random component is given by E = R2 s (5.4) as has already been discussed in Chapter 2. It is interesting to note, that although the multiple regression coefficient R2 varies for each month from as low as 0.18 to as high as 0.91 it is always significant at the 5% significance level. The months of July and August exhibit the lowest (although significant) correlation coefficients as is expected for Florida. The physical reason for these low correlations is that in the summer most rainfall is convective, whereas in other months there is more cyclonic activity. Rainfall from scattered thunderstorms is simply not as correlated with that of nearby areas as is rainfall from broad cyclonic activity. Thus, on the basis of the regressions shown in Table 5.1, the least squares method would be expected to perform least well in the summer in Florida, but this point is not validated in this thesis. Modified Weighted Average Method (MWA) For the modified weighted average method the twelve (3x3) covariance matrices of the three index stations have been calculated for each month using equation (2.9) and (2.10), and are shown in Table C.11 (appendix C). Also the monthly standard deviations, s have been estimated from the known {y} series, and the monthly standard deviations, s' have been calculated by equation (2.11) using the y calculated covariance matrices. Notice that although the twelve s values (as calculated from the actual data and which we want to preserve) are different at different percentages of missing values, the twelve s' values (that y depend only on the weights a. and the covariance matrix of the index stations) are calculated only once. The correction coefficients f (f = s /s') for each month and for y y each different percentage of missing values which must be applied on matrix A (equation 2.21) are shown in Table 5.2. From this table it can be seen that if the simple weighted average scheme of equation (2.3) were used for the generation, the standard deviation of November would be overestimated (by a factor of approximately 2) and the standard deviation of all other months would be under estimated (e.g., by a factor of approximately 0.5 for the month of January). We also observe that due to small changes of s for different percentages of missing values, the correction factor f does not vary much either, but tends Table 5.2. Correction Coefficient, f, for Each Month and for Each Different Percent of Missing Values (f =s y/s Y Y). 2% 5% 10% 15% 20% JAN 1.777 1.777 1.795 1.897 1.872 FEB 1.129 1.142 1.136 1.199 1.188 MAR 1.178 1.207 1.177 1.003 1.009 APR 1.089 0.980 1.061 1.051 1.054 MAY 1.269 1.197 1.212 1.222 1.360 JUN 1.214 1.173 1.192 1.228 1.242 JUL 1.338 1.345 1.386 1.390 1.491 AUG 1.424 1.414 1.425 1.432 1.369 SEP 1.313 1.328 1.325 1.210 1.331 OCT 1.258 1.273 1.218 1.229 1.314 NOV 0.533 0.537 0.509 0.583 0.572 DEC 1.161 1.140 1.169 1.172 1.248 to be slightly greater the greater the percent of missing values. The modified weighted average scheme theoretically preserves the mean and variance of the series as has been shown in Chapter 2. But this is true for a series that has been generated by the model and not for a series that is a mix of existing values and values generated (estimated) by the model. This illustrates the difference between the two concepts: "generation of data by a model" and "estimation of missing values by a model." A method for generation of data which is considered "good" in the sense that it preserves first and second order statistics is not necessarily "good" for the estimation of missing values. In fact, it may give statistics comparable to the ones given from a simpler estimation technique which does not preserve the statistics, even as a generation scheme. Theoretically, for a "large" number of missing values, the estimation model operates as a generation model and thus preserves the "desired" statistics, but practically, for this large amount of missing values the "desired" statistics (calculated from the few existing values) are of questionable reliability. Only for augmentation of the time series (extension of the series before the first or after the last point) will the modified weighted average scheme or other schemes that preserve the "desired" statistics be expected to work better than the simple weighted average schemes. One other disadvantage of the modified weighted average scheme as well as of the least squares scheme is that negative values may be generated by the model. Since all hydrological variables are positive, the negative generated values are set equal to zero, thus altering the statistics of the series. This is also true for all methods that involve a random component and is mainly due to "big" negative values taken on by the random deviate. The number of negative values, estimated by the MWA method, which have been set equal to zero in the example that follows were 1, 1, 6, 4, and 9 values for the 2%, 5%, 10%, 15% and 20% levels of missing values, respectively. The effect of the values arbitrarily set to zero cannot be evaluated exactly, but what can be intuitively understood is that a distortion in the distribution is introduced. A transformation that prevents the generation of negative values could be performed on the data before the application of the generation scheme. Such a transformation is, for example, the logarithmic transformation since its inverse applied on a negative value exists, and the mapping of the transformed to the original data and vice versa is one to one (this is not true for the square root transformation). Comparison of the MV, RD, NR and MWA Methods The performance of each method applied for the estimation of the missing values will be evaluated by comparing the estimated series (existing plus estimated values) to the incomplete series (really available in practice) and to the actual series (unknown in practice, but known in this artificial case). The criteria that will be used for the comparison of the method will be the following: (1) the bias in the mean as measured (a) by the difference between the mean of the estimated series, y e, and the mean of the incomplete series, y. (i = 1, 2, 3, 4, 5 for five different percentages of missing values), and (b) by the difference between the mean of the estimated series, y and the mean of the actual series, ya; (2) the bias in the standard deviation as measured (a) by the ratio of the standard deviation of the estimated series, s to the standard deviation of the incomplete series, s. and (b) by the ratio of the standard deviation of the estimated series, se, to the standard deviation of the actual series, sa; (3) the bias in the lagone and lagtwo correlation coefficients as measured by the difference of the correlation coefficient of the estimated series, r to the correlation coefficient of the actual series, ra; (4) the bias of the estimation model as given by the mean of the residuals, yr, i.e., the mean of the differences between the infilled (estimated) and hidden (actual) values (this is also a check to detect a consistent over or underestimation of the method); (5) the accuracy as determined by the variance of the residuals (differences between estimated and actual 2 values) of the whole series, s ; r (6) the accuracy as determined by the variance of the 2 residuals of only the estimated values, s ; and r,e (7) the significance of the biases in the mean, standard deviation and correlation coefficients as determined by the appropriate test statistic for each (see appendix A). Table 5.3 presents the statistics of the actual series (ACT), of the incomplete series (INC) and of the estimated series by the mean value method (MV), by the reciprocal distances method (RD), by the normal ratio method (NR) and by the modified weighted average method (MWA). The mean (y), standard deviation (s), coefficient of variation (cv) coefficient of skewness (cs), lagone and lagtwo correlation coefficients (rl, r2) of the above series considered as a whole have then been calculated. Regarding comparison of the means, the following can be concluded from Table 5.4: (1) the bias in the mean in all cases is not significant at the 5% significance level as shown by the appropriate ttest; Table 5.3. Statistics of the Actual (ACT), Incomplete (INC) and Estimated Series (MV, RD, NR, MWA). y s cv cs r r2 ACT 4.126 3.673 89.040 1.332 0.366 0.134 2% missing values INC 4.116 3.680 89.397 1.346   MV 4.125 3.663 88.808 1.335 0.371 0.130 RD 4.124 3.674 89.092 1.336 0.367 0.133 NR 4.114 3.666 89.104 1.339 0.368 0.131 MWA 4.113 3.674 89.331 1.342 0.363 0.131 5% missing values INC 4.113 3.671 89.249 1.341   MV 4.101 3.610 88.040 1.352 0.372 0.139 RD 4.127 3.696 89.550 1.359 0.369 0.133 NR 4.105 3.674 89.501 1.349 0.367 0.131 NWA 4.116 3.720 90.386 1.388 0.364 0.126 10% missing values INC 4.144 3.705 89.405 1.350   MV 4.134 3.603 87.152 1.346 0.379 0.159 continued Table 5.3. Continued. y s ACT 4.126 3.673 RD 4.150 3.689 NR 4.120 3.652 MWA 4.127 3.725 15% INC MV RD NR MWA 4.135 4.106 4.177 4.135 4.134 3.671 3.513 3.688 3.691 3.650 20% INC MV RD NR MWA 4.082 4.124 4.231 4.125 4.168 3.701 3.495 3.723 3.601 3.741 r 1 0.366 0.380 0.377 0.376 c v 89.040 88.884 88.633 90.244 missing 88.767 85.567 86.862 86.854 88.291 missing 90.673 84.749 87.993 87.307 89.758 s 1.332 1.301 1.321 1.286 values 1.268 1.270 1.224 1.236 1.248 values 1.404 1.333 1.865 1.298 1.273 r2 0.134 0.166 0.155 0.162 0.133 0.132 0.133 0.123  0.160 0.156 0.152 0.153 0.399 0.372 0.379 0.357 0.408 0.370 0.377 0.354  I Table 5.4. Bias in the Mean INC MV RD NR MWA (Ye i) Yi 2% 0. 0.009 0.008 0.002 0.003 4.116 5% 0. 0.012 0.014 0.008 0.003 4.113 10% 0. 0.010 0.006 0.024 0.017 4.144 15% 0. 0.089 0.042 0.000 0.001 4.135 20% 0. 0.042 0.149 0.043 0.086 4.082 (Ye Ya) Ya 2% 0.010 0.001 0.002 0.012 0.013 4.126 5% 0.013 0.025 0.001 0.021 0.010 10% 0.018 0.008 0.024 0.006 0.001 15% 0.009 0.020 0.051 0.009 0.008 20% 0.044 0.002 0.105 0.001 0.042 (2) the bias in the mean of the incomplete series is relatively small but becomes larger the higher the percent of missing values; (3) at high percent of missing values the NR method gives the less biased mean; (4) except for the RD method which consistently overestimates the mean (the bias being larger the higher the percent of missing values), the other methods do not show a consistent over or underestimation. Regarding comparison of the variances the following can be concluded from Table 5.5: (1) Although slight, the bias in the standard deviation is always significant, but this is so because the ratio of variances would have to equal 1.0 exactly to satisfy the Ftest (i.e., be unbiased) with as large a number of degrees of freedom as in this study; (2) the MV method always gives a reduced variance as compared to the variance of the incomplete series and of the actual series, the bias being larger the higher the percent of missing values; (3) the bias in the standard deviation of the incomplete series is small; (4) there is no consistent over or underestimation of the variance by any of the methods (except the MV method); Table 5.5. Bias in the Standard Deviation INC MV RD NR MWA s e/s. S 2% 1. 0.995 0.998 0.996 0.998 3.680 5% 1. 0.983 1.007 1.001 1.013 3.671 10% 1. 0.972 0.996 0.986 1.005 3.705 15% 1. 0.957 0.988 0.978 0.994 3.671 20% 1. 0.944 1.006 0.973 1.011 3.701 s e/sa sa 2% 1.002 0.997 1.000 0.998 1.000 3.673 5% 0.999 0.983 1.006 1.000 1.013 10% 1.009 0.981 1.004 0.994 1.014 15% 0.999 0.956 0.988 0.978 0.994 20% 1.008 0.952 1.014 0.980 1.019 (5) the MWA method does not give less biased variance even at the higher percent of missing values tested, as compared to the RD and NR methods. Regarding comparison of the correlation coefficients the following can be concluded from Table 5.6: (1) the bias in the correlation coefficients is in all cases not significant at the 5% significance level as shown by the appropriate ztest; (2) the MV method gives the largest bias in the correlation coefficients, the bias increasing the higher the percent of missing values, with a possible effect on the determination of the order of the model; (3) all methods (except the MWA method) consistently overestimate the serial correlation coefficient of the incomplete series but not the serial correlation of the actual series and therefore is not considered a problem; (4) the RD method seems to give a correlogram that closely follows the correlogram of the actual series. Regarding accuracy of the methods the following can be concluded from Table 5.7: (1) no method seems to consistently over or underestimate the missing values at all percent levels, but at high percent levels the missing values are overestimated by all methods; Table 5.6. Bias in the LagOne and LagTwo Correlation Coefficients. INC MV RD NR MWA (rl,e r,a) rl,a 2% 0.005 0.001 0.002 0.003 0.366 5%  0.006 0.003 0.001 0.002 10%  0.013 0.014 0.011 0.010 15%  0.033 0.006 0.013 0.009 20%  0.042 0.004 0.011 0.012 (r2,e r2,a) r2,a 2% 0.004 0.001 0.003 0.003 0.134 5%  0.005 0.001 0.003 0.008 10%  0.025 0.032 0.021 0.028 15%  0.001 0.002 0.001 0.011 20%  0.026 0.022 0.018 0.019 Table 5.7. AccuracyMean and Variance of the Residuals N = number of missing values N = total number of values = 660. INC MWA P = (Ye Y)/No N 2%  0.043 0.061 0.570 0.589 13 5%  0.440 0.034 0.380 0.176 33 10%  0.007 0.156 0.113 0.046 62 15%  0.175 0.338 0.074 0.105 98 20%  0.037 0.502 0.038 0.200 130  Zy e (  2.874 3.656 4.239 4.630 4.891 2 Ya ) /(No2) 3.149 3.411 3.484 3.958 3.681 2% 5% 10% 15% 20% 2 s r,e 5.037 8.610 7.892 7.620 5.224 4.585 5.340 5.187 5.816 4.898 Table 5.7. Continued. INC MV RD NR MWA 2 2 s = (Y e Y) /(N2) 2%  0.084 0.048 0.053 0.077 5%  0.406 0.172 0.161 0.252 10%  0.720 0.387 0.318 0.473 15%  1.112 0.675 0.577 0.849 20%  1.016 0.951 0.716 0.953 (2) the NR method is the more accurate method especially at high percent of missing values (i.e., it gives the smaller mean and variance of the residuals). Univariate Model Model Fitting Before considering the problem of missing values the problem of fitting an ARMA(p,q) model to the monthly rainfall series of the south Florida interpolation station will be considered. The observed rainfall series has been normalized using the square root transformation and the periodicity has been removed by standardization. The reduced series, approximately normal and stationary, is then modeled by an ARMA(p,q) model. The ACF of the reduced series, as shown in Fig. 5.3, implies a white noise process since almost all the autocorrelation coefficients (except at lag3 and lag12) lie inside the 95 percent confidence limits. Of course, it is unsatisfying to accept the white noise process as the "best" model for our series and an attempt is made to fit an ARMA(1,1) model to the series. The selection of an ARMA model and not an AR or MA model is based on the following reasons: (1) The observed rainfall series contains important observational errors and so it is assumed to be the sum rk 41.0 * 0.1 + 0.05 Fig. 5.3. Autocorrelation function of the normalized and standardized monthly rainfall series of Station A. 95 % C.I. 0.I of two series: the "true" series and the observational error series (signal plus noise). Therefore, even if the "true" series obeys an AR process, the addition of the observational error series is likely to produce an ARMA model: AR(p) + white noise = ARMA(p,p) AR(p) + AR(q) = ARMA(p+q, max(p,q)) (5.5) AR(p) + MA(q) = ARMA(p, p+q) The same can be said if the "true" series is an MA process and the observational error series an AR process but not if the latter is an MA process or a white noise process: MA(p) + AR(q) = ARMA(q,p+q) MA(p) + MA(q) = MA(max(p,q)) (5.6) MA(p) + white noise = MA(p) (Granger and Morris, 1976; Box and Jenkins, 1976, Appendix A4.4). It is understood, that the addition of any observational series to an ARMA process of the "true" series will give again an ARMA process. For example, ARMA(p,q) + white noise = ARMA(p,p) if p > q (5.7) = ARMA(p,q) if p < q 100 from which it can also be seen that the addition of an observational error may not always change the order of the model of the "true" process. (2) One other situation that leads exactly, or approximately, to ARMA models is the case of a variable which obeys a simple model such as AR(1) if it were recorded at an interval of K units of time but which is actually observed at an interval of M units (Granger and Morris, 1976, p. 251). All these results suggest that a number of real data situations are all likely to give rise to ARMA models; therefore, an ARMA(1,1) model will be fitted to the observed monthly rainfall series of the south Florida interpolation station. The preliminary estimate of c1 (equation 3.23) is 0.08163, and the preliminary estimate of 61 (equa tions 3.21 for k = 0, 1, 2) is the solution of the quadratic equation 0.1656 e2 + 1.0204 1 + 0.1656 = 0 .(5.8) 1 1 Only the one root 81 = 0.1667 is acceptable, the second lying outside the unit circle. These preliminary estimates of 1 and 81 become now the initial values for the determination of the maximum likelihood estimates (MLE). In general, the choice of the starting values of C and 6 does not significantly affect the parameter estimates (Box and Jenkins, 1976, p. 236), but this was not the case for the 101 0.5 0.4 0? 0.3 OO t0.2 0.1 b 0.0 0.1 0.4 b 0.3 O 0.5 0.5 0.4 0.3 0.2 0.1 0.0 0.1 0.2 0.5 0.4 0.5 ^2 Fig. 5.4. Sum of squares of the residuals, Z(at) , of an ARMA (1,1) model fitted to the rainfall series of station A. 102 Table 5.8. Initial Estimates and MLE of the Parameters P and 6 of an ARMA(1,1) model fitted to the rainfall series of station A. Initial Estimates Max. Likelihood Estimates Model 6 8 A 0.0816 0.0 0.0088 0.0989 B 0.0816 0.1667 0.3140 0.4056 C 0.1 0.0 0.0537 0.0278 D 0.4 0.5 0.4064 0.4939 south Florida rainfall series under study. In particular different initial estimates of 1 and 61 have been tested and the MLE of the parameters are compared in Table 5.8. The MLE have been calculated using the IMSL subroutine FTMXL which uses a modified steepest descent algorithm to find the values of # and 6 that minimize the sum of squares of the residuals (Box and Jenkins, 1976, p. 504). The drastic changes in parameter values together with the idea that the process may be a white noise process suggest a plot of the sum of squares of the residuals for the visual detection of anomalies. The sum of squares grids and contours are shown in Fig. 5.4. We observe that there is not a well defined point where the sum of squares becomes a minimum but rather a line (contour of the value 641) on which the sum of squares has an almost constant value equal to the minimum. In such case combinations of parameter values give similar sum of squares of residuals and a change 103 in the AR parameter can be nearly compensated by a suitable change in the MA parameter. From the comparison of the parameters ( and 6 (Table 5.8) of the four ARMA(1,1) models one cannot say that they all correspond to the same process. But this can in fact be illustrated by converting the four models to their "random shock form" (MA( ) processes) or their "invertible form" (AR( ) processes). An ARMA(1,1) process (1p1B) zt = (161B) at (5.9) can be also written as I zt = (161B) (1 1B) at (5.10) which can be expanded in the convergent form zt = [1 + (191)B + 1((1)B2 + (11)3 + ...] at (5.11) provided that the stationarity condition (11 < 1) is satisfied. Then the four models of Table 5.8 become: 104 (A) : (B) (C) (D) In the same "invertible zt = at + 0.090 at1 0.001 at2 + zt = at + 0.092 at1 0.029 at2 + ... (5.12) zt = a + 0.082 at1 + 0.004 at2 + zt = at + 0.088 at1 0.036 at2 + way the ARMA(1,1) model may be written in the form" zt = at (5.13) which can be expanded as [1 (B 2 )3  given that satisfied. ...] zt = at (5.14) the invertibility condition ( 1ll <1) is Then the four models become: (A) : (B) (C) (D) zt = at + 0.090 zt_1 0.009 zt = at + 0.092 zt1 0.037 zt = at + 0.082 zt1 0.002 zt = at + 0.088 zt1 0.043 zt2 + zt2 + ... zt2 + . zt2 + ... From the "random shock" form of the four models (equations 5.12) and from their "invertible form" (equations 5.15) the following remarks can be made: (5.15) (1 IB) (16 B)1 105 (1) Although from the comparison of the and e coefficients (Table 5.8) of the four ARMA(1,1) models one cannot say that they all correspond to the same process, the comparison of the MA coefficients (61, 2, 3, ...) of equations (5.12) or the AR coefficients (1' ,2' 3', ...) of equations (5.15) imply that indeed all four models belong to the same process. (2) Because the nonzero <2 (and 82) coefficients of zt2 (and at2) terms while small are of similar magnitude to the coefficients il (and 81), one cannot say that the "truncated" AR(1) or MA(1) model will fully describe the time series, but instead more terms are needed. On the other hand, we observe that the $1 coefficient so obtained (different for each model) is in the range of 0.082 to 0.090 and is greater than the coefficient 1 that would have been obtained by a direct fitting of an AR (1) model to the series (the latter would be 1 = rl = 0.0068). (3) It should also be noted that all the above models fitted to the series give residuals that pass the portemanteau goodness of fit test. As it can be seen from equation (5.12) the impulse response function (e.g., the weights applied on the a.'s when the model is written in the "random shock form") dies off very quickly in all the models, and there is thus no doubt as to the application of the portemanteau test 106 (see Appendix A). The values of Q for each model (calculated from equation A.1 using K = 60) are: QA = 67.80, QB = 67.26, QC = 67.73 and QD = 67.39, all 2 smaller than the X value with 58 degrees of freedom at 2 a 5% significance level, X585% = 79.1. It can also be seen that the values of Q for all models are almost equal, suggesting an equally good fit of the series by all the four models. One other interesting question that could be asked is, given a specific ARMA(p,q) model whether or not this could have arisen from some simpler model. "Simplifications are not always possible as conditions on the coefficients of the ARMA model need to be specified for a simpler model to be realizable" (Granger and Morris, 1976, p. 252). At this stage with coefficients that are so instable it is meaningless to test the four ARMA models for simplification. However, this test will be made after a unique and stable model has been obtained through the following proposed algorithm. Proposed Estimation Algorithm The problem of estimation of missing values will be combined with the problem of stabilizing the coefficients of the ARMA(1,1) model in a recursive algorithm which will have solved both problems uniquely upon convergence. The incomplete series (S0) is filledin with some initial estimates of the missing values (these initial 107 estimates can be simply the monthly means or even zeroes as will be shown). Denote by S1 this initial series. An ARMA (1,1) model is fitted to the series S1 and its coefficients 41 and 61 are used to update the first estimates of the missing values. For example, suppose that a gap of size k (k missing values) exists in the series S0: Series S : ... zt z ... Zt+k+l Zt+k+2 (5.16) Series S: ... zt1 zt zt+l ... Zt+k Zt+k+l zt+k+2 "'" where z +, .., z+k are the initial estimates of the missing values. These values z' ..., z are then t+1' t+k replaced by the forecasted values zt(1), ..., zt(k) by the model, made at origin t and for lead times A = 1, ..., k. These forecasts are the minimum mean square error forward forecasts as developed by Box and Jenkins (1976). For an ARMA(1,1) model with coefficients (1 and 61, the minimum mean square error forecasts zt() of z t+ where k is the lead time, are: zt() = (1 zt 18at k = 1 (5.17) zt() = c1 zt(l) A = 2, ..., k from which it can be seen that only the one step ahead forecast depends directly on at, and the forecasts at longer lead times are influenced indirectly (Box and Jenkins, 1976, Ch. 5). The forecasting procedure in repeated for the 108 estimation of all the gaps, and the newly estimated values are used in equations (5.17). These forecasts now become the new estimates of the missing values and they replace the old estimates giving the new series S2. An ARMA(1,1) model is then fitted to the new series S2 and the new coefficients (1 and 81 are found (different from the previous ones). Then the estimated values (forecasts from the previous model) are replaced by the forecasts by the new model, giving the new series S3, etc. The procedure is repeated until the model and the series stabilize in the sense that the parameters 01 and 81 of the model as well as the estimates of the missing values do not change between successive estimates within a specified tolerance. Schematically the algorithm is presented in Fig. 5.5 where S0 denotes the incomplete series, M0 the method used for the initial estimation, S. the estimated series at the 1 ith iteration, and M. the model (e.g., the set of 1 parameters p1 and 61, (,,1 1) i) fitted to the series S. The notation M. * M. and S. > Sil is introduced to 1 i+1 1 i+1 denote the stabilization of the model and series respectively after i iterations. The above algorithm will be addressed as RAEMVU (a recursive algorithm for the estimation of missing valuesunivariate model). Application of the Algorithm on the Monthly Rainfall Series The proposed recursive algorithm (RAEMVU) has been applied for the estimation of missing monthly rainfall 109 SMo M, Mz MS Mi+, So S, S2 .... Fig. 5.5. Recursive algorithm for the estimation of missing valuesunivariate model (RAEMVU). S. denotes the series, and M. the model, T,6)i, at the ith iteration. 110 values in the series of the south Florida interpolation station (station 6038). Different levels of percentage of missing values have been tested and the results for the 10% and 20% levels are presented herein. Tables 5.9 and 5.10 show the results for the 10% and 20% levels of missing values respectively. The starting series S is the 0 incomplete series (with 10% or 20% the values missing). Four different methods M (MV, RD, NR, and zeros) have been applied to the incomplete series, So, providing different starting series, S1, for the algorithm. Thus, its dependence on the initial conditions has also been tested. Results of the Method From Tables 5.9 and 5.10 the following can be concluded: (1) The algorithm converges very rapidly and independently of the initial estimates, thus suggesting the convenient replacement of the missing values by zeros to start the algorithm. (2) The greater the percent of missing values the slower the algorithm converges (6 iterations were needed for the 10% and 8 for the 20% to obtain accuracy to the third decimal place) as was expected since a larger part of the series is changing its values at each iteration and thus more iterations are needed to achieve equilibrium. 
Full Text 
PAGE 1 WATER IiRESOURCES researc center Publication No. 67 ESTIMATING MISSING VALUES IN MONTHLY RAINFALL SERIES By EFI FOUFOULAGEORGIOU A Thesis Presented to the Graduate Council of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Master of Engineering University of Florida Gai nesvi 11 e UNIVERSITY OF FLORIDA PAGE 2 ESTIMATING MISSING VALUES IN MONTHLY RAINFALL SERIES By EFI FOUFOULAGEORGIOU Publication No. 67 FLORIDA WATER RESOURCES RESEARCH CENTER Research Project Technical Completion Report Sponsored by South Florida Water Management District A THESIS PRESENTED TO THE GRADUATE COUNCIL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF ENGINEERING UNIVERSITY OF FLORIDA 1982 PAGE 3 ACKNOWLEDGEHENTS I wish to express my sincere gratitude to all those who contributed towards making this work possible. I am particularly indebted to the chairman of my supervisory committee, Professor Wayne C. Huber. Through the many constructive discussions along the course of this research, he provided an invaluable guidance. It was his technical and moral support that brought this work into completion. I would like to express my sincere appreciation to the other members of my supervisory committee: Professors J. P. Heaney, D. L. Harris, and M. C. K. Yang, for their helpful suggestions and their thoughtful and critical evaluation of this work. Special thanks are also given to my fellow students and friends, Khlifa, Dave D., Bob, Terrie, Richard, Dave M., and Mike, for their cheerful help and the pleasant environment for work they have created. Finally my deepest appreciation and love go to my husband, Tryphon, who has been a constant source of encouragement and inspiration for creative work. Many invaluable discussions with him helped a great deal in ii PAGE 4 gaining an understanding of some problems considered in this thesis. The research was supported in part by the South Florida Water Management District. Computations were performed at the Northeast Regional Data Center on the University of Florida campus, Gainesville. iii PAGE 5 TABLE OF CONTENTS ACKNOWLEDGEMENTS ii LIST OF TABLES vii LIST OF FIGURES ix ABSTRACT xi CHAPTER 1. INTRODUCTION 1 Rainfall Records 1 Frequency Analysis of Missing Observations in the South Florida Monthly Rainfall Records . . . . .. 5 Description of the Chapters 15 CHAPTER 2. SIMPLIFIED ESTIMATION TECHNIQUES Introduction Mean Value Method (MV) Reciprocal Distance Method (RD) Normal Ratio Method (NR) Modified Weighted Average Method (MWA) Least Squares Method (LS) CHAPTER 3. UNIVARIATE STOCHASTIC MODELS Introduction Review of BoxJenkins Models 17 17 17 20 21 22 27 32 32 34 Autoregressive Models 35 Moving Average Models 39 Mixed AutoregressiveMoving Average Models. 42 Autoregressive Integrated Moving Average Models 44 Transformation of the Original Series 46 Transformation to Normality Stationarity iv 46 50 PAGE 6 Monthly Rainfall Series 52 CHAPTER 4. Normalization and Stationarization Modeling of Normalized Series MULTIVARIATE STOCHASTIC MODELS 52 55 58 Introduction . 58 General Multivariate Regression Model .. 59 Multivariate LagOne Autoregressive Model 60 Comments on Multivariate AR(I) Model .. 63 Assumption of Normality and Stationarity .. 63 CrossCorrelation Matrix Ml .. 65 Further Simplification 66 Higher Order Multivariate Models 68 CHAPTER 5. ESTIMATION OF MISSING MONTHLY RAINFALL VALUESA CASE STUDY 71 Introduction .. . 71 71 75 Set Up of the Problem . Simplified Estimation Techniques . Techniques Utilized . Least Squares Methods Modified Weighted Average Method Comparison of the MV, RD, NR and MWA Methods . 75 78 82 85 Univariate Model 97 Model Fitting .. 97 Proposed Estimation Algorithm 106 Application of the Algorithm on the Monthly Rainfall Series Results of the Method .. Remarks 108 110 106 Bivariate Model 117 Model Fitting .. .. 11 7 CHAPTER 6. Proposed Estimation Algorithm Application of the Algorithm on the Monthly Rainfall Series CONCLUSIONS AND RECOMMENDATIONS Summary and Conclusions Further Research v 119 121 131 131 134 PAGE 7 APPENDIX A. DEFINITIONS. APPENDIX B. DETERMINATION OF MATRICES A AND B OF THE MULTIVARIATE AR(l) MODEL APPENDIX C. DATA USED AND STATISTICS APPENDIX D. COMPUTER PROGRAMS REFERENCES BIOGRAPHICAL SKETCH vi 136 150 156 169 182 188 PAGE 8 Table 1.1 LIST OF TABLES Frequency Distribution of the Percent of Missing Values in 213 South Florida Monthly Rainfall Records 5.1 Least Squares Regression Coefficients and 9 Their Significance Levels 80 5.2 Correction Coefficients for Each Month and for Each Different Percent of Missing Values 83 5.3 Statistics of the Actual (ACT), Incomplete (INC) and Estimated Series (MV, RD, NR, MWA) 88 5.4 Bias in the Mean 90 5.5 Bias in the Standard Deviation 92 5.6 Bias in the LagOne and LagTwo Correlation Coefficients 94 5.7 Accuracy Mean and Variance of the Residuals 95 5.8 Initial Estimates and MLE of the parameters cp and 8 of an ARMA(l,l) Model Fitted b:::> the Monthly Rainfall Series of Station A 102 5.9 Results of the RAEMVU Applied at the 10% Level of Missing Values. Upper Value is CP1' Lower Value is 8 1 111 5.10 Results of the RAEMVU Applied at the 20% Level of Missing Values. Upper Value is CP1' Lower Value is 8 1 112 5.11 Statistics of the Actual Series (ACT) and the Two Estimated Series (UN10, UN20) 115 5.12 Bias in the Mean, Standard Deviation and Serial Correlation CoefficientUnivariate Model . . . . . 116 vii PAGE 9 Table Page 5.13 Results of the RAEMVB1 Applied at the 10% Level of Missing Values . . . 125 5.14 Results of the RAEMVB1 Applied at the 20% Level of Missing Values . . 127 5.15 Statistics of the Actual Series (ACT) and the Two Estimated Series (B10 and B20). 129 5.16 Bias in the Mean, Standard Deviation and Serial Correlation CoefficientBivariate Model 130 viii PAGE 10 LIST OF FIGURES Figure 1.1 Monthly distribution of rainfall in the United States .. .... 6 1.2 Probability density function, f (m) of the percentage of missing values . . 8 1.3 Probability density function, f ( T) of the interevent size . . . . 11 1.4 Probability density, f(k), and mass function, p(k), of the gap size. . .. 12 2.1 Mean value method without random component 19 2.2 Mean value method with random component. 19 2.3 Least squares method without random component 30 2.4 Least squares method with random component 30 5.1 The four south Florida rainfall stations used in the analysis . 73 5.2 Plot of the monthly means and standard deviations of the rainfall series of Station A 76 5.3 Autocorrelation function plot of the residual series of an ARMA(l,l) model fitted to the monthly rainfall series of Station A . 98 5.4 Sum of squares of the residuals surface of an model fitted to the monthly rainfall series of Station A . .. 101 5.5 Recursive algorithm for the estimation of the missing valuesunivariate model (RAEMVU) ... 109 5.6 Recursive algorithm for the estimation of missing valuesbivariate model1 station to be estimated (RAEMVB1) .......... 122 ix PAGE 11 Figure 5.7 Recursive algorithm for the estimation of missing valuesbivariate model2 stations to be estimated (RAEMVB2) . . .. 123 x PAGE 12 Abstract of Thesis Presented to the Graduate Council of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Master of Engineering ESTIMATION OF MISSING OBSERVATIONS IN MONTHLY RAINFALL SERIES By Efstathia FoufoulaGeorgiou December, 1982 Chairman: Wayne C. Huber Cochairman: James P. Heaney Major Department: Environmental Engineering Sciences This study compares and evaluates different methods for the estimation of missing observations in monthly rainfall series. The estimation methods studied reflect three basic ideas: (1) the use of regionalstatistical information in four simple techniques: mean value method (MV), reciprocal distance method (RD), normal ratio method (NR) modified weighted average method (MWA)i (2) the use of a univariate autoregressive moving average (ARMA) model which describes the time correlation of the series; xi PAGE 13 (3) the use of a multivariate ARMA model which describes the time and space correlation of the series. An algorithm for the recursive estimation of the missing values in a series by a parallel updating of the univariate or multivariate ARMA model is proposed and demonstrated. All methods are illustrated in a case study using 55 years of monthly rainfall data from four south Florida stations. xii ;/,1 I Chairman PAGE 14 CHAPTER 1 INTRODUCTION Rainfall Records Rainfall is the source component of the hydrologic cycle. As such it regulates water availability and thus land use, agricultural and urban expansion, maintenance of environmental quality and even population growth and human habitation. As Hamrick (1972) points out, water may be transported for considerable distances from where it fell as rain and may be stored for long periods of time, but with very few exceptions it originates as rainfall. Consequently, the measurement and study of rainfall is in actuality the measurement and study of our potential water supply. Rainfall studies attempt to derive models, both probabilistic and physical, to describe and forecast the rainfall process. Since the quality of every study is immediately related to the quality of the data used, the need for "good quality" rainfall data has been expressed by all hydrologists. By "good quality" is meant accurate, long and uninterrupted series of rainfall measurements at a range of different time intervals (e.g., hourly, daily, monthly, and yearly data) and for a dense raingage network. Missing 1 PAGE 15 2 values in the series (due, for example, to failure of the recording instruments or to deletion of a station) is a real handicap to the hydrologic data users; The estimation of these missing values is often desirable prior to the use of the data. For instance, the South Florida Water Management District prepared a magnetic tape with monthly rainfall data for all rainfall stations in south Florida for use in this study (T. MacVicar, SFWMD, personal communication, May, 1982). The data included values for the period of record at each station, ranging from over 100 years (at Key West) to only a few months at several temporary stations. Approximately one month was required to preprocess these data prior to performing routine statistical and time series analyses. The preprocessing included tasks such as manipulations of the magnetic tape, selection of stations with desirable characteristics (e.g., long period of record, proximity to other stations of interest, few missing values) and a major effort at replacement of missing values that did exist. This effort, in fact, was the motivation for this thesis. Many different kinds of statistical analyses may be performed on a given data set, e.g., determination of elementary statistical parameters, autoand crosscorrelation analysis, spectral analysis, frequency analysis, fitting time series models. For routine statistics (e.g., calculation of mean, variance and skewness) missing values PAGE 16 3 are seldom a problem. But for techniques as common as autocorrelation and spectral analysis missing values can cause difficulties. In multivariate analysis missing values result in "wasted information" when only the overlapping period of the series can be used in the analysis, and in inconsistencies (Fiering, 1968, and Chapter 4 of this thesis) when the incomplete series are used. In general, two approaches to the problem of missing observations exist. The first consists of developing methods of analysis that use only the available data, the second in developing methods of estimation of the observations followed by application of classical methods of analysis. Monthly rainfall totals are usually calculated as the sum of daily recorded values. Thus, if one or more daily observations are missing the monthly total is not reported for that month. An investigation conducted by the Weather Bureau in 1950 (Paulhus and Kohler, 1952), showed that almost one third of the stations for which monthly and yearly totals were not published had only a few (less than five) days missing. Furthermore, for some of these missing days there was apparently no rainfall in the area as concluded by the rainfall observations at nearby stations. Therefore, in many cases estimation of a few missing daily rainfall values can provide a means for the estimation of the monthly totals. PAGE 17 4 Statisticians have been most concerned with the problem of handling short record multivariate data with missing observations in some or all of the variables, but no explicit and simple solutions have been given, apart from a few special cases in which the missing data follow certain patterns. A review of these methods is given by Afifi and Elashoff (1956). In the time domain, "the analysis of time series, when missing observations occur has not received a great deal of attention" as Marshall (1980, p. 567) comments, and he proposes a method for the estimation of the autocorrelations using only the observed values. Jones (1980) attempts to fit an ARMA model to a stationary time series which has missing observations using Akaike's Markovian representation and Kalman's recursive algorithm. In the frequency domain, spectral analysis with randomly missing observations has been examined by Jones (1962), Parzen (1963), Scheinok (1965), Neave (1970) and Bloomfield (1970) In hydrology, the problem of missing observations has not been studied much as Salas et al. (1980) state: The fillingin or extension of a data series is a topic which has not received a great deal of attention either in this book or elsewhere. Because of its importance, the subject is expected to be paid more attention in the future. (Salas et al., 1980, p. 464) Simple and "practicable" methods for the estimation of missing rainfall values for large scale application were proposed by Paulhus and Kohler (1952), for the completion of the rainfall data published by the Weather Bureau. The PAGE 18 5 study was initiated after numerous requests of the climatological data users. Beard (1973) adopted a multisite stochastic generation technique to fillin missing streamflow data, and Kottegoda and Elgy (1977) compared a weighted average scheme and a multivariate method for the estimation of missing data in monthly flow series. Hashino (1977) introduced the "concept of similar storm" for the estimation of missing rainfall sequences. Although the same methods of estimation can be applied to both rainfall and runoff series, a specific method is not expected to perform equally well when applied to the two different series due mainly to the different underlying processes. This is true even for rainfall series from different geographical regions, since their distributions may vary greatly as shown in Fig. 1.1. This analysis will use monthly rainfall data from four south Florida stations. First, a frequency analysis of the missing observations has been performed and their typical pattern has been identified. In this work the term "missing observations" is used for a sequence of missing monthly values restricted to less than twelve, so that unusual cases of lengthy gaps (a year or more of missing values) is avoided since they do not reflect the general situation. Frequency Analysis of Missing Observations in the .. South Florida Monthly Rainfall Records An analysis of the monthly series of 213 stations of the South Florida Water Management District PAGE 19 J. \: L J nl!4.i ? I I: J' : 0'., 0_ 1= ,: JMM ," 'f r/. __ ,lj ".","." t'1' PAGE 20 (SF\vMD) gave the results shown on Table 1.1. Figure 1. 2 shows the probability density function (pdf) plot of the percent m of missing values, f(m), which is defined as the ratio of the probability of occurrence over an interval to the length of that interval (column 4 of Table 1.1). The shape of the pdf f(m) suggests the fit by an exponential distribution 7 f (m) = Am Ae (1. 1) where A is the parameter of the distribution calculated as the inverse of the expected value of m, E(m)i E(m) = L:p (m.) m. 1 1 (1. 2) where p(m.) is the probability of having m. percent of 1 1 missing values. The mean value of the percentage of missing values is m = E(m) = 13.663, and therefore the fitted exponential pdf is f(m) = 0.073 0.073m e which gives an interesting and unexpectedly good fit as shown by Fig. 1.2 and column 5 of Table 1.1 The question now arises as to whether the missing values within a record follow a certain pattern. In (1. 3) PAGE 21 f (rn) 0.07 0.01 0.08 fern) = 0.073 e0.073rn 0.04 0.03 0.02 0:01 0.00 o 10 20 30 40 80 60 10 % missing values, rn Fig. 1.2. Probability density function, fern), of the percentage of missing values. Based on 213 stations, m = 13.663%. 8 PAGE 22 9 Table 1.1. Frequency Distribution of the Percent of Missing Values in 213 South Florida Monthly Rainfall Records. 1 2 3 4 5 % of % of Cumulative Empirical Fitted Missing Stations % of Stations pdf Exponential Values pdf 05 30.52 30.52 0.061 0.061 510 21.12 51. 64 0.042 0.042 1015 14.55 66.19 0.029 0.029 1520 13.61 79.80 0.027 0.020 2025 6.10 85.90 0.012 0.014 2530 3.29 89.10 0.007 0.010 3035 1.88 91. 70 0.004 0.007 3540 0.94 92.01 0.002 0.005 4045 2.35 94.36 0.005 0.003 4550 2.82 97.18 0.006 0.002 5055 0.47 97.65 0.001 0.002 5560 0.47 98.12 0.001 0.001 6065 1. 41 99.53 0.003 0.001 6570 0.47 100.00 0.001 0.001 PAGE 23 10 particular, if the occurrence of a gap is viewed as an "event" then the distribution of the interevent times (sizes of the interevents) and of the durations of the events (sizes of the gaps) may be examined. The probability distribution of the size of the interevents (number of values between two successive gaps) has been studied for four "typical" stations of the SFWMD, as far as length of the record, distribution and percent of missing values is concerned. These four stations are: MRF 6018, Titusville 2W, 19011981, 7.5% missing MRF 6021, Fellsmere 4W, 19111979, 9.3% missing MRF 6029, Ocala, 19001981, 4.4% missing MRF 6005, Plant City, 18921981, 8.6% missing A derived pdf for the four stations combined and the fitted exponential pdf are shown in Fig. 1.3. The mean size of the inter event T, is 19.03 months; therefore, the fitted exponential distribution is f(T) = 0.053 0.053T e (1. 4) Also, the probability distribution of the size of the gaps (number of values missing in each gap) has also been studied for the same four stations. These have been treated as discrete distributions since the size of the gap (k = 1, 2, ., 11) is small as compared to the interevent times. A probability distribution for the four stations combined is then derived, which is also the discrete probability mass function (pmf). This plot is shown in Fig. 1.4 and suggests either a Poisson distribution or a discretized exponential. PAGE 24 f (T) 0..05 0.041 0.03 0.02 0.01 f (T) = 0.053 0.053T e 11 o. 00 o 20 410 60 80 100 120 months between gaps,T Fig. 1.3. Probability density function, f(T), of the interevent size. Based on four stations. PAGE 25 f(k) and p(k) 0.6 0.4 0.3 o 0.2 0 0.1 0 0.0 0 2 3 4 f(k) = IS 6 0.447 0.447k e *.empirical o poisson fitted 7 8 9 10 II gap size, k (months) 12 Fig. 1.4. Probability density, f(k), and mass function, p(k), of the gap size. Based on four stations. PAGE 26 13 The mean value k is 2.237, which is also the parameter A of the Poisson distribution. The Poisson distribution e>" >..k f(k) = (1.5) k! is nonzero at k = 0 and does not fit the peak of the empirical point very well at k = 1 (it gives a value of 0.24 instead of the actual 0.53). The fitted continuous exponential pdf shown in Fig. 1.4 gives a better fit in general but also implies a nonzero probability for a gap size near zero. To overcome this problem and to discretize the continuous exponential pdf, the area (probability) under the exponential curve between zero and 1.5 is assigned to k = 1, ensuring a zero probability at k = O. Areas (probabilities) assigned to values of k > 1 are centered around those points. The fitted discretized exponential and the Poisson are also shown in Fig. 1.4. The distributions of the size of the gaps (k) and of the size of interevents (T) will be used to generate randomly distributed gaps in a complete record. Suppose that we have a complete record and desire to remove randomly m percent missing values. If the mean size of the gap (k) is assumed constant, the mean size of interevent (T) must vary, decreasing as the percent of missing values increases. Let N denote the total number of values in the record, m the PAGE 27 where (3.8 ) is called the multiple coefficient of determination and represents the fraction of the variance of the series that has been explained through the regression. If we denote by kj the jth coefficient in an autoregressive process of order k, then the last coefficient kk of the model is called the partial autocorrelation coefficient. Estimates of the partial autocorrelation 38 coefficients ll' pp may be obtained by fitting to the series autoregressive processes of successively higher order, and solving the corresponding YuleWalker equations. The partial autocorrelation function kk' k = 1, 2, p may also be obtained recursively by means of Durbin's relations (Durbin, 1960) k k k+l,k+l = [rk + l L k,J' rk+l_J,]/[l L k' r,] j=l j=l ,J J (3.9) k+l,j = k,j k+l,k+l k,kj+l j = 1, 2, .. k It can be shown (Box and Jenkins, 1976, p. 55) that the autocorrelation function of a stationary AR(p) process is a mixture of damped exponential and damped sine waves, PAGE 28 infinite in extent. On the other hand, the partial autocorrelation function kk is nonzero for k < P and zero for k > p. The plot of autocorrelation and partial autocorrelation functions of the series may be used to identify the kind and the order of the model that may have generated it (identification of the model). Moving Average Models In a moving average model the deviation of the current value of the process from the mean is expressed as a finite sum of weighted previous shocks als. Thus a moving average process of order q can be written as: 39 (3.10) or (3.11) where 6 (B) 1 8 B 6 B2 1 2 (3.l2} is the moving average operator of order q. An MA(q} model contains (q+2) parameters, ll, 6 1 62 8 q 0; to be estimated from the data. PAGE 29 40 From the definition of stationarity (see Appendix A) it follows that an MA(q) process is always stationary since 8(B) is finite and thus converges for IBI PAGE 30 41 + .. 8 + k 1 + 8i + + + 8 8 qk q k=l, 2, ... q (3.17) These equations are analogous to the YuleWalker equations for an autoregressive process, but they are not linear and so must be solved iteratively for the estimation of the moving average parameters 8, resulting in estimates that may not have high statistical efficiency. Again it was shown by Wold (1938) that these parameters may need corrections (e.g., to fit better the correlogram as a whole and not only the first q correlation coefficients), and that there may exist several, at most 2 q solutions, for the parameters of the moving average scheme corresponding to an assigned correlogram PI' P 2 .. P q However, only those 8's are acceptable which satisfy the invertibility conditions. From equation (3.14) an estimate for the white noise variance may be obtained ... + 8 2 q (3.18) According to the duality principle (see Appendix A) an invertible MA(q) process can be represented as an AR process of infinite order. This implies that the partial autocorrelation function kk of an MA(q) process is infinite in extent. It can be estimated after tedious algebraic manipulations PAGE 31 from the YuleWalker equations by substituting P k as functions of 8's for k < q and Pk = 0 for k > q. So, in contrast to a stationary AR(p) process, the autocorrelation function of an invertible MA(q) process is finite and cuts 42 off after lag q, and the partial autocorrelation function is infinite in extent, dominated by damped exponentials and damped sine waves (Box and Jenkins, 1976). Mixed AutoregressiveMoving Average Models In practice, to obtain a parsimonious parameterization, it will sometimes be necessary to include both autoregressive and moving average terms in the model. A mixed autoregressivemoving average process of order (p,q), ARMA(p,q), can be written as Zt = lZtl + + pZt_p + at 8 l a t l 8 q a t q (3.19) or CB) (3.20) with Cp+q+2) parameters, ll, 8 1 .. 8 q l' p' to be estimated from the data. An ARMA(p,q) process will be stationary provided that the characteristic equation (B) = 0 has all its roots outside the unit circle. Similarly, the roots of 8(B) = 0 must lie outside the unit circle for the process to be invertible. PAGE 32 43 '" By multiplying equation (3.19) by Ztk and taking expectations we obtain Yk = l Y k l + + p Yk p + Yza(k) 8 1Yza(kl) 8q Yza(kq) (3.21) where Y (k) is the cross covariance function between z and za '" a, defined by Yza(k) = E[Zt_kat]. Since Ztk depends only on shocks which have occurred up to time tk, it follows that Yza(k) = 0 Yza(k) "I 0 and (3.21) implies or k > 0 (3.22) k < 0 k > q + 1 (3.23) k > q + 1 (3.24) Thus, for the ARMA(p,q) process the first q autocorrelations PI' P 2 .. P q depend directly on the choice of the q moving average paramaters 8, as well as on the p autoregressive parameters through (3.21). The autocorrelations of higher lags P k k q + 1 are determined through the difference equation (3.24) after providing the p starting PAGE 33 44 values p +1' .. p qp q So, the autocorrelation function of an ARMA(p,q) model is infinite in extent, with the first qp values PI' .. P irregular and the others qp consisting of damped exponentials and/or damped sine waves (Box and Jenkins, 1976; Salas et al., 1980). Autoregressive Integrated Moving Average Models An ARMA(p,q) process is stationary if the roots of = 0 lie outside the unit circle and "explosive nonstationary" if they lie inside. For example, an explosive nonstationary AR(l) model is Zt = 2zt l + at (the plot of Zt vs. t is an exponential growth) in which = 1 2B has its root B = 0.5 inside the unit circle. The special case of homogeneous nonstationarity is when one or more of the roots lie on the unit circle. By introducing a generalized autoregressive operator which has d of its roots on the unit circle, the general model can be written as (3.25) that is (3.26) where = nd Z v t (3.27) PAGE 34 45 and V = 1 B is the difference operator. This model corresponds to assuming that the dth difference of the series can be represented by a stationary, invertible ARMA process. By inverting (3.27) (3.28) where S is the infinite summation operator 1 B 2 _ (.l_B)l _ 01 S = + B + + .. v (3.29) Equation (3.28) implies that the nonstationary process Zt can be obtained by surruning or "integrating" the stationary process w t d times. Therefore, this process is called a simple autoregressive integrated moving average process, ARIMA (p d q) It is also possible to take periodic or seasonal differences at lag's of the series, e.g., the 12th difference of monthly series, introducing the differencing operator VD with the meaning that seasonal differencing V is applied s s D times on the series. This periodic ARIMA(P,D,Q) model s can be written as (3.30) PAGE 35 46 The combination of nonperiodic and periodic models leads to the mUltiplicative ARlMA(p,d,q) x ARlMA(P,D,Q)s model which can be written as (3.31) After the model has been fitted to the differenced series an integration should be performed to retrieve the original process. But such an integrated would lack a mean value since a constant of integration has been lost through the differencing. This is the reason that the ARlMA models cannot be used for synthetic generation of time series, although they are useful in forecasting the deviations of a process (Box and Jenkins, 1976; Salas et al., 1980). Transformation of the Original Series Transformation to Normality Most probability theory and statistical techniques have been developed for normally distributed variables. Hydrologic variables are usually assymetrically distributed or bounded by zero (positive variables), and so a transformation to normality is often applied before modeling. Another approach would be to model the original skewed series and then find the probability distribution of the uncorrelated residuals. Care must then be taken to assess the errors of applying methods developed for normal variables to skewed PAGE 36 47 variables, especially when the series are highly skewed, e.g., hourly or daily series. On the other hand, when transforming the original series into normal, biases in the mean and standard deviation of the generated series may occur. In other words, the statistical properties of the transformed series may be reproduced in the generated but not in the original series. An alternative for avoiding biases in the moments of the generated series would be to estimate the moments of the transformed series through the derived relationships between the moments of the skewed and normal series. Matalas (1967) and Fiering and Jackson (1971) describe how to estimate the first two moments of the logtransformed series so as to reproduce the ones of the original series. Mejia et al. (1974) present another approach in order to the correlation structure of the original series. However, the most widely used approach is to transform the original skewed series to normal and then model the normal series. Several transformations may be applied to the original series, and the transformed series then tested for normality, e.g. the graph of their cumulative distribution should appear as a straight line when it is plotted on normal probability paper. The transformation will be finally chosen that gives the best approximation to normality, e.g., the best fit to a straight line. Another advantage of transforming the series to normal is that the maximum likelihood estimates of the model PAGE 37 48 parameters are essentially the same as the least squares estimates, provided that the residuals are normally distributed (Box and Jenkins, 1976, Ch. 7). This facilitates the calculation of the final estimates since they are those values that minimize the sum of squares of the residuals. Box and Cox (1964) showed how a maximum likelihood and a parallel Bayesian analysis can be applied to any type of transformation family to obtain the "best" choice of transformation from that family. They illustrated those methods for the popular power families in which the observation x is replaced by y, where xAl Y = {A log x A=O (3.32) The fundamental assumption was that for some A the transformed observations y can be treated as independently normally distributed with constant variance 02 and with expectations defined by a linear model E[y] = A L (3.33) where A is a known constant matrix and L is a vector of unknown parameters associated with the transformed observations (Box and Cox, 1964). This transformation has the advantage over the simple power transformation proposed by Tukey (1957) PAGE 38 49 y = { xA ,A;;iO log x A=O (3.34) of being continuous at A=O. Otherwise the two transformations are identical provided, as has been shown by Schlesselman (1971), that the linear model of (3.33) contains a constant term. Further, Draper and Cox (1969), showed that the value of A obtained from this family of transformations can be useful even in cases where no power transformation can produce normality exactly. Also, John and Draper (1980) suggested an alternative oneparameter family of transformations when the power transformation fails to produce satisfactory distributional properties as in the case of a symmetric distribution with long tails. The selection of the exact transformation to normality (zero skewness) is not an easy task, and overtransformation, i.e., transformation of the original data with a large positive (negative) skewness to data with a small negative (positive) skewness, or undertransformation, i.e., transformation of the original data with a large positive (negative) skewness to data with a small positive (negative) skewness, may result in unsatisfactory modeling of the series or in forecasts that are in error. This was the case for the data used by Chatfield and Prothero CI973a), who applied the BoxJenkins forecasting approach and were dissatisfied with the results, concluding that the BoxJenkins forecasting procedure is less efficient than other forecasting PAGE 39 50 methods. They applied a log transform to the data which evidently overtransformed the data, as shown by Box and Jenkins (1973) who finally suggested the approximate transf t 0.25 th h h I' db' orma lon y = x ,even oug t e comp lcate ut preclse BoxCox procedure gave an estimate of A = 0.37 [Wilson (1973)]. Thus, the selection of the normality transformation greatly affects the forecasts, as Chatfield and Prothero (1973b) experienced with their data. They concluded that We have seen that a "small" change in A from 0 to 0.25 has a substantial effect on the resulting forecasts from model A [ARlMA(l,l,l} x ARlMA(1,1,1}12J even though the goodness of fit does not seem to be much affected. This reminds us that a model which fits well does not necessarily forecast well. Since small changes in A close to zero produce marked changes in forecasts, it is obviously advisable to avoid "low" values of A, since a procedure which depends critically on distinguishing between fourthroot and logarithmic transformation is fraught with peril. On the other hand a "large" change in A from 0.25 to 1 appears to have relatively little effect on forecasts. So we conjecture that BoxJenkins forecasts are robust to changes in the transformation parameter away from zero .. [Chatfield and Prothero (1973b) p. 347] Stationarity Most time series occurring in practice exhibit nonstationarity in the form of trends or periodicities. The physical knowledge of the phenomenon being studied and a visual inspection of the plot of the original data may give the first insight into the problem. Usually the length of the series is not long enough, and the detection of PAGE 40 51 trends or cycles only through the plot of the series is ambiguous. Useful tools for the detection of periodicities are the autocorrelation function and the spectral density function of the series (which is the Fourier transform of the autocorrelation function). If a seasonal pattern is present in the series then the correlogram (plot of the autocorrelation function) will exhibit a sinusoidal appearance and the periodogram (plot of the spectral density function) will show peaks. The period of the sinusoidal function of the correlogram, or the frequency where the peaks occur in the periodogram, can determine the periodic component exactly (Jenkins and Watts, 1968). Another device for the detection of trends and periodicities is to fit some definite mathematical function, such as exponentials, Fourier series or polynomials to the series and then model the residual series, which is assumed to be stationary. More details on the treatment of nonstationary data as well as on the interpretation of the correlogram and periodogram of a time series can be found in textbooks such as Bendat and Piersol (1958}, Jenkins and Watts (1968), Wastler (1969), Yevjevich (1972), and Chatfield (1980). Apart from the approach of removing the nonstationarity of the original series and modeling the residual series with a stationary ARMA(p,q) model, the original nonstationary series can be modeled directly with a simple or seasonally integrated ARIMA model. Actually, the second approach can be viewed as an extension of the first one, PAGE 41 e.g., the nonstationarity is removed through the simple (V) or seasonal (V ) differencing. However, the integrated s 52 model cannot be used for generation of data, as has already been discussed. For many hydrologic applications, one is satisfied with second order or weak stationarity, e.g., stationarity in the mean and variance. Furthermore, weak stationarity and the assumption of normality imply strict stationarity (see Appendix A) Monthly Rainfall Series Normalization and Stationarization Stidd (1953, 1968) suggested that rainfall data have a cube root normal distribution because they are product functions of three variables: vertical motion in the atmosphere, moisture, and duration time. Synthetic rainfall data generated using processes analogous to those operating in nature showed that the exponent required to normalize the distribution is between 0.5 (square root) and 0.33 (cubic root) for different types of rainfall (Stidd, 1970). The square root transformation has been extensively used for the approximate normalization of monthly rainfall series (see Table C12 of Appendix C) with satisfactory results: Delleur and Kavvas (1978), Salas et al. (1980), Ch. 5, Roesner and Yevjevich (1966). However, Hinkley (1977) used the exact BoxCox transformation for monthly rainfall PAGE 42 53 series. Although, Asley et ale (1977) have developed an efficient algorithm for the estimation of A along with other parameters in an ARlMA model, it seems that the exact value of A is not more reliable than the approximate one A = 0.5 (Chatfield and Prothero, 1973b). The reasons for this follow. First, Chatfield and Prothero (1973b) used the BoxCox procedure to evaluate the exact transformation of their A data. They obtained estimates A = 0.24 using all the data (77 observations), A = 0.34 using the first 60 observations A and A = 0.16 excluding the first year's data. Therefore, it is logical to infer that even if the complicated BoxCox procedure for the incomplete rainfall record is used, the missing values may be enough to give a spurious A, which is not "more exact" than the value of 0.5 used in practice. Second, we may also notice that the use of either A = 0.33 (cubic root) or A = 0.5 (square root) is not expected to greatly affect the forecasts since, according to Chatfield and Prothero (1973b), the BoxJenkins forecasts are not too sensitive to changes of A for A > 0.25. Monthly rainfall series are nonstationary. The variation in the mean is obvious since generally the expected monthly rainfall value for January is not the same as that of July. Although the variation of the standard deviation is not so easy to visualize, calculations show that months with higher mean usually have higher standard deviation. Thus, each month has its own probability PAGE 43 54 distribution and its own statistical parameters resulting in monthly series that are nonstationary. By introducing the concept of circular stationarity as developed by Hannan (1960) and others (see Appendix A for definition), the periodic monthly rainfall series can be considered not as nonstationary but circular stationary, since circular stationarity suggests that the probability distribution of rainfall in a particular month is the same for the different years. Then, the monthly rainfall series is composed of a circularly stationary lperiodic) component and a stationary random component. The timeseries models currently used in hydrology are fitted to the stationary random component, so the circularly stationary component must be removed before modeling. This last component appears as a sinusoidal component in the autocorrelation function (with a 12month period) or as a discrete spectral component in the spectrum (peak at the frequency 1/12 cycle per month). Usually several subharmonics of the fundamental 12month period are needed to describe all the irregularities present in the autocorrelation function and spectral density function, since in nature the periodicity does not follow an ideal cosine function with a 12month period. The use of a Fourier series approach for the approximation of the periodic component of monthly rainfall and monthly runoff series has been illustrated by Roesner and Yevjevich (1966). PAGE 44 55 Kavvas and Delleur (1975) investigated three methods of removal of periodicities in the monthly rainfall series: nonseasonal (firstlag) differencing, seasonal differencing (12month difference), and removal of monthly means. They worked both analytically and empirically using the rescaled (divided by the monthly standard deviation) monthly rainfall square roots for fifteen Indiana watersheds. They concluded that "all the above transformations yield hydrologic series which satisfy the classical secondorder weak stationarity conditions. Both seasonal and nonseasonal differencing reduce the periodicity in the covariance function but distort the original spectrum, thus making it impractical or impossible to fit an ARMA model for generation of synthetic monthly series. The subtraction of monthly means removes the periodicity in the covariance and the amount of nonstationarity introduced is negligible for practical purposes." (Kavvas and Delleur, 1975, p. 349.) In other words, they concluded that the best way for modeling monthly rainfall series is to remove the seasonality (by subtracting the monthly means and dividing by the standard deviations of the normalized series) and then use a stationary ARMA(p,q} model to model the stationary normal residuals. Modeling of Normalized Series It is assumed that the nonstationarities due to longterm trends are removed before any operation. Then the appropriate transformation is applied to the data in PAGE 45 56 order to obtain an approximately normal distribution. For monthly rainfall series experience has shown that the best practical transformation is the square root transformation, as has already been discussed. What remains is the modeling of the normalized series with one of the following models: stationary ARMA(p,q), simple nonstationary ARIMA(p,d,q), seasonal nonstationary ARIMA(P,O,Q)s' or mUltiplicative ARIMA(p,d,q)x(P,O,Q)s model. Delleur and Kavvas (1978) fitted different models to the monthly rainfall series of 15 basins in Indiana and compared the results. They studied the models: ARIMA ( 0), ARIMA (1 1), ARIMA ( 1, 1, 1), ARIMA (1 1 1) 12 and ARIMA(1,0,0)x(1,1,1)12 on the squareroot transformed series. They concluded that from the nonseasonal ARIMA models, ARMA(l,l) "emerged as the most suitable for the generation and forecasting of monthly rainfall series." The goodnessoffit tests applied on the residuals were the portemanteau lack of fit test (see Appendix A) of Box and Pierce (1970) and the cumulative periodogram test (Box and Jenkins, 1976, p. 294). The ARMA(l,l) model passed both tests in all cases studied. From the nonseasonal models, ARIMA(1,0,0}x(1,1,1)12 also passed the goodnessoffit tests in all cases, but they stress that this model "has only limited use in the forecasting of monthly rainfall series since it does not preserve the monthly standard deviations." As far as forecasts are concerned, they showed that "the forecasts by the several models follow each other very PAGE 46 57 closely and the forecasts rapidly tend to the mean of the observed rainfall square roots (which is the forecast of the white noise model)." PAGE 47 CHAPTER 4 MULTIVARIATE STOCHASTIC MODELS Introduction For univariate stochastic models the sequence of observations under study is assumed independent of other sequences of observations and so is studied by itself (single or univariate time series). However, in practice there is always an interdependence among such sequences of observations, and their simultaneous study leads to the concept of multivariate statistical analysis. For example, a rainfall series of one station may be better modeled if its correlation with concurrent rainfall series at other nearby stations is incorporated into the model. Multiple time series can be divided into two groups: (1) multiple time series at several points (e.g., rainfall series at different stations, streamflow series at various points of a river), and (2) multiple series of different kinds at one point (e.g., rainfall and runoff series at the same station). In general, both kinds of multiple time series are studied simultaneously, and their correlation and crosscorrelation structure is used for the construction of a model that better describes all these series. The parameters of this so called multivariate stochastic model are calculated such 58 PAGE 48 59 that the correlation and crosscorrelation structure of the multiple measured series are preserved in the multiple series generated by the model. The multivariate models that will be presented in this chapter have been developed and extensively used for the generation of synthetic series. How these models can be adapted and used for filling in missing values will be discussed in chapter 5. General Multivariate Regression Model The general form of a multivariate regression model is Y = A X + B H (4. 1) where Y is the vector of dependent variables, X the vector of independent variables, A and B matrices of regression coefficients, and H a vector of random components. The vectors Y and X may consist of either the same variable at different points tor at different times) or different variables at the same or different points (or at different times) For convenience and without loss of generality all the variables are assumed second order stationary and normally distributed with zero mean and unit variance. Transformations to accomplish normality have been discussed in Chapter 3. A random component is superimposed on the model to account for the nondeterministic fluctuations. In the above model, the dependent and independent variables must be selected carefully so that the most PAGE 49 60 information is extracted from the existing data. A good summary of the methods for the selection of independent variables for use in the model is given in Draper and Smith (1966}. Most popular is the stepwise regression procedure in which the independent variables are ranked as a function of their partial correlation coefficients with the dependent variable and are added to the model, in that order, if they pass a sequential F test. The parameter matrices A and B are calculated from the existing data in such a way that important statistical characteristics of the historical series are preserved in the generated series. This estimation procedure becomes cumbersome when too many dependent and independent variables are involved in the model, and several simplifications are often made in practice. On the other hand, restrictions have to be imposed on the form of the data, as we shall see later, to ensure the existence of real solutions for the matrices A and B. Multivariate LagOne Autoregressive Model If only one variable (e.g., rainfall at different stations} is used in the analysis then the model of equation (4.11 becomes a multivariate autoregressive model. Since in the rest of this chapter we will be dealing only with one variable (rainfall} which has been transformed to normal and second order stationary, the vectors Y and X are replaced by the vector Z for a notation consistent with the PAGE 50 61 univariate models. Matalas (1967) suggested the multivariate lagone autoregressive model (4. 3) where Zt is an (mxl) vector whose ith element Zit is the observed rainfall value at station i and at time t, and the other variables have been described previously. Such a model can be used for the simultaneous generation of rainfall series at m different stations. The correlation and crosscorrelation of the series is incorporated in the model through the parameters A and B. The matrices A and B are estimated from the historical series so that the means, standard deviations and autocorrelation coefficients of lagone for all the series, as well as the crosscorrelations of lagzero and lagone between pairs of series are maintained. Let MO denote the lagzero correlation matrix which is defined as (4. 4) Then a diagonal element of MO is E[z. t z. t] = p .. (0) = 1 1, 1, 11 (since Zt is standardized) and an off diagonal element (i,j) is E[z. t z. t] = p .. (0) which is the lagzero cross corre1, J lJ lation between series {zi} and {Zj}. The matrix MO is symmetric since p .. (0) = p .. (0) for every i, j. lJ Jl PAGE 51 62 Let Ml denote the lagone correlation matrix defined as (4. 5) A diagonal element of Ml is E [z. t z. t lJ = p .. (1) which 1, 1, 11 is the lagone serial correlation coefficient of the series {z. } and an offdiagonal element (i, j ) is 1 E(z. t Zj,tl) = p .. (1) which is the lagone crosscorre1, lJ lation between the {z. } and {z.} series, the latter lagged 1 J behind the former. Since in general p .. (1) tp .. (1) for lJ J 1 i tj the matrix Ml is not symmetric. After some algebraic manipulations (see Appendix B) the coefficient matrices A and B are obtained as solutions to the equations (4. 6) (4.7) where is the inverse of M O and Mi the transpose of M l The correlation matrices MO and Ml are calculated from the data. Then an estimate of the matrix A is given directly by equation (4.6), and an estimate for B is found by solving equation (4.7) by using a technique of principal component analysis (Fiering, 1964) or upper triangularization (Young, 1968). For more details on the solution of equation (4.7) see Appendix B. PAGE 52 63 Comments on Multivariate AR{l) Model Assumption of Normality and Stationarity We have assumed that all random variables involved in the model are normal. The assumption of a multivariate normal distribution is convenient but not necessary. It has been shown (Valencia and Schaake, 1973) that the multivariate ARCl) model preserves first and second order statistics regardless of the underlying probability distributions. Several studies have been done using directly the original skewed series. Matalas (1967) worked with lognormal series and constructed the generation model so that it preserves the historical statistics of the lognormal process. Mejia et al. (1974) showed a procedure for multivariate generation of mixtures of normal and lognormal variables. Moran (1970) indicated how a multivariate gamma process may be applied, and Kahan (1974) presented a method for the preservation of skewness in a linear bivariate regression model. But in general, the normalization of the series prior to modeling is more convenient, especially when the series have different underlying probability distributions. In such cases different transformations are applied on the series, and that combination of transformations is kept which yields minimum average skewness. Average skewness is the sum of the skewness of each series divided by the number of series or number of stations used. This operation is called finding the MST (Minimum Skewness PAGE 53 64 Transformation) and results in an approximately multivariate normal distribution (Young and Pisano, 1968). We have also assumed that all variables are standardized, e.g., have zero mean and unit variance. This assumption is made without loss of generality since the linear transformations are preserved through the model. On the other hand this transformation becomes necessary when modeling periodic series since by subtracting the periodic means and dividing by the standard deviations we remove almost all of the periodicity. If the data are not standardized, MO and Ml represent the lagzero and lagone covariance matrices (instead of correlation matrices), respectively. If S denotes the diagonal matrix of the standard deviations and RO' Rl the lagzero and lagone correlation matrices then (4. 8) and (4.9) When we standardize the data the matrix S is an identity matrix and MO' Ml become the correlation matrices RO and Rl respectively. Thus, one other advantage of standardization is that we work with correlation matrices whose elements are less than unity and the computations are likely to be more stable (Pegram and James, 1972). PAGE 54 65 CrossCorrelation Matrix Ml Notice that the lagone correlation matrix Ml has been T defined as Ml = E[Zt Ztl] which contains the lagone crosscorrelations between pairs of series but having the second series lagged behind the first one. Following this definition the lagminusone correlation matrix will be (4.10) and it will contain the lagone correlations having now the second series lagged ahead of the first one. It is easy to show that M_l is actually the transpose of M l : E[(Z ZT )T] t tl Care then must be taken so that there is a consistency (4.11) between the equation used to calculate matrix A and the way that the crosscorrelation coefficients have been calculated. Such an inconsistency was present in the numerical multisite package developed by Young and Pisano (1968) and was first corrected by O'Connell (1973) and completely corrected and improved by Finzi et al. (1974, 1975). Incomplete Data Sets In practice, hydrologic series at different stations are unlikely to be concurrent and of equal length. With lagzero autoand crosscorrelation coefficients calculated PAGE 55 from the incomplete data sets, the lagzero correlation matrix MO obtained may .. Ml d d 1tS 1nverse 0 nee e not be positive semidefinite, and, for the calculation of matrix A thus may have elements that are complex numbers. Also, a necessary and sufficient condition for a real solution of 1 T. matrix B is that C = MO Ml MO Ml 1S a positive semidefinite matrix (see Appendix B) When all of the series are concurrent and complete 66 then MO and C are both semidefinite matrices [Valencia and Schaake, 1973], and the generated synthetic series are real numbers. When the series are incomplete there is no guarantee that real solutions for the matrices A and B exist causing the model of Matalas (1967) to be conditional on MO and C being positive semidefinite [Slack, 1973]. Several techniques have been proposed which use the incomplete data sets but guarantee the posite semidefiniteness of the correlation matrices. Fiering (1968) suggested a technique that can be used to produce a positive semidefinite correlation matrix MO. If MO is not positive semidefinite then negative eigenvalues may occur and hence negative variables, since the eigenvalues are variances in the principal component system. In this technique, the eigenvalues of the original correlation matrix are calculated. If negative eigenvalues are encountered, an adjustment procedure is used to eliminate them (thereby altering the correlation matrix, MO [Fiering, 1968]). PAGE 56 A correlation matrix is called consistent if all its eigenvalues are positive. But consistent estimates of the correlation matrices MO and Ml do not guarantee that C will also be consistent. Crosby and Maddock (1970) proposed a technique that is suitable only for monotone data (data continuous in collection to the present but having different starting times). This technique produces a consistent estimate of the matrix MO as well as of the matrix C, and is based on the maximum likelihood technique developed by Anderson (1957) Valencia and Schaake (1973) developed another technique. They estimate matrices A and B from the equations 67 (4.12 ) (4.13 ) where MOl is the lagzero correlation matrix MO computed from the first (Nl) vectors of the data, and M02 is computed from the last (Nl) vectors, where N is the number of data points (number of times sampled) in each of the n series. Further Simplification Sometimes in practice, the preservation of the lagzero and lagone autocorrelations and the lagzero PAGE 57 68 crosscorrelations is enough. In such cases, i.e., when the lagone crosscorrelations are of no interest, a nice simplification can be made due to Matalas (1967, 1974). He defined matrix A as a diagonal matrix whose diagonal elements are the lagone autocorrelation coefficients. With A defined as above, the lagone crosscorrelation of the generated series (p .. (1)) can be shown to be the product lJ of the lagzero crosscorrelation (p .. (0)) and the lagone lJ autocorrelation of the series (p .. (I), but of course difII ferent than the actual lagone crosscorrelation (p .. (1)). lJ p .. (1) = p .. (0) p .. (1) lJ lJ II (4.14} By using p .. (1) of equation (4.14) in place of the actual lJ Pij (ll, thus avoiding the actual computation of Pij (1) from the data, the desired statistical properties of the series are still preserved. Higher Order Multivariate Models The order p of a multivariate autoregressive model could be estimated from the plots of the autocorrelation and partial autocorrelation functions of the series (Salas et al., 1980) as an extension of the univariate model identification, which is already a difficult and ambiguous task. However, in practice first and second order models are usually adequate and higher order models should be avoided (Box and Jenkins, 1976). PAGE 58 69 In any case, the multivariate multilag autoregressive model of order p takes the form (4.15) and the matrices AI' A 2 .. Ap' B are the solutions of the equations M. = 1 P E Ak M. k k=l 1M o i = 1, 2, .. P (4.16) (4.17) where M is the lagcorrelation matrix. Equation (4.16) is a set of p matrix equations to be solved for the matrices AI' A 2 .. Ap' and matrix B is obtained from (4.17) using techniques already discussed. Here, the assumption of diagonal A matrices becomes even more attractive. For a multivariate secondorder AR process the above simplification is illustrated in Salas and Pegram (1977) where the case of periodic (not constant) matrix parameters is also considered. O'Connell (1974) studied the multivariate ARMA(l,l) model (4.18) where A, B, and C are coefficient matrices to be determined PAGE 59 70 from the data. Specifically they are solutions of the system of matrix equations (4. 19) = T where Sand T are functions of the correlation matrices MO Ml and M 2 Methods for solving this system are proposed by O'Connell (1974). Explicit solutions for higher order multivariate ARMA models are not available and Salas et al. (1980) propose an approximate multivariate ARMA(p,q) model. PAGE 60 CHAPTER 5 ESTIMATION OF MISSING MONTHLY RAINFALL VALUESA CASE STUDY Introduction This section compares and evaluates different methods for the estimation of missing values in hydrological time series. A case study is presented in which four of the simplified methods presented in Chapter 2 have been applied to a set of four concurrent 55 year monthly rainfall series from south Florida and the results compared. Also a recursive method for the estimation of missing values by the use of a univariate or multivariate stochastic model has been proposed and demonstrated. The theory already presented in Chapters 2, 3 and 4 is supplemented whenever needed. Set Up of the Problem The monthly rainfall series of four stations in the South Florida Water Management District (SFWMD) have been used in the analysis. These stations are: Station A Station 1 Station 2 Station 3 MRF6038, Moore Haven Lock 1 MRF6013, Avon Park MRF6093, Fort Myers WSO Ap. MRF6042, Canal point USDA. 71 PAGE 61 For convenience the four stations will sometimes be addressed as A, 1, 2, 3 instead of their SFWMD identification numbers 6038, 6013, 6093 and 6042, respectively. Their locations are shown in the map of 72 Fig. 5.1. Station A in the center is considered as the interpolation station (whose missing values are to be estimated) and the other three stations 1, 2 and 3 as the index stations. Care has been taken so that the three index stations are as close and as evenly distributed around the interpolation station as possible. This particular set of four stations was selected because it exhibits many desired and convenient properties: (1) the stations have an overlapping period of 55 years (19271981) (2) for this 55 year period the record of the interpolation station (station A) is complete (no missing values) (3) the three index stations have a small percent of missing values for the overlapping period (station 1: 2.7% missing, station 2: complete, and station 3: 1.2% missing values). The 55 year length of the records is considered long enough to establish the historical statistics (e.g., monthly mean, standard deviation and skewness) and provides a monthly series of a satisfactory length (660 values) for fitting a univariate or multivariate ARMA model. PAGE 62 .. FLORIDA +40 ... ""' ,..., .... .... ..... ........ 3 2 .,' ... .. Fig. 5.1. The four south Florida rainfall stations used in the analysis. A: 6038, Moore Haven Lock 1 1: 6013, Avon Park 2: 6093, Fort Myers WSO AP. 3: 6042, Canal Point USDA 73 \ I I PAGE 63 74 The completeness of the series of the interpolation station permits the random generation of gaps in the series, corresponding to different percentages of missing values, with the method described in Chapter 1. After the missing values have been estimated by the applied models, the gaps are infilled with the estimated values and the statistics of the new (estimated) series are compared with the statistics of the incomplete series and the statistics of the historical (actual) series. Also the statistical closeness of the infilled (estimated) values to the hidden (actual) values provides a means for the evaluation and comparison of the methods. When, for the estimation of a missing value of the interpolation station, the corresponding value of one or more index stations is also missing the latter is eliminated from the analysis, e.g., only the remaining one or two index stations are used for the estimation. Frequent occurrence of such concurrent gaps in both the interpolation and the index stations would alter the results of the applied method in a way that cannot be easily evaluated (e.g., another parameter such as the probability of having concurrent gaps should be included in the analysis). A small number of missing values in the selected index stations eliminates the possibility of such simultaneous gaps, and thus the effectiveness of the applied estimation procedures can be judged more efficiently. PAGE 64 75 The statistical properties (e.g., monthly mean, standard deviation, skewness and coefficient of variation) of the truncated (to the 19271981 period) original monthly rainfall series for the four stations are shown on Tables C.l, C.2, C.3 and C.4 of Appendix C. Figure 5.2 shows the plot of the monthly means and standard deviations for station A. From these plots we observe that: (1) the plot of monthly means is in agreement with the typical plot for Florida shown in Fig. 1.1, and (2) months with a high mean usually have a high standard deviation. The only exception seems to be the month of January which in spite of its low mean exhibits a high standard deviation and therefore a very high coefficient of variation and an unusually high skewness. A closer look at the January rainfall values of station A shows that the unusual properties for that month are due to an extreme value of 21.4 inches of rainfall for January 1979, the other values being between 0.05 and 6.04 inches. The three index stations 1, 2 and 3 are at distances 59 miles, 51 miles and 29 miles respectively from the interpolation station A. Simplified Estimation Techniques Techniques Utilized From the simplified techniques presented in Chapter 2, the following four are applied for the estimation of missing PAGE 65 76 inches 7 6 5 4 2 o .J F M A .J .J o N D inches 4 3 2 o .J F A M J J AS 0 N D (b) monthly standard deviations Fig. 5.2. Plot of the monthly means and standard deviationsstation 6038 (1927 1981) PAGE 66 monthly rainfall values: (1) the mean value method (MV) (2) the reciprocal distances method (RD) (3) the normal ratio method (NR) and (4) the modified weighted average method (MWA). 77 These methods are all deterministic and are applied directly on the available data permitting thus a uniform and objective comparison of the results. The mean value plus random component method has not been includedin this thesis. The above four methods will be applied for five different percentages of missing values: 2%, 5%, 10%, 15% and 20%. These percentages cover almost 80% of all cases encountered in practice as has been shown in Table 1.1 (e.g., 80% of the stations have below 20% missing values). From the same table it can also be seen that almost 30% of the stations have below 5% missing values. Therefore, it would be of interest and practical use if we could generalize the results for the region of below 5% missing values since a large fraction of the cases in practice fall in this region. The application of the first three methods (MV, RD, NR methods) is straightforward and no further comments need be made. However, some comments on the least squares (LS) method and the modified weighted average (mvA) method are necessary. PAGE 67 Least Squares Method (LS) The least squares method although simple in principle involves an enormous amount of calculations, and for that reason it has been excluded from this study. For example, consider the case in which the interpolation station A is regressed on the three index stations 1, 2 and 3. The estimated values will be given by: 78 (5.1) where a, b 1 b 2 b 3 are the regression coefficients calculated from the available concurrent values of all the four variables. There are 12 such regression equations, one for each month. But if it happens that an index station (say, station 3) has a missing value simultaneously with the interpolation station, a new set of 12 regression equations is needed for the estimation, e.g., Y = a1 + b x + b' x + E 1 1 2 2 (5.2) Unless this coincidence of simultaneously missing values is investigated manually so that only the needed least squares regressions are performed (Buck, 1960), all the possible combinations of regressions must otherwise be performed. This involves regressions among all the four variables (Yi xl' x 2 x 3), among the three of them (Yi xl' x 2), (Yi xl' x 3), (Yi x 2 x 3 ) and between pairs of them (Yi xl)' PAGE 68 (Yi x 2), (Yi x 3), giving overall 7 sets of 12 regression equations. Because the regression coefficients are 79 different for each percentage of missing values (since their calculation is based only on the existing concurrent values) the 84 (7 x 12) regressions must be repeated for each level of missing values (420 regressions overall for this study). It could be argued that the same 12 regression equations (Yi xl' x 2 x 3 ) could be kept and a missing values x. replaced by its mean x. or by another estimate x!. In 111 that case equation 5.1 would become (5.3) the coefficients of regression a, b 1 b 2 b 3 remaining unchanged. This in fact can be done, but then the method tested will not be the "pure" least squares method since the results will depend on the secondary method used for the estimation of the missing x. values. 1 The coefficients a, b 1 b 2 and b 3 (equation 5.1) of the regression of the {y} series (of station A with 2% missing values) on the series {xl}' {x2 } and {x3 } (of stations 1, 2 and 3 respectively) are shown in Table 5.1. In the same table the values of the squared multiple regression coefficient R2 and the standard deviation of the {y} series are also shown. The numbers in parenthesis show the significance level a at which the parameters are significant (the percent probability of being nonzero is (1a. For PAGE 69 80 Table 5.1. Least Squares Regression Coefficients for Equation (5.1) and Their Significant Levels. The standard deviation, s, for each month is also given. JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC a inches 0.0059 (0.9692) 0.1355 (0.5260) 0.0052 (0.9793) 0.7388 (0.0273) 2.1302 (0.0070) 1.8765 (0.1505) 2.8601 (0.0750) 2.0820 (0.2065) 0.0108 (0.9916) 0.6985 (0.0866) 0.3167 (0.1290) 0.2623 (0.1987) 0.1271 (0.2790) 0.2624 (0.0025) 0.1617 (0.0138) 0.2405 (0.0458) 0.4046 (0.0115) 0.2192 (0.1576) 0.0345 (0.7883) 0.1771 (0.1666) 0.5102 (0.0003) 0.3960 (0.0020) 0.3009 (0.0030) 0.2332 (0.1065) 0.4994 (0.0005) 0.0086 (0.9431) 0.3457 (0.0001) 0.2813 (0.0156) 0.0591 (0.7180) 0.1108 (0.4034) 0.3993 (0.0131) 0.2078 (0.0787) 0.2113 (0.0893) 0.2287 (0.0433) 0.2473 (0.0804) 0.3807 (0.0084) 0.3377 (0.0017) 0.5345 (0.0001) 0.4507 (0.0001) 0.1919 (0.1132) 0.2186 (0.1308) 0.3339 (0.0133) 0.1885 (0.1780) 0.2660 (0.0589) 0.2450 (0.0190) 0.4667 (0.0001) 0.1063 (0.0069) 0.4381 (0.0001) 0.8046 (0.0001) 0.7033 (0.0001) 0.9142 (0.0001) 0.4936 (0.0001) 0.2752 (0.0016) 0.3351 (0.0002) 0.2005 (0.0154) 0.1789 (0.0248) 0.5669 (0.0001) 0.7749 (0.0001) 0.4575 (0.0001) 0.7723 (0.0001) s inches 3.076 1. 365 2.464 1. 818 2.583 3.812 3.399 2.938 4.085 3.073 1. 228 1. 585 PAGE 70 81 example, for January the coefficient b I is not significant at the 5% significance level (a = 0.05) since 0.279 is greater than 0.05, but the R2 coefficient is significant even at 0.01% significance level (a = 0.0001). The significance levels correspond to the nttest" for the regression coefficients and to the "Ftest" for the R2 coefficients. The standard deviation, s, of the {y} series is also listed since the random component is given by s (5.4) as has already been discussed in Chapter 2. It is interesting to note, that although the multiple regression coefficient R2 varies for each month from as low as 0.18 to as high as 0.91 it is always significant at the 5% significance level. The months of July and August exhibit the lowest (although significant) correlation coefficients as is expected for Florida. The physical reason for these low correlations is that in the summer most rainfall is convective, whereas in other months there is more cyclonic activity. Rainfall from scattered thunderstorms is simply not as correlated with that of nearby areas as is rainfall from broad cyclonic activity. Thus, on the basis of the regressions shown in Table 5.1, the least squares method would be expected to perform least well in the summer in Florida, but this point is not validated in this thesis. PAGE 71 82 Modified Weighted Average Method (MWA) For the modified weighted average method the twelve (3x3) covariance matrices of the three index stations have been calculated for each month using equation (2.9) and (2.10), and are shown in Table C.11 (appendix C). Also the monthly standard deviations, s have been estimated from y the known {y} series, and the monthly standard deviations, s' have been calculated by equation (2.11) using the y calculated covariance matrices. Notice that although the twelve s values (as calculated from the actual data and y which we want to preserve) are different at different percentages of missing values, the twelve s' values (that y depend only on the weights a. and the covariance matrix of 1 the index stations) are calculated only once. The correction coefficients f (f = s Is') for each month and for y y each different percentage of missing values which must be applied on matrix A (equation 2.21) are shown in Table 5.2. From this table it can be seen that if the simple weighted average scheme' of equation (2.3) were used for the generation, the standard deviation of November would be overestimated (by a factor of approximately 2) and the standard deviation of all other months would be underestimated (e.g., by a factor of approximately 0.5 for the month of January). We also observe that due to small changes of s for different percentages of missing values, y the correction factor f does not vary much either, but tends PAGE 72 83 Table 5.2. Correction Coefficient, f, for Each Month and for Each Different Percent of Missing Values (f = s Is' y y). 2% 5% 10% 15% 20% JAN 1.777 1. 777 1. 795 1. 897 1. 872 FEB 1.129 1.142 1.136 1.199 1.188 MAR 1.178 1. 207 1.177 1. 003 1. 009 APR 1. 089 0.980 1. 061 1. 051 1. 054 MAY 1. 269 1.197 1. 212 1. 222 1. 360 JUN 1. 214 1.173 1.192 1. 228 1. 242 JUL 1. 338 1. 345 1. 386 1. 390 1. 491 AUG 1. 424 1. 414 1. 425 1.432 1. 369 SEP 1. 313 1. 328 1. 325 1. 210 1. 331 OCT 1. 258 1. 273 1. 218 1. 229 1. 314 NOV 0.533 0.537 0.509 0.583 0.572 DEC 1.161 1.140 1.169 1.172 1. 248 PAGE 73 to be slightly greater the greater the percent of missing values. 84 The modified weighted average scheme theoretically preserves the mean and variance of the series as has been shown in Chapter 2. But this is true for a series that has been generated by the model and not for a series that is a mix of existing values and values generated (estimated) by the model. This illustrates the difference between the two concepts: "generation of data by a model" and "estimation of missing values by a model." A method for generation of data which is considered "good" in the sense that it preserves first and second order statistics is not necessarily "good" for the estimation of missing values. In fact, it may give statistics comparable to the ones given from a simpler estimation technique which does not preserve the statistics, even as a generation scheme. Theoretically, for a "large" number of missing values, the estimation model operates as a generation model and thus preserves the "desired" statistics, but practically, for this large amount of missing values the "desired" statistics (calculated from the few existing values) are of questionable reliability. Only for augmentation of the time series (extension of the series before the first or after the last point) will the modified weighted average scheme or other schemes that preserve the "desired" statistics be expected to work better than the simple weighted average schemes. PAGE 74 85 One other disadvantage of the modified weighted average scheme as well as of the least squares scheme is that negative values may be generated by the model. Since all hydrological variables are positive, the negative generated values are set equal to zero, thus altering the statistics of the series. This is also true for all methods that involve a random component and is mainly due to "big" negative values taken on by the random deviate. The number of negative values, estimated by the MWA method, which have been set equal to zero in the example that follows were 1, 1, 6, 4, and 9 values for the 2%, 5%, 10%, 15% and 20% levels of missing values, respectively. The effect of the values arbitrarily set to zero cannot be evaluated exactly, but what can be intuitively understood is that a distortion in the distribution is introduced. A transformation that prevents the generation of negative values could be performed on the data before the application of the generation scheme. Such a transformation is, for example, the logarithmic transformation since its inverse applied on a negative value exists, and the mapping of the transformed to the original data and vice versa is one to one (this is not true for the square root transformation). Comparison of the MV, RO, NR and MWA Methods The performance of each method applied for the estimation of the missing values will be evaluated by comparing the estimated series (existing plus estimated PAGE 75 86 values) to the incomplete series (really available in practice) and to the actual series (unknown in practice, but known in this artificial case). The criteria that will be used for the comparison of the method will be the following: (1) the bias in the mean as measured (a) by the difference between the mean of the estimated series, y and the mean of the incomplete series, e y. (i = 1, 2, 3, 4, 5 for five different 1 percentages of missing values), and (b) by the difference between the mean of the estimated series, Ye and the mean of the actual series, Yai (2) the bias in the standard deviation as measured (a) by the ratio of the standard deviation of the estimated series, s to the standard deviation of e the incomplete series, s. and (b) by the ratio of 1 the standard deviation of the estimated series, to the standard deviation of the actual series, (3) the bias in the lagone and lagtwo correlation s e s a' coefficients as measured by the difference of the correlation coefficient of the estimated series, r to the correlation coefficient of the actual e series, r i a (4) the bias of the estimation model as given by the mean of the residuals, y i.e., the mean of the r differences between the infilled (estimated) and hidden (actual) values (this is also a check to PAGE 76 87 detect a consistent overor underestimation of the method); (5) the accuracy as determined by the variance of the residuals (differences between estimated and actual values) of the whole series, s2; r (6) the accuracy as determined by the variance of the residuals of only the estimated values, s2 ; and r,e (7) the significance of the biases in the mean, standard deviation and correlation coefficients as determined by the appropriate test statistic for each (see appendix A) Table 5.3 presents the statistics of the actual series (ACT), of the incomplete series (INC) and of the estimated series by the mean value method (MV) by the reciprocal distances method (RD) by the normal ratio method (NR) and by the modified weighted average method (MWA). The mean (y), standard deviation (s), coefficient of variation (c ) v coefficient of skewness (c ), lagone and lagtwo s correlation coefficients (r1 r 2 ) of the above series considered as a whole have then been calculated. Regarding comparison of the means, the following can be concluded from Table 5.4: (1) the bias in the mean in all cases is not significant at the 5% significance level as shown by the appropriate ttest; PAGE 77 88 Table 5.3. Statistics of the Actual (ACT) Incomplete (INC) and Estimated Series (MV, RD, NR, MWA). Y s c Cs r l r 2 v ACT 4.126 3.673 89.040 1. 332 0.366 0.134 2% missing values INC 4.116 3.680 89.397 1. 346 MV 4.125 3.663 88.808 1. 335 0.371 0.130 RD 4.124 3.674 89.092 1. 336 0.367 0.133 NR 4.114 3.666 89.104 1. 339 0.368 0.131 11WA 4.113 3.674 89.331 1. 342 0.363 0.131 5% missing values INC 4.113 3.671 89.249 1. 341 MV 4.101 3.610 88.040 1.352 0.372 0.139 RD 4.127 3.696 89.550 1. 359 0.369 0.133 NR 4.105 3.674 89.501 1. 349 0.367 0.131 NWA 4.116 3.720 90.386 1. 388 0.364 0.126 10% missing values INC 4.144 3.705 89.405 1. 350 MV 4.134 3.603 87.152 1. 346 0.379 0.159 continued PAGE 78 89 Table 5.3. Continued. y s c c r1 r2 v s ACT 4.126 3.673 89.040 1. 332 0.366 0.134 RD 4.150 3.689 88.884 1.301 0.380 0.166 NR 4.120 3.652 88.633 1.321 0.377 0.155 MWA 4.127 3.725 90.244 1. 286 0.376 0.162 15% missing values INC 4.135 3.671 88.767 1.268 MV 4.106 3.513 85.567 1.270 0.399 0.133 RD 4.177 3.688 86.862 1.224 0.372 0.132 NR 4.135 3.691 86.854 1. 236 0.379 0.133 MWA 4.134 3.650 88.291 1.248 0.357 0.123 20% missing values INC 4.082 3.701 90.673 1. 404 MV 4.124 3.495 84.749 1. 333 0.408 0.160 RD 4.231 3.723 87.993 1.865 0.370 0.156 NR 4.125 3.601 87.307 1. 298 0.377 0.152 MWA 4.168 3.741 89.758 1. 273 0.354 0.153 PAGE 79 90 Table 5.4. Bias in the Mean INC MV RD NR MWA (Ye y.) y. l. l. 2% O. 0.009 0.008 0.002 0.003 4.116 5% O. 0.012 0.014 0.008 0.003 4.113 10% O. 0.010 0.006 0.024 0.017 4.144 15% O. 0.089 0.042 0.000 0.001 4.135 20% O. 0.042 0.149 0.043 0.086 4.082 (Ye Y ) Ya a 2% 0.010 0.001 0.002 0.012 0.013 4.126 5% 0.013 0.025 0.001 0.021 0.010 10% 0.018 0.008 0.024 0.006 0.001 15% 0.009 0.020 0.051 0.009 0.008 20% 0.044 0.002 0.105 0.001 0.042 PAGE 80 91 (2) the bias in the mean of the incomplete series is relatively small but becomes larger the higher the percent of missing values; (3) at high percents of missing values the NR method gives the less biased mean; (4) except for the RD method which consistently overestimates the mean (the bias being larger the higher the percent of missing values), the other methods do not show a consistent over or underestimation. Regarding comparison of the variances the following can be concluded from Table 5.5: (1) Although slight, the bias in the standard deviation is always significant, but this is so because the ratio of variances would have to equal 1.0 exactly to satisfy the Ftest (i.e., be unbiased) with as large a number of degrees of freedom as in this study; (2) the MV method always gives a reduced variance as compared to the variance of the incomplete series and of the actual series, the bias being larger the higher the percent of missing values; (3) the bias in the standard deviation of the incomplete series is small; (4) there is no consistent over or underestimation of the variance by any of the methods (except the MV method); PAGE 81 92 Table 5.5. Bias in the Standard Deviation INC !1V RD NR MWA s Is. s. e 1 1 2% 1. 0.995 0.998 0.996 0.998 3.680 5% 1. 0.983 1. 007 1. 001 1. 013 3.671 10% 1. 0.972 0.996 0.986 1. 005 3.705 15% 1. 0.957 0.988 0.978 0.994 3.671 20% 1. 0.944 1. 006 0.973 1. 011 3.701 s /s s e a a 2% 1. 002 0.997 1. 000 0.998 1.000 3.673 5% 0.999 0.983 1. 006 1.000 1.013 10% 1.009 0.981 1. 004 0.994 1.014 15% 0.999 0.956 0.988 0.978 0.994 20% 1. 008 0.952 1. 014 0.980 1.019 PAGE 82 (5) the MWA method does not give less biased variance even at the higher percent of missing values tested, as compared to the RD and NR methods. Regarding comparison of the correlation coefficients the following can be concluded from Table 5.6: 93 (1) the bias in the correlation coefficients is in all cases not significant at the 5% significance level as shown by the appropriate ztesti (2) the MV method gives the largest bias in the correlation coefficients, the bias increasing the higher the percent of missing values, with a possible effect on the determination of the order of the model; (3) all methods (except the MWA method) consistently overestimate the serial correlation coefficient of the incomplete series but not the serial correlation of the actual series and therefore is not considered a problem; (4) the RD method seems to give a correlogram that closely follows the correlogram of the actual series. Regarding accuracy of the methods the following can be concluded from Table 5.7: (1) no method seems to consistently over or underestimate the missing values at all percent levels, but at high percent levels the missing values are overestimated by all methods; PAGE 83 Table 5.6. Bias in the LagOne and LagTwo Correlation Coefficients. INC MV RD NR HWA (r 1 r 1 ) ,e ,a 2% 0.005 0.001 0.002 0.003 5% 0.006 0.003 0.001 0.002 10% 0.013 0.014 0.011 0.010 15% 0.033 0.006 0.013 0.009 20% 0.042 0.004 0.011 0.012 (r 2 r ) ,e L,a 2% 0.004 0.001 0.003 0.003 5% 0.005 0.001 0.003 0.008 10% 0.025 0.032 0.021 0.028 15% 0.001 0.002 0.001 0.011 20% 0.026 0.022 0.018 0.019 94 r 1,a 0.366 r 2,a 0.134 PAGE 84 Table 5.7. AccuracyMean and Variance of the Residuals N = number of missing values NO = total number of values = 660. INC MV RD NR MWA 11 = r L (Ye Y a ) INo 2% 0.043 0.061 0.570 0.589 5% 0.440 0.034 0.380 0.176 10% 0.007 0.156 0.113 0.046 15% 0.175 0.338 0.074 0.105 N 0 13 33 62 98 20% 0.037 0.502 0.038 0.200 130 2 L (Y 2 s = Y ) I (N 2) r,e e a 0 2% 5.037 2.874 3.149 4.585 5% 8.610 3.656 3.411 5.340 10% 7.892 4.239 3.484 5.187 15% 7.620 4.630 3.958 5.816 20% 5.224 4.891 3.681 4.898 95 PAGE 85 96 Table 5.7. Continued. INC MV RD NR MWA 2 L: (y 2 s = Y ) /(N2) r e a 2% 0.084 0.048 0.053 0.077 5% 0.406 0.172 0.161 0.252 10% 0.720 0.387 0.318 0.473 15% 1.112 0.675 0.577 0.849 20% 1. 016 0.951 0.716 0.953 PAGE 86 (2) the NR method is the more accurate method especially at high percents of missing values (i.e., it gives the smaller mean and variance of the residuals). Univariate Model Model Fitting Before considering the problem of missing values the problem of fitting an ARMA(p,q) model to the monthly rainfall series of the south Florida interpolation station will be considered. 97 The observed rainfall series has been normalized using the square root transformation and the periodicity has been removed by standardization. The reduced series, approximately normal and stationary, is then modeled by an ARMA(p,q) model. The ACF of the reduced series, as shown in Fig. 5.3, implies a white noise process since almost all the autocorrelation coefficients (except at lag3 and lag12) lie inside the 95 percent confidence limits. Of course, it is unsatisfying to accept the white noise process as the "best" model for our series and an attempt is made to fit an ARMA(1,1) model to the series. The selection of an ARMA model and not an AR or model is based on the following reasons: (1) The observed rainfall series contains important observational errors and so it is assumed to be the sum PAGE 87 + 1.0 + 0.1 + o.OS 0.0 o.OS 0.1 95 % C. I. Ie ;Fig. 5.3. Autocorrelation function of the normalized and standardized monthly rainfall series of Station A. 98 PAGE 88 99 of two series: the "true" series and the observational error series (signal plus noise). Therefore, even if the "true" series obeys an AR process, the addition of the observational error series is likely to produce an ARMA model: AR(p) + white noise = ARMA(p,p) AR(p) + AR(q) = ARMA(p+q, max (p,q) ) (5.5) AR(p) + MA(q) = ARMA(p, p+q) The same can be said if the "true" series is an MA process and the observational error series an AR process but not if the latter is an MA process or a white noise process: MA(p) + AR(q) = ARMA(q,p+q) MA(p) + MA(q) = MA(max(p,q + white noise = MA(p) (Granger and Morris, 1976; Box and Jenkins, 1976, Appendix A4. 4) (5.6) It is understood, that the addition of any observational series to an ARMA process of the "true" series will give again an ARMA process. For example, ARMA(p,q) + white noise = ARMA(p,p) if p > q = ARMA(p,q) if p < q (5.7) PAGE 89 100 from which it can also be seen that the addition of an observational error may not always change the order of the model of the "true" process. (2) One other situation that leads exactly, or approximately, to ARMA models is the case of a variable which obeys a simple model such as AR(l) if it were recorded at an interval of K units of time but which is actually observed at an interval of M units (Granger and Morris, 1976, p. 251). All these results suggest that a number of real data situations are all likely to give rise to ARMA models; therefore, an ARMA(l,l) model will be fitted to the observed monthly rainfall series of the south Florida interpolation station. The preliminary estimate of (equation 3.23) is 0.08163, and the preliminary estimate of 8 1 (equations 3.21 for k = 0, 1, 2) is the solution of the quadratic equation 0.1656 8i + 1.0204 81 + 0.1656 = 0 (5.8) Only the one root 8 1 = 0.1667 is acceptable, the second lying outside the unit circle. These preliminary estimates of and 8 1 become now the initial values for the determination of the maximum likelihood estimates (MLE). In general, the choice of the starting values of and 8 does not significantly affect the parameter estimates (Box and Jenkins, 1976, p. 236), but this was not the case for the PAGE 90 0.5 0." O.! 0.2 0.1 0.0 0.1 0.2 0.3 0." 0.5 0.5 0.4 0.3 0.2 0.1 0.0 0.1 0.2 0.3 0.4 O.S Fig. 5.4. "2 Sum of squares of the residuals, Z(at), of an ARMA (1,1) model fitted to the rainfall series of station A. 101 e PAGE 91 Table 5.8. Initial Estimates and MLE of the Parameters and 8 of an ARMA(l,l) model fitted to the rainfall series of station A. 102 Initial Estimates Max. Likelihood Estimates Hodel 8 e A 0.0816 0.0 0.0088 0.0989 B 0.0816 0.1667 0.3140 0.4056 C 0.1 0.0 0.0537 0.0278 D 0.4 0.5 0.4064 0.4939 south Florida rainfall series under study. In particular different initial estimates of and 8 1 have been tested and the MLE of the parameters are compared in Table 5.8. The MLE have been calculated using the IMSL subroutine FTJliXL which uses a modified steepest descent algorithm to find the values of and 8 that minimize the sum of squares of the residuals (Box and Jenkins, 1976, p. 504). The drastic changes in parameter values together with the idea that the process may be a white noise process suggest a plot of the sum of squares of the residuals for the visual detection of anomalies. The sum of squares grids and contours are shown in Fig. 5.4. We observe that there is not a well defined point where the sum of squares becomes a minimum but rather a line (contour of the value 641) on which the sum of squares has an almost constant value equal to the minimum. In such case combinations of parameter values give similar sum of squares of residuals and a change PAGE 92 103 in the AR parameter can be nearly compensated by a suitable change in the MA parameter. From the comparison of the parameters and 8 (Table 5.8) of the four ARMA(l,l) models one cannot say that they all correspond to the same process. But this can in fact be illustrated by converting the four models to their II random shock form" (MA ( 00) processes) or their II invertible form" (AR ( 00) processes). An ARMA(l,l) process (5.9) can be also written as (5.10) which can be expanded in the convergent form 223 Zt = [1 + (81)B + (81)B + (81)B + .. ] at provided that the stationarity condition (1 I < 1) is satisfied. Then the four models of Table 5.8 become: (5.11) PAGE 93 104 (A) Zt = at + 0.090 a t 1 0.001 a t 2 + (B) Zt = at + 0.092 a t 1 0.029 a t 2 + (5.12) (C) Zt = at + 0.082 a t 1 + 0.004 a t 2 + (D) Zt = at + 0.088 a t 1 0.036 a t 2 + In the same way the ARMA (1,1) model may be written in the "invertible form" (5.13) which can be expanded as 2 2 3 [1 (81)B 8 1 (81)B 8 1 (81)B ... ] Zt = at given that the invertibility condition (181 I <1) is satisfied. Then the four models become: (A) Zt = at + 0.090 Zt1 0.009 Zt2 + (B) Zt = at + 0.092 Zt1 0.037 Zt2 + ( C) Zt = at + 0.082 Zt1 0.002 Zt2 + (D) Zt = at + 0.088 Zt1 0.043 Zt2 + From the "random shock" form of the four models (equations 5.12) and from their "invertible form" (equations 5.15) the following remarks can be made: (5.14) (5.15) PAGE 94 (1) Although from the comparison of the and 8 coefficients (Table 5.8) of the four ARMA(l,l) models one cannot say that they all correspond to the same process, the comparison of the MA coefficients (8 1 8 2 8 3 .. ) of equations (5.12) or the AR coefficients (l' ... ) of equations (5.15) imply that indeed all four models belong to the same process. (2) Because the nonzero (and 8 2 ) coefficients of Zt2 (and a t 2 ) terms while small are of similar magnitude to the coefficients l (and 8 1), one cannot say that the "truncated" AR(l) or MA(l) model will fully describe the time series, but instead more terms are needed. On the other hand, we observe that the coefficient so obtained (different for each model) is 105 in the range of 0.082 to 0.090 and is greater than the coefficient that would have been obtained by a direct fitting of an AR (1) model to the series (the latter would be l = r 1 = 0.0068). (3) It should also be noted that all the above models fitted to the series give residuals that pass the portemanteau goodness of fit test. As it can be seen from equation (5.12) the impulse response function (e.g., the weights W. applied on the a.'s when the J J model is written in the "random shock form") dies off very quickly in all the models, and there is thus no doubt as to the application of the portemanteau test PAGE 95 106 (see Appendix A). The values of Q for each model (calculated from equation A.l using K = 60) are: QA = 67.80, QB = 67.26, QC = 67.73 and QD = 67.39, all smaller than the X 2 value with 58 degrees of freedom at 5 % f 1 1 2 79 1 It 1 b a s1gn1 1cance eve, X58,5% = can a so e seen that the values of Q for all models are almost equal, suggesting an equally good fit of the series by all the four models. One other interesting question that could be asked is, given a specific ARMA(p,q) model whether or not this could have arisen from some simpler model. "Simplifications are not always possible as conditions on the coefficients of the ARMA model need to be specified for a simpler model to be realizable" (Granger and Morris, 1976, p. 252). At this stage with coefficients that are so instable it is meaningless to test the four ARMA models for simplification. However, this test will be made after a unique and stable model has been obtained through the following proposed algorithm. Proposed Estimation Algorithm The problem of estimation of missing values will be combined with the problem of stabilizing the coefficients of the AID1A(l,l) model in a recursive algorithm which will have solved both problems uniquely upon convergence. The incomplete series (SO) is filledin with some initial estimates of the missing values (these initial PAGE 96 107 estimates can be simply the monthly means or even zeroes as will be shown). Denote by Sl this initial series. An (1,1) model is fitted to the series and its coefficients l and el are used to update the first estimates of the missing values. For example, suppose that a gap of size k (k missing values) exists in the series SO: Series Zt+k+l Zt+k+2 ... Series where Zt+l' ... Zt+k are the initial estimates of the missing values. These values Zt+l' ... Zt+k are then (5.16) replaced by the forecasted values zt(l), ... Zt(k) by the model, at origin t and for lead times 1 = 1, ... k. These forecasts are the minimum mean square error forward forecasts as developed by Box and Jenkins (1976). For an model with coefficients l and el the minimum mean square error forecasts Zt(l) of Zt+l' where 1 is the lead time, are: 1 = 1 (5.17) A Zt(l) = l Zt(ll) 1 = 2, .. k from which it can be seen that only the one step ahead forecast depends directly on at' and the forecasts at longer lead times are influenced indirectly (Box and Jenkins, 1976, Ch. 5). The forecasting procedure in repeated for the PAGE 97 108 estimation of all the gaps, and the newly estimated values are used in equations (5.17). These forecasts now become the new estimates of the missing values and they replace the old estimates giving the new series An ARMA(l,l) model is then fitted to the new series and the new coefficients and 8 1 are found (different from the previous ones). Then the estimated values (forecasts from the previous model) are replaced by the forecasts by the new model, giving the new series etc. The procedure is repeated until the model and the series stabilize in the sense that the parameters and 8 1 of the model as well as the estimates of the missing values do not change between successive estimates within a specified tolerance. Schematically the algorithm is presented in Fig. 5.5 where So denotes the incomplete series, MO the method used for the initial estimation, S. the estimated series at the 1 ith iteration, and M. the model (e.g., the set of 1 parameters and 8 1 series S .. 1 The notation M. M'+l and S. S'+l is introduced to 1 1 1 1 denote the stabilization of the model and series respectively after i iterations. The above algorithm will be addressed as RAEMVU (a recursive algorithm for the estimation of missing valuesunivariate model) Application of the Algorithm on the Monthly Rainfall Series The proposed recursive algorithm (RAEMVU) has been applied for the estimation of missing monthly rainfall PAGE 98 109 So Mo L_5_1 } _I+l Si+1 Fig. 5.5. Recursive algorithm for the estimation of missing valuesunivariate model (RAEMVU). S. denotes the series, and M. the model, ($,8)., at the ith iteration: 1 PAGE 99 110 values in the series of the south Florida interpolation station (station 6038). Different levels of percentage of missing values have been tested and the results for the 10% and 20% levels are presented herein. Tables 5.9 and 5.10 show the results for the 10% and 20% levels of missing values respectively. The starting series So is the incomplete series (with 10% or 20% the values missing). Four different methods (MV, RD, NR, and zeros) have been applied to the incomplete series, SO' providing different starting series, for the algorithm. Thus, its dependence on the initial conditions has also been tested. Results of the Method From Tables 5.9 and 5.10 the following can be concluded: (1) The algorithm converges very rapidly and independently of the initial estimates, thus suggesting the convenient replacement of the missing values by zeros to start the algorithm. (2) The greater the percent of missing values the slower the algorithm converges (6 iterations were needed for the 10% and 8 for the 20% to obtain accuracy to the third decimal place) as was expected since a larger part of the series is changing its values at each iteration and thus more iterations are needed to achieve equilibrium. PAGE 100 III Table 5.9. Results of the RAEMVU Applied at the 10% Level of Missing Values. Upper Value is PAGE 101 112 Table 5.10. Results of the RAEMVU Applied at the 20% Level of Missing Values. Upper Value is Lower Value is 8 1 MO MV RD NR Zeroes Ml 0.0954 0.5023 0.5021 0.5756 0.0069 0.4173 0.4159 0.2587 M2 0.0738 0.1167 0.1189 0.2926 0.0344 0.0311 0.0289 0.1187 M3 0.0789 0.0369 0.0377 0.0762 0.0276 0.0693 0.0688 0.0458 M4 0.0774 0.0910 0.0908 0.0526 0.0296 0.0125 0.0128 0.0503 M5 0.0778 0.0745 0.0746 0.0863 0.0291 0.0334 0.0333 0.0184 M6 0.0777 0.0786 0.0786 0.0756 0.0292 0.0281 0.0281 0.0319 M7 0.0777 0.0775 0.0775 0.0783 0.0292 0.0295 0.0295 0.0285 M8 0.0777 0.0778 0.0778 0.0776 0.0292 0.0291 0.0291 0.0293 PAGE 102 113 (3) For a specific percent of missing values the algorithm converges to the same point (e.g., same model and same series) independently of the initial estimates of the missing values. (4) For a different percent of missing values the same series converges to a "different" point (e.g., "different" model and "different" series). This was expected since the constant information in the system (existing values) is different in each case, and thus a different model describes it better. Diagnostic checking on the residuals from the two final models is performed using the portemanteau goodness of fit test. Denote the two models (at 10% and 20% levels) by 1 0 d 20 1 th U d t h MU an MU respectlve y, e eno lng t at a univariate model has been fitted to the series. Then = 0.5095 = 0.0777 e = 0.4333 e = 0.0292. (5.18) The values of Q for each model are Q(M_U10) = 26.54 and Q(M_U20) = 30.22 (calculated by equation A.1 using K=30) which are both smaller than the X 2 value with 28 degrees of 2 freedom at a 5% significance level: X28,5% = 41.3. Notice also that Q(M_U10) < Q(M_U20), indicating that the final model fitted to the series when 10% of the values were missing has a better fit than the model fitted to the series when 20% of the values were missing as expected. PAGE 103 114 Also, now that the final ARMA{l,l) model is stable we can ask the question "can it be simplified to an AR(l) plus white noise?". For an ARMA(l,l) process the simplification condition is 1 PI 0 > > (5.19) 1 + 2 1 PAGE 104 115 of the actual series. The monthly statistics are also shown in Table C.13 (appendix C). Table 5.11. Statistics of the Actual Series (ACT) and the Two Estimated Series (UNI0, UN20). y s ACT 3.673 UNI0 4.105 3.609 UN20 4.043 3.492 c v 89.04 87.920 86.381 1. 332 1. 354 1. 373 0.366 0.134 0.384 0.157 0.410 0.160 Table 5.12 shows the bias in the mean, standard deviation and lagone correlation coefficient so that the statistical closeness of the estimated series to the actual one can be evaluated. The bias in the mean and correlation coefficient is not significant at 5% significance level; however, the bias in the standard deviation does not pass the stringent Ftest (requiring exact equality of standard deviations) and thus is significant. PAGE 105 116 Table 5.12. Bias in the Mean, Standard Deviation and Serial Correlation CoefficientUnivariate Model. YeYa UN10 0.021 UN20 0.083 s /s e a 0.983 0.951 r r 1,e 1,a 0.018 0.044 Remarks 1. The forecasting procedure utilized for the estimation is the minimum mean square forward forecasting procedure of Box and Jenkins (1976). Damsleth (1980) introduced the method of optimal betweenforecasts, combining the forward forecasts and backforecasts into betweenforecasts with a minimum mean square error. He showed that the gain in forecast error by betweenforecasting as compared to forward forecasting (or backforecasting) an ARMA(l,l) model is proportional to 1lk+1 where k is the size of the gap. Thus the gain rapidly becomes small, unless II is very close to one and the size of the gap is very small. He also showed that the gain from betweenforecasting can be sUbstantial when e is negative. Finally he concluded that "the reduction in forecast error variance by using this betweenforecasting method is not very great for stationary series, but may be substantial when the PAGE 106 117 series is nonstationary" (Damsleth, 1980, p. 39). In our case, the use of the more complicated betweenforecasting procedure does not seem to be justified. It has been shown that the simple BoxJenkins forecasts work satisfactorily in the sense that rapid convergence to a "statistically acceptable" series occurs. 2. It is interesting to note that when the final estimates of the model (parameters of equations 5.18) are provided as initial estimates, the maximum likelihood estimates (calculated by a steepest descent algorithm) are equal to the initial estimates provided. This emphasizes the "uniqueness" of the stable model achieved by the proposed recursive algorithm. 3. It will also be interesting to check the threshold level of percent of missing values at which the algorithm starts to diverge. This is expected to happen at some level of percent of missing values (probably greater than 50%) when too much information in the system is changing at each iteration. At such high percents of missing values a more elaborate testing of the final model may also be needed. Bivariate Model Hodel Fitting The lagone multivariate autoregressive model of equation (4.3), suggested by Matalas (1967), preserves the PAGE 107 118 lagzero and lagone autoand crosscorrelations. When applied to two stations the model is reduced to the bivariate Markov model: (5.22) where the matrix B is a lower triangular matrix as suggested by Young (1968). The above model has been extensively used for the simultaneous generation of hydrologic series at two sites. An attempt will be made herein, to show how the above model can be used for the estimation of the missing values in one or both of the time series. A recursive algorithm analogous to the one proposed for the univariate case will be presented. The special case that will be considered is the estimation of the missing values in the series of station 1, given the complete, concurrent, equal length series of station 2. As has been extensively discussed in Chapter 4 incomplete data sets may result in inconsistent covariance matrices resulting in generated rainfall values that contain complex numbers. Therefore the incomplete series of station 1 is first completed by the use of a simple estimation method (e.g., MV, RD, NR or even replacement of missing values by zeroes) giving the complete series S1. Denote by S the complete and known series of station 2. PAGE 108 Then a bivariate AR(l) model is fitted to the series and S. Actually the model, as in the univariate case, is fitted to the residual series e.g., the normalized and standardized series. The following procedure is followed 119 for the estimation of the parameters (matrices A and B) of the model: The lagzero and lagone correlation matrices, MO and M 1 of the residual series are computed = = [r 11 (1) r21 (1) (5.23) Then matrix A is given directly by the multiplication of the 1 matrices M1 and MO (equation B.8 of appendix B) and matrix C is computed from equation (B.13). Matrix B is given from the solution of equation BBT = C, which in the case of B being a lower triangular matrix reduces to the direct calculation of the elements of B from equations (B.19). Proposed Estimation Algorithm An algorithm analogous to the one for the univariate case is also proposed for the bivariate case. After the incomplete series, So has been completed with a simple method a bivariate AR(l) model is fitted to the complete series and as described earlier. The parameter matrices A and B of the fitted model M1 = (A,B)l' are then used to construct new estimates for the "missing" values in the series Sl. From equation (5.22) we can write that: PAGE 109 120 (5.24) Z2ft = a21 Zl,t1 + a 22 Z2,t1 + b 21 n1,t + b 22 n2,t (5.25) Since the second series is complete and known, equation (5.25) is ignored and only equation (5.24) is considered. Following the BoxJenkins forecasting procedure, the mean square error forecasts Zl,t(t) of Zl,t+t' where t is the lead time, are Zl,t + a 12 Z2,t t = 1 '" (5.26) Zl,t (t1) + a12 Z 2 t ( t1), t = 2, 3, k where k is the number of values missing in each gap. The forecasting procedure is repeated for the estimation of all the gaps always using the newly estimated values in equations (5.26). These estimates then become the new estimates of the missing values, and they replace the old estimates in the series giving the new series and S. Denote this new model by M2 = (A,B)2' which is used in the same way as before to update the estimates. The procedure is repeated until convergence occurs in the sense that neither the model M. nor the series S. after the ith 1 1 iteration change between iterations within a specified tolerance (M. M. 1 and S. S. +1) 1 1+ 1 1 PAGE 110 121 Schematically the recursive algorithm for the estimation of missing valuesbivariate model1 station to be estimated (RAEMVB1) is shown in Fig. 5.6. The algorithm can be generalized to the case where a multivariate model of, say, K stations is used to estimate the missing values of L incomplete stations where L < K. Such a generalized algorithm can be economically written as RAEMVMK.L. The algorithm for the case of a bivariate model with both records incomplete e.g., two series to be estimated (RAEMVB2 or in the general form RAEHVM2.2) is illustrated in Fig. 5.7. The notation is the same as before but two subscripts are used now for the series S, the first denoting the station (lor 2) and the second denoting the iteration i (i=l, ... ). In this case both equations (5.24) and (5.25) would be needed for the estimation of missing values existing in both series. Application of the Algorithm on the Monthly Rainfall Series The case study presented herein involves the estimation of the missing values of the rainfall series of station 6038 using a bivariate AR{l) model with the complete rainfall series of Station 6038. Thus the RAEMVB1 illustrated in Fig. 5.6 has been used. Again, different levels of percentage of missing values have been tested, and the results for the 10% and 20% missing values are presented in Tables 5.13 and 5.14 respectively. The dependence of the algorithm on the starting values has been tested the same PAGE 111 So Mo SI M, S2 M2 MI M2 S IS S3 M3 M3 S ... { M" _I s .... 1 N Fig. 5.6. Recursive algorithm for the estimation of missing valuesbivariate modell station to be estimated. S, denotes the series, and M. the model, (A,B) 'f at the ith iteration. 1 1 122 } PAGE 112 123 S _1,0 Mo s _1,1 s _1,2 S _I,' 51,1 5 _1,1+1 Mi ,.., Mi+ 1 MI M2 5 _2,1+1 S _2,1 S _2,2 S _2,' M' _0 S _2,0 Fig. 5.7. Recursive algorithm for the estimation of missing valuesbivariate model2 stations to be estimated (RAEMVB2). 8 (8 .), denotes the series of 2), and M. the model, (A,B)., at the ith iteration. PAGE 113 124 way as for the univariate case, e.g., by providing different initial series estimated by four different methods MO (MV, RD, NR and zeroes). Tables 5.13 and 5.14 show the crosscorrelation matrices MO and M1 at each iteration i and the model M. = (A,B) .. It is interesting to follow the changes of the 1 1 crosscorrelation coefficients at each time step. Also notice that the autocorrelation coefficient (see equation 5.23) of the first series changes at each iteration (since new estimates of the missing values replace the old ones) but the autocorrelation coefficient of the second series remains unchanged (since the second series is complete and known) From Tables 5.13 and 5.14 the following similar conclusions to the univariate case can be drawn: (1) The algorithm converges rapidly, independently of the starting point (initial series). Thus, initial estimation of the missing values is not needed, and they may as well be replaced by zeroes. (2) The convergence seems to be less sensitive to the percent of values missing, since in both the 10% and 20% levels convergence has been achieved in three to four iterations. (3) For a specific percent of missing values the algorithm converges to the same point (e.g., same model, same series, and same correlation matrices) independently of the initial estimates of the missing values. PAGE 114 125 Table 5.13. Results of the RAEMVB1 Applied at the 10% Level of Missing Values. M. i MO M1 A 1B M = MV 0 1. 0.330 0.004 0.137 0.046 0.152 0.990 O. 1 0.330 1. 0.042 0.315 0.070 0.338 0.286 0.902 1. 0.005 0.038 0.194 0.039 0.194 0.980 O. 2 0.005 1. 0.065 0.315 0.067 0.316 0.071 0.944 1. 0.025 0.049 0.202 0.044 0.201 0.978 O. 3 0.025 1. 0.069 0.315 0.061 0.314 0.043 0.946 1. 0.025 0.049 0.201 0.044 0.200 0.979 O. 4 0.025 1. 0.068 0.315 0.061 0.314 0.042 0.946 MO = RD 1. 0.554 0.124 0.249 0.021 0.261 0.968 O. 1 0.554 1. 0.201 0.315 0.038 0.294 0.492 0.811 1. 0.026 0.042 0.196 0.037 0.195 0.980 O. 2 0.026 1. 0.070 0.315 0.062 0.314 0.039 0.946 1. 0.025 0.048 0.201 0.043 0.200 0.979 O. 3 0.025 1. 0.069 0.315 0.061 0.314 0.042 0.946 1. 0.025 0.049 0.201 0.044 0.200 0.979 O. 4 0.025 1. 0.068 0.315 0.061 0.314 0.042 0.946 Continued PAGE 115 126 Table 5.13. Continued. M = NR 0 1. 0.543 0.126 0.261 0.022 0.273 0.965 O. 1 0.543 1. 0.187 0.315 0.022 0.303 0.478 0.819 1. 0.002 0.046 0.199 0.046 0.199 0.979 o 2 0.002 1. 0.069 0.315 0.070 0.316 0.070 0.944 1 0.026 0.050 0.203 0.045 0.201 0.978 o. 3 0.026 1. 0.069 0.315 0.061 0.314 0.042 0.946 1. 0.025 0.049 0.202 0.044 0.200 0.978 o. 4 0.025 1. 0.068 0.315 0.061 0.314 0.042 0.946 MO zeroes 1 0.258 0.463 0.172 0.448 0.057 0.885 O. 1 0.258 1. 0.048 0.315 0.036 10.385 0.247 0.915 1 0.042 0.061 0.225 0.059 0.222 0.973 o 2 0.042 1. 0.081 0.315 0.068 0.313 0.033 0.946 1 0.029 0.048 0.203 0.042 0.201 0.978 O. 3 0.029 1. 0.070 0.315 0.061 0.314 0.038 0.946 1 0.025 0.049 0.201 0.043 0.200 0.979 O. 4 0.025 1. 0.068 0.315 0.061 0.314 0.042 0.946 PAGE 116 127 Table 5.14. Results of the RAEMVB Applied at the 20% Level of Missing Values. M. i MO !t11 A 1 B No = MV 1. 0.523 0.342 0.251 0.290 0.100 0.936 O. 1 0.523 1. 0.257 0.315 0.126 0.249 0.446 0.831 1. 0.025 0.369 0.307 0.377 0.316 0.874 O. 2 0.025 1. 0.256 0.315 0.264 0.322 0.253 0.876 1. 0.023 0.389 0.333 0.393 0.337 0.857 o 3 0.012 1. 0.253 0.315 0.257 0.319 0.255 0.877 1. 0.012 0.389 0.332 0.393 0.337 0.858 O. 4 0.012 1. 0.253 0.315 0.257 0.319 0.254 0.877 MO = RD 1. 0.588 0.320 0.290 0.228 0.156 0.939 O. 1 0.588 1. 0.262 0.315 0.117 0.246 0.510 0.795 1. 0.012 0.368 0.315 0.375 0.383 0.872 O. 2 0.023 1. 0.257 0.315 0.264 0.321 0.254 0.875 1. 0.012 0.388 0.334 0.392 0.338 0.857 O. 3 0.012 1. 0.253 0.315 0.257 0.319 0.255 0.877 1. 0.012 0.388 0.333 0.393 0.337 0.858 O. 4 0.012 1. 0.253 0.315 0.257 0.318 0.254 0.877 Continued PAGE 117 128 Table 5.14. Continued. MO NR l. 0.611 0.324 0.273 0.252 0.119 0.941 O. 1 0.611 1. 0.279 0.315 0.137 0.232 0.534 0.777 l. 0.022 0.372 0.311 0.379 0.320 0.872 O. 2 0.022 1. 0.258 0.315 0.265 0.321 0.253 0.875 l. 0.012 0.389 0.333 0.393 0.338 0.857 O. 3 0.012 l. 0.253 0.315 0.257 0.319 0.255 0.877 l. 0.012 0.389 0.332 0.393 0.337 0.857 O. 4 0.012 l. 0.253 0.315 0.257 0.318 0.254 0.877 MO = zeroes l. 0.321 0.601 0.201 0.599 0.009 0.799 O. 1 0.321 l. 0.195 0.315 0.104 0.282 0.253 0.909 l. 0.006 0.423 0.340 0.421 0.337 0.841 O. 2 0.006 l. 0.228 0.315 0.226 0.314 0.233 0.892 1. 0.012 0.392 0.332 0.397 0.337 0.856 O. 3 0.013 1. 0.249 0.315 0.253 0.319 0.255 0.878 l. 0.013 0.390 0.333 0.394 0.338 0.857 O. 4 0.013 l. 0.253 0.315 0.257 0.319 0.255 0.877 PAGE 118 129 (4) For a different percent of missing values the same series converges to a "different" point, but this is reasonable and expected since the constant information (existing values in the series) is different in each case, and a different model thus describes it better. The statistical properties of the two final series (from the 10% and 20% missing values) are shown in Table 5.15 together with the ones of the actual series. The monthly statistics are also shown in Table C.14 (appendix C). Table 5.16 shows the statistical closeness of the two estimated series to the actual one. Again, the bias in the mean and correlation coefficient is not significant at the 5% significance level, but the bias in the standard deviation is. Table 5.15. Statistics of the Actual Series (ACT) and the Two Estimated Series (B10 and B20) y ACT 4.126 B10 4.096 B20 4.077 s 3.673 3.610 3.523 89.04 l. 332 0.366 0.134 88.132 1.358 0.382 0.162 86.421 l. 341 0.416 0.165 PAGE 119 130 Table 5.16. Bias in the Mean, Standard Deviation and Serial Correlation CoefficientBivariate Model. YeY a BIO 0.030 B20 0.049 s /s e a 0.983 0.959 r r l,e l,a 0.016 0.050 PAGE 120 CHAPTER 6 CONCLUSIONS AND RECOMMENDATIONS Summary and Conclusions The objective of this study was to compare and evaluate different methods for the estimation of missing observations in monthly rainfall series. The estimation methods studied reflect three basic ideas: (1) the use of regional information in four simple techniques: mean value method (MV) reciprocal distance method (RD) normal ratio method (NR) modified weighted average method (MWA); (2) the use of a univariate stochastic (ARMA) model that describes the time correlation of the series; (3} the use of a multivariate stochastic (ARMA) model that describes the time and space correlation of the series. An algorithm for the recursive estimation of the missing values in a time series using the fitted univariate or multivariate ARMA model has been proposed and demonstrated. Apparently, the idea of the recursive estimation of missing values is known (Orchard and Woodbury, 1972; Beale and 131 PAGE 121 132 Little, 1974), as well as the idea of using the fitted model to directly derive the estimates (Brubacher and Wilson, 1976; Damsleth, 1979). However it appears that a method which combines the above two ideas simultaneously in a recursive estimation of the missing values with parallel updating of the model has not been used before. The proposed algorithm is general and can be used for the estimation of the missing values in any series that can be described by an ARMA model. On the basis of the data from the four south Florida rainfall stations used in the analysis, the following conclusions can be drawn: (1) All the simplified estimation techniques give unbiased (overall and monthly) means and correlation coefficients at the 5% significance level even for as high as 20% missing values. (2) At high percentages of missing values (greater than 10%) the MV method gives the more biased (although not significantly so) correlation coefficients. (3) All methods give a slightly biased overall variance but unbiased monthly variance at the 5% significance level, and the l'1V method gives the most biased variances for all percentages of missing values. (4) The NR method gives the most and the MV the least accurate estimates, at almost all levels of percent missing values. PAGE 122 (5) The proposed recursive algorithm works satisfactorily in both the univariate and bivariate case. It converges rapidly and independently of the initial estimates and gives unbiased means and correlation coefficients at the 5% significance level. 133 (6) The use of a bivariate model as compared to a univariate one did not improve the estimates except for a slight improvement at 20% missing values. However, the use of a multivariate model based on three or four nearby stations is expected to give much better estimates. The use of three adjacent stations is the main reason for the better performance of the NR method over the more sophisticated univariate and bivariate ARMA models which use only zero and one additional stations. If the purpose of estimation is to calculate the historical statistics of the series (e.g., mean, standard deviation, and autocorrelations) the selection of the method matters little, and the simplest one may be chosen. However, if it is desired to fit an ARMA model to the incomplete series, to be used, say, to construct forecasts, the estimation of the missing values and the parameters of the model by the proposed recursive algorithm is recommended. In this case the equilibrium state (i.e., final series and parameters of the model) achieved upon convergence is unique, depending only on the existing information in the PAGE 123 134 system (available data) and not on any external information added to the system (by the replacement of the missing values with some derived estimates). The only assumption made is that the order of the ARMA model to be fitted to the series is known. In practical situations this is seldom a problem since the latter can be determined from the complete part of the series or from a series with similar characteristics. For example, if an ARMA(l,l) model is known to fit the monthly rainfall series well at a couple of nearby stations, there is little doubt that it will fit the incomplete monthly rainfall series equally well at the station of interest. Upon convergence, the recursive algorithm then gives the "best" estimates of the parameters of the model. Further Research Further research should include: (1) application of the simple estimation techniques in short records where the biases may be significant for the methods with the poorer performance; (2) test of the sensitivity of the recursive algorithm to the selection of the model (order of the model) when more than one model fits the data equally well; (3) derivation of the threshold percent of missing values after which the algorithm diverges; (4) application to the estimation of missing values in other hydrological series, e.g., runoff; PAGE 124 135 (5) trials of different forecasting procedures and determination of improvements obtained by the "betweenforecasting procedure" in cases of a large number of singlevalue gaps, e.g., use of the average of a backwards and forwards ARMA model forecast; (6) application of the concept of "missing values" for the estimation of erroneous values or outliers in a series to avoid errors when using the data, say, to construct forecasts; and cn estimation of values in a series that are affected by unusual circumstances, thereby permitting a measure of the magnitude of the unusual circumstance and the estimation of the effect of similar circumstances in the future (e.g., effect of a drought on water supply). PAGE 125 1. Strict stationarity APPENDIX A DEFINITIONS A stochastic process is said to be strictly stationary if its statistics (e.g., mean, variance, serial correlation) are not affected by a shift in the time origin, that is, if the joint probability distribution associated with n observations (zl' z2' ., zn)t made at time origin t, is the same as that associated with n observations (zl' z2' .. zn)t+k made at time origin t+k. In other words, z(t) is a strictly stationary process when the two processes z(t) and z(t+k) have the same statistics for any k. 2. Weak stationarity Weak stationarity of order f is when the moments of the process up to an order f depend only on time differences. Usually by weak stationarity we refer to second order stationarity, e.g., fixed mean and an autocovariance matrix that depends only on time differences (i.e., lags). 3. Gaussian process If the probability distribution associated with any set of times is a multivariate normal distribution, the process 136 PAGE 126 137 is called a normal or Gaussian process. Since the multivariate normal distribution is fully described by its first and second order moments it follows that weak stationarity and an assumption of normality imply strict stationarity. 4. Nonstationarity A stochastic process is said to be nonstationary if its statistical characteristics change with time. A homogeneous nonstationary process of order d is a process, for which the dth difference vdZt is a stationary process. For example a first order homogeneous nonstationary process is one that exhibits homogeneity apart from constant (e.g., a linear trend), and a second order nonstationary is the one that exhibits homogeneity apart from constant and slope (e.g., a parabolic trend). 5. Circular stationarity A stochastic process is said to be circularly stationary with period T, if the mUltivariate probability distribution of T observations (zl' z2' time origin t, is the same as that associated with T observations (zl' z2' t +Tk, for k = 1, 2, ., zT)t+Tk made at time origin For example, a monthly hydrologic series has a period of 12 months, i.e., T = 12 and circular stationarity suggests that the PAGE 127 138 probability distribution of a value of a particular month is the same for all the years. 6. Stationarity condition A linear process can be always written in the random shock form: (A. 1) where B is the backward shift operator defined by BZt = hence Bm = Z and Ztl; Zt tm (A. 2) is the so called transfer function of the linear system and is the generating function of the weights. For the process to be stationary the weights must satisfy the condition that converges on or within the unit circle, e.g., for all IBI < 1. 7. Invertibility condition The above model may also be written in the inverted form (A. 3) or n(B) (A. 4) PAGE 128 139 where is the generating function of the TI weights. For the process to be invertible the TI weights must satisfy the condition that TI(B) converges for all IBI < 1, that is on or within the unit circle. The invertibility condition is independent of the stationarity condition and is applicable also to the nonstationary linear models. The requirement of invertibility is needed in order to associate the present values of the process to the past values in a reasonable manner, as will be shown below. 8. Duality between AR and MA processes In a stationary AR(p) process, at can be represented as rv a finite weighted sum of previous z's, (A. 6) rv or Zt as an infinite weighted sum of previous a's (A. 7) rv Also, in an invertible MA(q) process, Zt can be represented as a finite weighted sum of previous a's, (A. 8) PAGE 129 'V or at as an infinite weighted sum of previous z's 1(B) e 140 (A. 9) In other words, a finite AR process is equivalent to an infinite MA process, and a finite MA process to an infinite AR process. This principle of duality has further aspects, e.g., there is an inverse relationship between the autocorrelation and partial autocorrelation functions of AR and MA processes. 9. Physical interpretation of stationarity and invertibility Consider an AR(I) process (1 I B ) Zt = at. For this process to be stationary, the root of the polynomial 1 lB = 0 must lie outside the unit circle, which implies that B = must be greater than one, or 1 1 < 1. The process can be also written Zt = 1 Ztl + at 2 + + (A. 10) Zt+l = IZtl lat a t + 1 3 2 + lat + 1 + a t+2 etc. Zt+2 = lZtl + la t When Il I > 1 (or 1 I = 1) the effect of the past on the present value of the time series increases (or stays the PAGE 130 141 same) as the series moves into the future. Only when Ill < 1 (stationary process) does the effect of the past on the present decrease the further we move into the past, which is a reasonable and acceptable hydrologic fact (Delleur and Kavvas, 1978). Consider now an MA(l) process Zt = (181B)at The invertibility condition implies that 181 I < 1. The process can also be written in the form: 1 = 1e B Zt 1 1 where the polynomial (181B) can be expanded in an (A. 11 ) infinite sum of convergent series only if 181 I < 1. To illustrate the need for invertibility let us assume that I 8 1 I > 1. Then (A. 11) can be written as 1 1 (A. 12) and since I < 1, it can be expanded to the form ( __ 1_ + 21 2 + __ 1 __ + ) Zt 8 1 B 8 1 B 8fB 3 (A. 13) or PAGE 131 142 (A. 14) which implies that future values are used to generate the present values. It becomes clear that the invertibility condition is required in order to assure hydrologic realizability. 10. The portemanteau lack of fit test The portemanteau lack of fit test (Box and Jenkins, 1976, Ch. 8) considers the first K autocorrelations rk(a), k = 1, 2, ... K, of the fitted residual series a of an ARIMA(p,d,q) process, to detect inadequacy of the model. It can be shown (Box and Pierce, 1970) that, if the fitted model is appropriate, Q = (Nd) K 2 A L: rk (a) k=l (A. 15) is approximately distributed as X 2(Kpq) where Kpq is the number of degrees of freedom, N is the total length of the series, and (Nd) is the number of observations used to fit the model. The adequacy of the model may be checked by comparing Q with the theoretical chisquare value x2(Kpq) of a given significance level. 2 If Q < X (Kpq), at is an independent series and so the model is adequate, otherwise the model is inadequate. PAGE 132 143 For the choice of K, Box and Jenkins suggest it to be "sufficiently large so that the weights in the model, J written in the form (A. 16) will be negligibly small after j = K" (Box and Jenkins, 1976, p. 221). The IMSL subroutine FTCMP (IMSL 0007, Ch. F) uses a value of K equal to NI10 + P + q to perform the portemanteau test. Ozaki (1977) points out that "for the application of the portemanteau test, fast dying off of the impulse response function (weights of the model is a necessary J condition" (Ozaki, 1977, p. 298). In cases where the impulse response function dies off rather slowly (possibly due to the nearnonstationarity of the model) when compared with the length of the series, the applicability of the portemanteau test is doubtful since the autocorrelations of the residuals may not be reliable at large lags. 11. Cumulative periodogram test Another method used in the diagnostic checking stage of the BoxJenkins procedure is the cumulative periodogram checking of the residuals. The normalized (area under the curve equal to one) cumulative periodogram for frequencies, f, between 0 and 0.5, of the fitted residuals at' is compared with the theoretical cumulative periodogram of a PAGE 133 144 white noise series which is a straight line joining the points (0, 0) and (0.5, 1). A periodicity in the residuals at frequency f. is expected to show up as a deviation from l the straight line at this frequency. KolmogorovSmirnov probability limits can be drawn on the cumulative periodogram plot to test the significance of such deviations. For a given level of significance a, the limit lines are drawn at distances Ka/ N' above and below the theoretical straight line, where N' = (N2)/2 for N even and N' = (N1) /2 for N odd. Approximate values of Ka for different levels of significance a, are: a 0.01 0.05 0.10 0.20 0.25 K a 1. 63 1. 36 1. 22 1. 07 1. 02 (Box and Jenkins, 1976, p. 297) So, if more than aN of the plotted points fall outside the probability lines, the residual series may still have some periodicity; otherwise it may be concluded that the residuals are independent. In practice, "because the a's are fitted values and not the true a's, we know that even when the model is correct they will not precisely follow a white noise process" and thus the cumulative periodogram test provides only a "rough guide" to the model inadequacy checking (Box and Jenkins, 1976, p. 297). PAGE 134 145 12. Akaike Information Criterion (AIC) The AIC for an ARMA(p,q) model is given by AIC(p,q) 1'.2 = N log (0 a) + 2 (p+q+2) + Nlog2'IT + N 1'.2 Ii where 0 is the MLE of the residual variance given by a 1 Npq s (A. 1 7) and i, are the vectors of the parameters e which minimize the sum of squares of the residuals at (A.18 ) For the purpose of comparison of models the definition of AIC can be replaced by AIC(p,q) 1'.2 = N log(Oa) + 2(p+q) (A.19) Ozaki (1977) demonstrates that the inherent difficulties associated with the BoxJenkins procedure (identification, estimation and diagnostic checking) for the selection of the model, when several models fit the data equally well, can be overcome by using the MAICE (minimum AIC estimation) procedure as the only objective criterion for the selection of the "best" approximating model among a set of possible PAGE 135 146 models. He also points out that the AlC IImeasures both the fit of a model and the unreliability of a modelll (Ozaki, 1977, p. 290). 13. Positive definite (semidefinite) matrix A real symmetric matrix A is called positive definite (semidefinite) if and only if (.,::.0) (A. 20) for all vectors X O. The two following theorems hold: Theorem 1: A matrix A is positive (semi) definite if and only if all its characteristic values (i.e., eigenvalues) are (nonnegative) positive. Theorem 2: A matrix A is positive (semi) definite if and only if all the successive principal minors of A are (nonnegative) positive. An obvious corollary of the above is that a positive semidefinite matrix is positive definite if and only if it is nonsingular i.e., none of its characteristic values are zero (Gantmacher, 1977, p. 305). PAGE 136 14. Test for differences in the means of two normal populations 147 Let denote the population means of two normal distributions and xl' x2 the sample means respectively. Let also assume that the variance of the two normal distributions are equal but unknown. The hypothesis Ho: = versus Ha: # is tested by calculating the statistic where t = 2 s = x 2 which has a t distribution with N1 + N2 2 degrees of freedom. The H is rejected if o (A.21) (A. 22) (A.23) Although the test is based on sample normality, for large samples, the Central Limit Theorem enables us to use the test as approximate test for nonnormal samples. If the two populations are of equal length, N1 = N2 = N, then equation (A.21) reduces to PAGE 137 t = 15. Test for equality of variances of two normal distributions 148 (A.24) 2 2 2 2 Let aI' a 2 denote the population variances and sl' s2 the sample variances of two normal distributions. The hypothesis Ho: ai = a; versus Ha: ai # a; is tested by calculating the statistic F c (A. 25) where si is the larger sample variance. Fc is distributed as an F distribution with Nl 1 and N2 1 degrees of freedom where N1 is the length of the sample having the larger variance and N2 is the length the sample with the smaller variance. F c H is rejected if o N 1 > F 1 N 1 2 1 a 16. Test for of correlation coefficients (A.25) Let p denote the population correlation coefficient and r the sample estimate of p. If the sample size is moderately large (N > 25) then the quantity W is PAGE 138 149 approximately normally distributed with mean and variance 1/N3 where and W I 1 (1 + r) ="2 n 1 r To test the hypothesis H : P = r against the o alternative H : P # r the quantity a z = (W w) IN 3 (A. 27) (A. 28) (A.29) can be considered to be normally distributed with zero mean and unit variance. If Iz I > zla/2 (z is the standard normal variable), Ho is rejected (see Haan, 1977, p. 223). PAGE 139 APPENDIX B DETERMINATION OF MATRICES A AND B OF THE MULTIVARIATE AR(I) MODEL Determination of matrix A The multivariate lagone autoregressive model is written as (B. I) T Postmultiplying both sides of equation (B.I) by Ztl and taking expectations it becomes: (B. 2) By definition (B. 3) and (B. 4) 150 PAGE 140 and from the assumption of weak stationarity Also from the independent uncorrelated process Nt so that equation (B.2) becomes and solving for the parameter matrix A Determination of matrix B Postmultiplying equation (B.1) by ZT and taking t expectations in both sides it becomes 151 (B. 5) (B. 6) (B. 7) (B. 8) (B. 9) PAGE 141 152 Because = I, an identity matrix, and = 0 equation (B.9) can be written (B.lO) By substituting A from equation (B.B) and solving for B BT (B.ll) Solution of equation B BT = C The right hand side of equation (B.Il) involves the lagzero and lagone correlation matrices which can be estimated from the historical data and thus is a known guantity C. The problem that remains now, is to solve equation (B.12) for B. A necessary and sufficient condition to have a real solution for B is that C must be a positive semidefinite matrix. It can be proven (Valencia and Schaake, 1973) that if the correlation matrices MO and Ml have been calculated using equal length records for all m sites, then the matrix (B.13) PAGE 142 153 is always positive semidefinite and so a real solution for the matrix B exists. But this solution for B is not unique. An infinite number of matrices B exist that satisfy (B.12). Proof: Let B denote a matrix solution of equation (B.12) and K denote an (rnxm) matrix such that K KT = I where I is an (rnxm) identity matrix. A matrix BO defined as BO = B K (B.14) may be used in place of B in equation (B.12) since There exists more than one matrix K such that K KT = I, and therefore many solutions for matrix B exist, all valid since the elements of B have no physical significance as far as synthetic hydrology is concerned (Matalas, 1967). Several techniques have been proposed for the solution of equation (B.12). Fiering (1964) and Matalas (1967) suggested the use of principal component analysis and Moran (1970) used canonical correlation analysis. Young (1968) assumed that B is a lower triangular matrix, based on the fact that C = B BT is a symmetric matrix, and gave a unique recursive solution for the elements of B. Let us examine this case closely: PAGE 143 154 (1) C = B BT is symmetric for any B. The (i,j)th element of matrix C is c .. = 1.J (B. 16) and the across the diagonal element is c .. = J1. (B. 17) where the prime denotes a transposed element. Thus, bkj = bjk and bki = bik, which implies that therefore C is symmetric for any B. c .. 1.J = c .. and J1. (2) That C is symmetric implies that m(m+l)/2 equations are required to specify it, and so m(m+1)/2 non zero elements of matric B are needed. Thus, since the (mxm) matrix B has m 2 elements there are m(ml)/2 elements that can be set to zero. So the assumption of a lower triangular matrix B is valid. (3) The assumption of a lower triangular matrix B allows a recursive solution for the coefficients of B. This will be illustrated in the (2x2) case, and the reader is referenced to Young and Pisano (1968) for the general case. PAGE 144 155 b11 0 bll b21 cII cl2 = b21 b22 0 b22 C21 c 2 2 or (B.18) 2 bll b21 bll cII c12 = b21 bll 2 2 (b21 b22) c21 c22 from which bll = cII b21 = c21/bll (B.19) V b22 2 = cII b21 with the constraints > 0 and (B.20) PAGE 145 APPENDIX C DATA USED AND STATISTICS Table C.l. 55 years of monthly rainfall data for the South Florida Station 6038. ...... STATION 6039. MOORE HAVEN LOCK 1 *.*.* 1"27 0.11 2. 0" 1. 70 2. 02 1. 94 10. 7'P 5. 79 8.61 6."9 4. 12 0.39 O. 39 1"28 O. 42 2.31 2.46 1. 52 4. lCf 8. 12 5.43 11.82 14. 60 O. 47 0.97 O. 31 1"29 O. 82 O. 14 O. 52 1. :55 2. 73 9. 35 8.44 4.93 13. 4:5 1. 71 1. 27 1.39 1"30 0.49 3.23 4. 76 4. 12 11.33 17.8:5 4. 72 11.61 11. 6. 33 O. 45 2. 33 1931 2. 58 O. 76 5.90 3. 44 1. 59 1. 20 2.68 10.34 :5.06 1. 94 o.oe o 35 1"32 1. 97 3. 13 2. 97 1. 76 6.05 4."6 6.25 15.71 :5."9 2. 93 3.29 O. 07 1"33 1.65 O. 1" 3.88 6. "2 3.8Cf 4. 66 5. 36 5. 77 2. 75 5.18 0.Cf2 O. 28 1"34 1. 33 :Z.E9 2. 73 2.22 6.43 4.36 8.48 6. 20 4. 18 5. 54 3.58 O. ;::6 1935 O. 52 1.00 0.03 5. 18 3. 57 :5.84 :5. 09 5. 50 9. 53 1. 42 1. 71 1. 43 1"36 2.23 4."7 1.95 2. 55 5. 41 14. 59 2. 99 :5. 79 11.51 3. 55 O. 58 1. 19 1"37 2. 07 1. 70 4.83 4. 89 4.94 4. 29 13. 79 4. 71 4.48 8.72 5. 47 O. 44 1938 O. 61 O. 57 0.34 O. 21 6.28 7. 40 8.20 2.39 2.23 3.92 1. 52 O. 11 193'P O. 18 0.35 O. 79 3. 08 4.48 3. 61 16.13 10. 42 4.20 3.60 1. 45 1 ':'5 1"40 2. 37 3.07 5. :55 2. 06 3.36 4. 96 7."2 10. 43 14. 13 0.3;Z O. 42 3.91 1"41 :5. 73 3.86 3.68 :to 62 3.30 4. 87 13. 23 b.71 8. 54 ;Z. 92 1. 66 1. 1942 2.80 3. 51 4. :5:5 :5. 64 1. 99 ". :51 4.81 :5.66 4. 16 0.03 0.46 1. 62 1943 O. 35 0.37 2. 72 3. 91 3. 43 5. 02 8.04 8.07 3 C'7 ;Z. 67 1. 69 O. ;20 1944 O. 98 O. 12 2.35 5. 41 1. :52 :5. :50 8.36 :5. 42 9.23 3. 47 0.07 o 27 1"4:5 1. 82 O. 27 O. 17 3. 20 2.22 7.07 ".47 6.86 8.38 4. 9;Z O. 53 o 57 1946 O. 68 O. 76 ;Z. 53 0.27 7.5;Z :5. 74 6.90 4. 49 7.77 1. 16 2.16 O. 90 1947 O. 70 1. 64 9. 73 O. :55 4.80 15.02 6. 43 10.74 10. :57 6. 18 4.33 1. ::i 1 1948 4.16 O. 39 0.62 3. 15 2.24 4. 67 6.00 :3.94 21. 55 ;Z.42 0.57 O. 57 1949 O. 05 0.0:3 O. 46 1. 64 3.13 6. :56 9. 40 12. :51 10. 22 O. 73 0.96 ;: 1 <;1:50 O. 06 O. 72 1. 40 2. 98 :3.29 4. :5:5 7. :5:3 8. 86 2. 77 5. 54 157 1. 45 1951 O. 15 1.99 O. 82 3. :31 4.47 5. 02 11.63 5.03 6.20 7.74 1.36 O. 11 1952 O. 92 5. 02 1 50 2. 2:5 10.74 7. 56 7.05 8.09 6.35 11. 11 O. 19 Ci 46 1953 1.45 2. 57 O. 76 4. 03 2.78 6. :52 9. 13 :5.6:5 14. 16 9.67 C. 55 1. 25 1954 O. 39 1.72 ;Z.24 3. 52 11.96 12. 53 10. 58 !j. 96 6. 48 2.63 1. 19 1. 89 1955 ;Z. 78 1.27 1. :26 1.7:2 3.91 13. 17 5. 80 3. 59 7.07 :2. 55 O.;?S 1. l8 1956 O. 96 1.04 O. 40 1. 1. 13 5. 43 3. 53 4. 67 :5. 18 6. 47 O. 13 0 52 1. 74 3. 73 6. 09 4. 06 5.58 4, 35 6. 59 7. 59 9. :50 1. 20 0.24 7 58 1958 6. 04 0.84 7.03 5.84 4.91 :5.93 8.32 4. 12 3. 09 4. 59 0.47 77 19:59 1.09 1.08 5.82 1.99 6. 07 10.16 :5. 60 O. 12 12. 00 12.36 1. 29 1 02 1960 O. 31 4. 43 1.37 6. 2.77 11.35 11. 11 6.37 11.30 5.99 1. 21 C. 69 1961 2. 71 2. 16 3. 56 2.44 6. 1;Z 7. 17 3. 74 4. 73 2 64 0.66 1. 41 0. 33 1962 O. 88 0.47 3. 57 2. 60 2.33 11.46 :5. 46 7. 71 8. 78 1. 20 4. 03 0 22 1963 0.86 3.64 0.49 0.80 8.82 6. 92 1. 08 6. 06 3. 52 0.05 268 4 20 1964 2. 4. 80 0,61 0.67 2.34 5. 20 4.78 8.89 3.46 2. 74 0 05 O. 72 1965 O. 42 3. 59 3. 16 1. 70 1. 11 10. 16 :5. 57 2. 78 4.71 9. 06 0.34 1. 89 1966 5.47 3. 67 0.42 3. 01 9. 26 10. 93 11. 19 6. 76 2. 62O. 11 o 40 1967 O. 84 1.69 O. 24 O. 14 2. 58 11.;Z7 7.02 3. 74 8. 53 3.37 0.08 1. 95 1968 O. 58 1. 72 1. 03 O. 85 8. 64 10. 73 7.13 4.23 6,81 3. 21 0.21 1969 1.76 2. 28 6.19 o 69 4. 10 10. 09 3. 68 10. 04 8. 49 11. 75 1.46 3. 82 1970 3. 55 2.40 12. 63 O. 02 2.98 8. 74 5. 91 3.46 4.70 o 13 0.28 1971 0.25 O. 51 0.37 O. 14 1. 50 13. 86 7.28 8.29 7. 18 6. 3:5 0 90 120 1972 O. 30 1. 2.24 2.34 7.52 10. 50 2.77 6.40 0.93 O. 40 2.21 1. 39 1973 2. 72 2. 73 3.34 1. 02 5. 88 10. 48 8.01 58 8.43 1. 38 0.03 1. 52 1974 O. 14 1. 36 0.08 0.97 3.00 14.91 18. 56 7.99 5. 91 1.35 1. 64 1. 71 197:5 O. ;ZO 1. 95 O. 74 1.22 4.89 5.29 7.00 3. 13 11. 11 4. 88 0.27 0 38 1976 0.65 1. 41 1. 59 1.81 4.43 3. 10 9.98 12.31 :5. 74 0.80 1. 88 2. 31 1977 4. 87 1. 38 1. 12 0.20 :5.17 3. 74 6. 19 5. 51 6.29 1. 01 5. 33 4. 74 1978 1. 78 1. 39 2. 64 2. 06 8.:38 43 9.32 2. 67 6.40 2. 23 2. 13 4. 39 1979 21.40 0.23 2.30 0.84 7.64 1.09 1. 45 5.66 17.69 1. 90 1.83 1.96 1980 2. 76 1. 08 2.32 :5.29 2.23 3. 10 7.61 6. 88 1. 47 2 20 0.62 1981 O. 87 1. 52 1. ;Z8 O. 38 2 06 3. 33 3. 70 10. 29 4. 54 O. 24 1. 27 o 15 156 PAGE 146 157 Table C.2. 55 years of monthly rainfall data for the South Florida Station 6013. ***** STATION 6013. AVON PARK ***** 1927 O. 10 1. 87 2.29 1. 0.31 S. 5.39 93 3.9S 3. 80 0.40 1. 71 1928 0.26 1. 14 3. 12 3.66 6.90 13.01 9.66 10.64 2.05 1. 03 O. 35 1929 1. 70 1. 00 1. 35 2. 78 S. 42 61 10. 55 11. 59 2. 40 0.56 2.29 1930 4.00 4.17 6. 59 3.95 7.55 11.37 4.49 7.06 18.iil2 2.42 1. 25 4. 13 1931 3.92 2.36 6. 10 3. 74 S.15 0.37 7.84 2.98 O. 18 1. 47 1932 0.63 O. 14 1. "9 2.08 5.95 ".29 4.68 2. SO 4. 06 4. 50 2.48 0.07 1933 1. 97 2.35 1. 70 5.90 3.66 4. 77 13.78 1.00 11.71 1. 94 3.47 0.27 1934 1. 22 2. SO 3. 5a 4.32 7. 15 10.94 4.13 1.00 3.17 0.11 0.93 1. 00 1935 O. 41 1. 15 O. Sl 6.03 2.87 6.87 1. 00 9.93 11.35 2. 99 1. 05 2.39 l'i136 4.83 8.35 5. 52 1.67 2.59 10.87 1. 00 7.99 9.99 3.97 1.07 2. 14 1937 2.63 5. 13 3.31 4.06 1. 65 1.00 5. 29 6.27 6.47 6.47 5.44 O. 87 1938 1.44 1. 43 1. 45 0.42 3.43 4.64 8. 13 4.24 2. 81 6. 44 2.50 O. 19 1'i139 1.52 1.20 1.34 4.66 7.91 S.22 19.85 6.22 4.63 O. 50 O. 61 1.,40 3.83 3.06 3. 5S 1. 54 5.30 8. 43 11. 76 4.02 9.94 0.68 O. 10 4. 43 l'i141 4.01 3.02 2.92 4. 73 1. 04 9. 52 15.20 3.11 4.S9 2.62 2.49 1. 98 1942 4. 48 4.72 3.86 2.67 6.43 S. 52 8. 76 5. 19 5.37 0.13 0.0 3. 54 1943 1.21 O. 46 4.94 1.69 8.83 76 7.86 10. 02 3. 98 4.35 1. 32 O. 59 1944 1. 00 1. 00 1.00 5. 73 2.07 7.39 11.17 6.42 3. 39 4.45 0.26 O. 51 1.,45 1. 0.03 0.40 1.61 2.45 14.09 14.49 2. 7'i1 8. 43 5.94 0.49 2. 00 1946 1. 14 2. 11 1.08 0.20 6.03 S.02 9. 88 6.04 8.09 4.74 2.06 1. 31 1947 1.92 3.82 6. 19 4.65 3. 57 12. 77 10. 50 9.30 14.31 2.97 2.65 1.65 1948 4.03 O. 51 0.83 6. 00 2.34 4.39 18.99 6. 72 16.10 6.99 1. 99 1. 50 1949 O. 13 O. 09 0.92 3.30 2.66 6. 74 6. 48 10.12 8. 18 O. 70 1. 79 0.41 0.0 O. 66 1. 46 3.15 2.42 2.09 3.38 5. 90 7.83 7.56 0.32 1.79 1951 0.22 2. 57 0.64 10.35 0.33 6. 98 5.30 a. 72 3.99 5.94 1. 00 O. 90 1952 1. 30 4.61 5. 49 O. 97 5.48 7.39 7.23 a. 46 5. 42 6. 90 1. 60 1. 15 1953 3. 27 2. 58 6.90 7.45 0.83 13. 16 5. 52 11.00 12. 71 6 92 7.44 2. 40 1954 1. 78 1.96 1. 62 4. 71 3.12 la.95 4. 73 6.31 6.20 1. 60 1. 60 1. 97 1955 2. 73 1. 06 1.67 1.31 1.62 5.27 6.65 1. 86 8.93 2. 46 0 56 O. 74 1956 0.26 0.94 1. 54 2.23 1. 95 9. 13 4.70 10.95 6. 70 7. 78 0.22 O. 22 1.,57 2. 14 5.10 4. 77 6. 07 10.91 9.37 12. 74 6.99 7. 08 1. 45 1. 30 2. 12 1958 8.33 3. 50 5. 55 3.43 4.10 6. 77 4.45 6.31 4. 97 2. 75 0.91 3. 96 1.,59 1.23 3.60 7.35 3.06 6.47 15. 17 7.03 8.20 12.06 11.26 1. 73 2. 47 1960 O. 55 6. 54 5. 52 3. 00 2.28 7. 06 13.67 8.07 14.82 3.06 0.28 1. 02 1.,61 2.30 3.22 3.02 2. 06 4.18 .,. 56 4.09 4. 77 2.86 2. 11 0 58 O. 78 1962 1.62 1. 53 3.38 3.30 1.21 10. 90 2. 90 8. 42 7.07 1. 23 2. 68 1. 42 1963 2.35 6. 13 1.22 0.81 13.06 7.28 7.24 6.29 10.10 O. 45 5 28 3. 59 1964 2.97 3.81 2.28 3.24 6.08 9.44 5.28 7.31 0.61 0.77 1.08 1965 1. 08 4.37 6.85 2. 91 1. 44 'iI. 53 13.66 4.75 7. 67 4. 26 1. 19 2. 39 1.,66 5.95 0.0' O. 77 2.98 '.OB 9.68 8.27 B. 98 7.85 2. 02 O. 15 1. 36 1967 0.65 2.Bl O. '1 O. 0 1. 00 1.00 9. 74 9.94 7.15 O. 96 0.36 2. 42 1968 O. 58 1.91 1.29 O. 43 8. 73 16. 73 8. 19 6.32 4.40 3. 94 2.73 o 35 1969 1. 89 1. 80 6.B9 0.97 1. 86 11.92 5.34 8.88 7.84 7. 91 1. 64 4. 35 1.,70 2.99 2.03 5. 0i!3 0.22 3.92 4. 51 14 .,3 5.33 5.B4 2.:;!5 0.54 1. 06 1971 O. 0i!2 2. 52 0.9' 0.49 2.34 6.22 5. 59 B.29 6. 17 7. 11 0.63 1. 92 0.93 3.47 3. 74 2.24 4.75 B.30 9.67 7.23 O. 36 1.98 4.95 2.80 1973 1. 00 1. 57 3. 06 '.61 2.06 3.64 8.50 10. 71 7. 59 4.43 0.80 1. 00 1974 1.00 1. 26 1. 00 1. 00 1. 00 0i!0. 14 9.64 3. '3 3. 22 O. 36 0.23 2. 20 1975 O. '0 1. 93 1. 98 0.23 5.30 ,. 45 5.90 8. 52 9. 14 6. 23 0.49 O. 28 1976 O. 51 O. 54 2.46 1.59 6.20 7.66 8.84 7. 80 6.29 2. 09 1. 81 1. 91 1977 2.69 1. 66 O. 46 0.26 3.99 4.95 8.27 4.3B 4. 03 1.62 4.39 2.61 1978 2. 96 4.32 2.29 O. 13 '.17 10.0' 13.36 4. 13 2.02 1. 42 0.49 3. 23 197., 6 53 1. 12 .44 1.87 7.76 10.17 4.05 4.92 13. 37 1. 18 1. 23 1. 58 1980 2.42 3.46 1. BO 5.41 3. 15 5.0" 4.60 6. 3.88 4.19 2.68 1. 09 1981 O. 57 4. 16 2. 13 O. 17 2.21 7. 56 6.57 6.49 8,01 O. 61 1. 03 O. 55 Note: 1 indicates missing value. PAGE 147 Table C.3. 1927 0.30 1928 0.2:3 1929 1.09 1930 1. 09 1931 3. 5:3 1932 O. 70 1933 O. 25 1934 O. 70 19:35 0.24 19:30 :3.:3:3 1937 O. 52 1938 2. 20 1939 O. 45 1940 :3. 79 1941 :3. 02 1942 1. 60 1943 O. 74 1944 1.20 1945 2. 19 1946 O. 35 1947 0.83 1948 4. 16 1949 O. 01 1950 O. 0 1951 0.::38 1952 1.28 1953 1. 71 1954 0.::30 1955 2. 68 1956 O. 57 1957 O. 78 1958 6.04 1959 1. 48 !960 O. 46 1961 ::3. 31 1962 0.43 1963 O. 81 1964 2. 88 1965 1.24 1966 3.39 1967 1. 15 1968 O. 40 1969 1. 44 1970 4.36 1971 0.65 1972 O. 77 197::3 3. 14 1974 0.36 1975 O. 20 1976 0.21 1977 3. 1978 2. 48 1979 7. 1980 2. 44 1981 O. 80 55 years on monthly rainfall data for the South Florida Station 6093. **It.* STATION 609:3. FORT /'IVERS WSO AP O. 76 1. 42 O. BO 1. 2:3 B. 04 B. 7B 3. 14 1. 78 0.:30 2.05 O. 51 1. 44 2.61 9.25 12.26 1:3.95 11.78 :3. 22 0.71 O. OB 1. 0:3 O. BB 7.82 8.30 O. 08 5. 15.44 3.42 0.30 2.8B 5.08 0.80 14.01 4. 05 5.97 1:3.7::3 1. 88 O. 13 :3.70 6.64 2. 92 2.58 3. 96 6. 33 7.27 6.44 O. 86 O. 09 O. 5:3 1.93 1. 06 7.03 3. 59 7.91 17.64 0.08 5. :37 0.71 2. 60 3. 93 0.06 6.86 5.02 9.20 4. 51 4.63 2. 08 l. 09 5. 93 O. 75 O. 92 5. 78 11.56 O. 09 3. 55 8.30 1. 59 0.66 1. B1 0.0 3.50 2.30 O. 42 9.:30 9.38 14. 49 0.30 0.83 5. 50 1. 69 1.14 6. 11 20.25 B.54 7.50 3. 56 5. 39 2. 78 3.68 3. 74 1.38 0.94 10. 75 5. 13 7.00 3.04 5.88 l. 44 0.34 O. 70 0.:33 2. 91 8.24 12. 71 5. 28 5. 12 3. 57 O. 39 0.87 O. 04 B. 42 3. 01 16. 43 7.69 6. 97 12.83 5. Bl 1. BO 4. 00 4. 41 1.73 O. 73 10. 52 3. 50 B.6q 13.02 0.61 O. 1:3 3. B2 6. BB 7. 60 1. 16 7. 12 15.28 7.46 6.09 O. 96 2.4B 3. 35 2.:31 4. 54 :3.:38 11. 15 10.66 9. 18 5.37 0.50 O. 08 O. 71 1. 61 4. 45 5.96 16.06 12.24 B. 59 5.68 :3. 56 2. 37 0.0 3. 76 O. B5 4.00 :3. 73 5.09 5.89 3. 50 5. 77 0.0 O. 68 0.10 O. 21 1.58 11.97 12.41 11.06 5. 71 5. 19 0.03 2.24 0.19 O. 01 0.71 10. 19 5. 78 6. 47 5.21 1. 34 3.:39 2.92 B. 0;>4 2. B2 6. 47 12.84 11. 17 9.40 16.::32 4.97 2.05 0.06 0.8:3 1.57 2.19 5.06 10.08 4. 9B 14.05 ::3. 90 O. 45 O. 07 O. 13 5.50 4.03 7. 5::3 1::3. 32 7.60 12. 70 3. 60 1. 27 O. 08 O. 49 O. 08 4.14 4. 84 6. 83 5. 93 8.::32 ::3. 26 O. 02 1. 90 1. 13 2. 71 2.14 9. 19 11.44 10. 30 3.48 11. 91 1. 14 4.::34 2. 05 O. 78 1. 75 7.95 5. 74 8.::39 12.::35 8. :34 0.75 2.01 O. 68 2. 28 0.41 12.81 9.34 4. 32 15.58 6. 68 1. 07 2. 53 2.13 :3. 49 4.08 4. 78 9. 19 6. 84 10.31 182 2.33 1. 16 O. 32 O. 97 3.23 8. 53 8. 76 4. 29 10 50 2. 15 O. 52 1.06 O. 05 3.50 4.76 4.67 5. :34 8.03 6.00 4. 42 135 3. 68 4. 73 2. 09 7.97 4.85 12.52 9.::39 8. 77 3.19 1.52 1. 26 10.:31 2. 18 6.22 7. :37 10. 92 4. 12 8.89 4. 57 1. 43 1.n 6. 33 1. 75 4. 74 16. 10 6. 17 5. 75 0.89 12. 04 1. 92 3.66 1. 87 3.8::3 2.20 5.20 13.76 5.66 11.93 :3. 01 2. 02 1. B8 3. 58 0.46 4.92 9. 75 9. 82 1:3. 41 2.80 3. 16 1. 12 O. 54 2. 65 1. :37 0.34 12.08 6. 01 10. 89 14.54 5. 44 3.01 4.65 O. 59 0.27 7.58 7. 70 4.06 3.98 7.49 0.05 3.45 3.:30 2. 12 O. 80 0.50 4. 58 2 28 4. 26 9.45 1. 38 0.22 2. 99 2. 91 2.:39 4. 70 7. 78 12 05 6. 57 4.35 4.42 O. 58 1. 06 0.37 3. 03 1. 61 12. 42 8.22 8. 10 4. 18 2. 14 0.18 2. 15 O. 72 0.0 1. 46 7. 41 O. 09 15.86 7. 04 3. 08 0.92 2. 08 O. 65 O. 57 10.32 15.0:3 9. 85 11.44 8.92 7.99 2.88 2.87 4. 74 O. 15 4. 71 10. 63 7. 11 6. 49 16. 60 11.03 O. 22 2.20 19.59 0.0 0.36 7. 47 4. 74 4. 82 8.::29 1. 19 O. 46 1. 55 O. O. 70 3.77 6. 18 9. 50 8. 06 9.21 0.49 0.16 2. 14 4. 72 0.27 5.20 7:80 9. 72 10.::2::2 2. ::33 2.20 3.95 2.23 3. 69 1.71 0.78 3. 99 9. 57 6. 66 8. 38 O. 10 0.10 O. 91 0.03 O. 11 2.40 20. 10 14. 47 7.70 4.::31 O. 19 1. 46 0.27 1. 47 0.80 2.76 10. 10.81 7. 74 12. 59 3.05 0.49 1.20 0.<;>1 0.90 '.22 10. 59 6. 14 9.95 8.81 1. 96 2.10 0.15 0.09 O. 70 O. 51 8.90 9.00 10. '8 9.21 O. 43 1. '0 3. 36 3. 43 02. 52 6. 75 10. 29 10.90 ,. 18 1. 45 0.04 1. 92 O. 43 3. 12 5.32 8.31 5. 90 14. 79 13.65 O. 39 O. 48 1. 04 3.5<;> 1.52 6. 73 1. 99 7.02 8. 79 4.64 1. '4 3. 15 1. 1. 29 O. 06 3.07 11.79 8.24 10. 73 6.70 O. 40 0.71 158 O. 71 O. 30 1. 31 2. 45 1.83 O. 30 O. 13 O. :31 1. 58 1. :34 O. 72 0.21 1.01 5. 42 0.99 1. BO O. 48 0.32 1. 45 O. 57 1.44 O. 6::3 1.62 2.20 O. 14 O. 71 1. 18 1. 93 O. 85 0.10 3. 55 :3. 30 1. 79 O. 73 O. 5:3 0.85 2.27 1. 06 O. 85 0.29 2. 91 O. 16 3. 95 0.37 0.::30 1. 43 1. 72 O. 89 0.69 1. 68 2. 74 4. 35 5. 16 O. 55 O. 73 PAGE 148 159 Table C.4. 55 years of monthly rainfall data for the South Florida Station 6042. STATION 6042 CANAL POINT USDA 1927 0.33 1. BO 2.37 1.0B 1.54 6.31 7.32 B. 14 3. :U 3.35 0.49 O. 40 1928 O. 19 1.38 3. 4B 1.72 3. 10 5. 42 14. 14. 13 16.45 O. 77 1.24 O. 20 1929 1.34 O. 07 O. 60 2.32 5.43 11.74 11.26 6.31 10. 70 3.08 0.69 1. 08 1930 2. 54 3. 03 4.32 '.25 6.10 16.96 4.08 3.07 5.36 5. 14 0.67 2. 77 1.31 2.05 O. 91 4.27 5. 71 3.05 0.4' 3.33 4.67 5.64 4.43 0.70 4. 62 1932 0.26 2. 3B O. B7 2.67 3.49 11. 26 4.'1 9.91 2.40 4.51 25.09 O. 16 1933 1. 54 0.35 4. 73 6.4W! 1.31 7.62 14. 02 B. 51 8. 16 4.36 1. B4 O. 09 1934 0.25 5.36 2. 77 7.64 6.27 7.96 5.20 B. 14 11.69 1.00 1.00 1. 00 1935 O. 16 2.Bl 0.17 5.45 0.76 6.11 3.98 3.62 11.90 4.44 0.57 1. 22 1930 2.40 5.69 3.27 0.39 6.10 14.29 5.44 B. 59 4.08 2.84 5.0B 1.65 1937 4.30 1. Bl 4.BB 3.36 1. 92 4. 44 14.62 ".37 5.88 6. 50 2.23 0.26 1938 O. 12 0.B4 1.0B 0.45 3.13 6.67 7.28 5. 52 8.45 3. 69 0.97 O. 10 1939 O. 38 0.08 1.26 2.B2 4.29 8.87 6.40 12.26 8.86 5. 55 0.42 2.32 1940 1.00 1. 00 1.00 0.38 5.61 B.63 8.79 B.22 6.09 1. 20 0.57 4. 76 1941 5. 72 4.03 3. 74 6.68 2.23 3.90 14. 73 4. 78 6.40 4.92 1.72 1. 50 1942 1.34 2. 77 6.36 2.36 4.92 14.11 3.62 4.42 4 .,3 2.06 2.15 2. 47 1943 0.31 O. 45 2.08 1.33 1.86 B.B3 11.73 6.56 5. 10 2.Bl 2.0B O. 38 1944 0.98 0.04 4.17 2. 71 3.98 3. 40 5.66 5.81 4. 73 B. 35 0.30 O. 43 1945 O. 47 0.88 O. 03 O. 70 3.11 10.93 10.83 7.24 13. 71 4.10 0.49 O. 53 1.,46 1.13 O. B4 4. 31 .0 10.60 11.20 8.59 6.98 12.2B 1.54 5.08 2. 13 1947 O. 42 2. 66 8. 52 5. 16 4.46 10.90 11.56 10.66 17.61 9.72 5.28 1. 16 1948 3. 70 0.48 O. 78 5. IB 1. 30 2. 17 7.62 8.41 16. 14 2. 74 0.38 0.34 1949 O. 40 0.80 O. 52 1.94 1. 64 15.69 6.28 12. 16 7.36 1. 94 1. 09 6. 47 1950 0.30 O. 79 3. 04 O. B7 2.14 2. 15 6.71 4.20 3.20 11. 17 1.07 1.25 1951 O. 04 2.06 1.01 5.41 5.68 6.34 9.16 8.68 5.38 10.58 0.98 O. 90 1952 1.68 5.20 O. 92 2.99 3.27 3. 46 8.13 8. 74 4. 90 13. 72 O. 1B O. 07 1953 1. 83 1. 89 2. 69 4. 20 0.84 7.85 14.00 12.24 11. 02 7. 65 2. 10 1.82 1954 O. 35 1. 96 2. 71 7. 57 6.77 12. 78 8.08 8.27 5. 45 2. 95 O. 56 1.60 1955 1.31 2.20 2.08 2.67 1.55 12.93 8.45 7.27 4.46 1. 70 0.27 2. 03 1956 O. 72 1.11 0.03 1.92 3.04 3. 70 7.34 3.08 14. 09 6.16 0.38 O. 50 1957 3. 88 2. 57 2.97 5.73 11.35 5.20 10. B9 4.97 12.68 3.15 0.77 5. 75 1958 8. 73 O. 61 5. 10 4.35 6.33 4.86 7.79 6.60 6.26 6. 07 0.62 6.35 1959 2.20 O. 01 5.73 3.90 10.03 9. 19 12.52 5.29 7.72 9.66 2.18 1. 72 1900 0.05 4.59 0.99 4.33 3.20 6.80 7.83 6. 16 12.89 4. 00 2.01 O. 70 1961 3.67 0.43 4. 17 2.03 8.82 3.21 9.25 10. 79 1. 19 4.55 0.97 O. 20 1962 1.22 O. 3. 05 4.08 2.12 7.01 10.45 5. 14 9. 88 1. 70 2. 19 0.31 1963 0.99 4. 18 0.71 0.09 6. 41 7.68 1. 60 5. 54 3.61 2.58 1.62 6. 09 1964 3. 32 2.06 0.93 3.67 2.05 13. 52 9.02 B. 59 5.65 6. 63 0.45 4. 37 1965 0.97 4. 54 2.20 2.04 4.50 10.25 8. 10 7.22 7.32 13. 24 0.32 1. 13 1966 4.09 2.27 1.01 3.02 5.46 9.81 12.03 5.66 5. 77 6.60 0.31 0.84 1967 O. 66 2.55 1. 00 0.0 1. 36 6.33 7.73 3.48 4.37 3. 45 0.13 1. 40 1968 0.29 2.27 0.80 0.33 7.26 19. 18 10.35 4. 21 10.55 7.36 1. 77 O. 02 1969 1. 66 1. 76 4. 74 1.87 7.17 9.93 3.36 S.09 5.S2 8.44 2.09 2. 14 1970 3. 13 2.89 14.55 0.0 6.92 3. 10 9. 45 13. 07 2. 19 3. 79 0.17 O. 10 1971 O. 40 1. 12 O. 40 0.16 6. 74 8. 43 5.07 5.40 6.47 8.09 1. 80 1.97 1972 2.33 1." 2.09 4.03 1.00 9.99 1. 00 :2.50 1. 77 1.72 4. 15 2. 42 1973 2.66 1.99 2.00 0.84 5.03 4.62 6.03 4.30 5. 74 3. 38 0.98 1. 77 1974 :2. 12 0.5S O. 22 1.37 6.01 10. 43 6.87 5.S9 7. 14 2. 06 1.60 0.95 1975 0.46 4. 15 1.00 1.09 10.13 7.34 7. 72 4.52 8.95 4.36 0.82 O. 21 1976 0.43 2.11 O. 30 1.79 8.74 7.85 2.07 7. 49 2. 96 0.26 2.26 2. 41 1977 3. 62 O. 46 O. 55 1.11 3.01 5.83 2.06 6.S4 13, 28 1. 39 6.17 6. 59 1978 2.34 1.42 3. 73 2.02 5.69 15. 47 6.22 10.41 8.03 4. 57 2.37 4. 55 1979 1.00 1.00 1.00 1.00 4,65 2.34 2.85 4.09 11.96 3.52 2. 52 2 10 1980 3.06 1. 89 1. 94 5.08 4. 15 5.10 7. 52 5.96 16.08 1. 42 1.59 0, 62 1981 O. 54 1. 62 2.27 O. 16 3.18 7. 16 4.05 13.50 5. 12 0.35 1.97 O. 27 Note: 1 indicates missing value. PAGE 149 Table C.S. VARIABLE JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC VARIABLE JAN FEB MAR APR /'lAy JUN JUL AUG SEP OCT NOV DEC VARIABLE JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC VAR IABLE JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC N :5:5 :5:5 :5:5 !5:5 55 55 !55 55 55 55 N 52 :53 :53 54 :53 !53 53 :55 :5:5 54 :54 N N 53 !53 53 54 54 :55 :54 :5:5 55 :54 54 54 Monthly statistics of stations 6038, 6013, 6093, 6042. ***** STATION 6038 ***** MEAN 1.927 1.878 2. 2. 507 4. 575 7.606 7.:235 7.033 7. 567 3. 747 1.379 1. 457 STANDARD DEVIATION 3. 063 1.368 2. 456 1.818 2. !584 3. 776 3. 358 2. 897 4. 085 3. 073 1.283 1. 555 ""***. STATION 6013 ****')f MEAN 2.093 2.718 2. 987 2.928 4. 192 8.613 8.307 7.258 7.521 3. 500 1.567 1.687 STANDARD DEVIATION 1. 780 1.828 2. 006 2. 209 2. 655 3. 694 3. 664 3. 148 3. 732 2. 488 1. 551 1. 135 ***** STATION 6093 ***** MEAN 1.636 2.039 2.619 1.995 4. 049 9. 105 8. 672 8. 309 8. 553 3. 474 1. 175 1.399 STANDARD DEVIATION 1. 1.450 3. 206 1.953 2.414 4.082 2. 976 3. 490 3. 988 2. 877 1.048 1.260 ***** STATION 6042 ***** MEAN 1.686 1.960 2. 632 2. 860 4. 626 8. 141 7.861 7.194 7. 802 4.709 1.972 1.818 STANDARD DEVIATION 1.688 1.475 2.501 2. 278 2. 659 4.107 3. 408 2. 853 4. 126 3. 163 3. 489 1.863 SKEWNESS 5.016 O. 664 1.762 O. 674 1.032 O. 646 1. 008 O. 724 1.081 1. 138 1.532 1.975 SKEWNESS 1.344 O. 798 O. 676 O. 864 1.073 1. 129 O. 793 1.487 0.716 O. 798 1 833 o 727 SKEWNESS 1. !531 O. 588 2. 779 1.474 O. 381 O. 777 O. 123 O. 974 O. 407 1.295 O. 881 1. 555 SKEWNESS 1.812 O. 881 2. 388 O. 758 O. 666 O. !530 O. 256 O. 636 O. 660 1. 056 5. 743 1.362 c. v. 159.002 72. 856 94.631 72. 498 56. 482 49. 646 46. 420 41. 193 53 983 62.017 93. 042 106. 686 c. v. 85. 043 67. 259 67.151 75. 460 63.326 42. 892 44. 108 43. 370 49. 620 71 082 98. 982 67. 280 c. v. 97.018 71. 102 122. 418 97. 916 59. 615 44. 835 34.313 41.997 46. 624 82 817 89. 186 90. 105 c. v. 100. 077 75. 242 95.010 79. 661 57. 482 50 446 43 353 39. 655 52. 880 67 163 176.912 102. 457 160 PAGE 150 161 Table C.6. Station 6038monthly statistics of the incomplete and estimated series2% missing values. VARIABLE .JAN FEB MAR APR MAY .JUN .JUL AUG SEP OCT NOV DEC VARIABLE .JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC ** STATION 6038 ( 27. MIS. N MEAN 1.88:5 1.906 2.632 2. 4. 624 7.510 7.308 7.030 7.567 3. 747 1.324 1. STANDARD DEVIATION 3. 076 1. 2.464 1.818 2. :583 3.812 3.399 2. 938 4. 3. 073 1.228 1. :58:5 ***** C. V. 163. 191 71.646 93. 616 72. 498 :55. 8:5:5 50. 762 46. 513 41.798 53. 983 82.017 92. 781 108. 781 .. *** USING THE MEAN VALUE ( 27. MIS. ) .... *** MEAN 1.885 1.906 2.632 2. 507 4. 624 7. 510 7. 308 7. 030 7. 567 3. 747 1.324 1. 457 STANDARD DEVIATION 3. 048 1. 353 2.441 1. 818 2. :559 3. 741 3. 336 2. 883 4. 085 3. 073 1. 217 1. C. V. 161. 666 70. 976 92. 746 72. 498 55. 336 49.813 643 41. 016 53. 993 82.017 91. 911 106. 738 ***** RECIPROCAL DISTANCES METHOD 27. MIS. VARIABLE JAN FEB MAR APR MAY .JUN .JUL AUG SEP OCT NOV DEC MEAN 1. 921 1.878 2.598 2. 507 4. :563 7. 7. 282 6. 997 7. :567 3. 747 1. 376 1.460 STANDARD DEVIATION 3. 1.368 2. 453 1. 818 2. :598 3. 80:5 3. 341 2. 891 4. 085 3. 073 1.278 1. :556 C. V. 159.265 72. 821 94. 413 72. 498 56. 929 50. 883 41. 310 53. 983 82.017 92. 825 106. 592 ***** NORMAL RATIO METHOD ( 27. MIS. ) **.** VARIABLE .JAN FEB MAR APR MAY .JUN .JUL AUG SEP OCT NOV DEC MEAN 1.927 1.876 2. 598 2. :507 4. 7. :538 7. 279 6.977 7. 567 3. 747 1.349 1.448 STANDARD DEVIATION 3.064 1.370 2. 453 1. 818 2. 603 3. 7:57 3. 339 2. 896 4. 085 3.073 1. 231 1. ***** MODIFIED WEIGHTED AVERAGE VARIABLE .JAN FEB MAR APR I'1AY .JUN .JUL AUG SEP OCT NOV DEC MEAN 1.871 2. 591 2. :;07 4. 551 7. 573 7.235 6. 963 7. 567 3. 747 1. 349 1.449 STANDARD DEVIATION 3.089 1.377 1.818 2.615 3.812 3. 362 2. 909 4. 085 3. 073 1. 230 1. 556 C. v. 158. 978 73. 020 94. 437 72. 498 57. 080 49. 839 45. 877 41. 516 53. 993 82.017 91. 259 107. 430 27. I'IIS. ) ***** C. V. 158. 165 73. :587 94.897 72. 498 57.459 50. 339 46.473 41. 779 53. 983 92.017 91.241 107. 386 SKEWNESS 5.083 0.643 1. 743 0.674 1. 016 0.717 0.953 o 725 1.081 1. 138 1. 645 1.943 SKEWNESS 5.127 0.649 1.759 O. 074 1.025 o 730 0969 o 738 1 081 1 139 1.659 1.977 SKEWNESS 5 039 0.664 1.766 O. 074 1 011 O. 672 0.987 O. 764 1.081 1. 138 1 527 1.968 SKEWNESS :5.014 0.660 1.766 o 674 1.002 O. 703 O. 993 O. 779 1.081 1. 138 1.566 1.992 SKEWNESS 4.B89 0.643 1.757 O. b74 0.977 0.6804 1. 000 O. 773 1. 081 1. 138 1. 571 1. 980 PAGE 151 162 Table C.7. Station 6038monthly statistics of the incomplete and estimated series5% missing values. VARIABLE .JAN FEB MAR APR MAY .JUN ,JUL AUG SEP OCT NOV DEC VARIABLE ,JAN FEB MAR APR MAY ,JUN ,JUL AUG SEP OCT NOV DEC ..... STAT ION 6038 ( 51. 1115. N 54 52 48 51 51 52 53 54 53 52 52 55 MEAN 1.985 1.925 2.619 2.368 4. 521 7.411 7.355 7. 062 7.474 3. 790 1.344 1.457 STANDARD DEVIATION 3. 076 1.391 2. 526 1.731 2. 437 3. 693 3. 2.917 4. 132 3. 109 1.238 1. 555 ***** C. V. 163. 191 71. 771 96. 460 73. 093 900 49. 699 45. 626 41. 303 287 82. 021 92. 137 106. 686 iHHHHt USING THE MEAN VALUE ( 51. MIS. ) iHHHHt MEAN 1.885 1.925 2. 619 2. 368 4.521 7. 411 7. 356 7. 062 7. 474 3 790 1.344 1.457 STANDARD DEVIATION 3. 048 1.343 2. 357 1.666 2. 345 3. 580 3. 293 2. 890 4. 055 3. 021 1.203 1. 555 C. V. 161.666 69. 758 89. 985 70. 330 51. 866 48. 299 44. 772 40.919 54.255 79. 711 89. 555 106. 686 5.083 0.616 1.804 o 666 O. 0.719 O. 984 0.696 1. 147 1. 131 1.643 1.975 S!('EWNESS 5.127 o 633 1.922 O. 689 O. 884 O. 738 1. 001 O. 702 1 167 1. 161 1 688 1.975 *** .... RECIPROCAL DISTANCES METHOD 51. MIS. *** .. VARIABLE MEAN STANDARD C. V. S!('EWNESS DEVIATION .JAN 1. 921 3. 159. 265 5 039 FEB 1. 867 1.370 73. 336 O. 683 MAR 2. 580 2 426 94. 058 1 812 APR 2. 429 1.786 73. 539 O. 667 MAY 4. 417 2. 414 54. 646 O. 885 ,JUN 7. 613 3. 776 49. 603 0 653 .JUL 7. 259 3. 332 904 1 036 AUG 7. 039 2. 895 41. 128 0 7"""" ...... SEP 7. 733 4. 294 55. 1.022 OCT 3. 837 3. 070 80 022 1.098 NOV 1. 1.270 92. 366 1.578 DEC 1. 1. 106. 686 1.975 11**** NORMAL RATIO METHOD ( 51. IHS. ) .. **** VARIABLE MEAN STANDARD C. V. S!('EWNESS DEVIATION .JAN 1.927 3. 064 158.978 014 FEB 1.856 1.377 74. 160 O. 683 MAR 2. 557 2. 423 94.741 1.847 APR 2. 403 1. 752 72. 939 O. 660 MAY 4. 434 2. 398 54. 081 O. 878 ,JUN 7. 536 3. 690 48. 961 O. 643 .JUL 7. 223 3.365 46. 590 1.012 AUG 7. 062 2. 890 40.919 O. 702 SEP 7. 691 4. 221 54. 883 1.008 OCT 3.773 3. 041 80. 1. 1'7 NOV 1.335 1. 231 92. 198 1.608 DEC 1. 457 1. 106. 686 1.975 ...... MODIFIED WEIGHTED AVERAGE 51. MIS. .. ..... VARIABLE MEAN STANnARD C. V. SKEWNESS DEVIATION .JAN 1. 3. 089 158. 165 4. 889 FEB 1 849 1. 387 74. 995 O. 657 MAR 2. 561 2. 459 96. 012 1. 756 APR 2. 405 1. 773 73. 744 O. 644 MAY 4. 403 &I 438 55. 374 O. 846 .JUN 7. 584 3.787 49. 934 o 666 .JUL 7. 197 3. 396 47.191 O. 973 AUG 7. 019 2. 907 41. 409 0.725 SEP 7. 796 4. 425 56. 753 1.085 OCT 3. 816 3. 092 81.032 1. 106 NOV 1 349 1. 221 90. 547 1 624 DEC 1. 457 1. 555 106. 686 1 975 PAGE 152 163 Table C.8. Station 6038monthly statistics of the incomplete and estimated series10% missing value. VARIABLE .JAN FEB MAR APR MAY ,,)UN ,JUL AUG SEP OCT NOV DEC VARIABLE ")AN FEB MAR APR MAY ,,)UN .JlJL AUG SEP OCT NOV ** ...... STATION 6038 ( 10;: MIS. .**** N MEAN STANDARD DEVIATION 50 1.848 3. 108 47 1.924 1. 374 '2 2. 509 2. 463 49 2. '6' 1. 874 48 4.488 :o!.468 '1 7.807 3. 742 50 7.:O!23 3.308 49 7.160 2.940 51 7. 582 4. 124 50 3. 706 2.976 50 1. 315 1. 173 51 1.486 1. 595 USING THE t1EAN VALUE ( 10:1. MEAN 1. 848 1.923 2. 509 2. 566 4. 488 7. 807 7. 223 7. 160 7. 582 3. 706 1.314 STANDARD DEVIATION 2. 961 1. 268 2. 394 1. 767 2. 302 3.601 3. 151 2. 772 3.969 2.835 1. 117 *** ... RECIPROCAL DISTANCES METHOD VARIABLE MEAN STANDARD DEVIATION ")AN 1.872 3.025 FEB 1.876 1.366 MAR 2.610 2.462 APR 2.608 1.941 MAY 4.430 2. 462 .JUN 7. 621 3.808 ,,)UL 7.435 3.313 AUG 7. 158 2.832 SEP 7. 681 4.042 OCT 3. 746 3.032 NOV 1.323 1. 156 DEC 1.445 1. 553 ***** NORMAL RATIO METHOD ( 10:1. VARIABLE MEAN STANDARD DEVIATION ")AN 1.883 3.025 FEB 1. 817 1.341 MAR 2. 590 2.448 APR 2. 5:56 1.870 MAY 4.498 2.432 .JUN 7.632 3.814 ,,)UL 7.263 3. 188 AUG 7. 121 2.800 SEP 7. 624 4.019 OCT 3. 660 2. 963 NOV 1. 347 1. 175 DEC 1.451 1. :5:53 MIS. ***.* C. V. 168. 167 71.410 98. 180 73. 066 54. 985 47. 931 4:5.803 41. 064 :54. 393 80. 318 89. 225 107. 372 C. V. 160. 178 65.926 411 68. 873 51.295 46. 120 43. 632 38.715 52. 341 76. 501 85. 021 10:1. MIS. C. V. 161. 608 72. 829 94. 323 74. 405 5:5. 572 49. 967 44. 566 39. 562 52. 631 80. 926 87. 370 107. 435 SKEWNESS 5.328 O. 705 1.904 O. 631 1. 051 0.675 1.069 O. t.97 1.078 1.261 1.498 1.943 SKEWNESS ***** 5. 572 O. 760 1.955 O. 665 1. 120 O. 700 1. 118 O. 736 1. 117 1. 318 1.568 :5.225 O. 736 1.745 O. 762 O. 999 0.629 O. 861 0.715 1.014 1.200 1.421 2. 007 MIS. ) ***** C. V. 160. 648 5.217 73. 800 O. 803 94. 535 1. 775 73. 141 O. 678 54. 064 0.960 49. 972 O. 647 43. 889 1.046 39. 326 O. 753 52. 721 1.064 80. 954 1. 192 EJ7. 209 1.382 107. 011 2. 000 ****if MODIFIED WEIGHTED AVERAGE 10:1. MIS. ***** VARIABLE MEAN STANDARD C. V. SKEWNESS DEVIA1ION ")AN 1 908 3. 165 165. 872 4. 645 FEB 1. 841 1.402 76. 132 O. 696 MAR 2. 616 2. 478 94. 722 1.728 APR 2. 596 1.959 75. 448 O. 780 MAY 4. 2. 532 57.217 O. 947 .JUN 7. 3. 935 52. 237 O. 489 ,,)UL 7. 39't 3.359 45. 398 O. 826 AUG 7. 11::. 2. 895 SEP 7. 681 4. 086 "40.689 O. 723 53. 193 1.003 OCT 3. 693 3. 119 84.458 1. 155 NOV 1.299 1. 129 86. 871 1.557 DEC 1. 419 1.568 110.497 1.971 PAGE 153 164 Table C.9. Station 6038monthly statistics of the incomplete and estimated series15% missing values. VARIABLE ,JAN FEB MAR APR MAY ,JVN ,JVL AVG SEP OCT NOV DEC VARIABLE ,JAN FEB MAR APR MAY ,JVN ,JUL AUG SEP OCT NOV DEC **<1* STATION 6038 ( 157MIS. ..... ** N MEAN STANDARD DEVIATION 46 1.927 3. 284 45 1.750 1.449 43 2.448 2. 099 47 2.484 1.857 47 4.674 2. 490 48 7.727 3.854 50 7. 111 3.467 45 7.245 2. 954 47 7.293 3. 766 49 3.656 3. 003 46 1.488 1.343 49 1.470 1.598 USING THE MEAN VALUE ( 157. MEAN 1. 927 1.750 2. 449 2. 484 4. 673 7.728 7. 111 7. 244 7. 293 3. 656 1.488 1. 470 STANDARD DEVIATION 2. 998 1.308 1. 851 1. 714 2. 298 3. 596 3. 303 2. 667 3. 475 2. 831 1.226 1. 506 MIS. ***** C. V. 170.428 82. 791 85. 738 74. 749 53. 272 49.876 48. 762 40.777 51. 632 82. 138 90. 292 108. 686 ) *** .. c. V. 155. 539 74. 733 75. 601 69.007 49. 174 46. 529 46. 450 36. 813 47. 657 77. 431 82. 403 102. 471 4.910 O. 895 1. 171 O. 749 O. 887 O. 592 1. 101 O. 664 O. 673 1. 182 1. 417 1. 975 5. 338 0.984 1.314 O. 808 O. 956 O. 631 1. 152 O. 730 O. 725 1.247 1. 540 2.085 ***** RECIPROCAL DISTANCES METHOD 157. MIS. ***** VARIABLE MEAN STANDARD DEVIATION ,JAN 2. 015 3.019 FEB 1.898 1.443 MAR 2. 672 2. 557 APR 2. 659 1. 993 MAY 4.711 2. 574 JUN 7. 676 3. 765 JUL 7. 159 3. 338 AUG 7.174 2. 745 SEP 7.401 3. 749 OCT 3. 893 3. 118 NOV 1. 402 1.272 DEC 1. 463 1. 537 ..... *** NORMAL RATIO METHOD ( VARIABLE MEAN STANDARD DEVIATION ,JAN 2. 017 3. 029 FEB 1.808 1.370 MAR 2. 621 2.377 APR 2.571 1.878 MAY 4. 750 2. 599 '"'UN 7. 585 3. 732 ,",UL 7. 129 3. 362 AUG 7.130 2. 739 SEP 7. 314 3. 718 OCT 3. 815 2.989 NOV 1.412 1. 278 DEC 1. 463 1. 538 ***** MODIFIED WEIGHTED AVERAGE VARIABLE MEAN STANDARD DEVIATION ,JAN 2.152 3. 093 FEB 1.853 1.467 MAR 2.621 2. 547 APR 2. 490 1.969 MAY 4. 760 2. 707 ,JUN 7. 557 3. 871 ,",UL 7.053 3. 369 AUG 7. 094 2. 835 SEP 7.315 3. 857 OCT 3. 865 3. 203 NOV 1.413 1.248 OEC 1.431 1. 553 C. V. 149. 847 76. 039 9S. 699 74. 944 54. 652 49. 050 46. 633 38. 261 50. 660 80. 103 90. 778 105. 049 157. 1'115. ) ****.:> C. V. 150. 183 75. 774 90. 718 73. 030 54. 711 49.208 47. 161 38.419 50. 843 78 337 90. 467 105. 085 SKEWNESS 5. 144 0.757 2.011 O. 696 0.775 O. 578 1.079 O. 734 O. 714 1.068 1.534 1.979 SKEWNESS 5. 095 O. 812 1.575 O. 625 O. 794 O. 639 1.086 O. 757 O. 697 1.009 1. S07 1.994 157. MIS. ) ***** C.V. SKEWNESS 143. 705 4. 701 79 163 0.844 97. 169 1.981 79. 056 O. 564 56.857 O. 797 51.220 0.529 47.772 1. 113 39. 965 O. 700 52. 732 O. 766 82. 859 1. 120 88. 354 1.623 108. 479 1.956 PAGE 154 165 Table C.IO. Station 6038monthly statistics of the incomplete and estimated series20% missing values. VARIABLE JAN FEB /'IAR APR /'lAY .)UN JUL AUG SEP OCT NOV DEC ** STATION 6038 ( MIS. N 47 45 48 43 43 44 42 43 45 42 43 /'lEAN 1.856 1.909 2.412 2. 4. 797 7. 306 7.306 7.023 7. 528 3.841 1.364 1.573 STANDARD DEVIATION 3.240 1.436 2. 112 1.862 2.769 3. 900 3. 720 2.826 4. 142 3.210 1.317 1.703 ***** C. V. 174.615 213 87. 561 72. 393 730 53. 376 50. 916 40. 233 55. 024 83. 576 96. 582 108.218 SKEWNESS 5.048 0.656 1. 144 O. 701 O. 912 0.818 0.887 O. 383 1.291 1. 180 1.603 1.765 BU. USING THE 11AN VALUE ( 20'l. MIS. ) ... 11*11 .. VAR IABLE JAN FEB /'IAR APR /'lAY JUN .)UL AUG SEP OCT NOV DEC MEAN 1. 856 1.909 2.411 2. 572 4. 797 7. 307 7. 307 7. 022 7. 3. 841 1. 363 1.573 STANDARD DEVIATION 2.991 1. 296 1.906 1.738 2. 442 3. 439 3.319 2. 462 3. 653 2. 898 1. 148 1. 501 C. v. 161. 109 67. 886 79. 048 67.547 50. 905 47.067 45. 430 35. 061 48. 524 75.446 84.213 95. 482 5.434 O. 720 1. 257 o 748 1 023 o 917 o 983 o 435 1 448 1 0:96 1 820 1 981 ***** RECIPROCAL DISTANCES METHOD 201. MIS. ***** VARIABLE JAN FEB /'IAR APR /'IAV .)UN JUL AUG SEP OCT NOV DEC VARIABLE JAN FEB /'IAI< APR /'lAY .)UN JUL AUG SEP OCT NOV DEC *_*iHt I'IEAN 1. 930 1.870 2. 638 2. 520 4. 7. 7. 643 7. 060 7. 839 3. 840 1. 710 1.540 NORMAL MEAN 1.952 1.828 2. 580 2. 530 4.671 7.181 7.383 6. 950 7.684 3. 779 1. 459 1.499 STANDARD DEVIATION 3. 032 1. 343 2. 1. 868 2.636 3.629 3. 501 2. 4. 065 3.065 2.417 1.594 R A TI 0 METHOD STANDARD DEVIATION 3.037 1.340 2.344 1. 870 2. 658 3. 591 3. 404 2. 570 3.930 3. 001 1. 418 1.543 ( ***** MODIFIED WEIGHTED AVERAGE VARIABLE MEAN STANDARD DEVIATION JAN 2 027 3.143 FEB 1.923 1. 371 MAR 2. 591 2. 525 APR 501 1.996 /'lAY 4. 705 2.770 JUN 7 283 3. 690 JUL 7. 572 3.610 AfJO 6.941 2. 593 SEP 7.929 4. 323 OCT 3. 738 3. 188 NOV 1 510 1.655 DEC 1.494 1. 648 C. V. 157.096 71. 811 95. 978 74. 132 56. 673 48. 200 45. 811 51. 860 79. 805 141. 338 103.496 SKEWNESS 5 151 o 686 2. 121 O. 665 0.916 o 662 O. 669 0.400 O. 950 1. 128 4. 585 1. 811 20'l. MIS. ) ... *** C. V. SKEWNESS 155.594 5. 110 73. 293 O. 780 90. 853 1.626 73.914 O. 665 899 O. 859 50. 001 0.894 46. 102 O. 873 36. 975 O. 428 51.143 1.097 79. 420 1.232 97.228 1.850 102. 983 1.948 20'l. MIS. ***** C. V. SKEWNESS 155.040 4. 605 75. 175 o 674 97. 451 2.107 75. 377 0.644 58. 864 0.779 50. 666 O. 732 47. 677 O. 736 37. 356 0.517 55.215 o 814 85. 288 1. 043 109. 608 :;;1. 110.308 1. 751 PAGE 155 166 Table C.11. Lagzero covariance matrices of the monthly rainfall series of stations 6013, 6093, 6042. All matrices are symmetric. 3.169 3.342 JAN: 2.393 2.518 FEB: 1. 813 2.101 2.343 1. 723 2.848 1. 496 1. 505 2.175 4.022 4.881 MAR: 3.501 10.275 APR: 1. 959 3.814 2.677 6.921 6.254 3.545 2.378 5.190 7.047 13.647 MAY: 3.610 5.826 JUN: 8.235 16.666 3.539 2.871 7.071 7.373 7.138 16.865 13.425 9.907 JUL: 3.655 8.854 AUG: 0.628 12.177 3.378 2.546 11. 615 1. 630 0.260 8.138 13.928 6.189 SEP: 9.080 15.902 OCT: 5.032 8.278 5.913 6.443 17.022 4.516 5.896 10.004 2.406 1.289 NOV: 0.822 1. 098 DEC: 1. 045 1. 588 1. 395 0.547 12.174 1.145 1. 510 3.471 PAGE 156 Table C.12. Normality transformations applied on the monthly rainfall data of Station 6038. ***** STATION 6038 ( NO TRANSFORMATION ) ***** VARIABLE MEAN STANDARD SKEWNESS C. V. DEVIATION JAN 1.927 3. 063 5. 016 FEB 1. a7a 1.36a O. 664 72.856 MAR 456 1.762 94.631 APR 2. :J07 1. ala O. 674 n. MAY 4. 57:J 2. 5a4 1.032 56. 482 JUN 7.606 3. 776 0.646 JUL 3. 358 1.008 46.420 AVt; 7.033 2. o. 724 41. 193 SEP 7.567 4. 085 1. 081 53 983 OCT 3.747 3. 073 1. 13a a2 017 NOV 1.379 1. 1. 532 93.042 DEC 1.4!57 1. 5!55 1.975 106.686 ... ST A TI ON 6038 LOOARITHI'IIC TRANSFORI'IATION ** ... YARIABLE MEAN STANDARD DEYIATION SKEWNESS C. Y. ..wi 0.009 O. :132 0.266 555 FEB O. 103 0.466 1. 224 4:12.424 filAR O. ISS O. :121 0.936 276.239 APR 0.218 O. :102 1.549 ii!3O.355 MY 0.593 0.251 0.164 42.276 JUN 0.a22 0.246 0.875 29.918 JUt... 0.809 O. 227 1. 034 28. 124 AUQ 0.810 O. 184 0.226 22.643 SEP 0.814 0.253 0.645 31. 143 OCT 0.385 0.488 1. 330 126.596 NOV 0.088 0.519 0. 755 588.139 DEC 0.068 0.479 0.173 706.420 ***** STATION 6038 ( POWER .. O.25 ) ***** VARIABLE MEAN STANDARD DEVIATION SKEWNESS C. V. JAN 1.040 O. 314 0.738 30. 141 FEB 1.096 0.256 0. !540 23.382 /'tAR 1. 161 O. 312 0. 108 26.870 APR 1.175 0.283 0.698 24.109 MY 1.421 0.203 O. 143 14.281 ..AJN 1.620 0.218 0.355 13.449 ..AJL 1.606 O. 19'9 0. 4!56 12. 387 AUt; 1.603 O. 168 O. 024 10.457 SEP 1.614 0.227 0. 140 14.051 OCT 1.292 O. 316 0. 334 24,436 NOV 0.990 O. 268 0. 130 27. 100 DEC 0.998 0.270 O. 364 27.061 *** .. STATION 6038 ( POWERo.35 ) .... *** VARIABLE MEAN STANDARD SKEWNESS C V. DEVIATION JAN 1. 083 O. 461 1. 219 42 FEB 1.154 O. 361 0. 327 31,285 MAR 1.257 O. 458 O. 163 36,456 APR 1. 274 O. 407 0, 443 31.909 MAY 1.644 O. 328 O. 264 19. 933 JVN 1. 97!5 O. 366 0. 180 18.530 JUL 1.949 O. 332 0. 238 17.051 Aut; 1.942 O. O. 121 14.592 SEP 1. 965 O. 382 O. 040 19.458 OCT 1. 4!56 O. 480 0. 060 32 927 NOV 1. 006 O. 370 0.110 36. 724 DEC 1.017 O. 383 0.581 37.667 ***** STATION 6038 ( SGUARE ROOT ) ***** VARIABLE MEAN STANDARD DEVIATION SKEWNESS C. V. JAN 1. 178 0.740 2.043 62,830 FEB 1. 265 O. 533 0.048 42. 115 MAR 1.442 0.724 O. 540 50.192 APR 1.459 O. 620 0.117 42.466 MY 2.059 O. 584 0.445 28.373 .ruN 2.671 O. 693 0.053 25.944 JUL 2.618 0.62!5 0.073 23.873 AVQ 2. 598 O. 539 O. 265 20. 760 SEP 2. 654 O. 729 0, 295 27. 446 OCT 1. 768 0.79:5 O. 282 44.953 NOV 1.051 O. 529 0.461 50. 341 DEC 1. 067 O. 570 0.908 53.473 167 PAGE 157 168 Table C.13. Statistics of the estimated series univariate model **11. UNIVARIATE I'iODEL ( 101. MIS. ) ..... VARIABLE MEAN STANDARD SKEWNESS C. V. DEVIATION .JAN 1. 8:54 2.980 :5. 462 160. 709 FEB 1. 891 1.278 0.B15 67. 608 MAR 2. :503 2.402 1.943 95.956 APR 2. 499 1. 805 O. 701 7;!.242 MAY ... :504 2. 30:5 1. 09:5 51. 181 .JUN 7. 809 3.601 0.697 46. 120 .JUL 7. 185 3. 161 1. 141 43. 995 AUG 7. 067 2. 809 O. 780 39. 757 SEP 7. 563 3.975 1. 127 52. 553 OCT 3. :594 2.871 1. 369 79.965 NOV .1. 314 1.132 1. 515 86. 162 DEC 1.475 1. 539 2.019 104.304 .. ... UNIVARIATE MODEL ( 201. MIS. ) ..... VARIABLE MEAN STANDARD SKEWNESS C. V. DEVIATION .JAN 1.777 2. 997 5. 480 168. =23 FEB 1. 846 1.303 0.854 70.585 MAR 2. 334 1.914 1.364 81 989 APR 2. 523 1. 743 O. 828 69. 053 MAY ... 713 2.449 1. 119 51. 9'9 .JUN 7. 199 3. 446 1.009 47.865 .JUL 7. 216 3. 325 1.062 46.072 AUG 6. 961 2.465 0.510 35.406 SEP 7. 420 3. 659 1.531 49 316 OCT 3.719 2.910 1.408 78.235 NOV 1.302 1. 153 1.954 88. 580 DEC 1.498 1. 509 2. 101 100. 727 Table C.14. Statistics of the estimated series bivariate model **H BIVARIATE MODEL ( 101. MIS. ) **11. VARIABLE MEAN STANDARD DEYIATION SKEWNESS C Y. .JAN 1.825 2. 972 5. 532 162. 862 FEB 1.869 1.287 O. 841 68. 866 MAR 2. 483 2.398 1. 976 96. 608 APR 2. 534 1. 781 0.698 70 275 MAY PAGE 158 APPENDIX D COMPUTER PROGRAMS RAEMVU (Recursive Algorithm for the Estimation of Missing Values Univariate Model) Input The program inputs the time series; the parameters of the normality transformation to be performed (power transformation); the number of gaps (not necessarily the number of missing values unless all the gaps are singles); and for each gap the starting and ending point (counting starts from the first value in the series). For the first iteration the missing values in the original series (usually indicated by a code or by a negative value) are initialized to zeroes or to some other desired initial estimates. Program Description The main program reads the input data and then subsequently calls subroutine ARMA (each call corresponds to one iteration). Subroutine ARMA performs the following calculations each time it is called: 169 PAGE 159 170 (1) The input series is transformed to normal (using the selected transformation) and stationary (by subtracting the monthly means and dividing by the standard deviations) (2) The mean, variance, autocovariance function (ACVF), autocorrelation function (AGF) and partial autocorrelation function (PACF) of the transformed series are computed by calling the IMSL subroutine FTAUTO. (3) Preliminary estimates of the p AR parameters, and q parameters are computed by calling the IMSL subroutines FTARPS and FTMPS subsequently. (4) Maximum likelihood estimates (MLE) of the AR and MA parameters are computed and the residual series is calculated by calling the IMSL subroutine FTMXL. (5) The mean, variance, ACVF, ACF and PACF of the residual series are computed by calling the IMSL subroutine FTAUTO. (6) The parameters of the fitted model (MLE) are used to estimate the missing values in all the gaps by the BoxJenkins minimum mean square error forecasting procedure. (7) The inverse normality and stationarity transformations are performed on the series and the estimated complete series is output. The estimated series (output from the first call) now becomes the input series for the second call and the above seven steps are repeated. The subroutine ARMA is called as many times as needed until stabilization of the parameter estimates and of the missing values estimates occur. The program is initialized to five calls (more can be easily added as needed), and a stabilization check for the parameters is provided so that the iterations stop when the two parameters remain constant to the second decimal place. The computation and printing of the ACVF, ACF and PACF of the transformed and residual series (steps 2 and 5) are PAGE 160 171 not necessary and can be eliminated from the program without any problem. However, their inclusion permits the checking of the goodness of the fitted model at each iteration by diagnostic checking applied on the residuals. A listing of the program in FORTRAN follows. RAEMVB (Recursive Algorithm for the Estimation of Missing Values Bivariate Model) The special case of having only the one series incomplete and the other complete will be considered here. However, the program can be easily modified to include the case of having both the series incomplete. Input The program inputs the two time series, the parameters of the normality transformation to be performed on each series, the number of gaps and the position of each gap for the incomplete series. The missing values in the incomplete series are initialized to zeros or to some other values. Program description The main program reads the input data and then subsequently calls subroutine BIVAR (each call corresponds to one iteration). Subroutine BIVAR performs the following each time is called: PAGE 161 (1) The two input series are transformed to normal and stationary by calling subroutine STAT. (2) The lagzero and lagone autocovariances and crosscovariances of the two series are computed by calling the IMSL subroutine FTCRXY. (3) The parameter matrices A and B are calculated. 172 Inversion and multiplication of matrices are performed by the IMSL subroutines LINV2F, VMULFF and VMULFP. (4) The parameter matrices A and B are used to estimate the missing values of the incomplete series. (5) The inverse normality and stationarity transformations are performed on the two series, and the estimated complete series is output. The estimated series (output from the first call) now becomes the input series for the second call and the above five steps are repeated until stabilization of the matrices A and B occurs. No check for stabilization is provided by the program (eight values must be checked simultaneously) but instead the subroutine is called for a prefixed number of times. A listing of the computer program in FORTRAN follows. PAGE 162 C cC C PROGRAM RAEMVU C C RECURSIVE ALGORITHM FOR THE ESTIMATION OF C MISSING VALUES UNIVARIATE ....:JOEL C CC C C C C C C C C C C 10 C 20 C 30 C 5 40 C 110 C C C C C C 11 C 15 C C C 100 C C C 50 C DIMENSION RAIN(60, 12),NYEAR(60) DIMENSION VRAINCBOO),EIRAIN(60, 12),VRAINICBOO), E2RAIN C 60, 12)' VRAIN2 (BOO), E3RAINC60.12).VRAIN3CBOO). E4RAIN(60.12).VRAIN4CBOO), 12).VRAIN5CBOO). E6RAIN(60.12).VRAIN6(SOO) DIMENSION LI(200).LOC200). ISI(200), IEI(200) COMMON/AI ID.NYEAR COMMON/SI N.C.P COMMON/CI NG. I5G(200). IEG(200) READ INPUT PARAMETERS HEADER .. TITLE N ....... NUMBER OF YEARS NO ...... NUMBER OF GAPS LI ...... LENGTH OF INTEREVENT LG ...... LENGTH OF GAP C,P ...... PARAMETERS OF THE TRANSFORMATION TRANSFORMED SERIES Y=CX+C)**P READCS. 10) HEADER FORMATC20A4) READC5.20) C.P FORNAT(2F5.2) READCS.30) N.NG FORMAT< 2 ( 14/ DO rI,NO READ(S.40) LICI).LG(I) FORMAT(2I4) READCI0.II0) (10. CNYEAR(I). CRAIN(I, J). J=I, 12.1=1. N) FORMAT(A4, 13. IX, 12F6.2) FROM THE INPUT VARIABLES LI AND LG TWO ARRAYS OF LENGTH NG ARE COMPUTED. THtN THE STARTING POINT OF THE KTH GAP IS ISGCK) AND THE ENDING POINT IS lEG (K). 151(1)1 lEI (1)=LI C 1) ISG(1)"'IEI(I)+1 IEG(l)ISGCl)+LGCl)l DO 11 1=2. NG ISI(I)=IEGCIl)+l IEI(I)=ISI(I)+LI(I)l ISG(I)=IEICI)+l IEGCI)=ISGCI)+LGCI)1 CONTINUE HEADER FoRMAT(20A4. III) PRINT THE POSITION OF THE GAPS FDR CHECKING WRITE(6.100) (1.150(1). lEGe I), Il.NG) FORMAT(316) INITIALIZE THE MISSING VALUES TO ZERO DO 50 1=1. N DO 50 J=I. 12 IFCRAIN(I.J).EG. 1) RAIN(I. J)=O. CONTINUE 173 PAGE 163 C SUBROUTINE ARMA IS CALLED TO FIT AN ARMA(P,Q) HODEL C TO THE INPUT SERIES. THE PARAMETERS OF THE MODEL C ARE USED FOR THE ESTIMATION OF THE MISSING VALUES. C C C CALL ARMA(RAIN,VRAIN,EIRAIN,VRAIN1,PHII,THETAl) CALL ARMA(EIRAIN,VRAIN1,E2RAIN,VRAIN2,PHI2,THETA2) CALL ARMA(E2RAIN,VRAIN2,E3RAIN,VRAIN3,PHI3,THETA3) IF( (PHI3PHI2). LE. O. 001. ANO. (THETA3THETA2l. LE. O. 001) GO TO 999 CALL ARMA(E3RAIN,VRAIN3,E4RAIN,VRAIN4,PHI4,THETA4) IF ( (PHI4PHI3). LE. O. 001. ANO. (THETA4THETA3). LE. O. 001 ) GO TO 999 CALL IF( (PHI!5PHI4). LE. O. 001. ANU. (THETA5THETA41. LE. O. 001> GO TO 999 CALL ARMA(E5RAIN,VRAIN5,E6RAIN,VRAIN6,PHI6,THETA6) 999 STOP END C CSUBROUTINE ARMA C C C C C C C C C SUBROUTINE ARMA FITTS AN ARMA(P,G) MODEL TO THE INPUT SERIES EACH TIME IS CALLED. THE MISSING VALUES ARE ESTIMATED BY THE BOXJENKINS FORECASTING PROCEDURE AND THE ESTIMATED SERIES 15 SAVED TO BE THE INPUT SERIES TO THE NEXT CALL. SUBROUTINE ARMA(RAIN,VRAIN,ERAIN,EVRAIN,PHI1,THETA1) REAL MEAN(13),MP,LP DIMENSION ERAIN(60, 12),EVRAIN(800), Z(800) DIMENSION RAIN(60, 12),NYEAR(60), IND(8), PHI PAGE 164 36 C 38 C C C 42 C C C C C C C C C C 30 130 135 C C C C C 40 140 145 STD(I)STD(I)MEAN(I)**01*FLOAT(N/(FLOAT(N)I. **0.5 CONTINUE MEANC 13)=0. STD(13)0. DO 38 11, N MEAN(13)MEAN(13)+YTOTAL(I)/FLOAT(N) STD(13)STDC13l+YTOTALCI)**2 CONTINUE STDC 13 )( CSTDe 13)MEANC 13)**2*FLOATCN) ) I CFLOATCNll. ) )**0. S NOW STANDARDIZE THE MONTHLY SERIES DO 42 Il,N DO 42 "'1,101 RAINCI,"')(RAINCI,"')MEAN("'l)/STD("') CONTINUE STORE THE MA TR IX SER IES IN A VECTOR SER I ES DO 30 11, N DO 30 "'1, 12 K"'+( 11 )*12 VRAINCK)RAIN(I,"') CONTINUE NN=N*12 COMPUTE AC. PAC. AND ACV OF THE SERIES USING SUBROUTINE FTAUTO L=30 CALL FTAUTOCVRAIN.NN,L.L,7,AMEAN.ACV(1),ACV(2).AC(2). PACV(2),WKAREA) SET AC AND PACV OF LAG ZERO TO ONE AC (1I'"'l. PACV( 1 )1. WRITE(6,130) AMEAN,ACV(1) FORMATCIHt.III, 15X. 'STANDARDIZED TRANSFORMED SERIES'. II. 1 5X. 'MEAN ......... F15. 7. II, 2 SX, 'VARIANCE ..... '.FI5. 7.111) WRlTE(6, 135) FORMATCSX, 'LAG', 12X. 'AC', 12X. 'PACV'.1.2X,4'( ''I/) SSQ=O. LP1=L+l DO 40 1=1. LP 1 IM1'"'Il WRITE(6.140) IM1.ACCI),PACV(l) SSQ=SSQ+ACCI)**2 CONTINUE SSQ=SSQl FORMATC3X. IS.2FI5. 7) WRITEC6, 145) SSQ FORMAT PAGE 165 C THE RESIDUALS C ITHETA"l IPHI=l DO '0 I 1 1. 11 DO 12=1.11 SUMSQICI1.I2'=ETA**2 00 60 I3=3.NN ETAITEMPCI3lPHICI2'*TEMP(I31'+THETA(Il)*ETA ETA=ETAI SUMSQICI1. I2)"SUMSQ1(I1. I2'+ETAl**2 60 CONTINUE IFCSUMSQl(II. 12). IPHI GO TO '5 ITHETAIl IPHI'"'I2 55 CONTINUE 50 CONTINUE C C WRITE OUT THE SUM OF SGUARES SURFACE OF THE RESIDUALS C 160 WRITE(6,160) FORMAT(lHl.III.50X. 'TABLE 2',11.15X. 'SUM OF SQUARES OF THE'. 'RESIDUALS OF THE STANDARDIZED TRANSFORMED SERIES'. 111.'2 X. PHI ) 165 170 C WRITE(6,16" (PHI(Il.Il,l1) FORMAT(5X. 'THETA',2X, 11<3X, F'. 2, lXlIl WR ITE (6, 170) PAGE 166 C C C C DSEEO=1234:57. DO CALL QONMLCDSEED,NN, Z) 00 20 Il,NO I1=ISO C 1) I2:IEQ(1) K=I2Il+1 IFCK.QT. 1) 00 TO :51 EVRAINCI1)PHI1*VRAINCI11)THETA1*Z(Ill) GO TO 20 !51 EVRAINCll)PHI1*VRAIN(Ill)THETAl*Z(Ill) DO 31 L=2,K 31 EVRAINCIl+Ll):PHIl*EVRAIN(Il+L2) 20 CONTINUE C APPLY THE INVERSE TRANSFORMATIONS ON THE SERIES. C PP=l/P DO 61 11, N Kl=( 11 )*12+1 K2:I*12 00 71 LKl,K2 J=L(Il)*12 ERAIN(I,J)(EVRAINCL)*STO(J)+MEAN(J**PP 71 CONTINUE 61 CONTINUE C C RETURN END IIOO.SYSIN DO ***** STATION 6038 UNIVARIATE MODEL ***** O. 5 !5!5 25 1 4 27 1 !5 3 11 5 3 2 63 3 1 3 10 4 2 1 19 2 83 1 11 1 14 2 36 2 31 1 33 7 49 2 19 7 21 2 39 1 11 2 2 1 25 1 2 2 30 2 1100. FTI0FOOl 00 DSN:UF. 80063401. 57. C60381. DISP=(OLO. KEEP) 177 PAGE 167 c CC C PROGRAM RAEI'IVB C C RECURSIVE ALGORITHM FOR THE ESTIMATION OF C MISSING VALUES BIVARIATE MODEL C cC C C C C C C C C C C 10 C 20 C 30 C S 40 C 100 C C C C C C 11 C 15 C C C 101 C C C DIMENSION RAINIC60. 12).VRICSOO).RAIN2(60. 12). VR2CSOO). 1 EIRI C60. 12). VIRI CSOO). EIR2C60, 12), VtR2(SOO). At (2. 2), 2 81(2.2),1'101(2.2).1'111(2.2) DIMENSION E2Rl(60. 12).V2Rl(SOO).E2R2(60. 12),V2R2(SOO). 1 A2(2.2),82C2,2),M02(2.2).M12C2,2) DIMENSION E3Rl(60. 12),V3Rl(SOo).E3R2C60. 12).V3R2CSOO). 1 A3(2,2).83(2,2).M03(2.2),M13(2,2) DIMENSION E4Rl(60. 12).V4Rl(SOO).E4R2C60, 12).V4R2(SOO), 1 A4C2,2),84C2.2).M04C2.2).M14(2,2) DIMENSION E'Rl(60. 12),V'Rl(SOO),E5R2(60, 12),V5R2(SOO), 1 A'C2,2),85(2,2),M05C2.2).Ml'C2.2) DIMENSION LI(200).LG(2001. ISI(200), IEI(200) COMMON/AI ID1.ID2.NYEAR(60) COMMON/BI N,C,P COMMON/CI NG, ISO(200). IEQ(200) READ INPUT PARAMETERS HEADER .. TI TLE N ....... NUMBER OF YEARS NG ...... NUMBER OF GAPS LI ...... LENGTH OF INTEREVENT LG ...... LENGTH OF GAP C.P ..... PARAMETERS OF THE TRANSFORMATION TRANSFORMED SERIES Y(X+C).*P READ(5. 10) HEADER FORMA TC 20A4 ) READCS.20) C.P FORMATC2F5. 2) READ(S.30) N,NG FORMATC2( 141) ) DO 5 I=I.NG READ(5.40) LI(I).LG(I) FORMAT(214) READ(10.100) (IDI. CNYEARCII. (RAINlCI.J).J=1.12, I=1.N) READ (11. 100) (102. (NYEAR ( 1 ) (R A I N2 C 1. J) J= 1. 12) ), 1=1. N) FORMATCA4. 13. IX. 12F6.2) FROM THE INPUT VARIABLES LI AND LG TWO ARRAYS OF LENGTH NO ARE COMPUTED. THEN THE STARTING POINT OF THE KTH GAP IS ISOCK) AND THE ENDING POINT IS IEG(K). ISI(I)=l lEI C 1 )=LI Cl) ISG(1)=IEI(l )+1 IEG(1)=ISGC1)+LG(1)1 DO 11 1=2. ISICI)=IEGCIl)+l IEICI)aISI(I)+LI(I)l ISG(I)"'IEI(I)+l IEG(I)=ISGCI)+LG(I)l CONTINUE WRlTE(6. 1') HEADER FORMATC20A4. II!) PRINT THE POSITIONS OF THE GAPS FOR A CHECK WRITE(6. 101) (I. ISG( I). IEG( I). 1=1. NG) FORMATC3I6) INITIALIZE THE MISSING VALUES OF THE INCOMPLETE SERIES 178 PAGE 168 C 60 C C C C C C C C C C C 102 66 DO 60 I=l.N DO 60 12 IF(RAIN1(I .J>'EG.l) RAIN1(I .J)0. CONTINUE PRINT OUT THE SERIES WHICH IS TO BE ESTIMATED WRITE(6.66) WRITE(6.102) (101. (NVEAR(I). (RAINI(I...J) ..J .. t.12. I"I.N) FORMAT(IX.A4. 13. IX, 12F6.2) FORMAT (lHI) SUBROUTINE BIVAR IS CALLED TO FIT A BIVARIATE AR(1) MODEL TO THE TWO INPUT SERIES. IT ESTIMATES ALSO THE MISSING VALUES OF THE ONE SERIES AND SAVES IT TO BE INPUT TO THE NEXT CALL CALL BIVAR(RAIN1.VR1.RAIN2,VR2.EIR1.VIR1.EIR2.VIR2. Al.Bl.MOt.Mll) CALL BIVAR(E1Rl.VlR1.ElR2.VlR2.E2R1.V2Rl.E2R2.V2R2. A2.B2.M02.M12) CALL BIVAR(E2Rl.V2Rl.E2R2.V2R2.E3Rl,V3Rl.E3R2.V3R2. A3.B3.M03,M13) CALL BIVAR(E3Rl.V3Rl,E3R2.V3R2.E4Rl.V4Rl,E4R2.V4R2. A4,B4,M04.M14) CALL BIVAR(E4Rl.V4Rl.E4R2.V4R2,E5Rl,V5Rl,E'R2.V5R2, A'.B',M05,M15) STOP END CSUBROUTINE BIVAR C C SUBROUTINE BIVAR FITS A BIVARIATE AR(l) MODEL TO C THE TWO INPUT SERIES EACH TIME IS CALLED. IT C ESTIMATES ALSO THE MISSING VALUES OF THE ONE C SERIES AND THE ESTIMATED SERIES IS SAVED TO C BE INPUT TO THE NEXT CALL C C c SUBROUTINE BIVAR(RAIN1.VR1,RAIN2.VR2.ERAIN1.EVRI,ERAIN2. EVR2. A. B. MO. Ml) DIMENSION WKAREA(200) DIMENSION RAIN1(60. 12).VR1(SOO),RAIN2(60, 12),VR2(SOO) DIMENSION ERAIN1(60, 12).EVR1(SOO).ERAIN2(60. 12), EVR2(SOO) DIMENSION XM1(12). XM2(12).STD1(12),STD2(12) DIMENSION A(2.2).B(2.2),C(2.2).D(2.2) REAL MO(2.2),Ml(2.2),MOINV(2.2) COMMONIAI ID1.ID2.NYEAR(60) COMMON/BI N,C,P COMMON/CI NO. ISO(200). IEO(200) COMMONIDI XM1.XM2.STD1,STD2. Xl,X2.ST1.ST2 C C CALL SUBROUTINE STAT TO NORMALIZE AND STANDARDIZE C THE SERIES AND COMPUTE THE STATISTICS C C CALL STAT(RAIN1.XM1.STD1.VR1, Xl.ST1) CALL STAT(RAIN2.XM2.STD2.VR2. X2.ST2) C CALL THE IMSL SUBROUTINE FTCRXV TO COMPUTE AUTOC AND CROSSCOVARIANCES OF THE SERIES C C C c CALL FTCRXV(VR1.VR2.N. Xl. X2.0.N.C120. IER) CALL FTCRXY(VRI. VRI. N. Xl. Xl. 1. N. Cllt. IER) CALL FTCRXV(VR2.VR2.N.X2.X2.1.N.C221. IER) CALL FTCRXV(VR1.VR2.N. Xl.X2.1.N.CI21. IER) CALL FTCRXV(VR2,VR1.N,X2.Xl.1.N.C211. IER) MO( 1. 1 MO(2,2)=1. MO(1.2)=C120/(ST1*ST2) MO (2. 1 ) =MO ( 1 2 ) M 1 ( 1. 1 ) =C 1111 ( STl *STl ) Ml(2.2)=C221/(ST2*ST2) Ml(l,2)=C121/(ST1*ST2) Ml(2,1)=C211/(ST1*ST2) WRITE(6.66) 66 FORMAT< IHll C C PRINT OUT THE CORRELATION MATRICES MO AND MI 179 PAGE 169 C WRITE(6,110) ((MO(I,'}),.}1,2) 1"1,2) WRlTE(6,111) (MUI,.}),,J"1,2) 1"1,2) 110 FORMATC'X, 'CORRELATION MATRIX MO',II, X,OlFI0.3)1) 111 FORMATC5X,'CORRELATION MATRIX Ml',II,X,OlF10.3)1) C C CALCULATE THE PARAMETER MATRICES A AND 8 C C 10 C C C C 140 141 C 15 C C C 40 20 C C C 50 C C C 101 C C CALL LINV2F(MO,Ol,Ol,MOINV,0, WKAREA, IER) CALL VMULFF(Ml,MOINV,Ol,2,ii!,2,2,A,2, IER) CALL VMVLFP(A,Ml,Ol,2,2,2,2, 0,2, IER) DO 10 11,2 DO 10 .}l,iiI C(I,,J)MO(I,'})O(I,,J) 8(1, 1)Cil, 11**0.5 8 (2, 1) DC ( 1, ii!) 18 (1, 1 ) B (2, Ol) (C (iii, 2) c ( 1, iiI!) **2/C ( 1, 1 ) ) 0. 5 8(1,2)"0. PRINT OUT THE MATRICES A AND 8 WRITE(6,140) A(I,,J),,J"1,2),I"I,2) WRITE(6,1411 (1,,J),,J=1.Ol),I=I,2) FORMATC 5X, 'COEFFIC lENT MATR I X: A I, II, 5X, OlFI0. 3) Il ) FORMATC5X, 'COEFFIC lENT MATR I X: 8 I, II, 5X, OlFI0. 3) Il ) NN=N*12 DO 15 1=I,NN EVR2(1)=VR2(1) EVR1(1)=VR1(1) ESTIMATE THE GAPS OF THE INCOMPLETE SERIES DO 20 1=1. NG I1=ISG( I) 12=IEG(I) K=12ll+1 DO 40 L=l,K EVR1(Il+Ll)"A(2,1)*EVR1(ll+L2)+A(2,2)*EVR2(Il+L2) CONTINUE PERFORM INVERSE TRANSFORMATIONS PP=l/P DO 50 1=I,N DO 50 .}=1, 12 L=,J+( 11 )*12 ERAINl(I,.})=(EVR1(L)*STD1(,J)+XM1(J**PP ERAIN2CI,J)=CEVR2(L)*STD2(,J)+XM2C,J**PP CONTINUE PRINT OUT THE ESTIMATED SERIES WRITE(6,66) WRITE(6,101) (101, (NYEARCI), (ERAINUI.,J),,J=I, lOl, I"l,N) FORMAT(lX,A4, 13, IX, 12F6.ii!) RETURN END CSUBROUTINE STAT C C C C C C SUBROUTINE STAT TRANSFORMS THE ORIGINAL SERIES TO NORMAL AND STATIONARY AND COMPUTES THE STATISTICS OF THE TRANSFORMED SERIES. SUBROUTINE STAT(RAIN, XM,STD, VRAIN, X,ST) DIMENSION RAIN(60, 12), VRAIN (800), XM( 12), STD( 12) COMMONIAI 101, 102, NYEAR(60) COMMON/81 N,C,P COMMON/CI NO, ISO(200), IEO(200) DO 10 I=l,N DO 10 J=1. 12 RAIN(I,,J)(RAINCI,,J)+C)**P 10 CONTINUE C C COMPUTE MONTHLY MEANS AND STANDARD DEVIATIONS OF C THE NORMALIZED SERIES C DO 20 .1=1,12 XMC,J)=O. 180 PAGE 170 C 2' 20 30 C C C 40 C C C STDeJ)=O. DO 25 1=1. N XM(JIXMeJ)+RAINCI,J)/FLOATCN) STD(J)=STOeJ)+RAINCI,JI**2 CONTINUE CONTINUE DO 30 1=1.12 STDCI)=eCSTD(I'XMCI'**2*FLOATeN)I/(FLOAT(N'l. )1**0.5 CONTINUE NOW,STANDARDIZE THE SERIES DO 40 II. N DO 40 J"'l, 12 RAINCI,J)CRAINeI,J)XMeJ/STDeJ) CONTINUE COMPUTE MEAN AND STD OF THE WHOLE SERIES NN=N*12 IC=O DO 50 I=l.N DO '0 J=l, 12 IC=IC+l '0 C VRAIN(ICIRAINCI,J) CONTINUE C X=O. ST=O. DO 60 1=1, NN X=X+VRAINCI) ST=ST+VRAIN(II**2 60 CONTINUE X=X/FLOAT PAGE 171 REFERENCES Afifi, A.A., and Elashoff, R.M., 1966, "Missing observations in multivariate statistics I: Review of the literature," J. Am. Stat. Assoc., 61:595604. Anderson, D.G., 1979. "Satelite versus conventional methods in hydrology" in Satellite Hydrology, American Water Resources Association, Minneapolis. Anderson, T. W., 1957, "Maximum likelihood estimates for a multivariate normal distribution when some observations are missing," J. Am. Stat. Assoc., 52:200203. Ansley, G.F., Spivey, W.A., and Worblski, v1.J., 1977, "A class of transformations for BoxJenkins's seasonal modelling," Appl. Stat., 26:173178. Beale, E.M.L., and Little, R.J.M., 1975, "Missing values in multivariate analysis," J. R. Stat. Soc., B37:129145. Beard, L.R., 1973, "Hydrologic data fillin and network design," in Design of Water Resources Projects with Inadequate Data, Proc. of the Madrid Symposium, June, 1973. Bendat, J.S., and Piersol, A.G., 1967, Measurement and Analysis of Random Data, John Wiley & Sons, New York, 3rd. printing. Bloomfield, P., 1970, "Spectral analysis with randomly missing observations," J. R. Stat. Soc., B32:369380. Box, G.E. P., and Cox, D.R., 1964, "An analysis of transformation (with discussion) ," J. R. Stat. Soc., B26:211252. Box G.E.P., and Jenkins, G.M., 1973, "Some comments on a paper by Chatfield and Prothero and on a review by Kendall (with discussion) ," J. R. Stat. Soc., A135:337345. Box, G.E.P., and Jenkins, G.M., 1976, Time Series Analysis Forecasting and Control, HoldenDay, San Francisco, Revised ed. 182 PAGE 172 183 Box, G.E.P., and Pierce, D.A., 1970, "Distribution of residual autocorrelations in autoregressiveintegrated moving average time series models," J. Am. Stat. Assoc., 64:15091526. Brubacher, S.R., and Tunnicliffe Wilson, G., 1976, "Interpolating time series with application to the estimation of holiday effects on electricity demand," Appl. Stat., 25: 107116 .. Buck, S.F., 1960, "A method of estimation of missing values in multivariate data suitable for use with an electronic computer," J. R. Stat. Soc., B22:302307. Chatfield, C., 1980, The Analysis of Time Series: An Introduction, Chapman and Hall, London, 2nd ed. Chatfield, C., and Prothero, D.L., 1973a, "BoxJenkins seasonal forecasting: Problems in a case study (with discussion)," J. R. Stat. Soc., A136:295336. Chatfield, C., and Prothero, D.L., 1973b, "Reply by Dr. Chatfield and Dr. Prothero on the paper 'Some comments on a paper by Chatfield and Prothero and on a review by Kendall' by Box, G.E.P., and Jenkins, G.M.," J. R. Stat. Soc., A136:347352. Crosby, D.S., and Maddoc, T., 1970, "Estimating coefficients of a flow generator for monotone samples of data," Water Resour. Res., 6(4) :10791086. Damsleth, E., 1980, "Interpolating missing values in a time series," Scand. J. Stat., 7:3339. Dean, J.D., and Snyder, W.M., 1977, "Temporally and areally distributed rainfall," J. of the Irrigation and Drainage Div., ASCE, 103(IR2) :221229. Delleur, J.W., and Kavvas, M.L., 1978, "Stochastic models for monthly rainfall forecasting and synthetic generation," J. Appl. Meteor., 17(10) :15281536. Draper, N.R., and Cox, D.R., 1969, "On distributions and their transformation to normality," J. R. Stat. Soc., B31:472476. Draper, N.R., and Smith, H., 1966, Applied Regression Analysis, John Wiley & Sons, New York. Durbin, J., 1960, "The fitting of time series models," Rev. Int. Inst. Stat., 28:233. Fiering, M.B., 1964, "Multivariate technique for synthetic hydrology," J. Hydraul. Div., ASCE, 90(HY5) :4360. PAGE 173 Fiering, M.B., 1968, "Schemes for handling inconsistent matrices," Water Resour. Res., 4(2) :291297. Fiering, M.B., and Jackson, B.B., 1971, "Synthetic Hydrology," Monograph No.1, American Geophysical Union, Washington, D.C. Finzi, G., Todini, E., and Wallis, J.R., 1977, "SPUMA: 184 Simulation package using Matalas algorithm," in Mathematical Models for Surface Water Hydrology, Ed. by Ciriani, T.A., Maione, U., and Wallis, J.R., John Wiley & Sons, London. Finzi, G., Todini, E., and Wallis, J.R., 1975, "Comment upon multivariate synthetic hydrology," Water Resour. Res., 11 (6) :844850. Gantmacher, F.R., 1977, The Theory of Matrices, Vol. I, Chelsea Publ. Company, New York. Granger, C.W.J., and Morris, M.J., 1976, "Time series modelling and interpretation," J. R. Stat. Soc., A139:246257. Haan, C.T., 1977, Statistical Methods In Hydrology, Iowa State Univ. Press, Ames. Hamrlck, R.L., 1972, "South Florida's I unmanaged I resource," In Depth Report, Central and South Florida Flood Control District 1:112. Hannan, E.J., 1960, Time Series Analysis, Chapman and Hall, London. Hashino, M., 1977, "A similar storm method on filling data voids," in Hodeling Hydrologic Processes, Ed. by MorelSeytoux, H., Salas, J.D., Sanders, T.G., and Smith, R.E., Water Resour. Res. Publications, Fort Collins, Colorado. Hinkley, D., 1977, "On quick choice of power transforma tion,1! Appl. Stat., 26(1) :6770. IMSL LIB0007, 1979, Reference Manual, Edition 7, Revised. Jenkins, G.M., and Watts, D.G., 1969, Spectral Analysis and its Applications, HoldenDay, San Francisco, 2nd printing. John, J.A., and Draper, N.R., 1980, "An alternative family of transformations," Appl. Stat., 29 (2) : 190197. Jones, R.H., 1962, "Spectral analysis with regularly missed observations," Ann. Math. Stat., 32:45561. PAGE 174 Kahan, J.P., 1974, "A method for maintaining cross and serial correlations and the coefficient of skewness under generation in a linear bivariate regression model," Water Resour. Res., 10(6) :12451248. 185 Kavvas, M., and Delleur, J., 1975, "Removal of Periodicities by differencing and monthly mean substraction," J. Hydrol., 26:335353. Kottegoda, N.T., and Elgy, J., 1977, "Infilling missing flow data," in Modeling Hydrologic Processes, Ed. by MorelSeytoux, H., Salas, J.D., Sanders, T.G., and Smith, R.E., Water Resour. Res. Publications, Fort Collins) Colorado. Linsley, R.K., Jr., Kohler, M.A., and Paulhus, J.L.H., 1978, Hydrology for Engineers, McGrawHill Book Co., New York, 2nd ed. Marshall, R.J., 1980, "Autocorrelation estimation of time series with randomly missing observations," Biometrika, 67 (3) :567570. Matalas, N.C., 1967, "Mathematical assessment of synthetic hydrology," Water Resour. Res., 3(4) :937945 .M.atalas, N.C., 1978, "Generation of multivariate synthetic flows," in Mathematical Models for Water Ed. by Ciriani, T.A., Maione, U., and Wallis, J.R., John Wiley & Sons, London. Mejia, J.M., RodriguezIturbe, I., and Cordova, J.R., 1974, "!1ultivate generation of mixtures of normal and lognormal variables," Water Resour. Res., 10 (4): 691693. Moran, P.A.P., 1970, "Simulation and evaluation of complex water systems operations," lvater Resour. Res., 6 (6) : 17371742. Neave, H.R., 1970, "Spectral analysis with initially scarce data," Biometrika, 57:111122. O'Connell, P.E., 1973, "Multivariate synthetic hydrology: a correction," J. Hydr. Div., ASCE, Tech. notes, 9(HY12): 23912396. O'Connell, P.E., 1974, "Stochastic modelling of longterm persistence in streamflow sequences,", Ph.D. thesis, University of London, London, England. PAGE 175 Orchard, T., and Woodbury, M.A., 1972, "A missing information principle: Theory and applications," in Proc. 6th Berkeley Symp. Math. Statist. Prob., Vol1:697715. Ozaki, T., 1977, "On the order determination of ARlMA models," Appl. Stat., 26:290301. Parzen, E., 1963, "On spectral analysis with missing observations and amplitude modulation," Sankhya, A25:383392. 186 Paulhus, J.L.H., and Kohler, M.A., 1952, "Interpolation of missing precipitation records," Mon. Weather Review, 80:129133. Pegram, G.G.S., and James, W., 1972, "Multilag multivariate autoregressive model for the generation of operational hydrology," Water Resour. Res., 8(4) :10741076. Roesner, L.A., and Yevjevich, V., 1966, "Mathematical models for time series of monthly precipitation and monthly runoff," Hydrology paper No. 15, Colorado State University, Fort Collins, Colorado. Salas, J.D., Delleur, J.W., Yevjevich, V., and Lane, W.L., 1980, Applied Modeling of Hydrologic Time Series, Water Resour. Res. Publ., Fort Collins, Colorado. Salas, J.D., and Pegram, G.G.S., 1977, "A seasonal multivariate multilag autoregressive model in hydrology," in Modeling hydrologic processes, Ed. by MorelSeytoux, H., Salas, J.D., Sanders, T.G., and Smith, R.E., Water Resour. Publications, Fort Collins, Colorado. Slack, J.R., 1973, "I would if I could (selfdenial by conditional models)," Water Resour. Res., 9(1) :247249. Scheinok, P.A., 1965, "Spectral analysis with randomly missed observations: The binomial case," Ann. Math. Stat., 36:971977. Schlesselman, J., 1971, "Power families: A note on the Box and Cox transformation," J. R. Stat. Soc., B33:307311. Shearman, R.J., and Salter, P.M., 1975, "An objective rainfall interpolation and mapping technique," Hydrological Sciences Bulletin, 20(3) :353363. Stidd, C.K., 1953, "Cuberootnormal precipitation distributions," Trans. Amer. Geophys. Union, 34:3135. PAGE 176 Stidd, C.J., 1968, "A three parameter distribution for precipitation data with a straightline plotting method," Proc. 1st Statist. Meteorol. Conf., Amer. Meteor. Soc., Hartford, Connecticut, pp. 158162. Stidd, C.K., 1970, "The nth root normal distribution of precipitation," Water Resour. Res., 6(4) :10951103. Tukey, J.W., 1957, "On the comparative anatomy of transformation," Ann. of Math. Stat., 28:602632. Valencia, D.R., and Schaake, J.C., Jr., 1973, "Disaggregation processes in stochastic hydrology," Water Resour. Res., 9(3) :580585. WastIer, T.A., 1969, Spectral Analysis, Applications in Water Pollution Control, U.S. Dept of the Interior, Federal Water Pol. Control Adm., Washington, D.C. 187 Wei, T.C., and McGuiness, J.L., 1973, "Reciprocal distance squared method, a computer technique for estimating areal precipitation," ARS NC8, U.S., Dept. of Agriculture, Washington, D.C. Wilson, G.T., 1973, "Contribution to discussion of 'BoxJenkins seasonal forecasting: Problems in a case study," by C. Chatfield and D.L. Prothero, J. R. Stat. Soc., A136:315319. Wold, H.O., 1938, A Study of the Analysis of Stationary Time Series, Almquist and Wicksell, Uppsala, 2nd ed., 1954. Yevjevich, V.M., 1972, "Structural analysis of hydrologic time series," Hydrol. paper No. 56, Colorado State University, Fort Collins, Colorado. Young, G.K., 1968, "Discussion of 'Mathematical assessment of synthetic hydrology' by N. G. Matalas," Water Resour. Res., 4 (3) :681682. Young, G.K., and Pisano, W.C., 1968, "Operational hydrology using residuals," J. Hydr. Div., ASCE, 94(HY4) :909923. Yule, G.U., 1927, "On a method of investigating periodicities in disturbed series, with special reference to Wolfer's sunspot numbers," in Statistical Papers of George Undy Yule, selected by Stuart, A., and Kendall, M., Hafner Publ. Co., New York, 1971. 