Estimating missing values in monthly rainfall series

MISSING IMAGE

Material Information

Title:
Estimating missing values in monthly rainfall series
Series Title:
Florida Water Resources Research Center Publication Number 67
Physical Description:
Book
Creator:
Foufoula-Georgiou, Efi
Publisher:
University of Florida
Place of Publication:
Gainesville, Fla.
Publication Date:

Notes

Abstract:
This study compares and evaluates different methods for the estimation of missing observations in monthly rainfall series. The estimation methods studied reflect three basic ideas: (1) the use of regional-statistical information in four simple techniques: - mean value method (MV), - reciprocal distance method (RD), - normal ratio method (NR), - modified weighted average method (MWA); (2) the use of a univariate autoregressive moving average (ARMA) model which describes the time correlation of the series; (3) the use of a multivariate ARMA model which describes the time and space correlation of the series. An algorithm for the recursive estimation of the missing values in a series by a parallel updating of the univariate or multivariate ARMA model is proposed and demonstrated. All methods are illustrated in a case study using 55 years of monthly rainfall data from four south Florida stations.

Record Information

Source Institution:
University of Florida Institutional Repository
Holding Location:
University of Florida
Rights Management:
All rights reserved by the source institution and holding location.
System ID:
AA00001543:00001


This item is only available as the following downloads:


Full Text




WATER RESOU L IES
rpe icrch center

Publication No. 67
ESTIMATING MISSING VALUES IN MONTHLY RAINFALL SERIES
By
EFI FOUFOULA-GEORGIOU
A Thesis Presented to the Graduate Council of
the University of Florida
in Partial Fulfillment of the Requirements for the
Degree of Master of Engineering
University of Florida
Gainesville














U N-,I. F. 'L
U FLO. IDA- ,: ,






UNIERSITY OF FLOIDA












ESTIMATING


MISSING VALUES IN MONTHLY
RAINFALL SERIES


By





EFI FOUFOULA-GEORGIOU


Publication No. 67


FLORIDA WATER RESOURCES RESEARCH CENTER


Research Project Technical Completion Report






Sponsored by


South Florida Water Management District






A THESIS PRESENTED TO THE GRADUATE COUNCIL OF
THE UNIVERSITY OF FLORIDA
IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE
DEGREE OF MASTER OF ENGINEERING






UNIVERSITY OF FLORIDA


1982
















ACKNOWLEDGEMENTS


I wish to express my sincere gratitude to all those who

contributed towards making this work possible.

I am particularly indebted to the chairman of my

supervisory committee, Professor Wayne C. Huber. Through

the many constructive discussions along the course of this

research, he provided an invaluable guidance. It was his

technical and moral support that brought this work into

completion.

I would like to express my sincere appreciation to the

other members of my supervisory committee: Professors J. P.

Heaney, D. L. Harris, and M. C. K. Yang, for their helpful

suggestions and their thoughtful and critical evaluation of

this work.

Special thanks are also given to my fellow students and

friends, Khlifa, Dave D., Bob, Terrie, Richard, Dave M., and

Mike, for their cheerful help and the pleasant environment

for work they have created.

Finally my deepest appreciation and love go to my

husband, Tryphon, who has been a constant source of

encouragement and inspiration for creative work. Many

invaluable discussions with him helped a great deal in









gaining an understanding of some problems considered in this

thesis.

The research was supported in part by the South Florida

Water Management District.

Computations were performed at the Northeast Regional

Data Center on the University of Florida campus,

Gainesville.


iii

















TABLE OF CONTENTS

Page

ACKNOWLEDGEMENTS . . . . ... ii

LIST OF TABLES . . . ... vii

LIST OF FIGURES . . . . ix

ABSTRACT . . . . ... . xi

CHAPTER 1. INTRODUCTION . . . 1

Rainfall Records . ......... 1
Frequency Analysis of Missing Observations
in the South Florida Monthly Rainfall
Records . . . . ... 5
Description of the Chapters . . ... .15

CHAPTER 2. SIMPLIFIED ESTIMATION TECHNIQUES ... .17

Introduction . . . .... .. 17
Mean Value Method (MV) . . . .17
Reciprocal Distance Method (RD) . . 20
Normal Ratio Method (NR) . . .. .21
Modified Weighted Average Method (MWA) ..... 22
Least Squares Method (LS) . . . 27

CHAPTER 3. UNIVARIATE STOCHASTIC MODELS . .. .32

Introduction .. . . . 32
Review of Box-Jenkins Models . . 34

Autoregressive Models . . .. 35
Moving Average Models . . ... 39
Mixed Autoregressive-Moving Average Models 42
Autoregressive Integrated Moving
Average Models . . . .. 44

Transformation of the Original Series . .. .46

Transformation to Normality . .. .46
Stationarity . . . ... 50










Page


Monthly Rainfall Series . . ... .52

Normalization and Stationarization ... .52
Modeling of Normalized Series . .. .55

CHAPTER 4. MULTIVARIATE STOCHASTIC MODELS . .. .58

Introduction ................. 58
General Multivariate Regression Model . .. .59
Multivariate Lag-One Autoregressive Model 60
Comments on Multivariate AR(1) Model . .. .63

Assumption of Normality and Stationarity .63
Cross-Correlation Matrix M1 ....... 65
Further Simplification .. . ... 66

Higher Order Multivariate Models . .. .68

CHAPTER 5. ESTIMATION OF MISSING MONTHLY RAINFALL
VALUES--A CASE STUDY . ... .71

Introduction ..... ........... 71
Set Up of the Problem ............. 71
Simplified Estimation Techniques . . 75

Techniques Utilized . . ... .75
Least Squares Methods . . ... 78
Modified Weighted Average Method . .. .82
Comparison of the MV, RD, NR and
MWA Methods . . . .. 85

Univariate Model . . . ... .97


Model Fitting . . .
Proposed Estimation Algorithm .
Application of the Algorithm on the
Monthly Rainfall Series . .
Results of the Method . .
Remarks . . . .

Bivariate Model . . . .

Model Fitting . . .
Proposed Estimation Algorithm .
Application of the Algorithm on the
Monthly Rainfall Series . .

CHAPTER 6. CONCLUSIONS AND RECOMMENDATIONS .

Summary and Conclusions . . .
Further Research . . .


. 97
. 106

. 108
. 110
. 106

. 117

. 117
. 119

. 121

. 131

* 131
S. 134










Page

APPENDIX A. DEFINITIONS . . . .. .136

APPENDIX B. DETERMINATION OF MATRICES A AND B OF
THE MULTIVARIATE AR(1) MODEL ..... 150

APPENDIX C. DATA USED AND STATISTICS . .. .156

APPENDIX D. COMPUTER PROGRAMS . . .. .169

REFERENCES . . . . ... ... .182

BIOGRAPHICAL SKETCH . . . .. .188

















LIST OF TABLES


Table Page

1.1 Frequency Distribution of the Percent of
Missing Values in 213 South Florida Monthly
Rainfall Records . . . . 9

5.1 Least Squares Regression Coefficients and
Their Significance Levels . . ... 80

5.2 Correction Coefficients for Each Month and for
Each Different Percent of Missing Values .. 83

5.3 Statistics of the Actual (ACT), Incomplete (INC)
and Estimated Series (MV, RD, NR, MWA) . .. .88

5.4 Bias in the Mean . . . ... 90

5.5 Bias in the Standard Deviation . . .. .92

5.6 Bias in the Lag-One and Lag-Two Correlation
Coefficients . . . . ... 94

5.7 Accuracy Mean and Variance of the Residuals .95

5.8 Initial Estimates and MLE of the parameters
P and 6 of an ARMA(1,1) Model Fitted to the
Monthly Rainfall Series of Station A . .. 102

5.9 Results of the RAEMV-U Applied at the 10% Level
of Missing Values. Upper Value is (1, Lower
Value is . . . . 111

5.10 Results of the RAEMV-U Applied at the 20% Level
of Missing Values. Upper Value is 1, Lower
Value is 8 . . . .. . 112

5.11 Statistics of the Actual Series (ACT) and the
Two Estimated Series (UN10, UN20) . ... 115

5.12 Bias in the Mean, Standard Deviation and
Serial Correlation Coefficient--Univariate
Model . . . . ... .116


vii










Table Page


5.13 Results of the RAEMV-B1 Applied at the 10% Level
of Missing Values . . . ... .125

5.14 Results of the RAEMV-B1 Applied at the 20% Level
of Missing Values . . . ... .127

5.15 Statistics of the Actual Series (ACT) and the
Two Estimated Series (B10 and B20) . ... 129

5.16 Bias in the Mean, Standard Deviation and Serial
Correlation Coefficient--Bivariate Model ... .130


viii
















LIST OF FIGURES


Figure Page

1.1 Monthly distribution of rainfall in the
United States . . . . 6

1.2 Probability density function, f(m), of the
percentage of missing values . . 8

1.3 Probability density function, f(T), of the
interevent size . . . . 11

1.4 Probability density, f(k), and mass function,
p(k), of the gap size . . . .. 12

2.1 Mean value method without random component 19

2.2 Mean value method with random component ..... 19

2.3 Least squares method without random component 30

2.4 Least squares method with random component 30

5.1 The four south Florida rainfall stations used
in the analysis . . . ... 73

5.2 Plot of the monthly means and standard devia-
tions of the rainfall series of Station A 76

5.3 Autocorrelation function plot of the residual
series of an ARMA(1,1) model fitted to the
monthly rainfall series of Station A . .. .98

5.4 Sum of squares of the residuals surface of an
ARMA(1,1) model fitted to the monthly rainfall
series of Station A . . . .. 101

5.5 Recursive algorithm for the estimation of the
missing values--univariate model (RAEMV-U) 109

5.6 Recursive algorithm for the estimation of
missing values--bivariate model--I station to
be estimated (RAEMV-B1) . . .. 122









Figure Page


5.7 Recursive algorithm for the estimation of
missing values--bivariate model--2 stations to
be estimated (RAEMV-B2) . . ... .123















Abstract of Thesis Presented to the Graduate
Council of the University of Florida in Partial
Fulfillment of the Requirements for the Degree of
Master of Engineering


ESTIMATION OF MISSING OBSERVATIONS
IN MONTHLY RAINFALL SERIES

By

Efstathia Foufoula-Georgiou

December, 1982

Chairman: Wayne C. Huber
Cochairman: James P. Heaney
Major Department: Environmental Engineering Sciences

This study compares and evaluates different methods for

the estimation of missing observations in monthly rainfall

series. The estimation methods studied reflect three basic

ideas:

(1) the use of regional-statistical information in four

simple techniques:

mean value method (MV),

reciprocal distance method (RD),

normal ratio method (NR),

modified weighted average method (MWA);

(2) the use of a univariate autoregressive moving

average (ARMA) model which describes the time

correlation of the series;









(3) the use of a multivariate ARMA model which

describes the time and space correlation of

the series.

An algorithm for the recursive estimation of the missing

values in a series by a parallel updating of the univariate

or multivariate ARMA model is proposed and demonstrated.

All methods are illustrated in a case study using 55 years

of monthly rainfall data from four south Florida stations.









/ Chairman


xii
















CHAPTER 1

INTRODUCTION



Rainfall Records

Rainfall is the source component of the hydrologic

cycle. As such it regulates water availability and thus

land use, agricultural and urban expansion, maintenance of

environmental quality and even population growth and human

habitation. As Hamrick (1972) points out, water may be

transported for considerable distances from where it fell as

rain and may be stored for long periods of time, but with

very few exceptions it originates as rainfall.

Consequently, the measurement and study of rainfall is in

actuality the measurement and study of our potential water

supply.

Rainfall studies attempt to derive models, both

probabilistic and physical, to describe and forecast the

rainfall process. Since the quality of every study is

immediately related to the quality of the data used, the

need for "good quality" rainfall data has been expressed by

all hydrologists. By "good quality" is meant accurate, long

and uninterrupted series of rainfall measurements at a range

of different time intervals (e.g., hourly, daily, monthly,

and yearly data) and for a dense raingage network. Missing









values in the series (due, for example, to failure of the

recording instruments or to deletion of a station) is a real

handicap to the hydrologic data users. The estimation of

these missing values is often desirable prior to the use of

the data.

For instance, the South Florida Water Management

District prepared a magnetic tape with monthly rainfall data

for all rainfall stations in south Florida for use in this

study (T. MacVicar, SFWMD, personal communication, May,

1982). The data included values for the period of record at

each station, ranging from over 100 years (at Key West) to

only a few months at several temporary stations.

Approximately one month was required to preprocess these

data prior to performing routine statistical and time series

analyses. The preprocessing included tasks such as

manipulations of the magnetic tape, selection of stations

with desirable characteristics (e.g., long period of record,

proximity to other stations of interest, few missing values)

and a major effort at replacement of missing values that did

exist. This effort, in fact, was the motivation for this

thesis.

Many different kinds of statistical analyses may be

performed on a given data set, e.g., determination of

elementary statistical parameters, auto- and cross-

correlation analysis, spectral analysis, frequency analysis,

fitting time series models. For routine statistics (e.g.,

calculation of mean, variance and skewness) missing values









are seldom a problem. But for techniques as common as

autocorrelation and spectral analysis missing values can

cause difficulties. In multivariate analysis missing values

result in "wasted information" when only the overlapping

period of the series can be used in the analysis, and in

inconsistencies (Fiering, 1968, and Chapter 4 of this

thesis) when the incomplete series are used.

In general, two approaches to the problem of missing

observations exist. The first consists of developing

methods of analysis that use only the available data, the

second in developing methods of estimation of the missing

observations followed by application of classical methods of

analysis.

Monthly rainfall totals are usually calculated as the

sum of daily recorded values. Thus, if one or more daily

observations are missing the monthly total is not reported

for that month. An investigation conducted by the Weather

Bureau in 1950 (Paulhus and Kohler, 1952), showed that

almost one third of the stations for which monthly and

yearly totals were not published had only a few (less than

five) days missing. Furthermore, for some of these missing

days there was apparently no rainfall in the area as

concluded by the rainfall observations at nearby stations.

Therefore, in many cases estimation of a few missing daily

rainfall values can provide a means for the estimation of

the monthly totals.










Statisticians have been most concerned with the problem

of handling short record multivariate data with missing

observations in some or all of the variables, but no

explicit and simple solutions have been given, apart from a

few special cases in which the missing data follow certain

patterns. A review of these methods is given by Afifi and

Elashoff (1956). In the time domain, "the analysis of time

series, when missing observations occur has not received a

great deal of attention" as Marshall (1980, p. 567)

comments, and he proposes a method for the estimation of the

autocorrelations using only the observed values. Jones

(1980) attempts to fit an ARMA model to a stationary time

series which has missing observations using Akaike's

Markovian representation and Kalman's recursive algorithm.

In the frequency domain, spectral analysis with randomly

missing observations has been examined by Jones (1962),

Parzen (1963), Scheinok (1965), Neave (1970) and Bloomfield

(1970).

In hydrology, the problem of missing observations has

not been studied much as Salas et al. (1980) state:

The filling-in or extension of a data series is a
topic which has not received a great deal of
attention either in this book or elsewhere.
Because of its importance, the subject is expected
to be paid more attention in the future. (Salas
et al., 1980, p. 464)

Simple and "practicable" methods for the estimation of

missing rainfall values for large scale application were

proposed by Paulhus and Kohler (1952), for the completion of

the rainfall data published by the Weather Bureau. The









study was initiated after numerous requests of the

climatological data users. Beard (1973) adopted a multisite

stochastic generation technique to fill-in missing

streamflow data, and Kottegoda and Elgy (1977) compared a

weighted average scheme and a multivariate method for the

estimation of missing data in monthly flow series. Hashino

(1977) introduced the "concept of similar storm" for the

estimation of missing rainfall sequences. Although the same

methods of estimation can be applied to both rainfall and

runoff series, a specific method is not expected to perform

equally well when applied to the two different series due

mainly to the different underlying processes. This is true

even for rainfall series from different geographical

regions, since their distributions may vary greatly as shown

in Fig. 1.1.

This analysis will use monthly rainfall data from four

south Florida stations. First, a frequency analysis of the

missing observations has been performed and their typical

pattern has been identified. In this work the term "missing

observations" is used for a sequence of missing monthly

values restricted to less than twelve, so that unusual cases

of lengthy gaps (a year or more of missing values) is

avoided since they do not reflect the general situation.



Frequency Analysis of Missing Observations in the
South Florida Monthly Rainfall Records

An analysis of the monthly rainfall series of

213 stations of the South Florida Water Management District




















5 aUUUI
I JJ;
4 ;* I,,
I"UU


Fig. 1.1. Monthly distribution of Fainfall in the United States
(after Linsley R.K., Kohler M.A. and Paulhus J.L.,
Hydrology for Engineers, 1975, McGraw-Hill, 2nd. edition
p. 90)









(SFWMD) gave the results shown on Table 1.1. Figure 1.2

shows the probability density function (pdf) plot of the

percent m of missing values, f(m), which is defined as the

ratio of the probability of occurrence over an interval to

the length of that interval (column 4 of Table 1.1). The

shape of the pdf f(m) suggests the fit by an exponential

distribution



f(m) = Xe-m (1.1)



where X is the parameter of the distribution calculated as

the inverse of the expected value of m, E(m);



E(m) = Ep(mi) mi (1.2)



where p(m.) is the probability of having mi percent of

missing values. The mean value of the percentage of missing

values is m = E(m) = 13.663, and therefore the fitted

exponential pdf is


-0.073m
f(m) = 0.073 e 073(1.3)



which gives an interesting and unexpectedly good fit as

shown by Fig. 1.2 and column 5 of Table 1.1

The question now arises as to whether the missing

values within a record follow a certain pattern. In















f (m)

0.07




0.06




0.05


-0.073m
f(m) = 0.073 e 73m
0.04




0.03




0.02




0. 01.




0.00
0 10 20 30 40 50 60 70
% missing values, m



Fig. 1.2. Probability density function, f(m),
of the percentage of missing values.
Based on 213 stations, m = 13.663%.










Table 1.1. Frequency Distribution of the Percent of Missing
Values in 213 South Florida Monthly Rainfall
Records.


2 3 4
% of Cumulative Empirical
Stations % of Stations pdf


1
% of
Missing
Values

0-5

5-10

10-15

15-20

20-25

25-30

30-35

35-40

40-45

45-50

50-55

55-60

60-65

65-70


30.52

51.64

66.19

79.80

85.90

89.10

91.70

92.01

94.36

97.18

97.65

98.12

99.53

100.00


0.061

0.042

0.029

0.027

0.012

0.007

0.004

0.002

0.005

0.006

0.001

0.001

0.003

0.001


5
Fitted
Exponential
pdf

0.061

0.042

0.029

0.020

0.014

0.010

0.007

0.005

0.003

0.002

0.002

0.001

0.001

0.001


30.52

21.12

14.55

13.61

6.10

3.29

1.88

0.94

2.35

2.82

0.47

0.47

1.41

0.47









particular, if the occurrence of a gap is viewed as an

"event" then the distribution of the interevent times (sizes

of the interevents) and of the durations of the events

(sizes of the gaps) may be examined.

The probability distribution of the size of the

interevents (number of values between two successive gaps)

has been studied for four "typical" stations of the SFWMD,

as far as length of the record, distribution and percent of

missing values is concerned. These four stations are:

MRF 6018, Titusville 2W, 1901-1981, 7.5% missing
MRF 6021, Fellsmere 4W, 1911-1979, 9.3% missing
MRF 6029, Ocala, 1900-1981, 4.4% missing
MRF 6005, Plant City, 1892-1981, 8.6% missing

A derived pdf for the four stations combined and the fitted

exponential pdf are shown in Fig. 1.3. The mean size of the

interevent, T, is 19.03 months; therefore, the fitted

exponential distribution is



f(T) = 0.053 e 0.053T (1.4)



Also, the probability distribution of the size of the gaps

(number of values missing in each gap) has also been studied

for the same four stations. These have been treated as

discrete distributions since the size of the gap (k = 1, 2,

S. ., 11) is small as compared to the interevent times. A

probability distribution for the four stations combined is

then derived, which is also the discrete probability mass

function (pmf). This plot is shown in Fig. 1.4 and suggests

either a Poisson distribution or a discretized exponential.























f (T)


0.05




0.04




0.03 f(T) = 0.053 e0.053T




0.02




0.01




0.00
0 20 40 60 80 100 120
months between gaps,T




Fig. 1.3. Probability density function, f(T), of the
interevent size. Based on four stations.




















0.6




0.5


-0.447k
f(k) = 0.447 e 447
0.4



empirical
0.3 o poisson
fitted
0
0 *

0.2 o




0.1 o

*


0.0
0 1 2 3 4 5 6 7 8 9 10 II
gap size, k (months)



Fig. 1.4. Probability density, f(k), and mass function,
p(k), of the gap size. Based on four stations.


f(k)
and

p(k)









The mean value k is 2.237, which is also the parameter A of

the Poisson distribution. The Poisson distribution




f(k) e (1.5)
k!


is nonzero at k = 0 and does not fit the peak of the

empirical point very well at k = 1 (it gives a value of 0.24

instead of the actual 0.53). The fitted continuous

exponential pdf shown in Fig. 1.4 gives a better fit in

general but also implies a nonzero probability for a gap

size near zero. To overcome this problem and to discretize

the continuous exponential pdf, the area (probability) under

the exponential curve between zero and 1.5 is assigned to

k = 1, ensuring a zero probability at k = 0. Areas

(probabilities) assigned to values of k > 1 are centered

around those points. The fitted discretized exponential and

the Poisson are also shown in Fig. 1.4.

The distributions of the size of the gaps (k) and of

the size of interevents (T) will be used to generate

randomly distributed gaps in a complete record. Suppose

that we have a complete record and desire to remove randomly

m percent missing values. If the mean size of the gap (k)

is assumed constant, the mean size of interevent (T) must

vary, decreasing as the percent of missing values increases.

Let N denote the total number of values in the record, m the






Pages
Missing
or
Unavailable









where



R2 = l + P2 + + ppp (3.8)


is called the multiple coefficient of determination and

represents the fraction of the variance of the series that

has been explained through the regression.

If we denote by (kj the jth coefficient in an auto-

regressive process of order k, then the last coefficient

(kk of the model is called the partial autocorrelation
coefficient. Estimates of the partial autocorrelation

coefficients 11i' 22' ," pp may be obtained by fitting

to the series autoregressive processes of successively

higher order, and solving the corresponding Yule-Walker

equations. The partial autocorrelation function kk, k = 1,

2, ., p may also be obtained recursively by means of

Durbin's relations (Durbin, 1960)


k k
k+,k+ [rk+l k,j rk+l-jV l k,j r]
j=1 j=1
(3.9)



k+l,j = k,j k+,k+l k,k-j+l j = 1, 2, .., k


It can be shown (Box and Jenkins, 1976, p. 55) that the

autocorrelation function of a stationary AR(p) process is a

mixture of damped exponential and damped sine waves,









infinite in extent. On the other hand, the partial auto-

correlation function kk is nonzero for k < p and zero for

k > p. The plot of autocorrelation and partial autocorre-

lation functions of the series may be used to identify the

kind and the order of the model that may have generated

it (identification of the model).



Moving Average Models

In a moving average model the deviation of the current

value of the process from the mean is expressed as a finite

sum of weighted previous shocks a's. Thus a moving average

process of order q can be written as:



zt = a 6at 2 2 ... qat-q (3.10)


or



zt = 6(B)at (3.11)



where



0(B) = 1 1B GB2 ... B (3.12)



is the moving average operator of order q. An MA(q) model
2
contains (q+2) parameters, y, 61, 2, ..., a to be esti-

mated from the data.









From the definition of stationarity (see Appendix A)

it follows that an MA(q) process is always stationary since

6(B) is finite and thus converges for IBI<1. But for an

MA(q) process to be invertible the q moving average

coefficients 61, 62, ., 6 must be chosen so that -6 (B)

converges on or within the unit circle, in other words the

characteristic equation 6(B) = 0 must have its roots out-

side the unit circle.

By multiplying equation (3.10) by ztk and taking

expected values on both sides we define the autocovariance

at lag k:



Yk = E [(at 6lat ... 6atq) (at-k lat-kl

... t k ] (3.13)
q t-k-q


which gives

y (1 + 2 + 62 + + 82) 2 k = 0 (3.14)
o 1 2 q a


Y = (i + 1 2 + e + a + 2
k k 1 k+1 2 k+2 q-k q a
k= (-k + 81k+i + 828k+2 + + qke)a2
k = 1, 2, ..., q (3.15)



Yk = 0 k > q (3.16)


2
By substituting in equation (3.15) the value of a from
a
equation (3.14) we obtain a set of q nonlinear equations for

61, 82, ., q in terms of pl, 2, ', pq










-8k + 8k+1 2 k+2 q-k+ + 8q
pk = 2 2 k=l,2,...,q
1 + 6 + ... + 8
1 q (3.17)



These equations are analogous to the Yule-Walker equa-

tions for an autoregressive process, but they are not linear

and so must be solved iteratively for the estimation of the

moving average parameters 8, resulting in estimates that

may not have high statistical efficiency. Again it was

shown by Wold (1938) that these parameters may need correc-

tions (e.g., to fit better the correlogram as a whole and not

only the first q correlation coefficients), and that there

may exist several, at most 2q solutions, for the parameters

of the moving average scheme corresponding to an assigned

correlogram pl, P2' ..., pq. However, only those 6's are

acceptable which satisfy the invertibility conditions.

From equation (3.14) an estimate for the white noise
2
variance a may be obtained

2
2 z
a =z (3.18)
a 2 2 2
1 + 6 + 6 + .. + 6
1 2 q


According to the duality principle (see Appendix A) an

invertible MA(q) process can be represented as an AR process

of infinite order. This implies that the partial autocorre-

lation function (kk of an MA(q) process is infinite in extent.

It can be estimated after tedious algebraic manipulations









from the Yule-Walker equations by substituting pk as

functions of 6's for k < q and pk = 0 for k > q. So, in

contrast to a stationary AR(p) process, the autocorrelation

function of an invertible MA(q) process is finite and cuts

off after lag q, and the partial autocorrelation function is

infinite in extent, dominated by damped exponentials and

damped sine waves (Box and Jenkins, 1976).



Mixed Autoregressive-Moving Average Models

In practice, to obtain a parsimonious parameterization,

it will sometimes be necessary to include both autoregressive

and moving average terms in the model. A mixed autoregres-

sive-moving average process of order (p,q), ARMA(p,q), can

be written as



t = Ztl + ... + t-p + at t-l ... qat-q
t = t-l p t-p t I t-1 q t-q
(3.19)

or


(B) z = 6(B) at (3.20)


2
with (p+q+2) parameters, p, 1, ., q p, aa to

be estimated from the data.

An ARMA(p,q) process will be stationary provided that

the characteristic equation ((B) = 0 has all its roots out-

side the unit circle. Similarly, the roots of 0(B) = 0 must

lie outside the unit circle for the process to be invertible.









By multiplying equation (3.19) by zt-k and taking

expectations we obtain



Yk = 1 Yk-1 + "' + p Yk-p + za(k) 61Yza(k-) -

e y za(k-q) (3.21)
q za


where y za(k) is the cross covariance function between z and

a, defined by yza (k) = E[ztkat]. Since ztk depends only

on shocks which have occurred up to time t-k, it follows

that



Yza(k) = 0 k > 0
(3.22)
Yza(k) 0 k < 0



and (3.21) implies



Pk = iPk-l + 2Pk-2 + + ~ pPk-p k > q + 1 (3.23)

or


j(B) Pk = 0 k > q + 1 (3.24)



Thus, for the ARMA(p,q) process the first q autocorre-

lations pI, p2 ".. pq depend directly on the choice of

the q moving average parameters 0, as well as on the p auto-

regressive parameters ( through (3.21). The autocorrela-

tions of higher lags pk, k > q + 1 are determined through the

difference equation (3.24) after providing the p starting









values Pq-p+l' "'. Pq* So, the autocorrelation function

of an ARMA(p,q) model is infinite in extent, with the

first q-p values pl, ..., pp irregular and the others

consisting of damped exponentials and/or damped sine waves

(Box and Jenkins, 1976; Salas et al., 1980).



Autoregressive Integrated Moving Average Models

An ARMA(p,q) process is stationary if the roots of

P(B) = 0 lie outside the unit circle and "explosive non-

stationary" if they lie inside. For example, an explosive

nonstationary AR(1) model is zt = 2zt_1 + at (the plot

of zt vs. t is an exponential growth) in which (B) = 1 2B

has its root B = 0.5 inside the unit circle. The special

case of homogeneous nonstationarity is when one or more of

the roots lie on the unit circle. By introducing a general-

ized autoregressive operator 0(B), which has d of its roots

on the unit circle, the general model can be written as


d -u
0(B) = p(B) (l-B) zt = e(B) at (3.25)



that is



<(B) wt = 6(B) at (3.26)



where


d d
wt = V z = V z (3.27)










and V = 1 B is the difference operator. This model corre-

sponds to assuming that the dth difference of the series

can be represented by a stationary, invertible ARMA process.

By inverting (3.27)



zt = V w = Sd wt (3.28)



where S is the infinite summation operator



S = 1 + B + B2 + ... = (1-B) = V-1 (3.29)



Equation (3.28) implies that the nonstationary process zt

can be obtained by summing or "integrating" the stationary

process wt, d times. Therefore, this process is called a

simple autoregressive integrated moving average process,

ARIMA(p,d,q).

It is also possible to take periodic or seasonal dif-

ferences at lag's of the series, e.g., the 12th difference

of monthly series, introducing the differencing operator

V with the meaning that seasonal differencing V is applied
s s
D times on the series. This periodic ARIMA(P,D,Q) model

can be written as



4(BS) VD zt = 0(BS) at (3.30)
s t









The combination of nonperiodic and periodic models leads to

the multiplicative ARIMA(p,d,q) x ARIMA(P,D,Q) model which

can be written as



(B) D(Bs) Vd VD zt = 0(B) E(Bs) at (3.31)



After the model has been fitted to the difference

series an integration should be performed to retrieve the

original process. But such an integrated series would lack

a mean value since a constant of integration has been lost

through the differencing. This is the reason that the ARIMA

models cannot be used for synthetic generation of time

series, although they are useful in forecasting the devia-

tions of a process (Box and Jenkins, 1976; Salas et al., 1980).



Transformation of the Original Series



Transformation to Normality

Most probability theory and statistical techniques have

been developed for normally distributed variables. Hydro-

logic variables are usually assymetrically distributed or

bounded by zero (positive variables), and so a transforma-

tion to normality is often applied before modeling. Another

approach would be to model the original skewed series and

then find the probability distribution of the uncorrelated

residuals. Care must then be taken to assess the errors of

applying methods developed for normal variables to skewed









variables, especially when the series are highly skewed,

e.g., hourly or daily series. On the other hand, when trans-

forming the original series into normal, biases in the mean

and standard deviation of the generated series may occur.

In other words, the statistical properties of the trans-

formed series may be reproduced in the generated but not

in the original series. An alternative for avoiding biases

in the moments of the generated series would be to estimate

the moments of the transformed series through the derived

relationships between the moments of the skewed and normal

series. Matalas (1967) and Fiering and Jackson (1971)

describe how to estimate the first two moments of the log-

transformed series so as to reproduce the ones of the

original series. Mejia et al. (1974) present another

approach in order to preserve the correlation structure of

the original series.

However, the most widely used approach is to transform

the original skewed series to normal and then model the

normal series. Several transformations may be applied to

the original series, and the transformed series then

tested for normality, e.g. the graph of their cumulative

distribution should appear as a straight line when it is

plotted on normal probability paper. The transformation

will be finally chosen that gives the best approximation to

normality, e.g., the best fit to a straight line.

Another advantage of transforming the series to normal

is that the maximum likelihood estimates of the model










parameters are essentially the same as the least squares

estimates, provided that the residuals are normally dis-

tributed (Box and Jenkins, 1976, Ch. 7). This facilitates

the calculation of the final estimates since they are those

values that minimize the sum of squares of the residuals.

Box and Cox (1964) showed how a maximum likelihood and

a parallel Bayesian analysis can be applied to any type of

transformation family to obtain the "best" choice of trans-

formation from that family. They illustrated those methods

for the popular power families in which the observation x is

replaced by y, where

x -1

y = (3.32)
log x X=0


The fundamental assumption was that for some X the trans-

formed observations y can be treated as independently
2
normally distributed with constant variance 2 and with

expectations defined by a linear model



E[y] = A L (3.33)



where A is a known constant matrix and L is a vector of

unknown parameters associated with the transformed observa-

tions (Box and Cox, 1964).

This transformation has the advantage over the simple

power transformation proposed by Tukey (1957)









x 0
y =, X (3.34)
log x X=0

of being continuous at X=0. Otherwise the two transforma-

tions are identical provided, as has been shown by

Schlesselman (1971), that the linear model of (3.33) con-

tains a constant term.

Further, Draper and Cox (1969), showed that the value

of A obtained from this family of transformations can be

useful even in cases where no power transformation can

produce normality exactly. Also, John and Draper (1980)

suggested an alternative one-parameter family of transfor-

mations when the power transformation fails to produce

satisfactory distributional properties as in the case of

a symmetric distribution with long tails.

The selection of the exact transformation to normality

(zero skewness) is not an easy task, and over-transforma-

tion, i.e., transformation of the original data with a

large positive (negative) skewness to data with a small

negative (positive) skewness, or under-transformation, i.e.,

transformation of the original data with a large positive

(negative) skewness to data with a small positive (negative)

skewness, may result in unsatisfactory modeling of the series

or in forecasts that are in error. This was the case for

the data used by Chatfield and Prothero (1973a), who applied

the Box-Jenkins forecasting approach and were dissatisfied

with the results, concluding that the Box-Jenkins forecast-

ing procedure is less efficient than other forecasting









methods. They applied a log transform to the data which

evidently over-transformed the data, as shown by Box and

Jenkins (1973) who finally suggested the approximate trans-

formation y = x 25, even though the complicated but precise

Box-Cox procedure gave an estimate of A = 0.37 [Wilson

(1973) ].

Thus, the selection of the normality transformation

greatly affects the forecasts, as Chatfield and Prothero

(1973b) experienced with their data. They concluded

that

S. We have seen that a "small" change in X
from 0 to 0.25 has a substantial effect on the
resulting forecasts from model A [ARIMA(1,1,1) x
ARIMA(1,1,1)12] even though the goodness of fit
does not seem to be much affected. This reminds
us that a model which fits well does not neces-
sarily forecast well. Since small changes in X
close to zero produce marked changes in forecasts,
it is obviously advisable to avoid "low" values
of X, since a procedure which depends critically
on distinguishing between fourth-root and
logarithmic transformation is fraught with peril.
On the other hand a "large" change in A from 0.25
to 1 appears to have relatively little effect on
forecasts. So we conjecture that Box-Jenkins
forecasts are robust to changes in the transfor-
mation parameter away from zero. .[Chatfield
and Prothero (1973b) p. 347]



Stationarity

Most time series occurring in practice exhibit non-

stationarity in the form of trends or periodicities. The

physical knowledge of the phenomenon being studied and a

visual inspection of the plot of the original data may give

the first insight into the problem. Usually the length

of the series is not long enough, and the detection of









trends or cycles only through the plot of the series is

ambiguous. Useful tools for the detection of periodicities

are the autocorrelation function and the spectral density

function of the series (which is the Fourier transform of

the autocorrelation function). If a seasonal pattern is

present in the series then the correlogram (plot of the

autocorrelation function) will exhibit a sinusoidal appear-

ance and the periodogram (plot of the spectral density

function) will show peaks. The period of the sinusoidal

function of the correlogram, or the frequency where the

peaks occur in the periodogram, can determine the periodic

component exactly (Jenkins and Watts, 1968). Another device

for the detection of trends and periodicities is to fit

some definite mathematical function, such as exponentials,

Fourier series or polynomials to the series and then model

the residual series, which is assumed to be stationary.

More details on the treatment of nonstationary data as well

as on the interpretation of the correlogram and periodogram

of a time series can be found in textbooks such as Bendat

and Piersol (1958), Jenkins and Watts (1968), Wastler (1969),

Yevjevich (1972), and Chatfield (1980).

Apart from the approach of removing the nonstationarity

of the original series and modeling the residual series

with a stationary ARMA(p,q) model, the original nonsta-

tionary series can be modeled directly with a simple or

seasonally integrated ARIMA model. Actually, the second

approach can be viewed as an extension of the first one,









e.g., the nonstationarity is removed through the simple (V)

or seasonal (V ) differencing. However, the integrated

model cannot be used for generation of data, as has already

been discussed.

For many hydrologic applications, one is satisfied

with second order or weak stationarity, e.g., stationarity

in the mean and variance. Furthermore, weak stationarity

and the assumption of normality imply strict stationarity

(see Appendix A).



Monthly Rainfall Series



Normalization and Stationarization

Stidd (1953, 1968) suggested that rainfall data have

a cube root normal distribution because they are product

functions of three variables: vertical motion in the

atmosphere, moisture, and duration time. Synthetic rainfall

data generated using processes analogous to those operating

in nature showed that the exponent required to normalize

the distribution is between 0.5 (square root) and 0.33

(cubic root) for different types of rainfall (Stidd, 1970).

The square root transformation has been extensively

used for the approximate normalization of monthly rainfall

series (see Table C12 of Appendix C) with satisfactory

results: Delleur and Kavvas (1978), Salas et al. (1980),

Ch. 5, Roesner and Yevjevich (1966). However, Hinkley (1977)

used the exact Box-Cox transformation for monthly rainfall









series. Although, Asley et al. (1977) have developed an

efficient algorithm for the estimation of X along with other

parameters in an ARIMA model, it seems that the exact value

of X is not more reliable than the approximate one X = 0.5

(Chatfield and Prothero, 1973b). The reasons for this

follow.

First, Chatfield and Prothero (1973b) used the Box-Cox

procedure to evaluate the exact transformation of their

data. They obtained estimates X = 0.24 using all the data

(77 observations), X = 0.34 using the first 60 observations

and X = 0.16 excluding the first year's data. Therefore,

it is logical to infer that even if the complicated Box-Cox

procedure for the incomplete rainfall record is used, the

missing values may be enough to give a spurious X, which is

not "more exact" than the value of 0.5 used in practice.

Second, we may also notice that the use of either

A = 0.33 (cubic root) or A = 0.5 (square root) is not

expected to greatly affect the forecasts since, according to

Chatfield and Prothero (1973b), the Box-Jenkins forecasts

are not too sensitive to changes of X for A > 0.25.

Monthly rainfall series are nonstationary. The

variation in the mean is obvious since generally the

expected monthly rainfall value for January is not the same

as that of July. Although the variation of the standard

deviation is not so easy to visualize, calculations show

that months with higher mean usually have higher standard

deviation. Thus, each month has its own probability










distribution and its own statistical parameters resulting in

monthly series that are nonstationary.

By introducing the concept of circular stationarity

as developed by Hannan (1960) and others (see Appendix A

for definition), the periodic monthly rainfall series can

be considered not as nonstationary but circular stationary,

since circular stationarity suggests that the probability

distribution of rainfall in a particular month is the same

for the different years. Then, the monthly rainfall series

is composed of a circularly stationary (periodic) component

and a stationary random component.

The time-series models currently used in hydrology are

fitted to the stationary random component, so the circularly

stationary component must be removed before modeling. This

last component appears as a sinusoidal component in the

autocorrelation function (with a 12-month period) or as a

discrete spectral component in the spectrum (peak at the

frequency 1/12 cycle per month). Usually several subhar-

monics of the fundamental 12-month period are needed to

describe all the irregularities present in the autocorre-

lation function and spectral density function, since in

nature the periodicity does not follow an ideal cosine

function with a 12-month period. The use of a Fourier

series approach for the approximation of the periodic

component of monthly rainfall and monthly runoff series has

been illustrated by Roesner and Yevjevich (1966).









Kavvas and Delleur (1975) investigated three methods

of removal of periodicities in the monthly rainfall series:

nonseasonal (first-lag) differencing, seasonal differencing

(12-month difference), and removal of monthly means. They

worked both analytically and empirically using the rescaled

(divided by the monthly standard deviation) monthly rainfall

square roots for fifteen Indiana watersheds. They concluded

that "all the above transformations yield hydrologic series

which satisfy the classical second-order weak stationarity

conditions. Both seasonal and nonseasonal differencing

reduce the periodicity in the covariance function but

distort the original spectrum, thus making it impractical

or impossible to fit an ARMA model for generation of

synthetic monthly series. The subtraction of monthly

means removes the periodicity in the covariance and the

amount of nonstationarity introduced is negligible for

practical purposes." (Kavvas and Delleur, 1975, p. 349.) In

other words, they concluded that the best way for modeling

monthly rainfall series is to remove the seasonality (by sub-

tracting the monthly means and dividing by the standard

deviations of the normalized series) and then use a station-

ary ARMA(p,q) model to model the stationary normal residuals.



Modeling of Normalized Series

It is assumed that the nonstationarities due to long-

term trends are removed before any operation. Then the

appropriate transformation is applied to the data in










order to obtain an approximately normal distribution. For

monthly rainfall series experience has shown that the best

practical transformation is the square root transformation,

as has already been discussed. What remains is the modeling

of the normalized series with one of the following models:

stationary ARMA(p,q), simple nonstationary ARIMA(p,d,q),

seasonal nonstationary ARIMA(P,D,Q)s or multiplicative

ARIMA(p,d,q)x(P,D,Q) model.

Delleur and Kavvas (1978) fitted different models to

the monthly rainfall series of 15 basins in Indiana and

compared the results. They studied the models: ARIMA

(0,0,0), ARIMA(1,0,1), ARIMA(1,1,1), ARIMA(1,1,1)12,

and ARIMA(l,0,0)x(1,l,l)12 on the square-root trans-

formed series. They concluded that from the nonseasonal

ARIMA models, ARMA(1,1) "emerged as the most suitable for

the generation and forecasting of monthly rainfall series."

The goodness-of-fit tests applied on the residuals were

the portemanteau lack of fit test (see Appendix A) of Box

and Pierce (1970) and the cumulative periodogram test (Box

and Jenkins, 1976, p. 294). The ARMA(1,1) model passed both

tests in all cases studied. From the nonseasonal models,

ARIMA(1,0,0)x(l,1,1)12 also passed the goodness-of-fit tests

in all cases, but they stress that this model "has only

limited use in the forecasting of monthly rainfall series

since it does not preserve the monthly standard deviations."

As far as forecasts are concerned, they showed that "the

forecasts by the several models follow each other very





57




closely and the forecasts rapidly tend to the mean of the

observed rainfall square roots (which is the forecast of the

white noise model)."















CHAPTER 4

MULTIVARIATE STOCHASTIC MODELS



Introduction

For univariate stochastic models the sequence of

observations under study is assumed independent of other

sequences of observations and so is studied by itself

(single or univariate time series). However, in practice

there is always an interdependence among such sequences of

observations, and their simultaneous study leads to the

concept of multivariate statistical analysis. For example,

a rainfall series of one station may be better modeled if

its correlation with concurrent rainfall series at other

nearby stations is incorporated into the model. Multiple

time series can be divided into two groups: (1) multiple

time series at several points (e.g., rainfall series at

different stations, streamflow series at various points of

a river), and (2) multiple series of different kinds at one

point (e.g., rainfall and runoff series at the same station).

In general, both kinds of multiple time series are studied

simultaneously, and their correlation and cross-correlation

structure is used for the construction of a model that

better describes all these series. The parameters of this

so called multivariate stochastic model are calculated such









that the correlation and cross-correlation structure of the

multiple measured series are preserved in the multiple

series generated by the model.

The multivariate models that will be presented in this

chapter have been developed and extensively used for the

generation of synthetic series. How these models can be

adapted and used for filling in missing values will be

discussed in chapter 5.



General Multivariate Regression Model

The general form of a multivariate regression model is


Y = AX + B H (4.1)


where Y is the vector of dependent variables, X the vector

of independent variables, A and B matrices of regression

coefficients, and H a vector of random components. The

vectors Y and X may consist of either the same variable at

different points (or at different times) or different

variables at the same or different points (or at different

times).

For convenience and without loss of generality all the

variables are assumed second order stationary and normally

distributed with zero mean and unit variance. Transforma-

tions to accomplish normality have been discussed in Chapter

3. A random component is superimposed on the model to

account for the nondeterministic fluctuations.

In the above model, the dependent and independent

variables must be selected carefully so that the most









information is extracted from the existing data. A good

summary of the methods for the selection of independent

variables for use in the model is given in Draper and Smith

(1966). Most popular is the stepwise regression procedure

in which the independent variables are ranked as a function

of their partial correlation coefficients with the dependent

variable and are added to the model, in that order, if they

pass a sequential F test.

The parameter matrices A and B are calculated from

the existing data in such a way that important statistical

characteristics of the historical series are preserved in

the generated series. This estimation procedure becomes

cumbersome when too many dependent and independent variables

are involved in the model, and several simplifications are

often made in practice. On the other hand, restrictions

have to be imposed on the form of the data, as we shall see

later, to ensure the existence of real solutions for the

matrices A and B.



Multivariate Lag-One Autoregressive Model

If only one variable (e.g., rainfall at different

stations) is used in the analysis then the model of equa-

tion (4.1) becomes a multivariate autoregressive model.

Since in the rest of this chapter we will be dealing only

with one variable (rainfall) which has been transformed to

normal and second order stationary, the vectors Y and X are

replaced by the vector Z for a notation consistent with the









univariate models. Matalas (1967) suggested the multivari-

ate lag-one autoregressive model



Zt = A Zt1 + B Ht (4.3)



where Z is an (mxl) vector whose ith element zit is the

observed rainfall value at station i and at time t, and the

other variables have been described previously.

Such a model can be used for the simultaneous genera-

tion of rainfall series at m different stations. The

correlation and cross-correlation of the series is incor-

porated in the model through the parameters A and B.

The matrices A and B are estimated from the historical

series so that the means, standard deviations and auto-

correlation coefficients of lag-one for all the series, as

well as the cross-correlations of lag-zero and lag-one

between pairs of series are maintained.

Let M0 denote the lag-zero correlation matrix which

is defined as



M = E[Zt ZT] (4.4)



Then a diagonal element of M0 is E[zi.t z. ] = Pii(0) = 1

(since Zt is standardized) and an off diagonal element (i,j)

is E[zit z jt = pij(0) which is the lag-zero cross corre-

lation between series {zi} and {z.}. The matrix M0 is

symmetric since pij(0) = pji(0) for every i, j.









Let M1 denote the lag-one correlation matrix defined

as


M = E[Zt Zt_] (4.5)



A diagonal element of M1 is E[zit zit. l] = pii(1) which
1 i,t i,t-1 ii
is the lag-one serial correlation coefficient of the

series {zi}, and an off-diagonal element (i,j) is

E(zit z. tl) = pij(1) which is the lag-one cross-corre-
i,t 3,t-1 ij
lation between the {zi} and {z.} series, the latter lagged

behind the former. Since in general pij(1) 7 Pji(1) for

i 7 j the matrix M1 is not symmetric.

After some algebraic manipulations (see Appendix B) the

coefficient matrices A and B are obtained as solutions to

the equations


-1
A = M1 M1 (4.6)
1 0


T -1 T
BB = M M M1 M1 (4.7)
0 1 0 1


-1 T
where M1 is the inverse of M0, and MT the transpose of MI.

The correlation matrices M0 and M1 are calculated from the

data. Then an estimate of the matrix A is given directly

by equation (4.6), and an estimate for B is found by solving

equation (4.7) by using a technique of principal component

analysis (Fiering, 1964) or upper triangularization (Young,

1968). For more details on the solution of equation (4.7)

see Appendix B.









Comments on Multivariate AR(1) Model



Assumption of Normality and Stationarity

We have assumed that all random variables involved in

the model are normal. The assumption of a multivariate

normal distribution is convenient but not necessary. It has

been shown (Valencia and Schaake, 1973) that the multivari-

ate AR(1) model preserves first and second order statistics

regardless of the underlying probability distributions.

Several studies have been done using directly the

original skewed series. Matalas (1967) worked with log-

normal series and constructed the generation model so that

it preserves the historical statistics of the log-normal

process. Mejia et al. (1974) showed a procedure for multi-

variate generation of mixtures of normal and log-normal

variables. Moran (1970) indicated how a multivariate gamma

process may be applied, and Kahan (1974) presented a method

for the preservation of skewness in a linear bivariate

regression model. But in general, the normalization of the

series prior to modeling is more convenient, especially when

the series have different underlying probability distribu-

tions. In such cases different transformations are applied

on the series, and that combination of transformations is

kept which yields minimum average skewness. Average skew-

ness is the sum of the skewness of each series divided by

the number of series or number of stations used. This

operation is called finding the MST (Minimum Skewness









Transformation) and results in an approximately multivariate

normal distribution (Young and Pisano, 1968).

We have also assumed that all variables are standard-

ized, e.g., have zero mean and unit variance. This assump-

tion is made without loss of generality since the linear

transformations are preserved through the model. On the

other hand this transformation becomes necessary when

modeling periodic series since by subtracting the periodic

means and dividing by the standard deviations we remove

almost all of the periodicity.

If the data are not standardized, M0 and M1 represent

the lag-zero and lag-one covariance matrices (instead of

correlation matrices), respectively. If S denotes the

diagonal matrix of the standard deviations and R0, R1 the

lag-zero and lag-one correlation matrices then



M0 = S R0 S (4.8)


and


M1 = S R S (4.9)



When we standardize the data the matrix S is an identity

matrix and Mo, M1 become the correlation matrices R0 and R1

respectively. Thus, one other advantage of standardization

is that we work with correlation matrices whose elements are

less than unity and the computations are likely to be more

stable (Pegram and James, 1972).









Cross-Correlation Matrix M1

Notice that the lag-one correlation matrix M1 has been
T
defined as = E[Z Z- ] which contains the lag-one
M1 t t-1
cross-correlations between pairs of series but having the

second series lagged behind the first one. Following this

definition the lag-minus-one correlation matrix will be



M = EIZ_ Z T (4.10)
-1 t- t


and it will contain the lag-one correlations having now the

second series lagged ahead of the first one. It is easy to

show that M_1 is actually the transpose of M :


T T T T
M = E[Z ZT] = E[(Z Z ) ] = M (4.11)



Care then must be taken so that there is a consistency

between the equation used to calculate matrix A and the way

that the cross-correlation coefficients have been calculated.

Such an inconsistency was present in the numerical multisite

package (NMP) developed by Young and Pisano (1968) and was

first corrected by O'Connell (1973) and completely corrected

and improved by Finzi et al. (1974, 1975).



Incomplete Data Sets

In practice, hydrologic series at different stations

are unlikely to be concurrent and of equal length. With

lag-zero auto- and cross-correlation coefficients calculated










from the incomplete data sets, the lag-zero correlation

matrix M obtained may not be positive semidefinite, and,
-1
its inverse M needed for the calculation of matrix A
0
thus may have elements that are complex numbers. Also, a

necessary and sufficient condition for a real solution of
-1 T
matrix B is that C = M M MI M is a positive semi-
0 1 0 1
definite matrix (see Appendix B).

When all of the series are concurrent and complete

then M0 and C are both semidefinite matrices [Valencia and

Schaake, 1973], and the generated synthetic series are real

numbers. When the series are incomplete there is no

guarantee that real solutions for the matrices A and B exist

causing the model of Matalas (1967) to be conditional on M0

and C being positive semidefinite [Slack, 1973].

Several techniques have been proposed which use the

incomplete data sets but guarantee the posite semidefinite-

ness of the correlation matrices. Fiering (1968) suggested

a technique that can be used to produce a positive semi-

definite correlation matrix M0. If M0 is not positive

semidefinite then negative eigenvalues may occur and hence

negative variables, since the eigenvalues are variances in

the principal component system. In this technique, the

eigenvalues of the original correlation matrix are calcu-

lated. If negative eigenvalues are encountered, an adjust-

ment procedure is used to eliminate them (thereby altering

the correlation matrix, M0 [Fiering, 1968]).










A correlation matrix is called consistent if all its

eigenvalues are positive. But consistent estimates of the

correlation matrices M0 and M1 do not guarantee that C will

also be consistent.

Crosby and Maddock (1970) proposed a technique that

is suitable only for monotone data (data continuous in

collection to the present but having different starting

times). This technique produces a consistent estimate of

the matrix M0 as well as of the matrix C, and is based on

the maximum likelihood technique developed by Anderson

(1957).

Valencia and Schaake (1973) developed another tech-

nique. They estimate matrices A and B from the equations


-1
A = M M01 (4.12)
1 01


T -1 T
B B 02 M M M1 (4.13)



where M01 is the lag-zero correlation matrix M0 computed

from the first (N-1) vectors of the data, and M02 is com-

puted from the last (N-l) vectors, where N is the number of

data points (number of times sampled) in each of the n

series.



Further Simplification

Sometimes in practice, the preservation of the lag-

zero and lag-one autocorrelations and the lag-zero









cross-correlations is enough. In such cases, i.e., when the

lag-one cross-correlations are of no interest, a nice

simplification can be made due to Matalas (1967, 1974). He

defined matrix A as a diagonal matrix whose diagonal ele-

ments are the lag-one auto-correlation coefficients. With

A defined as above, the lag-one cross-correlation of the

generated series (Pij(1)) can be shown to be the product

of the lag-zero cross-correlation (Pij(0)) and the lag-one

auto-correlation of the series (pii(1)), but of course dif-

ferent than the actual lag-one cross-correlation (pij(1)).



Pij(1) = ij (0) Pii(1) (4.14)


By using Pij (1) of equation (4.14) in place of the actual

Pij(1), thus avoiding the actual computation of pij(1) from

the data, the desired statistical properties of the series

are still preserved.



Higher Order Multivariate Models

The order p of a multivariate autoregressive model

could be estimated from the plots of the autocorrelation

and partial autocorrelation functions of the series (Salas

et al., 1980) as an extension of the univariate model

identification, which is already a difficult and ambiguous

task. However, in practice first and second order models

are usually adequate and higher order models should be

avoided (Box and Jenkins, 1976).





69



In any case, the multivariate multilag autoregressive

model of order p takes the form


p
Z = Ak Z-k + B Ht (4.15)
k=l


and the matrices Al, A2, ... A B are the solutions of the

equations


p
M = Ak M i = 1, 2, ..., p (4.16)
k=l

T T
B B M Z Ak Mk (4.17)
k=l


where M is the lag-k correlation matrix. Equation (4.16) is

a set of p matrix equations to be solved for the matrices

A,, A2, ..., A and matrix B is obtained from (4.17) using

techniques already discussed. Here, the assumption of diag-

onal A matrices becomes even more attractive. For a multi-

variate second-order AR process the above simplification is

illustrated in Salas and Pegram (1977) where the case of

periodic (not constant) matrix parameters is also considered.

O'Connell (1974) studied the multivariate ARMA(1,1)

model



Z = A Zt-1 + B H C H (4.18)



where A, B, and C are coefficient matrices to be determined










from the data. Specifically they are solutions of the

system of matrix equations


T T
B B + CC =S

T (4.19)
C B = T



where S and T are functions of the correlation matrices

M0, M1 and M2. Methods for solving this system are proposed

by O'Connell (1974).

Explicit solutions for higher order multivariate ARMA

models are not available and Salas et al. (1980) propose an

approximate multivariate ARMA(p,q) model.
















CHAPTER 5

ESTIMATION OF MISSING MONTHLY RAINFALL VALUES--
A CASE STUDY




Introduction

This section compares and evaluates different methods

for the estimation of missing values in hydrological time

series. A case study is presented in which four of the

simplified methods presented in Chapter 2 have been applied

to a set of four concurrent 55 year monthly rainfall series

from south Florida and the results compared. Also a

recursive method for the estimation of missing values by the

use of a univariate or multivariate stochastic model has

been proposed and demonstrated. The theory already

presented in Chapters 2, 3 and 4 is supplemented whenever

needed.



Set Up of the Problem

The monthly rainfall series of four stations in the

South Florida Water Management District (SFWMD) have been

used in the analysis. These stations are:

Station A : MRF6038, Moore Haven Lock 1
Station 1 : MRF6013, Avon Park
Station 2 : MRF6093, Fort Myers WSO Ap.
Station 3 : MRF6042, Canal point USDA.









For convenience the four stations will sometimes be

addressed as A, 1, 2, 3 instead of their SFWMD

identification numbers 6038, 6013, 6093 and 6042,

respectively. Their locations are shown in the map of

Fig. 5.1. Station A in the center is considered as the

interpolation station (whose missing values are to be

estimated) and the other three stations 1, 2 and 3 as the

index stations. Care has been taken so that the three index

stations are as close and as evenly distributed around the

interpolation station as possible.

This particular set of four stations was selected

because it exhibits many desired and convenient properties:

(1) the stations have an overlapping period of 55 years

(1927-1981),

(2) for this 55 year period the record of the

interpolation station (station A) is complete (no

missing values),

(3) the three index stations have a small percent of

missing values for the overlapping period (sta-

tion 1: 2.7% missing, station 2: complete, and

station 3: 1.2% missing values).

The 55 year length of the records is considered long

enough to establish the historical statistics (e.g., monthly

mean, standard deviation and skewness) and provides a

monthly series of a satisfactory length (660 values) for

fitting a univariate or multivariate ARMA model.


















r
a
rm

\9.


b~-~


STh m.Y Thu za












A~ u
2l-


FLORIDA
Ta IM MI Tn nr w


x:._ 4 _T- .-T L*" _.:.


0


75 TN m Ink ZOM




OR n


.4.


--- -- _
t^. ;,.




r -r-'*~"J I- ---L7"-


I C~ERD E---" --


AI -

*^Sgl~ ~ iAT* t ,


j_ ______ ______________ ____________




Fig. 5.1. The four south Florida rainfall stations
used in the analysis.
A: 6038, Moore Haven Lock 1
1: 6013, Avon Park
2: 6093, Fort Myers WSO AP.
3: 6042, Canal Point USDA


L


"


`'Y~ C~ I I ------ -


I I___________


I









The completeness of the series of the interpolation

station permits the random generation of gaps in the series,

corresponding to different percentages of missing values,

with the method described in Chapter 1. After the missing

values have been estimated by the applied models, the gaps

are in-filled with the estimated values and the statistics

of the new (estimated) series are compared with the

statistics of the incomplete series and the statistics of

the historical (actual) series. Also the statistical

closeness of the in-filled (estimated) values to the hidden

(actual) values provides a means for the evaluation and

comparison of the methods.

When, for the estimation of a missing value of the

interpolation station, the corresponding value of one or

more index stations is also missing the latter is eliminated

from the analysis, e.g., only the remaining one or two index

stations are used for the estimation. Frequent occurrence

of such concurrent gaps in both the interpolation and the

index stations would alter the results of the applied method

in a way that cannot be easily evaluated (e.g., another

parameter such as the probability of having concurrent gaps

should be included in the analysis). A small number of

missing values in the selected index stations eliminates the

possibility of such simultaneous gaps, and thus the

effectiveness of the applied estimation procedures can be

judged more efficiently.









The statistical properties (e.g., monthly mean,

standard deviation, skewness and coefficient of variation)

of the truncated (to the 1927-1981 period) original monthly

rainfall series for the four stations are shown on

Tables C.1, C.2, C.3 and C.4 of Appendix C. Figure 5.2

shows the plot of the monthly means and standard deviations

for station A. From these plots we observe that: (1) the

plot of monthly means is in agreement with the typical plot

for Florida shown in Fig. 1.1, and (2) months with a high

mean usually have a high standard deviation. The only

exception seems to be the month of January which in spite of

its low mean exhibits a high standard deviation and

therefore a very high coefficient of variation and an

unusually high skewness. A closer look at the January

rainfall values of station A shows that the unusual

properties for that month are due to an extreme value of

21.4 inches of rainfall for January 1979, the other values

being between 0.05 and 6.04 inches.

The three index stations 1, 2 and 3 are at distances

59 miles, 51 miles and 29 miles respectively from the

interpolation station A.



Simplified Estimation Techniques



Techniques Utilized

From the simplified techniques presented in Chapter 2,

the following four are applied for the estimation of missing










inches


J

inches


F M A M J J A S 0 N D


J F M A M J J A 0 N D
(b) monthly standard deviations

Fig. 5.2. Plot of the monthly means and standard deviations--
station 6038 (1927 1981)









monthly rainfall values:

(1) the mean value method (MV)

(2) the reciprocal distances method (RD)

(3) the normal ratio method (NR), and

(4) the modified weighted average method (MWA).

These methods are all deterministic and are applied directly

on the available data permitting thus a uniform and

objective comparison of the results. The mean value plus

random component method has not been included in this

thesis.

The above four methods will be applied for five

different percentages of missing values: 2%, 5%, 10%, 15%

and 20%. These percentages cover almost 80% of all cases

encountered in practice as has been shown in Table 1.1

(e.g., 80% of the stations have below 20% missing values).

From the same table it can also be seen that almost 30% of

the stations have below 5% missing values. Therefore, it

would be of interest and practical use if we could

generalize the results for the region of below 5% missing

values since a large fraction of the cases in practice fall

in this region.

The application of the first three methods (MV, RD, NR

methods) is straightforward and no further comments need be

made. However, some comments on the least squares (LS)

method and the modified weighted average (MWA) method are

necessary.









Least Squares Method (LS)

The least squares method although simple in principle

involves an enormous amount of calculations, and for that

reason it has been excluded from this study. For example,

consider the case in which the interpolation station A is

regressed on the three index stations 1, 2 and 3. The

estimated values will be given by:



y' = a + b1 x1 + b2 x2 + b3 x3 + E (5.1)



where a, bl, b2, b3 are the regression coefficients

calculated from the available concurrent values of all the

four variables. There are 12 such regression equations, one

for each month. But if it happens that an index station

(say, station 3) has a missing value simultaneously with the

interpolation station, a new set of 12 regression equations

is needed for the estimation, e.g.,

y' = a' + b' xl + b x2 + E (5.2)



Unless this coincidence of simultaneously missing values is

investigated manually so that only the needed least squares

regressions are performed (Buck, 1960), all the possible

combinations of regressions must otherwise be performed.

This involves regressions among all the four variables

(y; xl, x2, x3), among the three of them (y; xl, x2),

(y; x1, x3), (y; x2, x3) and between pairs of them (y; x ),









(y; x2), (y; x3), giving overall 7 sets of 12 regression

equations. Because the regression coefficients are

different for each percentage of missing values (since their

calculation is based only on the existing concurrent values)

the 84 (7 x 12) regressions must be repeated for each level

of missing values (420 regressions overall for this study).

It could be argued that the same 12 regression

equations (y; xl, x2, x3) could be kept and a missing values

x. replaced by its mean x. or by another estimate x!. In
1 1 1
that case equation 5.1 would become



y' = a + b1 x1 + b2 x2 + b3 x3 + E, (5.3)



the coefficients of regression a, bl, b2, b3 remaining

unchanged. This in fact can be done, but then the method

tested will not be the "pure" least squares method since the

results will depend on the secondary method used for the

estimation of the missing x. values.
1
The coefficients a, bl, b2 and b3 (equation 5.1) of the

regression of the {y} series (of station A with 2% missing

values) on the series {xl}, {x2} and {x3} (of stations 1, 2

and 3 respectively) are shown in Table 5.1. In the same

table the values of the squared multiple regression

coefficient R2 and the standard deviation of the {y) series

are also shown. The numbers in parenthesis show the

significance level a at which the parameters are significant

(the percent probability of being nonzero is (1-a))100. For









Table 5.1. Least Squares Regression Coefficients for
Equation (5.1) and Their Significant Levels.
The standard deviation, s, for each month is
also given.


a b1 b2 b3 R s
inches inches

0.0059 0.1271 0.4994 0.3377 0.8046
(0.9692) (0.2790) (0.0005) (0.0017) (0.0001) 3.


0.1355 0.2624 0.0086 0.5345 0.7033
(0.5260) (0.0025) (0.9431) (0.0001) (0.0001)


0.0052 0.1617 0.3457 0.4507 0.9142 4
MAR 2.464
(0.9793) (0.0138) (0.0001) (0.0001) (0.0001)


0.7388 0.2405 0.2813 0.1919 0.4936 81
(0.0273) (0.0458) (0.0156) (0.1132) (0.0001)


2.1302 0.4046 -0.0591 0.2186 0.2752
MAY 2. 583
(0.0070) (0.0115) (0.7180) (0.1308) (0.0016)


1.8765 0.2192 0.1108 0.3339 0.3351
(0.1505) (0.1576) (0.4034) (0.0133) (0.0002) 3.


2.8601 -0.0345 0.3993 0.1885 0.2005 39
(0.0750) (0.7883) (0.0131) (0.1780) (0.0154)


2.0820 0.1771 0.2078 0.2660 0.1789 93
(0.2065) (0.1666) (0.0787) (0.0589) (0.0248)


0.0108 0.5102 0.2113 0.2450 0.5669
(0.9916) (0.0003) (0.0893) (0.0190) (0.0001)


-0.6985 0.3960 0.2287 0.4667 0.7749 0
OCT 3.073
(0.0866) (0.0020) (0.0433) (0.0001) (0.0001)


0.3167 0.3009 0.2473 0.1063 0.4575 2
NOV 1. 228
(0.1290) (0.0030) (0.0804) (0.0069) (0.0001)


-0.2623 0.2332 0.3807 0.4381 0.7723 5
DEC1987) (0.1065) (0.0084) (0.0001) (0.000 1) 585
(0.1987) (0.1065) (0.0084) (0.0001) (0.0001)









example, for January the coefficient b1 is not significant

at the 5% significance level (a = 0.05) since 0.279 is

greater than 0.05, but the R2 coefficient is significant

even at 0.01% significance level (a = 0.0001). The

significance levels correspond to the "t-test" for the

regression coefficients and to the "F-test" for the R2

coefficients. The standard deviation, s, of the {y} series

is also listed since the random component is given by



E = -R2 s (5.4)



as has already been discussed in Chapter 2.

It is interesting to note, that although the multiple

regression coefficient R2 varies for each month from as low

as 0.18 to as high as 0.91 it is always significant at the

5% significance level. The months of July and August

exhibit the lowest (although significant) correlation

coefficients as is expected for Florida. The physical

reason for these low correlations is that in the summer most

rainfall is convective, whereas in other months there is

more cyclonic activity. Rainfall from scattered

thunderstorms is simply not as correlated with that of

nearby areas as is rainfall from broad cyclonic activity.

Thus, on the basis of the regressions shown in Table 5.1,

the least squares method would be expected to perform least

well in the summer in Florida, but this point is not

validated in this thesis.









Modified Weighted Average Method (MWA)

For the modified weighted average method the twelve

(3x3) covariance matrices of the three index stations have

been calculated for each month using equation (2.9) and

(2.10), and are shown in Table C.11 (appendix C). Also the

monthly standard deviations, s have been estimated from

the known {y} series, and the monthly standard deviations,

s' have been calculated by equation (2.11) using the
y
calculated covariance matrices. Notice that although the

twelve s values (as calculated from the actual data and

which we want to preserve) are different at different

percentages of missing values, the twelve s' values (that
y
depend only on the weights a. and the covariance matrix of

the index stations) are calculated only once. The

correction coefficients f (f = s /s') for each month and for
y y
each different percentage of missing values which must be

applied on matrix A (equation 2.21) are shown in Table 5.2.

From this table it can be seen that if the simple

weighted average scheme of equation (2.3) were used for the

generation, the standard deviation of November would be

overestimated (by a factor of approximately 2) and the

standard deviation of all other months would be under-

estimated (e.g., by a factor of approximately 0.5 for the

month of January). We also observe that due to small

changes of s for different percentages of missing values,

the correction factor f does not vary much either, but tends









Table 5.2. Correction Coefficient, f, for Each Month and
for Each Different Percent of Missing Values
(f =s y/s
Y Y).


2% 5% 10% 15% 20%


JAN 1.777 1.777 1.795 1.897 1.872


FEB 1.129 1.142 1.136 1.199 1.188


MAR 1.178 1.207 1.177 1.003 1.009


APR 1.089 0.980 1.061 1.051 1.054


MAY 1.269 1.197 1.212 1.222 1.360


JUN 1.214 1.173 1.192 1.228 1.242


JUL 1.338 1.345 1.386 1.390 1.491


AUG 1.424 1.414 1.425 1.432 1.369


SEP 1.313 1.328 1.325 1.210 1.331


OCT 1.258 1.273 1.218 1.229 1.314


NOV 0.533 0.537 0.509 0.583 0.572


DEC 1.161 1.140 1.169 1.172 1.248










to be slightly greater the greater the percent of missing

values.

The modified weighted average scheme theoretically

preserves the mean and variance of the series as has been

shown in Chapter 2. But this is true for a series that has

been generated by the model and not for a series that is a

mix of existing values and values generated (estimated) by

the model. This illustrates the difference between the two

concepts: "generation of data by a model" and "estimation

of missing values by a model." A method for generation of

data which is considered "good" in the sense that it

preserves first and second order statistics is not

necessarily "good" for the estimation of missing values. In

fact, it may give statistics comparable to the ones given

from a simpler estimation technique which does not preserve

the statistics, even as a generation scheme. Theoretically,

for a "large" number of missing values, the estimation model

operates as a generation model and thus preserves the

"desired" statistics, but practically, for this large amount

of missing values the "desired" statistics (calculated from

the few existing values) are of questionable reliability.

Only for augmentation of the time series (extension of the

series before the first or after the last point) will the

modified weighted average scheme or other schemes that

preserve the "desired" statistics be expected to work better

than the simple weighted average schemes.









One other disadvantage of the modified weighted average

scheme as well as of the least squares scheme is that

negative values may be generated by the model. Since all

hydrological variables are positive, the negative generated

values are set equal to zero, thus altering the statistics

of the series. This is also true for all methods that

involve a random component and is mainly due to "big"

negative values taken on by the random deviate.

The number of negative values, estimated by the MWA

method, which have been set equal to zero in the example

that follows were 1, 1, 6, 4, and 9 values for the 2%, 5%,

10%, 15% and 20% levels of missing values, respectively.

The effect of the values arbitrarily set to zero cannot

be evaluated exactly, but what can be intuitively understood

is that a distortion in the distribution is introduced. A

transformation that prevents the generation of negative

values could be performed on the data before the application

of the generation scheme. Such a transformation is, for

example, the logarithmic transformation since its inverse

applied on a negative value exists, and the mapping of the

transformed to the original data and vice versa is one to

one (this is not true for the square root transformation).



Comparison of the MV, RD, NR and MWA Methods

The performance of each method applied for the

estimation of the missing values will be evaluated by

comparing the estimated series (existing plus estimated









values) to the incomplete series (really available in

practice) and to the actual series (unknown in practice, but

known in this artificial case). The criteria that will be

used for the comparison of the method will be the following:

(1) the bias in the mean as measured (a) by the

difference between the mean of the estimated

series, y e, and the mean of the incomplete series,

y. (i = 1, 2, 3, 4, 5 for five different

percentages of missing values), and (b) by the

difference between the mean of the estimated

series, y and the mean of the actual series, ya;

(2) the bias in the standard deviation as measured (a)

by the ratio of the standard deviation of the

estimated series, s to the standard deviation of

the incomplete series, s. and (b) by the ratio of

the standard deviation of the estimated series, se,

to the standard deviation of the actual series, sa;

(3) the bias in the lag-one and lag-two correlation

coefficients as measured by the difference of the

correlation coefficient of the estimated series,

r to the correlation coefficient of the actual

series, ra;

(4) the bias of the estimation model as given by the

mean of the residuals, yr, i.e., the mean of the

differences between the in-filled (estimated) and

hidden (actual) values (this is also a check to









detect a consistent over- or under-estimation of

the method);

(5) the accuracy as determined by the variance of the

residuals (differences between estimated and actual
2
values) of the whole series, s ;
r
(6) the accuracy as determined by the variance of the
2
residuals of only the estimated values, s ; and
r,e
(7) the significance of the biases in the mean,

standard deviation and correlation coefficients as

determined by the appropriate test statistic for

each (see appendix A).

Table 5.3 presents the statistics of the actual series

(ACT), of the incomplete series (INC) and of the estimated

series by the mean value method (MV), by the reciprocal

distances method (RD), by the normal ratio method (NR) and

by the modified weighted average method (MWA). The mean

(y), standard deviation (s), coefficient of variation (cv)

coefficient of skewness (cs), lag-one and lag-two

correlation coefficients (rl, r2) of the above series

considered as a whole have then been calculated.

Regarding comparison of the means, the following can be

concluded from Table 5.4:

(1) the bias in the mean in all cases is not

significant at the 5% significance level as shown

by the appropriate t-test;










Table 5.3. Statistics of the Actual (ACT), Incomplete (INC)
and Estimated Series (MV, RD, NR, MWA).


y s cv cs r r2


ACT 4.126 3.673 89.040 1.332 0.366 0.134

2% missing values

INC 4.116 3.680 89.397 1.346 -- --


MV 4.125 3.663 88.808 1.335 0.371 0.130


RD 4.124 3.674 89.092 1.336 0.367 0.133


NR 4.114 3.666 89.104 1.339 0.368 0.131


MWA 4.113 3.674 89.331 1.342 0.363 0.131

5% missing values

INC 4.113 3.671 89.249 1.341 -- --


MV 4.101 3.610 88.040 1.352 0.372 0.139


RD 4.127 3.696 89.550 1.359 0.369 0.133


NR 4.105 3.674 89.501 1.349 0.367 0.131


NWA 4.116 3.720 90.386 1.388 0.364 0.126

10% missing values

INC 4.144 3.705 89.405 1.350 -- --


MV 4.134 3.603 87.152 1.346 0.379 0.159


continued









Table 5.3. Continued.


y s


ACT 4.126 3.673


RD 4.150 3.689


NR 4.120 3.652


MWA 4.127 3.725


15%


INC


MV


RD


NR


MWA


4.135


4.106


4.177


4.135


4.134


3.671


3.513


3.688


3.691


3.650


20%


INC


MV


RD


NR


MWA


4.082


4.124


4.231


4.125


4.168


3.701


3.495


3.723


3.601


3.741


r 1


0.366


0.380


0.377


0.376


c
v

89.040


88.884


88.633


90.244

missing

88.767


85.567


86.862


86.854


88.291

missing

90.673


84.749


87.993


87.307


89.758


s

1.332


1.301


1.321


1.286

values

1.268


1.270


1.224


1.236


1.248

values

1.404


1.333


1.865


1.298


1.273


r2

0.134


0.166


0.155


0.162


0.133


0.132


0.133


0.123


--


0.160


0.156


0.152


0.153


0.399


0.372


0.379


0.357


0.408


0.370


0.377


0.354


--


I









Table 5.4. Bias in the Mean



INC MV RD NR MWA


(Ye i) Yi

2% 0. 0.009 0.008 0.002 0.003 4.116


5% 0. -0.012 0.014 -0.008 0.003 4.113


10% 0. -0.010 0.006 -0.024 -0.017 4.144


15% 0. -0.089 0.042 0.000 -0.001 4.135


20% 0. 0.042 0.149 0.043 0.086 4.082


(Ye -Ya) Ya


2% -0.010 -0.001 -0.002 -0.012 -0.013 4.126


5% -0.013 -0.025 0.001 -0.021 -0.010


10% 0.018 0.008 0.024 -0.006 0.001


15% 0.009 -0.020 0.051 0.009 0.008


20% -0.044 -0.002 0.105 -0.001 0.042










(2) the bias in the mean of the incomplete series is

relatively small but becomes larger the higher the

percent of missing values;

(3) at high percent of missing values the NR method

gives the less biased mean;

(4) except for the RD method which consistently

overestimates the mean (the bias being larger the

higher the percent of missing values), the other

methods do not show a consistent over or

underestimation.

Regarding comparison of the variances the following can

be concluded from Table 5.5:

(1) Although slight, the bias in the standard deviation

is always significant, but this is so because the

ratio of variances would have to equal 1.0 exactly

to satisfy the F-test (i.e., be unbiased) with as

large a number of degrees of freedom as in this

study;

(2) the MV method always gives a reduced variance as

compared to the variance of the incomplete series

and of the actual series, the bias being larger the

higher the percent of missing values;

(3) the bias in the standard deviation of the

incomplete series is small;

(4) there is no consistent over or under-estimation of

the variance by any of the methods (except the MV

method);










Table 5.5. Bias in the Standard Deviation



INC MV RD NR MWA


s e/s. S


2% 1. 0.995 0.998 0.996 0.998 3.680


5% 1. 0.983 1.007 1.001 1.013 3.671


10% 1. 0.972 0.996 0.986 1.005 3.705


15% 1. 0.957 0.988 0.978 0.994 3.671


20% 1. 0.944 1.006 0.973 1.011 3.701


s e/sa sa


2% 1.002 0.997 1.000 0.998 1.000 3.673


5% 0.999 0.983 1.006 1.000 1.013


10% 1.009 0.981 1.004 0.994 1.014


15% 0.999 0.956 0.988 0.978 0.994


20% 1.008 0.952 1.014 0.980 1.019










(5) the MWA method does not give less biased variance

even at the higher percent of missing values

tested, as compared to the RD and NR methods.

Regarding comparison of the correlation coefficients

the following can be concluded from Table 5.6:

(1) the bias in the correlation coefficients is in all

cases not significant at the 5% significance level

as shown by the appropriate z-test;

(2) the MV method gives the largest bias in the

correlation coefficients, the bias increasing the

higher the percent of missing values, with a

possible effect on the determination of the order

of the model;

(3) all methods (except the MWA method) consistently

overestimate the serial correlation coefficient of

the incomplete series but not the serial

correlation of the actual series and therefore is

not considered a problem;

(4) the RD method seems to give a correlogram that

closely follows the correlogram of the actual

series.

Regarding accuracy of the methods the following can be

concluded from Table 5.7:

(1) no method seems to consistently over or

underestimate the missing values at all percent

levels, but at high percent levels the missing

values are overestimated by all methods;










Table 5.6. Bias in the Lag-One and Lag-Two Correlation
Coefficients.



INC MV RD NR MWA


(rl,e r,a) rl,a


2% 0.005 0.001 0.002 -0.003 0.366


5% -- 0.006 0.003 0.001 -0.002


10% -- 0.013 0.014 0.011 0.010


15% -- 0.033 0.006 0.013 -0.009


20% -- 0.042 0.004 0.011 -0.012


(r2,e r2,a) r2,a


2% -0.004 -0.001 -0.003 -0.003 0.134


5% -- 0.005 -0.001 -0.003 -0.008


10% -- 0.025 0.032 0.021 0.028


15% -- -0.001 -0.002 -0.001 -0.011


20% -- 0.026 0.022 0.018 0.019










Table 5.7. Accuracy--Mean and Variance of the Residuals
N = number of missing values
N = total number of values = 660.


INC


MWA


P = (Ye Y)/No N


2% -- -0.043 -0.061 -0.570 -0.589 13


5% -- -0.440 0.034 -0.380 -0.176 33


10% -- 0.007 0.156 -0.113 -0.046 62


15% -- -0.175 0.338 0.074 0.105 98


20% -- 0.037 0.502 0.038 0.200 130


- Z-y
e ( -


2.874


3.656


4.239


4.630


4.891


2
Ya ) /(No-2)


3.149


3.411


3.484


3.958


3.681


2%


5%


10%


15%


20%


2
s
r,e

5.037


8.610


7.892


7.620


5.224


4.585


5.340


5.187


5.816


4.898










Table 5.7. Continued.


INC MV RD NR MWA

2 2
s = (Y -e Y) /(N-2)


2% -- 0.084 0.048 0.053 0.077


5% -- 0.406 0.172 0.161 0.252


10% -- 0.720 0.387 0.318 0.473


15% -- 1.112 0.675 0.577 0.849


20% -- 1.016 0.951 0.716 0.953









(2) the NR method is the more accurate method

especially at high percent of missing values

(i.e., it gives the smaller mean and variance of

the residuals).



Univariate Model



Model Fitting

Before considering the problem of missing values the

problem of fitting an ARMA(p,q) model to the monthly

rainfall series of the south Florida interpolation station

will be considered.

The observed rainfall series has been normalized using

the square root transformation and the periodicity has been

removed by standardization. The reduced series,

approximately normal and stationary, is then modeled by an

ARMA(p,q) model. The ACF of the reduced series, as shown in

Fig. 5.3, implies a white noise process since almost all the

autocorrelation coefficients (except at lag-3 and lag-12)

lie inside the 95 percent confidence limits.

Of course, it is unsatisfying to accept the white noise

process as the "best" model for our series and an attempt is

made to fit an ARMA(1,1) model to the series. The selection

of an ARMA model and not an AR or MA model is based on the

following reasons:

(1) The observed rainfall series contains important

observational errors and so it is assumed to be the sum














rk



41.0













* 0.1


+ 0.05


Fig. 5.3.


Autocorrelation function of the normalized and
standardized monthly rainfall series of
Station A.


--95 % C.I.


-0.I









of two series: the "true" series and the observational

error series (signal plus noise). Therefore, even if

the "true" series obeys an AR process, the addition of

the observational error series is likely to produce an

ARMA model:

AR(p) + white noise = ARMA(p,p)

AR(p) + AR(q) = ARMA(p+q, max(p,q)) (5.5)

AR(p) + MA(q) = ARMA(p, p+q)



The same can be said if the "true" series is an MA

process and the observational error series an AR

process but not if the latter is an MA process or a

white noise process:



MA(p) + AR(q) = ARMA(q,p+q)

MA(p) + MA(q) = MA(max(p,q)) (5.6)

MA(p) + white noise = MA(p)



(Granger and Morris, 1976; Box and Jenkins, 1976,

Appendix A4.4).



It is understood, that the addition of any

observational series to an ARMA process of the "true"

series will give again an ARMA process. For example,



ARMA(p,q) + white noise = ARMA(p,p) if p > q (5.7)

= ARMA(p,q) if p < q





100


from which it can also be seen that the addition of an

observational error may not always change the order of

the model of the "true" process.

(2) One other situation that leads exactly, or

approximately, to ARMA models is the case of a variable

which obeys a simple model such as AR(1) if it were

recorded at an interval of K units of time but which is

actually observed at an interval of M units (Granger

and Morris, 1976, p. 251).

All these results suggest that a number of real data

situations are all likely to give rise to ARMA models;

therefore, an ARMA(1,1) model will be fitted to the observed

monthly rainfall series of the south Florida interpolation

station. The preliminary estimate of c1 (equation 3.23) is

-0.08163, and the preliminary estimate of 61 (equa-

tions 3.21 for k = 0, 1, 2) is the solution of the quadratic

equation



0.1656 e2 + 1.0204 1 + 0.1656 = 0 .(5.8)
1 1


Only the one root 81 = -0.1667 is acceptable, the second

lying outside the unit circle. These preliminary estimates

of 1 and 81 become now the initial values for the

determination of the maximum likelihood estimates (MLE). In

general, the choice of the starting values of C and 6 does

not significantly affect the parameter estimates (Box and

Jenkins, 1976, p. 236), but this was not the case for the





101


0.5


0.4


0? 0.3

OO
t0.2

0.1


b 0.0

-0.1







-0.4
b -0.3
O



-0.5



-0.5 -0.4 -0.3 -0.2 -0.1 0.0 0.1 0.2 0.5 0.4 0.5






^2

Fig. 5.4. Sum of squares of the residuals, Z(at) ,
of an ARMA (1,1) model fitted to the
rainfall series of station A.





102


Table 5.8. Initial Estimates and MLE of the Parameters P
and 6 of an ARMA(1,1) model fitted to the
rainfall series of station A.


Initial Estimates Max. Likelihood Estimates
Model 6 8

A -0.0816 0.0 -0.0088 -0.0989

B -0.0816 -0.1667 -0.3140 -0.4056

C 0.1 0.0 0.0537 -0.0278

D -0.4 -0.5 -0.4064 -0.4939





south Florida rainfall series under study. In particular

different initial estimates of 1 and 61 have been tested

and the MLE of the parameters are compared in Table 5.8.

The MLE have been calculated using the IMSL subroutine FTMXL

which uses a modified steepest descent algorithm to find the

values of # and 6 that minimize the sum of squares of the

residuals (Box and Jenkins, 1976, p. 504).

The drastic changes in parameter values together with

the idea that the process may be a white noise process

suggest a plot of the sum of squares of the residuals for

the visual detection of anomalies. The sum of squares grids

and contours are shown in Fig. 5.4. We observe that there

is not a well defined point where the sum of squares becomes

a minimum but rather a line (contour of the value 641) on

which the sum of squares has an almost constant value equal

to the minimum. In such case combinations of parameter

values give similar sum of squares of residuals and a change





103


in the AR parameter can be nearly compensated by a suitable

change in the MA parameter.

From the comparison of the parameters ( and 6

(Table 5.8) of the four ARMA(1,1) models one cannot say that

they all correspond to the same process. But this can in

fact be illustrated by converting the four models to their

"random shock form" (MA( ) processes) or their "invertible

form" (AR( ) processes).

An ARMA(1,1) process



(1-p1B) zt = (1-61B) at (5.9)



can be also written as


-I
zt = (1-61B) (1- 1B) at (5.10)



which can be expanded in the convergent form



zt = [1 + (1-91)B + 1((1-)B2 + (1-1)3 + ...] at
(5.11)



provided that the stationarity condition (|11 < 1) is

satisfied. Then the four models of Table 5.8 become:





104


(A) :

(B)

(C)

(D)

In the same

"invertible


zt = at + 0.090 at-1 0.001 at-2 +

zt = at + 0.092 at-1 0.029 at-2 + ... (5.12)

zt = a + 0.082 at-1 + 0.004 at-2 +

zt = at + 0.088 at-1 0.036 at-2 +

way the ARMA(1,1) model may be written in the

form"


zt = at


(5.13)


which can be expanded as


[1- (B 2 )3 -


given that

satisfied.


...] zt = at

(5.14)


the invertibility condition ( 1ll <1) is

Then the four models become:


(A) :

(B)

(C)

(D)


zt = at + 0.090 zt_1 0.009

zt = at + 0.092 zt-1 0.037

zt = at + 0.082 zt-1 0.002

zt = at + 0.088 zt-1 0.043


zt-2 +

zt-2 + ...

zt-2 + .

zt-2 + ...


From the "random shock" form of the four models

(equations 5.12) and from their "invertible form"

(equations 5.15) the following remarks can be made:


(5.15)


(1- IB) (1-6 B)-1





105


(1) Although from the comparison of the and e

coefficients (Table 5.8) of the four ARMA(1,1) models

one cannot say that they all correspond to the same

process, the comparison of the MA coefficients

(61, 2, 3, ...) of equations (5.12) or the AR

coefficients (1' ,2' 3', ...) of equations (5.15)

imply that indeed all four models belong to the same

process.

(2) Because the nonzero <2 (and 82) coefficients of zt-2

(and at-2) terms while small are of similar magnitude

to the coefficients il (and 81), one cannot say that

the "truncated" AR(1) or MA(1) model will fully

describe the time series, but instead more terms are

needed. On the other hand, we observe that the $1

coefficient so obtained (different for each model) is

in the range of 0.082 to 0.090 and is greater than the

coefficient 1 that would have been obtained by a

direct fitting of an AR (1) model to the series (the

latter would be 1 = rl = 0.0068).

(3) It should also be noted that all the above models

fitted to the series give residuals that pass the

portemanteau goodness of fit test. As it can be seen

from equation (5.12) the impulse response function

(e.g., the weights applied on the a.'s when the

model is written in the "random shock form") dies off

very quickly in all the models, and there is thus no

doubt as to the application of the portemanteau test





106


(see Appendix A). The values of Q for each model

(calculated from equation A.1 using K = 60) are: QA =

67.80, QB = 67.26, QC = 67.73 and QD = 67.39, all
2
smaller than the X value with 58 degrees of freedom at
2
a 5% significance level, X585% = 79.1. It can also be

seen that the values of Q for all models are almost

equal, suggesting an equally good fit of the series by

all the four models.

One other interesting question that could be asked is,

given a specific ARMA(p,q) model whether or not this could

have arisen from some simpler model. "Simplifications are

not always possible as conditions on the coefficients of the

ARMA model need to be specified for a simpler model to be

realizable" (Granger and Morris, 1976, p. 252). At this

stage with coefficients that are so instable it is

meaningless to test the four ARMA models for simplification.

However, this test will be made after a unique and stable

model has been obtained through the following proposed

algorithm.



Proposed Estimation Algorithm

The problem of estimation of missing values will be

combined with the problem of stabilizing the coefficients of

the ARMA(1,1) model in a recursive algorithm which will have

solved both problems uniquely upon convergence.

The incomplete series (S0) is filled-in with some

initial estimates of the missing values (these initial





107


estimates can be simply the monthly means or even zeroes as

will be shown). Denote by S1 this initial series. An ARMA

(1,1) model is fitted to the series S1 and its coefficients

41 and 61 are used to update the first estimates of the
missing values. For example, suppose that a gap of size k

(k missing values) exists in the series S0:



Series S : ... zt z ... Zt+k+l Zt+k+2 (5.16)

Series S: ... zt-1 zt zt+l ... Zt+k Zt+k+l zt+k+2 "'"



where z +, .., z+k are the initial estimates of the

missing values. These values z' ..., z are then
t+1' t+k
replaced by the forecasted values zt(1), ..., zt(k) by the

model, made at origin t and for lead times A = 1, ..., k.

These forecasts are the minimum mean square error forward

forecasts as developed by Box and Jenkins (1976). For an

ARMA(1,1) model with coefficients (1 and 61, the minimum

mean square error forecasts zt() of z t+ where k is the

lead time, are:



zt() = (1 zt 18at k = 1 (5.17)

zt() = c1 zt(-l) A = 2, ..., k



from which it can be seen that only the one step ahead

forecast depends directly on at, and the forecasts at longer

lead times are influenced indirectly (Box and Jenkins, 1976,

Ch. 5). The forecasting procedure in repeated for the





108


estimation of all the gaps, and the newly estimated values

are used in equations (5.17). These forecasts now become

the new estimates of the missing values and they replace the

old estimates giving the new series S2. An ARMA(1,1) model

is then fitted to the new series S2 and the new coefficients

(1 and 81 are found (different from the previous ones).
Then the estimated values (forecasts from the previous

model) are replaced by the forecasts by the new model,

giving the new series S3, etc. The procedure is repeated

until the model and the series stabilize in the sense that

the parameters 01 and 81 of the model as well as the

estimates of the missing values do not change between

successive estimates within a specified tolerance.

Schematically the algorithm is presented in Fig. 5.5

where S0 denotes the incomplete series, M0 the method used

for the initial estimation, S. the estimated series at the
--1
ith iteration, and M. the model (e.g., the set of
--1

parameters p1 and 61, (,,1 1) i) fitted to the series S.

The notation M. -* M. and S. --> Sil is introduced to
-1 -i+1 -1 -i+1
denote the stabilization of the model and series

respectively after i iterations. The above algorithm will

be addressed as RAEMV-U (a recursive algorithm for the

estimation of missing values--univariate model).



Application of the Algorithm on the Monthly Rainfall Series

The proposed recursive algorithm (RAEMV-U) has been

applied for the estimation of missing monthly rainfall





109


SMo M, Mz MS Mi+,
So S, S2 ....








Fig. 5.5. Recursive algorithm for the estimation of
missing values--univariate model (RAEMV-U).
S. denotes the series, and M. the model,
T-,6)i, at the ith iteration.





110


values in the series of the south Florida interpolation

station (station 6038). Different levels of percentage of

missing values have been tested and the results for the 10%

and 20% levels are presented herein. Tables 5.9 and 5.10

show the results for the 10% and 20% levels of missing

values respectively. The starting series S is the
-0
incomplete series (with 10% or 20% the values missing).

Four different methods M (MV, RD, NR, and zeros) have been

applied to the incomplete series, So, providing different

starting series, S1, for the algorithm. Thus, its

dependence on the initial conditions has also been tested.



Results of the Method

From Tables 5.9 and 5.10 the following can be

concluded:

(1) The algorithm converges very rapidly and independently

of the initial estimates, thus suggesting the

convenient replacement of the missing values by zeros

to start the algorithm.

(2) The greater the percent of missing values the slower

the algorithm converges (6 iterations were needed for

the 10% and 8 for the 20% to obtain accuracy to the

third decimal place) as was expected since a larger

part of the series is changing its values at each

iteration and thus more iterations are needed to

achieve equilibrium.




Full Text

PAGE 1

WATER IiRESOURCES researc center Publication No. 67 ESTIMATING MISSING VALUES IN MONTHLY RAINFALL SERIES By EFI FOUFOULA-GEORGIOU A Thesis Presented to the Graduate Council of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Master of Engineering University of Florida Gai nesvi 11 e UNIVERSITY OF FLORIDA

PAGE 2

ESTIMATING MISSING VALUES IN MONTHLY RAINFALL SERIES By EFI FOUFOULA-GEORGIOU Publication No. 67 FLORIDA WATER RESOURCES RESEARCH CENTER Research Project Technical Completion Report Sponsored by South Florida Water Management District A THESIS PRESENTED TO THE GRADUATE COUNCIL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF ENGINEERING UNIVERSITY OF FLORIDA 1982

PAGE 3

ACKNOWLEDGEHENTS I wish to express my sincere gratitude to all those who contributed towards making this work possible. I am particularly indebted to the chairman of my supervisory committee, Professor Wayne C. Huber. Through the many constructive discussions along the course of this research, he provided an invaluable guidance. It was his technical and moral support that brought this work into completion. I would like to express my sincere appreciation to the other members of my supervisory committee: Professors J. P. Heaney, D. L. Harris, and M. C. K. Yang, for their helpful suggestions and their thoughtful and critical evaluation of this work. Special thanks are also given to my fellow students and friends, Khlifa, Dave D., Bob, Terrie, Richard, Dave M., and Mike, for their cheerful help and the pleasant environment for work they have created. Finally my deepest appreciation and love go to my husband, Tryphon, who has been a constant source of encouragement and inspiration for creative work. Many invaluable discussions with him helped a great deal in ii

PAGE 4

gaining an understanding of some problems considered in this thesis. The research was supported in part by the South Florida Water Management District. Computations were performed at the Northeast Regional Data Center on the University of Florida campus, Gainesville. iii

PAGE 5

TABLE OF CONTENTS ACKNOWLEDGEMENTS ii LIST OF TABLES vii LIST OF FIGURES ix ABSTRACT xi CHAPTER 1. INTRODUCTION 1 Rainfall Records 1 Frequency Analysis of Missing Observations in the South Florida Monthly Rainfall Records . . . . .. 5 Description of the Chapters 15 CHAPTER 2. SIMPLIFIED ESTIMATION TECHNIQUES Introduction Mean Value Method (MV) Reciprocal Distance Method (RD) Normal Ratio Method (NR) Modified Weighted Average Method (MWA) Least Squares Method (LS) CHAPTER 3. UNIVARIATE STOCHASTIC MODELS Introduction Review of Box-Jenkins Models 17 17 17 20 21 22 27 32 32 34 Autoregressive Models 35 Moving Average Models 39 Mixed Autoregressive-Moving Average Models. 42 Autoregressive Integrated Moving Average Models 44 Transformation of the Original Series 46 Transformation to Normality Stationarity iv 46 50

PAGE 6

Monthly Rainfall Series 52 CHAPTER 4. Normalization and Stationarization Modeling of Normalized Series MULTIVARIATE STOCHASTIC MODELS 52 55 58 Introduction . 58 General Multivariate Regression Model .. 59 Multivariate Lag-One Autoregressive Model 60 Comments on Multivariate AR(I) Model .. 63 Assumption of Normality and Stationarity .. 63 Cross-Correlation Matrix Ml .. 65 Further Simplification 66 Higher Order Multivariate Models 68 CHAPTER 5. ESTIMATION OF MISSING MONTHLY RAINFALL VALUES--A CASE STUDY 71 Introduction .. . 71 71 75 Set Up of the Problem . Simplified Estimation Techniques . Techniques Utilized . Least Squares Methods Modified Weighted Average Method Comparison of the MV, RD, NR and MWA Methods . 75 78 82 85 Univariate Model 97 Model Fitting .. 97 Proposed Estimation Algorithm 106 Application of the Algorithm on the Monthly Rainfall Series Results of the Method .. Remarks 108 110 106 Bivariate Model 117 Model Fitting .. .. 11 7 CHAPTER 6. Proposed Estimation Algorithm Application of the Algorithm on the Monthly Rainfall Series CONCLUSIONS AND RECOMMENDATIONS Summary and Conclusions Further Research v 119 121 131 131 134

PAGE 7

APPENDIX A. DEFINITIONS. APPENDIX B. DETERMINATION OF MATRICES A AND B OF THE MULTIVARIATE AR(l) MODEL APPENDIX C. DATA USED AND STATISTICS APPENDIX D. COMPUTER PROGRAMS REFERENCES BIOGRAPHICAL SKETCH vi 136 150 156 169 182 188

PAGE 8

Table 1.1 LIST OF TABLES Frequency Distribution of the Percent of Missing Values in 213 South Florida Monthly Rainfall Records 5.1 Least Squares Regression Coefficients and 9 Their Significance Levels 80 5.2 Correction Coefficients for Each Month and for Each Different Percent of Missing Values 83 5.3 Statistics of the Actual (ACT), Incomplete (INC) and Estimated Series (MV, RD, NR, MWA) 88 5.4 Bias in the Mean 90 5.5 Bias in the Standard Deviation 92 5.6 Bias in the Lag-One and Lag-Two Correlation Coefficients 94 5.7 Accuracy Mean and Variance of the Residuals 95 5.8 Initial Estimates and MLE of the parameters cp and 8 of an ARMA(l,l) Model Fitted b:::> the Monthly Rainfall Series of Station A 102 5.9 Results of the RAEMV-U Applied at the 10% Level of Missing Values. Upper Value is CP1' Lower Value is 8 1 111 5.10 Results of the RAEMV-U Applied at the 20% Level of Missing Values. Upper Value is CP1' Lower Value is 8 1 112 5.11 Statistics of the Actual Series (ACT) and the Two Estimated Series (UN10, UN20) 115 5.12 Bias in the Mean, Standard Deviation and Serial Correlation Coefficient--Univariate Model . . . . . 116 vii

PAGE 9

Table Page 5.13 Results of the RAEMV-B1 Applied at the 10% Level of Missing Values . . . 125 5.14 Results of the RAEMV-B1 Applied at the 20% Level of Missing Values . . 127 5.15 Statistics of the Actual Series (ACT) and the Two Estimated Series (B10 and B20). 129 5.16 Bias in the Mean, Standard Deviation and Serial Correlation Coefficient--Bivariate Model 130 viii

PAGE 10

LIST OF FIGURES Figure 1.1 Monthly distribution of rainfall in the United States .. .... 6 1.2 Probability density function, f (m) of the percentage of missing values . . 8 1.3 Probability density function, f ( T) of the interevent size . . . . 11 1.4 Probability density, f(k), and mass function, p(k), of the gap size. . .. 12 2.1 Mean value method without random component 19 2.2 Mean value method with random component. 19 2.3 Least squares method without random component 30 2.4 Least squares method with random component 30 5.1 The four south Florida rainfall stations used in the analysis . 73 5.2 Plot of the monthly means and standard devia-tions of the rainfall series of Station A 76 5.3 Autocorrelation function plot of the residual series of an ARMA(l,l) model fitted to the monthly rainfall series of Station A . 98 5.4 Sum of squares of the residuals surface of an model fitted to the monthly rainfall series of Station A . .. 101 5.5 Recursive algorithm for the estimation of the missing values--univariate model (RAEMV-U) ... 109 5.6 Recursive algorithm for the estimation of missing values--bivariate model--1 station to be estimated (RAEMV-B1) .......... 122 ix

PAGE 11

Figure 5.7 Recursive algorithm for the estimation of missing values--bivariate model--2 stations to be estimated (RAEMV-B2) . . .. 123 x

PAGE 12

Abstract of Thesis Presented to the Graduate Council of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Master of Engineering ESTIMATION OF MISSING OBSERVATIONS IN MONTHLY RAINFALL SERIES By Efstathia Foufoula-Georgiou December, 1982 Chairman: Wayne C. Huber Cochairman: James P. Heaney Major Department: Environmental Engineering Sciences This study compares and evaluates different methods for the estimation of missing observations in monthly rainfall series. The estimation methods studied reflect three basic ideas: (1) the use of regional-statistical information in four simple techniques: -mean value method (MV), -reciprocal distance method (RD), -normal ratio method (NR) -modified weighted average method (MWA)i (2) the use of a univariate autoregressive moving average (ARMA) model which describes the time correlation of the series; xi

PAGE 13

(3) the use of a multivariate ARMA model which describes the time and space correlation of the series. An algorithm for the recursive estimation of the missing values in a series by a parallel updating of the univariate or multivariate ARMA model is proposed and demonstrated. All methods are illustrated in a case study using 55 years of monthly rainfall data from four south Florida stations. xii ;/,1 I Chairman

PAGE 14

CHAPTER 1 INTRODUCTION Rainfall Records Rainfall is the source component of the hydrologic cycle. As such it regulates water availability and thus land use, agricultural and urban expansion, maintenance of environmental quality and even population growth and human habitation. As Hamrick (1972) points out, water may be transported for considerable distances from where it fell as rain and may be stored for long periods of time, but with very few exceptions it originates as rainfall. Consequently, the measurement and study of rainfall is in actuality the measurement and study of our potential water supply. Rainfall studies attempt to derive models, both probabilistic and physical, to describe and forecast the rainfall process. Since the quality of every study is immediately related to the quality of the data used, the need for "good quality" rainfall data has been expressed by all hydrologists. By "good quality" is meant accurate, long and uninterrupted series of rainfall measurements at a range of different time intervals (e.g., hourly, daily, monthly, and yearly data) and for a dense raingage network. Missing 1

PAGE 15

2 values in the series (due, for example, to failure of the recording instruments or to deletion of a station) is a real handicap to the hydrologic data users; The estimation of these missing values is often desirable prior to the use of the data. For instance, the South Florida Water Management District prepared a magnetic tape with monthly rainfall data for all rainfall stations in south Florida for use in this study (T. MacVicar, SFWMD, personal communication, May, 1982). The data included values for the period of record at each station, ranging from over 100 years (at Key West) to only a few months at several temporary stations. Approximately one month was required to preprocess these data prior to performing routine statistical and time series analyses. The preprocessing included tasks such as manipulations of the magnetic tape, selection of stations with desirable characteristics (e.g., long period of record, proximity to other stations of interest, few missing values) and a major effort at replacement of missing values that did exist. This effort, in fact, was the motivation for this thesis. Many different kinds of statistical analyses may be performed on a given data set, e.g., determination of elementary statistical parameters, auto-and crosscorrelation analysis, spectral analysis, frequency analysis, fitting time series models. For routine statistics (e.g., calculation of mean, variance and skewness) missing values

PAGE 16

3 are seldom a problem. But for techniques as common as autocorrelation and spectral analysis missing values can cause difficulties. In multivariate analysis missing values result in "wasted information" when only the overlapping period of the series can be used in the analysis, and in inconsistencies (Fiering, 1968, and Chapter 4 of this thesis) when the incomplete series are used. In general, two approaches to the problem of missing observations exist. The first consists of developing methods of analysis that use only the available data, the second in developing methods of estimation of the observations followed by application of classical methods of analysis. Monthly rainfall totals are usually calculated as the sum of daily recorded values. Thus, if one or more daily observations are missing the monthly total is not reported for that month. An investigation conducted by the Weather Bureau in 1950 (Paulhus and Kohler, 1952), showed that almost one third of the stations for which monthly and yearly totals were not published had only a few (less than five) days missing. Furthermore, for some of these missing days there was apparently no rainfall in the area as concluded by the rainfall observations at nearby stations. Therefore, in many cases estimation of a few missing daily rainfall values can provide a means for the estimation of the monthly totals.

PAGE 17

4 Statisticians have been most concerned with the problem of handling short record multivariate data with missing observations in some or all of the variables, but no explicit and simple solutions have been given, apart from a few special cases in which the missing data follow certain patterns. A review of these methods is given by Afifi and Elashoff (1956). In the time domain, "the analysis of time series, when missing observations occur has not received a great deal of attention" as Marshall (1980, p. 567) comments, and he proposes a method for the estimation of the autocorrelations using only the observed values. Jones (1980) attempts to fit an ARMA model to a stationary time series which has missing observations using Akaike's Markovian representation and Kalman's recursive algorithm. In the frequency domain, spectral analysis with randomly missing observations has been examined by Jones (1962), Parzen (1963), Scheinok (1965), Neave (1970) and Bloomfield (1970) In hydrology, the problem of missing observations has not been studied much as Salas et al. (1980) state: The filling-in or extension of a data series is a topic which has not received a great deal of attention either in this book or elsewhere. Because of its importance, the subject is expected to be paid more attention in the future. (Salas et al., 1980, p. 464) Simple and "practicable" methods for the estimation of missing rainfall values for large scale application were proposed by Paulhus and Kohler (1952), for the completion of the rainfall data published by the Weather Bureau. The

PAGE 18

5 study was initiated after numerous requests of the climatological data users. Beard (1973) adopted a multisite stochastic generation technique to fill-in missing streamflow data, and Kottegoda and Elgy (1977) compared a weighted average scheme and a multivariate method for the estimation of missing data in monthly flow series. Hashino (1977) introduced the "concept of similar storm" for the estimation of missing rainfall sequences. Although the same methods of estimation can be applied to both rainfall and runoff series, a specific method is not expected to perform equally well when applied to the two different series due mainly to the different underlying processes. This is true even for rainfall series from different geographical regions, since their distributions may vary greatly as shown in Fig. 1.1. This analysis will use monthly rainfall data from four south Florida stations. First, a frequency analysis of the missing observations has been performed and their typical pattern has been identified. In this work the term "missing observations" is used for a sequence of missing monthly values restricted to less than twelve, so that unusual cases of lengthy gaps (a year or more of missing values) is avoided since they do not reflect the general situation. Frequency Analysis of Missing Observations in the .. South Florida Monthly Rainfall Records An analysis of the monthly series of 213 stations of the South Florida Water Management District

PAGE 19

J. \: L J -----n-l!4.i ? I I: J' : 0'., 0-_ 1= ,: JMM ," 'f r/. __ ,lj ".","." t'1' _on\o\ Yo. ,-a Jill .. J S Se" r'''"Ci,co,C''i' .... ?: ,: II O"Io,,'d., J AI .. J IW [I POlO, r""H 5 : ", Fig. 1.1. Monthly distribution of rainfall in the United States (after Linsley R.K., Kohler M.A. and Paulhus J.L., Hydrology for Engineers, 1975, McGraw-Hill, 2nd. edition p. 90) 0'\

PAGE 20

(SF\vMD) gave the results shown on Table 1.1. Figure 1. 2 shows the probability density function (pdf) plot of the percent m of missing values, f(m), which is defined as the ratio of the probability of occurrence over an interval to the length of that interval (column 4 of Table 1.1). The shape of the pdf f(m) suggests the fit by an exponential distribution 7 f (m) = -Am Ae (1. 1) where A is the parameter of the distribution calculated as the inverse of the expected value of m, E(m)i E(m) = L:p (m.) m. 1 1 (1. 2) where p(m.) is the probability of having m. percent of 1 1 missing values. The mean value of the percentage of missing values is m = E(m) = 13.663, and therefore the fitted exponential pdf is f(m) = 0.073 -0.073m e which gives an interesting and unexpectedly good fit as shown by Fig. 1.2 and column 5 of Table 1.1 The question now arises as to whether the missing values within a record follow a certain pattern. In (1. 3)

PAGE 21

f (rn) 0.07 0.01 0.08 fern) = 0.073 e-0.073rn 0.04 0.03 0.02 0:01 0.00 o 10 20 30 40 80 60 10 % missing values, rn Fig. 1.2. Probability density function, fern), of the percentage of missing values. Based on 213 stations, m = 13.663%. 8

PAGE 22

9 Table 1.1. Frequency Distribution of the Percent of Missing Values in 213 South Florida Monthly Rainfall Records. 1 2 3 4 5 % of % of Cumulative Empirical Fitted Missing Stations % of Stations pdf Exponential Values pdf 0-5 30.52 30.52 0.061 0.061 5-10 21.12 51. 64 0.042 0.042 10-15 14.55 66.19 0.029 0.029 15-20 13.61 79.80 0.027 0.020 20-25 6.10 85.90 0.012 0.014 25-30 3.29 89.10 0.007 0.010 30-35 1.88 91. 70 0.004 0.007 35-40 0.94 92.01 0.002 0.005 40-45 2.35 94.36 0.005 0.003 45-50 2.82 97.18 0.006 0.002 50-55 0.47 97.65 0.001 0.002 55-60 0.47 98.12 0.001 0.001 60-65 1. 41 99.53 0.003 0.001 65-70 0.47 100.00 0.001 0.001

PAGE 23

10 particular, if the occurrence of a gap is viewed as an "event" then the distribution of the interevent times (sizes of the interevents) and of the durations of the events (sizes of the gaps) may be examined. The probability distribution of the size of the interevents (number of values between two successive gaps) has been studied for four "typical" stations of the SFWMD, as far as length of the record, distribution and percent of missing values is concerned. These four stations are: MRF 6018, Titusville 2W, 1901-1981, 7.5% missing MRF 6021, Fellsmere 4W, 1911-1979, 9.3% missing MRF 6029, Ocala, 1900-1981, 4.4% missing MRF 6005, Plant City, 1892-1981, 8.6% missing A derived pdf for the four stations combined and the fitted exponential pdf are shown in Fig. 1.3. The mean size of the inter event T, is 19.03 months; therefore, the fitted exponential distribution is f(T) = 0.053 -0.053T e (1. 4) Also, the probability distribution of the size of the gaps (number of values missing in each gap) has also been studied for the same four stations. These have been treated as discrete distributions since the size of the gap (k = 1, 2, ., 11) is small as compared to the interevent times. A probability distribution for the four stations combined is then derived, which is also the discrete probability mass function (pmf). This plot is shown in Fig. 1.4 and suggests either a Poisson distribution or a discretized exponential.

PAGE 24

f (T) 0..05 0.041 0.03 0.02 0.01 f (T) = 0.053 -0.053T e 11 o. 00 o 20 410 60 80 100 120 months between gaps,T Fig. 1.3. Probability density function, f(T), of the interevent size. Based on four stations.

PAGE 25

f(k) and p(k) 0.6 0.4 0.3 o 0.2 0 0.1 0 0.0 0 2 3 4 f(k) = IS 6 0.447 -0.447k e *.empirical o poisson -fitted 7 8 9 10 II gap size, k (months) 12 Fig. 1.4. Probability density, f(k), and mass function, p(k), of the gap size. Based on four stations.

PAGE 26

13 The mean value k is 2.237, which is also the parameter A of the Poisson distribution. The Poisson distribution e->" >..k f(k) = (1.5) k! is nonzero at k = 0 and does not fit the peak of the empirical point very well at k = 1 (it gives a value of 0.24 instead of the actual 0.53). The fitted continuous exponential pdf shown in Fig. 1.4 gives a better fit in general but also implies a nonzero probability for a gap size near zero. To overcome this problem and to discretize the continuous exponential pdf, the area (probability) under the exponential curve between zero and 1.5 is assigned to k = 1, ensuring a zero probability at k = O. Areas (probabilities) assigned to values of k > 1 are centered around those points. The fitted discretized exponential and the Poisson are also shown in Fig. 1.4. The distributions of the size of the gaps (k) and of the size of interevents (T) will be used to generate randomly distributed gaps in a complete record. Suppose that we have a complete record and desire to remove randomly m percent missing values. If the mean size of the gap (k) is assumed constant, the mean size of interevent (T) must vary, decreasing as the percent of missing values increases. Let N denote the total number of values in the record, m the

PAGE 27

where (3.8 ) is called the multiple coefficient of determination and represents the fraction of the variance of the series that has been explained through the regression. If we denote by kj the jth coefficient in an autoregressive process of order k, then the last coefficient kk of the model is called the partial autocorrelation coefficient. Estimates of the partial autocorrelation 38 coefficients ll' pp may be obtained by fitting to the series autoregressive processes of successively higher order, and solving the corresponding Yule-Walker equations. The partial autocorrelation function kk' k = 1, 2, p may also be obtained recursively by means of Durbin's relations (Durbin, 1960) k k k+l,k+l = [rk + l L k,J' rk+l_J,]/[l L k' r,] j=l j=l ,J J (3.9) k+l,j = k,j k+l,k+l k,k-j+l j = 1, 2, .. k It can be shown (Box and Jenkins, 1976, p. 55) that the autocorrelation function of a stationary AR(p) process is a mixture of damped exponential and damped sine waves,

PAGE 28

infinite in extent. On the other hand, the partial auto-correlation function kk is nonzero for k < P and zero for k > p. The plot of autocorrelation and partial autocorre-lation functions of the series may be used to identify the kind and the order of the model that may have generated it (identification of the model). Moving Average Models In a moving average model the deviation of the current value of the process from the mean is expressed as a finite sum of weighted previous shocks als. Thus a moving average process of order q can be written as: 39 (3.10) or (3.11) where 6 (B) 1 8 B 6 B2 -1 2 (3.l2} is the moving average operator of order q. An MA(q} model contains (q+2) parameters, ll, 6 1 62 8 q 0; to be estimated from the data.

PAGE 29

40 From the definition of stationarity (see Appendix A) it follows that an MA(q) process is always stationary since 8(B) is finite and thus converges for IBI q (3.13) (3.14) (3.15) (3. 16) By substituting in equation (3.15) the value of 02 from a equation (3.14) we obtain a set of q nonlinear equations for

PAGE 30

41 + .. -8 + k 1 + 8i + + + 8 8 q-k q k=l, 2, ... q (3.17) These equations are analogous to the Yule-Walker equa-tions for an autoregressive process, but they are not linear and so must be solved iteratively for the estimation of the moving average parameters 8, resulting in estimates that may not have high statistical efficiency. Again it was shown by Wold (1938) that these parameters may need correc-tions (e.g., to fit better the correlogram as a whole and not only the first q correlation coefficients), and that there may exist several, at most 2 q solutions, for the parameters of the moving average scheme corresponding to an assigned correlogram PI' P 2 .. P q However, only those 8's are acceptable which satisfy the invertibility conditions. From equation (3.14) an estimate for the white noise variance may be obtained ... + 8 2 q (3.18) According to the duality principle (see Appendix A) an invertible MA(q) process can be represented as an AR process of infinite order. This implies that the partial autocorre-lation function kk of an MA(q) process is infinite in extent. It can be estimated after tedious algebraic manipulations

PAGE 31

from the Yule-Walker equations by substituting P k as functions of 8's for k < q and Pk = 0 for k > q. So, in contrast to a stationary AR(p) process, the autocorrelation function of an invertible MA(q) process is finite and cuts 42 off after lag q, and the partial autocorrelation function is infinite in extent, dominated by damped exponentials and damped sine waves (Box and Jenkins, 1976). Mixed Autoregressive-Moving Average Models In practice, to obtain a parsimonious parameterization, it will sometimes be necessary to include both autoregressive and moving average terms in the model. A mixed autoregres-sive-moving average process of order (p,q), ARMA(p,q), can be written as Zt = lZt-l + + pZt_p + at -8 l a t l -8 q a t q (3.19) or CB) (3.20) with Cp+q+2) parameters, ll, 8 1 .. 8 q l' p' to be estimated from the data. An ARMA(p,q) process will be stationary provided that the characteristic equation (B) = 0 has all its roots out-side the unit circle. Similarly, the roots of 8(B) = 0 must lie outside the unit circle for the process to be invertible.

PAGE 32

43 '" By multiplying equation (3.19) by Zt-k and taking expectations we obtain Yk = l Y k l + + p Yk -p + Yza(k) 8 1Yza(k-l) --8q Yza(k-q) (3.21) where Y (k) is the cross covariance function between z and za '" a, defined by Yza(k) = E[Zt_kat]. Since Zt-k depends only on shocks which have occurred up to time t-k, it follows that Yza(k) = 0 Yza(k) "I 0 and (3.21) implies or k > 0 (3.22) k < 0 k > q + 1 (3.23) k > q + 1 (3.24) Thus, for the ARMA(p,q) process the first q autocorre-lations PI' P 2 .. P q depend directly on the choice of the q moving average paramaters 8, as well as on the p auto-regressive parameters through (3.21). The autocorrela-tions of higher lags P k k q + 1 are determined through the difference equation (3.24) after providing the p starting

PAGE 33

44 values p +1' .. p q-p q So, the autocorrelation function of an ARMA(p,q) model is infinite in extent, with the first q-p values PI' .. P irregular and the others q-p consisting of damped exponentials and/or damped sine waves (Box and Jenkins, 1976; Salas et al., 1980). Autoregressive Integrated Moving Average Models An ARMA(p,q) process is stationary if the roots of = 0 lie outside the unit circle and "explosive non-stationary" if they lie inside. For example, an explosive nonstationary AR(l) model is Zt = 2zt l + at (the plot of Zt vs. t is an exponential growth) in which = 1 -2B has its root B = 0.5 inside the unit circle. The special case of homogeneous nonstationarity is when one or more of the roots lie on the unit circle. By introducing a general-ized autoregressive operator which has d of its roots on the unit circle, the general model can be written as (3.25) that is (3.26) where = nd Z v t (3.27)

PAGE 34

45 and V = 1 -B is the difference operator. This model corre-sponds to assuming that the dth difference of the series can be represented by a stationary, invertible ARMA process. By inverting (3.27) (3.28) where S is the infinite summation operator 1 B 2 -_ (.l_B)-l -_ 0-1 S = + B + + .. v (3.29) Equation (3.28) implies that the nonstationary process Zt can be obtained by surruning or "integrating" the stationary process w t d times. Therefore, this process is called a simple autoregressive integrated moving average process, ARIMA (p d q) It is also possible to take periodic or seasonal dif-ferences at lag's of the series, e.g., the 12th difference of monthly series, introducing the differencing operator VD with the meaning that seasonal differencing V is applied s s D times on the series. This periodic ARIMA(P,D,Q) model s can be written as (3.30)

PAGE 35

46 The combination of nonperiodic and periodic models leads to the mUltiplicative ARlMA(p,d,q) x ARlMA(P,D,Q)s model which can be written as (3.31) After the model has been fitted to the differenced series an integration should be performed to retrieve the original process. But such an integrated would lack a mean value since a constant of integration has been lost through the differencing. This is the reason that the ARlMA models cannot be used for synthetic generation of time series, although they are useful in forecasting the deviations of a process (Box and Jenkins, 1976; Salas et al., 1980). Transformation of the Original Series Transformation to Normality Most probability theory and statistical techniques have been developed for normally distributed variables. Hydrologic variables are usually assymetrically distributed or bounded by zero (positive variables), and so a transformation to normality is often applied before modeling. Another approach would be to model the original skewed series and then find the probability distribution of the uncorrelated residuals. Care must then be taken to assess the errors of applying methods developed for normal variables to skewed

PAGE 36

47 variables, especially when the series are highly skewed, e.g., hourly or daily series. On the other hand, when transforming the original series into normal, biases in the mean and standard deviation of the generated series may occur. In other words, the statistical properties of the transformed series may be reproduced in the generated but not in the original series. An alternative for avoiding biases in the moments of the generated series would be to estimate the moments of the transformed series through the derived relationships between the moments of the skewed and normal series. Matalas (1967) and Fiering and Jackson (1971) describe how to estimate the first two moments of the logtransformed series so as to reproduce the ones of the original series. Mejia et al. (1974) present another approach in order to the correlation structure of the original series. However, the most widely used approach is to transform the original skewed series to normal and then model the normal series. Several transformations may be applied to the original series, and the transformed series then tested for normality, e.g. the graph of their cumulative distribution should appear as a straight line when it is plotted on normal probability paper. The transformation will be finally chosen that gives the best approximation to normality, e.g., the best fit to a straight line. Another advantage of transforming the series to normal is that the maximum likelihood estimates of the model

PAGE 37

48 parameters are essentially the same as the least squares estimates, provided that the residuals are normally dis-tributed (Box and Jenkins, 1976, Ch. 7). This facilitates the calculation of the final estimates since they are those values that minimize the sum of squares of the residuals. Box and Cox (1964) showed how a maximum likelihood and a parallel Bayesian analysis can be applied to any type of transformation family to obtain the "best" choice of trans-formation from that family. They illustrated those methods for the popular power families in which the observation x is replaced by y, where xA-l Y = {-A log x A=O (3.32) The fundamental assumption was that for some A the trans-formed observations y can be treated as independently normally distributed with constant variance 02 and with expectations defined by a linear model E[y] = A L (3.33) where A is a known constant matrix and L is a vector of unknown parameters associated with the transformed observa-tions (Box and Cox, 1964). This transformation has the advantage over the simple power transformation proposed by Tukey (1957)

PAGE 38

49 y = { xA ,A;;iO log x A=O (3.34) of being continuous at A=O. Otherwise the two transformations are identical provided, as has been shown by Schlesselman (1971), that the linear model of (3.33) contains a constant term. Further, Draper and Cox (1969), showed that the value of A obtained from this family of transformations can be useful even in cases where no power transformation can produce normality exactly. Also, John and Draper (1980) suggested an alternative one-parameter family of transformations when the power transformation fails to produce satisfactory distributional properties as in the case of a symmetric distribution with long tails. The selection of the exact transformation to normality (zero skewness) is not an easy task, and over-transformation, i.e., transformation of the original data with a large positive (negative) skewness to data with a small negative (positive) skewness, or under-transformation, i.e., transformation of the original data with a large positive (negative) skewness to data with a small positive (negative) skewness, may result in unsatisfactory modeling of the series or in forecasts that are in error. This was the case for the data used by Chatfield and Prothero CI973a), who applied the Box-Jenkins forecasting approach and were dissatisfied with the results, concluding that the Box-Jenkins forecasting procedure is less efficient than other forecasting

PAGE 39

50 methods. They applied a log transform to the data which evidently over-transformed the data, as shown by Box and Jenkins (1973) who finally suggested the approximate trans-f t 0.25 th h h I' db' orma lon y = x ,even oug t e comp lcate ut preclse Box-Cox procedure gave an estimate of A = 0.37 [Wilson (1973)]. Thus, the selection of the normality transformation greatly affects the forecasts, as Chatfield and Prothero (1973b) experienced with their data. They concluded that We have seen that a "small" change in A from 0 to 0.25 has a substantial effect on the resulting forecasts from model A [ARlMA(l,l,l} x ARlMA(1,1,1}12J even though the goodness of fit does not seem to be much affected. This reminds us that a model which fits well does not necessarily forecast well. Since small changes in A close to zero produce marked changes in forecasts, it is obviously advisable to avoid "low" values of A, since a procedure which depends critically on distinguishing between fourth-root and logarithmic transformation is fraught with peril. On the other hand a "large" change in A from 0.25 to 1 appears to have relatively little effect on forecasts. So we conjecture that Box-Jenkins forecasts are robust to changes in the transformation parameter away from zero .. [Chatfield and Prothero (1973b) p. 347] Stationarity Most time series occurring in practice exhibit non-stationarity in the form of trends or periodicities. The physical knowledge of the phenomenon being studied and a visual inspection of the plot of the original data may give the first insight into the problem. Usually the length of the series is not long enough, and the detection of

PAGE 40

51 trends or cycles only through the plot of the series is ambiguous. Useful tools for the detection of periodicities are the autocorrelation function and the spectral density function of the series (which is the Fourier transform of the autocorrelation function). If a seasonal pattern is present in the series then the correlogram (plot of the autocorrelation function) will exhibit a sinusoidal appearance and the periodogram (plot of the spectral density function) will show peaks. The period of the sinusoidal function of the correlogram, or the frequency where the peaks occur in the periodogram, can determine the periodic component exactly (Jenkins and Watts, 1968). Another device for the detection of trends and periodicities is to fit some definite mathematical function, such as exponentials, Fourier series or polynomials to the series and then model the residual series, which is assumed to be stationary. More details on the treatment of nonstationary data as well as on the interpretation of the correlogram and periodogram of a time series can be found in textbooks such as Bendat and Piersol (1958}, Jenkins and Watts (1968), Wastler (1969), Yevjevich (1972), and Chatfield (1980). Apart from the approach of removing the nonstationarity of the original series and modeling the residual series with a stationary ARMA(p,q) model, the original nonstationary series can be modeled directly with a simple or seasonally integrated ARIMA model. Actually, the second approach can be viewed as an extension of the first one,

PAGE 41

e.g., the nonstationarity is removed through the simple (V) or seasonal (V ) differencing. However, the integrated s 52 model cannot be used for generation of data, as has already been discussed. For many hydrologic applications, one is satisfied with second order or weak stationarity, e.g., stationarity in the mean and variance. Furthermore, weak stationarity and the assumption of normality imply strict stationarity (see Appendix A) Monthly Rainfall Series Normalization and Stationarization Stidd (1953, 1968) suggested that rainfall data have a cube root normal distribution because they are product functions of three variables: vertical motion in the atmosphere, moisture, and duration time. Synthetic rainfall data generated using processes analogous to those operating in nature showed that the exponent required to normalize the distribution is between 0.5 (square root) and 0.33 (cubic root) for different types of rainfall (Stidd, 1970). The square root transformation has been extensively used for the approximate normalization of monthly rainfall series (see Table C12 of Appendix C) with satisfactory results: Delleur and Kavvas (1978), Salas et al. (1980), Ch. 5, Roesner and Yevjevich (1966). However, Hinkley (1977) used the exact Box-Cox transformation for monthly rainfall

PAGE 42

53 series. Although, Asley et ale (1977) have developed an efficient algorithm for the estimation of A along with other parameters in an ARlMA model, it seems that the exact value of A is not more reliable than the approximate one A = 0.5 (Chatfield and Prothero, 1973b). The reasons for this follow. First, Chatfield and Prothero (1973b) used the Box-Cox procedure to evaluate the exact transformation of their A data. They obtained estimates A = 0.24 using all the data (77 observations), A = 0.34 using the first 60 observations A and A = 0.16 excluding the first year's data. Therefore, it is logical to infer that even if the complicated Box-Cox procedure for the incomplete rainfall record is used, the missing values may be enough to give a spurious A, which is not "more exact" than the value of 0.5 used in practice. Second, we may also notice that the use of either A = 0.33 (cubic root) or A = 0.5 (square root) is not expected to greatly affect the forecasts since, according to Chatfield and Prothero (1973b), the Box-Jenkins forecasts are not too sensitive to changes of A for A > 0.25. Monthly rainfall series are nonstationary. The variation in the mean is obvious since generally the expected monthly rainfall value for January is not the same as that of July. Although the variation of the standard deviation is not so easy to visualize, calculations show that months with higher mean usually have higher standard deviation. Thus, each month has its own probability

PAGE 43

54 distribution and its own statistical parameters resulting in monthly series that are nonstationary. By introducing the concept of circular stationarity as developed by Hannan (1960) and others (see Appendix A for definition), the periodic monthly rainfall series can be considered not as nonstationary but circular stationary, since circular stationarity suggests that the probability distribution of rainfall in a particular month is the same for the different years. Then, the monthly rainfall series is composed of a circularly stationary lperiodic) component and a stationary random component. The time-series models currently used in hydrology are fitted to the stationary random component, so the circularly stationary component must be removed before modeling. This last component appears as a sinusoidal component in the autocorrelation function (with a 12-month period) or as a discrete spectral component in the spectrum (peak at the frequency 1/12 cycle per month). Usually several subharmonics of the fundamental 12-month period are needed to describe all the irregularities present in the autocorrelation function and spectral density function, since in nature the periodicity does not follow an ideal cosine function with a 12-month period. The use of a Fourier series approach for the approximation of the periodic component of monthly rainfall and monthly runoff series has been illustrated by Roesner and Yevjevich (1966).

PAGE 44

55 Kavvas and Delleur (1975) investigated three methods of removal of periodicities in the monthly rainfall series: nonseasonal (first-lag) differencing, seasonal differencing (12-month difference), and removal of monthly means. They worked both analytically and empirically using the rescaled (divided by the monthly standard deviation) monthly rainfall square roots for fifteen Indiana watersheds. They concluded that "all the above transformations yield hydrologic series which satisfy the classical second-order weak stationarity conditions. Both seasonal and nonseasonal differencing reduce the periodicity in the covariance function but distort the original spectrum, thus making it impractical or impossible to fit an ARMA model for generation of synthetic monthly series. The subtraction of monthly means removes the periodicity in the covariance and the amount of nonstationarity introduced is negligible for practical purposes." (Kavvas and Delleur, 1975, p. 349.) In other words, they concluded that the best way for modeling monthly rainfall series is to remove the seasonality (by subtracting the monthly means and dividing by the standard deviations of the normalized series) and then use a stationary ARMA(p,q} model to model the stationary normal residuals. Modeling of Normalized Series It is assumed that the nonstationarities due to longterm trends are removed before any operation. Then the appropriate transformation is applied to the data in

PAGE 45

56 order to obtain an approximately normal distribution. For monthly rainfall series experience has shown that the best practical transformation is the square root transformation, as has already been discussed. What remains is the modeling of the normalized series with one of the following models: stationary ARMA(p,q), simple nonstationary ARIMA(p,d,q), seasonal nonstationary ARIMA(P,O,Q)s' or mUltiplicative ARIMA(p,d,q)x(P,O,Q)s model. Delleur and Kavvas (1978) fitted different models to the monthly rainfall series of 15 basins in Indiana and compared the results. They studied the models: ARIMA ( 0), ARIMA (1 1), ARIMA ( 1, 1, 1), ARIMA (1 1 1) 12 and ARIMA(1,0,0)x(1,1,1)12 on the square-root trans-formed series. They concluded that from the nonseasonal ARIMA models, ARMA(l,l) "emerged as the most suitable for the generation and forecasting of monthly rainfall series." The goodness-of-fit tests applied on the residuals were the portemanteau lack of fit test (see Appendix A) of Box and Pierce (1970) and the cumulative periodogram test (Box and Jenkins, 1976, p. 294). The ARMA(l,l) model passed both tests in all cases studied. From the nonseasonal models, ARIMA(1,0,0}x(1,1,1)12 also passed the goodness-of-fit tests in all cases, but they stress that this model "has only limited use in the forecasting of monthly rainfall series since it does not preserve the monthly standard deviations." As far as forecasts are concerned, they showed that "the forecasts by the several models follow each other very

PAGE 46

57 closely and the forecasts rapidly tend to the mean of the observed rainfall square roots (which is the forecast of the white noise model)."

PAGE 47

CHAPTER 4 MULTIVARIATE STOCHASTIC MODELS Introduction For univariate stochastic models the sequence of observations under study is assumed independent of other sequences of observations and so is studied by itself (single or univariate time series). However, in practice there is always an interdependence among such sequences of observations, and their simultaneous study leads to the concept of multivariate statistical analysis. For example, a rainfall series of one station may be better modeled if its correlation with concurrent rainfall series at other nearby stations is incorporated into the model. Multiple time series can be divided into two groups: (1) multiple time series at several points (e.g., rainfall series at different stations, streamflow series at various points of a river), and (2) multiple series of different kinds at one point (e.g., rainfall and runoff series at the same station). In general, both kinds of multiple time series are studied simultaneously, and their correlation and cross-correlation structure is used for the construction of a model that better describes all these series. The parameters of this so called multivariate stochastic model are calculated such 58

PAGE 48

59 that the correlation and cross-correlation structure of the multiple measured series are preserved in the multiple series generated by the model. The multivariate models that will be presented in this chapter have been developed and extensively used for the generation of synthetic series. How these models can be adapted and used for filling in missing values will be discussed in chapter 5. General Multivariate Regression Model The general form of a multivariate regression model is Y = A X + B H (4. 1) where Y is the vector of dependent variables, X the vector of independent variables, A and B matrices of regression coefficients, and H a vector of random components. The vectors Y and X may consist of either the same variable at different points tor at different times) or different variables at the same or different points (or at different times) For convenience and without loss of generality all the variables are assumed second order stationary and normally distributed with zero mean and unit variance. Transformations to accomplish normality have been discussed in Chapter 3. A random component is superimposed on the model to account for the nondeterministic fluctuations. In the above model, the dependent and independent variables must be selected carefully so that the most

PAGE 49

60 information is extracted from the existing data. A good summary of the methods for the selection of independent variables for use in the model is given in Draper and Smith (1966}. Most popular is the stepwise regression procedure in which the independent variables are ranked as a function of their partial correlation coefficients with the dependent variable and are added to the model, in that order, if they pass a sequential F test. The parameter matrices A and B are calculated from the existing data in such a way that important statistical characteristics of the historical series are preserved in the generated series. This estimation procedure becomes cumbersome when too many dependent and independent variables are involved in the model, and several simplifications are often made in practice. On the other hand, restrictions have to be imposed on the form of the data, as we shall see later, to ensure the existence of real solutions for the matrices A and B. Multivariate Lag-One Autoregressive Model If only one variable (e.g., rainfall at different stations} is used in the analysis then the model of equation (4.11 becomes a multivariate autoregressive model. Since in the rest of this chapter we will be dealing only with one variable (rainfall} which has been transformed to normal and second order stationary, the vectors Y and X are replaced by the vector Z for a notation consistent with the

PAGE 50

61 univariate models. Matalas (1967) suggested the multivari-ate lag-one autoregressive model (4. 3) where Zt is an (mxl) vector whose ith element Zit is the observed rainfall value at station i and at time t, and the other variables have been described previously. Such a model can be used for the simultaneous genera-tion of rainfall series at m different stations. The correlation and cross-correlation of the series is incor-porated in the model through the parameters A and B. The matrices A and B are estimated from the historical series so that the means, standard deviations and auto-correlation coefficients of lag-one for all the series, as well as the cross-correlations of lag-zero and lag-one between pairs of series are maintained. Let MO denote the lag-zero correlation matrix which is defined as (4. 4) Then a diagonal element of MO is E[z. t z. t] = p .. (0) = 1 1, 1, 11 (since Zt is standardized) and an off diagonal element (i,j) is E[z. t z. t] = p .. (0) which is the lag-zero cross corre-1, J lJ lation between series {zi} and {Zj}. The matrix MO is symmetric since p .. (0) = p .. (0) for every i, j. lJ Jl

PAGE 51

62 Let Ml denote the lag-one correlation matrix defined as (4. 5) A diagonal element of Ml is E [z. t z. t lJ = p .. (1) which 1, 1, 11 is the lag-one serial correlation coefficient of the series {z. } and an off-diagonal element (i, j ) is 1 E(z. t Zj,t-l) = p .. (1) which is the lag-one cross-corre1, lJ lation between the {z. } and {z.} series, the latter lagged 1 J behind the former. Since in general p .. (1) tp .. (1) for lJ J 1 i tj the matrix Ml is not symmetric. After some algebraic manipulations (see Appendix B) the coefficient matrices A and B are obtained as solutions to the equations (4. 6) (4.7) where is the inverse of M O and Mi the transpose of M l The correlation matrices MO and Ml are calculated from the data. Then an estimate of the matrix A is given directly by equation (4.6), and an estimate for B is found by solving equation (4.7) by using a technique of principal component analysis (Fiering, 1964) or upper triangularization (Young, 1968). For more details on the solution of equation (4.7) see Appendix B.

PAGE 52

63 Comments on Multivariate AR{l) Model Assumption of Normality and Stationarity We have assumed that all random variables involved in the model are normal. The assumption of a multivariate normal distribution is convenient but not necessary. It has been shown (Valencia and Schaake, 1973) that the multivariate ARCl) model preserves first and second order statistics regardless of the underlying probability distributions. Several studies have been done using directly the original skewed series. Matalas (1967) worked with lognormal series and constructed the generation model so that it preserves the historical statistics of the log-normal process. Mejia et al. (1974) showed a procedure for multivariate generation of mixtures of normal and log-normal variables. Moran (1970) indicated how a multivariate gamma process may be applied, and Kahan (1974) presented a method for the preservation of skewness in a linear bivariate regression model. But in general, the normalization of the series prior to modeling is more convenient, especially when the series have different underlying probability distributions. In such cases different transformations are applied on the series, and that combination of transformations is kept which yields minimum average skewness. Average skewness is the sum of the skewness of each series divided by the number of series or number of stations used. This operation is called finding the MST (Minimum Skewness

PAGE 53

64 Transformation) and results in an approximately multivariate normal distribution (Young and Pisano, 1968). We have also assumed that all variables are standardized, e.g., have zero mean and unit variance. This assumption is made without loss of generality since the linear transformations are preserved through the model. On the other hand this transformation becomes necessary when modeling periodic series since by subtracting the periodic means and dividing by the standard deviations we remove almost all of the periodicity. If the data are not standardized, MO and Ml represent the lag-zero and lag-one covariance matrices (instead of correlation matrices), respectively. If S denotes the diagonal matrix of the standard deviations and RO' Rl the lag-zero and lag-one correlation matrices then (4. 8) and (4.9) When we standardize the data the matrix S is an identity matrix and MO' Ml become the correlation matrices RO and Rl respectively. Thus, one other advantage of standardization is that we work with correlation matrices whose elements are less than unity and the computations are likely to be more stable (Pegram and James, 1972).

PAGE 54

65 Cross-Correlation Matrix Ml Notice that the lag-one correlation matrix Ml has been T defined as Ml = E[Zt Zt-l] which contains the lag-one cross-correlations between pairs of series but having the second series lagged behind the first one. Following this definition the lag-minus-one correlation matrix will be (4.10) and it will contain the lag-one correlations having now the second series lagged ahead of the first one. It is easy to show that M_l is actually the transpose of M l : E[(Z ZT )T] t t-l Care then must be taken so that there is a consistency (4.11) between the equation used to calculate matrix A and the way that the cross-correlation coefficients have been calculated. Such an inconsistency was present in the numerical multisite package developed by Young and Pisano (1968) and was first corrected by O'Connell (1973) and completely corrected and improved by Finzi et al. (1974, 1975). Incomplete Data Sets In practice, hydrologic series at different stations are unlikely to be concurrent and of equal length. With lag-zero auto-and cross-correlation coefficients calculated

PAGE 55

from the incomplete data sets, the lag-zero correlation matrix MO obtained may .. M-l d d 1tS 1nverse 0 nee e not be positive semidefinite, and, for the calculation of matrix A thus may have elements that are complex numbers. Also, a necessary and sufficient condition for a real solution of -1 T. matrix B is that C = MO Ml MO Ml 1S a positive semi-definite matrix (see Appendix B) When all of the series are concurrent and complete 66 then MO and C are both semidefinite matrices [Valencia and Schaake, 1973], and the generated synthetic series are real numbers. When the series are incomplete there is no guarantee that real solutions for the matrices A and B exist causing the model of Matalas (1967) to be conditional on MO and C being positive semidefinite [Slack, 1973]. Several techniques have been proposed which use the incomplete data sets but guarantee the posite semidefinite-ness of the correlation matrices. Fiering (1968) suggested a technique that can be used to produce a positive semi-definite correlation matrix MO. If MO is not positive semidefinite then negative eigenvalues may occur and hence negative variables, since the eigenvalues are variances in the principal component system. In this technique, the eigenvalues of the original correlation matrix are calcu-lated. If negative eigenvalues are encountered, an adjust-ment procedure is used to eliminate them (thereby altering the correlation matrix, MO [Fiering, 1968]).

PAGE 56

A correlation matrix is called consistent if all its eigenvalues are positive. But consistent estimates of the correlation matrices MO and Ml do not guarantee that C will also be consistent. Crosby and Maddock (1970) proposed a technique that is suitable only for monotone data (data continuous in collection to the present but having different starting times). This technique produces a consistent estimate of the matrix MO as well as of the matrix C, and is based on the maximum likelihood technique developed by Anderson (1957) Valencia and Schaake (1973) developed another tech-nique. They estimate matrices A and B from the equations 67 (4.12 ) (4.13 ) where MOl is the lag-zero correlation matrix MO computed from the first (N-l) vectors of the data, and M02 is computed from the last (N-l) vectors, where N is the number of data points (number of times sampled) in each of the n series. Further Simplification Sometimes in practice, the preservation of the lag-zero and lag-one autocorrelations and the lag-zero

PAGE 57

68 cross-correlations is enough. In such cases, i.e., when the lag-one cross-correlations are of no interest, a nice simplification can be made due to Matalas (1967, 1974). He defined matrix A as a diagonal matrix whose diagonal ele-ments are the lag-one auto-correlation coefficients. With A defined as above, the lag-one cross-correlation of the generated series (p .. (1)) can be shown to be the product lJ of the lag-zero cross-correlation (p .. (0)) and the lag-one lJ auto-correlation of the series (p .. (I), but of course difII ferent than the actual lag-one cross-correlation (p .. (1)). lJ p .. (1) = p .. (0) p .. (1) lJ lJ II (4.14} By using p .. (1) of equation (4.14) in place of the actual lJ Pij (ll, thus avoiding the actual computation of Pij (1) from the data, the desired statistical properties of the series are still preserved. Higher Order Multivariate Models The order p of a multivariate autoregressive model could be estimated from the plots of the autocorrelation and partial autocorrelation functions of the series (Salas et al., 1980) as an extension of the univariate model identification, which is already a difficult and ambiguous task. However, in practice first and second order models are usually adequate and higher order models should be avoided (Box and Jenkins, 1976).

PAGE 58

69 In any case, the multivariate multilag autoregressive model of order p takes the form (4.15) and the matrices AI' A 2 .. Ap' B are the solutions of the equations M. = 1 P E Ak M. k k=l 1-M -o i = 1, 2, .. P (4.16) (4.17) where M is the lagcorrelation matrix. Equation (4.16) is a set of p matrix equations to be solved for the matrices AI' A 2 .. Ap' and matrix B is obtained from (4.17) using techniques already discussed. Here, the assumption of diag-onal A matrices becomes even more attractive. For a multi-variate second-order AR process the above simplification is illustrated in Salas and Pegram (1977) where the case of periodic (not constant) matrix parameters is also considered. O'Connell (1974) studied the multivariate ARMA(l,l) model (4.18) where A, B, and C are coefficient matrices to be determined

PAGE 59

70 from the data. Specifically they are solutions of the system of matrix equations (4. 19) = T where Sand T are functions of the correlation matrices MO Ml and M 2 Methods for solving this system are proposed by O'Connell (1974). Explicit solutions for higher order multivariate ARMA models are not available and Salas et al. (1980) propose an approximate multivariate ARMA(p,q) model.

PAGE 60

CHAPTER 5 ESTIMATION OF MISSING MONTHLY RAINFALL VALUESA CASE STUDY Introduction This section compares and evaluates different methods for the estimation of missing values in hydrological time series. A case study is presented in which four of the simplified methods presented in Chapter 2 have been applied to a set of four concurrent 55 year monthly rainfall series from south Florida and the results compared. Also a recursive method for the estimation of missing values by the use of a univariate or multivariate stochastic model has been proposed and demonstrated. The theory already presented in Chapters 2, 3 and 4 is supplemented whenever needed. Set Up of the Problem The monthly rainfall series of four stations in the South Florida Water Management District (SFWMD) have been used in the analysis. These stations are: Station A Station 1 Station 2 Station 3 MRF6038, Moore Haven Lock 1 MRF6013, Avon Park MRF6093, Fort Myers WSO Ap. MRF6042, Canal point USDA. 71

PAGE 61

For convenience the four stations will sometimes be addressed as A, 1, 2, 3 instead of their SFWMD identification numbers 6038, 6013, 6093 and 6042, respectively. Their locations are shown in the map of 72 Fig. 5.1. Station A in the center is considered as the interpolation station (whose missing values are to be estimated) and the other three stations 1, 2 and 3 as the index stations. Care has been taken so that the three index stations are as close and as evenly distributed around the interpolation station as possible. This particular set of four stations was selected because it exhibits many desired and convenient properties: (1) the stations have an overlapping period of 55 years (1927-1981) (2) for this 55 year period the record of the interpolation station (station A) is complete (no missing values) (3) the three index stations have a small percent of missing values for the overlapping period (station 1: 2.7% missing, station 2: complete, and station 3: 1.2% missing values). The 55 year length of the records is considered long enough to establish the historical statistics (e.g., monthly mean, standard deviation and skewness) and provides a monthly series of a satisfactory length (660 values) for fitting a univariate or multivariate ARMA model.

PAGE 62

.. FLORIDA +-40 ... ""-' ----,..., .... .-... ..... ---------..-...... --3 2 .,' ... .. Fig. 5.1. The four south Florida rainfall stations used in the analysis. A: 6038, Moore Haven Lock 1 1: 6013, Avon Park 2: 6093, Fort Myers WSO AP. 3: 6042, Canal Point USDA 73 \ I I

PAGE 63

74 The completeness of the series of the interpolation station permits the random generation of gaps in the series, corresponding to different percentages of missing values, with the method described in Chapter 1. After the missing values have been estimated by the applied models, the gaps are in-filled with the estimated values and the statistics of the new (estimated) series are compared with the statistics of the incomplete series and the statistics of the historical (actual) series. Also the statistical closeness of the in-filled (estimated) values to the hidden (actual) values provides a means for the evaluation and comparison of the methods. When, for the estimation of a missing value of the interpolation station, the corresponding value of one or more index stations is also missing the latter is eliminated from the analysis, e.g., only the remaining one or two index stations are used for the estimation. Frequent occurrence of such concurrent gaps in both the interpolation and the index stations would alter the results of the applied method in a way that cannot be easily evaluated (e.g., another parameter such as the probability of having concurrent gaps should be included in the analysis). A small number of missing values in the selected index stations eliminates the possibility of such simultaneous gaps, and thus the effectiveness of the applied estimation procedures can be judged more efficiently.

PAGE 64

75 The statistical properties (e.g., monthly mean, standard deviation, skewness and coefficient of variation) of the truncated (to the 1927-1981 period) original monthly rainfall series for the four stations are shown on Tables C.l, C.2, C.3 and C.4 of Appendix C. Figure 5.2 shows the plot of the monthly means and standard deviations for station A. From these plots we observe that: (1) the plot of monthly means is in agreement with the typical plot for Florida shown in Fig. 1.1, and (2) months with a high mean usually have a high standard deviation. The only exception seems to be the month of January which in spite of its low mean exhibits a high standard deviation and therefore a very high coefficient of variation and an unusually high skewness. A closer look at the January rainfall values of station A shows that the unusual properties for that month are due to an extreme value of 21.4 inches of rainfall for January 1979, the other values being between 0.05 and 6.04 inches. The three index stations 1, 2 and 3 are at distances 59 miles, 51 miles and 29 miles respectively from the interpolation station A. Simplified Estimation Techniques Techniques Utilized From the simplified techniques presented in Chapter 2, the following four are applied for the estimation of missing

PAGE 65

76 inches 7 6 5 4 2 o .J F M A .J .J o N D inches 4 3 2 o .J F A M J J AS 0 N D (b) monthly standard deviations Fig. 5.2. Plot of the monthly means and standard deviations-station 6038 (1927 -1981)

PAGE 66

monthly rainfall values: (1) the mean value method (MV) (2) the reciprocal distances method (RD) (3) the normal ratio method (NR) and (4) the modified weighted average method (MWA). 77 These methods are all deterministic and are applied directly on the available data permitting thus a uniform and objective comparison of the results. The mean value plus random component method has not been included-in this thesis. The above four methods will be applied for five different percentages of missing values: 2%, 5%, 10%, 15% and 20%. These percentages cover almost 80% of all cases encountered in practice as has been shown in Table 1.1 (e.g., 80% of the stations have below 20% missing values). From the same table it can also be seen that almost 30% of the stations have below 5% missing values. Therefore, it would be of interest and practical use if we could generalize the results for the region of below 5% missing values since a large fraction of the cases in practice fall in this region. The application of the first three methods (MV, RD, NR methods) is straightforward and no further comments need be made. However, some comments on the least squares (LS) method and the modified weighted average (mvA) method are necessary.

PAGE 67

Least Squares Method (LS) The least squares method although simple in principle involves an enormous amount of calculations, and for that reason it has been excluded from this study. For example, consider the case in which the interpolation station A is regressed on the three index stations 1, 2 and 3. The estimated values will be given by: 78 (5.1) where a, b 1 b 2 b 3 are the regression coefficients calculated from the available concurrent values of all the four variables. There are 12 such regression equations, one for each month. But if it happens that an index station (say, station 3) has a missing value simultaneously with the interpolation station, a new set of 12 regression equations is needed for the estimation, e.g., Y = a1 + b x + b' x + E 1 1 2 2 (5.2) Unless this coincidence of simultaneously missing values is investigated manually so that only the needed least squares regressions are performed (Buck, 1960), all the possible combinations of regressions must otherwise be performed. This involves regressions among all the four variables (Yi xl' x 2 x 3), among the three of them (Yi xl' x 2), (Yi xl' x 3), (Yi x 2 x 3 ) and between pairs of them (Yi xl)'

PAGE 68

(Yi x 2), (Yi x 3), giving overall 7 sets of 12 regression equations. Because the regression coefficients are 79 different for each percentage of missing values (since their calculation is based only on the existing concurrent values) the 84 (7 x 12) regressions must be repeated for each level of missing values (420 regressions overall for this study). It could be argued that the same 12 regression equations (Yi xl' x 2 x 3 ) could be kept and a missing values x. replaced by its mean x. or by another estimate x!. In 111 that case equation 5.1 would become (5.3) the coefficients of regression a, b 1 b 2 b 3 remaining unchanged. This in fact can be done, but then the method tested will not be the "pure" least squares method since the results will depend on the secondary method used for the estimation of the missing x. values. 1 The coefficients a, b 1 b 2 and b 3 (equation 5.1) of the regression of the {y} series (of station A with 2% missing values) on the series {xl}' {x2 } and {x3 } (of stations 1, 2 and 3 respectively) are shown in Table 5.1. In the same table the values of the squared multiple regression coefficient R2 and the standard deviation of the {y} series are also shown. The numbers in parenthesis show the significance level a at which the parameters are significant (the percent probability of being nonzero is (1-a. For

PAGE 69

80 Table 5.1. Least Squares Regression Coefficients for Equation (5.1) and Their Significant Levels. The standard deviation, s, for each month is also given. JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC a inches 0.0059 (0.9692) 0.1355 (0.5260) 0.0052 (0.9793) 0.7388 (0.0273) 2.1302 (0.0070) 1.8765 (0.1505) 2.8601 (0.0750) 2.0820 (0.2065) 0.0108 (0.9916) -0.6985 (0.0866) 0.3167 (0.1290) -0.2623 (0.1987) 0.1271 (0.2790) 0.2624 (0.0025) 0.1617 (0.0138) 0.2405 (0.0458) 0.4046 (0.0115) 0.2192 (0.1576) -0.0345 (0.7883) 0.1771 (0.1666) 0.5102 (0.0003) 0.3960 (0.0020) 0.3009 (0.0030) 0.2332 (0.1065) 0.4994 (0.0005) 0.0086 (0.9431) 0.3457 (0.0001) 0.2813 (0.0156) -0.0591 (0.7180) 0.1108 (0.4034) 0.3993 (0.0131) 0.2078 (0.0787) 0.2113 (0.0893) 0.2287 (0.0433) 0.2473 (0.0804) 0.3807 (0.0084) 0.3377 (0.0017) 0.5345 (0.0001) 0.4507 (0.0001) 0.1919 (0.1132) 0.2186 (0.1308) 0.3339 (0.0133) 0.1885 (0.1780) 0.2660 (0.0589) 0.2450 (0.0190) 0.4667 (0.0001) 0.1063 (0.0069) 0.4381 (0.0001) 0.8046 (0.0001) 0.7033 (0.0001) 0.9142 (0.0001) 0.4936 (0.0001) 0.2752 (0.0016) 0.3351 (0.0002) 0.2005 (0.0154) 0.1789 (0.0248) 0.5669 (0.0001) 0.7749 (0.0001) 0.4575 (0.0001) 0.7723 (0.0001) s inches 3.076 1. 365 2.464 1. 818 2.583 3.812 3.399 2.938 4.085 3.073 1. 228 1. 585

PAGE 70

81 example, for January the coefficient b I is not significant at the 5% significance level (a = 0.05) since 0.279 is greater than 0.05, but the R2 coefficient is significant even at 0.01% significance level (a = 0.0001). The significance levels correspond to the nt-test" for the regression coefficients and to the "F-test" for the R2 coefficients. The standard deviation, s, of the {y} series is also listed since the random component is given by s (5.4) as has already been discussed in Chapter 2. It is interesting to note, that although the multiple regression coefficient R2 varies for each month from as low as 0.18 to as high as 0.91 it is always significant at the 5% significance level. The months of July and August exhibit the lowest (although significant) correlation coefficients as is expected for Florida. The physical reason for these low correlations is that in the summer most rainfall is convective, whereas in other months there is more cyclonic activity. Rainfall from scattered thunderstorms is simply not as correlated with that of nearby areas as is rainfall from broad cyclonic activity. Thus, on the basis of the regressions shown in Table 5.1, the least squares method would be expected to perform least well in the summer in Florida, but this point is not validated in this thesis.

PAGE 71

82 Modified Weighted Average Method (MWA) For the modified weighted average method the twelve (3x3) covariance matrices of the three index stations have been calculated for each month using equation (2.9) and (2.10), and are shown in Table C.11 (appendix C). Also the monthly standard deviations, s have been estimated from y the known {y} series, and the monthly standard deviations, s' have been calculated by equation (2.11) using the y calculated covariance matrices. Notice that although the twelve s values (as calculated from the actual data and y which we want to preserve) are different at different percentages of missing values, the twelve s' values (that y depend only on the weights a. and the covariance matrix of 1 the index stations) are calculated only once. The correction coefficients f (f = s Is') for each month and for y y each different percentage of missing values which must be applied on matrix A (equation 2.21) are shown in Table 5.2. From this table it can be seen that if the simple weighted average scheme' of equation (2.3) were used for the generation, the standard deviation of November would be overestimated (by a factor of approximately 2) and the standard deviation of all other months would be under-estimated (e.g., by a factor of approximately 0.5 for the month of January). We also observe that due to small changes of s for different percentages of missing values, y the correction factor f does not vary much either, but tends

PAGE 72

83 Table 5.2. Correction Coefficient, f, for Each Month and for Each Different Percent of Missing Values (f = s Is' y y). 2% 5% 10% 15% 20% JAN 1.777 1. 777 1. 795 1. 897 1. 872 FEB 1.129 1.142 1.136 1.199 1.188 MAR 1.178 1. 207 1.177 1. 003 1. 009 APR 1. 089 0.980 1. 061 1. 051 1. 054 MAY 1. 269 1.197 1. 212 1. 222 1. 360 JUN 1. 214 1.173 1.192 1. 228 1. 242 JUL 1. 338 1. 345 1. 386 1. 390 1. 491 AUG 1. 424 1. 414 1. 425 1.432 1. 369 SEP 1. 313 1. 328 1. 325 1. 210 1. 331 OCT 1. 258 1. 273 1. 218 1. 229 1. 314 NOV 0.533 0.537 0.509 0.583 0.572 DEC 1.161 1.140 1.169 1.172 1. 248

PAGE 73

to be slightly greater the greater the percent of missing values. 84 The modified weighted average scheme theoretically preserves the mean and variance of the series as has been shown in Chapter 2. But this is true for a series that has been generated by the model and not for a series that is a mix of existing values and values generated (estimated) by the model. This illustrates the difference between the two concepts: "generation of data by a model" and "estimation of missing values by a model." A method for generation of data which is considered "good" in the sense that it preserves first and second order statistics is not necessarily "good" for the estimation of missing values. In fact, it may give statistics comparable to the ones given from a simpler estimation technique which does not preserve the statistics, even as a generation scheme. Theoretically, for a "large" number of missing values, the estimation model operates as a generation model and thus preserves the "desired" statistics, but practically, for this large amount of missing values the "desired" statistics (calculated from the few existing values) are of questionable reliability. Only for augmentation of the time series (extension of the series before the first or after the last point) will the modified weighted average scheme or other schemes that preserve the "desired" statistics be expected to work better than the simple weighted average schemes.

PAGE 74

85 One other disadvantage of the modified weighted average scheme as well as of the least squares scheme is that negative values may be generated by the model. Since all hydrological variables are positive, the negative generated values are set equal to zero, thus altering the statistics of the series. This is also true for all methods that involve a random component and is mainly due to "big" negative values taken on by the random deviate. The number of negative values, estimated by the MWA method, which have been set equal to zero in the example that follows were 1, 1, 6, 4, and 9 values for the 2%, 5%, 10%, 15% and 20% levels of missing values, respectively. The effect of the values arbitrarily set to zero cannot be evaluated exactly, but what can be intuitively understood is that a distortion in the distribution is introduced. A transformation that prevents the generation of negative values could be performed on the data before the application of the generation scheme. Such a transformation is, for example, the logarithmic transformation since its inverse applied on a negative value exists, and the mapping of the transformed to the original data and vice versa is one to one (this is not true for the square root transformation). Comparison of the MV, RO, NR and MWA Methods The performance of each method applied for the estimation of the missing values will be evaluated by comparing the estimated series (existing plus estimated

PAGE 75

86 values) to the incomplete series (really available in practice) and to the actual series (unknown in practice, but known in this artificial case). The criteria that will be used for the comparison of the method will be the following: (1) the bias in the mean as measured (a) by the difference between the mean of the estimated series, y and the mean of the incomplete series, e y. (i = 1, 2, 3, 4, 5 for five different 1 percentages of missing values), and (b) by the difference between the mean of the estimated series, Ye and the mean of the actual series, Yai (2) the bias in the standard deviation as measured (a) by the ratio of the standard deviation of the estimated series, s to the standard deviation of e the incomplete series, s. and (b) by the ratio of 1 the standard deviation of the estimated series, to the standard deviation of the actual series, (3) the bias in the lag-one and lag-two correlation s e s a' coefficients as measured by the difference of the correlation coefficient of the estimated series, r to the correlation coefficient of the actual e series, r i a (4) the bias of the estimation model as given by the mean of the residuals, y i.e., the mean of the r differences between the in-filled (estimated) and hidden (actual) values (this is also a check to

PAGE 76

87 detect a consistent over-or under-estimation of the method); (5) the accuracy as determined by the variance of the residuals (differences between estimated and actual values) of the whole series, s2; r (6) the accuracy as determined by the variance of the residuals of only the estimated values, s2 ; and r,e (7) the significance of the biases in the mean, standard deviation and correlation coefficients as determined by the appropriate test statistic for each (see appendix A) Table 5.3 presents the statistics of the actual series (ACT), of the incomplete series (INC) and of the estimated series by the mean value method (MV) by the reciprocal distances method (RD) by the normal ratio method (NR) and by the modified weighted average method (MWA). The mean (y), standard deviation (s), coefficient of variation (c ) v coefficient of skewness (c ), lag-one and lag-two s correlation coefficients (r1 r 2 ) of the above series considered as a whole have then been calculated. Regarding comparison of the means, the following can be concluded from Table 5.4: (1) the bias in the mean in all cases is not significant at the 5% significance level as shown by the appropriate t-test;

PAGE 77

88 Table 5.3. Statistics of the Actual (ACT) Incomplete (INC) and Estimated Series (MV, RD, NR, MWA). -Y s c Cs r l r 2 v ACT 4.126 3.673 89.040 1. 332 0.366 0.134 2% missing values INC 4.116 3.680 89.397 1. 346 MV 4.125 3.663 88.808 1. 335 0.371 0.130 RD 4.124 3.674 89.092 1. 336 0.367 0.133 NR 4.114 3.666 89.104 1. 339 0.368 0.131 1-1WA 4.113 3.674 89.331 1. 342 0.363 0.131 5% missing values INC 4.113 3.671 89.249 1. 341 MV 4.101 3.610 88.040 1.352 0.372 0.139 RD 4.127 3.696 89.550 1. 359 0.369 0.133 NR 4.105 3.674 89.501 1. 349 0.367 0.131 NWA 4.116 3.720 90.386 1. 388 0.364 0.126 10% missing values INC 4.144 3.705 89.405 1. 350 MV 4.134 3.603 87.152 1. 346 0.379 0.159 continued

PAGE 78

89 Table 5.3. Continued. y s c c r1 r2 v s ACT 4.126 3.673 89.040 1. 332 0.366 0.134 RD 4.150 3.689 88.884 1.301 0.380 0.166 NR 4.120 3.652 88.633 1.321 0.377 0.155 MWA 4.127 3.725 90.244 1. 286 0.376 0.162 15% missing values INC 4.135 3.671 88.767 1.268 MV 4.106 3.513 85.567 1.270 0.399 0.133 RD 4.177 3.688 86.862 1.224 0.372 0.132 NR 4.135 3.691 86.854 1. 236 0.379 0.133 MWA 4.134 3.650 88.291 1.248 0.357 0.123 20% missing values INC 4.082 3.701 90.673 1. 404 MV 4.124 3.495 84.749 1. 333 0.408 0.160 RD 4.231 3.723 87.993 1.865 0.370 0.156 NR 4.125 3.601 87.307 1. 298 0.377 0.152 MWA 4.168 3.741 89.758 1. 273 0.354 0.153

PAGE 79

90 Table 5.4. Bias in the Mean INC MV RD NR MWA (Ye y.) -y. l. l. 2% O. 0.009 0.008 0.002 0.003 4.116 5% O. -0.012 0.014 -0.008 0.003 4.113 10% O. -0.010 0.006 -0.024 -0.017 4.144 15% O. -0.089 0.042 0.000 -0.001 4.135 20% O. 0.042 0.149 0.043 0.086 4.082 (Ye Y ) -Ya a 2% -0.010 -0.001 -0.002 -0.012 -0.013 4.126 5% -0.013 -0.025 0.001 -0.021 -0.010 10% 0.018 0.008 0.024 -0.006 0.001 15% 0.009 -0.020 0.051 0.009 0.008 20% -0.044 -0.002 0.105 -0.001 0.042

PAGE 80

91 (2) the bias in the mean of the incomplete series is relatively small but becomes larger the higher the percent of missing values; (3) at high percents of missing values the NR method gives the less biased mean; (4) except for the RD method which consistently overestimates the mean (the bias being larger the higher the percent of missing values), the other methods do not show a consistent over or underestimation. Regarding comparison of the variances the following can be concluded from Table 5.5: (1) Although slight, the bias in the standard deviation is always significant, but this is so because the ratio of variances would have to equal 1.0 exactly to satisfy the F-test (i.e., be unbiased) with as large a number of degrees of freedom as in this study; (2) the MV method always gives a reduced variance as compared to the variance of the incomplete series and of the actual series, the bias being larger the higher the percent of missing values; (3) the bias in the standard deviation of the incomplete series is small; (4) there is no consistent over or under-estimation of the variance by any of the methods (except the MV method);

PAGE 81

92 Table 5.5. Bias in the Standard Deviation INC !1V RD NR MWA s Is. s. e 1 1 2% 1. 0.995 0.998 0.996 0.998 3.680 5% 1. 0.983 1. 007 1. 001 1. 013 3.671 10% 1. 0.972 0.996 0.986 1. 005 3.705 15% 1. 0.957 0.988 0.978 0.994 3.671 20% 1. 0.944 1. 006 0.973 1. 011 3.701 s /s s e a a 2% 1. 002 0.997 1. 000 0.998 1.000 3.673 5% 0.999 0.983 1. 006 1.000 1.013 10% 1.009 0.981 1. 004 0.994 1.014 15% 0.999 0.956 0.988 0.978 0.994 20% 1. 008 0.952 1. 014 0.980 1.019

PAGE 82

(5) the MWA method does not give less biased variance even at the higher percent of missing values tested, as compared to the RD and NR methods. Regarding comparison of the correlation coefficients the following can be concluded from Table 5.6: 93 (1) the bias in the correlation coefficients is in all cases not significant at the 5% significance level as shown by the appropriate z-testi (2) the MV method gives the largest bias in the correlation coefficients, the bias increasing the higher the percent of missing values, with a possible effect on the determination of the order of the model; (3) all methods (except the MWA method) consistently overestimate the serial correlation coefficient of the incomplete series but not the serial correlation of the actual series and therefore is not considered a problem; (4) the RD method seems to give a correlogram that closely follows the correlogram of the actual series. Regarding accuracy of the methods the following can be concluded from Table 5.7: (1) no method seems to consistently over or underestimate the missing values at all percent levels, but at high percent levels the missing values are overestimated by all methods;

PAGE 83

Table 5.6. Bias in the Lag-One and Lag-Two Correlation Coefficients. INC MV RD NR HWA (r 1 -r 1 ) ,e ,a 2% 0.005 0.001 0.002 -0.003 5% 0.006 0.003 0.001 -0.002 10% 0.013 0.014 0.011 0.010 15% 0.033 0.006 0.013 -0.009 20% 0.042 0.004 0.011 -0.012 (r 2 -r ) ,e L,a 2% -0.004 -0.001 -0.003 -0.003 5% 0.005 -0.001 -0.003 -0.008 10% 0.025 0.032 0.021 0.028 15% -0.001 -0.002 -0.001 -0.011 20% 0.026 0.022 0.018 0.019 94 r 1,a 0.366 r 2,a 0.134

PAGE 84

Table 5.7. Accuracy--Mean and Variance of the Residuals N = number of missing values NO = total number of values = 660. INC MV RD NR MWA 11 = r L (Ye -Y a ) INo 2% -0.043 -0.061 -0.570 -0.589 5% -0.440 0.034 -0.380 -0.176 10% 0.007 0.156 -0.113 -0.046 15% -0.175 0.338 0.074 0.105 N 0 13 33 62 98 20% 0.037 0.502 0.038 0.200 130 2 L (Y 2 s = Y ) I (N -2) r,e e a 0 2% 5.037 2.874 3.149 4.585 5% 8.610 3.656 3.411 5.340 10% 7.892 4.239 3.484 5.187 15% 7.620 4.630 3.958 5.816 20% 5.224 4.891 3.681 4.898 95

PAGE 85

96 Table 5.7. Continued. INC MV RD NR MWA 2 L: (y -2 s = Y ) /(N-2) r e a 2% 0.084 0.048 0.053 0.077 5% 0.406 0.172 0.161 0.252 10% 0.720 0.387 0.318 0.473 15% 1.112 0.675 0.577 0.849 20% 1. 016 0.951 0.716 0.953

PAGE 86

(2) the NR method is the more accurate method especially at high percents of missing values (i.e., it gives the smaller mean and variance of the residuals). Univariate Model Model Fitting Before considering the problem of missing values the problem of fitting an ARMA(p,q) model to the monthly rainfall series of the south Florida interpolation station will be considered. 97 The observed rainfall series has been normalized using the square root transformation and the periodicity has been removed by standardization. The reduced series, approximately normal and stationary, is then modeled by an ARMA(p,q) model. The ACF of the reduced series, as shown in Fig. 5.3, implies a white noise process since almost all the autocorrelation coefficients (except at lag-3 and lag-12) lie inside the 95 percent confidence limits. Of course, it is unsatisfying to accept the white noise process as the "best" model for our series and an attempt is made to fit an ARMA(1,1) model to the series. The selection of an ARMA model and not an AR or model is based on the following reasons: (1) The observed rainfall series contains important observational errors and so it is assumed to be the sum

PAGE 87

+ 1.0 + 0.1 + o.OS 0.0 -o.OS 0.1 ---95 % C. I. Ie ----;---------------------------------------Fig. 5.3. Autocorrelation function of the normalized and standardized monthly rainfall series of Station A. 98

PAGE 88

99 of two series: the "true" series and the observational error series (signal plus noise). Therefore, even if the "true" series obeys an AR process, the addition of the observational error series is likely to produce an ARMA model: AR(p) + white noise = ARMA(p,p) AR(p) + AR(q) = ARMA(p+q, max (p,q) ) (5.5) AR(p) + MA(q) = ARMA(p, p+q) The same can be said if the "true" series is an MA process and the observational error series an AR process but not if the latter is an MA process or a white noise process: MA(p) + AR(q) = ARMA(q,p+q) MA(p) + MA(q) = MA(max(p,q + white noise = MA(p) (Granger and Morris, 1976; Box and Jenkins, 1976, Appendix A4. 4) (5.6) It is understood, that the addition of any observational series to an ARMA process of the "true" series will give again an ARMA process. For example, ARMA(p,q) + white noise = ARMA(p,p) if p > q = ARMA(p,q) if p < q (5.7)

PAGE 89

100 from which it can also be seen that the addition of an observational error may not always change the order of the model of the "true" process. (2) One other situation that leads exactly, or approximately, to ARMA models is the case of a variable which obeys a simple model such as AR(l) if it were recorded at an interval of K units of time but which is actually observed at an interval of M units (Granger and Morris, 1976, p. 251). All these results suggest that a number of real data situations are all likely to give rise to ARMA models; therefore, an ARMA(l,l) model will be fitted to the observed monthly rainfall series of the south Florida interpolation station. The preliminary estimate of (equation 3.23) is -0.08163, and the preliminary estimate of 8 1 (equa-tions 3.21 for k = 0, 1, 2) is the solution of the quadratic equation 0.1656 8i + 1.0204 81 + 0.1656 = 0 (5.8) Only the one root 8 1 = -0.1667 is acceptable, the second lying outside the unit circle. These preliminary estimates of and 8 1 become now the initial values for the determination of the maximum likelihood estimates (MLE). In general, the choice of the starting values of and 8 does not significantly affect the parameter estimates (Box and Jenkins, 1976, p. 236), but this was not the case for the

PAGE 90

0.5 0." O.! 0.2 0.1 0.0 -0.1 -0.2 -0.3 -0." -0.5 -0.5 -0.4 -0.3 -0.2 -0.1 0.0 0.1 0.2 0.3 0.4 O.S Fig. 5.4. "2 Sum of squares of the residuals, Z(at), of an ARMA (1,1) model fitted to the rainfall series of station A. 101 e

PAGE 91

Table 5.8. Initial Estimates and MLE of the Parameters and 8 of an ARMA(l,l) model fitted to the rainfall series of station A. 102 Initial Estimates Max. Likelihood Estimates Hodel 8 e A -0.0816 0.0 -0.0088 -0.0989 B -0.0816 -0.1667 -0.3140 -0.4056 C 0.1 0.0 0.0537 -0.0278 D -0.4 -0.5 -0.4064 -0.4939 south Florida rainfall series under study. In particular different initial estimates of and 8 1 have been tested and the MLE of the parameters are compared in Table 5.8. The MLE have been calculated using the IMSL subroutine FTJliXL which uses a modified steepest descent algorithm to find the values of and 8 that minimize the sum of squares of the residuals (Box and Jenkins, 1976, p. 504). The drastic changes in parameter values together with the idea that the process may be a white noise process suggest a plot of the sum of squares of the residuals for the visual detection of anomalies. The sum of squares grids and contours are shown in Fig. 5.4. We observe that there is not a well defined point where the sum of squares becomes a minimum but rather a line (contour of the value 641) on which the sum of squares has an almost constant value equal to the minimum. In such case combinations of parameter values give similar sum of squares of residuals and a change

PAGE 92

103 in the AR parameter can be nearly compensated by a suitable change in the MA parameter. From the comparison of the parameters and 8 (Table 5.8) of the four ARMA(l,l) models one cannot say that they all correspond to the same process. But this can in fact be illustrated by converting the four models to their II random shock form" (MA ( 00) processes) or their II invertible form" (AR ( 00) processes). An ARMA(l,l) process (5.9) can be also written as (5.10) which can be expanded in the convergent form 223 Zt = [1 + (-81)B + (-81)B + (-81)B + .. ] at provided that the stationarity condition (1 I < 1) is satisfied. Then the four models of Table 5.8 become: (5.11)

PAGE 93

104 (A) Zt = at + 0.090 a t 1 0.001 a t 2 + (B) Zt = at + 0.092 a t 1 0.029 a t 2 + (5.12) (C) Zt = at + 0.082 a t 1 + 0.004 a t 2 + (D) Zt = at + 0.088 a t 1 0.036 a t 2 + In the same way the ARMA (1,1) model may be written in the "invertible form" (5.13) which can be expanded as 2 2 3 [1 (-81)B 8 1 (-81)B 8 1 (-81)B -... ] Zt = at given that the invertibility condition (181 I <1) is satisfied. Then the four models become: (A) Zt = at + 0.090 Zt-1 0.009 Zt-2 + (B) Zt = at + 0.092 Zt-1 0.037 Zt-2 + ( C) Zt = at + 0.082 Zt-1 0.002 Zt-2 + (D) Zt = at + 0.088 Zt-1 0.043 Zt-2 + From the "random shock" form of the four models (equations 5.12) and from their "invertible form" (equations 5.15) the following remarks can be made: (5.14) (5.15)

PAGE 94

(1) Although from the comparison of the and 8 coefficients (Table 5.8) of the four ARMA(l,l) models one cannot say that they all correspond to the same process, the comparison of the MA coefficients (8 1 8 2 8 3 .. ) of equations (5.12) or the AR coefficients (l' ... ) of equations (5.15) imply that indeed all four models belong to the same process. (2) Because the nonzero (and 8 2 ) coefficients of Zt-2 (and a t 2 ) terms while small are of similar magnitude to the coefficients l (and 8 1), one cannot say that the "truncated" AR(l) or MA(l) model will fully describe the time series, but instead more terms are needed. On the other hand, we observe that the coefficient so obtained (different for each model) is 105 in the range of 0.082 to 0.090 and is greater than the coefficient that would have been obtained by a direct fitting of an AR (1) model to the series (the latter would be l = r 1 = 0.0068). (3) It should also be noted that all the above models fitted to the series give residuals that pass the portemanteau goodness of fit test. As it can be seen from equation (5.12) the impulse response function (e.g., the weights W. applied on the a.'s when the J J model is written in the "random shock form") dies off very quickly in all the models, and there is thus no doubt as to the application of the portemanteau test

PAGE 95

106 (see Appendix A). The values of Q for each model (calculated from equation A.l using K = 60) are: QA = 67.80, QB = 67.26, QC = 67.73 and QD = 67.39, all smaller than the X 2 value with 58 degrees of freedom at 5 % f 1 1 2 79 1 It 1 b a s1gn1 1cance eve, X58,5% = can a so e seen that the values of Q for all models are almost equal, suggesting an equally good fit of the series by all the four models. One other interesting question that could be asked is, given a specific ARMA(p,q) model whether or not this could have arisen from some simpler model. "Simplifications are not always possible as conditions on the coefficients of the ARMA model need to be specified for a simpler model to be realizable" (Granger and Morris, 1976, p. 252). At this stage with coefficients that are so instable it is meaningless to test the four ARMA models for simplification. However, this test will be made after a unique and stable model has been obtained through the following proposed algorithm. Proposed Estimation Algorithm The problem of estimation of missing values will be combined with the problem of stabilizing the coefficients of the AID1A(l,l) model in a recursive algorithm which will have solved both problems uniquely upon convergence. The incomplete series (SO) is filled-in with some initial estimates of the missing values (these initial

PAGE 96

107 estimates can be simply the monthly means or even zeroes as will be shown). Denote by Sl this initial series. An (1,1) model is fitted to the series and its coefficients l and el are used to update the first estimates of the missing values. For example, suppose that a gap of size k (k missing values) exists in the series SO: Series Zt+k+l Zt+k+2 ... Series where Zt+l' ... Zt+k are the initial estimates of the missing values. These values Zt+l' ... Zt+k are then (5.16) replaced by the forecasted values zt(l), ... Zt(k) by the model, at origin t and for lead times 1 = 1, ... k. These forecasts are the minimum mean square error forward forecasts as developed by Box and Jenkins (1976). For an model with coefficients l and el the minimum mean square error forecasts Zt(l) of Zt+l' where 1 is the lead time, are: 1 = 1 (5.17) A Zt(l) = l Zt(l-l) 1 = 2, .. k from which it can be seen that only the one step ahead forecast depends directly on at' and the forecasts at longer lead times are influenced indirectly (Box and Jenkins, 1976, Ch. 5). The forecasting procedure in repeated for the

PAGE 97

108 estimation of all the gaps, and the newly estimated values are used in equations (5.17). These forecasts now become the new estimates of the missing values and they replace the old estimates giving the new series An ARMA(l,l) model is then fitted to the new series and the new coefficients and 8 1 are found (different from the previous ones). Then the estimated values (forecasts from the previous model) are replaced by the forecasts by the new model, giving the new series etc. The procedure is repeated until the model and the series stabilize in the sense that the parameters and 8 1 of the model as well as the estimates of the missing values do not change between successive estimates within a specified tolerance. Schematically the algorithm is presented in Fig. 5.5 where So denotes the incomplete series, MO the method used for the initial estimation, S. the estimated series at the -1 ith iteration, and M. the model (e.g., the set of -1 parameters and 8 1 series S .. -1 The notation M. M'+l and S. S'+l is introduced to -1 -1 -1 -1 denote the stabilization of the model and series respectively after i iterations. The above algorithm will be addressed as RAEMV-U (a recursive algorithm for the estimation of missing values--univariate model) Application of the Algorithm on the Monthly Rainfall Series The proposed recursive algorithm (RAEMV-U) has been applied for the estimation of missing monthly rainfall

PAGE 98

109 So Mo L-_5_1 } _I+l Si+1 Fig. 5.5. Recursive algorithm for the estimation of missing values--univariate model (RAEMV-U). S. denotes the series, and M. the model, ($,8)., at the ith iteration: 1

PAGE 99

110 values in the series of the south Florida interpolation station (station 6038). Different levels of percentage of missing values have been tested and the results for the 10% and 20% levels are presented herein. Tables 5.9 and 5.10 show the results for the 10% and 20% levels of missing values respectively. The starting series So is the incomplete series (with 10% or 20% the values missing). Four different methods (MV, RD, NR, and zeros) have been applied to the incomplete series, SO' providing different starting series, for the algorithm. Thus, its dependence on the initial conditions has also been tested. Results of the Method From Tables 5.9 and 5.10 the following can be concluded: (1) The algorithm converges very rapidly and independently of the initial estimates, thus suggesting the convenient replacement of the missing values by zeros to start the algorithm. (2) The greater the percent of missing values the slower the algorithm converges (6 iterations were needed for the 10% and 8 for the 20% to obtain accuracy to the third decimal place) as was expected since a larger part of the series is changing its values at each iteration and thus more iterations are needed to achieve equilibrium.

PAGE 100

III Table 5.9. Results of the RAEMV-U Applied at the 10% Level of Missing Values. Upper Value is
PAGE 101

112 Table 5.10. Results of the RAEMV-U Applied at the 20% Level of Missing Values. Upper Value is Lower Value is 8 1 MO MV RD NR Zeroes Ml 0.0954 0.5023 0.5021 0.5756 -0.0069 0.4173 0.4159 0.2587 M2 0.0738 0.1167 0.1189 0.2926 -0.0344 -0.0311 -0.0289 0.1187 M3 0.0789 0.0369 0.0377 0.0762 -0.0276 -0.0693 -0.0688 -0.0458 M4 0.0774 0.0910 0.0908 0.0526 -0.0296 -0.0125 -0.0128 -0.0503 M5 0.0778 0.0745 0.0746 0.0863 -0.0291 -0.0334 0.0333 -0.0184 M6 0.0777 0.0786 0.0786 0.0756 -0.0292 -0.0281 -0.0281 -0.0319 M7 0.0777 0.0775 0.0775 0.0783 -0.0292 -0.0295 -0.0295 -0.0285 M8 0.0777 0.0778 0.0778 0.0776 -0.0292 -0.0291 -0.0291 -0.0293

PAGE 102

113 (3) For a specific percent of missing values the algorithm converges to the same point (e.g., same model and same series) independently of the initial estimates of the missing values. (4) For a different percent of missing values the same series converges to a "different" point (e.g., "different" model and "different" series). This was expected since the constant information in the system (existing values) is different in each case, and thus a different model describes it better. Diagnostic checking on the residuals from the two final models is performed using the portemanteau goodness of fit test. Denote the two models (at 10% and 20% levels) by 1 0 d 20 1 th U d t h M-U an M-U respectlve y, e eno lng t at a univariate model has been fitted to the series. Then = 0.5095 = 0.0777 e = 0.4333 e = 0.0292. (5.18) The values of Q for each model are Q(M_U10) = 26.54 and Q(M_U20) = 30.22 (calculated by equation A.1 using K=30) which are both smaller than the X 2 value with 28 degrees of 2 freedom at a 5% significance level: X28,5% = 41.3. Notice also that Q(M_U10) < Q(M_U20), indicating that the final model fitted to the series when 10% of the values were missing has a better fit than the model fitted to the series when 20% of the values were missing as expected.

PAGE 103

114 Also, now that the final ARMA{l,l) model is stable we can ask the question "can it be simplified to an AR(l) plus white noise?". For an ARMA(l,l) process the simplification condition is 1 PI 0 > -> (5.19) -1 + 2 1 0.716 > 0 (5.21) 0.994 > -0.375 f 0 Although the first model barely satisfied the condition for simplification the second model does not, implying that an AR(l) process cannot describe the series as well as an ARMA (1,1) process. This result justified the selection of the ARMA(l,l) model for this rainfall series. The statistical properties of the two final series (from the 10% and 20% missing values) have also been computed and are shown in Table 5.11 together with the ones

PAGE 104

115 of the actual series. The monthly statistics are also shown in Table C.13 (appendix C). Table 5.11. Statistics of the Actual Series (ACT) and the Two Estimated Series (UNI0, UN20). y s ACT 3.673 UNI0 4.105 3.609 UN20 4.043 3.492 c v 89.04 87.920 86.381 1. 332 1. 354 1. 373 0.366 0.134 0.384 0.157 0.410 0.160 Table 5.12 shows the bias in the mean, standard deviation and lag-one correlation coefficient so that the statistical closeness of the estimated series to the actual one can be evaluated. The bias in the mean and correlation coefficient is not significant at 5% significance level; however, the bias in the standard deviation does not pass the stringent F-test (requiring exact equality of standard deviations) and thus is significant.

PAGE 105

116 Table 5.12. Bias in the Mean, Standard Deviation and Serial Correlation Coefficient-Univariate Model. Ye-Ya UN10 -0.021 UN20 -0.083 s /s e a 0.983 0.951 r -r 1,e 1,a 0.018 0.044 Remarks 1. The forecasting procedure utilized for the estimation is the minimum mean square forward forecasting procedure of Box and Jenkins (1976). Damsleth (1980) introduced the method of optimal between-forecasts, combining the forward forecasts and backforecasts into between-forecasts with a minimum mean square error. He showed that the gain in forecast error by between-forecasting as compared to forward forecasting (or back-forecasting) an ARMA(l,l) model is proportional to 1lk+1 where k is the size of the gap. Thus the gain rapidly becomes small, unless II is very close to one and the size of the gap is very small. He also showed that the gain from between-forecasting can be sUbstantial when e is negative. Finally he concluded that "the reduction in forecast error variance by using this between-forecasting method is not very great for stationary series, but may be substantial when the

PAGE 106

117 series is non-stationary" (Damsleth, 1980, p. 39). In our case, the use of the more complicated betweenforecasting procedure does not seem to be justified. It has been shown that the simple Box-Jenkins forecasts work satisfactorily in the sense that rapid convergence to a "statistically acceptable" series occurs. 2. It is interesting to note that when the final estimates of the model (parameters of equations 5.18) are provided as initial estimates, the maximum likelihood estimates (calculated by a steepest descent algorithm) are equal to the initial estimates provided. This emphasizes the "uniqueness" of the stable model achieved by the proposed recursive algorithm. 3. It will also be interesting to check the threshold level of percent of missing values at which the algorithm starts to diverge. This is expected to happen at some level of percent of missing values (probably greater than 50%) when too much information in the system is changing at each iteration. At such high percents of missing values a more elaborate testing of the final model may also be needed. Bivariate Model Hodel Fitting The lag-one multivariate autoregressive model of equation (4.3), suggested by Matalas (1967), preserves the

PAGE 107

118 lag-zero and lag-one auto-and cross-correlations. When applied to two stations the model is reduced to the bivariate Markov model: (5.22) where the matrix B is a lower triangular matrix as suggested by Young (1968). The above model has been extensively used for the simultaneous generation of hydrologic series at two sites. An attempt will be made herein, to show how the above model can be used for the estimation of the missing values in one or both of the time series. A recursive algorithm analogous to the one proposed for the univariate case will be presented. The special case that will be considered is the estimation of the missing values in the series of station 1, given the complete, concurrent, equal length series of station 2. As has been extensively discussed in Chapter 4 incomplete data sets may result in inconsistent covariance matrices resulting in generated rainfall values that contain complex numbers. Therefore the incomplete series of station 1 is first completed by the use of a simple estimation method (e.g., MV, RD, NR or even replacement of missing values by zeroes) giving the complete series S1. Denote by S the complete and known series of station 2.

PAGE 108

Then a bivariate AR(l) model is fitted to the series and S. Actually the model, as in the univariate case, is fitted to the residual series e.g., the normalized and standardized series. The following procedure is followed 119 for the estimation of the parameters (matrices A and B) of the model: The lag-zero and lag-one correlation matrices, MO and M 1 of the residual series are computed = = [r 11 (1) r21 (1) (5.23) Then matrix A is given directly by the multiplication of the -1 matrices M1 and MO (equation B.8 of appendix B) and matrix C is computed from equation (B.13). Matrix B is given from the solution of equation BBT = C, which in the case of B being a lower triangular matrix reduces to the direct calculation of the elements of B from equations (B.19). Proposed Estimation Algorithm An algorithm analogous to the one for the univariate case is also proposed for the bivariate case. After the incomplete series, So has been completed with a simple method a bivariate AR(l) model is fitted to the complete series and as described earlier. The parameter matrices A and B of the fitted model M1 = (A,B)l' are then used to construct new estimates for the "missing" values in the series Sl. From equation (5.22) we can write that:

PAGE 109

120 (5.24) Z2ft = a21 Zl,t-1 + a 22 Z2,t-1 + b 21 n1,t + b 22 n2,t (5.25) Since the second series is complete and known, equa-tion (5.25) is ignored and only equation (5.24) is considered. Following the Box-Jenkins forecasting procedure, the mean square error forecasts Zl,t(t) of Zl,t+t' where t is the lead time, are Zl,t + a 12 Z2,t t = 1 '" (5.26) Zl,t (t-1) + a12 Z 2 t ( t-1), t = 2, 3, k where k is the number of values missing in each gap. The forecasting procedure is repeated for the estimation of all the gaps always using the newly estimated values in equations (5.26). These estimates then become the new estimates of the missing values, and they replace the old estimates in the series giving the new series and S. Denote this new model by M2 = (A,B)2' which is used in the same way as before to update the estimates. The procedure is repeated until convergence occurs in the sense that neither the model M. nor the series S. after the ith -1 -1 iteration change between iterations within a specified tolerance (M. M. 1 and S. S. +1) -1 -1+ -1 -1

PAGE 110

121 Schematically the recursive algorithm for the estimation of missing values--bivariate model--1 station to be estimated (RAEMV-B1) is shown in Fig. 5.6. The algorithm can be generalized to the case where a multivariate model of, say, K stations is used to estimate the missing values of L incomplete stations where L < K. Such a generalized algorithm can be economically written as RAEMV-MK.L. The algorithm for the case of a bivariate model with both records incomplete e.g., two series to be estimated (RAEMV-B2 or in the general form RAEHV-M2.2) is illustrated in Fig. 5.7. The notation is the same as before but two subscripts are used now for the series S, the first denoting the station (lor 2) and the second denoting the iteration i (i=l, ... ). In this case both equations (5.24) and (5.25) would be needed for the estimation of missing values existing in both series. Application of the Algorithm on the Monthly Rainfall Series The case study presented herein involves the estimation of the missing values of the rainfall series of station 6038 using a bivariate AR{l) model with the complete rainfall series of Station 6038. Thus the RAEMV-B1 illustrated in Fig. 5.6 has been used. Again, different levels of percentage of missing values have been tested, and the results for the 10% and 20% missing values are presented in Tables 5.13 and 5.14 respectively. The dependence of the algorithm on the starting values has been tested the same

PAGE 111

So Mo SI M, S2 M2 MI M2 S I--S -S3 M3 M3 S --... { M" _I s .... 1 N -Fig. 5.6. Recursive algorithm for the estimation of missing values--bivariate model--l station to be estimated. S, denotes the series, and M. the model, (A,B) 'f at the ith iteration. -1 1 122 }

PAGE 112

123 S _1,0 Mo s _1,1 s _1,2 S _I,' 51,1 5 _1,1+1 Mi ,.., Mi+ 1 MI M2 5 _2,1+1 S _2,1 S _2,2 S _2,' M' _0 S _2,0 Fig. 5.7. Recursive algorithm for the estimation of missing values--bivariate model--2 stations to be estimated (RAEMV-B2). 8 (8 .), denotes the series of 2), and M. the model, (A,B)., at the ith iteration.

PAGE 113

124 way as for the univariate case, e.g., by providing different initial series estimated by four different methods MO (MV, RD, NR and zeroes). Tables 5.13 and 5.14 show the cross-correlation matrices MO and M1 at each iteration i and the model M. = (A,B) .. It is interesting to follow the changes of the -1 1 cross-correlation coefficients at each time step. Also notice that the autocorrelation coefficient (see equa-tion 5.23) of the first series changes at each iteration (since new estimates of the missing values replace the old ones) but the autocorrelation coefficient of the second series remains unchanged (since the second series is complete and known) From Tables 5.13 and 5.14 the following similar conclusions to the univariate case can be drawn: (1) The algorithm converges rapidly, independently of the starting point (initial series). Thus, initial estimation of the missing values is not needed, and they may as well be replaced by zeroes. (2) The convergence seems to be less sensitive to the percent of values missing, since in both the 10% and 20% levels convergence has been achieved in three to four iterations. (3) For a specific percent of missing values the algorithm converges to the same point (e.g., same model, same series, and same correlation matrices) independently of the initial estimates of the missing values.

PAGE 114

125 Table 5.13. Results of the RAEMV-B1 Applied at the 10% Level of Missing Values. M. i MO M1 A 1-B M = MV 0 1. 0.330 0.004 0.137 -0.046 0.152 0.990 O. 1 0.330 1. 0.042 0.315 -0.070 0.338 0.286 0.902 1. -0.005 0.038 0.194 0.039 0.194 0.980 O. 2 -0.005 1. 0.065 0.315 0.067 0.316 -0.071 0.944 1. 0.025 0.049 0.202 0.044 0.201 0.978 O. 3 0.025 1. 0.069 0.315 0.061 0.314 -0.043 0.946 1. 0.025 0.049 0.201 0.044 0.200 0.979 O. 4 0.025 1. 0.068 0.315 0.061 0.314 -0.042 0.946 MO = RD 1. 0.554 0.124 0.249 -0.021 0.261 0.968 O. 1 0.554 1. 0.201 0.315 0.038 0.294 0.492 0.811 1. 0.026 0.042 0.196 0.037 0.195 0.980 O. 2 0.026 1. 0.070 0.315 0.062 0.314 -0.039 0.946 1. 0.025 0.048 0.201 0.043 0.200 0.979 O. 3 0.025 1. 0.069 0.315 0.061 0.314 -0.042 0.946 1. 0.025 0.049 0.201 0.044 0.200 0.979 O. 4 0.025 1. 0.068 0.315 0.061 0.314 -0.042 0.946 Continued

PAGE 115

126 Table 5.13. Continued. M = NR -0 1. 0.543 0.126 0.261 -0.022 0.273 0.965 O. 1 0.543 1. 0.187 0.315 0.022 0.303 0.478 0.819 1. -0.002 0.046 0.199 0.046 0.199 0.979 o 2 -0.002 1. 0.069 0.315 0.070 0.316 -0.070 0.944 1 0.026 0.050 0.203 0.045 0.201 0.978 o. 3 0.026 1. 0.069 0.315 0.061 0.314 -0.042 0.946 1. 0.025 0.049 0.202 0.044 0.200 0.978 o. 4 0.025 1. 0.068 0.315 0.061 0.314 -0.042 0.946 MO -zeroes 1 0.258 0.463 0.172 0.448 0.057 0.885 O. 1 0.258 1. 0.048 0.315 -0.036 10.385 0.247 0.915 1 0.042 0.061 0.225 0.059 0.222 0.973 o 2 0.042 1. 0.081 0.315 0.068 0.313 -0.033 0.946 1 0.029 0.048 0.203 0.042 0.201 0.978 O. 3 0.029 1. 0.070 0.315 0.061 0.314 -0.038 0.946 1 0.025 0.049 0.201 0.043 0.200 0.979 O. 4 0.025 1. 0.068 0.315 0.061 0.314 -0.042 0.946

PAGE 116

127 Table 5.14. Results of the RAEMV-B Applied at the 20% Level of Missing Values. M. i MO !-t11 A 1 B No = MV 1. 0.523 0.342 0.251 0.290 0.100 0.936 O. 1 0.523 1. 0.257 0.315 0.126 0.249 0.446 0.831 1. -0.025 0.369 0.307 0.377 0.316 0.874 O. 2 -0.025 1. 0.256 0.315 0.264 0.322 -0.253 0.876 1. -0.023 0.389 0.333 0.393 0.337 0.857 o 3 -0.012 1. 0.253 0.315 0.257 0.319 -0.255 0.877 1. -0.012 0.389 0.332 0.393 0.337 0.858 O. 4 -0.012 1. 0.253 0.315 0.257 0.319 -0.254 0.877 MO = RD 1. 0.588 0.320 0.290 0.228 0.156 0.939 O. 1 0.588 1. 0.262 0.315 0.117 0.246 0.510 0.795 1. -0.012 0.368 0.315 0.375 0.383 0.872 O. 2 -0.023 1. 0.257 0.315 0.264 0.321 -0.254 0.875 1. -0.012 0.388 0.334 0.392 0.338 0.857 O. 3 -0.012 1. 0.253 0.315 0.257 0.319 -0.255 0.877 1. -0.012 0.388 0.333 0.393 0.337 0.858 O. 4 -0.012 1. 0.253 0.315 0.257 0.318 -0.254 0.877 Continued

PAGE 117

128 Table 5.14. Continued. MO -NR l. 0.611 0.324 0.273 0.252 0.119 0.941 O. 1 0.611 1. 0.279 0.315 0.137 0.232 0.534 0.777 l. -0.022 0.372 0.311 0.379 0.320 0.872 O. 2 -0.022 1. 0.258 0.315 0.265 0.321 -0.253 0.875 l. -0.012 0.389 0.333 0.393 0.338 0.857 O. 3 -0.012 l. 0.253 0.315 0.257 0.319 -0.255 0.877 l. -0.012 0.389 0.332 0.393 0.337 0.857 O. 4 -0.012 l. 0.253 0.315 0.257 0.318 -0.254 0.877 MO = zeroes l. 0.321 0.601 0.201 0.599 0.009 0.799 O. 1 0.321 l. 0.195 0.315 0.104 0.282 0.253 0.909 l. 0.006 0.423 0.340 0.421 0.337 0.841 O. 2 0.006 l. 0.228 0.315 0.226 0.314 -0.233 0.892 1. -0.012 0.392 0.332 0.397 0.337 0.856 O. 3 -0.013 1. 0.249 0.315 0.253 0.319 -0.255 0.878 l. -0.013 0.390 0.333 0.394 0.338 0.857 O. 4 -0.013 l. 0.253 0.315 0.257 0.319 -0.255 0.877

PAGE 118

129 (4) For a different percent of missing values the same series converges to a "different" point, but this is reasonable and expected since the constant information (existing values in the series) is different in each case, and a different model thus describes it better. The statistical properties of the two final series (from the 10% and 20% missing values) are shown in Table 5.15 together with the ones of the actual series. The monthly statistics are also shown in Table C.14 (appendix C). Table 5.16 shows the statistical closeness of the two estimated series to the actual one. Again, the bias in the mean and correlation coefficient is not significant at the 5% significance level, but the bias in the standard deviation is. Table 5.15. Statistics of the Actual Series (ACT) and the Two Estimated Series (B10 and B20) y ACT 4.126 B10 4.096 B20 4.077 s 3.673 3.610 3.523 89.04 l. 332 0.366 0.134 88.132 1.358 0.382 0.162 86.421 l. 341 0.416 0.165

PAGE 119

130 Table 5.16. Bias in the Mean, Standard Deviation and Serial Correlation Coefficient-Bivariate Model. Ye-Y a BIO -0.030 B20 -0.049 s /s e a 0.983 0.959 r -r l,e l,a 0.016 0.050

PAGE 120

CHAPTER 6 CONCLUSIONS AND RECOMMENDATIONS Summary and Conclusions The objective of this study was to compare and evaluate different methods for the estimation of missing observations in monthly rainfall series. The estimation methods studied reflect three basic ideas: (1) the use of regional information in four simple techniques: -mean value method (MV) -reciprocal distance method (RD) -normal ratio method (NR) -modified weighted average method (MWA); (2) the use of a univariate stochastic (ARMA) model that describes the time correlation of the series; (3} the use of a multivariate stochastic (ARMA) model that describes the time and space correlation of the series. An algorithm for the recursive estimation of the missing values in a time series using the fitted univariate or multivariate ARMA model has been proposed and demonstrated. Apparently, the idea of the recursive estimation of missing values is known (Orchard and Woodbury, 1972; Beale and 131

PAGE 121

132 Little, 1974), as well as the idea of using the fitted model to directly derive the estimates (Brubacher and Wilson, 1976; Damsleth, 1979). However it appears that a method which combines the above two ideas simultaneously in a recursive estimation of the missing values with parallel updating of the model has not been used before. The proposed algorithm is general and can be used for the estimation of the missing values in any series that can be described by an ARMA model. On the basis of the data from the four south Florida rainfall stations used in the analysis, the following conclusions can be drawn: (1) All the simplified estimation techniques give unbiased (overall and monthly) means and correlation coefficients at the 5% significance level even for as high as 20% missing values. (2) At high percentages of missing values (greater than 10%) the MV method gives the more biased (although not significantly so) correlation coefficients. (3) All methods give a slightly biased overall variance but unbiased monthly variance at the 5% significance level, and the l'-1V method gives the most biased variances for all percentages of missing values. (4) The NR method gives the most and the MV the least accurate estimates, at almost all levels of percent missing values.

PAGE 122

(5) The proposed recursive algorithm works satisfactorily in both the univariate and bivariate case. It converges rapidly and independently of the initial estimates and gives unbiased means and correlation coefficients at the 5% significance level. 133 (6) The use of a bivariate model as compared to a univariate one did not improve the estimates except for a slight improvement at 20% missing values. However, the use of a multivariate model based on three or four nearby stations is expected to give much better estimates. The use of three adjacent stations is the main reason for the better performance of the NR method over the more sophisticated univariate and bivariate ARMA models which use only zero and one additional stations. If the purpose of estimation is to calculate the historical statistics of the series (e.g., mean, standard deviation, and autocorrelations) the selection of the method matters little, and the simplest one may be chosen. However, if it is desired to fit an ARMA model to the incomplete series, to be used, say, to construct forecasts, the estimation of the missing values and the parameters of the model by the proposed recursive algorithm is recommended. In this case the equilibrium state (i.e., final series and parameters of the model) achieved upon convergence is unique, depending only on the existing information in the

PAGE 123

134 system (available data) and not on any external information added to the system (by the replacement of the missing values with some derived estimates). The only assumption made is that the order of the ARMA model to be fitted to the series is known. In practical situations this is seldom a problem since the latter can be determined from the complete part of the series or from a series with similar characteristics. For example, if an ARMA(l,l) model is known to fit the monthly rainfall series well at a couple of nearby stations, there is little doubt that it will fit the incomplete monthly rainfall series equally well at the station of interest. Upon convergence, the recursive algorithm then gives the "best" estimates of the parameters of the model. Further Research Further research should include: (1) application of the simple estimation techniques in short records where the biases may be significant for the methods with the poorer performance; (2) test of the sensitivity of the recursive algorithm to the selection of the model (order of the model) when more than one model fits the data equally well; (3) derivation of the threshold percent of missing values after which the algorithm diverges; (4) application to the estimation of missing values in other hydrological series, e.g., runoff;

PAGE 124

135 (5) trials of different forecasting procedures and determination of improvements obtained by the "betweenforecasting procedure" in cases of a large number of single-value gaps, e.g., use of the average of a backwards and forwards ARMA model forecast; (6) application of the concept of "missing values" for the estimation of erroneous values or outliers in a series to avoid errors when using the data, say, to construct forecasts; and cn estimation of values in a series that are affected by unusual circumstances, thereby permitting a measure of the magnitude of the unusual circumstance and the estimation of the effect of similar circumstances in the future (e.g., effect of a drought on water supply).

PAGE 125

1. Strict stationarity APPENDIX A DEFINITIONS A stochastic process is said to be strictly stationary if its statistics (e.g., mean, variance, serial correlation) are not affected by a shift in the time origin, that is, if the joint probability distribution associated with n observations (zl' z2' ., zn)t made at time origin t, is the same as that associated with n observations (zl' z2' .. zn)t+k made at time origin t+k. In other words, z(t) is a strictly stationary process when the two processes z(t) and z(t+k) have the same statistics for any k. 2. Weak stationarity Weak stationarity of order f is when the moments of the process up to an order f depend only on time differences. Usually by weak stationarity we refer to second order stationarity, e.g., fixed mean and an autocovariance matrix that depends only on time differences (i.e., lags). 3. Gaussian process If the probability distribution associated with any set of times is a multivariate normal distribution, the process 136

PAGE 126

137 is called a normal or Gaussian process. Since the multivariate normal distribution is fully described by its first and second order moments it follows that weak stationarity and an assumption of normality imply strict stationarity. 4. Non-stationarity A stochastic process is said to be nonstationary if its statistical characteristics change with time. A homogeneous nonstationary process of order d is a process, for which the dth difference vdZt is a stationary process. For example a first order homogeneous nonstationary process is one that exhibits homogeneity apart from constant (e.g., a linear trend), and a second order nonstationary is the one that exhibits homogeneity apart from constant and slope (e.g., a parabolic trend). 5. Circular stationarity A stochastic process is said to be circularly stationary with period T, if the mUltivariate probability distribution of T observations (zl' z2' time origin t, is the same as that associated with T observations (zl' z2' t +Tk, for k = 1, 2, ., zT)t+Tk made at time origin For example, a monthly hydrologic series has a period of 12 months, i.e., T = 12 and circular stationarity suggests that the

PAGE 127

138 probability distribution of a value of a particular month is the same for all the years. 6. Stationarity condition A linear process can be always written in the random shock form: (A. 1) where B is the backward shift operator defined by BZt = hence Bm = Z and Zt-l; Zt t-m (A. 2) is the so called transfer function of the linear system and is the generating function of the weights. For the process to be stationary the weights must satisfy the condition that converges on or within the unit circle, e.g., for all IBI < 1. 7. Invertibility condition The above model may also be written in the inverted form (A. 3) or n(B) (A. 4)

PAGE 128

139 where is the generating function of the TI weights. For the process to be invertible the TI weights must satisfy the condition that TI(B) converges for all IBI < 1, that is on or within the unit circle. The invertibility condition is independent of the stationarity condition and is applicable also to the nonstationary linear models. The requirement of invertibility is needed in order to associate the present values of the process to the past values in a reasonable manner, as will be shown below. 8. Duality between AR and MA processes In a stationary AR(p) process, at can be represented as rv a finite weighted sum of previous z's, (A. 6) rv or Zt as an infinite weighted sum of previous a's (A. 7) rv Also, in an invertible MA(q) process, Zt can be represented as a finite weighted sum of previous a's, (A. 8)

PAGE 129

'V or at as an infinite weighted sum of previous z's -1(B) e 140 (A. 9) In other words, a finite AR process is equivalent to an infinite MA process, and a finite MA process to an infinite AR process. This principle of duality has further aspects, e.g., there is an inverse relationship between the autocorrelation and partial autocorrelation functions of AR and MA processes. 9. Physical interpretation of stationarity and invertibility Consider an AR(I) process (1 I B ) Zt = at. For this process to be stationary, the root of the polynomial 1 lB = 0 must lie outside the unit circle, which implies that B = must be greater than one, or 1 1 < 1. The process can be also written Zt = 1 Zt-l + at 2 + + (A. 10) Zt+l = IZt-l lat a t + 1 3 2 + lat + 1 + a t+2 etc. Zt+2 = lZt-l + la t When Il I > 1 (or 1 I = 1) the effect of the past on the present value of the time series increases (or stays the

PAGE 130

141 same) as the series moves into the future. Only when Ill < 1 (stationary process) does the effect of the past on the present decrease the further we move into the past, which is a reasonable and acceptable hydrologic fact (Delleur and Kavvas, 1978). Consider now an MA(l) process Zt = (1-81B)at The invertibility condition implies that 181 I < 1. The process can also be written in the form: 1 = 1-e B Zt 1 -1 where the polynomial (1-81B) can be expanded in an (A. 11 ) infinite sum of convergent series only if 181 I < 1. To illustrate the need for invertibility let us assume that I 8 1 I > 1. Then (A. 11) can be written as 1 1 (A. 12) and since I < 1, it can be expanded to the form ( __ 1_ + ----21 2 + __ 1 __ + ) Zt 8 1 B 8 1 B 8fB 3 (A. 13) or

PAGE 131

142 (A. 14) which implies that future values are used to generate the present values. It becomes clear that the invertibility condition is required in order to assure hydrologic realizability. 10. The portemanteau lack of fit test The portemanteau lack of fit test (Box and Jenkins, 1976, Ch. 8) considers the first K autocorrelations rk(a), k = 1, 2, ... K, of the fitted residual series a of an ARIMA(p,d,q) process, to detect inadequacy of the model. It can be shown (Box and Pierce, 1970) that, if the fitted model is appropriate, Q = (N-d) K 2 A L: rk (a) k=l (A. 15) is approximately distributed as X 2(K-p-q) where K-p-q is the number of degrees of freedom, N is the total length of the series, and (N-d) is the number of observations used to fit the model. The adequacy of the model may be checked by comparing Q with the theoretical chi-square value x2(K-p-q) of a given significance level. 2 If Q < X (K-p-q), at is an independent series and so the model is adequate, otherwise the model is inadequate.

PAGE 132

143 For the choice of K, Box and Jenkins suggest it to be "sufficiently large so that the weights in the model, J written in the form (A. 16) will be negligibly small after j = K" (Box and Jenkins, 1976, p. 221). The IMSL subroutine FTCMP (IMSL 0007, Ch. F) uses a value of K equal to NI10 + P + q to perform the portemanteau test. Ozaki (1977) points out that "for the application of the portemanteau test, fast dying off of the impulse response function (weights of the model is a necessary J condition" (Ozaki, 1977, p. 298). In cases where the impulse response function dies off rather slowly (possibly due to the near-nonstationarity of the model) when compared with the length of the series, the applicability of the portemanteau test is doubtful since the autocorrelations of the residuals may not be reliable at large lags. 11. Cumulative periodogram test Another method used in the diagnostic checking stage of the Box-Jenkins procedure is the cumulative periodogram checking of the residuals. The normalized (area under the curve equal to one) cumulative periodogram for frequencies, f, between 0 and 0.5, of the fitted residuals at' is compared with the theoretical cumulative periodogram of a

PAGE 133

144 white noise series which is a straight line joining the points (0, 0) and (0.5, 1). A periodicity in the residuals at frequency f. is expected to show up as a deviation from l the straight line at this frequency. Kolmogorov-Smirnov probability limits can be drawn on the cumulative periodogram plot to test the significance of such deviations. For a given level of significance a, the limit lines are drawn at distances Ka/ N' above and below the theoretical straight line, where N' = (N-2)/2 for N even and N' = (N-1) /2 for N odd. Approximate values of Ka for different levels of significance a, are: a 0.01 0.05 0.10 0.20 0.25 K a 1. 63 1. 36 1. 22 1. 07 1. 02 (Box and Jenkins, 1976, p. 297) So, if more than aN of the plotted points fall outside the probability lines, the residual series may still have some periodicity; otherwise it may be concluded that the residuals are independent. In practice, "because the a's are fitted values and not the true a's, we know that even when the model is correct they will not precisely follow a white noise process" and thus the cumulative periodogram test provides only a "rough guide" to the model inadequacy checking (Box and Jenkins, 1976, p. 297).

PAGE 134

145 12. Akaike Information Criterion (AIC) The AIC for an ARMA(p,q) model is given by AIC(p,q) 1'.2 = N log (0 a) + 2 (p+q+2) + Nlog2'IT + N 1'.2 Ii where 0 is the MLE of the residual variance given by a 1 N-p-q s (A. 1 7) and i, are the vectors of the parameters e which minimize the sum of squares of the residuals at (A.18 ) For the purpose of comparison of models the definition of AIC can be replaced by AIC(p,q) 1'.2 = N log(Oa) + 2(p+q) (A.19) Ozaki (1977) demonstrates that the inherent difficulties associated with the Box-Jenkins procedure (identification, estimation and diagnostic checking) for the selection of the model, when several models fit the data equally well, can be overcome by using the MAICE (minimum AIC estimation) procedure as the only objective criterion for the selection of the "best" approximating model among a set of possible

PAGE 135

146 models. He also points out that the AlC IImeasures both the fit of a model and the unreliability of a modelll (Ozaki, 1977, p. 290). 13. Positive definite (semidefinite) matrix A real symmetric matrix A is called positive definite (semidefinite) if and only if (.,::.0) (A. 20) for all vectors X O. The two following theorems hold: Theorem 1: A matrix A is positive (semi-) definite if and only if all its characteristic values (i.e., eigenvalues) are (non-negative) positive. Theorem 2: A matrix A is positive (semi-) definite if and only if all the successive principal minors of A are (non-negative) positive. An obvious corollary of the above is that a positive semidefinite matrix is positive definite if and only if it is nonsingular i.e., none of its characteristic values are zero (Gantmacher, 1977, p. 305).

PAGE 136

14. Test for differences in the means of two normal populations 147 Let denote the population means of two normal distributions and xl' x2 the sample means respectively. Let also assume that the variance of the two normal distributions are equal but unknown. The hypothesis Ho: = versus Ha: # is tested by calculating the statistic where t = 2 s = -x 2 which has a t distribution with N1 + N2 -2 degrees of freedom. The H is rejected if o (A.21) (A. 22) (A.23) Although the test is based on sample normality, for large samples, the Central Limit Theorem enables us to use the test as approximate test for nonnormal samples. If the two populations are of equal length, N1 = N2 = N, then equation (A.21) reduces to

PAGE 137

t = 15. Test for equality of variances of two normal distributions 148 (A.24) 2 2 2 2 Let aI' a 2 denote the population variances and sl' s2 the sample variances of two normal distributions. The hypothesis Ho: ai = a; versus Ha: ai # a; is tested by calculating the statistic F c (A. 25) where si is the larger sample variance. Fc is distributed as an F distribution with Nl -1 and N2 -1 degrees of freedom where N1 is the length of the sample having the larger variance and N2 is the length the sample with the smaller variance. F c H is rejected if o N -1 > F 1 N -1 2 1 a 16. Test for of correlation coefficients (A.25) Let p denote the population correlation coefficient and r the sample estimate of p. If the sample size is moderately large (N > 25) then the quantity W is

PAGE 138

149 approximately normally distributed with mean and variance 1/N-3 where and W I 1 (1 + r) ="2 n 1 r To test the hypothesis H : P = r against the o alternative H : P # r the quantity a z = (W w) IN -3 (A. 27) (A. 28) (A.29) can be considered to be normally distributed with zero mean and unit variance. If Iz I > zl-a/2 (z is the standard normal variable), Ho is rejected (see Haan, 1977, p. 223).

PAGE 139

APPENDIX B DETERMINATION OF MATRICES A AND B OF THE MULTIVARIATE AR(I) MODEL Determination of matrix A The multivariate lag-one autoregressive model is written as (B. I) T Post-multiplying both sides of equation (B.I) by Zt-l and taking expectations it becomes: (B. 2) By definition (B. 3) and (B. 4) 150

PAGE 140

and from the assumption of weak stationarity Also from the independent uncorrelated process Nt so that equation (B.2) becomes and solving for the parameter matrix A Determination of matrix B Post-multiplying equation (B.1) by ZT and taking t expectations in both sides it becomes 151 (B. 5) (B. 6) (B. 7) (B. 8) (B. 9)

PAGE 141

152 Because = I, an identity matrix, and = 0 equation (B.9) can be written (B.lO) By substituting A from equation (B.B) and solving for B BT (B.ll) Solution of equation B BT = C The right hand side of equation (B.Il) involves the lag-zero and lag-one correlation matrices which can be estimated from the historical data and thus is a known guantity C. The problem that remains now, is to solve equation (B.12) for B. A necessary and sufficient condition to have a real solution for B is that C must be a positive semidefinite matrix. It can be proven (Valencia and Schaake, 1973) that if the correlation matrices MO and Ml have been calculated using equal length records for all m sites, then the matrix (B.13)

PAGE 142

153 is always positive semidefinite and so a real solution for the matrix B exists. But this solution for B is not unique. An infinite number of matrices B exist that satisfy (B.12). Proof: Let B denote a matrix solution of equation (B.12) and K denote an (rnxm) matrix such that K KT = I where I is an (rnxm) identity matrix. A matrix BO defined as BO = B K (B.14) may be used in place of B in equation (B.12) since There exists more than one matrix K such that K KT = I, and therefore many solutions for matrix B exist, all valid since the elements of B have no physical significance as far as synthetic hydrology is concerned (Matalas, 1967). Several techniques have been proposed for the solution of equation (B.12). Fiering (1964) and Matalas (1967) suggested the use of principal component analysis and Moran (1970) used canonical correlation analysis. Young (1968) assumed that B is a lower triangular matrix, based on the fact that C = B BT is a symmetric matrix, and gave a unique recursive solution for the elements of B. Let us examine this case closely:

PAGE 143

154 (1) C = B BT is symmetric for any B. The (i,j)th element of matrix C is c .. = 1.J (B. 16) and the across the diagonal element is c .. = J1. (B. 17) where the prime denotes a transposed element. Thus, bkj = bjk and bki = bik, which implies that therefore C is symmetric for any B. c .. 1.J = c .. and J1. (2) That C is symmetric implies that m(m+l)/2 equations are required to specify it, and so m(m+1)/2 non zero elements of matric B are needed. Thus, since the (mxm) matrix B has m 2 elements there are m(m-l)/2 elements that can be set to zero. So the assumption of a lower triangular matrix B is valid. (3) The assumption of a lower triangular matrix B allows a recursive solution for the coefficients of B. This will be illustrated in the (2x2) case, and the reader is referenced to Young and Pisano (1968) for the general case.

PAGE 144

155 b11 0 bll b21 cII cl2 = b21 b22 0 b22 C21 c 2 2 or (B.18) 2 bll b21 bll cII c12 = b21 bll 2 2 (b21 -b22) c21 c22 from which bll = cII b21 = c21/bll (B.19) V b22 2 = cII -b21 with the constraints > 0 and (B.20)

PAGE 145

APPENDIX C DATA USED AND STATISTICS Table C.l. 55 years of monthly rainfall data for the South Florida Station 6038. ...... STATION 6039. MOORE HAVEN LOCK 1 *.*.* 1"27 0.11 2. 0" 1. 70 2. 02 1. 94 10. 7'P 5. 79 8.61 6."9 4. 12 0.39 O. 39 1"28 O. 42 2.31 2.46 1. 52 4. lCf 8. 12 5.43 11.82 14. 60 O. 47 0.97 O. 31 1"29 O. 82 O. 14 O. 52 1. :55 2. 73 9. 35 8.44 4.93 13. 4:5 1. 71 1. 27 1.39 1"30 0.49 3.23 4. 76 4. 12 11.33 17.8:5 4. 72 11.61 11. 6. 33 O. 45 2. 33 1931 2. 58 O. 76 5.90 3. 44 1. 59 1. 20 2.68 10.34 :5.06 1. 94 o.oe o 35 1"32 1. 97 3. 13 2. 97 1. 76 6.05 4."6 6.25 15.71 :5."9 2. 93 3.29 O. 07 1"33 1.65 O. 1" 3.88 6. "2 3.8Cf 4. 66 5. 36 5. 77 2. 75 5.18 0.Cf2 O. 28 1"34 1. 33 :Z.E9 2. 73 2.22 6.43 4.36 8.48 6. 20 4. 18 5. 54 3.58 O. ;::6 1935 O. 52 1.00 0.03 5. 18 3. 57 :5.84 :5. 09 5. 50 9. 53 1. 42 1. 71 1. 43 1"36 2.23 4."7 1.95 2. 55 5. 41 14. 59 2. 99 :5. 79 11.51 3. 55 O. 58 1. 19 1"37 2. 07 1. 70 4.83 4. 89 4.94 4. 29 13. 79 4. 71 4.48 8.72 5. 47 O. 44 1938 O. 61 O. 57 0.34 O. 21 6.28 7. 40 8.20 2.39 2.23 3.92 1. 52 O. 11 193'P O. 18 0.35 O. 79 3. 08 4.48 3. 61 16.13 10. 42 4.20 3.60 1. 45 1 ':'5 1"40 2. 37 3.07 5. :55 2. 06 3.36 4. 96 7."2 10. 43 14. 13 0.3;Z O. 42 3.91 1"41 :5. 73 3.86 3.68 :to 62 3.30 4. 87 13. 23 b.71 8. 54 ;Z. 92 1. 66 1. 1942 2.80 3. 51 4. :5:5 :5. 64 1. 99 ". :51 4.81 :5.66 4. 16 0.03 0.46 1. 62 1943 O. 35 0.37 2. 72 3. 91 3. 43 5. 02 8.04 8.07 3 C'7 ;Z. 67 1. 69 O. ;20 1944 O. 98 O. 12 2.35 5. 41 1. :52 :5. :50 8.36 :5. 42 9.23 3. 47 0.07 o 27 1"4:5 1. 82 O. 27 O. 17 3. 20 2.22 7.07 ".47 6.86 8.38 4. 9;Z O. 53 o 57 1946 O. 68 O. 76 ;Z. 53 0.27 7.5;Z :5. 74 6.90 4. 49 7.77 1. 16 2.16 O. 90 1947 O. 70 1. 64 9. 73 O. :55 4.80 15.02 6. 43 10.74 10. :57 6. 18 4.33 1. ::i 1 1948 4.16 O. 39 0.62 3. 15 2.24 4. 67 6.00 :3.94 21. 55 ;Z.42 0.57 O. 57 1949 O. 05 0.0:3 O. 46 1. 64 3.13 6. :56 9. 40 12. :51 10. 22 O. 73 0.96 ;: 1 <;1:50 O. 06 O. 72 1. 40 2. 98 :3.29 4. :5:5 7. :5:3 8. 86 2. 77 5. 54 157 1. 45 1951 O. 15 1.99 O. 82 3. :31 4.47 5. 02 11.63 5.03 6.20 7.74 1.36 O. 11 1952 O. 92 5. 02 1 50 2. 2:5 10.74 7. 56 7.05 8.09 6.35 11. 11 O. 19 Ci 46 1953 1.45 2. 57 O. 76 4. 03 2.78 6. :52 9. 13 :5.6:5 14. 16 9.67 C. 55 1. 25 1954 O. 39 1.72 ;Z.24 3. 52 11.96 12. 53 10. 58 !j. 96 6. 48 2.63 1. 19 1. 89 1955 ;Z. 78 1.27 1. :26 1.7:2 3.91 13. 17 5. 80 3. 59 7.07 :2. 55 O.;?S 1. l8 195-6 O. 96 1.04 O. 40 1. 1. 13 5. 43 3. 53 4. 67 :5. 18 6. 47 O. 13 0 52 1. 74 3. 73 6. 09 4. 06 5.58 4, 35 6. 59 7. 59 9. :50 1. 20 0.24 7 58 1958 6. 04 0.84 7.03 5.84 4.91 :5.93 8.32 4. 12 3. 09 4. 59 0.47 77 19:59 1.09 1.08 5.82 1.99 6. 07 10.16 :5. 60 O. 12 12. 00 12.36 1. 29 1 02 1960 O. 31 4. 43 1.37 6. 2.77 11.35 11. 11 6.37 11.30 5.99 1. 21 C. 69 1961 2. 71 2. 16 3. 56 2.44 6. 1;Z 7. 17 3. 74 4. 73 2 64 0.66 1. 41 0. 33 1962 O. 88 0.47 3. 57 2. 60 2.33 11.46 :5. 46 7. 71 8. 78 1. 20 4. 03 0 22 1963 0.86 3.64 0.49 0.80 8.82 6. 92 1. 08 6. 06 3. 52 0.05 268 4 20 1964 2. 4. 80 0,61 0.67 2.34 5. 20 4.78 8.89 3.46 2. 74 0 05 O. 72 1965 O. 42 3. 59 3. 16 1. 70 1. 11 10. 16 :5. 57 2. 78 4.71 9. 06 0.34 1. 89 1966 5.47 3. 67 0.42 3. 01 9. 26 10. 93 11. 19 6. 76 2. 62-O. 11 o 40 1967 O. 84 1.69 O. 24 O. 14 2. 58 11.;Z7 7.02 3. 74 8. 53 3.37 0.08 1. 95 1968 O. 58 1. 72 1. 03 O. 85 8. 64 10. 73 7.13 4.23 6,81 3. 21 0.21 1969 1.76 2. 28 6.19 o 69 4. 10 10. 09 3. 68 10. 04 8. 49 11. 75 1.46 3. 82 1970 3. 55 2.40 12. 63 O. 02 2.98 8. 74 5. 91 3.46 4.70 o 13 0.28 1971 0.25 O. 51 0.37 O. 14 1. 50 13. 86 7.28 8.29 7. 18 6. 3:5 0 90 120 1972 O. 30 1. 2.24 2.34 7.52 10. 50 2.77 6.40 0.93 O. 40 2.21 1. 39 1973 2. 72 2. 73 3.34 1. 02 5. 88 10. 48 8.01 58 8.43 1. 38 0.03 1. 52 1974 O. 14 1. 36 0.08 0.97 3.00 14.91 18. 56 7.99 5. 91 1.35 1. 64 1. 71 197:5 O. ;ZO 1. 95 O. 74 1.22 4.89 5.29 7.00 3. 13 11. 11 4. 88 0.27 0 38 1976 0.65 1. 41 1. 59 1.81 4.43 3. 10 9.98 12.31 :5. 74 0.80 1. 88 2. 31 1977 4. 87 1. 38 1. 12 0.20 :5.17 3. 74 6. 19 5. 51 6.29 1. 01 5. 33 4. 74 1978 1. 78 1. 39 2. 64 2. 06 8.:38 43 9.32 2. 67 6.40 2. 23 2. 13 4. 39 1979 21.40 0.23 2.30 0.84 7.64 1.09 1. 45 5.66 17.69 1. 90 1.83 1.96 1980 2. 76 1. 08 2.32 :5.29 2.23 3. 10 7.61 6. 88 1. 47 2 20 0.62 1981 O. 87 1. 52 1. ;Z8 O. 38 2 06 3. 33 3. 70 10. 29 4. 54 O. 24 1. 27 o 15 156

PAGE 146

157 Table C.2. 55 years of monthly rainfall data for the South Florida Station 6013. ***** STATION 6013. AVON PARK ***** 1927 O. 10 1. 87 2.29 1. 0.31 S. 5.39 93 3.9S 3. 80 0.40 1. 71 1928 0.26 1. 14 3. 12 3.66 6.90 13.01 9.66 10.64 2.05 1. 03 O. 35 1929 1. 70 -1. 00 1. 35 2. 78 S. 42 61 10. 55 11. 59 2. 40 0.56 2.29 1930 4.00 4.17 6. 59 3.95 7.55 11.37 4.49 7.06 18.iil2 2.42 1. 25 4. 13 1931 3.92 2.36 6. 10 3. 74 S.15 0.37 7.84 2.98 O. 18 1. 47 1932 0.63 O. 14 1. "9 2.08 5.95 ".29 4.68 2. SO 4. 06 4. 50 2.48 0.07 1933 1. 97 2.35 1. 70 5.90 3.66 4. 77 13.78 -1.00 11.71 1. 94 3.47 0.27 1934 1. 22 2. SO 3. 5a 4.32 7. 15 10.94 4.13 -1.00 3.17 0.11 0.93 1. 00 1935 O. 41 1. 15 O. Sl 6.03 2.87 6.87 -1. 00 9.93 11.35 2. 99 1. 05 2.39 l'i136 4.83 8.35 5. 52 1.67 2.59 10.87 -1. 00 7.99 9.99 3.97 1.07 2. 14 1937 2.63 5. 13 3.31 4.06 1. 65 -1.00 5. 29 6.27 6.47 6.47 5.44 O. 87 1938 1.44 1. 43 1. 45 0.42 3.43 4.64 8. 13 4.24 2. 81 6. 44 2.50 O. 19 1'i139 1.52 1.20 1.34 4.66 7.91 S.22 19.85 6.22 4.63 O. 50 O. 61 1.,40 3.83 3.06 3. 5S 1. 54 5.30 8. 43 11. 76 4.02 9.94 0.68 O. 10 4. 43 l'i141 4.01 3.02 2.92 4. 73 1. 04 9. 52 15.20 3.11 4.S9 2.62 2.49 1. 98 1942 4. 48 4.72 3.86 2.67 6.43 S. 52 8. 76 5. 19 5.37 0.13 0.0 3. 54 1943 1.21 O. 46 4.94 1.69 8.83 76 7.86 10. 02 3. 98 4.35 1. 32 O. 59 1944 -1. 00 -1. 00 -1.00 5. 73 2.07 7.39 11.17 6.42 3. 39 4.45 0.26 O. 51 1.,45 1. 0.03 0.40 1.61 2.45 14.09 14.49 2. 7'i1 8. 43 5.94 0.49 2. 00 1946 1. 14 2. 11 1.08 0.20 6.03 S.02 9. 88 6.04 8.09 4.74 2.06 1. 31 1947 1.92 3.82 6. 19 4.65 3. 57 12. 77 10. 50 9.30 14.31 2.97 2.65 1.65 1948 4.03 O. 51 0.83 6. 00 2.34 4.39 18.99 6. 72 16.10 6.99 1. 99 1. 50 1949 O. 13 O. 09 0.92 3.30 2.66 6. 74 6. 48 10.12 8. 18 O. 70 1. 79 0.41 0.0 O. 66 1. 46 3.15 2.42 2.09 3.38 5. 90 7.83 7.56 0.32 1.79 1951 0.22 2. 57 0.64 10.35 0.33 6. 98 5.30 a. 72 3.99 5.94 -1. 00 O. 90 1952 1. 30 4.61 5. 49 O. 97 5.48 7.39 7.23 a. 46 5. 42 6. 90 1. 60 1. 15 1953 3. 27 2. 58 6.90 7.45 0.83 13. 16 5. 52 11.00 12. 71 6 92 7.44 2. 40 1954 1. 78 1.96 1. 62 4. 71 3.12 la.95 4. 73 6.31 6.20 1. 60 1. 60 1. 97 1955 2. 73 1. 06 1.67 1.31 1.62 5.27 6.65 1. 86 8.93 2. 46 0 56 O. 74 1956 0.26 0.94 1. 54 2.23 1. 95 9. 13 4.70 10.95 6. 70 7. 78 0.22 O. 22 1.,57 2. 14 5.10 4. 77 6. 07 10.91 9.37 12. 74 6.99 7. 08 1. 45 1. 30 2. 12 1958 8.33 3. 50 5. 55 3.43 4.10 6. 77 4.45 6.31 4. 97 2. 75 0.91 3. 96 1.,59 1.23 3.60 7.35 3.06 6.47 15. 17 7.03 8.20 12.06 11.26 1. 73 2. 47 1960 O. 55 6. 54 5. 52 3. 00 2.28 7. 06 13.67 8.07 14.82 3.06 0.28 1. 02 1.,61 2.30 3.22 3.02 2. 06 4.18 .,. 56 4.09 4. 77 2.86 2. 11 0 58 O. 78 1962 1.62 1. 53 3.38 3.30 1.21 10. 90 2. 90 8. 42 7.07 1. 23 2. 68 1. 42 1963 2.35 6. 13 1.22 0.81 13.06 7.28 7.24 6.29 10.10 O. 45 5 28 3. 59 1964 2.97 3.81 2.28 3.24 6.08 9.44 5.28 7.31 0.61 0.77 1.08 1965 1. 08 4.37 6.85 2. 91 1. 44 'iI. 53 13.66 4.75 7. 67 4. 26 1. 19 2. 39 1.,66 5.95 0.0' O. 77 2.98 '.OB 9.68 8.27 B. 98 7.85 2. 02 O. 15 1. 36 1967 0.65 2.Bl O. '1 O. 0 -1. 00 -1.00 9. 74 9.94 7.15 O. 96 0.36 2. 42 1968 O. 58 1.91 1.29 O. 43 8. 73 16. 73 8. 19 6.32 4.40 3. 94 2.73 o 35 1969 1. 89 1. 80 6.B9 0.97 1. 86 11.92 5.34 8.88 7.84 7. 91 1. 64 4. 35 1.,70 2.99 2.03 5. 0i!3 0.22 3.92 4. 51 14 .,3 5.33 5.B4 2.:;!5 0.54 1. 06 1971 O. 0i!2 2. 52 0.9' 0.49 2.34 6.22 5. 59 B.29 6. 17 7. 11 0.63 1. 92 0.93 3.47 3. 74 2.24 4.75 B.30 9.67 7.23 O. 36 1.98 4.95 2.80 1973 -1. 00 1. 57 3. 06 '.61 2.06 3.64 8.50 10. 71 7. 59 4.43 0.80 -1. 00 1974 -1.00 1. 26 -1. 00 -1. 00 -1. 00 0i!0. 14 9.64 3. '3 3. 22 O. 36 0.23 2. 20 1975 O. '0 1. 93 1. 98 0.23 5.30 ,. 45 5.90 8. 52 9. 14 6. 23 0.49 O. 28 1976 O. 51 O. 54 2.46 1.59 6.20 7.66 8.84 7. 80 6.29 2. 09 1. 81 1. 91 1977 2.69 1. 66 O. 46 0.26 3.99 4.95 8.27 4.3B 4. 03 1.62 4.39 2.61 1978 2. 96 4.32 2.29 O. 13 '.17 10.0' 13.36 4. 13 2.02 1. 42 0.49 3. 23 197., 6 53 1. 12 .44 1.87 7.76 10.17 4.05 4.92 13. 37 1. 18 1. 23 1. 58 1980 2.42 3.46 1. BO 5.41 3. 15 5.0" 4.60 6. 3.88 4.19 2.68 1. 09 1981 O. 57 4. 16 2. 13 O. 17 2.21 7. 56 6.57 6.49 8,01 O. 61 1. 03 O. 55 Note: -1 indicates missing value.

PAGE 147

Table C.3. 1927 0.30 1928 0.2:3 1929 1.09 1930 1. 09 1931 3. 5:3 1932 O. 70 1933 O. 25 1934 O. 70 19:35 0.24 19:30 :3.:3:3 1937 O. 52 1938 2. 20 1939 O. 45 1940 :3. 79 1941 :3. 02 1942 1. 60 1943 O. 74 1944 1.20 1945 2. 19 1946 O. 35 1947 0.83 1948 4. 16 1949 O. 01 1950 O. 0 1951 0.::38 1952 1.28 1953 1. 71 1954 0.::30 1955 2. 68 1956 O. 57 1957 O. 78 1958 6.04 1959 1. 48 !960 O. 46 1961 ::3. 31 1962 0.43 1963 O. 81 1964 2. 88 1965 1.24 1966 3.39 1967 1. 15 1968 O. 40 1969 1. 44 1970 4.36 1971 0.65 1972 O. 77 197::3 3. 14 1974 0.36 1975 O. 20 1976 0.21 1977 3. 1978 2. 48 1979 7. 1980 2. 44 1981 O. 80 55 years on monthly rainfall data for the South Florida Station 6093. **It.* STATION 609:3. FORT /'IVERS WSO AP O. 76 1. 42 O. BO 1. 2:3 B. 04 B. 7B 3. 14 1. 78 0.:30 2.05 O. 51 1. 44 2.61 9.25 12.26 1:3.95 11.78 :3. 22 0.71 O. OB 1. 0:3 O. BB 7.82 8.30 O. 08 5. 15.44 3.42 0.30 2.8B 5.08 0.80 14.01 4. 05 5.97 1:3.7::3 1. 88 O. 13 :3.70 6.64 2. 92 2.58 3. 96 6. 33 7.27 6.44 O. 86 O. 09 O. 5:3 1.93 1. 06 7.03 3. 59 7.91 17.64 0.08 5. :37 0.71 2. 60 3. 93 0.06 6.86 5.02 9.20 4. 51 4.63 2. 08 l. 09 5. 93 O. 75 O. 92 5. 78 11.56 O. 09 3. 55 8.30 1. 59 0.66 1. B1 0.0 3.50 2.30 O. 42 9.:30 9.38 14. 49 0.30 0.83 5. 50 1. 69 1.14 6. 11 20.25 B.54 7.50 3. 56 5. 39 2. 78 3.68 3. 74 1.38 0.94 10. 75 5. 13 7.00 3.04 5.88 l. 44 0.34 O. 70 0.:33 2. 91 8.24 12. 71 5. 28 5. 12 3. 57 O. 39 0.87 O. 04 B. 42 3. 01 16. 43 7.69 6. 97 12.83 5. Bl 1. BO 4. 00 4. 41 1.73 O. 73 10. 52 3. 50 B.6q 13.02 0.61 O. 1:3 3. B2 6. BB 7. 60 1. 16 7. 12 15.28 7.46 6.09 O. 96 2.4B 3. 35 2.:31 4. 54 :3.:38 11. 15 10.66 9. 18 5.37 0.50 O. 08 O. 71 1. 61 4. 45 5.96 16.06 12.24 B. 59 5.68 :3. 56 2. 37 0.0 3. 76 O. B5 4.00 :3. 73 5.09 5.89 3. 50 5. 77 0.0 O. 68 0.10 O. 21 1.58 11.97 12.41 11.06 5. 71 5. 19 0.03 2.24 0.19 O. 01 0.71 10. 19 5. 78 6. 47 5.21 1. 34 3.:39 2.92 B. 0;>4 2. B2 6. 47 12.84 11. 17 9.40 16.::32 4.97 2.05 0.06 0.8:3 1.57 2.19 5.06 10.08 4. 9B 14.05 ::3. 90 O. 45 O. 07 O. 13 5.50 4.03 7. 5::3 1::3. 32 7.60 12. 70 3. 60 1. 27 O. 08 O. 49 O. 08 4.14 4. 84 6. 83 5. 93 8.::32 ::3. 26 O. 02 1. 90 1. 13 2. 71 2.14 9. 19 11.44 10. 30 3.48 11. 91 1. 14 4.::34 2. 05 O. 78 1. 75 7.95 5. 74 8.::39 12.::35 8. :34 0.75 2.01 O. 68 2. 28 0.41 12.81 9.34 4. 32 15.58 6. 68 1. 07 2. 53 2.13 :3. 49 4.08 4. 78 9. 19 6. 84 10.31 182 2.33 1. 16 O. 32 O. 97 3.23 8. 53 8. 76 4. 29 10 50 2. 15 O. 52 1.06 O. 05 3.50 4.76 4.67 5. :34 8.03 6.00 4. 42 135 3. 68 4. 73 2. 09 7.97 4.85 12.52 9.::39 8. 77 3.19 1.52 1. 26 10.:31 2. 18 6.22 7. :37 10. 92 4. 12 8.89 4. 57 1. 43 1.n 6. 33 1. 75 4. 74 16. 10 6. 17 5. 75 0.89 12. 04 1. 92 3.66 1. 87 3.8::3 2.20 5.20 13.76 5.66 11.93 :3. 01 2. 02 1. B8 3. 58 0.46 4.92 9. 75 9. 82 1:3. 41 2.80 3. 16 1. 12 O. 54 2. 65 1. :37 0.34 12.08 6. 01 10. 89 14.54 5. 44 3.01 4.65 O. 59 0.27 7.58 7. 70 4.06 3.98 7.49 0.05 3.45 3.:30 2. 12 O. 80 0.50 4. 58 2 28 4. 26 9.45 1. 38 0.22 2. 99 2. 91 2.:39 4. 70 7. 78 12 05 6. 57 4.35 4.42 O. 58 1. 06 0.37 3. 03 1. 61 12. 42 8.22 8. 10 4. 18 2. 14 0.18 2. 15 O. 72 0.0 1. 46 7. 41 O. 09 15.86 7. 04 3. 08 0.92 2. 08 O. 65 O. 57 10.32 15.0:3 9. 85 11.44 8.92 7.99 2.88 2.87 4. 74 O. 15 4. 71 10. 63 7. 11 6. 49 16. 60 11.03 O. 22 2.20 19.59 0.0 0.36 7. 47 4. 74 4. 82 8.::29 1. 19 O. 46 1. 55 O. O. 70 3.77 6. 18 9. 50 8. 06 9.21 0.49 0.16 2. 14 4. 72 0.27 5.20 7:80 9. 72 10.::2::2 2. ::33 2.20 3.95 2.23 3. 69 1.71 0.78 3. 99 9. 57 6. 66 8. 38 O. 10 0.10 O. 91 0.03 O. 11 2.40 20. 10 14. 47 7.70 4.::31 O. 19 1. 46 0.27 1. 47 0.80 2.76 10. 10.81 7. 74 12. 59 3.05 0.49 1.20 0.<;>1 0.90 '.22 10. 59 6. 14 9.95 8.81 1. 96 2.10 0.15 0.09 O. 70 O. 51 8.90 9.00 10. '8 9.21 O. 43 1. '0 3. 36 3. 43 02. 52 6. 75 10. 29 10.90 ,. 18 1. 45 0.04 1. 92 O. 43 3. 12 5.32 8.31 5. 90 14. 79 13.65 O. 39 O. 48 1. 04 3.5<;> 1.52 6. 73 1. 99 7.02 8. 79 4.64 1. '4 3. 15 1. 1. 29 O. 06 3.07 11.79 8.24 10. 73 6.70 O. 40 0.71 158 O. 71 O. 30 1. 31 2. 45 1.83 O. 30 O. 13 O. :31 1. 58 1. :34 O. 72 0.21 1.01 5. 42 0.99 1. BO O. 48 0.32 1. 45 O. 57 1.44 O. 6::3 1.62 2.20 O. 14 O. 71 1. 18 1. 93 O. 85 0.10 3. 55 :3. 30 1. 79 O. 73 O. 5:3 0.85 2.27 1. 06 O. 85 0.29 2. 91 O. 16 3. 95 0.37 0.::30 1. 43 1. 72 O. 89 0.69 1. 68 2. 74 4. 35 5. 16 O. 55 O. 73

PAGE 148

159 Table C.4. 55 years of monthly rainfall data for the South Florida Station 6042. STATION 6042 CANAL POINT USDA 1927 0.33 1. BO 2.37 1.0B 1.54 6.31 7.32 B. 14 3. :U 3.35 0.49 O. 40 1928 O. 19 1.38 3. 4B 1.72 3. 10 5. 42 14. 14. 13 16.45 O. 77 1.24 O. 20 1929 1.34 O. 07 O. 60 2.32 5.43 11.74 11.26 6.31 10. 70 3.08 0.69 1. 08 1930 2. 54 3. 03 4.32 '.25 6.10 16.96 4.08 3.07 5.36 5. 14 0.67 2. 77 1.31 2.05 O. 91 4.27 5. 71 3.05 0.4' 3.33 4.67 5.64 4.43 0.70 4. 62 1932 0.26 2. 3B O. B7 2.67 3.49 11. 26 4.'1 9.91 2.40 4.51 25.09 O. 16 1933 1. 54 0.35 4. 73 6.4W! 1.31 7.62 14. 02 B. 51 8. 16 4.36 1. B4 O. 09 1934 0.25 5.36 2. 77 7.64 6.27 7.96 5.20 B. 14 11.69 -1.00 -1.00 -1. 00 1935 O. 16 2.Bl 0.17 5.45 0.76 6.11 3.98 3.62 11.90 4.44 0.57 1. 22 1930 2.40 5.69 3.27 0.39 6.10 14.29 5.44 B. 59 4.08 2.84 5.0B 1.65 1937 4.30 1. Bl 4.BB 3.36 1. 92 4. 44 14.62 ".37 5.88 6. 50 2.23 0.26 1938 O. 12 0.B4 1.0B 0.45 3.13 6.67 7.28 5. 52 8.45 3. 69 0.97 O. 10 1939 O. 38 0.08 1.26 2.B2 4.29 8.87 6.40 12.26 8.86 5. 55 0.42 2.32 1940 -1.00 -1. 00 -1.00 0.38 5.61 B.63 8.79 B.22 6.09 1. 20 0.57 4. 76 1941 5. 72 4.03 3. 74 6.68 2.23 3.90 14. 73 4. 78 6.40 4.92 1.72 1. 50 1942 1.34 2. 77 6.36 2.36 4.92 14.11 3.62 4.42 4 .,3 2.06 2.15 2. 47 1943 0.31 O. 45 2.08 1.33 1.86 B.B3 11.73 6.56 5. 10 2.Bl 2.0B O. 38 1944 0.98 0.04 4.17 2. 71 3.98 3. 40 5.66 5.81 4. 73 B. 35 0.30 O. 43 1945 O. 47 0.88 O. 03 O. 70 3.11 10.93 10.83 7.24 13. 71 4.10 0.49 O. 53 1.,46 1.13 O. B4 4. 31 .0 10.60 11.20 8.59 6.98 12.2B 1.54 5.08 2. 13 1947 O. 42 2. 66 8. 52 5. 16 4.46 10.90 11.56 10.66 17.61 9.72 5.28 1. 16 1948 3. 70 0.48 O. 78 5. IB 1. 30 2. 17 7.62 8.41 16. 14 2. 74 0.38 0.34 1949 O. 40 0.80 O. 52 1.94 1. 64 15.69 6.28 12. 16 7.36 1. 94 1. 09 6. 47 1950 0.30 O. 79 3. 04 O. B7 2.14 2. 15 6.71 4.20 3.20 11. 17 1.07 1.25 1951 O. 04 2.06 1.01 5.41 5.68 6.34 9.16 8.68 5.38 10.58 0.98 O. 90 1952 1.68 5.20 O. 92 2.99 3.27 3. 46 8.13 8. 74 4. 90 13. 72 O. 1B O. 07 1953 1. 83 1. 89 2. 69 4. 20 0.84 7.85 14.00 12.24 11. 02 7. 65 2. 10 1.82 1954 O. 35 1. 96 2. 71 7. 57 6.77 12. 78 8.08 8.27 5. 45 2. 95 O. 56 1.60 1955 1.31 2.20 2.08 2.67 1.55 12.93 8.45 7.27 4.46 1. 70 0.27 2. 03 1956 O. 72 1.11 0.03 1.92 3.04 3. 70 7.34 3.08 14. 09 6.16 0.38 O. 50 1957 3. 88 2. 57 2.97 5.73 11.35 5.20 10. B9 4.97 12.68 3.15 0.77 5. 75 1958 8. 73 O. 61 5. 10 4.35 6.33 4.86 7.79 6.60 6.26 6. 07 0.62 6.35 1959 2.20 O. 01 5.73 3.90 10.03 9. 19 12.52 5.29 7.72 9.66 2.18 1. 72 1900 0.05 4.59 0.99 4.33 3.20 6.80 7.83 6. 16 12.89 4. 00 2.01 O. 70 1961 3.67 0.43 4. 17 2.03 8.82 3.21 9.25 10. 79 1. 19 4.55 0.97 O. 20 1962 1.22 O. 3. 05 4.08 2.12 7.01 10.45 5. 14 9. 88 1. 70 2. 19 0.31 1963 0.99 4. 18 0.71 0.09 6. 41 7.68 1. 60 5. 54 3.61 2.58 1.62 6. 09 1964 3. 32 2.06 0.93 3.67 2.05 13. 52 9.02 B. 59 5.65 6. 63 0.45 4. 37 1965 0.97 4. 54 2.20 2.04 4.50 10.25 8. 10 7.22 7.32 13. 24 0.32 1. 13 1966 4.09 2.27 1.01 3.02 5.46 9.81 12.03 5.66 5. 77 6.60 0.31 0.84 1967 O. 66 2.55 1. 00 0.0 1. 36 6.33 7.73 3.48 4.37 3. 45 0.13 1. 40 1968 0.29 2.27 0.80 0.33 7.26 19. 18 10.35 4. 21 10.55 7.36 1. 77 O. 02 1969 1. 66 1. 76 4. 74 1.87 7.17 9.93 3.36 S.09 5.S2 8.44 2.09 2. 14 1970 3. 13 2.89 14.55 0.0 6.92 3. 10 9. 45 13. 07 2. 19 3. 79 0.17 O. 10 1971 O. 40 1. 12 O. 40 0.16 6. 74 8. 43 5.07 5.40 6.47 8.09 1. 80 1.97 1972 2.33 1." 2.09 4.03 -1.00 9.99 -1. 00 :2.50 1. 77 1.72 4. 15 2. 42 1973 2.66 1.99 2.00 0.84 5.03 4.62 6.03 4.30 5. 74 3. 38 0.98 1. 77 1974 :2. 12 0.5S O. 22 1.37 6.01 10. 43 6.87 5.S9 7. 14 2. 06 1.60 0.95 1975 0.46 4. 15 1.00 1.09 10.13 7.34 7. 72 4.52 8.95 4.36 0.82 O. 21 1976 0.43 2.11 O. 30 1.79 8.74 7.85 2.07 7. 49 2. 96 0.26 2.26 2. 41 1977 3. 62 O. 46 O. 55 1.11 3.01 5.83 2.06 6.S4 13, 28 1. 39 6.17 6. 59 1978 2.34 1.42 3. 73 2.02 5.69 15. 47 6.22 10.41 8.03 4. 57 2.37 4. 55 1979 -1.00 -1.00 -1.00 -1.00 4,65 2.34 2.85 4.09 11.96 3.52 2. 52 2 10 1980 3.06 1. 89 1. 94 5.08 4. 15 5.10 7. 52 5.96 16.08 1. 42 1.59 0, 62 1981 O. 54 1. 62 2.27 O. 16 3.18 7. 16 4.05 13.50 5. 12 0.35 1.97 O. 27 Note: -1 indicates missing value.

PAGE 149

Table C.S. VARIABLE -JAN FEB MAR APR MAY -JUN -JUL AUG SEP OCT NOV DEC VARIABLE -JAN FEB MAR APR /'lAy -JUN -JUL AUG SEP OCT NOV DEC VARIABLE -JAN FEB MAR APR MAY -JUN JUL AUG SEP OCT NOV DEC VAR IABLE -JAN FEB MAR APR MAY -JUN -JUL AUG SEP OCT NOV DEC N :5:5 :5:5 :5:5 !5:5 55 55 !55 55 55 55 N 52 :53 :53 54 :53 !53 53 :55 :5:5 54 :54 N N 53 !53 53 54 54 :55 :54 :5:5 55 :54 54 54 Monthly statistics of stations 6038, 6013, 6093, 6042. ***** STATION 6038 ***** MEAN 1.927 1.878 2. 2. 507 4. 575 7.606 7.:235 7.033 7. 567 3. 747 1.379 1. 457 STANDARD DEVIATION 3. 063 1.368 2. 456 1.818 2. !584 3. 776 3. 358 2. 897 4. 085 3. 073 1.283 1. 555 ""***. STATION 6013 ****')f MEAN 2.093 2.718 2. 987 2.928 4. 192 8.613 8.307 7.258 7.521 3. 500 1.567 1.687 STANDARD DEVIATION 1. 780 1.828 2. 006 2. 209 2. 655 3. 694 3. 664 3. 148 3. 732 2. 488 1. 551 1. 135 ***** STATION 6093 ***** MEAN 1.636 2.039 2.619 1.995 4. 049 9. 105 8. 672 8. 309 8. 553 3. 474 1. 175 1.399 STANDARD DEVIATION 1. 1.450 3. 206 1.953 2.414 4.082 2. 976 3. 490 3. 988 2. 877 1.048 1.260 ***** STATION 6042 ***** MEAN 1.686 1.960 2. 632 2. 860 4. 626 8. 141 7.861 7.194 7. 802 4.709 1.972 1.818 STANDARD DEVIATION 1.688 1.475 2.501 2. 278 2. 659 4.107 3. 408 2. 853 4. 126 3. 163 3. 489 1.863 SKEWNESS 5.016 O. 664 1.762 O. 674 1.032 O. 646 1. 008 O. 724 1.081 1. 138 1.532 1.975 SKEWNESS 1.344 O. 798 O. 676 O. 864 1.073 1. 129 O. 793 1.487 0.716 O. 798 1 833 o 727 SKEWNESS 1. !531 O. 588 2. 779 1.474 O. 381 O. 777 O. 123 O. 974 O. 407 1.295 O. 881 1. 555 SKEWNESS 1.812 O. 881 2. 388 O. 758 O. 666 O. !530 O. 256 O. 636 O. 660 1. 056 5. 743 1.362 c. v. 159.002 72. 856 94.631 72. 498 56. 482 49. 646 46. 420 41. 193 53 983 62.017 93. 042 106. 686 c. v. 85. 043 67. 259 67.151 75. 460 63.326 42. 892 44. 108 43. 370 49. 620 71 082 98. 982 67. 280 c. v. 97.018 71. 102 122. 418 97. 916 59. 615 44. 835 34.313 41.997 46. 624 82 817 89. 186 90. 105 c. v. 100. 077 75. 242 95.010 79. 661 57. 482 50 446 43 353 39. 655 52. 880 67 163 176.912 102. 457 160

PAGE 150

161 Table C.6. Station 6038--monthly statistics of the incomplete and estimated series--2% missing values. VARIABLE .JAN FEB MAR APR MAY .JUN .JUL AUG SEP OCT NOV DEC VARIABLE .JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC ** STATION 6038 ( 27. MIS. N MEAN 1.88:5 1.906 2.632 2. 4. 624 7.510 7.308 7.030 7.567 3. 747 1.324 1. STANDARD DEVIATION 3. 076 1. 2.464 1.818 2. :583 3.812 3.399 2. 938 4. 3. 073 1.228 1. :58:5 ***** C. V. 163. 191 71.646 93. 616 72. 498 :55. 8:5:5 50. 762 46. 513 41.798 53. 983 82.017 92. 781 108. 781 .. *** USING THE MEAN VALUE ( 27. MIS. ) .... *** MEAN 1.885 1.906 2.632 2. 507 4. 624 7. 510 7. 308 7. 030 7. 567 3. 747 1.324 1. 457 STANDARD DEVIATION 3. 048 1. 353 2.441 1. 818 2. :559 3. 741 3. 336 2. 883 4. 085 3. 073 1. 217 1. C. V. 161. 666 70. 976 92. 746 72. 498 55. 336 49.813 643 41. 016 53. 993 82.017 91. 911 106. 738 ***** RECIPROCAL DISTANCES METHOD 27. MIS. VARIABLE JAN FEB MAR APR MAY .JUN .JUL AUG SEP OCT NOV DEC MEAN 1. 921 1.878 2.598 2. 507 4. :563 7. 7. 282 6. 997 7. :567 3. 747 1. 376 1.460 STANDARD DEVIATION 3. 1.368 2. 453 1. 818 2. :598 3. 80:5 3. 341 2. 891 4. 085 3. 073 1.278 1. :556 C. V. 159.265 72. 821 94. 413 72. 498 56. 929 50. 883 41. 310 53. 983 82.017 92. 825 106. 592 ***** NORMAL RATIO METHOD ( 27. MIS. ) **.** VARIABLE .JAN FEB MAR APR MAY .JUN .JUL AUG SEP OCT NOV DEC MEAN 1.927 1.876 2. 598 2. :507 4. 7. :538 7. 279 6.977 7. 567 3. 747 1.349 1.448 STANDARD DEVIATION 3.064 1.370 2. 453 1. 818 2. 603 3. 7:57 3. 339 2. 896 4. 085 3.073 1. 231 1. ***** MODIFIED WEIGHTED AVERAGE VARIABLE .JAN FEB MAR APR I'1AY .JUN .JUL AUG SEP OCT NOV DEC MEAN 1.871 2. 591 2. :;07 4. 551 7. 573 7.235 6. 963 7. 567 3. 747 1. 349 1.449 STANDARD DEVIATION 3.089 1.377 1.818 2.615 3.812 3. 362 2. 909 4. 085 3. 073 1. 230 1. 556 C. v. 158. 978 73. 020 94. 437 72. 498 57. 080 49. 839 45. 877 41. 516 53. 993 82.017 91. 259 107. 430 27. I'IIS. ) ***** C. V. 158. 165 73. :587 94.897 72. 498 57.459 50. 339 46.473 41. 779 53. 983 92.017 91.241 107. 386 SKEWNESS 5.083 0.643 1. 743 0.674 1. 016 0.717 0.953 o 725 1.081 1. 138 1. 645 1.943 SKEWNESS 5.127 0.649 1.759 O. 074 1.025 o 730 0969 o 738 1 081 1 139 1.659 1.977 SKEWNESS 5 039 0.664 1.766 O. 074 1 011 O. 672 0.987 O. 764 1.081 1. 138 1 527 1.968 SKEWNESS :5.014 0.660 1.766 o 674 1.002 O. 703 O. 993 O. 779 1.081 1. 138 1.566 1.992 SKEWNESS 4.B89 0.643 1.757 O. b74 0.977 0.6804 1. 000 O. 773 1. 081 1. 138 1. 571 1. 980

PAGE 151

162 Table C.7. Station 6038--monthly statistics of the incomplete and estimated series--5% missing values. VARIABLE .JAN FEB MAR APR MAY .JUN ,JUL AUG SEP OCT NOV DEC VARIABLE ,JAN FEB MAR APR MAY ,JUN ,JUL AUG SEP OCT NOV DEC ..... STAT ION 6038 ( 51. 1115. N 54 52 48 51 51 52 53 54 53 52 52 55 MEAN 1.985 1.925 2.619 2.368 4. 521 7.411 7.355 7. 062 7.474 3. 790 1.344 1.457 STANDARD DEVIATION 3. 076 1.391 2. 526 1.731 2. 437 3. 693 3. 2.917 4. 132 3. 109 1.238 1. 555 ***** C. V. 163. 191 71. 771 96. 460 73. 093 900 49. 699 45. 626 41. 303 287 82. 021 92. 137 106. 686 iHHHHt USING THE MEAN VALUE ( 51. MIS. ) iHHHHt MEAN 1.885 1.925 2. 619 2. 368 4.521 7. 411 7. 356 7. 062 7. 474 3 790 1.344 1.457 STANDARD DEVIATION 3. 048 1.343 2. 357 1.666 2. 345 3. 580 3. 293 2. 890 4. 055 3. 021 1.203 1. 555 C. V. 161.666 69. 758 89. 985 70. 330 51. 866 48. 299 44. 772 40.919 54.255 79. 711 89. 555 106. 686 5.083 0.616 1.804 o 666 O. 0.719 O. 984 0.696 1. 147 1. 131 1.643 1.975 S!('EWNESS 5.127 o 633 1.922 O. 689 O. 884 O. 738 1. 001 O. 702 1 167 1. 161 1 688 1.975 *** .... RECIPROCAL DISTANCES METHOD 51. MIS. *** .. VARIABLE MEAN STANDARD C. V. S!('EWNESS DEVIATION -.JAN 1. 921 3. 159. 265 5 039 FEB 1. 867 1.370 73. 336 O. 683 MAR 2. 580 2 426 94. 058 1 812 APR 2. 429 1.786 73. 539 O. 667 MAY 4. 417 2. 414 54. 646 O. 885 ,JUN 7. 613 3. 776 49. 603 0 653 -.JUL 7. 259 3. 332 904 1 036 AUG 7. 039 2. 895 41. 128 0 7"""" ...... SEP 7. 733 4. 294 55. 1.022 OCT 3. 837 3. 070 80 022 1.098 NOV 1. 1.270 92. 366 1.578 DEC 1. 1. 106. 686 1.975 -11**** NORMAL RATIO METHOD ( 51. IHS. ) .. **** VARIABLE MEAN STANDARD C. V. S!('EWNESS DEVIATION -.JAN 1.927 3. 064 158.978 014 FEB 1.856 1.377 74. 160 O. 683 MAR 2. 557 2. 423 94.741 1.847 APR 2. 403 1. 752 72. 939 O. 660 MAY 4. 434 2. 398 54. 081 O. 878 ,JUN 7. 536 3. 690 48. 961 O. 643 -.JUL 7. 223 3.365 46. 590 1.012 AUG 7. 062 2. 890 40.919 O. 702 SEP 7. 691 4. 221 54. 883 1.008 OCT 3.773 3. 041 80. 1. 1'7 NOV 1.335 1. 231 92. 198 1.608 DEC 1. 457 1. 106. 686 1.975 ...... MODIFIED WEIGHTED AVERAGE 51. MIS. .. ..... VARIABLE MEAN STANnARD C. V. SKEWNESS DEVIATION .JAN 1. 3. 089 158. 165 4. 889 FEB 1 849 1. 387 74. 995 O. 657 MAR 2. 561 2. 459 96. 012 1. 756 APR 2. 405 1. 773 73. 744 O. 644 MAY 4. 403 &I 438 55. 374 O. 846 -.JUN 7. 584 3.787 49. 934 o 666 -.JUL 7. 197 3. 396 47.191 O. 973 AUG 7. 019 2. 907 41. 409 0.725 SEP 7. 796 4. 425 56. 753 1.085 OCT 3. 816 3. 092 81.032 1. 106 NOV 1 349 1. 221 90. 547 1 624 DEC 1. 457 1. 555 106. 686 1 975

PAGE 152

163 Table C.8. Station 6038--monthly statistics of the incomplete and estimated series--10% missing value. VARIABLE .JAN FEB MAR APR MAY ,,)UN ,JUL AUG SEP OCT NOV DEC VARIABLE ")AN FEB MAR APR MAY ,,)UN .JlJL AUG SEP OCT NOV ** ...... STATION 6038 ( 10;: MIS. .**** N MEAN STANDARD DEVIATION 50 1.848 3. 108 47 1.924 1. 374 '2 2. 509 2. 463 49 2. '6' 1. 874 48 4.488 :o!.468 '1 7.807 3. 742 50 7.:O!23 3.308 49 7.160 2.940 51 7. 582 4. 124 50 3. 706 2.976 50 1. 315 1. 173 51 1.486 1. 595 USING THE t1EAN VALUE ( 10:1. MEAN 1. 848 1.923 2. 509 2. 566 4. 488 7. 807 7. 223 7. 160 7. 582 3. 706 1.314 STANDARD DEVIATION 2. 961 1. 268 2. 394 1. 767 2. 302 3.601 3. 151 2. 772 3.969 2.835 1. 117 *** ... RECIPROCAL DISTANCES METHOD VARIABLE MEAN STANDARD DEVIATION ")AN 1.872 3.025 FEB 1.876 1.366 MAR 2.610 2.462 APR 2.608 1.941 MAY 4.430 2. 462 .JUN 7. 621 3.808 ,,)UL 7.435 3.313 AUG 7. 158 2.832 SEP 7. 681 4.042 OCT 3. 746 3.032 NOV 1.323 1. 156 DEC 1.445 1. 553 ***** NORMAL RATIO METHOD ( 10:1. VARIABLE MEAN STANDARD DEVIATION ")AN 1.883 3.025 FEB 1. 817 1.341 MAR 2. 590 2.448 APR 2. 5:56 1.870 MAY 4.498 2.432 .JUN 7.632 3.814 ,,)UL 7.263 3. 188 AUG 7. 121 2.800 SEP 7. 624 4.019 OCT 3. 660 2. 963 NOV 1. 347 1. 175 DEC 1.451 1. :5:53 MIS. ***.* C. V. 168. 167 71.410 98. 180 73. 066 54. 985 47. 931 4:5.803 41. 064 :54. 393 80. 318 89. 225 107. 372 C. V. 160. 178 65.926 411 68. 873 51.295 46. 120 43. 632 38.715 52. 341 76. 501 85. 021 10:1. MIS. C. V. 161. 608 72. 829 94. 323 74. 405 5:5. 572 49. 967 44. 566 39. 562 52. 631 80. 926 87. 370 107. 435 SKEWNESS 5.328 O. 705 1.904 O. 631 1. 051 0.675 1.069 O. t.97 1.078 1.261 1.498 1.943 SKEWNESS ***** 5. 572 O. 760 1.955 O. 665 1. 120 O. 700 1. 118 O. 736 1. 117 1. 318 1.568 :5.225 O. 736 1.745 O. 762 O. 999 0.629 O. 861 0.715 1.014 1.200 1.421 2. 007 MIS. ) ***** C. V. 160. 648 5.217 73. 800 O. 803 94. 535 1. 775 73. 141 O. 678 54. 064 0.960 49. 972 O. 647 43. 889 1.046 39. 326 O. 753 52. 721 1.064 80. 954 1. 192 EJ7. 209 1.382 107. 011 2. 000 ****if MODIFIED WEIGHTED AVERAGE 10:1. MIS. ***** VARIABLE MEAN STANDARD C. V. SKEWNESS DEVIA1ION ")AN 1 908 3. 165 165. 872 4. 645 FEB 1. 841 1.402 76. 132 O. 696 MAR 2. 616 2. 478 94. 722 1.728 APR 2. 596 1.959 75. 448 O. 780 MAY 4. 2. 532 57.217 O. 947 .JUN 7. 3. 935 52. 237 O. 489 ,,)UL 7. 39't 3.359 45. 398 O. 826 AUG 7. 11::. 2. 895 SEP 7. 681 4. 086 "40.689 O. 723 53. 193 1.003 OCT 3. 693 3. 119 84.458 1. 155 NOV 1.299 1. 129 86. 871 1.557 DEC 1. 419 1.568 110.497 1.971

PAGE 153

164 Table C.9. Station 6038--monthly statistics of the incomplete and estimated series--15% missing values. VARIABLE ,JAN FEB MAR APR MAY ,JVN ,JVL AVG SEP OCT NOV DEC VARIABLE ,JAN FEB MAR APR MAY ,JVN ,JUL AUG SEP OCT NOV DEC **-<1* STATION 6038 ( 157-MIS. ..... ** N MEAN STANDARD DEVIATION 46 1.927 3. 284 45 1.750 1.449 43 2.448 2. 099 47 2.484 1.857 47 4.674 2. 490 48 7.727 3.854 50 7. 111 3.467 45 7.245 2. 954 47 7.293 3. 766 49 3.656 3. 003 46 1.488 1.343 49 1.470 1.598 USING THE MEAN VALUE ( 157. MEAN 1. 927 1.750 2. 449 2. 484 4. 673 7.728 7. 111 7. 244 7. 293 3. 656 1.488 1. 470 STANDARD DEVIATION 2. 998 1.308 1. 851 1. 714 2. 298 3. 596 3. 303 2. 667 3. 475 2. 831 1.226 1. 506 MIS. ***** C. V. 170.428 82. 791 85. 738 74. 749 53. 272 49.876 48. 762 40.777 51. 632 82. 138 90. 292 108. 686 ) *** .. c. V. 155. 539 74. 733 75. 601 69.007 49. 174 46. 529 46. 450 36. 813 47. 657 77. 431 82. 403 102. 471 4.910 O. 895 1. 171 O. 749 O. 887 O. 592 1. 101 O. 664 O. 673 1. 182 1. 417 1. 975 5. 338 0.984 1.314 O. 808 O. 956 O. 631 1. 152 O. 730 O. 725 1.247 1. 540 2.085 ***** RECIPROCAL DISTANCES METHOD 157. MIS. ***** VARIABLE MEAN STANDARD DEVIATION ,JAN 2. 015 3.019 FEB 1.898 1.443 MAR 2. 672 2. 557 APR 2. 659 1. 993 MAY 4.711 2. 574 JUN 7. 676 3. 765 JUL 7. 159 3. 338 AUG 7.174 2. 745 SEP 7.401 3. 749 OCT 3. 893 3. 118 NOV 1. 402 1.272 DEC 1. 463 1. 537 ..... *** NORMAL RATIO METHOD ( VARIABLE MEAN STANDARD DEVIATION ,JAN 2. 017 3. 029 FEB 1.808 1.370 MAR 2. 621 2.377 APR 2.571 1.878 MAY 4. 750 2. 599 '"'UN 7. 585 3. 732 ,",UL 7. 129 3. 362 AUG 7.130 2. 739 SEP 7. 314 3. 718 OCT 3. 815 2.989 NOV 1.412 1. 278 DEC 1. 463 1. 538 ***** MODIFIED WEIGHTED AVERAGE VARIABLE MEAN STANDARD DEVIATION ,JAN 2.152 3. 093 FEB 1.853 1.467 MAR 2.621 2. 547 APR 2. 490 1.969 MAY 4. 760 2. 707 ,JUN 7. 557 3. 871 ,",UL 7.053 3. 369 AUG 7. 094 2. 835 SEP 7.315 3. 857 OCT 3. 865 3. 203 NOV 1.413 1.248 OEC 1.431 1. 553 C. V. 149. 847 76. 039 9S. 699 74. 944 54. 652 49. 050 46. 633 38. 261 50. 660 80. 103 90. 778 105. 049 157. 1'115. ) ****.:> C. V. 150. 183 75. 774 90. 718 73. 030 54. 711 49.208 47. 161 38.419 50. 843 78 337 90. 467 105. 085 SKEWNESS 5. 144 0.757 2.011 O. 696 0.775 O. 578 1.079 O. 734 O. 714 1.068 1.534 1.979 SKEWNESS 5. 095 O. 812 1.575 O. 625 O. 794 O. 639 1.086 O. 757 O. 697 1.009 1. S07 1.994 157. MIS. ) ***** C.V. SKEWNESS 143. 705 4. 701 79 163 0.844 97. 169 1.981 79. 056 O. 564 56.857 O. 797 51.220 0.529 47.772 1. 113 39. 965 O. 700 52. 732 O. 766 82. 859 1. 120 88. 354 1.623 108. 479 1.956

PAGE 154

165 Table C.IO. Station 6038--monthly statistics of the incomplete and estimated series--20% missing values. VARIABLE JAN FEB /'IAR APR /'lAY .)UN JUL AUG SEP OCT NOV DEC ** STATION 6038 ( MIS. N 47 45 48 43 43 44 42 43 45 42 43 /'lEAN 1.856 1.909 2.412 2. 4. 797 7. 306 7.306 7.023 7. 528 3.841 1.364 1.573 STANDARD DEVIATION 3.240 1.436 2. 112 1.862 2.769 3. 900 3. 720 2.826 4. 142 3.210 1.317 1.703 ***** C. V. 174.615 213 87. 561 72. 393 730 53. 376 50. 916 40. 233 55. 024 83. 576 96. 582 108.218 SKEWNESS 5.048 0.656 1. 144 O. 701 O. 912 0.818 0.887 O. 383 1.291 1. 180 1.603 1.765 BU. USING THE 11AN VALUE ( 20'l. MIS. ) ... 11-*11 .. VAR IABLE JAN FEB /'IAR APR /'lAY JUN .)UL AUG SEP OCT NOV DEC MEAN 1. 856 1.909 2.411 2. 572 4. 797 7. 307 7. 307 7. 022 7. 3. 841 1. 363 1.573 STANDARD DEVIATION 2.991 1. 296 1.906 1.738 2. 442 3. 439 3.319 2. 462 3. 653 2. 898 1. 148 1. 501 C. v. 161. 109 67. 886 79. 048 67.547 50. 905 47.067 45. 430 35. 061 48. 524 75.446 84.213 95. 482 5.434 O. 720 1. 257 o 748 1 023 o 917 o 983 o 435 1 448 1 0:96 1 820 1 981 ***** RECIPROCAL DISTANCES METHOD 201. MIS. ***** VARIABLE JAN FEB /'IAR APR /'IAV .)UN -JUL AUG SEP OCT NOV DEC VARIABLE JAN FEB /'IAI< APR /'lAY .)UN JUL AUG SEP OCT NOV DEC *_*iHt I'IEAN 1. 930 1.870 2. 638 2. 520 4. 7. 7. 643 7. 060 7. 839 3. 840 1. 710 1.540 NORMAL MEAN 1.952 1.828 2. 580 2. 530 4.671 7.181 7.383 6. 950 7.684 3. 779 1. 459 1.499 STANDARD DEVIATION 3. 032 1. 343 2. 1. 868 2.636 3.629 3. 501 2. 4. 065 3.065 2.417 1.594 R A TI 0 METHOD STANDARD DEVIATION 3.037 1.340 2.344 1. 870 2. 658 3. 591 3. 404 2. 570 3.930 3. 001 1. 418 1.543 ( ***** MODIFIED WEIGHTED AVERAGE VARIABLE MEAN STANDARD DEVIATION JAN 2 027 3.143 FEB 1.923 1. 371 MAR 2. 591 2. 525 APR 501 1.996 /'lAY 4. 705 2.770 JUN 7 283 3. 690 JUL 7. 572 3.610 AfJO 6.941 2. 593 SEP 7.929 4. 323 OCT 3. 738 3. 188 NOV 1 510 1.655 DEC 1.494 1. 648 C. V. 157.096 71. 811 95. 978 74. 132 56. 673 48. 200 45. 811 51. 860 79. 805 141. 338 103.496 SKEWNESS 5 151 o 686 2. 121 O. 665 0.916 o 662 O. 669 0.400 O. 950 1. 128 4. 585 1. 811 20'l. MIS. ) ... *** C. V. SKEWNESS 155.594 5. 110 73. 293 O. 780 90. 853 1.626 73.914 O. 665 899 O. 859 50. 001 0.894 46. 102 O. 873 36. 975 O. 428 51.143 1.097 79. 420 1.232 97.228 1.850 102. 983 1.948 20'l. MIS. ***** C. V. SKEWNESS 155.040 4. 605 75. 175 o 674 97. 451 2.107 75. 377 0.644 58. 864 0.779 50. 666 O. 732 47. 677 O. 736 37. 356 0.517 55.215 o 814 85. 288 1. 043 109. 608 :;;1. 110.308 1. 751

PAGE 155

166 Table C.11. Lag-zero covariance matrices of the monthly rainfall series of stations 6013, 6093, 6042. All matrices are symmetric. 3.169 3.342 JAN: 2.393 2.518 FEB: 1. 813 2.101 2.343 1. 723 2.848 1. 496 1. 505 2.175 4.022 4.881 MAR: 3.501 10.275 APR: 1. 959 3.814 2.677 6.921 6.254 3.545 2.378 5.190 7.047 13.647 MAY: 3.610 5.826 JUN: 8.235 16.666 3.539 2.871 7.071 7.373 7.138 16.865 13.425 9.907 JUL: 3.655 8.854 AUG: -0.628 12.177 3.378 2.546 11. 615 1. 630 0.260 8.138 13.928 6.189 SEP: 9.080 15.902 OCT: 5.032 8.278 5.913 6.443 17.022 4.516 5.896 10.004 2.406 1.289 NOV: 0.822 1. 098 DEC: 1. 045 1. 588 1. 395 0.547 12.174 1.145 1. 510 3.471

PAGE 156

Table C.12. Normality transformations applied on the monthly rainfall data of Station 6038. ***** STATION 6038 ( NO TRANSFORMATION ) ***** VARIABLE MEAN STANDARD SKEWNESS C. V. DEVIATION JAN 1.927 3. 063 5. 016 FEB 1. a7a 1.36a O. 664 72.856 MAR 456 1.762 94.631 APR 2. :J07 1. ala O. 674 n. MAY 4. 57:J 2. 5a4 1.032 56. 482 JUN 7.606 3. 776 0.646 JUL 3. 358 1.008 46.420 AVt; 7.033 2. o. 724 41. 193 SEP 7.567 4. 085 1. 081 53 983 OCT 3.747 3. 073 1. 13a a2 017 NOV 1.379 1. 1. 532 93.042 DEC 1.4!57 1. 5!55 1.975 106.686 ... -ST A TI ON 6038 LOOARITHI'IIC TRANSFORI'IATION ** ... YARIABLE MEAN STANDARD DEYIATION SKEWNESS C. Y. ..wi -0.009 O. :132 -0.266 555 FEB O. 103 0.466 -1. 224 4:12.424 filAR O. ISS O. :121 -0.936 276.239 APR 0.218 O. :102 -1.549 ii!3O.355 MY 0.593 0.251 -0.164 42.276 JUN 0.a22 0.246 -0.875 29.918 JUt... 0.809 O. 227 -1. 034 28. 124 AUQ 0.810 O. 184 -0.226 22.643 SEP 0.814 0.253 -0.645 31. 143 OCT 0.385 0.488 -1. 330 126.596 NOV -0.088 0.519 -0. 755 -588.139 DEC -0.068 0.479 -0.173 -706.420 ***** STATION 6038 ( POWER .. O.25 ) ***** VARIABLE MEAN STANDARD DEVIATION SKEWNESS C. V. JAN 1.040 O. 314 0.738 30. 141 FEB 1.096 0.256 -0. !540 23.382 /'tAR 1. 161 O. 312 -0. 108 26.870 APR 1.175 0.283 -0.698 24.109 MY 1.421 0.203 O. 143 14.281 ..AJN 1.620 0.218 -0.355 13.449 ..AJL 1.606 O. 19'9 -0. 4!56 12. 387 AUt; 1.603 O. 168 O. 024 10.457 SEP 1.614 0.227 -0. 140 14.051 OCT 1.292 O. 316 -0. 334 24,436 NOV 0.990 O. 268 -0. 130 27. 100 DEC 0.998 0.270 O. 364 27.061 *** .. STATION 6038 ( POWER-o.35 ) .... *** VARIABLE MEAN STANDARD SKEWNESS C V. DEVIATION JAN 1. 083 O. 461 1. 219 42 FEB 1.154 O. 361 -0. 327 31,285 MAR 1.257 O. 458 O. 163 36,456 APR 1. 274 O. 407 -0, 443 31.909 MAY 1.644 O. 328 O. 264 19. 933 JVN 1. 97!5 O. 366 -0. 180 18.530 JUL 1.949 O. 332 -0. 238 17.051 Aut; 1.942 O. O. 121 14.592 SEP 1. 965 O. 382 O. 040 19.458 OCT 1. 4!56 O. 480 -0. 060 32 927 NOV 1. 006 O. 370 0.110 36. 724 DEC 1.017 O. 383 0.581 37.667 ***** STATION 6038 ( SGUARE ROOT ) ***** VARIABLE MEAN STANDARD DEVIATION SKEWNESS C. V. JAN 1. 178 0.740 2.043 62,830 FEB 1. 265 O. 533 -0.048 42. 115 MAR 1.442 0.724 O. 540 50.192 APR 1.459 O. 620 -0.117 42.466 MY 2.059 O. 584 0.445 28.373 .ruN 2.671 O. 693 0.053 25.944 JUL 2.618 0.62!5 0.073 23.873 AVQ 2. 598 O. 539 O. 265 20. 760 SEP 2. 654 O. 729 0, 295 27. 446 OCT 1. 768 0.79:5 O. 282 44.953 NOV 1.051 O. 529 0.461 50. 341 DEC 1. 067 O. 570 0.908 53.473 167

PAGE 157

168 Table C.13. Statistics of the estimated series --univariate model **11. UNIVARIATE I'iODEL ( 101. MIS. ) ..... VARIABLE MEAN STANDARD SKEWNESS C. V. DEVIATION .JAN 1. 8:54 2.980 :5. 462 160. 709 FEB 1. 891 1.278 0.B15 67. 608 MAR 2. :503 2.402 1.943 95.956 APR 2. 499 1. 805 O. 701 7;!.242 MAY ... :504 2. 30:5 1. 09:5 51. 181 .JUN 7. 809 3.601 0.697 46. 120 .JUL 7. 185 3. 161 1. 141 43. 995 AUG 7. 067 2. 809 O. 780 39. 757 SEP 7. 563 3.975 1. 127 52. 553 OCT 3. :594 2.871 1. 369 79.965 NOV .1. 314 1.132 1. 515 86. 162 DEC 1.475 1. 539 2.019 104.304 .. ... UNIVARIATE MODEL ( 201. MIS. ) ..... VARIABLE MEAN STANDARD SKEWNESS C. V. DEVIATION .JAN 1.777 2. 997 5. 480 168. =23 FEB 1. 846 1.303 0.854 70.585 MAR 2. 334 1.914 1.364 81 989 APR 2. 523 1. 743 O. 828 69. 053 MAY ... 713 2.449 1. 119 51. 9'9 .JUN 7. 199 3. 446 1.009 47.865 .JUL 7. 216 3. 325 1.062 46.072 AUG 6. 961 2.465 0.510 35.406 SEP 7. 420 3. 659 1.531 49 316 OCT 3.719 2.910 1.408 78.235 NOV 1.302 1. 153 1.954 88. 580 DEC 1.498 1. 509 2. 101 100. 727 Table C.14. Statistics of the estimated series --bivariate model **H BIVARIATE MODEL ( 101. MIS. ) **11. VARIABLE MEAN STANDARD DEYIATION SKEWNESS C Y. .JAN 1.825 2. 972 5. 532 162. 862 FEB 1.869 1.287 O. 841 68. 866 MAR 2. 483 2.398 1. 976 96. 608 APR 2. 534 1. 781 0.698 70 275 MAY
PAGE 158

APPENDIX D COMPUTER PROGRAMS RAEMV-U (Recursive Algorithm for the Estimation of Missing Values Univariate Model) Input The program inputs the time series; the parameters of the normality transformation to be performed (power transformation); the number of gaps (not necessarily the number of missing values unless all the gaps are singles); and for each gap the starting and ending point (counting starts from the first value in the series). For the first iteration the missing values in the original series (usually indicated by a code or by a negative value) are initialized to zeroes or to some other desired initial estimates. Program Description The main program reads the input data and then subsequently calls subroutine ARMA (each call corresponds to one iteration). Subroutine ARMA performs the following calculations each time it is called: 169

PAGE 159

170 (1) The input series is transformed to normal (using the selected transformation) and stationary (by subtracting the monthly means and dividing by the standard deviations) (2) The mean, variance, autocovariance function (ACVF), autocorrelation function (AGF) and partial autocorrelation function (PACF) of the transformed series are computed by calling the IMSL subroutine FTAUTO. (3) Preliminary estimates of the p AR parameters, and q parameters are computed by calling the IMSL subroutines FTARPS and FTMPS subsequently. (4) Maximum likelihood estimates (MLE) of the AR and MA parameters are computed and the residual series is calculated by calling the IMSL subroutine FTMXL. (5) The mean, variance, ACVF, ACF and PACF of the residual series are computed by calling the IMSL subroutine FTAUTO. (6) The parameters of the fitted model (MLE) are used to estimate the missing values in all the gaps by the Box-Jenkins minimum mean square error forecasting procedure. (7) The inverse normality and stationarity transformations are performed on the series and the estimated complete series is output. The estimated series (output from the first call) now becomes the input series for the second call and the above seven steps are repeated. The subroutine ARMA is called as many times as needed until stabilization of the parameter estimates and of the missing values estimates occur. The program is initialized to five calls (more can be easily added as needed), and a stabilization check for the parameters is provided so that the iterations stop when the two parameters remain constant to the second decimal place. The computation and printing of the ACVF, ACF and PACF of the transformed and residual series (steps 2 and 5) are

PAGE 160

171 not necessary and can be eliminated from the program without any problem. However, their inclusion permits the checking of the goodness of the fitted model at each iteration by diagnostic checking applied on the residuals. A listing of the program in FORTRAN follows. RAEMV-B (Recursive Algorithm for the Estimation of Missing Values -Bivariate Model) The special case of having only the one series incomplete and the other complete will be considered here. However, the program can be easily modified to include the case of having both the series incomplete. Input The program inputs the two time series, the parameters of the normality transformation to be performed on each series, the number of gaps and the position of each gap for the incomplete series. The missing values in the incomplete series are initialized to zeros or to some other values. Program description The main program reads the input data and then subsequently calls subroutine BIVAR (each call corresponds to one iteration). Subroutine BIVAR performs the following each time is called:

PAGE 161

(1) The two input series are transformed to normal and stationary by calling subroutine STAT. (2) The lag-zero and lag-one autocovariances and cross-covariances of the two series are computed by calling the IMSL subroutine FTCRXY. (3) The parameter matrices A and B are calculated. 172 Inversion and multiplication of matrices are performed by the IMSL subroutines LINV2F, VMULFF and VMULFP. (4) The parameter matrices A and B are used to estimate the missing values of the incomplete series. (5) The inverse normality and stationarity transformations are performed on the two series, and the estimated complete series is output. The estimated series (output from the first call) now becomes the input series for the second call and the above five steps are repeated until stabilization of the matrices A and B occurs. No check for stabilization is provided by the program (eight values must be checked simultaneously) but instead the subroutine is called for a prefixed number of times. A listing of the computer program in FORTRAN follows.

PAGE 162

C c---------------------------------------------------------C C PROGRAM RAEMV-U C C RECURSIVE ALGORITHM FOR THE ESTIMATION OF C MISSING VALUES UNIVARIATE ....:JOEL C C---------------------------------------------------------C C C C C C C C C C C 10 C 20 C 30 C 5 40 C 110 C C C C C C 11 C 15 C C C 100 C C C 50 C DIMENSION RAIN(60, 12),NYEAR(60) DIMENSION VRAINCBOO),EIRAIN(60, 12),VRAINICBOO), E2RAIN C 60, 12)' VRAIN2 (BOO), E3RAINC60.12).VRAIN3CBOO). E4RAIN(60.12).VRAIN4CBOO), 12).VRAIN5CBOO). E6RAIN(60.12).VRAIN6(SOO) DIMENSION LI(200).LOC200). ISI(200), IEI(200) COMMON/AI ID.NYEAR COMMON/SI N.C.P COMMON/CI NG. I5G(200). IEG(200) READ INPUT PARAMETERS HEADER .. TITLE N ....... NUMBER OF YEARS NO ...... NUMBER OF GAPS LI ...... LENGTH OF INTEREVENT LG ...... LENGTH OF GAP C,P ...... PARAMETERS OF THE TRANSFORMATION TRANSFORMED SERIES Y=CX+C)**P READCS. 10) HEADER FORMATC20A4) READC5.20) C.P FORNAT(2F5.2) READCS.30) N.NG FORMAT< 2 ( 14/ DO r-I,NO READ(S.40) LICI).LG(I) FORMAT(2I4) READCI0.II0) (10. CNYEAR(I). CRAIN(I, J). J=I, 12.1=1. N) FORMAT(A4, 13. IX, 12F6.2) FROM THE INPUT VARIABLES LI AND LG TWO ARRAYS OF LENGTH NG ARE COMPUTED. THtN THE STARTING POINT OF THE KTH GAP IS ISGCK) AND THE ENDING POINT IS lEG (K). 151(1)-1 lEI (1)=LI C 1) ISG(1)"'IEI(I)+1 IEG(l)-ISGCl)+LGCl)-l DO 11 1=2. NG ISI(I)=IEGCI-l)+l IEI(I)=ISI(I)+LI(I)-l ISG(I)=IEICI)+l IEGCI)=ISGCI)+LGCI)-1 CONTINUE HEADER FoRMAT(20A4. III) PRINT THE POSITION OF THE GAPS FDR CHECKING WRITE(6.100) (1.150(1). lEGe I), I-l.NG) FORMAT(316) INITIALIZE THE MISSING VALUES TO ZERO DO 50 1=1. N DO 50 J=I. 12 IFCRAIN(I.J).EG. -1) RAIN(I. J)=O. CONTINUE 173

PAGE 163

C SUBROUTINE ARMA IS CALLED TO FIT AN ARMA(P,Q) HODEL C TO THE INPUT SERIES. THE PARAMETERS OF THE MODEL C ARE USED FOR THE ESTIMATION OF THE MISSING VALUES. C C C CALL ARMA(RAIN,VRAIN,EIRAIN,VRAIN1,PHII,THETAl) CALL ARMA(EIRAIN,VRAIN1,E2RAIN,VRAIN2,PHI2,THETA2) CALL ARMA(E2RAIN,VRAIN2,E3RAIN,VRAIN3,PHI3,THETA3) IF( (PHI3-PHI2). LE. O. 001. ANO. (THETA3-THETA2l. LE. O. 001) GO TO 999 CALL ARMA(E3RAIN,VRAIN3,E4RAIN,VRAIN4,PHI4,THETA4) IF ( (PHI4-PHI3). LE. O. 001. ANO. (THETA4-THETA3). LE. O. 001 ) GO TO 999 CALL IF( (PHI!5-PHI4). LE. O. 001. ANU. (THETA5-THETA41. LE. O. 001> GO TO 999 CALL ARMA(E5RAIN,VRAIN5,E6RAIN,VRAIN6,PHI6,THETA6) 999 STOP END C C---------------------SUBROUTINE ARMA --------------------------C C C C C C C C C SUBROUTINE ARMA FITTS AN ARMA(P,G) MODEL TO THE INPUT SERIES EACH TIME IS CALLED. THE MISSING VALUES ARE ESTIMATED BY THE BOX-JENKINS FORECASTING PROCEDURE AND THE ESTIMATED SERIES 15 SAVED TO BE THE INPUT SERIES TO THE NEXT CALL. SUBROUTINE ARMA(RAIN,VRAIN,ERAIN,EVRAIN,PHI1,THETA1) REAL MEAN(13),MP,LP DIMENSION ERAIN(60, 12),EVRAIN(800), Z(800) DIMENSION RAIN(60, 12),NYEAR(60), IND(8), PHI
PAGE 164

36 C 38 C C C 42 C C C C C C C C C C 30 130 135 C C C C C 40 140 145 STD(I)-STD(I)-MEAN(I)**01*FLOAT(N/(FLOAT(N)-I. **0.5 CONTINUE MEANC 13)=0. STD(13)-0. DO 38 1-1, N MEAN(13)-MEAN(13)+YTOTAL(I)/FLOAT(N) STD(13)-STDC13l+YTOTALCI)**2 CONTINUE STDC 13 )-( CSTDe 13)-MEANC 13)**2*FLOATCN) ) I CFLOATCNl-l. ) )**0. S NOW STANDARDIZE THE MONTHLY SERIES DO 42 I-l,N DO 42 "'-1,101 RAINCI,"')-(RAINCI,"')-MEAN("'l)/STD("') CONTINUE STORE THE MA TR IX SER IES IN A VECTOR SER I ES DO 30 1-1, N DO 30 "'-1, 12 K-"'+( 1-1 )*12 VRAINCK)-RAIN(I,"') CONTINUE NN=N*12 COMPUTE AC. PAC. AND ACV OF THE SERIES USING SUBROUTINE FTAUTO L=30 CALL FTAUTOCVRAIN.NN,L.L,7,AMEAN.ACV(1),ACV(2).AC(2). PACV(2),WKAREA) SET AC AND PACV OF LAG ZERO TO ONE AC (1I'"'l. PACV( 1 )-1. WRITE(6,130) AMEAN,ACV(1) FORMATCIHt.III, 15X. 'STANDARDIZED TRANSFORMED SERIES'. II. 1 5X. 'MEAN ......... F15. 7. II, 2 SX, 'VARIANCE ..... '.FI5. 7.111) WRlTE(6, 135) FORMATCSX, 'LAG', 12X. 'AC', 12X. 'PACV'.1.2X,4'( '-'I/) SSQ=O. LP1=L+l DO 40 1=1. LP 1 IM1'"'I-l WRITE(6.140) IM1.ACCI),PACV(l) SSQ=SSQ+ACCI)**2 CONTINUE SSQ=SSQ-l FORMATC3X. IS.2FI5. 7) WRITEC6, 145) SSQ FORMAT
PAGE 165

C THE RESIDUALS C ITHETA"l IPHI=l DO '0 I 1 -1. 11 DO 12=1.11 SUMSQICI1.I2'=ETA**2 00 60 I3=3.NN ETAI-TEMPCI3l-PHICI2'*TEMP(I3-1'+THETA(Il)*ETA ETA=ETAI SUMSQICI1. I2)"SUMSQ1(I1. I2'+ETAl**2 60 CONTINUE IFCSUMSQl(II. 12). IPHI GO TO '5 ITHETA-Il IPHI'"'I2 55 CONTINUE 50 CONTINUE C C WRITE OUT THE SUM OF SGUARES SURFACE OF THE RESIDUALS C 160 WRITE(6,160) FORMAT(lHl.III.50X. 'TABLE 2',11.15X. 'SUM OF SQUARES OF THE'. 'RESIDUALS OF THE STANDARDIZED TRANSFORMED SERIES'. 111.'2 X. PHI ) 165 170 C WRITE(6,16" (PHI(Il.I-l,l1) FORMAT(5X. 'THETA',2X, 11<3X, F'. 2, lXlIl WR ITE (6, 170)
PAGE 166

C C C C DSEEO=1234:57. DO CALL QONMLCDSEED,NN, Z) 00 20 I-l,NO I1=ISO C 1) I2:IEQ(1) K=I2-Il+1 IFCK.QT. 1) 00 TO :51 EVRAINCI1)-PHI1*VRAINCI1-1)-THETA1*Z(Il-l) GO TO 20 !51 EVRAINCll)-PHI1*VRAIN(Il-l)-THETAl*Z(Il-l) DO 31 L=2,K 31 EVRAINCIl+L-l):PHIl*EVRAIN(Il+L-2) 20 CONTINUE C APPLY THE INVERSE TRANSFORMATIONS ON THE SERIES. C PP=l/P DO 61 1-1, N Kl=( 1-1 )*12+1 K2:I*12 00 71 L-Kl,K2 J=L-(I-l)*12 ERAIN(I,J)-(EVRAINCL)*STO(J)+MEAN(J**PP 71 CONTINUE 61 CONTINUE C C RETURN END IIOO.SYSIN DO ***** STATION 6038 UNIVARIATE MODEL ***** O. 5 !5!5 25 1 4 27 1 !5 3 11 5 3 2 63 3 1 3 10 4 2 1 19 2 83 1 11 1 14 2 36 2 31 1 33 7 49 2 19 7 21 2 39 1 11 2 2 1 25 1 2 2 30 2 1100. FTI0FOOl 00 DSN:UF. 80063401. 57. C60381. DISP=(OLO. KEEP) 177

PAGE 167

c C------------------------------------------------------------C C PROGRAM RAEI'IV-B C C RECURSIVE ALGORITHM FOR THE ESTIMATION OF C MISSING VALUES BIVARIATE MODEL C c-------------------------------------------------------------C C C C C C C C C C C 10 C 20 C 30 C S 40 C 100 C C C C C C 11 C 15 C C C 101 C C C DIMENSION RAINIC60. 12).VRICSOO).RAIN2(60. 12). VR2CSOO). 1 EIRI C60. 12). VIRI CSOO). EIR2C60, 12), VtR2(SOO). At (2. 2), 2 81(2.2),1'101(2.2).1'111(2.2) DIMENSION E2Rl(60. 12).V2Rl(SOO).E2R2(60. 12),V2R2(SOO). 1 A2(2.2),82C2,2),M02(2.2).M12C2,2) DIMENSION E3Rl(60. 12),V3Rl(SOo).E3R2C60. 12).V3R2CSOO). 1 A3(2,2).83(2,2).M03(2.2),M13(2,2) DIMENSION E4Rl(60. 12).V4Rl(SOO).E4R2C60, 12).V4R2(SOO), 1 A4C2,2),84C2.2).M04C2.2).M14(2,2) DIMENSION E'Rl(60. 12),V'Rl(SOO),E5R2(60, 12),V5R2(SOO), 1 A'C2,2),85(2,2),M05C2.2).Ml'C2.2) DIMENSION LI(200).LG(2001. ISI(200), IEI(200) COMMON/AI ID1.ID2.NYEAR(60) COMMON/BI N,C,P COMMON/CI NG, ISO(200). IEQ(200) READ INPUT PARAMETERS HEADER .. TI TLE N ....... NUMBER OF YEARS NG ...... NUMBER OF GAPS LI ...... LENGTH OF INTEREVENT LG ...... LENGTH OF GAP C.P ..... PARAMETERS OF THE TRANSFORMATION TRANSFORMED SERIES Y-(X+C).*P READ(5. 10) HEADER FORMA TC 20A4 ) READCS.20) C.P FORMATC2F5. 2) READ(S.30) N,NG FORMATC2( 141) ) DO 5 I=I.NG READ(5.40) LI(I).LG(I) FORMAT(214) READ(10.100) (IDI. CNYEARCII. (RAINlCI.J).J=1.12, I=1.N) READ (11. 100) (102. (NYEAR ( 1 ) (R A I N2 C 1. J) J= 1. 12) ), 1=1. N) FORMATCA4. 13. IX. 12F6.2) FROM THE INPUT VARIABLES LI AND LG TWO ARRAYS OF LENGTH NO ARE COMPUTED. THEN THE STARTING POINT OF THE KTH GAP IS ISOCK) AND THE ENDING POINT IS IEG(K). ISI(I)=l lEI C 1 )=LI Cl) ISG(1)=IEI(l )+1 IEG(1)=ISGC1)+LG(1)-1 DO 11 1=2. ISICI)=IEGCI-l)+l IEICI)aISI(I)+LI(I)-l ISG(I)"'IEI(I)+l IEG(I)=ISGCI)+LG(I)-l CONTINUE WRlTE(6. 1') HEADER FORMATC20A4. II!) PRINT THE POSITIONS OF THE GAPS FOR A CHECK WRITE(6. 101) (I. ISG( I). IEG( I). 1=1. NG) FORMATC3I6) INITIALIZE THE MISSING VALUES OF THE INCOMPLETE SERIES 178

PAGE 168

C 60 C C C C C C C C C C C 102 66 DO 60 I=l.N DO 60 12 IF(RAIN1(I .J>'EG.-l) RAIN1(I .J)-0. CONTINUE PRINT OUT THE SERIES WHICH IS TO BE ESTIMATED WRITE(6.66) WRITE(6.102) (101. (NVEAR(I). (RAINI(I...J) ..J .. t.12. I"I.N) FORMAT(IX.A4. 13. IX, 12F6.2) FORMAT (lHI) SUBROUTINE BIVAR IS CALLED TO FIT A BIVARIATE AR(1) MODEL TO THE TWO INPUT SERIES. IT ESTIMATES ALSO THE MISSING VALUES OF THE ONE SERIES AND SAVES IT TO BE INPUT TO THE NEXT CALL CALL BIVAR(RAIN1.VR1.RAIN2,VR2.EIR1.VIR1.EIR2.VIR2. Al.Bl.MOt.Mll) CALL BIVAR(E1Rl.VlR1.ElR2.VlR2.E2R1.V2Rl.E2R2.V2R2. A2.B2.M02.M12) CALL BIVAR(E2Rl.V2Rl.E2R2.V2R2.E3Rl,V3Rl.E3R2.V3R2. A3.B3.M03,M13) CALL BIVAR(E3Rl.V3Rl,E3R2.V3R2.E4Rl.V4Rl,E4R2.V4R2. A4,B4,M04.M14) CALL BIVAR(E4Rl.V4Rl.E4R2.V4R2,E5Rl,V5Rl,E'R2.V5R2, A'.B',M05,M15) STOP END C------------------SUBROUTINE BIVAR --------------------------C C SUBROUTINE BIVAR FITS A BIVARIATE AR(l) MODEL TO C THE TWO INPUT SERIES EACH TIME IS CALLED. IT C ESTIMATES ALSO THE MISSING VALUES OF THE ONE C SERIES AND THE ESTIMATED SERIES IS SAVED TO C BE INPUT TO THE NEXT CALL C C c SUBROUTINE BIVAR(RAIN1.VR1,RAIN2.VR2.ERAIN1.EVRI,ERAIN2. EVR2. A. B. MO. Ml) DIMENSION WKAREA(200) DIMENSION RAIN1(60. 12).VR1(SOO),RAIN2(60, 12),VR2(SOO) DIMENSION ERAIN1(60, 12).EVR1(SOO).ERAIN2(60. 12), EVR2(SOO) DIMENSION XM1(12). XM2(12).STD1(12),STD2(12) DIMENSION A(2.2).B(2.2),C(2.2).D(2.2) REAL MO(2.2),Ml(2.2),MOINV(2.2) COMMONIAI ID1.ID2.NYEAR(60) COMMON/BI N,C,P COMMON/CI NO. ISO(200). IEO(200) COMMONIDI XM1.XM2.STD1,STD2. Xl,X2.ST1.ST2 C C CALL SUBROUTINE STAT TO NORMALIZE AND STANDARDIZE C THE SERIES AND COMPUTE THE STATISTICS C C CALL STAT(RAIN1.XM1.STD1.VR1, Xl.ST1) CALL STAT(RAIN2.XM2.STD2.VR2. X2.ST2) C CALL THE IMSL SUBROUTINE FTCRXV TO COMPUTE AUTO-C AND CROSS-COVARIANCES OF THE SERIES C C C c CALL FTCRXV(VR1.VR2.N. Xl. X2.0.N.C120. IER) CALL FTCRXY(VRI. VRI. N. Xl. Xl. -1. N. Cllt. IER) CALL FTCRXV(VR2.VR2.N.X2.X2.-1.N.C221. IER) CALL FTCRXV(VR1.VR2.N. Xl.X2.-1.N.CI21. IER) CALL FTCRXV(VR2,VR1.N,X2.Xl.-1.N.C211. IER) MO( 1. 1 MO(2,2)=1. MO(1.2)=C120/(ST1*ST2) MO (2. 1 ) =MO ( 1 2 ) M 1 ( 1. 1 ) =C 1111 ( STl *STl ) Ml(2.2)=C221/(ST2*ST2) Ml(l,2)=C121/(ST1*ST2) Ml(2,1)=C211/(ST1*ST2) WRITE(6.66) 66 FORMAT< IHll C C PRINT OUT THE CORRELATION MATRICES MO AND MI 179

PAGE 169

C WRITE(6,110) ((MO(I,'}),.}-1,2) 1"1,2) WRlTE(6,111) (MU-I,.}),,J"1,2) 1"1,2) 110 FORMATC'X, 'CORRELATION MATRIX MO',II, X,OlFI0.3)1) 111 FORMATC5X,'CORRELATION MATRIX Ml',II,X,OlF10.3)1) C C CALCULATE THE PARAMETER MATRICES A AND 8 C C 10 C C C C 140 141 C 15 C C C 40 20 C C C 50 C C C 101 C C CALL LINV2F(MO,Ol,Ol,MOINV,0, WKAREA, IER) CALL VMULFF(Ml,MOINV,Ol,2,ii!,2,2,A,2, IER) CALL VMVLFP(A,Ml,Ol,2,2,2,2, 0,2, IER) DO 10 1-1,2 DO 10 .}-l,iiI C(I,,J)-MO(I,'})-O(I,,J) 8(1, 1)-Cil, 11**0.5 8 (2, 1) DC ( 1, ii!) 18 (1, 1 ) B (2, Ol) (C (iii, 2) -c ( 1, iiI!) **2/C ( 1, 1 ) ) -0. 5 8(1,2)"0. PRINT OUT THE MATRICES A AND 8 WRITE(6,140) A(I,,J),,J"1,2),I"I,2) WRITE(6,1411 (1,,J),,J=1.Ol),I=I,2) FORMATC 5X, 'COEFFIC lENT MATR I X: A I, II, 5X, OlFI0. 3) Il ) FORMATC5X, 'COEFFIC lENT MATR I X: 8 I, II, 5X, OlFI0. 3) Il ) NN=N*12 DO 15 1=I,NN EVR2(1)=VR2(1) EVR1(1)=VR1(1) ESTIMATE THE GAPS OF THE INCOMPLETE SERIES DO 20 1=1. NG I1=ISG( I) 12=IEG(I) K=12-ll+1 DO 40 L=l,K EVR1(Il+L-l)"A(2,1)*EVR1(ll+L-2)+A(2,2)*EVR2(Il+L-2) CONTINUE PERFORM INVERSE TRANSFORMATIONS PP=l/P DO 50 1=I,N DO 50 .}=1, 12 L=,J+( 1-1 )*12 ERAINl(I,.})=(EVR1(L)*STD1(,J)+XM1(J**PP ERAIN2CI,J)=CEVR2(L)*STD2(,J)+XM2C,J**PP CONTINUE PRINT OUT THE ESTIMATED SERIES WRITE(6,66) WRITE(6,101) (101, (NYEARCI), (ERAINUI.,J),,J=I, lOl, I"l,N) FORMAT(lX,A4, 13, IX, 12F6.ii!) RETURN END C----------------SUBROUTINE STAT C C C C C C SUBROUTINE STAT TRANSFORMS THE ORIGINAL SERIES TO NORMAL AND STATIONARY AND COMPUTES THE STATISTICS OF THE TRANSFORMED SERIES. SUBROUTINE STAT(RAIN, XM,STD, VRAIN, X,ST) DIMENSION RAIN(60, 12), VRAIN (800), XM( 12), STD( 12) COMMONIAI 101, 102, NYEAR(60) COMMON/81 N,C,P COMMON/CI NO, ISO(200), IEO(200) DO 10 I=l,N DO 10 J=1. 12 RAIN(I,,J)-(RAINCI,,J)+C)**P 10 CONTINUE C C COMPUTE MONTHLY MEANS AND STANDARD DEVIATIONS OF C THE NORMALIZED SERIES C DO 20 .1=1,12 XMC,J)=O. 180

PAGE 170

C 2' 20 30 C C C 40 C C C STDeJ)=O. DO 25 1=1. N XM(JI-XMeJ)+RAINCI,J)/FLOATCN) STD(J)=STOeJ)+RAINCI,JI**2 CONTINUE CONTINUE DO 30 1=1.12 STDCI)=eCSTD(I'-XMCI'**2*FLOATeN)I/(FLOAT(N'-l. )1**0.5 CONTINUE NOW,STANDARDIZE THE SERIES DO 40 I-I. N DO 40 J"'l, 12 RAINCI,J)-CRAINeI,J)-XMeJ/STDeJ) CONTINUE COMPUTE MEAN AND STD OF THE WHOLE SERIES NN=N*12 IC=O DO 50 I=l.N DO '0 J=l, 12 IC=IC+l '0 C VRAIN(ICI-RAINCI,J) CONTINUE C X=O. ST=O. DO 60 1=1, NN X=X+VRAINCI) ST=ST+VRAIN(II**2 60 CONTINUE X=X/FLOAT
PAGE 171

REFERENCES Afifi, A.A., and Elashoff, R.M., 1966, "Missing observations in multivariate statistics I: Review of the literature," J. Am. Stat. Assoc., 61:595-604. Anderson, D.G., 1979. "Satelite versus conventional methods in hydrology" in Satellite Hydrology, American Water Resources Association, Minneapolis. Anderson, T. W., 1957, "Maximum likelihood estimates for a multivariate normal distribution when some observations are missing," J. Am. Stat. Assoc., 52:200-203. Ansley, G.F., Spivey, W.A., and Worblski, v1.J., 1977, "A class of transformations for Box-Jenkins's seasonal modelling," Appl. Stat., 26:173-178. Beale, E.M.L., and Little, R.J.M., 1975, "Missing values in multivariate analysis," J. R. Stat. Soc., B37:129-145. Beard, L.R., 1973, "Hydrologic data fill-in and network design," in Design of Water Resources Projects with Inadequate Data, Proc. of the Madrid Symposium, June, 1973. Bendat, J.S., and Piersol, A.G., 1967, Measurement and Analysis of Random Data, John Wiley & Sons, New York, 3rd. printing. Bloomfield, P., 1970, "Spectral analysis with randomly missing observations," J. R. Stat. Soc., B32:369-380. Box, G.E. P., and Cox, D.R., 1964, "An analysis of transformation (with discussion) ," J. R. Stat. Soc., B26:211-252. Box G.E.P., and Jenkins, G.M., 1973, "Some comments on a paper by Chatfield and Prothero and on a review by Kendall (with discussion) ," J. R. Stat. Soc., A135:337-345. Box, G.E.P., and Jenkins, G.M., 1976, Time Series Analysis Forecasting and Control, Holden-Day, San Francisco, Revised ed. 182

PAGE 172

183 Box, G.E.P., and Pierce, D.A., 1970, "Distribution of residual autocorrelations in autoregressive-integrated moving average time series models," J. Am. Stat. Assoc., 64:1509-1526. Brubacher, S.R., and Tunnicliffe Wilson, G., 1976, "Interpolating time series with application to the estimation of holiday effects on electricity demand," Appl. Stat., 25: 107-116 .. Buck, S.F., 1960, "A method of estimation of missing values in multivariate data suitable for use with an electronic computer," J. R. Stat. Soc., B22:302-307. Chatfield, C., 1980, The Analysis of Time Series: An Introduction, Chapman and Hall, London, 2nd ed. Chatfield, C., and Prothero, D.L., 1973a, "Box-Jenkins seasonal forecasting: Problems in a case study (with discussion)," J. R. Stat. Soc., A136:295-336. Chatfield, C., and Prothero, D.L., 1973b, "Reply by Dr. Chatfield and Dr. Prothero on the paper 'Some comments on a paper by Chatfield and Prothero and on a review by Kendall' by Box, G.E.P., and Jenkins, G.M.," J. R. Stat. Soc., A136:347-352. Crosby, D.S., and Maddoc, T., 1970, "Estimating coefficients of a flow generator for monotone samples of data," Water Resour. Res., 6(4) :1079-1086. Damsleth, E., 1980, "Interpolating missing values in a time series," Scand. J. Stat., 7:33-39. Dean, J.D., and Snyder, W.M., 1977, "Temporally and areally distributed rainfall," J. of the Irrigation and Drainage Div., ASCE, 103(IR2) :221-229. Delleur, J.W., and Kavvas, M.L., 1978, "Stochastic models for monthly rainfall forecasting and synthetic generation," J. Appl. Meteor., 17(10) :1528-1536. Draper, N.R., and Cox, D.R., 1969, "On distributions and their transformation to normality," J. R. Stat. Soc., B31:472-476. Draper, N.R., and Smith, H., 1966, Applied Regression Analysis, John Wiley & Sons, New York. Durbin, J., 1960, "The fitting of time series models," Rev. Int. Inst. Stat., 28:233. Fiering, M.B., 1964, "Multivariate technique for synthetic hydrology," J. Hydraul. Div., ASCE, 90(HY5) :43-60.

PAGE 173

Fiering, M.B., 1968, "Schemes for handling inconsistent matrices," Water Resour. Res., 4(2) :291-297. Fiering, M.B., and Jackson, B.B., 1971, "Synthetic Hydrology," Monograph No.1, American Geophysical Union, Washington, D.C. Finzi, G., Todini, E., and Wallis, J.R., 1977, "SPUMA: 184 Simulation package using Matalas algorithm," in Mathematical Models for Surface Water Hydrology, Ed. by Ciriani, T.A., Maione, U., and Wallis, J.R., John Wiley & Sons, London. Finzi, G., Todini, E., and Wallis, J.R., 1975, "Comment upon multivariate synthetic hydrology," Water Resour. Res., 11 (6) :844-850. Gantmacher, F.R., 1977, The Theory of Matrices, Vol. I, Chelsea Publ. Company, New York. Granger, C.W.J., and Morris, M.J., 1976, "Time series modelling and interpretation," J. R. Stat. Soc., A139:246-257. Haan, C.T., 1977, Statistical Methods In Hydrology, Iowa State Univ. Press, Ames. Hamrlck, R.L., 1972, "South Florida's I unmanaged I resource," In Depth Report, Central and South Florida Flood Control District 1:1-12. Hannan, E.J., 1960, Time Series Analysis, Chapman and Hall, London. Hashino, M., 1977, "A similar storm method on filling data voids," in Hodeling Hydrologic Processes, Ed. by Morel-Seytoux, H., Salas, J.D., Sanders, T.G., and Smith, R.E., Water Resour. Res. Publications, Fort Collins, Colorado. Hinkley, D., 1977, "On quick choice of power transforma tion,1! Appl. Stat., 26(1) :67-70. IMSL LIB-0007, 1979, Reference Manual, Edition 7, Revised. Jenkins, G.M., and Watts, D.G., 1969, Spectral Analysis and its Applications, Holden-Day, San Francisco, 2nd printing. John, J.A., and Draper, N.R., 1980, "An alternative family of transformations," Appl. Stat., 29 (2) : 190-197. Jones, R.H., 1962, "Spectral analysis with regularly missed observations," Ann. Math. Stat., 32:455-61.

PAGE 174

Kahan, J.P., 1974, "A method for maintaining cross and serial correlations and the coefficient of skewness under generation in a linear bivariate regression model," Water Resour. Res., 10(6) :1245-1248. 185 Kavvas, M., and Delleur, J., 1975, "Removal of Periodicities by differencing and monthly mean substraction," J. Hydrol., 26:335-353. Kottegoda, N.T., and Elgy, J., 1977, "Infilling missing flow data," in Modeling Hydrologic Processes, Ed. by Morel-Seytoux, H., Salas, J.D., Sanders, T.G., and Smith, R.E., Water Resour. Res. Publications, Fort Collins) Colorado. Linsley, R.K., Jr., Kohler, M.A., and Paulhus, J.L.H., 1978, Hydrology for Engineers, McGraw-Hill Book Co., New York, 2nd ed. Marshall, R.J., 1980, "Autocorrelation estimation of time series with randomly missing observations," Biometrika, 67 (3) :567-570. Matalas, N.C., 1967, "Mathematical assessment of synthetic hydrology," Water Resour. Res., 3(4) :937-945 .M.atalas, N.C., 1978, "Generation of multivariate synthetic flows," in Mathematical Models for Water Ed. by Ciriani, T.A., Maione, U., and Wallis, J.R., John Wiley & Sons, London. Mejia, J.M., Rodriguez-Iturbe, I., and Cordova, J.R., 1974, "!-1ultivate generation of mixtures of normal and log-normal variables," Water Resour. Res., 10 (4): 691-693. Moran, P.A.P., 1970, "Simulation and evaluation of complex water systems operations," l-vater Resour. Res., 6 (6) : 1737-1742. Neave, H.R., 1970, "Spectral analysis with initially scarce data," Biometrika, 57:111-122. O'Connell, P.E., 1973, "Multivariate synthetic hydrology: a correction," J. Hydr. Div., ASCE, Tech. notes, 9(HY12): 2391-2396. O'Connell, P.E., 1974, "Stochastic modelling of long-term persistence in streamflow sequences,", Ph.D. thesis, University of London, London, England.

PAGE 175

Orchard, T., and Woodbury, M.A., 1972, "A missing information principle: Theory and applications," in Proc. 6th Berkeley Symp. Math. Statist. Prob., Vol--1:697-715. Ozaki, T., 1977, "On the order determination of ARlMA models," Appl. Stat., 26:290-301. Parzen, E., 1963, "On spectral analysis with missing observations and amplitude modulation," Sankhya, A25:383-392. 186 Paulhus, J.L.H., and Kohler, M.A., 1952, "Interpolation of missing precipitation records," Mon. Weather Review, 80:129-133. Pegram, G.G.S., and James, W., 1972, "Multilag multivariate autoregressive model for the generation of operational hydrology," Water Resour. Res., 8(4) :1074-1076. Roesner, L.A., and Yevjevich, V., 1966, "Mathematical models for time series of monthly precipitation and monthly runoff," Hydrology paper No. 15, Colorado State University, Fort Collins, Colorado. Salas, J.D., Delleur, J.W., Yevjevich, V., and Lane, W.L., 1980, Applied Modeling of Hydrologic Time Series, Water Resour. Res. Publ., Fort Collins, Colorado. Salas, J.D., and Pegram, G.G.S., 1977, "A seasonal multivariate multilag autoregressive model in hydrology," in Modeling hydrologic processes, Ed. by Morel-Seytoux, H., Salas, J.D., Sanders, T.G., and Smith, R.E., Water Resour. Publications, Fort Collins, Colorado. Slack, J.R., 1973, "I would if I could (self-denial by conditional models)," Water Resour. Res., 9(1) :247-249. Scheinok, P.A., 1965, "Spectral analysis with randomly missed observations: The binomial case," Ann. Math. Stat., 36:971-977. Schlesselman, J., 1971, "Power families: A note on the Box and Cox transformation," J. R. Stat. Soc., B33:307-311. Shearman, R.J., and Salter, P.M., 1975, "An objective rainfall interpolation and mapping technique," Hydrological Sciences Bulletin, 20(3) :353-363. Stidd, C.K., 1953, "Cube-root-normal precipitation distributions," Trans. Amer. Geophys. Union, 34:31-35.

PAGE 176

Stidd, C.J., 1968, "A three parameter distribution for precipitation data with a straight-line plotting method," Proc. 1st Statist. Meteorol. Conf., Amer. Meteor. Soc., Hartford, Connecticut, pp. 158-162. Stidd, C.K., 1970, "The nth root normal distribution of precipitation," Water Resour. Res., 6(4) :1095-1103. Tukey, J.W., 1957, "On the comparative anatomy of transformation," Ann. of Math. Stat., 28:602-632. Valencia, D.R., and Schaake, J.C., Jr., 1973, "Disaggregation processes in stochastic hydrology," Water Resour. Res., 9(3) :580-585. WastIer, T.A., 1969, Spectral Analysis, Applications in Water Pollution Control, U.S. Dept of the Interior, Federal Water Pol. Control Adm., Washington, D.C. 187 Wei, T.C., and McGuiness, J.L., 1973, "Reciprocal distance squared method, a computer technique for estimating areal precipitation," ARS NC-8, U.S., Dept. of Agriculture, Washington, D.C. Wilson, G.T., 1973, "Contribution to discussion of 'Box-Jenkins seasonal forecasting: Problems in a case study," by C. Chatfield and D.L. Prothero, J. R. Stat. Soc., A136:315-319. Wold, H.O., 1938, A Study of the Analysis of Stationary Time Series, Almquist and Wicksell, Uppsala, 2nd ed., 1954. Yevjevich, V.M., 1972, "Structural analysis of hydrologic time series," Hydrol. paper No. 56, Colorado State University, Fort Collins, Colorado. Young, G.K., 1968, "Discussion of 'Mathematical assessment of synthetic hydrology' by N. G. Matalas," Water Resour. Res., 4 (3) :681-682. Young, G.K., and Pisano, W.C., 1968, "Operational hydrology using residuals," J. Hydr. Div., ASCE, 94(HY4) :909-923. Yule, G.U., 1927, "On a method of investigating periodicities in disturbed series, with special reference to Wolfer's sunspot numbers," in Statistical Papers of George Undy Yule, selected by Stuart, A., and Kendall, M., Hafner Publ. Co., New York, 1971.