Toward stochastic models for air pollution

MISSING IMAGE

Material Information

Title:
Toward stochastic models for air pollution
Physical Description:
viii, 125 leaves : ill. ; 28 cm.
Language:
English
Creator:
Chadbourne, John Frederick, 1948-
Publication Date:

Subjects

Subjects / Keywords:
Air -- Pollution -- Mathematical models   ( lcsh )
Genre:
bibliography   ( marcgt )
theses   ( marcgt )
non-fiction   ( marcgt )

Notes

Thesis:
Thesis--University of Florida.
Bibliography:
Includes bibliographical references (leaves 121-124).
Statement of Responsibility:
by John Frederick Chadbourne, Jr.
General Note:
Typescript.
General Note:
Vita.

Record Information

Source Institution:
University of Florida
Rights Management:
All applicable rights reserved by the source institution and holding location.
Resource Identifier:
aleph - 000208388
notis - AAX5192
oclc - 04112715
System ID:
AA00003916:00001


This item is only available as the following downloads:


Full Text









TOWARD STOCHASTIC MODELS
FOR AIR POLLUTION











By

JOHN FREDERICK CHADBOURNE, JR.


A DISSERTATION PRESENTED TO THE GRADUATE COUNCIL
OF THE UNIVERSITY OF FLORIDA
IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE
DEGREE OF DOCTOR OF PHILOSOPHY








UNIVERSITY OF FLORIDA


1977


























Dedicated with love to my family,
Debby, Julie and Chipper.














ACKNOWLEDGEMENTS


I wish to thank my committee members, Doctors Dale A. Lundgren, J. E.

Singley, R. Carl Stoufer, and especially my chairman, Dr. Paul Urone,

and my co-chairman, Dr. Allen E. Durling, for their efforts in directing

this dissertation. Thanks also to Dr. Wayne Huber for all of his useful

suggestions. Mr. Steven T. Gladin deserves my appreciation for his

guidance in statistics and computer technology.


In the final stages, this dissertation was greatly helped by the encour-

agement and cooperation of Mr. J. Alfred Schueler. The manuscript was

typed by Ms. Brenda Lanza. I sincerely appreciate their efforts.


iii















TABLE OF CONTENTS


ACKNOWLEDGEMENTS

LIST OF TABLES

LIST OF FIGURES

ABSTRACT

CHAPTER I

CHAPTER II

CHAPTER III


CHAPTER IV

CHAPTER V

APPENDIX I

APPENDIX II

APPENDIX III


INTRODUCTION

BACKGROUND AND PROBLEM STATEMENT

PATTERN RECOGNITION TO IDENTIFY
STATIONARY INTERVALS

AN APPLICATION TO URBAN OZONE DATA

SUMMARY AND CONCLUSIONS

AIR POLLUTION STATISTICS

PRINCIPAL COMPONENTS

METEOROLOGICAL CHARACTERISTICS OF
DAYS GROUPED ON THE BASIS OF SURFACE
OZONE CONCENTRATIONS AS OBSERVED IN
TAMPA, FLORIDA


REFERENCES

BIOGRAPHICAL SKETCH


Page

iii

v

vi


vii

1

5

15














LIST OF TABLES


Page


Table 3.1


Table 3.2

Table 3.3

Table 4.1



Table 4.2


Table 4.3


Table 4.4



Table 4.5


Table 4.6


Table AI.1


Table AI.2


Ozone Concentration for Ten Example Days
With Mean and Standard Deviation by Hour

Factor Scores for Ten Day Example

Ten Day Cluster Example Scatter Summary

Contraventions of the Oxidant Standard
Observed in Hillsborough and Pinellas
Counties, Florida in 1974

Mean and Six Principal Components
Station 109 1974 and 1975

Reconstruction of April 19, 1974
Station 109

Days Represented in Group SE. Day Type at
Station and Number of Hours Exceeding
159 ug/m3

Average Lognormal Parameters for Summary
Groups and Subgroup Membership

Cross Reference Summary Subgroups from
Figures 4.2 Through 4.5

Probability Density Functions Commonly
Used in Air Pollution Modeling

Confidence Limits for Ambient Ozone Calibration


110














LIST OF FIGURES
Page

Fig. 3.1 Mean and Principal Components for Ten 29
Day Example

Fig. 3.2 Dendograms for Ten Day Example 32

Fig. 3.3 Three Dimensional Display Factor Scores 36
for Ten Day Example

Fig. 3.4 Lognormal Plots of Each Day in Ten Day 37
Example

Fig. 3.5 A Schematic Diagram for Empirical Modeling 40
Using Cluster Analysis

Fig. 4.1 Mean and Principal Components for Hillsborough 47
and Pinellas County Ozone Data 1973 Through
1975

Fig. 4.2 Dendogram for Station 63 in 1974 54

Fig. 4.3 Dendogram for Station 94 in 1974 55

Fig. 4.4 Dendogram for Station 109 in 1974 56

Fig. 4.5 Dendogram for Station STP in 1974 57

Fig. 4.6 Clustering Steps for the Twelve Days in 60
Group 109-13

Fig. 4.7 Dendogram for All Four Stations in 1974 62
Based on Mean Factor Scores

Fig. 4.8 Lognormal Graphs for Summary Groups 68

Fig. 4.9 Mean and Principal Components of Subgroups 70

Fig. AI.I Log Normal Distribution with mg = 100, 95
ag = 2.0 and Associated Order Statistics
for Highest of 365 Observations

Fig. AI.2 81, 82 Plane for Pearson System 97

Fig. AI.3 Larsen's Model 103

Fig. AI.4 Example Three Stage Calibration 109
vi














Abstract of Dissertation Presented to the Graduate Council
of the University of Florida in Partial Fulfillment
of the Requirements for the Degree of Doctor of Philosophy


TOWARD STOCHASTIC MODELS FOR AIR POLLUTION

By

John Frederick Chadbourne, Jr.

December, 1977

Chairman: Paul Urone
Major Department: Environmental Engineering Sciences

Mathematical models to describe air pollution are necessary planning

tools in the air quality management process. They provide a means of

predicting and controlling pollution. Classical models are necessary

in the absence of reliable ambient measurements. These models have

become increasingly sophisticated, but they remain inadequate to des-

cribe many of the phenomena which actually occur in the atmosphere.

Statistical methods applied to reliable air quality influence parameters

can advance the science of modeling, leading to stochastic models to

simulate and better understand air pollut. i. Ultimately, these tech-

niques will provide capabilities for real-time analysis and control of

air pollution.

Statistical models for air pollution have generally been limited to

specification of distribution functions or analysis of orthogonal com-

ponents because dynamic statistical models are complicated under the

non-stationary conditions encountered in the atmosphere. This disser-

tation investigates the application of pattern recognition techniques












to identify quasi-stationary conditions with respect to both space and

time. Systems models approximating these quasi-stationary groupings

can be constructed when adequate influence data become available. An

analysis of mesoscale meteorological conditions was used to illustrate

how influence parameters can be identified.

Photochemical oxidants are a national concern. The primary consti-

tuent is ozone which can be accurately measured using a chemilumines-

cent technique. During 1974, the Tampa, Florida, area experienced many

elevated oxidant periods which were documented on well-calibrated con-

tinuous monitors. These data were therefore selected to determine

conditions under which stochastic air pollution models could be devel-

oped.

The multivariate approach developed in this dissertation for iden-

tifying similar ozone waveforms is a powerful analytical tool. The

method is not restricted to air pollution research, but extensive ap-

plications in atmospheric studies are appropriate. Perhaps the most useful

aspect of the method is the ability to identify conditions which facil-

itate model development. The most frequently encountered condition in

this study was designated the non-episode type. For these days, the

oxidant standard was rarely exceeded; most sampling sites recorded the

same type of waveform; and many consecutive days reported this condi-

tion. These days would be relatively easy to model since mesoscale

parameters would generally be adequate, many examples of these condi-

tions are available and variations between days in this group are small.

The next most frequently occurring condition was called an episode day


viii











since the oxidant standard was often exceeded, but ordinarily for only

one or two hours in each day. Spatial differences become increasingly

important for this group indicating a need for microscale data in some

instances. More variance was encountered and many different waveforms

were represented. A third type of day was designated super episode

since the oxidant standard was usually exceeded and elevated levels

persisted for several hours. Super episode conditions often followed

frontal passage under high pressure conditions when solar insolation

was very high. Many of these days exhibited rapid change in concentra-

tion and radically differing waveforms. Spatial differences were also

significant on many of these occasions. Consequently, super episode

conditions would be most difficult to model and would require a dense

network providing microscale influence data and frequent sampling.

There were, however, a number of frequently repeated spatially homo-

geneous conditions within the super episode group. These would be good

candidates for stochastic modeling.















CHAPTER I INTRODUCTION


There are many approaches to predicting air quality in today's society.

This is natural since the problem is indeed complicated. As discussed

in Gifford (16,p.84), the most general equation describing pollutant

transport is a hyperbolic partial differential equation. There is

actually no known closed-form solution for this equation. Consequently,

different methods of finding approximate solutions have evolved. The

two basic classes of methods are called a priori (before the fact) and

a posteriori (after the fact). A priori methods concentrate on a know-

ledge of the physics involved to make appropriate assumptions and to

establish boundary conditions. Similarity variables, separation of

variables and integral transform methods are then used to obtain ana-

lytical solutions. A posteriori methods operate on observed variables

as they change with time to develop functional relationships between

influence parameters (inputs) and the pollutant modeled (output). Ecth

types of models are useful tools for understanding the causes and meth-

ods of controlling air pollution.


There are many kinds of a priori models. The simplest type uses a

single analytical equation for which various combinations of "steady"

input parameters yield concentration as a function of distance from

the sources. They are divided into two types called long-term and

short-term models. Both use the same equation, but the manner of













treating the output concentration differs. Long-term models weight

the result from each set of conditions by an empirically determined

frequency of occurrence for that condition. The sum of all such con-

centrations represents a long term average pollutant level as a func-

tion of space. Isopleth contour maps may then be plotted. Short-term

models provide down-wind steady-state concentrations. The averaging

time for these results is assumed to be on the order of 10 minutes,

and longer averaging times (30 min., 1 hour, etc.) are estimated empir-

ically by considering meander in the unsteady wind. Sophisticated a

priori models use numerical integration techniques. Forward difference,

central difference, backward difference and combinations are used with

a moving (Lagrangian) coordinate system, as well as the more common

fixed (Eulerian) coordinate system. Generally, long-term models pro-

vide better estimates since averages are easier to predict than specific

conditions. In either case, the model performance is evaluated by com-

paring model projections with observed air quality under the assumed

conditions. The practice is known as calibration. Millions of dollars

have been spent to develop a priori models in recent years.


The two types of a posteriori models are loosely referred to as static

and dynamic. Static models focus on the underlying set of distributions

which characterize observed air quality. Most popular of these is the

lognormal family. Two kinds of questions are answered by such models.

The first is the empirical probability of observing a value exceeding

some fixed limit, presumably an air quality standard, during the year













sampled. The second issue to be resolved is the probability that this

limit would be exceeded in subsequent years. This type of model is

directly related to air quality standards and it therefore has been

used and abused extensively.


Dynamic statistical (stochastic) models have not been thoroughly inves-

tigated. There are many reasons which explain this historical lack of

interest in stochastic systems for air pollution. A mathematical basis

in terms of the physical process provides a framework in which to

understand pollutant dispersion. Consequently, there is an inherent

bias toward physical models. The theoretist will resort to empirical

techniques only after he has thoroughly exhausted his analytical ave-

nues. Empiricists ultimately attempt to relate conclusions to physical

laws to facilitate interpretation of results. Naturally, a priori

models are developed first. Attention turns to empirical models when

the physical models become inadequate or when additional viewpoints

become necessary. Moreover, in air pollution, the inaccuracy of field

measurements becomes extremely important since trace concentrations

must be studied. Surface chemistry, molecular interaction and inter-

ference become serious problems which undermine the validity of data.

Empirical models are limited by the axiom "garbage in yields garbage

out." Consequently, these models cannot really be developed until

reliable aerometric data are available.


In recent years it has become increasingly clear that the dispersion of

air pollutants is a stochastic phenomenon. Realistically, a simple













analytical solution cannot provide reliable short-term solutions under

conditions normally encountered in the atmosphere. Supporters of a

priori models have argued that stochastic models would be applicable

only to the region for which they were developed. However, the same

factors which contribute to regional applicability tend to violate the

assumptions required in physical models. Consequently, as reliable

continuous air quality data become more available, interest in stochas-

tic models will increase.


Perhaps the most troublesome aspect of air pollution modeling is the

nonstationary character of the atmospheric transport process. Simply

stated, this means that the system being modeled changes with time.

Systems engineers have coped with this kind of problem by developing

adaptive filters. However, a systematic basis for adjusting the filter

coefficients is required. In this dissertation, the nonstationary

character exhibited by ozone data reported in Hillsborough and Pinellas

counties in Florida is investigated. These data were selected due to

the national importance of the photochemical oxidant problem and the

availability of rigorously verified data. The objective is to classify

reported episodes into quasi-stationary intervals using techniques from

pattern recognition. These constitute a basis for multiple stochastic

model development and also prove to be a useful diagnostic tool for

understanding photochemical air pollution.
















CHAPTER II

BACKGROUND AND PROBLEM STATEMENT


The study of air pollution as a function of space and time has long

been a human concern. Pollutant dispersion is described by differen-

tial equations. Mathematical techniques for solving these equations

have historically been adapted from the study of heat transport. This

is primarily due to the comprehensive analytical solutions developed

for various boundary conditions and input functions. Atmospheric sys-

tems contain random variables and random functions of these variables.

Differential equations for random functions are called stochastic pro-

cesses and they generally do not have analytical solutions. An ap-

proach, which avoids steady-state analytical solutions, is numerical

integration. The most advanced air pollution models in today's society

use this technique.


Regardless of the methods used to model a process, comparison with

measured results is essential. Empirical analysis of these measured

values can provide insight concerning the functional relationships

between influence parameters and pollutant concentrations in addition

to validating models. Thorough statistical analysis of observed con-

centrations, viewed as non-stationary dynamic systems, leads to the

development of stochastic dispersion models. This system theory ap-

proach to air pollution is investigated in this dissertation.
5













Brownian motion provides a starting point for understanding atmospheric

stochastic processes. In a paper by Einstein (12) the irregular motion

of colloidal particles based on collisions with surrounding fluid mole-

cules was described. A thorough treatment of Brownian motion is given

in Fuchs (14), showing that the velocity history of a colloidal parti-

cle is a random (stochastic) process. The equation describing Brownian

motion is

dU(t) = U(t) = [A]U(t)+[BjW(t) 2.1
dt
where

d = total derivative with respect to time
dt
U(t) = particle velocity vector at time t(a state variable)

[Al = -B[I1 the system matrix with:

8 = 6rwa/m (particle mobility)

a = radius of spherical particle

P = fluid viscosity

m = partial mass

[I] = identity matrix

[B] = distribution matrix relating input variables
to state variables

W(t) = input vector of white noise
representing molecular collisions.


Equations of this form were first treated by Langevin (18), who hypo-

thesized two additive components. The first component, ([A]U(t)), is

the systematic dynamic friction (drag). The second component, ([BIW(t)),

represents the random force exerted by molecular collisions. Brownian













motion is also known as the Wi -ener process after Norbert Wiener (48)

who formulated much of the pre sent theory of time series.


Integrating Langevin's equatio mn yields

U(t) = eAtU(o) 0+eAte- BW(A)dA 2.2

where

eAt = exp(At) = the mat -rix exponential)

U(o) = velocity=- vector at time zero; and

r = X-t, the= lag pa -cameter.




From statistical mecharrnics we E=-
ul, u2, u3 (x, y or z) componer irt of the Brownian particle is

-m = kT 2.3

where

= e= nsemble- average squared velocity component

k = Ba-oltzman-wn's constant

T = a-bsolute temperature.


Since u?(t) is not a fu-action c==if time, the process is stationary, having

a time invariant autoco- =relati<- n function. This autocorrelation func-

tion, defined as

< -ul(t) u-- (t+r)>
R(T) = 11(t) u- (t+T)> 2.4
>

can easily be determine- from -'squation 2.2.


For Brownian motion, th_ e ensenmh==le average of all times containing the












input vector :. t) is zero. Thus only the drag component enters into

this equation resulting in


R(T) =()> = eA = eBT 2.5


This is the autocorrelation function of a Markov process, characterized

in Arnold (2, p.27) by the property "if the state of a system at the

present is known, information from past times has no effect on our

knowledge of the probable development of the system in the future."

In other words, the system has no memory.


The work of Taylor (45) provides some very important results in the

field of turbulent diffusion theory. Taylor's theorem relates the

expected particle displacement of a diffusing cloud to the velocity

autocorrelation function. For an initially concentrated cloud (instan-

taneous point source) Taylor's theorem is


x = 2 R(T)dT 2.6
dt )

Here ox is the radius of inertia, a measure of cloud size. Substituting

the autocorrelation for Brownian motion and integrating this equation

yields

2 = 2 2(1-e8t] 2.7


For unit density spheres with radii smaller than 10 micrometers, the

relaxation time (8-1) is less than 0.01 seconds. Thus, for time on

the order of seconds,













2 2t
x = = 2Dt 2.8


where

SkT
D = kT (the diffusivity constant).
6raa



The equipartition of energy principle from statistical mechanics and

the integrated Langevin's equation, together with Taylor's theorem,

yield Einstein's conclusion that Brownian motion can be simply defined

using a diffusivity constant.


The classical diffusion equation representing Brownian motion is an

instantaneous point source with no bulk fluid motion. Thus,


ac = DV2c = D2c + 2c + 22c 2.9
at ax2 ay2 3z2
where

a [-1
at- = partial derivative with respect to time


V2 = the Laplacian operator

c = concentration of diffusing substance

D = a constant diffusivity.


Analytical solutions for this equation, and for augmented equations

representing atmospheric conditions, are abundant in the literature.

Two particularly readable and comprehensive summaries are Slade (41)

and Csanady (7). For the isotropic conditions (ax=ay=az) e-hdibited

in Brownian motion, the analytical solution is













c(r,t) = Q(2wa2)-3/2exp[(-r2)/2a2)1


This is the cumulative distribution function for the Gaussian, or

normal, probability law

where

Q = source strength

r2 = (x2+y2+z2) squared radius of diffusing cloud

a2 = 2 Dt.


The diffusivity required to satisfy equation 2.10 is the same diffu-

sivity derived for the stochastic formulation of Brownian motion given

in equation (2.8). Thus, the stochastic and analytical approaches

yield the same results for the special conditions of Brownian motion.


For n chemically reactive constituents, the continuity equation is

3Ci (UlCi) (u2Ci) (u3Ci)
a + (uli) (u2Ci) (u3Ci) DV2Ci + Ri(Cl,...,Cn,T)+Si(x,y,z,t)
at ax ay az

2.11
where


i =

n =

C. =

(ul,u2,U3) =

Ri =

Si =

In practice,

stituent can


1,2,...,n

number of pollutants involved

concentration of constituent i

(x,y,z) components of wind velocity

rate of formation of constituent i by chemical reaction

rate of emission of constituent i from sources.

only n-l equations are required since the remaining con-

be determined by difference.


2.10












In modeling air pollution, one can usually assume that the pollutants

are merely tracers of atmospheric activity. Under this assumption

equation 2.11 can be solved alone without considering the coupled

Navier-Stokes (27) and energy equations. Now, each wind velocity com-

ponent consists of a deterministic and a stochastic component (ui=ii+ui).

Similarly, each Ci consists of an ensemble-average concentration and a

random component (Ci=+Ci). Ordinarily, molecular diffusion is

negligible compared to turbulent dispersion. Introducing these con-

siderations and taking the expected value of equation 2.11 yields the

"K-theory" coupled partial differential equations presented in earlier

dispersion modeling studies; for example, see Pasquill (30) and Slade

(41). The general form of this equation is


a a a
at + U [i l + ay [U2] + z [rU3' =


a a a a a a
[Kxx ] + |y [KyyT ] + 2- [Kz -] +


Ri(j=l,nT) + Si(x,y,z,t) 2.12


where

Kaa = -, multiplicative stochastic diffusivity
8 components and

Kab = 0.0 is assumed for a 3 b.


Analytical solutions for Ci(x,y,z,t) are derived by making assumptions

in the above equation and applying classical mathematics [Carslaw and

Jaeger ( 6)1.













Unfortunately, dispersion modeling is complicated by "eddy diffusion"

when random motions of the transporting fluid enter the equation.

Equation 2.12 is analogous to equation 2.9 with "eddy diffusivities"

(Ks) in place of molecular diffusivity (D). This difference can be

interpreted in terms of the "scale of turbulence." For Brownian motion

the length scale is the mean free path and the time scale (tL) is B1.

For turbulent flow there is no single time or length scale. Turbulent

flow in the atmosphere is characterized by eddies of vastly different

sizes. Large eddies extract energy from the bulk flow. Smaller eddies

feed on the energy of larger eddies resulting in an "energy cascade."

The eddy energy comes from the bulk fluid energy augmented by solar

insolation, surface heat and topographic factors. These eddies move

groups of particles together, so the motion of neighboring particles

is not independent. This correlation of motion over many size scales

causes unpredictable differences such that the observed concentrations

are random variables. Further, eddy diffusivities are not constant in

space or time. Under these conditions, there is no analytical solution

to equation 2.12. Many researchers have turned to numerical integration

to overcome this limitation [Reynolds et al. (36) and (37); California

Air Resources Board (CARB)( 5 ); and Randerson (35)].


The status of recent air pollution modeling is summarized in NATO/CCMS

(26). The advantages and limitations of both classical and statistical

models are presented in a complementary context. Advanced classical

models require extensive specification of wind fields, diffusivities













and constituent concentrations as functions of space and time. Problems

with these models include large computer time and storage requirements,

stability of integration schemes, "artificial diffusion," and cumulative

errors (CARB (5)). In some situations, simple models yield better

results than the advanced models, as demonstrated by Gifford in NATO/

CCMS (15,p.XVI). Statistical air pollution models have not been as

extensively developed as classical models. Very few dynamic statistical

models have been formulated. The primary reasons for this are avail--

ability of reliable air quality data from dense sampling networks; a

lack of regional applicability of the models developed; mathematical

complications; and most importantly, the non-stationary character of

air pollution. The chronology of the NATO reports clearly illustrates

the increasing emphasis on statistical aspects in modern air quality

simulation.


A stationary process nas an autocorrelation function which does not

change with time. Second order stationarity implies that the mean

and variance of repeated experiments are constant. Air pollutant dis-

tributions are not stationary, primarily due to changes in meteorological

conditions. This non-stationary character of atmospheric turbulence

was treated by Pasquill (30). He devised "stability classes" to des-

cribe turbulence types. This stationarity concept is closely coupled

to the duration of an experiment and the averaging time of samples.

Although diurnal variation denies stationarity with respect to 24 con-

secutive hour-long experiments, day-long experiments may be stationary













with respect to one another. Similarly, major changes in meterology,

such as storms, or frontal passage, disrupt stationary conditions.


As pointed out by Pasquill (30,p.17), it is this non-stationary nature

of atmospheric processes which limits the use of time-series techniques.

Selecting subgroups of data, which are stationary with respect to one

another, provides a means of developing dynamic statistical models for

air pollution. Stochastic models offer some very attractive qualities.

Since their random dynamic nature is analogous to observed atmospheric

conditions, they should prove superior to the steady-state Gaussian

models popularized in the recent past.














CHAPTER III

PATTERN RECOGNITION TO IDENTIFY STATIONARY INTERVALS


Selecting reproducible conditions in air pollution can be viewed as a

problem in pattern recognition. Many methods which accomplish this

objective can be formulated. A good method must operate on readily

available data to delineate similar conditions at reasonable cost. One

could begin by plotting the observed concentration as a function of time

and studying the graph to pick piecewise-similar days. Unfortunately,

conditions in the atmosphere usually change in rapid and random fashion,

as when fronts pass or reacting contaminants interact. Consequently,

randomly repeated intervals occur which are not ordinarily sequential.

An additional complication concerns differences in waveforms occurring

at the same time, but at different spatial locations. In order to apply

stochastic modeling techniques to air pollution, methods for identifying

similar conditions randomly distributed through space and time are nec-

essary.


Statistical studies of time behavior traditionally deal with time cor-

relation. The correlation of a signal with itself is known as the auto-

correlation function (p(t)). Tau (T) is a time function representing

lag or lead from zero time toward past or future time. For example

p (o), the autocorrelation of zero time, is computed from the covariance

of a sequence of numbers. For a zero mean sequence, this is the sum of

15












all squares divided by the number of observations. At each lag (for

example 10 hours) a new covariance can be estimated by summing the pro-

duct of all paired observations T(ten) hours apart. Dividing each of

these covariances by the covariance at zero lag yields the sample auto-

correlation function. By analogy two signals can be related by com-

puting cross products as one signal is slid past the other, yielding

the crosscorrelation function. These two functions are used by statis-

ticians to characterize time behavior. When the relationship between

the autocorrelation and crosscorrelation functions does not change the

system is called time invariant or stationary. However, when this rela-

tionship changes with time, as it does in air pollution studies, multiple

inputs and non-stationary methods are essential.


Engineers are usually exposed to the concept of frequency response in

the study of system control. The way a system reacts to changes in

input is described in terms of transfer functions, having poles, zeros

and transforms (Laplace and Fourier) relating time to frequency.

Theoretically, the Fourier transform of a signal and the Fourier trans-

form of the autocorrelation function are directly related through the

power spectral density function as discussed in Weiner (48). Consequently,

systems having the same correlation functions should exhibit the same

frequency response. Identifying intervals having the same frequency

response specifies a quasi-stationary system. This approach, using

very efficient algorithms to compute sliding fast Fourier transforms

(FFT) as in Rabiner and Gold (34,p.382), is a powerful analytical












technique. It provides a way to compare any interval of time with any

other interval based on the system response. However, this direct

approach to identifying stationarity is expensive and does not take

advantage of the known physical processes involved.


One of the most significant factors in atmospheric studies is solar

insolation. The sun is a primary driving force behind meteorology.

Wind speed, wind direction and atmospheric turbulence are related to

the intensity and duration of incident radiation as well as other major

physical phenomena. Since the period of rotation of the earth is one

day, it is reasonable to divide a continuous record into 24 hour-long

intervals. Longer periods can be selected based on frontal passage

or storm cycles as appropriate. These periods can be further sub-

divided into wind direction or turbulence types as is currently prac-

ticed in dispersion modeling. Differences between days may then be

viewed as multivariate problems. The hour of day, for example, can be

a variable and each day would then be represented as a 2,.-space vector.

In this way, the relationship between time and other variables can be

analyzed using classical multivariate techniques. A multivariate

approach to identify similar patterns of air pollution is presented

in this chapter.


A fundamental concept in multivariate analysis is the covariance matrix.

For a collection of days having 24 hourly averages, the covariance

matrix has 24 rows and 24 columns and the values in the upper right












triangle are equal to those in the lower left triangle. Each diagonal

element is the sum of squares for a particular hour divided by one less

than the total number of days. The square root of each such covariance

is the standard deviation for that particular hour. The off-diagonal

values represent correlations between variables and are cross products

of pairs representing both hours summed over all days and divided by

the number of days. Thus, the total variance of the matrix is the sum

of its diagonal elements and the intercorrelation between each hour and

all other hours is represented within the matrix. Often the mean for

each hour is first subtracted from the raw data, resulting in a covar-

iance matrix about the mean.


By mathematically rearranging the covariance matrix, a concise repre-

sentation of each original day can be computed using only a few new

variables rather than the original 24. This powerful technique, dis-

cussed in detail in Morrison (25) and in Harmon (17), is known as factor

analysis. The first step, called principal component analysis, finds

new variables made up of proportionate weights of each of the original

variables. The contributions of each hour are assigned in a way which

accounts for as much of the difference between individual days and the

"average day" as possible. The second component, or weighting function,

is constructed perpendicular to the first component in a way which

accounts for as much of the remaining variance as possible. Subsequent

components can be constructed which are perpendicular to the previous

components until, finally, all of the variation is accounted for. The











analyst can then select the minimum number of components to recover any

desired percentage of the original information. Factor scores can then

be computed for each day. These factor scores determine the amount of

each component to be added to the average day (mean vector) to recon-

struct the original hourly observations. The factor weights are added

to the mean vector because the covariance matrix, from which the eigen-

vectors were derived, was calculated about the 24 hourly mean values.

The object is to reduce the dimensions of the problem by projecting the

original information into a set of orthogonal functions. These ortho-

gonal functions (eigenvectors) are ranked in order of their relative

importance. Thus, the original data can be represented as a few inde-

pendent variables while retaining most of the information. An applica-

tion of principal components to air pollution appears in Peterson (32)

and a particularly readable description of principal component analysis

can be found in Stidd (44).

This procedure can be succinctly summarized in the shorthand of matrix

algebra. A linear model of the form

[Xl = [FI[E]T + M + [e] 3.1

is to be constructed where:

[X] = estimate matrix reconstructing original data

[F] = factor score matrix representing each day

[E] = factor loading matrix of weighting functions for each
independent factor

T = transpose operator

[E = error matrix or difference between observed
and predicted values; and











M = mean vector or average day

The first step is to find the eigenvalues (see Appendix II) of the

singular system.

det [[I~lT 1 [I]X] = 0 3.2

where:

det [' = determinant of matrix [*]

[R T[(] = normalized covariance matrix [cov]

[I] = identity matrix; and

A = eigenvalues (24 in this application).


Note that the sum of the eigenvalues over all 24 variables equals the

total variance in the covariance matrix. Only the first few eigen-

values are retained so that their sum represents the desired percentage

of this total variance. Eigenvectors are computed for each eigenvalue

retained so that

[cov] Ei = XiEi 3.3

The factor loading matrix ([El) has a 24-valued weighting function

representing each column of the matrix. These are the independent

factors described above. Factor scores indicating the contribution

of each factor required to reconstruct each day are computed from

[F] = [E1T[covf-l[X] 3.4

where:

[F] = factor scores, a column vector for each
day; and

[X] = observed concentration minus the mean
for each hour











These factor scores have zero mean and unit variance computed over all

days. The example which completes this chapter illustrates these com-

putations.


Pattern recognition techniques for multivariate systems are known as

cluster analysis. An excellent introduction to this subject is Duran

and Odell (11). Specific details of various schemes may be found by

pursuing their extensive reference list. There are basically two cate-

gories of cluster methods. Derisive schemes successively partition

cases into smaller clusters. They begin with all data in one group.

Agglomerative methods pool subsets of cases beginning with each case

in a group by itself. The cluster analysis scheme used in this work

is an hierarchical agglomerativee) method due to Ward (47).


The cluster concept implies an optimality criterion which defines the

manner in which groups should be combined. This objective function

can be stated in terms of the homogeneity within a cluster and the

disparity between two clusters. These measures depend on the distance

between groups. The euclidean distance between two vectors Fi and Fj

is

de (Fi, Fj) ==l (Fki-Fkj) 2


= [(Fi Fj)T (Fi Fj)1 3.5


Note that Fs are employed signifying distance between vectors of factor

scores.











This real valued function is a metric because

d(Fi, Fj) > 0 for all Fi and Fj

d(Fi, Fj) = 0 if and only if Fi = Fj

d(Fi, Fj) = d(Fj, Fi)

d(Fi, Fj) 5 d(Fi, Fk) + d(Fk, Fj).


The statistical distance between two clusters can similarly be defined.

For [F] = Fl,...,Fni and [Y] = Yl,..,Yni, where Fi and Yi are f

variable members of two clusters, the euclidean distance is


d l 2 ( T_) T(_2) 3.6
nlnn2

"1 "2
where F = Fi/nl, Y= Z i/n2, and n1 and n2 are the number of
i=l i=l

members in groups [F] and [Y1.


This intercluster distance is the increase in the within group sum of

squares (WGSS) when cluster [F] is combined with [Y] to form a new

group. Ward's method calculates this distance for all possible com-

binations of groups and combines the two groups with the smallest dis-

tance (dxy). Thus, the objective function minimizes the WGSS. This

technique favors agglomeration into small, closely related groups and

is particularly appropriate in this application.


Euclidean distance implies orthogonal geometry. This occurs if each

variable is independent of all the other variables. When the variables

are highly correlated, a generalized euclidean distance must be used.











The Mahalanobis (21) metric,

dm (Xi, Xj) = (Xi Xj)TW-1 (Xi Xj) 3.7

accounts for such interaction by including the within scatter matrix

([W]). The [W] is covariance matrix computed for all vectors contained

in the cluster. Mahalanobis distance has many advantages including the

fact that certain objective functions become equivalent. It is also

invariant under any nonsingular linear transformation, thus eliminating

scaling problems.


The properties of the Mahalanobis metric, coupled with a theorem in

Bryan (4), lead to an elegant method of clustering utilizing the

euclidean metric. Bryan (4) shows that Mahalanobis distance is pro-

portional to euclidean distance when the original vectors are trans-

formed according to [EI[X], provided [EI[cov][E]T = [I], 3.8

where [E] is the modal matrix described in Appendix II. This is

exactly the problem solved in the principal component analysis. We

now require, in addition, that the eigenvectors be rotated so that each

has equal variance. An equivalent means of accomplishing this result

in the transformed matrix [F] is to scale the factor scores according

to [F] A. Standard factor scores have unit variance and the sum of

squares of each eigenvector is equal to the associated eigenvalue.

Proper scaling results in factor scores which retain the variance in

the original covariance matrix. Thus, the original geometry is re-

stored, imparting a larger range to factor scores associated with the

most important principal components. In this way, the relative impact












of each component is included in the factor score matrix.


When two groups are combined, the degree of scatter within the group

and the distance between the new group and other groups are of interest.

For properly scaled orthogonal geometries as described above, these two

measures are the trace of the scatter matrix and the euclidean metric.

The scatter matrix is computed from
n
[Sf]= E (Fi F)(Fi F)T 3.9
i=l

where:

[Sf]= scatter matrix

n = number of members within the group

Fi = a vector of factor scores representing
a single case

F = the mean factor score vector of all members
of the group combined.

The trace of this matrix is the sum of the diagonal elements and repre-

sents the total variance of the group members about the group mean

vector. This mean vector represents the new group in computing the

distance to each other group and conceptually is the hyper-centroid of

the multi-dimensional sphere which includes all of the group members.

The cluster scheme used in this work finds the distance between all

groups; combines the two closest groups; computes the trace of the new

scatter matrix and then computes the distance between the new group

and the other groups. This process is continued until all of the ori-

ginal days have been combined into a single group. To construct the

dendogram, the total scatter at each cluster step is computed by adding











the scatter for the new group and subtracting the scatter of the two

groups which were combined to form this new group. Initially, the total

scatter is zero when each day stands alone. When two days are combined,

the total scatter becomes the previous scatter plus the scatter computed

on combining these two days. This recognizes the fact that the scatter

between a day and itself is zero. The relative scatter at each step is

computed by dividing the total scatter accumulated at that step by the

total scatter when all of the cases have been combined to form one group.

All of the original cases are then arranged in cluster order so that

successive groups which have the smallest distance between them appear

next to one another. This procedure begins with the largest groups and

systematically arranges the members of each subgroup according to their

similarity. The dendogram displays the membership of each group from

individual cases to a single composite group with the relative scatter

represented above each group. The reader is referred to Shannon (39)

and Shannon and Brezonik (40) for an environmental application of

cluster analysis.


An example using ten days from one sampling station is now presented to

illustrate the use of principal components and cluster analysis to char-

acterize and group days which are most similar. Table 3.1 displays the

reported ozone concentration for the ten sample days, selected specifi-

cally to illustrate the cluster example. Start hour zero represents the

one hour average from midnight until 1:00 AM and this convention applies

to all 24 hours. The example was constructed to show a clear separation

between three basic types of days.











First a covariance matrix about the mean was computed. Symbolically,

10
Xh = E Xdh for h from 0 to 23 3.10
d=l

10
2 hh = E (X dh XXh) 3.11
d=1
10

10
ah2= (Xd-Xh)2 3.12
d=l
9

where

Xh = mean for each hour averaged over ten days

Xdh = observed concentration for each hour on each day

2h,h = covariance between any two hours averaged over
ten days

Uh2 = variance for hour h.
h

Note that a 2o, is the variance for starting hour zero about the mean.

The square root of this value is the standard deviation of hour zero.

In Table 3.1, we find the mean for hour zero was 32 hg/m3 and the

standard deviation was 30 ug/m3. Thus, the variance contributed by

hour zero in this example is (30)2 = 900 (pg/m3)2.


The total variance of the covariance matrix is the sum of the variance

contributed by each of the 24 hours. Symbolically,

24
T2 = E h 2 3.13
h=l
where

a2, = total variance of the matrix (the trace).

















4J, o co v tn m





N OD r- OD %D
M NCV N N
gmVN C


WO % % WD W


NW memowoommesmmmwm

~~00U~0jN~D00t%0
*q' ^O Oii i o nq r
HHHrlr l


00000000000000000
HWOO H0 r-00nu0mweN
HH H H -4 H -4 HH


S0o00o00 00000000o00o000
HHH H HHHHHHHH i -r lr


\'0\0\0\00000


000000000000000
lH H l H HHHH H f-4 HH


'~0 0O000 00000 00 0000000000



000000006000000000060000




00000000000000o000000000




00000000000000000000000




'~0 '000 00 00000 00000 000 00
H H H~~~rv~~~ Wm~s


0000000000000000
HHHHHHHO HHlr(w


00000000000
i 1- Dr 1-1


0000000
wwmwwworl


00000000000


~~H HHHH HHH HH eq eq eq


>






o


E0


0

04
fa





S4)


N
0S


\,

NH






o
,40



0





I-i
r0







e1


C0



n0




41.
4 00
%,,











Summing the squared standard deviations in Table 3.1 yields

36351(pg/m3)2 which is approximately equal to the full precision vari-

ance of 36569(pg/m3)2 computed for the ten example days. Off-diagonal

elements such as ,2 are computed using Equation 3.11 which resulted

in 659(jg/m3)2 in this example.


A real symmetric covariance matrix, having 24 rows and 24 columns, also

has 24 roots which satisfy its characteristic equation (see Appendix II).

These roots, called eigenvalues, satisfy Equation 3.2. For the example

in Table 3.1, the three largest roots were found to be 27390, 7141 and

810 as in Dixon (9,p. 255). Since the total variance was 36569, these

roots represent 74.9%, 19.5% and 2.2% of the variance respectively. Col-

lectively, they account for 96.6% of the differences between each of the

ten days and the average day. Applying equation j.3 and solving for

E, results in the first principal component shown in Column No. 1 of

Figure 3.1. This vector of 24 values is plotted opposite the number

75 and above the mean vector in Figure 3.1. The variance of these 24

values about the origin is 27390 which equals the magnitude of its

associated latent root. Similarly, eigenvectors for the second and

third roots can be computed and are shown in Figure 3.1. Each of the

ten days can now be approximated by adding or subtracting various amounts

of each of these three components to the mean vector. These various

amounts are the factor scores computed from Equation 3.4. In Table 3.2,

these three values for each day in the example appear in the three

columns under unit variance scores. Choosing April 28, starting hour
























In



8 O
-,I

00

m u -
oo

M I:






41
z O

.)
C'
0






$4 0
4)










0r wE
r1 2u
54 (
4064 <


I r-I I '1 I






r-i 1- I 4 r-i r-4
Q)NrQml


aHmrMO~
.....r~r


r 0;o



I II
!s I^ I


- OMM iI v w


CO .0- OI I Co
I I r 1 1 il


I.4ll
I I


C'I40


C4rz14U;1d A 0 8tflAU.4
.4~~~ m~~m ~


o mN r- W M 0 N mc' v ko c O m 0 ro N m


n 0
'.4


0
r/N


i n


a .


in


S N N 0 .S So

uoTpZuuauo2 aTltQt










zero is reconstructed to be

32+(1.44)(15.1)+(1.14)(23.2)+(-.21)(-8.8) = 82 Vg/m3.
Hour 12 is represented by

110+(1.44)(50.9)+(1.14)(-6.7)+(-.21)(7.3) = 174 Ug/m3.

Similarly, all of the original data can be reconstructed to satisfy the

model given in Equation 3.1. The principal components also indicate the

correlation between hours found in these ten days. Looking at Figure 3.1,

we see that the contribution from hours nine through thirteen is approx-

imately constant for both the first and second components with the mean

accounting for differences during this interval. Similarly, hours zero

through three contribute approximately the same amount on any given day

and the period from hour 20 through midnight is important in defining

the second component.


Factor scores are independent of one another since they represent the

proportions of independent components which make up each day. There-

fore, they can be represented geometrically along perpendicular axis.

The unit variance scores weight each factor equally. In the ten day

example factor one actually represents 75% of the difference while the

second factor represents only 20%. In order to achieve the correct

grouping when these scores are clustered, the total variance in each

component needs to be transferred to the factor scores. As discussed

above, the easiest way to do this is to multiply each unit variance

score by the square root of the applicable eigenvalue (165.5, 84.5

and 28.5, respectively). This scaling produces zero mean factor scores

which have variance equal to the original contribution of each











component. Thus the sum of squares of factor scores computed over all

ten days equals the proportionate variance and the associated eigen-

vectors each have unit variance. These factor scores are shown in

Table 3.2 under Mahalanobis scaled scores. They are called Mahalanobis

scores because they yield a measure of Mahalanobis distance using the

Euclidean metric.


To illustrate the clustering method both unit variance and Mahalanobis

scaled scores have been grouped and are displayed in Figure 3.2. The

numerical example now presented applies to the Mahalanobis clustering

since this provides a distinct grouping. The computation is the same

for unit variance scores and can be directly extended to as many dimen-

sions desired. The first step is to compute the euclidean distance

between all groups. For ten cases, this is a collection of 45 numbers.

The smallest of these represents the distance between December 7 and

November 16 and was computed to be


(-170-(-176))2 + (15-17)2 + (38.6-17.3)2 = 22.22.

The largest number was 267.75 representing the distance between April 28

and May 2. Referring to Table 3.2, this value was computed from


(239-31)2 + (97-(-65))2 + (-6-40.7)2 = 267.75.

The first step in clustering was therefore to combine December 7 and

November 16. Since the mean factor scores for these two days are F1=173,

F2=16 and F3=28, the trace of the scatter matrix at this step is computed

from (-170-(-173))2+(-176-(-173))2+(15-16)2+(17-16)2+

(38.6-28)2+(17.3-28)2 = 245.












TABLE 3.2

Factor Scores for Ten Day Example


Variance Scores
F2 F3
1.14 -0.21


1.27
0.18
0.20
0.73
0.78
-1.11
-1.48
-0.93
-0.77
1.0


0.48
1.36
0.61
-1.32
-0.60
0.12
-0.37
-1.48
1.43
1.0


Mahalanobis Scaled Scores


Fl
239
236
-170
-176
-187
-160
55
70
63
31
27390


F2
37
107
15
17
61
66
-94
-125
-79
-65
7141


F3
-6.0
13.6
38.6
17.3
-37.7
-17.1
3.4
-10.7
-42.3
40.7
810


FIGURE 3.2

Dendograms for Ten Day Example


Unit Variance
Cluster


4


3


o uo
r. s4-
I-l, -1


1.0
0.9-
0.8-
0.7-
0.6.
0.5.
0.4.
0.3-
0.2.
0.1.
0.0.


1"


Mahalanobis
Cluster


9



8




7
6
5


I I I I


2 ** 1*


'0
Nr- 4
P-4 P-4


0 rc


q O CDz 0
.-4 C4 -4 -.I N
N Unl ko M n
,4


*First Grouping
**Second Grouping


Day of Year, 1974


Date

4/28
4/20
12/7
11/16
11/30
12/14
5/28
6/12
8/19
5/2
variance


Unit \
Fl
1.44
1.43
-1.03
-1.06
-1.13
-0.97
0.33
0.42
0.38
0.19
1.0


1.0
0.9.
0.8-
0.7.
0.6-
0.5-
0.4-
0.3-
0.2.
0.1-
0.0.


- 4 .~I


2*
2**











Since this is the first step, the total cumulative scatter is zero plus

245. Only the distance from the new group to other groups needs to be

recomputed. On the second iteration, the smallest overall distance was

found to be 22.24 between April 20 and April 28. When these two days

were combined, the total scatter of the scatter matrix was determined

to be 247. Thus the total cumulative scatter after completing the second

step was 245 + 247 = 492. Similarly, pairs of days were combined for

steps three and four.


Step six illustrates the procedure when two groups, both containing more

than one day, are combined. The total cumulative scatter after comple-

tion of step five was 3389. When December 14 and November 30 were com-

bined, the total variance of the scatter matrix for these two days were

585. The trace of the scatter matrix for the four days combined was

calculated as the sum of 12 squared values and was equal to 12736. Thus,

the total cumulative scatter on completion of step six was 15297. We

see this under step six in Table 3.3.


On completion of clustering the cumulative scatter was 318069 as shown

in Table 3.3. This is derived from the total variance of the Mahalanobis

factor score matrix shown in Table 3.2. Notice that the mean of each

set of ten factor scores in Table 3.2 is zero. Thus, the scatter is

equal to the sum of squares. Ten zero mean values having a variance

of 27390 (Mahalanobis Fl) have a sum of squares equal to 246510. This

is apparent from Equation 3.12, since nine times 27390 equals 246510.

Similarly, the sum of squares for F2 and F3 are 64269 and 7290. The












total sum of squares for the Mahalanobis scores in Table 3.2 is therefore

246510 + 64269 + 7290 = 318069.

This represents 96.6% of the total variance shown on Page 28. Thus,

Mahalanobis scores preserve the original variance. This is not true of

the unit variance scores high result in incorrect scaling and clustering

errors.




Table 3.3

Ten Day Cluster Example
Scatter Summary


% Relative
Scatter

0.08

0.16

0.34

0.56

1.07

4.81

6.15

47.40


Cumulative
Scatter

243 =

495 =

1080 =

1768 =

3389 =

15297 =

19568 =

150778 =


Previous
Scatter

0

243

495

1080

1768

3389

15297

19568


+ New -
Group

+ 243 -

+ 252 -

+ 585 -

+ 688 -

+ 2309 -

+ 12736 -

+ 6580 -

+150526 -


100.00


318069 = 150778 +318069 150526 -


252


Note: Member A and Member B are the two groups being combined to
form the New Group.


Step
No.

1

2

3

4

5

6

7

8


Member -
A

0-

0-

0-

0-

0-

585 -

0 -

6580 -


Member
B

0

0

0

0

688

243

2309

12736











When all clustering steps have been completed, the relative scatter is cal-

culated by dividing each cumulative scatter by the total scatter in the

factor matrix. This is also equivalent to total variance. A dendogram

can be constructed from the relative scatter by arranging the days so

that nearest neighbors are adjacent. The nine steps and associated

relative scatter are shown in Table 3.3. Clustering for the unit vari-

ance scores is done analogously, with the total sum of squares on the

ninth step equal to 27. The dendogram is shown in Figure 3.2. Although

the groupings for both unit variance and Mahalanobis scores are the same

in this example, significant differences in measures of closeness are

clear. These differences can cause grouping errors, especially when

differences between cases are not a pronounced as those given in this

example. In practice, tabulating the scatter values is not necessary,

since the dendogram graphically shows how to group each set of data.


Figure 3.3 is a visual aid to understanding cluster analysis. The fac-

tor scores representing each of the ten days have been plotted in three

dimensions. Clustering can be accomplished in multiple dimensions, but

more than three dimensions becomes difficult to visualize. Clearly the

April dates are set apart from the other eight days. The November and

December days were grouped and set apart from the May, June and August

days. The relative scatter and closeness between groups are clearly

illustrated in Figures 3.2 and 3.3 and in Table 3.3.


Lognormal plots (see Appendix I, p. 105 ) of these ten days are in Figure

3.4. The same grouping illustrated above can be postulated. However,



























-2.5


-1.


-2.0


-2.5


05 6
5 3


-2.0


FIGURE 3.3
Three Dimensional Display
Factor Scores for Ten Day Example

Unit Variance
+F3
Scores +F3
+F2 1.5
2.0
3 8/


4 0.5 1.0


-1.0 *10

1.0 2.0 +F1
-0.5
-0.5
-1.0

5 87




@8
09 -2.0
Mahalanobis
Scores
1.5 2.0
+F2 +F3
1.0 1.5 1e

1.0
0.5

1.0 2.0 2.5



0.5
-0.5

-1.0 07
@9
-1.5 M


-2.0


1.5




























































































0 0 0 0 000 0 0 0 0 0
0 0 0 ( ) oUOT .uI uo

ea[rSioz (-u1/brt) uo-tzeuaauoo


* S44
NOX






Ma
0 04

go
01
3


0
0












.0
- 4

N


a


o
I-

5: 2
14

a













>
o rJ














m
0

-4i









.o
-4

Un





41
0



-o^











grouping on geometric mean or standard geometric deviation does not

necessarily produce the same clusters derived from orthogonal components.


I have briefly described necessary mathematical concepts and presented

an example to illustrate how information can be condensed and utilized

to identify similarities and differences among days in Chapter III. This

example is extended in Chapter IV and applied to ozone data reported in

south Florida. Before proceeding to this analysis, a discussion of

empirical modeling is next presented with a flow chart illustrating

steps in the procedure.


A posteriori models require extensive measurement of ambient conditions

over time. The sampling network provides these data. Ideally classical

modeling techniques would be used to determine the optimal number, loca-

tion and sampling frequency for the network. Performance is ordinarily

evaluated after considerable data have been collected. These data

should be used to confirm or investigate preliminary classical model

results and to suggest desirable changes in the sampling procedures.

The ultimate application for stochastic models is for real time analysis

utilizing microprocessors to predict future air quality based on past

and present conditions. These techniques could also be used to initiate

control strategies and signal a need for more intensive sampling during

critical periods. Stochastic models need to be developed before real

time control strategies can be formulated.








39
To build empirical simulation models, one must observe enough repeated

experiments to develop a reliable relationship between inputs and out-

puts. When conditions change often and in random fashion, as in the

atmosphere, these experiments need to be grouped for similarity in order

to construct meaningful models. If all conditions were used indiscrim-

inantly, an "average model" would be derived which actually would not

represent any of the observed conditions. Figure 3.5 illustrates the steps

necessary to analyze data preparing for stochastic modeling. The con-

ventions used are the same as those for critical path analysis. Each

arrow represents a task or operation to be completed. The numbered

points mark the initiation and completion of each task. Successive

operations cannot begin until all of the preceding tasks have been com-

pleted. The first step, proceeding from point zero to point 1, is col-

lecting the necessary data. It includes all of the laboratory, field

and administrative (auditing, etc.) quality assurance procedures neces-

sary to insure valid data. Selecting data of interest concerns the

frequency of elevated pollutant levels, duration of conditions, accuracy

of data, availability of support data, sampling network density and

other considerations. For multivariate problems, covariance matrices

can be computed after screening the data to exclude anomolies. At the

same time, preliminary summaries,such as tabulating all days which ex-

ceed standards, can be performed. The covariance matrices are used to

estimate missing observations and to compute principal components.

After these tasks are completed, factor scores for each day can be

found to represent the original data in condensed form. The principal

component results are used to further characterize the data as soon as





















.J












rj-4
0
4l




UI
t)



U

4J









a








4)


344


0
.4
-4

U



















*4
a





0
*-4



03



0

*-4



*-I ...4
C'.Q


u



in a>
*m





r4
.0


13C
0 W*-4
UB>4











they become available. Thus, the dashed arrow between points four and

two indicates that these results will be used to proceed from point two

to point six (preliminary characterization) but that this task can begin

before principal components and missing observations have been computed.

Using the properly scaled factor scores, groupings at each spatial lo-

cation can be computed which characterize the waveforms as a function

of time independent of space. Using the preliminary characteristic

results, these clusters can then be summarized. The number of dif-

ferent types of days and the frequency of occurrence as well as the

relative similarity can be determined from the dendograms. Once these

groups have been defined, their characteristics can be further analyzed

as shown in Figure 3.5 at point seven. Appropriate distributions can be

fit for each group (see Appendix I). These indicate the probability of

observing high concentrations in each group. However, they do not pro-

vide the functional relationships in time that stochastic models repre-

sent. Principal components for each subgroup can be computed and plot-

ted along with the group-average waveform. This visual display of group

characteristics shows the kinds of models needed and the periods during

each day which are of particular interest. The total variance for each

group indicates the magnitude of differences between members of each

group and is an indicator of the ease with which a stochastic model

could be constructed. Average factor scores for each group at each site

can be computed and clustered to locate similar conditions in time which

have occurred at different spatial locations.











Clustering the average factor scores shows spatial as well as temporal

similarity. Utilizing the summarized characteristics available at this

point results in major group types. In this study, three types were

identified and named non-episode, episode and super episode conditions.

The subgroups for each major type were analyzed and displayed in cluster

order to illustrate differences between day types. The kind of day

observed at all sampling sites was tabulated for every day which had a

super episode condition. This is indicated as date comparison in Figure

3.5. A choice of the most interesting, potentially modelable conditions

can be derived from this analysis. The next step in model development

is identifying influence parameters. By contrasting conditions on super

episode days against non-episode or episode days, meteorological or air

quality parameters which seem to affect the system can be found. Having

selected interesting influence data, formal procedures such as cluster

analysis or discriminant analysis can be used to determine the relative

importance of each factor. The conditions which are to be modeled and

input data can then be selected. If sufficient influence data are

available and these conditions have been repeated often enough, models

can be formulated. These models can be constructed from the raw data.

Alternatively, weighted average data reconstructed from the factor

scores could be used to smooth random variations before models are

constructed. The inputs and system response which characterize various

kinds of days can then be used to evaluate classical models and air

pollution potential.














CHAPTER IV

AN APPLICATION TO URBAN OZONE DATA


Photochemical oxidant levels are a pervasive problem throughout the

continental United States. Literally hundreds of journal articles and

millions of dollars have been directed toward this problem during the

past few years. Initially, hydrocarbon emission control was suggested

as an answer to the urban oxidant problem. With these emission con-

trols, ozone concentrations remain very high. Even rural levels fre-

quently exceed established standards during spring and summer months.

Reynolds et al. (37) and Roth et al. (38) have developed a comprehensive

emission inventory and a sophisticated oxidant model based on numerical

integration. Under certain conditions in the Los Angeles basin, the

model performs well. This model is presently being modified for appli-

cation in Tampa, Florida, which historically has had oxidant episodes

lasting several days with levels reaching twice the standard. Further

study is under way nationally with a particularly comprehensive program

in Houston, Texas, sponsored by the Houston Chamber of Commerce under

Feldcamp (13).


Oxidants are a secondary pollutant, since they are not emitted directly

from sources. Precursor compounds such as reactive hydrocarbons and

nitrogen oxides interact in complicated chemical reactions which are

not clearly understood. Intensive solar insolation appears to be a

43












necessary ingredient, or at least a contributing factor. A recent

study reported by Spicer et al. (43) concluded regional scale transport

and high pressure systems must be considered. Spicer et al. (43) also

concluded that oxidant sources were ground based, since the ozone con-

centration decreased with altitude.


The Hillsborough County Environmental Protection Commission has rou-

tinely collected ozone data using chemiluminescent monitors since 1973.

A 90% confidence interval for the three-stage calibration procedure

was typically +10 ppb as shown in Appendix I. During 1974, there were

82 days and a total of 589 recorded station-hours exceeding 159 ug/m3

in the data set. These data are summarized in Table 4.1. A brief

review of plotted diurnal waveforms revealed a bizarre collection of

cases which did not follow the typical curves given in USDHEW AP-63

(46). Since oxidants are a national concern and since an extensive

set of reliable data was readily available, the Tampa area ozone data

were selected for analysis.


The HCEPC maintained three ozone monitors during the study period.

Station 63, located on Davis Islands at the Coast Guard basin, is an

urban site which historically has reported some of Tampa's poorest air

quality. A second urban ozone sampler, designated Station 94, is lo-

cated at the intersection of Dale Mabry Highway and Cypress Avenue.

It is less than 210 yards south of the heavily traveled Interstate

Highway 4. The station at the fire department on Gunn Highway was

numbered 109 and is approximately five miles northwest of the urban


















TAWLE *.1

Oontraventions aof he Oxidant Standard
QCbsate d in Hillsborough avd Pinellau Countle., Florida in 1974


DaU 63


Sawing Station
94 109


Sp DAUT 63


Sampling Station
94 109


2 () 3(SB2
I(3)
3(B)
6(B)
5(33) 3(B)
30 1(B)
Bi))
1(33
3(SE) 6(SE)



2(1) 1(g)
1(B)
3() 2 0

S2( 1

6(SE) 2(SE) 7(SE)
11(SE) 5 (SE) 11(z)
2(g) 2(SB)
1(E)


2 (SE)

1(SE)
2(SB)
8(SE)
7 (S)


3(E) 3(SE)
2(8)) 5(3 5 (5B)
2(M) 3 (S) 2(SB)
4(SB)
2(W) 3 (S)
8(SB) 3(SB) 7(SE) 1(SB)
8 0 4(SE) 5(S)
5(SB) 1(R) 2(M)
4(g) 1I()
2(B)


2(U) 2(S8)
8(8S) 6(SB) 6(SE)
5(B) 3(S8) 5 (S)
7 (S) 5(SE) (S3)
7(SZ) I(E) 3(SE)
I)
3(E)
4 (SE)
2(SE)
2(3)
7(8E) 2(US)


8-27
4(SE) 8-28
4 (S) 9-10
9-17
5(SE) 9-18
9-25
10-8
10-10
10-15
10-19
2(SB) 12-28


7(SE) 1(E)
6(S8)
9(8E) 3(S3) 10 0
4 () 2(1)

5(SB) 3() 2 0
2(SE) 3 0
1(B)
2(B) 1(K)




2(r)
1(11a)
1(1(3
ME)
7 0 (SE) 1(B)
7(SE) 6(SE)
7(SZ) 4(SE) 0
1(E) 5 0
7(8)
10(sE) 5 (S) 4(SE)



6(sE) 4(() 5(8z)
9(8E) 1(S)
9(S8) 7(SB) 2(Sz)
3(3)
4(W) 3(SE)
1(B) 1(B)
3(8)
1(B) 2(3)
1(B) 3(B)


1(E) 1(B)
2(SB)
1(B)
1(SB) 3(88)
3 (SE)
3(8)
6 (S5) 1(SE) 6(SB)
1(E)


Notes 1) Integers are total hours that day exceeded 159 Ug/n3.
2) SB Super Episode, B Episode, NE Non-Episode. O Omitted
3) Blank indicates no hours exceeding 159 Mg/m3.
4) ampling stations described on Page 44


6 (SB)
4 (SE)
1(B)
4(E)
2(SB)




















34 (SB)
1(8)
























2(B)
3(SE)






1(E)
S(E)
1(8)





1(SB)












center. It is a semi-urban site on the periphery between urban Tampa

and the agricultural areas north of town. Consequently, at this site

both urban and rural effects can be observed depending on meteorology.

The remaining sampling site, designated STP, was located in downtown

Saint Petersburg.


The first step was to obtain a copy of the State of Florida Air Quality

Data Handling System (AQDHS) data tape. All ozone data from Hills-

borough and Pinellas counties, through September, 1975, were stripped

from this tape to direct access files. These data were then analyzed

to locate erroneous values and days with more than 50% missing values.

A number of errors were found, checked with the local agency and cor-

rected. Days with too little data were removed. The Pearson Correla-

tion program described in Nie et al. (28) was then used to compute a

correlation matrix for all remaining data at each station. This routine

was used because it accommodates missing observations and because it

employs the double precision arithmetic required for large numbers of

cases. A covariance matrix was formed from the correlation matrix using

the standard deviation for each variable (hour of day) from the Pearson

Correlation program. Principal components for each covariance matrix

were computed using the BMD08M routine from Dixon (9). This routine

was selected because a very flexible input specification is provided.

Figure 4.1 is a plot of the mean and first four principal components

of the ozone data at each station based on all data available. Prin-

cipal components are discussed in Chapter III and in Appendix II.



















































































I I
cin/b u01~w~uaauo ~a~Twrou


0









=4'
5
0

00



S4


S~ I2!


- i.
-'05

II
al




14
a
.4


i l

0 0
8"5 a




*



a s
4 a 5


a a og






4 .4
si
a j?

< *










*^ m


0
0-4


U3










o



a
4 C4

















03











If all of the principal components were retained, 100% of the total

variance would remain and only the geometric representation of the data

would be changed. However, when the original variables are intercor-

related, most of the information can be represented with relatively

few factors. The criterionused in this study was retention of 90%

of the total variance. In this application, greater than 90% of the

total variance at each station was accounted for by six components.

Thus, each day in the original data set was reduced from 24 hourly

observations to a set of six factor scores while retaining 90% of the

information contained in 24 values. This achieves computational effi-

ciency and orthogonal structure as described and illustrated in the

example of Chapter III.


At this point, the results of the principal component analysis can

best be presented by example. Scanning Table 4.1, April 19 looks

interesting with all sites exceeding 159 pg/m3 for at least 5 hours.

The mean and principal components for Site 109, which are presented in

Figure 4.1, are presented in Table 4.2. The total variance for the 551

days in this site was 25561 (Ug/m3)2. These numbers are shown at the

top of Figure 4.1 above Site 109. The first six eigenvalues for this

site, arranged from largest to smallest, are

13614 4882 2616 1240 880 and 525.

Dividing each of the eigenvalues by the total variance, we see that

component one represents 53% of the variance. Similarly, components

two, three and four represent 19, ten and five percent of the variance




















MI-4mm N -H I OO O
(I Bl lfi l


I I I


M ln
00
(4 I


I HM l II
i -l i l l


01


g Ln
14
*4
0

in


-H
u


0


J
P4





-l





0
-i

C
0

-4
0









S|

&


- Comr-i ro f oh I
ri.i r-I i-i >-i -i


NC 0
CS"'-
Nr
I I







I I
A*






00 0
**
I I









(-*












Q*
N ro
-- -i







I '


M 0 r-l i-l 0
e-4 C-4 r-4 6l


ONOMtlNo

I I 14 W; 4
I I I I





. 9 < C *NI
'4C>O'-C
. .


in f-i oD
S4 1O
cn M Im
...
HMPi-
(T ( m


I I
r41;<-


M r-4
-I II
I I


i- -I ,-
II I


co NiLn mi
#NamW;R
.....l


.CM0M C.
* *
Moo CAN


02C'P 00


CO
0

cI cm 0 o r-i




P4


O.-iN N 0O H N mcM M OO M


I I I


02
4J
C n
4a



401



r4



0 (
Sa


4


III 1 1N 1 1O i i


(4 Cr4 Cr4 i-l


NO
13l
0

0
Ws


*HMOrl
* .* *
in rWOe












respectively. These numbers also appear in Figure 4.1 to the right of

the component which they represent. The mean for each hour in the day

is shown at the bottom of this figure. Standard factor scores were

computed using Equation 3. 4. For April 19 at Site 109, these were

3.862 1.101 -1.476 0.883 -0.101 and -1.418.

These factor scores have a zero mean and unity variance when computed

over all days represented. They represent the contribution of each

corresponding orthogonal component which must be added to the Site 109

mean vector to reconstruct the hourly values for April 19. For example,

the first hour in this day can be reconstructed using the top line of

Table 4.2 and the six values above. Explicitly,

26.4 + (3.862)(13.4) + (1.101)(17.2) + (-1.476)(-1.8) +

(0.883)(-10.4) + (-0.101)(-2.2) + (-1.418)(-8.6) = 103 ug/m3


This value (103 pg/m3) appears in Table 4.3 as the first entry under

reconstructed values. Similarly, the second hour of April 19 at Site

109 was reconstructed as 100 ug/m3 using the same factor scores above

and values from the second line in Table 4.2. Table 4.3 summarizes

the reconstruction of each of the 24 hours in April 19 at Site 109.

The average difference between the observed concentration and the

reconstructed concentration was 0.4 pg/m3. Since April 19 was an

unusual day with respect to the mean day, reconstruction would be

much better for a day selected at random. None the less, reconstruc-

tion within 16% of the observed value was achieved for every hour.

In summary, most of the ozone information for each station-day can












0


e4
aS








41 I
S0

MA


mo o CTncomb r-immNO IR wmo w M
F-4 r- H H r- r-i H r-4 r Hl r r- C r- 4 N N #-4. -4 H r-4 r4 P-4











0 0i 0 q *-t r1 V A rf H O
~m~~N~HH '-I H H H H HH


4J 3
-di rq H -r4 r- HmH i-i H N N N
Ul


H I I IH I I I I r-4 H I I i-









000000000000000000000000
r r r r N r- r4 H H


'a




0


4.




MO"4



.4041
.r-I UI
41-




0
0
8












be represented by only six values having orthogonal geometry, rather

than the 24 intercorrelated hourly values.


All missing observations were reconstructed using a multiple regression

routine. Each missing hour was computed based on the distribution over

all cases for that specific hour. The weighting function used was the

ozone concentration for all hours which were measured. This procedure

was necessary, since factor scores cannot be computed unless all of

the original variables (hours) are available. The factor scores were

computed and scaled as described in Chapter III to achieve orthogonal

Mahalanobis structure. This is accomplished by multiplying each of the

standard factor scores by the square root of the corresponding eigen-

value. As an illustration, for April 19 at Site 109, the properly

scaled factor scores are

450.7 76.9 -75.5 31.1 -3.0 and -32.5.

These values were computed from 3.8621f3614, 1.1010882, etc. Computed

over all cases, these factor scores have zero mean, and variance equal

to the eigenvalue for each specific factor. Thus, the total variance

of the factor score matrix would remain 25561 (pg/m3)2 if all 24 factors

were retained. Further, the variance of each factor relative to the

other factors has been restored. The standard factor scores weight each

factor equally, when actually factor one accounts for over half of the

total difference of a specific day compared with the mean day. Using

this same procedure, factor scores for each of the four sampling sites

were computed based on the principal components for that specific site












alone. In this way, the simplicity and economy of the euclidean metric

can be used to achieve Mahalanobis clustering. The Mahalanobis metric

would also provide this same result, but at considerably greater ex-

pense.


Ward's method, utilizing the euclidean metric, was applied to all avail-

able days for 1974. A six-space clustering was constructed using the

factor scores as variables. The dendograms in Figures 4.2 through 4.5

detail the macrostructure of the separation between cases. The number

of groups was determined subjectively by computing average factor scores

for the first four components. The within-group scatter distance was

used to make an initial separation into approximately 35 clusters. The

difference between the average factor scores for separate as well as

combined groupings was then used to achieve a manageable number of sub-

groups. Each of the three numbers in each block of these dendograms

is the average factor score for all members of that group. The tick

marks along the horizontal axis represent cases which recorded at least

one hour exceeding 159 pg/m3. April 19 at Station 109 was a member of

Group 109-13. This notation is used throughout the remainder of the

text to signify the station and subgroup corresponding to Figures 4.2

through 4.5. Group 109-13 means Station 109 Group 13 and, in this case,

represents a set of twelve days. As shown in Figure 4.4, most days in

this group exceeded the oxidant standard. The average factor scores

for the first three principal components in Group 109-13 were 316, 90

and -42. These three numbers appear in the block above Group 13 in









































ca








c 4
.4,




c mf o
I I f a



: o S* .




onn
94.,






In







Lq to
dl

j -,
.n I s acm







o a a o2 a o
L- "4 0.S
.4r 0'












iaUI^ *AIT3PT1 0




































































0br\ *n -
-.d~ 0 -


Jhe




*U a.



5304



.i :
s.







01
i3l




.51
2WR&
j


FI..-
;d0


Ja pt W **Aim UN-
























a C4 .4
at1 a
ON..
044n


1 __ 81


4"0 j f
r-6


1 S


~1


N -


0.4.9
S-4


'o C?
---- '5 --- = 1''

IV Vb in
0 CAQ



-9
p4.48
-I* I

^ n


71 in I
./ m i


-4











-4B


1.1
i













48 04
' w .I

5 8 '?









SMug a


-4












0 I
s s269



.4 N419 *

41|
U


I 8 c
o 04 0


II I a = I1


N 10 I
0 0


S o o0 0


Ja3wSSg aAnyaIea


0400
.01 0 0
moo
nin


m cm
N I


I I


r I I


1i 01 N
.4


W I


.4 I


T-iT
I ii u e
i P

pl


=~I t
rcrr~
























i m =0


N -N W I -J



N
"aSO











I or in















Nw .4
.4 II








k 0 M













W V
I I



I i 0' .1 i I




0 0 .Ac











in in
*1 I A a 04



>a gs


2aavosg aA$iMT0v












Figure 4.4. Days which had large positive values for the first princi-

pal component, generally clustered at the extreme right in Figures 4.2

through 4.5. These days also tended to exceed the oxidant standard.

Thus, the first component, which accounts for over 50% of the total

variance, also tends to separate high oxidant days from low level days.


Figures 4.2 through 4.5 display the group structure for each day repre-

sented in 1974. All of the days are arranged in cluster order along

the abscissa. The dates are not shown as their number is too large to

be conveniently displayed. The subgroup numbers appear sequentially

below the horizontal axis. The rectangle above each group number indi-

cates both the number of days in the group and the relative scatter

among members of that group.


Returning to Group 109-13 in Figure 4.4, the rectangle above Group 13

has relative scatter equal to 48.9% of the total scatter (l.612X106(ig/

m3)2) computed for all 269 days in Station 109 for 1974. The width of

the rectangle is 12/269 or 4.46% of all days represented. The horizon-

tal line above Group 13 is therefore 4.46% of the length of the entire

horizontal line. Figure 4.6 displays the clustering of the twelve days

in Group 109-13. The first four steps for this group combined pairs of

days ending when April 11 joined with April 21 at a relative scatter of

8.4%. These were steps numbered 34, 50, 74 and 128 in the 268 sequen-

tial steps required to cluster Station 109. At Step 154, the fifth

step taken for Group 109-13, April 27 and May 13 were combined with

April 20 and April 28 to form a new group at a relative scatter of 12.6%.












This process continued and at Step 247, Group 109-13 was defined to

include 12 specific days. Note that we can easily subdivide this group

into finer subgroups, which are more similar to one another, or add

additional groups, such as Groups 14 and 15, while increasing the scat-

ter. The procedure is the same as illustrated by example in Chapter

III, except that the clustering was done in six space to recover at

least 90% of the original information.


Once the members of a particular group are defined, the arithmetic aver-

age for each factor can be computed. This is useful to understand how

factors on the average account for differences between groups. The

first three factors for Group 109-13, as stated previously, are 316, 90

and -42 averaged over the 12 days represented. In contrast, Group

109-10 representing 69 days has averaged scores of -126, 14 and -2. As

indicated by the tick marks in Figure 4.4, most of the Group 13 days

exceeded the ozone standard while none of the Group 10 days reached

160 pg/m3. Note that each factor is zero averaged over all 269 days.

Distinct separations between the groups displayed in Figures 4.2 through

4.5 are apparent when the group-averaged factor scores are presented.


Periods having similar waveforms lasting a few days or longer appear as

chronological sequences within a group. Using Group 109-13 once again

as an example, all except two of the days in this group occurred in

April. As shown in Figure 4.6, these April dates were 10, 11, 18, 19,

20, 21, 25, 26, 27 and 28. Seven of these ten days exceeded 159 pg/m3.

The sequence of days from the 18th through the 28th represents a






















FIGURE 4.6

Clustering Steps for the Twelve Days in
Group 109-13


1.0






0.5




0.4



41

a 0.3
0




0.2
S




0.1




0.0


0 a 4 N oa0
q 14 :S .,
fl ,T ct T


o a to c OI
n W n


Day of Year, 1974


5

4

2
I I I











quasi-stationary period at Site 109 with a change in conditions between

the 21st and 25th. This period represents a sequential set of days

during which each day had a similar waveform. If sufficient influence

parameters are available with fine enough temporal resolution, a digital

simulation relating observed ozone concentration to input parameters can

be developed for this group. Traditional transfer functions, derived

empirically, can be used to develop a stochastic model representing

conditions on April 18 through April 28, and differences in these influ-

ence parameters on April 22, 23 and 24 could provide insight into causes of

elevated oxidant levels for Group 109-13.


In addition to temporal stationarity, spatial homogeneity is highly

desirable. If periods having similar waveforms at all spatial locations

were defined, mesoscale influence parameters could be used to construct

the model. Otherwise microscale data, collected at each sampling site,

would be necessary. Spatio-temporal homogeneity was investigated using

the cluster algorithm. A four-space grouping based on the average fac-

tor scores for each subgroup at each station appears in Figure 4.7.

The complete tree structure has been diagramed with the sampling site

and subgroup number at the left margin. A total of 66 station groups

were analyzed. This dendogram resolved three basic group types denoted

NE (non-episode), E (episode) and SE (super episode). Group NE con-

tained 15,192 station hours, of which only two occurrences exceeding

159 pg/m3 were found. All twelve months of the year were represented

at all four sampling sites. A total of 140 values exceeding 159 pg/m3































.0-

0,


C I



N'


0

d ~
C-


ri .1r -


-I


i







R l
0.&
.4 a4




*g .
a "
U f


7


InFl


TI


I l A


ui -) I u.i n A u vU -- n -l ) Iw w U) !- cOa --rn vdo iq, ni
oaqunH dnoLjqn


N


- O
a
te
5!




a

I'

s .



U5
|I


i












were found in the 10,104 station hours represented in Group E. The

2,544 station hours in Group SE included 402 such values. The SE, E,

and NE groupings were based on spatial location as well as temporal

waveform. In percent of total hours represented, Group SE had 16% over

159 pg/m3 while Groups E and NE had 1.4% and 0.013% respectively. An

example of a spatially homogeneous day is April 19 which was an SE day

at all four sampling stations. As shown in Figure 4.6, April 19 is

also a member of Group 109-13.


To further investigate spatial homogeneity, a date matching routine was

developed. Table 4.4 was made from a computer output for Group SE.

All dates for which at least one site belonged to this group are listed

chronologically at the left. Under each of the four stations, the group

membership is summarized. M and O are for missing and omitted data

respectively. In all, 52 days were represented. Only five of these

days (denoted *) did not include at least one station-hour exceeding

159 ug/m3, and virtually all of the extensive occurrences of elevated

levels were included in this group. Spatial homogeneity within Group

SE is apparent in Table 4.4. On nine days, denoted + in the table, all

of the stations represented recorded type SE days. On these occasions,

macroscale phenomena created elevated oxidant levels throughout the

region. Presumably, site specific differences played a small role in

oxidant formation, so macroscale influence parameters could be selected

to construct simulation models representing these days. Table 4.4 also

shows occasions when spatial differences were significant. For example,















TABLE 4.4

Days Represented in Group SE.
Day Type at Each Station and
Number of Hours Exceeding 159 ug/m3.


Date 63 94

2-13 E(2) SE(3)
3-10 SE(S) S (3)
3-14 M M
3-19 S (3) SE(6)
3-30 E1) 0
3-31 E(3) S(2)
4- 0* B SE
4-10 E(2) B
4-11* E Ba
4-17 z E
4-18+ SE (6) Sl2)
4-19+ SE (11) S(5)
4-20 3(2) E
4-21 E(1) E
4-24 E(1) M
4-25 E8(2) E
4-26 E(2) E
4-27 SE E
4-28 E(2) B
4-29+ SeS() SE (3)
4-304. 0(8) SE(4)
5- 1 SE(5) 3(1)
5- 7 (2) E
5- 8+ SE(8) SE (6)
5- 9 E(5) SB(3)
S-10 SE (7) 3(5)
5-13 S (7) e81)
5-29 SE(4) H
5-31 SB(2) E
6- 5+ sB (7) S (2)
6- 6 SB(7) z(1)
6-12 SE (6) z
6-13+ S (9) SE(3)
6-18 SE (5) (3)
6-19 SB(2)
6-22 E8 (2) (1)
6-29 E E
7-15 (7) SE (6)
7-16 S (7) S (6)
7-17 S (7) S (4)
7-23 SE(10) S (5)
7-24 SE (6) (4)
7-29 SE (9) E
7-30+ SE (9) SE (7)
8-14 (4) S (3)
9-10 MH H
9-18 E SE(2)
10- 1I E e
10- 2* E 2
10- 8* SB (1) E
10-10 E E
10-19+ SE (6) SE (1)

+ days with all sites in Group SE
* days vith no site exceeding 159 ug/m3
ME nonepisode (group 12)


109 ST_

4 E
E E
E(B) SE
H E



SE SE(1)
SB SB




3E S (2)
S (7) Se(8)
SE (11) 8 (7)
SE(2) SB
SE SB
SE(3) a
SE () S (5)
SB(3) s (2)
S (4) SE
SE(3) SB
sE (7) SE (1)
SE 3) SE
SE 2(2)
S (2) z
S (6) S (4)
8 3(5) S (4)
SE(5) B
S (3) SE(5)
B 3
RE N
H SSW2)
M N

o(10) a (6)
0(2) s (2)
Sn3(3) SB (4)

B SB(8)
E(1) H
3 N
0(5) H
SEC4) E
S (5) B
E(1) B
SS(2)E (34)
M SB (3)
HE SB (3)
N3 E (2)
z SE
E s8
SE(3) 3
SE (3)
SE (6) X
E episode (group 13)
SE auper episode (group 14)
H Missing day
0 omitted day


Notest 1) + denotes all sites in Group SE, denotes no site exceeding 159 ug/m3.
2) SE Super Episode. E Episode, NE Non-Episode.
3) H missing, O Omittec
4) Integers following group type are total hours exceeding 159 ug/m3.











scanning Table 4.4 for type NE days, one finds five occasions where a

super episode occurred at at least one site while a low ozone day occur-

red at another. July 23 is a striking example when persistent elevated

ozone was reported at all three Tampa locations while the St. Petersburg

site remained low. Clearly, there are occasions when site specific

influence data would be required to explain spatial differences. How-

ever, most of the days having at least one SE-type station day also had

type SE or type E conditions at the other sampling sites. The degree

of spatial homogeneity exhibited on days chosen for modeling affects

the degree of spatial resolution required in the influence parameters.


Coupling spatial homogeneity with temporal homogeneity provides a means

of selecting periods which should be most easily modeled. Intuitively,

these would be periods where the entire region experienced similar air

quality which lasted a relatively long time. Returning to Table 4.4,

a scan for consecutive dates reveals one interval lasting eight days

from April 24 through May 1. Other intervals in Table 4.4 lasting from

two to five days can also be found which represent quasi-stationary

high-oxidant episodes. Of the nine days previously mentioned which

exhibited SE conditions at all sites, only October 19 does not occur

during one of these quasi-stationary periods. If sufficient influence

parameter data were available, these intervals would be good candidates

for simulation modeling. The scope of this dissertation does not

include construction of digital models, since the primary objective is

demonstrating a method of defining quasi-stationary intervals for which












stochastic models could be developed. Consequently, the remainder of

this chapter concerns statistical and meteorological descriptions which

characterize the subgroups defined through pattern recognition.


Lognormal parameters were calculated for each station day. The average

geometric mean and geometric standard deviations were also calculated.

Group membership and average lognormal parameters appear in Table 4.5,

including the number of station days represented. The overall groups

in Table 4.5 refer to group numbers opposite the brackets in Figure 4.7.

Membership in each group from 1 through 14 is shown in both Table 4.5

and Figure 4.7. These distributions have been plotted in Figure 4.8.

Individual curves can be identified by locating the left-most line at

the lower right-hand corner of each graph. The subgroup order begins

with this curve and lists the group from left to right. Group 1 had

slightly greater slope than Group 2, but a very similar geometric mean

near 27 1g/m3. Group E had an overall geometric mean of 49 at slope

2.63. Groups 3, 4 and 5 appear together with an average geometric mean

around 38 ug/m3. Their standard deviations ranged from 2.77 to 3.98.

Groups 6 and 7 were close together and averaged 63 pg/m3. These two

curves had the lowest slope of any groups indicating elevated levels

throughout the day, but few values reaching 160 pg/m3. (Note: The

highest in 24 values occurs with a frequency of 2.8%):. Group 8, with a geo-

metric mean of 76 and slope of 2.1 was more closely aligned with Groups

6 and 7, as would be expected from the dendogram structure in Figure

4.6. The super episode group had an averaged geometric mean of 72. All















TABLE 4.5

Average Lognormal Parameters for
Summary Groups and Subgroup Membership


Grcup
No.


1 63-1
94- 8
2 63- 3
STP- 5
63-10
STP- 3
3 63- 4
STP-10
4 63- 5
94- 6
109- 5
109- 9
94-14
5 63- 9
3-11
STP- 7
109- 3
6 63- 9
7 94-12
8 63-14
9 63-15
109-15
63-18
10 63-16
11 63-17
12 (NE) 1
13 (E) 3
6

14 (SE) 9


Group
Members


STP- 1
109-11
94- 7
STP- 4
94- 9
63-12
STP-14

STP-16
109- 8
STP-15
STP-13
STP- 9
STP-12
STP.- 8
94- 1
94- 2
STP-12
109- 4
109-12
109-14
109- 6
STP-19
STP-17
STP-18
2
4
7

10


63- 2
STF- 6
109-10
STP- 2
94-10

94- 4

63- 8
94- 5
63- 6
63- 7
94-13
109- 2
94- 3
STP-11
63-13
109- 2
109- 1


94-11
109- 7


109-13




5
8

11


Station Average Average
Days Goo Mean Geo Dev.

246 26.8 2.70


27.3




25.6

39.6





51.9




60.4
71.3
75.9
51.5



102.4
100.3
27.2
48.9


72.0


2.49




3.98

3.07





2.77




1.77
1.71
2.11
3.61



1.64
2.19
2.57
2.63


2.86


Group numbers 1 14 are shown
Group members show station and
Average geomean is the average
station days in the group


in Figure 4.6.
subgroup number represented.
geometric mean for all


Notes: 1)
2)
3)










68









o /




0
IM-. 4







o 4J
go
8 o

o 0o A






Sso
Iv I I I






al 0 0m


rId rl
0 44 a z i

w/o .. D C g
(a

04
0 w0






4 In .
Uw 0 u 4)
o / Cflo .1




e 00

5.4 W
o 0
SO
.0 sg- /5. s.~~'O* S

23 0 i I








u in
a -4 00









0 O; s *4





4.
-w
3 0 0 04 C3




0 O O 4 0








0 0.4 0 0
os i o < E o






a-tlas hoyi tu/lfrl uoT7e~uaouoD












three of the subgroups which made up Group SE predict daily contraven-

tions of the oxidant standard. Group 9 had a geometric mean of only

52, but a slope of 3.6. This indicates relatively low levels during

certain times of the day with very high excursions at other times.

Group 9 consisted of 63 days and represented nearly 60% of all of the

SE type cases. Group 11 included a total of nine days and was limited

to Station 63 and the site in St. Petersburg. With a geometric mean of

100 at a slope of 2.19, all of the members of this group would also be

expected to exceed the air quality standard. Group 10, at 102 and 1.64

respectively, predicts contraventions of 160 pg/m3 on the average but

includes some days which statistically might not exceed this limit.


Figure 4.9 is a plot of the mean and first two principal components for

the subgroups listed in Figure 4.7. A cross reference table is provided

as a guide to locate groups in Figure 4.9. All groups shown in Figure

4.7 or Figures 4.2 through 4.5 can be found in Table 4.6. The plots

are arranged in cluster order left to right and top to bottom. The

number of days in each group appears at the upper left corner in each

figure. The number in the upper right corner is the total variance of

all days in the group about the mean vector. The number at the left of

each principal component is the percent of the total variance explained

by that eigenvector. Figure 4.9 is very useful for visualizing differ-

ences between observed waveforms. Averaging smooths abrupt changes

which confound the analysis of individual station days. Thus, the mean

vector for each subgroup is easier to interpret than the waveform of a

















FIGURE 4.9 a sMan and Principal Components of Subgroups
Type SE Days


12 Days 22972


20 "



0 35
r'V1 .
vc r\.


63-15


-to


3 Days


10539


.75


F


o 25


I505

100-

50


0 5 10 IS 20

109-15
Starting Hour


4 Days 12710 22 Days 24726





56


109-14


3 Days


94-11


8144


69


4 Days


6729


10 IS 20 5 10 IS 2
109- 6 109- 7
Starting Hour Starting Hour


Total variance for group appears in upper right corner
Percent of total variance for each component is the number
to the right of each component
Bottom figure is the mean vector
Numbers below each set of graphs indicate station and
subgroup number


I
a

e
0I



a-


Note: 1)
2)

3)
4)


. A52















FIGURE 4.9 b Mean and Principal Components of Subgroups
Type SE Days Continued


63-18 STP-19


40 16 Days


12640


0 I 1 I I0
0 5 I0 i5 20


S 10 B 21


STP-17 109-13 63-17
Starting Hour Starting Hour Starting Hour
Notes 1) Total variance for group appears in upper right corner
2) Percent of total. variance for each component is the number
to the right of each component
3) Bottom figure is the mean vector
4) Numbers below each set of graphs indicate station and
subgroup number


/' .42






4-L -


20
I 0


S-20
8o I20
-4

al
















PIGURE 4.9 c Mean and Principal Components of Subgroups
Type HE Days


5 Days


7569


55





\ .-,
?3 ^17










5 10 5I 20
63- 1


7 Days


^ 'n V43






23


40
20






0

0
M





S -20
a







810
8 W0


0


94 Days


....22






20










S 10 s 20
63- 2


0



S-o

0
o 20


4I 1




4 100
8 -20
950



0


I-


54 Days


. 24


S- 21


5 10 15 20
94- 8
Starting Hour


S75 Days


19






z 2^ "17









a 10 i1 20
109-11
Starting Hour


STP- 6
Starting Hour


Notes 1) Total variance for group appears in upper right corner
2) Percent of total variance for each component is the number
to the right of each component
3) Bottom figure is the aean vector
4) Numbers below each set of graphs indicate station and
subgroup number


5 10 IS 20
STP- I

















FIGURE 4.9 d Mean and Principal Components of Subgroups
Type NE Days Continued


44 Days


-1- Q47


I*


5 10 1i 20
5 10 19 20


63- 3


49 Days


30





20










5 10 15 20

94- 7


68 Days *


PL1 2










8 10 I 20O

109-10


12 Days 18704




,70


0 13

:0-
0o
10 *


10 *
i k

0 8 10 18 20
STP- 5
Starting Hour


34 Days


8465


36






7L J 22










5 10 1I 20
STP- 4
Starting Hour


51 Days


6872


"x--23






17










a 10 IS 20
STP- 2
Starting Hour


Notes 1) Total variance for group appears in upper right corner
2) Percent of total variance for each component is the number
to the riqht of each component
3) Bottom figure is the mean vector
4) Numbers below each set of graphs indicate station and
subgroup number


- *20
a
5 o

S0




6a
150
-p 8


00


;0

inl


S2
Cr
6j1
4



e I!
-4,

a
















FIGURE 4.9 a Mean and Principal Components of Subgroups
Type HE Days Continued
9736 22 Days 4476 13 Day


2aL .30





~~721


I i t
0 10 15 20

63-10


49 bays 5687
40

20-

0 28

-Wo.

202

o 20


0 5 10 IS 20
STP- 3
Starting Hour


S* 26





S 19









5 0 15 20

94- 9


L 5 Day 3169




_Ar
















5 10 15 20
63-12
Starting Hour


29 -


94-10


I I I
6 10 IS 20

Starting Hour


Notes 1) total variance for group appears in upper right corner
2) Percent of total variance for each component is the number
to the right of each component
3) Bottom figure is the mean vector
4) Numbers below each set of graphs indicate station and
subgroup number


40 Days


0
-n

a
0 20
0




S -20
ISO
3
- 90


t~5~

















FlGURE 4.9 f Mean


I 26 Days


iilLI


-201


A 14









3 10 S1 20
63- 4


and Principal Componunts ot Subgroups
Type E Days

13 Days 7596 I 9 Days


5 10 IS 20 5 10 1,
STP-14 94- 4


a Days


7084


to. V L
0 44




10



10
50




0 5 10 IS 20


S 10 1s 20


STP-10 63- 5 STP-16
Starting Hour Starting Hour Starting Hour

Hotal 1) Total variance for group Jpp ars in upper right corner
2) Percent of total variance for each component is tho number
to the right ot eacn component
3) Bottom figure is tho mean vector
4) Numbers be ov each set of graphs indicate station and
subgroup .unlrer
















pFGUIE 4.9 g Mean and Principal Components uf Subgroups
Type E Day Continued


5 Days 17104





57

















5 10 IS 20

63- 8


5 Days 4822
40








S 20









14
a 0-






0 g to IS 20
94- 5


30 Days


5 10 IS 20


94- 6


14 Days


6370


45

















I 10 o1 20
109- S


9 Days


S41






-L\- 19










5, tO IS 20

109- 8


11 Days


9488


35

















5 10 IS 20
STP-15


Starting Hour Starting Hour Starting Hour
Notes 1) Total variance for group appears in upper right corner
2) Percent of total variance for each component is the number
to the right of each component
3) Bottom figure is the mean vector
4) Numbers below each set of graphs indicate station and
subgroup number


*~ -20


0




a
















PIGURE 4.9 h Mean and Principal Componenet of Subgroups
Type E Days Continued


*u

a20

I


ISO

a too
I*


3 Days


9230


78










S10 La 22





5 10 15 20


63- 6


8 Days


11737


, 51





* 24
-^ A A ,2


0 10 1t 20
63- 7
Starting Hour


10 Days


10417


1 10 s1 20

109- 9


22483


V I A 60*
v "-





Aj V 19









5 10 16 20
94-14
Starting Hour


5 Days


39















I I I I
8 10 5 .0

STP-13


8 Daye 8734




39






'4 r 26








* I I I
0 10 1S 20
STP- 9
Starting Hour


Notes 1) Total variance for group appears in upper right corner
2) Percent of total variance for each component is the number
to the right of each component
3) Bottom figure is the muan vector
4) Numbers baloa each set of graphs indicate station and
subgroup number


Pu
-


5


0

10



0
10
i0

















FCGUKE 4.9 i Mean and Principal Components of Subgroups
Type Z Days Continued


24 Days


11362


-2V




^V^^,2"


8 Days 13248


0.




o 10 20 20


0 5 10 20 S 10 13 20


94-13


63- 9


26 Days 13647




-- 37
















I I
5 10 1S 20
63-11
Starting Hour


10 Days


22









* I I *
6 10 15 20
STP- B
Starting Hour


Notes 1) Total variance for group appears in upper right corner
2) Percent of total variance for each component is the number
to the right of each component
3) Bottom figure is the mean vector
4) Numbers below each set of graphs indicate station and
suogroup number


0

!O




0

a


o a
C

0
Ut
Ae
5
4 i
.4
ae


STP-12


109- 2
Starting Hour


!


'34


















19 Days


FIGURZ 4.9 j Mean and Principal Components of Subgroups
Type E Days Continued
8299 1 25 Days 6309 I 11 Day


32














I I14
- -








5 10 1 532
94- 3


40- 9 Days


~2~7~T41





-#A 1
19


*


.^


0 6 10 IS 20
STP-11
Starting Hour


S13 Days


6600


-O .- 39





24




IIII



I I-
S 1t0 IS 20
109- 3
Starting Hlour


Notes 1) Total variance for group appears in upper right corner
2) Percent of total variance for each component is the number
to the right of each component
3) Bottom figure is the mean vector
4) Numbers below each set of graphs indicate station and
subgroup number


II


I

8 -M

* 10

S o
a
Ic


STP- 7 94- 1


I ISO
.4 to20


o so

S 100

80


94- 2
Starting Hour

















FIGURE 4.9 k Mean and Principal Components of Subgroupa
Type E Days Continued


11 Days


21699


\149






29






- 0 5 20



5 10 15 20


63-13


5 Days


7373


54






V -- "23
- v


0 5 10 5I 20
109- 1
Starting Hour


10 Days 25957




A 66






10









I I I
5 t0 I 20

94-12


9 Days


15758


47






24


63-14
Starting Hour


15 Days


37






36










5 10 13 20

109- 4



12 Days 29467




51






24


109-12
Starting Hour


Note: 1) Total variance for group appears in upper right corner
2) Percent of total variance for each component is the number
to the right of each component
3) Bottom figure is the mean vector
4) Numbers below each set of graphs indicate station and
subgroup number


0

S -20

20

0
5
,

8 so

too
10


20


i
4 -20

20

~ o


S-20
150


i w











specific day. Differences between individual members of each group can

be visualized in terms of the first and second principal components.

Large positive or negative peaks in these curves indicate times during

the day when differences between group members occurred. These compo-

nents probably have physical significance in terms of meteorological or

other influence parameters. Many of the components display a periodic

nature having 24-hour, 12-hour and 6-hour cycles. In addition, both

mean and component vectors seem to be repeated among various subgroups.

It is also apparent that many of the subgroups exhibit unusual waveforms

which are not characteristic of the pattern expected from classical

photochemical considerations. Figure 4.9 thus forms a basis for fur-

ther investigation of oxidant phenomena. A choice of subgroups to model

and possible characteristics of the influence parameters become much

clearer after studying Figure 4.9 and Tables 4.1 and 4.5.


To illustrate the use of Figure 4.9, an example using Group 109-13 is

now presented. To locate this group, we find Station 109-13 as the

first element of the fifth column in Table.4.6. The entry opposite

109-13 is 4.9b which is page 71 and we find Group 109-13 in the middle

at the bottom of this page. The mean vector was computed to be 75 vg/m3

from midnight until about 4:00 AM followed by a drop in concentration

to 50 pg/m3 about 5:30 AM. The ozone level then climbs linearly to

reach a maximum of 160 Vg/m3 around 12:00 noon. From noon through

5:00 PM, the average ozone concentration for these 12 days remained

near 160 Pg/m3 and dropped to about 110 pg/m3 by 11:00 PM in












approximately linear fashion. As shown in Figure 4.7, Group STP-17 is

most similar to Group 109-13 and the relative scatter was about 2.5%

when these two groups were combined. Since these graphs were plotted

in cluster order, Group STP-17 appears immediately to the left of Group

109-13 in Figure 4.9b. The similarity between average waveforms is

easily recognized.


The first and second principal components for Groups 109-13 and STP-17

have a relatively smooth form. This indicates that differences between

individual days occur over a span of a few hours rather than during a

single hour. The exception is the peak at about 7:00 AM in the first

two components for Group STP-17. These peaks tell us that much of the

variation between days for this group occurred during the concentration

drop from 100 ug/m3 to 50 yg/m3 at about 7:00 AM. Thus, some days

exhibited less change in concentration at 7:00 than the average day,

while other days had an even sharper change than the mean day. However,

over all 24 hours, both of these groups tended to have either higher or

lower morning values than the mean day. This is evident in the first

component which represents over 40% of the variation about the mean.

A gradual change in these weighting functions from 20 to near zero oc-

curs from midnight until 3:00 PM. This means that more variation

occurred in the early morning concentrations than in the early afternoon

values.


The total variances about the mean vector over all 24 hours were 11334

and 12640 (pg/m3)2 for Groups 109-13 and STP-17 computed over 12 and 16












days respectively. These values appear at the top of each graph in

Figure 4.9b. These low values confirm the relative similarity among

days in these two groups. These two groups would be good candidates

for simulation modeling of certain super episode days. Similarly Group

94-11, having a total variance of 24726 (Vg/m3)2 for 22 days, is a

group for which models could be constructed. In contrast, Group 63-17

(Figure 4.9b) shows that some days tend to be unique with large changes

in concentration over short time periods. Turning to members of the NE

group, the total variance about the mean is much lower since days tend

to be very similar to one another. These days would be easiest to

model, but are also of less interest because the oxidant standard was

rarely exceeded. Models to represent these days may be useful to under-

stand underlying physiochemical and meteorological conditions which

restrict or reduce oxidant generation. Naturally, a large collection

of very similar days is much easier to model than a few unusual days.

Figure 4.9 reveals those groups which would be easiest to simulate as

well as graphically presenting the basic characteristics of all of the

days represented in 1974.


The objective in Chapter IV is to illustrate the methods presented in

Chapter III through an application of real data. Figure 4.9 provides

an effective means of breaking the analysis of the oxidant problem in

South Florida into a collection of smaller problems. Groups of similar

days can be further studied using a large number of other mathematical

techniques. In addition to studying the interaction of influence











parameters within a group, Figure 4.9 provides an approach to studying

differences between groups. Contrasting difference in influence para-

meters,using powerful statistical tools such as stepwise discriminate

analysis, could be used to identify those parameters having the greatest

impact on oxidant levels. Once these are identified, the time evolution

of ozone as a function of these parameters can be studied, provided the

patterns of interest have been observed sufficiently often. The tech-

nique put forth in this dissertation has wide application in other

fields as well as in air pollution. It provides a systematic method

of condensing useful information and comparing it. A better under-

standing of the problem as well as possibilities for potentially suc-

cessful approaches to further analysis results.















Table 4.6

Cross Reference Summary
Subgroups from Figures 4.2 Through 4.5


Figure
No.

4.9c
4.9c
4.9d
4.9f
4.9f
4.9h
4.9h
4.9g
4.9i
4.9e
4.9i
4.9e
4.9k
4.9k
4.9a
4.9b
4.9b
4.9b
4.9j
4.9j
4.9j
4.9f


Group
Name

94- 5
94- 6
94- 7
94- 8
94- 9
94-10
94-11
94-12
94-13
94-14
109- 1
109- 2
109- 3
109- 4
109- 5
109- 6
109- 7
109- 8
109- 9
109-10
109-11
109-12


Figure
No.

4.9g
4.9g
4.9d
4.9c
4.9e
4.9e
4.9a
4.9k
4.9i
4.9h
4.9k
4.9i
4.9j
4.9k
4.9g
4.9a
4.9a
4.9g
4.9h
4.9d
4.9c
4.9k


Group
Name

109-13
109-14
109-15
STP- 1
STP- 2
STP- 3
STP- 4
STP- 5
STP- 6
STP- 7
STP- 8
STP- 9
STP-10
STP-11
STP-12
STP-13
STP-14
STP-15
STP-16
STP-17
STP-18
STP-19


Figure
No.

4.9b
4.9a
4.9a
4.9c
4.9d
4.9e
4.9d
4.9d
4.9c
4.9j
4.9i
4.9h
4.9f
4.9j
4.9i
4.9h
4.9f
4.9g
4.9f
4.9b
none
4.9b


Note: 1) 63-1 means Station 63 Subgroup No. 1.

2) 4.9a to 4.9k refer to pages 70 to 80.


Group
Name

63- 1
63- 2
63- 3
63- 4
63- 5
63- 6
63- 7
63- 8
63- 9
63-10
63-11
63-12
63-13
63-14
63-15
63-16
63-17
63-18
94- 1
94- 2
94- 3
94- 4













CHAPTER V

SUMMARY AND CONCLUSIONS


Atmospheric processes are inherently random and they change randomly

over time. Statisticians call this kind of behavior a non-stationary

stochastic system. Modeling air pollution is difficult because the

response to influence parameters changes with time. Classical a priori

models treat this problem by assigning stability types and making assump-

tions which allow approximate solutions to the pollutant transport

equation. Stochastic models are better suited to simulating air quality

because they represent empirical relationships between observed para-

meters emphasizing their random nature. These models can provide the

greatly needed means to better understand and control air quality.

Atmospheric surveillance is becoming more intensive and more sophis-

ticated. Naturally, stochastic models for air pollution will become

more popular when highly reliable and extensive sets of data become

available. However, the non-stationary behavior of these systems must

be recognized and treated in order to develop meaningful stochastic

models.


This dissertation described some limitations to modeling air pollution

and setsforth a method for segregating data having similar attributes.

Although a number of methods are possible, the most important principle

is to recognize the need for many models. A single model to represent












all of the observed conditions is not possible. Therefore, the first

step in building stochastic air quality models is to find sets of condi-

tions which could be modeled. To qualify, the same conditions must have

occurred often enough so that repeated experiments in the set are not

too different from one another. Otherwise, the "system" being modeled

may actually represent multiple systems. Note that rapidly changing

conditions require frequent sampling and that significant spatial dif-

ferences require a dense sampling network. Hourly averages at four

sampling sites may be sufficient to represent some conditions but gros-

sly inadequate for others if very rapid changes occurred. After choosing

the appropriate conditions, reliable influence data can be analyzed to

construct a model. Similarly, models representing particular conditions

could be formulated to include most of the observed data.


The approach to treating non-stationary data presented in this work

utilized an assumption so that multivariate methods applied. Each hour

in a day was considered as a distinct variable since ozone is related

to the intensity of incident radiation. This assumption was not neces-

sary but greatly simplified formulating the pattern recognition problem.

A generalized grouping method could be developed to seek intervals having

similar system response. The length of these intervals does not need to

be constant, and could be initialized by passage of a front for example.

However, to develop stochastic air pollution models, one must recognize

the need for repeated observations of similar conditions and methods to

determine when these have occurred.











Atmospheric modeling is a process of simulating a waveform and comparing

it with an observed waveform which represents the same condition. The

simulated results are subtracted from the observed results to yield a

new waveform called the residuals. If the residuals are small compared

with the original data, the model represents those conditions well.

Further treating the residuals can lead to better models until ulti-

mately what is left is that purely random component known as white noise.

The models used for each successive step can be deterministic, as in

classical modeling, or stochastic. Either type of model represents a

relationship between inputs and outputs. Both deterministic and sto-

chastic models have characteristic frequency response curves relating

these inputs and outputs. Classical models develop this relationship

from physical principles while stochastic models rely on empirical

relationships. But the final test of each model is its ability to

reconstruct the observed data and yield a purely random residual. Models

fail when the influence data are inadequate and when functional coeffi-

cients are in error. It really doesn't matter how the model was devel-

oped, so long as the correct influence parameters and coefficients have

been determined and can be relied upon to correctly simulate the condi-

tions represented. Pattern recognition used on observed data can provide

test data to compare with a priori models. A priori models can be used

to suggest influence parameters and to make assumptions in applying

pattern recognition tools. Both techniques should be used together.











A pattern recognition approach to dealing with the non-stationary

waveforms found in air pollution was presented for ozone data reported

at four sampling sites in Tampa and Saint Petersburg, Florida. These

data were selected because the oxidant problem is a national concern

and accurate data were available. Each day was reduced from 24 hourly

observations to a set of six values while retaining over 90% of the

available information. These six values represented proportionate

contributions of six independent pseudo variables each of which was

made up from the original 24 hours. With proper scaling, the euclidean

metric was used to measure distance between each day and the most simi-

lar days were combined into common groups. Each station was analyzed

separately and the results were displayed graphically as a dendogram.

Days which experienced elevated oxidant levels tended to be closely

associated in the dendogram. Similarly, days with low ozone levels

were found together. The average factor scores from each group were

used to determine spatially similar groups. This analysis resulted in

the super episode (SE), episode (E), and non-episode (NE) classifica-

tion which was used to contrast meteorological conditions. As des-

cribed in Appendix III, intense solar insolation following passage of

a cold front and the associated high pressure system, accounts for

many super episode days. A clear explanation of differences between

E and NE type days was not possible from the limited meteorological

data available. The mean and first two principal components for each

group at each station were plotted in cluster order. This graphical

display demonstrated the existence of a number of distinct types of

days which appear to require basically different models.















Average lognormal parameters for groups of days were computed and the

distributions were plotted. These also demonstrated distinct differ-

ences between types of days and showed that there is no single under-

lying distribution representing all observed conditions.


In many cases ozone behavior was found to be unusual with respect to

classical photochemical oxidant theory.


Pattern recognition provides a method of handling the randomly-changing

non-stationary conditions encountered in atmospheric studies. This

application clearly illustrated the usefullness of pattern recognition

as a diagnostic tool in air pollution research.


A number of significant conclusions can be drawn from the results of

this work. All of them focus on the problem of stationary. Fitting

confidence bounds on the population mean or maximum and computing order

statistics from a sample require a single stationary distribution. It

is very apparent that the density function of air quality data changes

with time. Consequently, predicting the probability that a prescribed

level was exceeded or will be exceeded in the future can not be done

with certainty. Influence parameters can be identified and functional

relationships can be developed, but there is no guarantee that these

same influence parameters will be important under another set of con-

ditions. It is also possible that influence parameters which are con-

sidered to be insignificant may operate under some circumstances to











exercise a predominant influence on air quality. Building models to

describe air pollution, given this lack of constraints, is indeed com-

plicated, and a single model capable of including all of these possible

conditions is difficult to imagine. Classifying and analyzing observed

air quality provides an alternative approach which divides the problem

into managable parts. If the system changed in a gradual predictable

way, forecasting could be accomplished using adaptive filters with coef-

ficients that change predictably. Unfortunately, conditions in the

atmosphere can change abruptly to require a new set of influence para-

meters and different functional relationships. But a comprehensive

history of repeated events can be used to determine which model is

appropriate. Real-time analysis'of atmospheric processes can be devel-

oped after extensive collection and analysis of data have transpired.

Classifying conditions and rearranging them into chronological order

produces sequences of events. There are certain patterns to these se-

quences which can be used to predict when an air pollution episode is

likely to occur. Mathematical models to simulate each kind of day and

other models which predict which type of model is most probable are the

basis for real-time prediction and stochastic simulation of air quality

data.




Full Text
xml version 1.0 encoding UTF-8
REPORT xmlns http:www.fcla.edudlsmddaitss xmlns:xsi http:www.w3.org2001XMLSchema-instance xsi:schemaLocation http:www.fcla.edudlsmddaitssdaitssReport.xsd
INGEST IEID EZ9LUVUUP_I4TBAY INGEST_TIME 2012-02-07T17:45:31Z PACKAGE AA00003916_00001
AGREEMENT_INFO ACCOUNT UF PROJECT UFDC
FILES