UFDC Home  Search all Groups  UF Institutional Repository  UF Institutional Repository  UF Theses & Dissertations  Vendor Digitized Files   Help 
Material Information
Subjects
Notes
Record Information

Full Text 
TOWARD STOCHASTIC MODELS FOR AIR POLLUTION By JOHN FREDERICK CHADBOURNE, JR. A DISSERTATION PRESENTED TO THE GRADUATE COUNCIL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 1977 Dedicated with love to my family, Debby, Julie and Chipper. ACKNOWLEDGEMENTS I wish to thank my committee members, Doctors Dale A. Lundgren, J. E. Singley, R. Carl Stoufer, and especially my chairman, Dr. Paul Urone, and my cochairman, Dr. Allen E. Durling, for their efforts in directing this dissertation. Thanks also to Dr. Wayne Huber for all of his useful suggestions. Mr. Steven T. Gladin deserves my appreciation for his guidance in statistics and computer technology. In the final stages, this dissertation was greatly helped by the encour agement and cooperation of Mr. J. Alfred Schueler. The manuscript was typed by Ms. Brenda Lanza. I sincerely appreciate their efforts. iii TABLE OF CONTENTS ACKNOWLEDGEMENTS LIST OF TABLES LIST OF FIGURES ABSTRACT CHAPTER I CHAPTER II CHAPTER III CHAPTER IV CHAPTER V APPENDIX I APPENDIX II APPENDIX III INTRODUCTION BACKGROUND AND PROBLEM STATEMENT PATTERN RECOGNITION TO IDENTIFY STATIONARY INTERVALS AN APPLICATION TO URBAN OZONE DATA SUMMARY AND CONCLUSIONS AIR POLLUTION STATISTICS PRINCIPAL COMPONENTS METEOROLOGICAL CHARACTERISTICS OF DAYS GROUPED ON THE BASIS OF SURFACE OZONE CONCENTRATIONS AS OBSERVED IN TAMPA, FLORIDA REFERENCES BIOGRAPHICAL SKETCH Page iii v vi vii 1 5 15 LIST OF TABLES Page Table 3.1 Table 3.2 Table 3.3 Table 4.1 Table 4.2 Table 4.3 Table 4.4 Table 4.5 Table 4.6 Table AI.1 Table AI.2 Ozone Concentration for Ten Example Days With Mean and Standard Deviation by Hour Factor Scores for Ten Day Example Ten Day Cluster Example Scatter Summary Contraventions of the Oxidant Standard Observed in Hillsborough and Pinellas Counties, Florida in 1974 Mean and Six Principal Components Station 109 1974 and 1975 Reconstruction of April 19, 1974 Station 109 Days Represented in Group SE. Day Type at Station and Number of Hours Exceeding 159 ug/m3 Average Lognormal Parameters for Summary Groups and Subgroup Membership Cross Reference Summary Subgroups from Figures 4.2 Through 4.5 Probability Density Functions Commonly Used in Air Pollution Modeling Confidence Limits for Ambient Ozone Calibration 110 LIST OF FIGURES Page Fig. 3.1 Mean and Principal Components for Ten 29 Day Example Fig. 3.2 Dendograms for Ten Day Example 32 Fig. 3.3 Three Dimensional Display Factor Scores 36 for Ten Day Example Fig. 3.4 Lognormal Plots of Each Day in Ten Day 37 Example Fig. 3.5 A Schematic Diagram for Empirical Modeling 40 Using Cluster Analysis Fig. 4.1 Mean and Principal Components for Hillsborough 47 and Pinellas County Ozone Data 1973 Through 1975 Fig. 4.2 Dendogram for Station 63 in 1974 54 Fig. 4.3 Dendogram for Station 94 in 1974 55 Fig. 4.4 Dendogram for Station 109 in 1974 56 Fig. 4.5 Dendogram for Station STP in 1974 57 Fig. 4.6 Clustering Steps for the Twelve Days in 60 Group 10913 Fig. 4.7 Dendogram for All Four Stations in 1974 62 Based on Mean Factor Scores Fig. 4.8 Lognormal Graphs for Summary Groups 68 Fig. 4.9 Mean and Principal Components of Subgroups 70 Fig. AI.I Log Normal Distribution with mg = 100, 95 ag = 2.0 and Associated Order Statistics for Highest of 365 Observations Fig. AI.2 81, 82 Plane for Pearson System 97 Fig. AI.3 Larsen's Model 103 Fig. AI.4 Example Three Stage Calibration 109 vi Abstract of Dissertation Presented to the Graduate Council of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy TOWARD STOCHASTIC MODELS FOR AIR POLLUTION By John Frederick Chadbourne, Jr. December, 1977 Chairman: Paul Urone Major Department: Environmental Engineering Sciences Mathematical models to describe air pollution are necessary planning tools in the air quality management process. They provide a means of predicting and controlling pollution. Classical models are necessary in the absence of reliable ambient measurements. These models have become increasingly sophisticated, but they remain inadequate to des cribe many of the phenomena which actually occur in the atmosphere. Statistical methods applied to reliable air quality influence parameters can advance the science of modeling, leading to stochastic models to simulate and better understand air pollut. i. Ultimately, these tech niques will provide capabilities for realtime analysis and control of air pollution. Statistical models for air pollution have generally been limited to specification of distribution functions or analysis of orthogonal com ponents because dynamic statistical models are complicated under the nonstationary conditions encountered in the atmosphere. This disser tation investigates the application of pattern recognition techniques to identify quasistationary conditions with respect to both space and time. Systems models approximating these quasistationary groupings can be constructed when adequate influence data become available. An analysis of mesoscale meteorological conditions was used to illustrate how influence parameters can be identified. Photochemical oxidants are a national concern. The primary consti tuent is ozone which can be accurately measured using a chemilumines cent technique. During 1974, the Tampa, Florida, area experienced many elevated oxidant periods which were documented on wellcalibrated con tinuous monitors. These data were therefore selected to determine conditions under which stochastic air pollution models could be devel oped. The multivariate approach developed in this dissertation for iden tifying similar ozone waveforms is a powerful analytical tool. The method is not restricted to air pollution research, but extensive ap plications in atmospheric studies are appropriate. Perhaps the most useful aspect of the method is the ability to identify conditions which facil itate model development. The most frequently encountered condition in this study was designated the nonepisode type. For these days, the oxidant standard was rarely exceeded; most sampling sites recorded the same type of waveform; and many consecutive days reported this condi tion. These days would be relatively easy to model since mesoscale parameters would generally be adequate, many examples of these condi tions are available and variations between days in this group are small. The next most frequently occurring condition was called an episode day viii since the oxidant standard was often exceeded, but ordinarily for only one or two hours in each day. Spatial differences become increasingly important for this group indicating a need for microscale data in some instances. More variance was encountered and many different waveforms were represented. A third type of day was designated super episode since the oxidant standard was usually exceeded and elevated levels persisted for several hours. Super episode conditions often followed frontal passage under high pressure conditions when solar insolation was very high. Many of these days exhibited rapid change in concentra tion and radically differing waveforms. Spatial differences were also significant on many of these occasions. Consequently, super episode conditions would be most difficult to model and would require a dense network providing microscale influence data and frequent sampling. There were, however, a number of frequently repeated spatially homo geneous conditions within the super episode group. These would be good candidates for stochastic modeling. CHAPTER I INTRODUCTION There are many approaches to predicting air quality in today's society. This is natural since the problem is indeed complicated. As discussed in Gifford (16,p.84), the most general equation describing pollutant transport is a hyperbolic partial differential equation. There is actually no known closedform solution for this equation. Consequently, different methods of finding approximate solutions have evolved. The two basic classes of methods are called a priori (before the fact) and a posteriori (after the fact). A priori methods concentrate on a know ledge of the physics involved to make appropriate assumptions and to establish boundary conditions. Similarity variables, separation of variables and integral transform methods are then used to obtain ana lytical solutions. A posteriori methods operate on observed variables as they change with time to develop functional relationships between influence parameters (inputs) and the pollutant modeled (output). Ecth types of models are useful tools for understanding the causes and meth ods of controlling air pollution. There are many kinds of a priori models. The simplest type uses a single analytical equation for which various combinations of "steady" input parameters yield concentration as a function of distance from the sources. They are divided into two types called longterm and shortterm models. Both use the same equation, but the manner of treating the output concentration differs. Longterm models weight the result from each set of conditions by an empirically determined frequency of occurrence for that condition. The sum of all such con centrations represents a long term average pollutant level as a func tion of space. Isopleth contour maps may then be plotted. Shortterm models provide downwind steadystate concentrations. The averaging time for these results is assumed to be on the order of 10 minutes, and longer averaging times (30 min., 1 hour, etc.) are estimated empir ically by considering meander in the unsteady wind. Sophisticated a priori models use numerical integration techniques. Forward difference, central difference, backward difference and combinations are used with a moving (Lagrangian) coordinate system, as well as the more common fixed (Eulerian) coordinate system. Generally, longterm models pro vide better estimates since averages are easier to predict than specific conditions. In either case, the model performance is evaluated by com paring model projections with observed air quality under the assumed conditions. The practice is known as calibration. Millions of dollars have been spent to develop a priori models in recent years. The two types of a posteriori models are loosely referred to as static and dynamic. Static models focus on the underlying set of distributions which characterize observed air quality. Most popular of these is the lognormal family. Two kinds of questions are answered by such models. The first is the empirical probability of observing a value exceeding some fixed limit, presumably an air quality standard, during the year sampled. The second issue to be resolved is the probability that this limit would be exceeded in subsequent years. This type of model is directly related to air quality standards and it therefore has been used and abused extensively. Dynamic statistical (stochastic) models have not been thoroughly inves tigated. There are many reasons which explain this historical lack of interest in stochastic systems for air pollution. A mathematical basis in terms of the physical process provides a framework in which to understand pollutant dispersion. Consequently, there is an inherent bias toward physical models. The theoretist will resort to empirical techniques only after he has thoroughly exhausted his analytical ave nues. Empiricists ultimately attempt to relate conclusions to physical laws to facilitate interpretation of results. Naturally, a priori models are developed first. Attention turns to empirical models when the physical models become inadequate or when additional viewpoints become necessary. Moreover, in air pollution, the inaccuracy of field measurements becomes extremely important since trace concentrations must be studied. Surface chemistry, molecular interaction and inter ference become serious problems which undermine the validity of data. Empirical models are limited by the axiom "garbage in yields garbage out." Consequently, these models cannot really be developed until reliable aerometric data are available. In recent years it has become increasingly clear that the dispersion of air pollutants is a stochastic phenomenon. Realistically, a simple analytical solution cannot provide reliable shortterm solutions under conditions normally encountered in the atmosphere. Supporters of a priori models have argued that stochastic models would be applicable only to the region for which they were developed. However, the same factors which contribute to regional applicability tend to violate the assumptions required in physical models. Consequently, as reliable continuous air quality data become more available, interest in stochas tic models will increase. Perhaps the most troublesome aspect of air pollution modeling is the nonstationary character of the atmospheric transport process. Simply stated, this means that the system being modeled changes with time. Systems engineers have coped with this kind of problem by developing adaptive filters. However, a systematic basis for adjusting the filter coefficients is required. In this dissertation, the nonstationary character exhibited by ozone data reported in Hillsborough and Pinellas counties in Florida is investigated. These data were selected due to the national importance of the photochemical oxidant problem and the availability of rigorously verified data. The objective is to classify reported episodes into quasistationary intervals using techniques from pattern recognition. These constitute a basis for multiple stochastic model development and also prove to be a useful diagnostic tool for understanding photochemical air pollution. CHAPTER II BACKGROUND AND PROBLEM STATEMENT The study of air pollution as a function of space and time has long been a human concern. Pollutant dispersion is described by differen tial equations. Mathematical techniques for solving these equations have historically been adapted from the study of heat transport. This is primarily due to the comprehensive analytical solutions developed for various boundary conditions and input functions. Atmospheric sys tems contain random variables and random functions of these variables. Differential equations for random functions are called stochastic pro cesses and they generally do not have analytical solutions. An ap proach, which avoids steadystate analytical solutions, is numerical integration. The most advanced air pollution models in today's society use this technique. Regardless of the methods used to model a process, comparison with measured results is essential. Empirical analysis of these measured values can provide insight concerning the functional relationships between influence parameters and pollutant concentrations in addition to validating models. Thorough statistical analysis of observed con centrations, viewed as nonstationary dynamic systems, leads to the development of stochastic dispersion models. This system theory ap proach to air pollution is investigated in this dissertation. 5 Brownian motion provides a starting point for understanding atmospheric stochastic processes. In a paper by Einstein (12) the irregular motion of colloidal particles based on collisions with surrounding fluid mole cules was described. A thorough treatment of Brownian motion is given in Fuchs (14), showing that the velocity history of a colloidal parti cle is a random (stochastic) process. The equation describing Brownian motion is dU(t) = U(t) = [A]U(t)+[BjW(t) 2.1 dt where d = total derivative with respect to time dt U(t) = particle velocity vector at time t(a state variable) [Al = B[I1 the system matrix with: 8 = 6rwa/m (particle mobility) a = radius of spherical particle P = fluid viscosity m = partial mass [I] = identity matrix [B] = distribution matrix relating input variables to state variables W(t) = input vector of white noise representing molecular collisions. Equations of this form were first treated by Langevin (18), who hypo thesized two additive components. The first component, ([A]U(t)), is the systematic dynamic friction (drag). The second component, ([BIW(t)), represents the random force exerted by molecular collisions. Brownian motion is also known as the Wi ener process after Norbert Wiener (48) who formulated much of the pre sent theory of time series. Integrating Langevin's equatio mn yields U(t) = eAtU(o) 0+eAte BW(A)dA 2.2 where eAt = exp(At) = the mat rix exponential) U(o) = velocity= vector at time zero; and r = Xt, the= lag pa cameter. From statistical mecharrnics we E= ul, u2, u3 (x, y or z) componer irt of the Brownian particle is m where = e= nsemble average squared velocity component k = Baoltzmanwn's constant T = absolute temperature. Since u?(t) is not a fuaction c==if time, the process is stationary, having a time invariant autoco =relati< n function. This autocorrelation func tion, defined as < ul(t) u (t+r)> R(T) = 11(t) u (t+T)> 2.4 can easily be determine from 'squation 2.2. For Brownian motion, th_ e ensenmh==le average of all times containing the input vector :. t) is zero. Thus only the drag component enters into this equation resulting in R(T) =()> = eA = eBT 2.5 This is the autocorrelation function of a Markov process, characterized in Arnold (2, p.27) by the property "if the state of a system at the present is known, information from past times has no effect on our knowledge of the probable development of the system in the future." In other words, the system has no memory. The work of Taylor (45) provides some very important results in the field of turbulent diffusion theory. Taylor's theorem relates the expected particle displacement of a diffusing cloud to the velocity autocorrelation function. For an initially concentrated cloud (instan taneous point source) Taylor's theorem is x = 2 R(T)dT 2.6 dt ) Here ox is the radius of inertia, a measure of cloud size. Substituting the autocorrelation for Brownian motion and integrating this equation yields 2 = 2 For unit density spheres with radii smaller than 10 micrometers, the relaxation time (81) is less than 0.01 seconds. Thus, for time on the order of seconds, 2 2 x = = 2Dt 2.8 where SkT D = kT (the diffusivity constant). 6raa The equipartition of energy principle from statistical mechanics and the integrated Langevin's equation, together with Taylor's theorem, yield Einstein's conclusion that Brownian motion can be simply defined using a diffusivity constant. The classical diffusion equation representing Brownian motion is an instantaneous point source with no bulk fluid motion. Thus, ac = DV2c = D2c + 2c + 22c 2.9 at ax2 ay2 3z2 where a [1 at = partial derivative with respect to time V2 = the Laplacian operator c = concentration of diffusing substance D = a constant diffusivity. Analytical solutions for this equation, and for augmented equations representing atmospheric conditions, are abundant in the literature. Two particularly readable and comprehensive summaries are Slade (41) and Csanady (7). For the isotropic conditions (ax=ay=az) ehdibited in Brownian motion, the analytical solution is c(r,t) = Q(2wa2)3/2exp[(r2)/2a2)1 This is the cumulative distribution function for the Gaussian, or normal, probability law where Q = source strength r2 = (x2+y2+z2) squared radius of diffusing cloud a2 = 2 Dt. The diffusivity required to satisfy equation 2.10 is the same diffu sivity derived for the stochastic formulation of Brownian motion given in equation (2.8). Thus, the stochastic and analytical approaches yield the same results for the special conditions of Brownian motion. For n chemically reactive constituents, the continuity equation is 3Ci (UlCi) (u2Ci) (u3Ci) a + (uli) (u2Ci) (u3Ci) DV2Ci + Ri(Cl,...,Cn,T)+Si(x,y,z,t) at ax ay az 2.11 where i = n = C. = (ul,u2,U3) = Ri = Si = In practice, stituent can 1,2,...,n number of pollutants involved concentration of constituent i (x,y,z) components of wind velocity rate of formation of constituent i by chemical reaction rate of emission of constituent i from sources. only nl equations are required since the remaining con be determined by difference. 2.10 In modeling air pollution, one can usually assume that the pollutants are merely tracers of atmospheric activity. Under this assumption equation 2.11 can be solved alone without considering the coupled NavierStokes (27) and energy equations. Now, each wind velocity com ponent consists of a deterministic and a stochastic component (ui=ii+ui). Similarly, each Ci consists of an ensembleaverage concentration and a random component (Ci= negligible compared to turbulent dispersion. Introducing these con siderations and taking the expected value of equation 2.11 yields the "Ktheory" coupled partial differential equations presented in earlier dispersion modeling studies; for example, see Pasquill (30) and Slade (41). The general form of this equation is a at + U [i a a a a a a [Kxx Ri( where Kaa 8 components and Kab = 0.0 is assumed for a 3 b. Analytical solutions for Ci(x,y,z,t) are derived by making assumptions in the above equation and applying classical mathematics [Carslaw and Jaeger ( 6)1. Unfortunately, dispersion modeling is complicated by "eddy diffusion" when random motions of the transporting fluid enter the equation. Equation 2.12 is analogous to equation 2.9 with "eddy diffusivities" (Ks) in place of molecular diffusivity (D). This difference can be interpreted in terms of the "scale of turbulence." For Brownian motion the length scale is the mean free path and the time scale (tL) is B1. For turbulent flow there is no single time or length scale. Turbulent flow in the atmosphere is characterized by eddies of vastly different sizes. Large eddies extract energy from the bulk flow. Smaller eddies feed on the energy of larger eddies resulting in an "energy cascade." The eddy energy comes from the bulk fluid energy augmented by solar insolation, surface heat and topographic factors. These eddies move groups of particles together, so the motion of neighboring particles is not independent. This correlation of motion over many size scales causes unpredictable differences such that the observed concentrations are random variables. Further, eddy diffusivities are not constant in space or time. Under these conditions, there is no analytical solution to equation 2.12. Many researchers have turned to numerical integration to overcome this limitation [Reynolds et al. (36) and (37); California Air Resources Board (CARB)( 5 ); and Randerson (35)]. The status of recent air pollution modeling is summarized in NATO/CCMS (26). The advantages and limitations of both classical and statistical models are presented in a complementary context. Advanced classical models require extensive specification of wind fields, diffusivities and constituent concentrations as functions of space and time. Problems with these models include large computer time and storage requirements, stability of integration schemes, "artificial diffusion," and cumulative errors (CARB (5)). In some situations, simple models yield better results than the advanced models, as demonstrated by Gifford in NATO/ CCMS (15,p.XVI). Statistical air pollution models have not been as extensively developed as classical models. Very few dynamic statistical models have been formulated. The primary reasons for this are avail ability of reliable air quality data from dense sampling networks; a lack of regional applicability of the models developed; mathematical complications; and most importantly, the nonstationary character of air pollution. The chronology of the NATO reports clearly illustrates the increasing emphasis on statistical aspects in modern air quality simulation. A stationary process nas an autocorrelation function which does not change with time. Second order stationarity implies that the mean and variance of repeated experiments are constant. Air pollutant dis tributions are not stationary, primarily due to changes in meteorological conditions. This nonstationary character of atmospheric turbulence was treated by Pasquill (30). He devised "stability classes" to des cribe turbulence types. This stationarity concept is closely coupled to the duration of an experiment and the averaging time of samples. Although diurnal variation denies stationarity with respect to 24 con secutive hourlong experiments, daylong experiments may be stationary with respect to one another. Similarly, major changes in meterology, such as storms, or frontal passage, disrupt stationary conditions. As pointed out by Pasquill (30,p.17), it is this nonstationary nature of atmospheric processes which limits the use of timeseries techniques. Selecting subgroups of data, which are stationary with respect to one another, provides a means of developing dynamic statistical models for air pollution. Stochastic models offer some very attractive qualities. Since their random dynamic nature is analogous to observed atmospheric conditions, they should prove superior to the steadystate Gaussian models popularized in the recent past. CHAPTER III PATTERN RECOGNITION TO IDENTIFY STATIONARY INTERVALS Selecting reproducible conditions in air pollution can be viewed as a problem in pattern recognition. Many methods which accomplish this objective can be formulated. A good method must operate on readily available data to delineate similar conditions at reasonable cost. One could begin by plotting the observed concentration as a function of time and studying the graph to pick piecewisesimilar days. Unfortunately, conditions in the atmosphere usually change in rapid and random fashion, as when fronts pass or reacting contaminants interact. Consequently, randomly repeated intervals occur which are not ordinarily sequential. An additional complication concerns differences in waveforms occurring at the same time, but at different spatial locations. In order to apply stochastic modeling techniques to air pollution, methods for identifying similar conditions randomly distributed through space and time are nec essary. Statistical studies of time behavior traditionally deal with time cor relation. The correlation of a signal with itself is known as the auto correlation function (p(t)). Tau (T) is a time function representing lag or lead from zero time toward past or future time. For example p (o), the autocorrelation of zero time, is computed from the covariance of a sequence of numbers. For a zero mean sequence, this is the sum of 15 all squares divided by the number of observations. At each lag (for example 10 hours) a new covariance can be estimated by summing the pro duct of all paired observations T(ten) hours apart. Dividing each of these covariances by the covariance at zero lag yields the sample auto correlation function. By analogy two signals can be related by com puting cross products as one signal is slid past the other, yielding the crosscorrelation function. These two functions are used by statis ticians to characterize time behavior. When the relationship between the autocorrelation and crosscorrelation functions does not change the system is called time invariant or stationary. However, when this rela tionship changes with time, as it does in air pollution studies, multiple inputs and nonstationary methods are essential. Engineers are usually exposed to the concept of frequency response in the study of system control. The way a system reacts to changes in input is described in terms of transfer functions, having poles, zeros and transforms (Laplace and Fourier) relating time to frequency. Theoretically, the Fourier transform of a signal and the Fourier trans form of the autocorrelation function are directly related through the power spectral density function as discussed in Weiner (48). Consequently, systems having the same correlation functions should exhibit the same frequency response. Identifying intervals having the same frequency response specifies a quasistationary system. This approach, using very efficient algorithms to compute sliding fast Fourier transforms (FFT) as in Rabiner and Gold (34,p.382), is a powerful analytical technique. It provides a way to compare any interval of time with any other interval based on the system response. However, this direct approach to identifying stationarity is expensive and does not take advantage of the known physical processes involved. One of the most significant factors in atmospheric studies is solar insolation. The sun is a primary driving force behind meteorology. Wind speed, wind direction and atmospheric turbulence are related to the intensity and duration of incident radiation as well as other major physical phenomena. Since the period of rotation of the earth is one day, it is reasonable to divide a continuous record into 24 hourlong intervals. Longer periods can be selected based on frontal passage or storm cycles as appropriate. These periods can be further sub divided into wind direction or turbulence types as is currently prac ticed in dispersion modeling. Differences between days may then be viewed as multivariate problems. The hour of day, for example, can be a variable and each day would then be represented as a 2,.space vector. In this way, the relationship between time and other variables can be analyzed using classical multivariate techniques. A multivariate approach to identify similar patterns of air pollution is presented in this chapter. A fundamental concept in multivariate analysis is the covariance matrix. For a collection of days having 24 hourly averages, the covariance matrix has 24 rows and 24 columns and the values in the upper right triangle are equal to those in the lower left triangle. Each diagonal element is the sum of squares for a particular hour divided by one less than the total number of days. The square root of each such covariance is the standard deviation for that particular hour. The offdiagonal values represent correlations between variables and are cross products of pairs representing both hours summed over all days and divided by the number of days. Thus, the total variance of the matrix is the sum of its diagonal elements and the intercorrelation between each hour and all other hours is represented within the matrix. Often the mean for each hour is first subtracted from the raw data, resulting in a covar iance matrix about the mean. By mathematically rearranging the covariance matrix, a concise repre sentation of each original day can be computed using only a few new variables rather than the original 24. This powerful technique, dis cussed in detail in Morrison (25) and in Harmon (17), is known as factor analysis. The first step, called principal component analysis, finds new variables made up of proportionate weights of each of the original variables. The contributions of each hour are assigned in a way which accounts for as much of the difference between individual days and the "average day" as possible. The second component, or weighting function, is constructed perpendicular to the first component in a way which accounts for as much of the remaining variance as possible. Subsequent components can be constructed which are perpendicular to the previous components until, finally, all of the variation is accounted for. The analyst can then select the minimum number of components to recover any desired percentage of the original information. Factor scores can then be computed for each day. These factor scores determine the amount of each component to be added to the average day (mean vector) to recon struct the original hourly observations. The factor weights are added to the mean vector because the covariance matrix, from which the eigen vectors were derived, was calculated about the 24 hourly mean values. The object is to reduce the dimensions of the problem by projecting the original information into a set of orthogonal functions. These ortho gonal functions (eigenvectors) are ranked in order of their relative importance. Thus, the original data can be represented as a few inde pendent variables while retaining most of the information. An applica tion of principal components to air pollution appears in Peterson (32) and a particularly readable description of principal component analysis can be found in Stidd (44). This procedure can be succinctly summarized in the shorthand of matrix algebra. A linear model of the form [Xl = [FI[E]T + M + [e] 3.1 is to be constructed where: [X] = estimate matrix reconstructing original data [F] = factor score matrix representing each day [E] = factor loading matrix of weighting functions for each independent factor T = transpose operator [E = error matrix or difference between observed and predicted values; and M = mean vector or average day The first step is to find the eigenvalues (see Appendix II) of the singular system. det [[I~lT 1 [I]X] = 0 3.2 where: det [' = determinant of matrix [*] [R T[(] = normalized covariance matrix [cov] [I] = identity matrix; and A = eigenvalues (24 in this application). Note that the sum of the eigenvalues over all 24 variables equals the total variance in the covariance matrix. Only the first few eigen values are retained so that their sum represents the desired percentage of this total variance. Eigenvectors are computed for each eigenvalue retained so that [cov] Ei = XiEi 3.3 The factor loading matrix ([El) has a 24valued weighting function representing each column of the matrix. These are the independent factors described above. Factor scores indicating the contribution of each factor required to reconstruct each day are computed from [F] = [E1T[covfl[X] 3.4 where: [F] = factor scores, a column vector for each day; and [X] = observed concentration minus the mean for each hour These factor scores have zero mean and unit variance computed over all days. The example which completes this chapter illustrates these com putations. Pattern recognition techniques for multivariate systems are known as cluster analysis. An excellent introduction to this subject is Duran and Odell (11). Specific details of various schemes may be found by pursuing their extensive reference list. There are basically two cate gories of cluster methods. Derisive schemes successively partition cases into smaller clusters. They begin with all data in one group. Agglomerative methods pool subsets of cases beginning with each case in a group by itself. The cluster analysis scheme used in this work is an hierarchical agglomerativee) method due to Ward (47). The cluster concept implies an optimality criterion which defines the manner in which groups should be combined. This objective function can be stated in terms of the homogeneity within a cluster and the disparity between two clusters. These measures depend on the distance between groups. The euclidean distance between two vectors Fi and Fj is de (Fi, Fj) ==l (FkiFkj) 2 = [(Fi Fj)T (Fi Fj)1 3.5 Note that Fs are employed signifying distance between vectors of factor scores. This real valued function is a metric because d(Fi, Fj) > 0 for all Fi and Fj d(Fi, Fj) = 0 if and only if Fi = Fj d(Fi, Fj) = d(Fj, Fi) d(Fi, Fj) 5 d(Fi, Fk) + d(Fk, Fj). The statistical distance between two clusters can similarly be defined. For [F] = Fl,...,Fni and [Y] = Yl,..,Yni, where Fi and Yi are f variable members of two clusters, the euclidean distance is d l 2 ( T_) T(_2) 3.6 nlnn2 "1 "2 where F = Fi/nl, Y= Z i/n2, and n1 and n2 are the number of i=l i=l members in groups [F] and [Y1. This intercluster distance is the increase in the within group sum of squares (WGSS) when cluster [F] is combined with [Y] to form a new group. Ward's method calculates this distance for all possible com binations of groups and combines the two groups with the smallest dis tance (dxy). Thus, the objective function minimizes the WGSS. This technique favors agglomeration into small, closely related groups and is particularly appropriate in this application. Euclidean distance implies orthogonal geometry. This occurs if each variable is independent of all the other variables. When the variables are highly correlated, a generalized euclidean distance must be used. The Mahalanobis (21) metric, dm (Xi, Xj) = (Xi Xj)TW1 (Xi Xj) 3.7 accounts for such interaction by including the within scatter matrix ([W]). The [W] is covariance matrix computed for all vectors contained in the cluster. Mahalanobis distance has many advantages including the fact that certain objective functions become equivalent. It is also invariant under any nonsingular linear transformation, thus eliminating scaling problems. The properties of the Mahalanobis metric, coupled with a theorem in Bryan (4), lead to an elegant method of clustering utilizing the euclidean metric. Bryan (4) shows that Mahalanobis distance is pro portional to euclidean distance when the original vectors are trans formed according to [EI[X], provided [EI[cov][E]T = [I], 3.8 where [E] is the modal matrix described in Appendix II. This is exactly the problem solved in the principal component analysis. We now require, in addition, that the eigenvectors be rotated so that each has equal variance. An equivalent means of accomplishing this result in the transformed matrix [F] is to scale the factor scores according to [F] A. Standard factor scores have unit variance and the sum of squares of each eigenvector is equal to the associated eigenvalue. Proper scaling results in factor scores which retain the variance in the original covariance matrix. Thus, the original geometry is re stored, imparting a larger range to factor scores associated with the most important principal components. In this way, the relative impact of each component is included in the factor score matrix. When two groups are combined, the degree of scatter within the group and the distance between the new group and other groups are of interest. For properly scaled orthogonal geometries as described above, these two measures are the trace of the scatter matrix and the euclidean metric. The scatter matrix is computed from n [Sf]= E (Fi F)(Fi F)T 3.9 i=l where: [Sf]= scatter matrix n = number of members within the group Fi = a vector of factor scores representing a single case F = the mean factor score vector of all members of the group combined. The trace of this matrix is the sum of the diagonal elements and repre sents the total variance of the group members about the group mean vector. This mean vector represents the new group in computing the distance to each other group and conceptually is the hypercentroid of the multidimensional sphere which includes all of the group members. The cluster scheme used in this work finds the distance between all groups; combines the two closest groups; computes the trace of the new scatter matrix and then computes the distance between the new group and the other groups. This process is continued until all of the ori ginal days have been combined into a single group. To construct the dendogram, the total scatter at each cluster step is computed by adding the scatter for the new group and subtracting the scatter of the two groups which were combined to form this new group. Initially, the total scatter is zero when each day stands alone. When two days are combined, the total scatter becomes the previous scatter plus the scatter computed on combining these two days. This recognizes the fact that the scatter between a day and itself is zero. The relative scatter at each step is computed by dividing the total scatter accumulated at that step by the total scatter when all of the cases have been combined to form one group. All of the original cases are then arranged in cluster order so that successive groups which have the smallest distance between them appear next to one another. This procedure begins with the largest groups and systematically arranges the members of each subgroup according to their similarity. The dendogram displays the membership of each group from individual cases to a single composite group with the relative scatter represented above each group. The reader is referred to Shannon (39) and Shannon and Brezonik (40) for an environmental application of cluster analysis. An example using ten days from one sampling station is now presented to illustrate the use of principal components and cluster analysis to char acterize and group days which are most similar. Table 3.1 displays the reported ozone concentration for the ten sample days, selected specifi cally to illustrate the cluster example. Start hour zero represents the one hour average from midnight until 1:00 AM and this convention applies to all 24 hours. The example was constructed to show a clear separation between three basic types of days. First a covariance matrix about the mean was computed. Symbolically, 10 Xh = E Xdh for h from 0 to 23 3.10 d=l 10 2 hh = E (X dh XXh) 3.11 d=1 10 10 ah2= (XdXh)2 3.12 d=l 9 where Xh = mean for each hour averaged over ten days Xdh = observed concentration for each hour on each day 2h,h = covariance between any two hours averaged over ten days Uh2 = variance for hour h. h Note that a 2o, is the variance for starting hour zero about the mean. The square root of this value is the standard deviation of hour zero. In Table 3.1, we find the mean for hour zero was 32 hg/m3 and the standard deviation was 30 ug/m3. Thus, the variance contributed by hour zero in this example is (30)2 = 900 (pg/m3)2. The total variance of the covariance matrix is the sum of the variance contributed by each of the 24 hours. Symbolically, 24 T2 = E h 2 3.13 h=l where a2, = total variance of the matrix (the trace). 4J, o co v tn m N OD r OD %D M NCV N N gmVN C WO % % WD W NW memowoommesmmmwm ~~00U~0jN~D00t%0 *q' ^O Oii i o nq r HHHrlr l 00000000000000000 HWOO H0 r00nu0mweN HH H H 4 H 4 HH S0o00o00 00000000o00o000 HHH H HHHHHHHH i r lr \'0\0\0\00000 000000000000000 lH H l H HHHH H f4 HH '~0 0O000 00000 00 0000000000 000000006000000000060000 00000000000000o000000000 00000000000000000000000 '~0 '000 00 00000 00000 000 00 H H H~~~rv~~~ Wm~s 0000000000000000 HHHHHHHO HHlr(w 00000000000 i 1 Dr 11 0000000 wwmwwworl 00000000000 ~~H HHHH HHH HH eq eq eq > o E0 0 04 fa S4) N 0S \, NH o ,40 0 Ii r0 e1 C0 n0 41. 4 00 %,, Summing the squared standard deviations in Table 3.1 yields 36351(pg/m3)2 which is approximately equal to the full precision vari ance of 36569(pg/m3)2 computed for the ten example days. Offdiagonal elements such as ,2 are computed using Equation 3.11 which resulted in 659(jg/m3)2 in this example. A real symmetric covariance matrix, having 24 rows and 24 columns, also has 24 roots which satisfy its characteristic equation (see Appendix II). These roots, called eigenvalues, satisfy Equation 3.2. For the example in Table 3.1, the three largest roots were found to be 27390, 7141 and 810 as in Dixon (9,p. 255). Since the total variance was 36569, these roots represent 74.9%, 19.5% and 2.2% of the variance respectively. Col lectively, they account for 96.6% of the differences between each of the ten days and the average day. Applying equation j.3 and solving for E, results in the first principal component shown in Column No. 1 of Figure 3.1. This vector of 24 values is plotted opposite the number 75 and above the mean vector in Figure 3.1. The variance of these 24 values about the origin is 27390 which equals the magnitude of its associated latent root. Similarly, eigenvectors for the second and third roots can be computed and are shown in Figure 3.1. Each of the ten days can now be approximated by adding or subtracting various amounts of each of these three components to the mean vector. These various amounts are the factor scores computed from Equation 3.4. In Table 3.2, these three values for each day in the example appear in the three columns under unit variance scores. Choosing April 28, starting hour In 8 O ,I 00 m u  oo M I: 41 z O .) C' 0 $4 0 4) 0r wE r1 2u 54 ( 4064 < I rI I '1 I ri 1 I 4 ri r4 Q)NrQml aHmrMO~ .....r~r r 0;o I II !s I^ I  OMM iI v w CO .0 OI I Co I I r 1 1 il I.4ll I I C'I40 C4rz14U;1d A 0 8tflAU.4 .4~~~ m~~m ~ o mN r W M 0 N mc' v ko c O m 0 ro N m n 0 '.4 0 r/N i n a . in S N N 0 .S So uoTpZuuauo2 aTltQt zero is reconstructed to be 32+(1.44)(15.1)+(1.14)(23.2)+(.21)(8.8) = 82 Vg/m3. Hour 12 is represented by 110+(1.44)(50.9)+(1.14)(6.7)+(.21)(7.3) = 174 Ug/m3. Similarly, all of the original data can be reconstructed to satisfy the model given in Equation 3.1. The principal components also indicate the correlation between hours found in these ten days. Looking at Figure 3.1, we see that the contribution from hours nine through thirteen is approx imately constant for both the first and second components with the mean accounting for differences during this interval. Similarly, hours zero through three contribute approximately the same amount on any given day and the period from hour 20 through midnight is important in defining the second component. Factor scores are independent of one another since they represent the proportions of independent components which make up each day. There fore, they can be represented geometrically along perpendicular axis. The unit variance scores weight each factor equally. In the ten day example factor one actually represents 75% of the difference while the second factor represents only 20%. In order to achieve the correct grouping when these scores are clustered, the total variance in each component needs to be transferred to the factor scores. As discussed above, the easiest way to do this is to multiply each unit variance score by the square root of the applicable eigenvalue (165.5, 84.5 and 28.5, respectively). This scaling produces zero mean factor scores which have variance equal to the original contribution of each component. Thus the sum of squares of factor scores computed over all ten days equals the proportionate variance and the associated eigen vectors each have unit variance. These factor scores are shown in Table 3.2 under Mahalanobis scaled scores. They are called Mahalanobis scores because they yield a measure of Mahalanobis distance using the Euclidean metric. To illustrate the clustering method both unit variance and Mahalanobis scaled scores have been grouped and are displayed in Figure 3.2. The numerical example now presented applies to the Mahalanobis clustering since this provides a distinct grouping. The computation is the same for unit variance scores and can be directly extended to as many dimen sions desired. The first step is to compute the euclidean distance between all groups. For ten cases, this is a collection of 45 numbers. The smallest of these represents the distance between December 7 and November 16 and was computed to be (170(176))2 + (1517)2 + (38.617.3)2 = 22.22. The largest number was 267.75 representing the distance between April 28 and May 2. Referring to Table 3.2, this value was computed from (23931)2 + (97(65))2 + (640.7)2 = 267.75. The first step in clustering was therefore to combine December 7 and November 16. Since the mean factor scores for these two days are F1=173, F2=16 and F3=28, the trace of the scatter matrix at this step is computed from (170(173))2+(176(173))2+(1516)2+(1716)2+ (38.628)2+(17.328)2 = 245. TABLE 3.2 Factor Scores for Ten Day Example Variance Scores F2 F3 1.14 0.21 1.27 0.18 0.20 0.73 0.78 1.11 1.48 0.93 0.77 1.0 0.48 1.36 0.61 1.32 0.60 0.12 0.37 1.48 1.43 1.0 Mahalanobis Scaled Scores Fl 239 236 170 176 187 160 55 70 63 31 27390 F2 37 107 15 17 61 66 94 125 79 65 7141 F3 6.0 13.6 38.6 17.3 37.7 17.1 3.4 10.7 42.3 40.7 810 FIGURE 3.2 Dendograms for Ten Day Example Unit Variance Cluster 4 3 o uo r. s4 Il, 1 1.0 0.9 0.8 0.7 0.6. 0.5. 0.4. 0.3 0.2. 0.1. 0.0. 1" Mahalanobis Cluster 9 8 7 6 5 I I I I 2 ** 1* '0 Nr 4 P4 P4 0 rc q O CDz 0 .4 C4 4 .I N N Unl ko M n ,4 *First Grouping **Second Grouping Day of Year, 1974 Date 4/28 4/20 12/7 11/16 11/30 12/14 5/28 6/12 8/19 5/2 variance Unit \ Fl 1.44 1.43 1.03 1.06 1.13 0.97 0.33 0.42 0.38 0.19 1.0 1.0 0.9. 0.8 0.7. 0.6 0.5 0.4 0.3 0.2. 0.1 0.0.  4 .~I 2* 2** Since this is the first step, the total cumulative scatter is zero plus 245. Only the distance from the new group to other groups needs to be recomputed. On the second iteration, the smallest overall distance was found to be 22.24 between April 20 and April 28. When these two days were combined, the total scatter of the scatter matrix was determined to be 247. Thus the total cumulative scatter after completing the second step was 245 + 247 = 492. Similarly, pairs of days were combined for steps three and four. Step six illustrates the procedure when two groups, both containing more than one day, are combined. The total cumulative scatter after comple tion of step five was 3389. When December 14 and November 30 were com bined, the total variance of the scatter matrix for these two days were 585. The trace of the scatter matrix for the four days combined was calculated as the sum of 12 squared values and was equal to 12736. Thus, the total cumulative scatter on completion of step six was 15297. We see this under step six in Table 3.3. On completion of clustering the cumulative scatter was 318069 as shown in Table 3.3. This is derived from the total variance of the Mahalanobis factor score matrix shown in Table 3.2. Notice that the mean of each set of ten factor scores in Table 3.2 is zero. Thus, the scatter is equal to the sum of squares. Ten zero mean values having a variance of 27390 (Mahalanobis Fl) have a sum of squares equal to 246510. This is apparent from Equation 3.12, since nine times 27390 equals 246510. Similarly, the sum of squares for F2 and F3 are 64269 and 7290. The total sum of squares for the Mahalanobis scores in Table 3.2 is therefore 246510 + 64269 + 7290 = 318069. This represents 96.6% of the total variance shown on Page 28. Thus, Mahalanobis scores preserve the original variance. This is not true of the unit variance scores high result in incorrect scaling and clustering errors. Table 3.3 Ten Day Cluster Example Scatter Summary % Relative Scatter 0.08 0.16 0.34 0.56 1.07 4.81 6.15 47.40 Cumulative Scatter 243 = 495 = 1080 = 1768 = 3389 = 15297 = 19568 = 150778 = Previous Scatter 0 243 495 1080 1768 3389 15297 19568 + New  Group + 243  + 252  + 585  + 688  + 2309  + 12736  + 6580  +150526  100.00 318069 = 150778 +318069 150526  252 Note: Member A and Member B are the two groups being combined to form the New Group. Step No. 1 2 3 4 5 6 7 8 Member  A 0 0 0 0 0 585  0  6580  Member B 0 0 0 0 688 243 2309 12736 When all clustering steps have been completed, the relative scatter is cal culated by dividing each cumulative scatter by the total scatter in the factor matrix. This is also equivalent to total variance. A dendogram can be constructed from the relative scatter by arranging the days so that nearest neighbors are adjacent. The nine steps and associated relative scatter are shown in Table 3.3. Clustering for the unit vari ance scores is done analogously, with the total sum of squares on the ninth step equal to 27. The dendogram is shown in Figure 3.2. Although the groupings for both unit variance and Mahalanobis scores are the same in this example, significant differences in measures of closeness are clear. These differences can cause grouping errors, especially when differences between cases are not a pronounced as those given in this example. In practice, tabulating the scatter values is not necessary, since the dendogram graphically shows how to group each set of data. Figure 3.3 is a visual aid to understanding cluster analysis. The fac tor scores representing each of the ten days have been plotted in three dimensions. Clustering can be accomplished in multiple dimensions, but more than three dimensions becomes difficult to visualize. Clearly the April dates are set apart from the other eight days. The November and December days were grouped and set apart from the May, June and August days. The relative scatter and closeness between groups are clearly illustrated in Figures 3.2 and 3.3 and in Table 3.3. Lognormal plots (see Appendix I, p. 105 ) of these ten days are in Figure 3.4. The same grouping illustrated above can be postulated. However, 2.5 1. 2.0 2.5 05 6 5 3 2.0 FIGURE 3.3 Three Dimensional Display Factor Scores for Ten Day Example Unit Variance +F3 Scores +F3 +F2 1.5 2.0 3 8/ 4 0.5 1.0 1.0 *10 1.0 2.0 +F1 0.5 0.5 1.0 5 87 @8 09 2.0 Mahalanobis Scores 1.5 2.0 +F2 +F3 1.0 1.5 1e 1.0 0.5 1.0 2.0 2.5 0.5 0.5 1.0 07 @9 1.5 M 2.0 1.5 0 0 0 0 000 0 0 0 0 0 0 0 0 ( ) oUOT .uI uo ea[rSioz (u1/brt) uotzeuaauoo * S44 NOX Ma 0 04 go 01 3 0 0 .0  4 N a o I 5: 2 14 a > o rJ m 0 4i .o 4 Un 41 0 o^ grouping on geometric mean or standard geometric deviation does not necessarily produce the same clusters derived from orthogonal components. I have briefly described necessary mathematical concepts and presented an example to illustrate how information can be condensed and utilized to identify similarities and differences among days in Chapter III. This example is extended in Chapter IV and applied to ozone data reported in south Florida. Before proceeding to this analysis, a discussion of empirical modeling is next presented with a flow chart illustrating steps in the procedure. A posteriori models require extensive measurement of ambient conditions over time. The sampling network provides these data. Ideally classical modeling techniques would be used to determine the optimal number, loca tion and sampling frequency for the network. Performance is ordinarily evaluated after considerable data have been collected. These data should be used to confirm or investigate preliminary classical model results and to suggest desirable changes in the sampling procedures. The ultimate application for stochastic models is for real time analysis utilizing microprocessors to predict future air quality based on past and present conditions. These techniques could also be used to initiate control strategies and signal a need for more intensive sampling during critical periods. Stochastic models need to be developed before real time control strategies can be formulated. 39 To build empirical simulation models, one must observe enough repeated experiments to develop a reliable relationship between inputs and out puts. When conditions change often and in random fashion, as in the atmosphere, these experiments need to be grouped for similarity in order to construct meaningful models. If all conditions were used indiscrim inantly, an "average model" would be derived which actually would not represent any of the observed conditions. Figure 3.5 illustrates the steps necessary to analyze data preparing for stochastic modeling. The con ventions used are the same as those for critical path analysis. Each arrow represents a task or operation to be completed. The numbered points mark the initiation and completion of each task. Successive operations cannot begin until all of the preceding tasks have been com pleted. The first step, proceeding from point zero to point 1, is col lecting the necessary data. It includes all of the laboratory, field and administrative (auditing, etc.) quality assurance procedures neces sary to insure valid data. Selecting data of interest concerns the frequency of elevated pollutant levels, duration of conditions, accuracy of data, availability of support data, sampling network density and other considerations. For multivariate problems, covariance matrices can be computed after screening the data to exclude anomolies. At the same time, preliminary summaries,such as tabulating all days which ex ceed standards, can be performed. The covariance matrices are used to estimate missing observations and to compute principal components. After these tasks are completed, factor scores for each day can be found to represent the original data in condensed form. The principal component results are used to further characterize the data as soon as .J rj4 0 4l UI t) U 4J a 4) 344 0 .4 4 U *4 a 0 *4 03 0 *4 *I ...4 C'.Q u in a> *m r4 .0 13C 0 W*4 UB>4 they become available. Thus, the dashed arrow between points four and two indicates that these results will be used to proceed from point two to point six (preliminary characterization) but that this task can begin before principal components and missing observations have been computed. Using the properly scaled factor scores, groupings at each spatial lo cation can be computed which characterize the waveforms as a function of time independent of space. Using the preliminary characteristic results, these clusters can then be summarized. The number of dif ferent types of days and the frequency of occurrence as well as the relative similarity can be determined from the dendograms. Once these groups have been defined, their characteristics can be further analyzed as shown in Figure 3.5 at point seven. Appropriate distributions can be fit for each group (see Appendix I). These indicate the probability of observing high concentrations in each group. However, they do not pro vide the functional relationships in time that stochastic models repre sent. Principal components for each subgroup can be computed and plot ted along with the groupaverage waveform. This visual display of group characteristics shows the kinds of models needed and the periods during each day which are of particular interest. The total variance for each group indicates the magnitude of differences between members of each group and is an indicator of the ease with which a stochastic model could be constructed. Average factor scores for each group at each site can be computed and clustered to locate similar conditions in time which have occurred at different spatial locations. Clustering the average factor scores shows spatial as well as temporal similarity. Utilizing the summarized characteristics available at this point results in major group types. In this study, three types were identified and named nonepisode, episode and super episode conditions. The subgroups for each major type were analyzed and displayed in cluster order to illustrate differences between day types. The kind of day observed at all sampling sites was tabulated for every day which had a super episode condition. This is indicated as date comparison in Figure 3.5. A choice of the most interesting, potentially modelable conditions can be derived from this analysis. The next step in model development is identifying influence parameters. By contrasting conditions on super episode days against nonepisode or episode days, meteorological or air quality parameters which seem to affect the system can be found. Having selected interesting influence data, formal procedures such as cluster analysis or discriminant analysis can be used to determine the relative importance of each factor. The conditions which are to be modeled and input data can then be selected. If sufficient influence data are available and these conditions have been repeated often enough, models can be formulated. These models can be constructed from the raw data. Alternatively, weighted average data reconstructed from the factor scores could be used to smooth random variations before models are constructed. The inputs and system response which characterize various kinds of days can then be used to evaluate classical models and air pollution potential. CHAPTER IV AN APPLICATION TO URBAN OZONE DATA Photochemical oxidant levels are a pervasive problem throughout the continental United States. Literally hundreds of journal articles and millions of dollars have been directed toward this problem during the past few years. Initially, hydrocarbon emission control was suggested as an answer to the urban oxidant problem. With these emission con trols, ozone concentrations remain very high. Even rural levels fre quently exceed established standards during spring and summer months. Reynolds et al. (37) and Roth et al. (38) have developed a comprehensive emission inventory and a sophisticated oxidant model based on numerical integration. Under certain conditions in the Los Angeles basin, the model performs well. This model is presently being modified for appli cation in Tampa, Florida, which historically has had oxidant episodes lasting several days with levels reaching twice the standard. Further study is under way nationally with a particularly comprehensive program in Houston, Texas, sponsored by the Houston Chamber of Commerce under Feldcamp (13). Oxidants are a secondary pollutant, since they are not emitted directly from sources. Precursor compounds such as reactive hydrocarbons and nitrogen oxides interact in complicated chemical reactions which are not clearly understood. Intensive solar insolation appears to be a 43 necessary ingredient, or at least a contributing factor. A recent study reported by Spicer et al. (43) concluded regional scale transport and high pressure systems must be considered. Spicer et al. (43) also concluded that oxidant sources were ground based, since the ozone con centration decreased with altitude. The Hillsborough County Environmental Protection Commission has rou tinely collected ozone data using chemiluminescent monitors since 1973. A 90% confidence interval for the threestage calibration procedure was typically +10 ppb as shown in Appendix I. During 1974, there were 82 days and a total of 589 recorded stationhours exceeding 159 ug/m3 in the data set. These data are summarized in Table 4.1. A brief review of plotted diurnal waveforms revealed a bizarre collection of cases which did not follow the typical curves given in USDHEW AP63 (46). Since oxidants are a national concern and since an extensive set of reliable data was readily available, the Tampa area ozone data were selected for analysis. The HCEPC maintained three ozone monitors during the study period. Station 63, located on Davis Islands at the Coast Guard basin, is an urban site which historically has reported some of Tampa's poorest air quality. A second urban ozone sampler, designated Station 94, is lo cated at the intersection of Dale Mabry Highway and Cypress Avenue. It is less than 210 yards south of the heavily traveled Interstate Highway 4. The station at the fire department on Gunn Highway was numbered 109 and is approximately five miles northwest of the urban TAWLE *.1 Oontraventions aof he Oxidant Standard QCbsate d in Hillsborough avd Pinellau Countle., Florida in 1974 DaU 63 Sawing Station 94 109 Sp DAUT 63 Sampling Station 94 109 2 () 3(SB2 I(3) 3(B) 6(B) 5(33) 3(B) 30 1(B) Bi)) 1(33 3(SE) 6(SE) 2(1) 1(g) 1(B) 3() 2 0 S2( 1 6(SE) 2(SE) 7(SE) 11(SE) 5 (SE) 11(z) 2(g) 2(SB) 1(E) 2 (SE) 1(SE) 2(SB) 8(SE) 7 (S) 3(E) 3(SE) 2(8)) 5(3 5 (5B) 2(M) 3 (S) 2(SB) 4(SB) 2(W) 3 (S) 8(SB) 3(SB) 7(SE) 1(SB) 8 0 4(SE) 5(S) 5(SB) 1(R) 2(M) 4(g) 1I() 2(B) 2(U) 2(S8) 8(8S) 6(SB) 6(SE) 5(B) 3(S8) 5 (S) 7 (S) 5(SE) (S3) 7(SZ) I(E) 3(SE) I) 3(E) 4 (SE) 2(SE) 2(3) 7(8E) 2(US) 827 4(SE) 828 4 (S) 910 917 5(SE) 918 925 108 1010 1015 1019 2(SB) 1228 7(SE) 1(E) 6(S8) 9(8E) 3(S3) 10 0 4 () 2(1) 5(SB) 3() 2 0 2(SE) 3 0 1(B) 2(B) 1(K) 2(r) 1(11a) 1(1(3 ME) 7 0 (SE) 1(B) 7(SE) 6(SE) 7(SZ) 4(SE) 0 1(E) 5 0 7(8) 10(sE) 5 (S) 4(SE) 6(sE) 4(() 5(8z) 9(8E) 1(S) 9(S8) 7(SB) 2(Sz) 3(3) 4(W) 3(SE) 1(B) 1(B) 3(8) 1(B) 2(3) 1(B) 3(B) 1(E) 1(B) 2(SB) 1(B) 1(SB) 3(88) 3 (SE) 3(8) 6 (S5) 1(SE) 6(SB) 1(E) Notes 1) Integers are total hours that day exceeded 159 Ug/n3. 2) SB Super Episode, B Episode, NE NonEpisode. O Omitted 3) Blank indicates no hours exceeding 159 Mg/m3. 4) ampling stations described on Page 44 6 (SB) 4 (SE) 1(B) 4(E) 2(SB) 34 (SB) 1(8) 2(B) 3(SE) 1(E) S(E) 1(8) 1(SB) center. It is a semiurban site on the periphery between urban Tampa and the agricultural areas north of town. Consequently, at this site both urban and rural effects can be observed depending on meteorology. The remaining sampling site, designated STP, was located in downtown Saint Petersburg. The first step was to obtain a copy of the State of Florida Air Quality Data Handling System (AQDHS) data tape. All ozone data from Hills borough and Pinellas counties, through September, 1975, were stripped from this tape to direct access files. These data were then analyzed to locate erroneous values and days with more than 50% missing values. A number of errors were found, checked with the local agency and cor rected. Days with too little data were removed. The Pearson Correla tion program described in Nie et al. (28) was then used to compute a correlation matrix for all remaining data at each station. This routine was used because it accommodates missing observations and because it employs the double precision arithmetic required for large numbers of cases. A covariance matrix was formed from the correlation matrix using the standard deviation for each variable (hour of day) from the Pearson Correlation program. Principal components for each covariance matrix were computed using the BMD08M routine from Dixon (9). This routine was selected because a very flexible input specification is provided. Figure 4.1 is a plot of the mean and first four principal components of the ozone data at each station based on all data available. Prin cipal components are discussed in Chapter III and in Appendix II. I I cin/b u01~w~uaauo ~a~Twrou 0 =4' 5 0 00 S4 S~ I2!  i. '05 II al 14 a .4 i l 0 0 8"5 a * a s 4 a 5 a a og 4 .4 si a j? < * *^ m 0 04 U3 o a 4 C4 03 If all of the principal components were retained, 100% of the total variance would remain and only the geometric representation of the data would be changed. However, when the original variables are intercor related, most of the information can be represented with relatively few factors. The criterionused in this study was retention of 90% of the total variance. In this application, greater than 90% of the total variance at each station was accounted for by six components. Thus, each day in the original data set was reduced from 24 hourly observations to a set of six factor scores while retaining 90% of the information contained in 24 values. This achieves computational effi ciency and orthogonal structure as described and illustrated in the example of Chapter III. At this point, the results of the principal component analysis can best be presented by example. Scanning Table 4.1, April 19 looks interesting with all sites exceeding 159 pg/m3 for at least 5 hours. The mean and principal components for Site 109, which are presented in Figure 4.1, are presented in Table 4.2. The total variance for the 551 days in this site was 25561 (Ug/m3)2. These numbers are shown at the top of Figure 4.1 above Site 109. The first six eigenvalues for this site, arranged from largest to smallest, are 13614 4882 2616 1240 880 and 525. Dividing each of the eigenvalues by the total variance, we see that component one represents 53% of the variance. Similarly, components two, three and four represent 19, ten and five percent of the variance MI4mm N H I OO O (I Bl lfi l I I I M ln 00 (4 I I HM l II i l i l l 01 g Ln 14 *4 0 in H u 0 J P4 l 0 i C 0 4 0 S &  Comri ro f oh I ri.i rI ii >i i NC 0 CS"' Nr I I I I A* 00 0 ** I I (* Q* N ro  i I ' M 0 rl il 0 e4 C4 r4 6l ONOMtlNo I I 14 W; 4 I I I I . 9 < C *NI '4C>O'C . . in fi oD S4 1O cn M Im ... HMPi (T ( m I I r41;< M r4 I II I I i I , II I co NiLn mi #NamW;R .....l .CM0M C. * * Moo CAN 02C'P 00 CO 0 cI cm 0 o ri P4 O.iN N 0O H N mcM M OO M I I I 02 4J C n 4a 401 r4 0 ( Sa 4 III 1 1N 1 1O i i (4 Cr4 Cr4 il NO 13l 0 0 Ws *HMOrl * .* * respectively. These numbers also appear in Figure 4.1 to the right of the component which they represent. The mean for each hour in the day is shown at the bottom of this figure. Standard factor scores were computed using Equation 3. 4. For April 19 at Site 109, these were 3.862 1.101 1.476 0.883 0.101 and 1.418. These factor scores have a zero mean and unity variance when computed over all days represented. They represent the contribution of each corresponding orthogonal component which must be added to the Site 109 mean vector to reconstruct the hourly values for April 19. For example, the first hour in this day can be reconstructed using the top line of Table 4.2 and the six values above. Explicitly, 26.4 + (3.862)(13.4) + (1.101)(17.2) + (1.476)(1.8) + (0.883)(10.4) + (0.101)(2.2) + (1.418)(8.6) = 103 ug/m3 This value (103 pg/m3) appears in Table 4.3 as the first entry under reconstructed values. Similarly, the second hour of April 19 at Site 109 was reconstructed as 100 ug/m3 using the same factor scores above and values from the second line in Table 4.2. Table 4.3 summarizes the reconstruction of each of the 24 hours in April 19 at Site 109. The average difference between the observed concentration and the reconstructed concentration was 0.4 pg/m3. Since April 19 was an unusual day with respect to the mean day, reconstruction would be much better for a day selected at random. None the less, reconstruc tion within 16% of the observed value was achieved for every hour. In summary, most of the ozone information for each stationday can 0 e4 aS 41 I S0 MA mo o CTncomb rimmNO IR wmo w M F4 r H H r ri H r4 r Hl r r C r 4 N N #4. 4 H r4 r4 P4 0 0i 0 q *t r1 V A rf H O ~m~~N~HH 'I H H H H HH 4J 3 di rq H r4 r HmH ii H N N N Ul H I I IH I I I I r4 H I I i 000000000000000000000000 r r r r N r r4 H H 'a 0 4. MO"4 .4041 .rI UI 41 0 0 8 be represented by only six values having orthogonal geometry, rather than the 24 intercorrelated hourly values. All missing observations were reconstructed using a multiple regression routine. Each missing hour was computed based on the distribution over all cases for that specific hour. The weighting function used was the ozone concentration for all hours which were measured. This procedure was necessary, since factor scores cannot be computed unless all of the original variables (hours) are available. The factor scores were computed and scaled as described in Chapter III to achieve orthogonal Mahalanobis structure. This is accomplished by multiplying each of the standard factor scores by the square root of the corresponding eigen value. As an illustration, for April 19 at Site 109, the properly scaled factor scores are 450.7 76.9 75.5 31.1 3.0 and 32.5. These values were computed from 3.8621f3614, 1.1010882, etc. Computed over all cases, these factor scores have zero mean, and variance equal to the eigenvalue for each specific factor. Thus, the total variance of the factor score matrix would remain 25561 (pg/m3)2 if all 24 factors were retained. Further, the variance of each factor relative to the other factors has been restored. The standard factor scores weight each factor equally, when actually factor one accounts for over half of the total difference of a specific day compared with the mean day. Using this same procedure, factor scores for each of the four sampling sites were computed based on the principal components for that specific site alone. In this way, the simplicity and economy of the euclidean metric can be used to achieve Mahalanobis clustering. The Mahalanobis metric would also provide this same result, but at considerably greater ex pense. Ward's method, utilizing the euclidean metric, was applied to all avail able days for 1974. A sixspace clustering was constructed using the factor scores as variables. The dendograms in Figures 4.2 through 4.5 detail the macrostructure of the separation between cases. The number of groups was determined subjectively by computing average factor scores for the first four components. The withingroup scatter distance was used to make an initial separation into approximately 35 clusters. The difference between the average factor scores for separate as well as combined groupings was then used to achieve a manageable number of sub groups. Each of the three numbers in each block of these dendograms is the average factor score for all members of that group. The tick marks along the horizontal axis represent cases which recorded at least one hour exceeding 159 pg/m3. April 19 at Station 109 was a member of Group 10913. This notation is used throughout the remainder of the text to signify the station and subgroup corresponding to Figures 4.2 through 4.5. Group 10913 means Station 109 Group 13 and, in this case, represents a set of twelve days. As shown in Figure 4.4, most days in this group exceeded the oxidant standard. The average factor scores for the first three principal components in Group 10913 were 316, 90 and 42. These three numbers appear in the block above Group 13 in ca c 4 .4, c mf o I I f a : o S* . onn 94., In Lq to dl j , .n I s acm o a a o2 a o L "4 0.S .4r 0' iaUI^ *AIT3PT1 0 0br\ *n  .d~ 0  Jhe *U a. 5304 .i : s. 01 i3l .51 2WR& j FI.. ;d0 Ja pt W **Aim UN a C4 .4 at1 a ON.. 044n 1 __ 81 4"0 j f r6 1 S ~1 N  0.4.9 S4 'o C?  '5  = 1'' IV Vb in 0 CAQ 9 p4.48 I* I ^ n 71 in I ./ m i 4 4B 1.1 i 48 04 ' w .I 5 8 '? SMug a 4 0 I s s269 .4 N419 * 41 U I 8 c o 04 0 II I a = I1 N 10 I 0 0 S o o0 0 Ja3wSSg aAnyaIea 0400 .01 0 0 moo nin m cm N I I I r I I 1i 01 N .4 W I .4 I TiT I ii u e i P pl =~I t rcrr~ i m =0 N N W I J N "aSO I or in Nw .4 .4 II k 0 M W V I I I i 0' .1 i I 0 0 .Ac in in *1 I A a 04 >a gs 2aavosg aA$iMT0v Figure 4.4. Days which had large positive values for the first princi pal component, generally clustered at the extreme right in Figures 4.2 through 4.5. These days also tended to exceed the oxidant standard. Thus, the first component, which accounts for over 50% of the total variance, also tends to separate high oxidant days from low level days. Figures 4.2 through 4.5 display the group structure for each day repre sented in 1974. All of the days are arranged in cluster order along the abscissa. The dates are not shown as their number is too large to be conveniently displayed. The subgroup numbers appear sequentially below the horizontal axis. The rectangle above each group number indi cates both the number of days in the group and the relative scatter among members of that group. Returning to Group 10913 in Figure 4.4, the rectangle above Group 13 has relative scatter equal to 48.9% of the total scatter (l.612X106(ig/ m3)2) computed for all 269 days in Station 109 for 1974. The width of the rectangle is 12/269 or 4.46% of all days represented. The horizon tal line above Group 13 is therefore 4.46% of the length of the entire horizontal line. Figure 4.6 displays the clustering of the twelve days in Group 10913. The first four steps for this group combined pairs of days ending when April 11 joined with April 21 at a relative scatter of 8.4%. These were steps numbered 34, 50, 74 and 128 in the 268 sequen tial steps required to cluster Station 109. At Step 154, the fifth step taken for Group 10913, April 27 and May 13 were combined with April 20 and April 28 to form a new group at a relative scatter of 12.6%. This process continued and at Step 247, Group 10913 was defined to include 12 specific days. Note that we can easily subdivide this group into finer subgroups, which are more similar to one another, or add additional groups, such as Groups 14 and 15, while increasing the scat ter. The procedure is the same as illustrated by example in Chapter III, except that the clustering was done in six space to recover at least 90% of the original information. Once the members of a particular group are defined, the arithmetic aver age for each factor can be computed. This is useful to understand how factors on the average account for differences between groups. The first three factors for Group 10913, as stated previously, are 316, 90 and 42 averaged over the 12 days represented. In contrast, Group 10910 representing 69 days has averaged scores of 126, 14 and 2. As indicated by the tick marks in Figure 4.4, most of the Group 13 days exceeded the ozone standard while none of the Group 10 days reached 160 pg/m3. Note that each factor is zero averaged over all 269 days. Distinct separations between the groups displayed in Figures 4.2 through 4.5 are apparent when the groupaveraged factor scores are presented. Periods having similar waveforms lasting a few days or longer appear as chronological sequences within a group. Using Group 10913 once again as an example, all except two of the days in this group occurred in April. As shown in Figure 4.6, these April dates were 10, 11, 18, 19, 20, 21, 25, 26, 27 and 28. Seven of these ten days exceeded 159 pg/m3. The sequence of days from the 18th through the 28th represents a FIGURE 4.6 Clustering Steps for the Twelve Days in Group 10913 1.0 0.5 0.4 41 a 0.3 0 0.2 S 0.1 0.0 0 a 4 N oa0 q 14 :S ., fl ,T ct T o a to c OI n W n Day of Year, 1974 5 4 2 I I I quasistationary period at Site 109 with a change in conditions between the 21st and 25th. This period represents a sequential set of days during which each day had a similar waveform. If sufficient influence parameters are available with fine enough temporal resolution, a digital simulation relating observed ozone concentration to input parameters can be developed for this group. Traditional transfer functions, derived empirically, can be used to develop a stochastic model representing conditions on April 18 through April 28, and differences in these influ ence parameters on April 22, 23 and 24 could provide insight into causes of elevated oxidant levels for Group 10913. In addition to temporal stationarity, spatial homogeneity is highly desirable. If periods having similar waveforms at all spatial locations were defined, mesoscale influence parameters could be used to construct the model. Otherwise microscale data, collected at each sampling site, would be necessary. Spatiotemporal homogeneity was investigated using the cluster algorithm. A fourspace grouping based on the average fac tor scores for each subgroup at each station appears in Figure 4.7. The complete tree structure has been diagramed with the sampling site and subgroup number at the left margin. A total of 66 station groups were analyzed. This dendogram resolved three basic group types denoted NE (nonepisode), E (episode) and SE (super episode). Group NE con tained 15,192 station hours, of which only two occurrences exceeding 159 pg/m3 were found. All twelve months of the year were represented at all four sampling sites. A total of 140 values exceeding 159 pg/m3 .0 0, C I N' 0 d ~ C ri .1r  I i R l 0.& .4 a4 *g . a " U f 7 InFl TI I l A ui ) I u.i n A u vU  n l ) Iw w U) ! cOa rn vdo iq, ni oaqunH dnoLjqn N  O a te 5! a I' s . U5 I i were found in the 10,104 station hours represented in Group E. The 2,544 station hours in Group SE included 402 such values. The SE, E, and NE groupings were based on spatial location as well as temporal waveform. In percent of total hours represented, Group SE had 16% over 159 pg/m3 while Groups E and NE had 1.4% and 0.013% respectively. An example of a spatially homogeneous day is April 19 which was an SE day at all four sampling stations. As shown in Figure 4.6, April 19 is also a member of Group 10913. To further investigate spatial homogeneity, a date matching routine was developed. Table 4.4 was made from a computer output for Group SE. All dates for which at least one site belonged to this group are listed chronologically at the left. Under each of the four stations, the group membership is summarized. M and O are for missing and omitted data respectively. In all, 52 days were represented. Only five of these days (denoted *) did not include at least one stationhour exceeding 159 ug/m3, and virtually all of the extensive occurrences of elevated levels were included in this group. Spatial homogeneity within Group SE is apparent in Table 4.4. On nine days, denoted + in the table, all of the stations represented recorded type SE days. On these occasions, macroscale phenomena created elevated oxidant levels throughout the region. Presumably, site specific differences played a small role in oxidant formation, so macroscale influence parameters could be selected to construct simulation models representing these days. Table 4.4 also shows occasions when spatial differences were significant. For example, TABLE 4.4 Days Represented in Group SE. Day Type at Each Station and Number of Hours Exceeding 159 ug/m3. Date 63 94 213 E(2) SE(3) 310 SE(S) S (3) 314 M M 319 S (3) SE(6) 330 E1) 0 331 E(3) S(2) 4 0* B SE 410 E(2) B 411* E Ba 417 z E 418+ SE (6) Sl2) 419+ SE (11) S(5) 420 3(2) E 421 E(1) E 424 E(1) M 425 E8(2) E 426 E(2) E 427 SE E 428 E(2) B 429+ SeS() SE (3) 4304. 0(8) SE(4) 5 1 SE(5) 3(1) 5 7 (2) E 5 8+ SE(8) SE (6) 5 9 E(5) SB(3) S10 SE (7) 3(5) 513 S (7) e81) 529 SE(4) H 531 SB(2) E 6 5+ sB (7) S (2) 6 6 SB(7) z(1) 612 SE (6) z 613+ S (9) SE(3) 618 SE (5) (3) 619 SB(2) 622 E8 (2) (1) 629 E E 715 (7) SE (6) 716 S (7) S (6) 717 S (7) S (4) 723 SE(10) S (5) 724 SE (6) (4) 729 SE (9) E 730+ SE (9) SE (7) 814 (4) S (3) 910 MH H 918 E SE(2) 10 1I E e 10 2* E 2 10 8* SB (1) E 1010 E E 1019+ SE (6) SE (1) + days with all sites in Group SE * days vith no site exceeding 159 ug/m3 ME nonepisode (group 12) 109 ST_ 4 E E E E(B) SE H E SE SE(1) SB SB 3E S (2) S (7) Se(8) SE (11) 8 (7) SE(2) SB SE SB SE(3) a SE () S (5) SB(3) s (2) S (4) SE SE(3) SB sE (7) SE (1) SE 3) SE SE 2(2) S (2) z S (6) S (4) 8 3(5) S (4) SE(5) B S (3) SE(5) B 3 RE N H SSW2) M N o(10) a (6) 0(2) s (2) Sn3(3) SB (4) B SB(8) E(1) H 3 N 0(5) H SEC4) E S (5) B E(1) B SS(2)E (34) M SB (3) HE SB (3) N3 E (2) z SE E s8 SE(3) 3 SE (3) SE (6) X E episode (group 13) SE auper episode (group 14) H Missing day 0 omitted day Notest 1) + denotes all sites in Group SE, denotes no site exceeding 159 ug/m3. 2) SE Super Episode. E Episode, NE NonEpisode. 3) H missing, O Omittec 4) Integers following group type are total hours exceeding 159 ug/m3. scanning Table 4.4 for type NE days, one finds five occasions where a super episode occurred at at least one site while a low ozone day occur red at another. July 23 is a striking example when persistent elevated ozone was reported at all three Tampa locations while the St. Petersburg site remained low. Clearly, there are occasions when site specific influence data would be required to explain spatial differences. How ever, most of the days having at least one SEtype station day also had type SE or type E conditions at the other sampling sites. The degree of spatial homogeneity exhibited on days chosen for modeling affects the degree of spatial resolution required in the influence parameters. Coupling spatial homogeneity with temporal homogeneity provides a means of selecting periods which should be most easily modeled. Intuitively, these would be periods where the entire region experienced similar air quality which lasted a relatively long time. Returning to Table 4.4, a scan for consecutive dates reveals one interval lasting eight days from April 24 through May 1. Other intervals in Table 4.4 lasting from two to five days can also be found which represent quasistationary highoxidant episodes. Of the nine days previously mentioned which exhibited SE conditions at all sites, only October 19 does not occur during one of these quasistationary periods. If sufficient influence parameter data were available, these intervals would be good candidates for simulation modeling. The scope of this dissertation does not include construction of digital models, since the primary objective is demonstrating a method of defining quasistationary intervals for which stochastic models could be developed. Consequently, the remainder of this chapter concerns statistical and meteorological descriptions which characterize the subgroups defined through pattern recognition. Lognormal parameters were calculated for each station day. The average geometric mean and geometric standard deviations were also calculated. Group membership and average lognormal parameters appear in Table 4.5, including the number of station days represented. The overall groups in Table 4.5 refer to group numbers opposite the brackets in Figure 4.7. Membership in each group from 1 through 14 is shown in both Table 4.5 and Figure 4.7. These distributions have been plotted in Figure 4.8. Individual curves can be identified by locating the leftmost line at the lower righthand corner of each graph. The subgroup order begins with this curve and lists the group from left to right. Group 1 had slightly greater slope than Group 2, but a very similar geometric mean near 27 1g/m3. Group E had an overall geometric mean of 49 at slope 2.63. Groups 3, 4 and 5 appear together with an average geometric mean around 38 ug/m3. Their standard deviations ranged from 2.77 to 3.98. Groups 6 and 7 were close together and averaged 63 pg/m3. These two curves had the lowest slope of any groups indicating elevated levels throughout the day, but few values reaching 160 pg/m3. (Note: The highest in 24 values occurs with a frequency of 2.8%):. Group 8, with a geo metric mean of 76 and slope of 2.1 was more closely aligned with Groups 6 and 7, as would be expected from the dendogram structure in Figure 4.6. The super episode group had an averaged geometric mean of 72. All TABLE 4.5 Average Lognormal Parameters for Summary Groups and Subgroup Membership Grcup No. 1 631 94 8 2 63 3 STP 5 6310 STP 3 3 63 4 STP10 4 63 5 94 6 109 5 109 9 9414 5 63 9 311 STP 7 109 3 6 63 9 7 9412 8 6314 9 6315 10915 6318 10 6316 11 6317 12 (NE) 1 13 (E) 3 6 14 (SE) 9 Group Members STP 1 10911 94 7 STP 4 94 9 6312 STP14 STP16 109 8 STP15 STP13 STP 9 STP12 STP. 8 94 1 94 2 STP12 109 4 10912 10914 109 6 STP19 STP17 STP18 2 4 7 10 63 2 STF 6 10910 STP 2 9410 94 4 63 8 94 5 63 6 63 7 9413 109 2 94 3 STP11 6313 109 2 109 1 9411 109 7 10913 5 8 11 Station Average Average Days Goo Mean Geo Dev. 246 26.8 2.70 27.3 25.6 39.6 51.9 60.4 71.3 75.9 51.5 102.4 100.3 27.2 48.9 72.0 2.49 3.98 3.07 2.77 1.77 1.71 2.11 3.61 1.64 2.19 2.57 2.63 2.86 Group numbers 1 14 are shown Group members show station and Average geomean is the average station days in the group in Figure 4.6. subgroup number represented. geometric mean for all Notes: 1) 2) 3) 68 o / 0 IM. 4 o 4J go 8 o o 0o A Sso Iv I I I al 0 0m rId rl 0 44 a z i w/o .. D C g (a 04 0 w0 4 In . Uw 0 u 4) o / Cflo .1 e 00 5.4 W o 0 SO .0 sg /5. s.~~'O* S 23 0 i I u in a 4 00 0 O; s *4 4. w 3 0 0 04 C3 0 O O 4 0 0 0.4 0 0 os i o < E o atlas hoyi tu/lfrl uoT7e~uaouoD three of the subgroups which made up Group SE predict daily contraven tions of the oxidant standard. Group 9 had a geometric mean of only 52, but a slope of 3.6. This indicates relatively low levels during certain times of the day with very high excursions at other times. Group 9 consisted of 63 days and represented nearly 60% of all of the SE type cases. Group 11 included a total of nine days and was limited to Station 63 and the site in St. Petersburg. With a geometric mean of 100 at a slope of 2.19, all of the members of this group would also be expected to exceed the air quality standard. Group 10, at 102 and 1.64 respectively, predicts contraventions of 160 pg/m3 on the average but includes some days which statistically might not exceed this limit. Figure 4.9 is a plot of the mean and first two principal components for the subgroups listed in Figure 4.7. A cross reference table is provided as a guide to locate groups in Figure 4.9. All groups shown in Figure 4.7 or Figures 4.2 through 4.5 can be found in Table 4.6. The plots are arranged in cluster order left to right and top to bottom. The number of days in each group appears at the upper left corner in each figure. The number in the upper right corner is the total variance of all days in the group about the mean vector. The number at the left of each principal component is the percent of the total variance explained by that eigenvector. Figure 4.9 is very useful for visualizing differ ences between observed waveforms. Averaging smooths abrupt changes which confound the analysis of individual station days. Thus, the mean vector for each subgroup is easier to interpret than the waveform of a FIGURE 4.9 a sMan and Principal Components of Subgroups Type SE Days 12 Days 22972 20 " 0 35 r'V1 . vc r\. 6315 to 3 Days 10539 .75 F o 25 I505 100 50 0 5 10 IS 20 10915 Starting Hour 4 Days 12710 22 Days 24726 56 10914 3 Days 9411 8144 69 4 Days 6729 10 IS 20 5 10 IS 2 109 6 109 7 Starting Hour Starting Hour Total variance for group appears in upper right corner Percent of total variance for each component is the number to the right of each component Bottom figure is the mean vector Numbers below each set of graphs indicate station and subgroup number I a e 0I a Note: 1) 2) 3) 4) . A52 FIGURE 4.9 b Mean and Principal Components of Subgroups Type SE Days Continued 6318 STP19 40 16 Days 12640 0 I 1 I I0 0 5 I0 i5 20 S 10 B 21 STP17 10913 6317 Starting Hour Starting Hour Starting Hour Notes 1) Total variance for group appears in upper right corner 2) Percent of total. variance for each component is the number to the right of each component 3) Bottom figure is the mean vector 4) Numbers below each set of graphs indicate station and subgroup number /' .42 4L  20 I 0 S20 8o I20 4 al PIGURE 4.9 c Mean and Principal Components of Subgroups Type HE Days 5 Days 7569 55 \ ., ?3 ^17 5 10 5I 20 63 1 7 Days ^ 'n V43 23 40 20 0 0 M S 20 a 810 8 W0 0 94 Days ....22 20 S 10 s 20 63 2 0 So 0 o 20 4I 1 4 100 8 20 950 0 I 54 Days . 24 S 21 5 10 15 20 94 8 Starting Hour S75 Days 19 z 2^ "17 a 10 i1 20 10911 Starting Hour STP 6 Starting Hour Notes 1) Total variance for group appears in upper right corner 2) Percent of total variance for each component is the number to the right of each component 3) Bottom figure is the aean vector 4) Numbers below each set of graphs indicate station and subgroup number 5 10 IS 20 STP I FIGURE 4.9 d Mean and Principal Components of Subgroups Type NE Days Continued 44 Days 1 Q47 I* 5 10 1i 20 5 10 19 20 63 3 49 Days 30 20 5 10 15 20 94 7 68 Days * PL1 2 8 10 I 20O 10910 12 Days 18704 ,70 0 13 :0 0o 10 * 10 * i k 0 8 10 18 20 STP 5 Starting Hour 34 Days 8465 36 7L J 22 5 10 1I 20 STP 4 Starting Hour 51 Days 6872 "x23 17 a 10 IS 20 STP 2 Starting Hour Notes 1) Total variance for group appears in upper right corner 2) Percent of total variance for each component is the number to the riqht of each component 3) Bottom figure is the mean vector 4) Numbers below each set of graphs indicate station and subgroup number  *20 a 5 o S0 6a 150 p 8 00 ;0 inl S2 Cr 6j1 4 e I! 4, a FIGURE 4.9 a Mean and Principal Components of Subgroups Type HE Days Continued 9736 22 Days 4476 13 Day 2aL .30 ~~721 I i t 0 10 15 20 6310 49 bays 5687 40 20 0 28 Wo. 202 o 20 0 5 10 IS 20 STP 3 Starting Hour S* 26 S 19 5 0 15 20 94 9 L 5 Day 3169 _Ar 5 10 15 20 6312 Starting Hour 29  9410 I I I 6 10 IS 20 Starting Hour Notes 1) total variance for group appears in upper right corner 2) Percent of total variance for each component is the number to the right of each component 3) Bottom figure is the mean vector 4) Numbers below each set of graphs indicate station and subgroup number 40 Days 0 n a 0 20 0 S 20 ISO 3  90 t~5~ FlGURE 4.9 f Mean I 26 Days iilLI 201 A 14 3 10 S1 20 63 4 and Principal Componunts ot Subgroups Type E Days 13 Days 7596 I 9 Days 5 10 IS 20 5 10 1, STP14 94 4 a Days 7084 to. V L 0 44 10 10 50 0 5 10 IS 20 S 10 1s 20 STP10 63 5 STP16 Starting Hour Starting Hour Starting Hour Hotal 1) Total variance for group Jpp ars in upper right corner 2) Percent of total variance for each component is tho number to the right ot eacn component 3) Bottom figure is tho mean vector 4) Numbers be ov each set of graphs indicate station and subgroup .unlrer pFGUIE 4.9 g Mean and Principal Components uf Subgroups Type E Day Continued 5 Days 17104 57 5 10 IS 20 63 8 5 Days 4822 40 S 20 14 a 0 0 g to IS 20 94 5 30 Days 5 10 IS 20 94 6 14 Days 6370 45 I 10 o1 20 109 S 9 Days S41 L\ 19 5, tO IS 20 109 8 11 Days 9488 35 5 10 IS 20 STP15 Starting Hour Starting Hour Starting Hour Notes 1) Total variance for group appears in upper right corner 2) Percent of total variance for each component is the number to the right of each component 3) Bottom figure is the mean vector 4) Numbers below each set of graphs indicate station and subgroup number *~ 20 0 a PIGURE 4.9 h Mean and Principal Componenet of Subgroups Type E Days Continued *u a20 I ISO a too I* 3 Days 9230 78 S10 La 22 5 10 15 20 63 6 8 Days 11737 , 51 * 24 ^ A A ,2 0 10 1t 20 63 7 Starting Hour 10 Days 10417 1 10 s1 20 109 9 22483 V I A 60* v " Aj V 19 5 10 16 20 9414 Starting Hour 5 Days 39 I I I I 8 10 5 .0 STP13 8 Daye 8734 39 '4 r 26 * I I I 0 10 1S 20 STP 9 Starting Hour Notes 1) Total variance for group appears in upper right corner 2) Percent of total variance for each component is the number to the right of each component 3) Bottom figure is the muan vector 4) Numbers baloa each set of graphs indicate station and subgroup number Pu  5 0 10 0 10 i0 FCGUKE 4.9 i Mean and Principal Components of Subgroups Type Z Days Continued 24 Days 11362 2V ^V^^,2" 8 Days 13248 0. o 10 20 20 0 5 10 20 S 10 13 20 9413 63 9 26 Days 13647  37 I I 5 10 1S 20 6311 Starting Hour 10 Days 22 * I I * 6 10 15 20 STP B Starting Hour Notes 1) Total variance for group appears in upper right corner 2) Percent of total variance for each component is the number to the right of each component 3) Bottom figure is the mean vector 4) Numbers below each set of graphs indicate station and suogroup number 0 !O 0 a o a C 0 Ut Ae 5 4 i .4 ae STP12 109 2 Starting Hour ! '34 19 Days FIGURZ 4.9 j Mean and Principal Components of Subgroups Type E Days Continued 8299 1 25 Days 6309 I 11 Day 32 I I14   5 10 1 532 94 3 40 9 Days ~2~7~T41 #A 1 19 * .^ 0 6 10 IS 20 STP11 Starting Hour S13 Days 6600 O . 39 24 IIII I I S 1t0 IS 20 109 3 Starting Hlour Notes 1) Total variance for group appears in upper right corner 2) Percent of total variance for each component is the number to the right of each component 3) Bottom figure is the mean vector 4) Numbers below each set of graphs indicate station and subgroup number II I 8 M * 10 S o a Ic STP 7 94 1 I ISO .4 to20 o so S 100 80 94 2 Starting Hour FIGURE 4.9 k Mean and Principal Components of Subgroupa Type E Days Continued 11 Days 21699 \149 29  0 5 20 5 10 15 20 6313 5 Days 7373 54 V  "23  v 0 5 10 5I 20 109 1 Starting Hour 10 Days 25957 A 66 10 I I I 5 t0 I 20 9412 9 Days 15758 47 24 6314 Starting Hour 15 Days 37 36 5 10 13 20 109 4 12 Days 29467 51 24 10912 Starting Hour Note: 1) Total variance for group appears in upper right corner 2) Percent of total variance for each component is the number to the right of each component 3) Bottom figure is the mean vector 4) Numbers below each set of graphs indicate station and subgroup number 0 S 20 20 0 5 , 8 so too 10 20 i 4 20 20 ~ o S20 150 i w specific day. Differences between individual members of each group can be visualized in terms of the first and second principal components. Large positive or negative peaks in these curves indicate times during the day when differences between group members occurred. These compo nents probably have physical significance in terms of meteorological or other influence parameters. Many of the components display a periodic nature having 24hour, 12hour and 6hour cycles. In addition, both mean and component vectors seem to be repeated among various subgroups. It is also apparent that many of the subgroups exhibit unusual waveforms which are not characteristic of the pattern expected from classical photochemical considerations. Figure 4.9 thus forms a basis for fur ther investigation of oxidant phenomena. A choice of subgroups to model and possible characteristics of the influence parameters become much clearer after studying Figure 4.9 and Tables 4.1 and 4.5. To illustrate the use of Figure 4.9, an example using Group 10913 is now presented. To locate this group, we find Station 10913 as the first element of the fifth column in Table.4.6. The entry opposite 10913 is 4.9b which is page 71 and we find Group 10913 in the middle at the bottom of this page. The mean vector was computed to be 75 vg/m3 from midnight until about 4:00 AM followed by a drop in concentration to 50 pg/m3 about 5:30 AM. The ozone level then climbs linearly to reach a maximum of 160 Vg/m3 around 12:00 noon. From noon through 5:00 PM, the average ozone concentration for these 12 days remained near 160 Pg/m3 and dropped to about 110 pg/m3 by 11:00 PM in approximately linear fashion. As shown in Figure 4.7, Group STP17 is most similar to Group 10913 and the relative scatter was about 2.5% when these two groups were combined. Since these graphs were plotted in cluster order, Group STP17 appears immediately to the left of Group 10913 in Figure 4.9b. The similarity between average waveforms is easily recognized. The first and second principal components for Groups 10913 and STP17 have a relatively smooth form. This indicates that differences between individual days occur over a span of a few hours rather than during a single hour. The exception is the peak at about 7:00 AM in the first two components for Group STP17. These peaks tell us that much of the variation between days for this group occurred during the concentration drop from 100 ug/m3 to 50 yg/m3 at about 7:00 AM. Thus, some days exhibited less change in concentration at 7:00 than the average day, while other days had an even sharper change than the mean day. However, over all 24 hours, both of these groups tended to have either higher or lower morning values than the mean day. This is evident in the first component which represents over 40% of the variation about the mean. A gradual change in these weighting functions from 20 to near zero oc curs from midnight until 3:00 PM. This means that more variation occurred in the early morning concentrations than in the early afternoon values. The total variances about the mean vector over all 24 hours were 11334 and 12640 (pg/m3)2 for Groups 10913 and STP17 computed over 12 and 16 days respectively. These values appear at the top of each graph in Figure 4.9b. These low values confirm the relative similarity among days in these two groups. These two groups would be good candidates for simulation modeling of certain super episode days. Similarly Group 9411, having a total variance of 24726 (Vg/m3)2 for 22 days, is a group for which models could be constructed. In contrast, Group 6317 (Figure 4.9b) shows that some days tend to be unique with large changes in concentration over short time periods. Turning to members of the NE group, the total variance about the mean is much lower since days tend to be very similar to one another. These days would be easiest to model, but are also of less interest because the oxidant standard was rarely exceeded. Models to represent these days may be useful to under stand underlying physiochemical and meteorological conditions which restrict or reduce oxidant generation. Naturally, a large collection of very similar days is much easier to model than a few unusual days. Figure 4.9 reveals those groups which would be easiest to simulate as well as graphically presenting the basic characteristics of all of the days represented in 1974. The objective in Chapter IV is to illustrate the methods presented in Chapter III through an application of real data. Figure 4.9 provides an effective means of breaking the analysis of the oxidant problem in South Florida into a collection of smaller problems. Groups of similar days can be further studied using a large number of other mathematical techniques. In addition to studying the interaction of influence parameters within a group, Figure 4.9 provides an approach to studying differences between groups. Contrasting difference in influence para meters,using powerful statistical tools such as stepwise discriminate analysis, could be used to identify those parameters having the greatest impact on oxidant levels. Once these are identified, the time evolution of ozone as a function of these parameters can be studied, provided the patterns of interest have been observed sufficiently often. The tech nique put forth in this dissertation has wide application in other fields as well as in air pollution. It provides a systematic method of condensing useful information and comparing it. A better under standing of the problem as well as possibilities for potentially suc cessful approaches to further analysis results. Table 4.6 Cross Reference Summary Subgroups from Figures 4.2 Through 4.5 Figure No. 4.9c 4.9c 4.9d 4.9f 4.9f 4.9h 4.9h 4.9g 4.9i 4.9e 4.9i 4.9e 4.9k 4.9k 4.9a 4.9b 4.9b 4.9b 4.9j 4.9j 4.9j 4.9f Group Name 94 5 94 6 94 7 94 8 94 9 9410 9411 9412 9413 9414 109 1 109 2 109 3 109 4 109 5 109 6 109 7 109 8 109 9 10910 10911 10912 Figure No. 4.9g 4.9g 4.9d 4.9c 4.9e 4.9e 4.9a 4.9k 4.9i 4.9h 4.9k 4.9i 4.9j 4.9k 4.9g 4.9a 4.9a 4.9g 4.9h 4.9d 4.9c 4.9k Group Name 10913 10914 10915 STP 1 STP 2 STP 3 STP 4 STP 5 STP 6 STP 7 STP 8 STP 9 STP10 STP11 STP12 STP13 STP14 STP15 STP16 STP17 STP18 STP19 Figure No. 4.9b 4.9a 4.9a 4.9c 4.9d 4.9e 4.9d 4.9d 4.9c 4.9j 4.9i 4.9h 4.9f 4.9j 4.9i 4.9h 4.9f 4.9g 4.9f 4.9b none 4.9b Note: 1) 631 means Station 63 Subgroup No. 1. 2) 4.9a to 4.9k refer to pages 70 to 80. Group Name 63 1 63 2 63 3 63 4 63 5 63 6 63 7 63 8 63 9 6310 6311 6312 6313 6314 6315 6316 6317 6318 94 1 94 2 94 3 94 4 CHAPTER V SUMMARY AND CONCLUSIONS Atmospheric processes are inherently random and they change randomly over time. Statisticians call this kind of behavior a nonstationary stochastic system. Modeling air pollution is difficult because the response to influence parameters changes with time. Classical a priori models treat this problem by assigning stability types and making assump tions which allow approximate solutions to the pollutant transport equation. Stochastic models are better suited to simulating air quality because they represent empirical relationships between observed para meters emphasizing their random nature. These models can provide the greatly needed means to better understand and control air quality. Atmospheric surveillance is becoming more intensive and more sophis ticated. Naturally, stochastic models for air pollution will become more popular when highly reliable and extensive sets of data become available. However, the nonstationary behavior of these systems must be recognized and treated in order to develop meaningful stochastic models. This dissertation described some limitations to modeling air pollution and setsforth a method for segregating data having similar attributes. Although a number of methods are possible, the most important principle is to recognize the need for many models. A single model to represent all of the observed conditions is not possible. Therefore, the first step in building stochastic air quality models is to find sets of condi tions which could be modeled. To qualify, the same conditions must have occurred often enough so that repeated experiments in the set are not too different from one another. Otherwise, the "system" being modeled may actually represent multiple systems. Note that rapidly changing conditions require frequent sampling and that significant spatial dif ferences require a dense sampling network. Hourly averages at four sampling sites may be sufficient to represent some conditions but gros sly inadequate for others if very rapid changes occurred. After choosing the appropriate conditions, reliable influence data can be analyzed to construct a model. Similarly, models representing particular conditions could be formulated to include most of the observed data. The approach to treating nonstationary data presented in this work utilized an assumption so that multivariate methods applied. Each hour in a day was considered as a distinct variable since ozone is related to the intensity of incident radiation. This assumption was not neces sary but greatly simplified formulating the pattern recognition problem. A generalized grouping method could be developed to seek intervals having similar system response. The length of these intervals does not need to be constant, and could be initialized by passage of a front for example. However, to develop stochastic air pollution models, one must recognize the need for repeated observations of similar conditions and methods to determine when these have occurred. Atmospheric modeling is a process of simulating a waveform and comparing it with an observed waveform which represents the same condition. The simulated results are subtracted from the observed results to yield a new waveform called the residuals. If the residuals are small compared with the original data, the model represents those conditions well. Further treating the residuals can lead to better models until ulti mately what is left is that purely random component known as white noise. The models used for each successive step can be deterministic, as in classical modeling, or stochastic. Either type of model represents a relationship between inputs and outputs. Both deterministic and sto chastic models have characteristic frequency response curves relating these inputs and outputs. Classical models develop this relationship from physical principles while stochastic models rely on empirical relationships. But the final test of each model is its ability to reconstruct the observed data and yield a purely random residual. Models fail when the influence data are inadequate and when functional coeffi cients are in error. It really doesn't matter how the model was devel oped, so long as the correct influence parameters and coefficients have been determined and can be relied upon to correctly simulate the condi tions represented. Pattern recognition used on observed data can provide test data to compare with a priori models. A priori models can be used to suggest influence parameters and to make assumptions in applying pattern recognition tools. Both techniques should be used together. A pattern recognition approach to dealing with the nonstationary waveforms found in air pollution was presented for ozone data reported at four sampling sites in Tampa and Saint Petersburg, Florida. These data were selected because the oxidant problem is a national concern and accurate data were available. Each day was reduced from 24 hourly observations to a set of six values while retaining over 90% of the available information. These six values represented proportionate contributions of six independent pseudo variables each of which was made up from the original 24 hours. With proper scaling, the euclidean metric was used to measure distance between each day and the most simi lar days were combined into common groups. Each station was analyzed separately and the results were displayed graphically as a dendogram. Days which experienced elevated oxidant levels tended to be closely associated in the dendogram. Similarly, days with low ozone levels were found together. The average factor scores from each group were used to determine spatially similar groups. This analysis resulted in the super episode (SE), episode (E), and nonepisode (NE) classifica tion which was used to contrast meteorological conditions. As des cribed in Appendix III, intense solar insolation following passage of a cold front and the associated high pressure system, accounts for many super episode days. A clear explanation of differences between E and NE type days was not possible from the limited meteorological data available. The mean and first two principal components for each group at each station were plotted in cluster order. This graphical display demonstrated the existence of a number of distinct types of days which appear to require basically different models. Average lognormal parameters for groups of days were computed and the distributions were plotted. These also demonstrated distinct differ ences between types of days and showed that there is no single under lying distribution representing all observed conditions. In many cases ozone behavior was found to be unusual with respect to classical photochemical oxidant theory. Pattern recognition provides a method of handling the randomlychanging nonstationary conditions encountered in atmospheric studies. This application clearly illustrated the usefullness of pattern recognition as a diagnostic tool in air pollution research. A number of significant conclusions can be drawn from the results of this work. All of them focus on the problem of stationary. Fitting confidence bounds on the population mean or maximum and computing order statistics from a sample require a single stationary distribution. It is very apparent that the density function of air quality data changes with time. Consequently, predicting the probability that a prescribed level was exceeded or will be exceeded in the future can not be done with certainty. Influence parameters can be identified and functional relationships can be developed, but there is no guarantee that these same influence parameters will be important under another set of con ditions. It is also possible that influence parameters which are con sidered to be insignificant may operate under some circumstances to exercise a predominant influence on air quality. Building models to describe air pollution, given this lack of constraints, is indeed com plicated, and a single model capable of including all of these possible conditions is difficult to imagine. Classifying and analyzing observed air quality provides an alternative approach which divides the problem into managable parts. If the system changed in a gradual predictable way, forecasting could be accomplished using adaptive filters with coef ficients that change predictably. Unfortunately, conditions in the atmosphere can change abruptly to require a new set of influence para meters and different functional relationships. But a comprehensive history of repeated events can be used to determine which model is appropriate. Realtime analysis'of atmospheric processes can be devel oped after extensive collection and analysis of data have transpired. Classifying conditions and rearranging them into chronological order produces sequences of events. There are certain patterns to these se quences which can be used to predict when an air pollution episode is likely to occur. Mathematical models to simulate each kind of day and other models which predict which type of model is most probable are the basis for realtime prediction and stochastic simulation of air quality data. 
Full Text 
xml version 1.0 encoding UTF8
REPORT xmlns http:www.fcla.edudlsmddaitss xmlns:xsi http:www.w3.org2001XMLSchemainstance xsi:schemaLocation http:www.fcla.edudlsmddaitssdaitssReport.xsd INGEST IEID EZ9LUVUUP_I4TBAY INGEST_TIME 20120207T17:45:31Z PACKAGE AA00003916_00001 AGREEMENT_INFO ACCOUNT UF PROJECT UFDC FILES 