MONTE CARLO SIMULATIONS FOR ESTIMATING
UNCERTAINTY IN GISDERIVED BASE FLOOD ELEVATIONS
ARISING FROM POSITIONAL ERRORS IN VECTOR DATA
By
SAMUEL HANCHETT HENDERSON
A THESIS PRESENTED TO THE GRADUATE SCHOOL
OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OF
MASTER OF SCIENCE
UNIVERSITY OF FLORIDA
2000
Copyright 2000
by
Samuel Hanchett Henderson
ACKNOWLEDGMENTS
I would like to express my sincere appreciation for all those who have helped me
through this master's thesis. Particular thanks go to my advisor, Dr. Bon Dewitt, for
selflessly providing an open door and constructive criticism. His observations
contributed a great deal to this research. I also wish to thank my other committee
members, Dr. Scot Smith and Dr. Timothy Fik, for their useful comments and
suggestions.
Thanks go to my fellow graduate students who provided support and comic relief,
and to 3001, Inc. for providing the data and means to carry out this study. I am indebted
to Sandra Russ and Joan Jones for their warm humor and for keeping me on track and in
line. Finally, I would like to thank my parents, William and Ann Henderson, for their
love, support, and encouragement.
TABLE OF CONTENTS
page
A C K N O W LE D G M EN T S .............. ..............................................................................iii
A B S T R A C T ................................vi.............................
CHAPTERS
1 INTRODUCTION..................... ............. 1
2 BA CK GROUN D .................. .................................................. ...... ...... 5
Spatial D ata Q quality .............. ........ . ...... .. ........................................ ............. ....... 5
Positional D ata Q quality ....................................... ................ .. ........ .. 7
E rror T ypes.................................................................................. 7........ 7
A accuracy and Precision ................................... .... .......... ........ ......... .. 8
A accuracy A ssessm ent........................................... ... ............... ....... ................ .. 10
Error Propagation ................. ................................. .................... 11
M onte C arlo Sim ulation ...... ............................................................ ...... 13
R elated S tu d ie s ............... ....................... ........................................ 14
R asterB ased Studies............................................. ................................. ..... 15
V ectorBased Studies ................................................... ............ ...... ...... .. 17
Hunter and Goodchild's Vector Uncertainty Model....................................... 19
3 EROSION HAZARDS STUDY ........... .................................. 21
D ata L a y e rs ...................... ................ ......................................................... 2 1
E erosion H hazards A analysis ................................. ............................... ....................... 23
4 METHODS......................................... 29
Creating Data Realizations........................... ............................... .............. 30
D eriving U uncertainty E stim ates ........... ......... ...... ......... ..................... .............. 35
Application to Erosion Hazards Study ........ ........................ .............. 37
5 RESULTS AND D ISCU SSION .............. .......................................................... 39
S am ple R esu lts ........... ......... ................ ................. ...................................39
StudyWide Results........................ ......... ................. 45
H igh U uncertainty Cases..................................................................... .............. 48
U se fu ln e ss .............................................................................. 5 1
Isolating E rror C contributors .......................................................................................... 54
6 C O N C LU SIO N .............. ............................... ................ ... ................... ..... .......... 64
APPEND IX CODE ......... ............................ .. ........ ...... .......... 69
Fema.aml .............. .................................................. 69
Pes.am l ........................................ ................... 80
P e s m a in .a m l ................................................................................................................ 9 9
Pes m ain.m enu ......................................... 102
Pes add.m enu ................................. ...................................... 106
R an d g en .c .............................................. .................. 10 7
F a t2 rrt.c ....................................................................................................................... 1 1 9
LIST OF REFERENCES .......................................... 122
BIOGRAPHICAL SKETCH ........................................ 127
Abstract of Thesis Presented to the Graduate School
of the University of Florida in Partial Fulfillment of the
Requirements for the Degree of Master of Science
MONTE CARLO SIMULATIONS FOR ESTIMATING
UNCERTAINTY IN GISDERIVED BASE FLOOD ELEVATIONS
ARISING FROM POSITIONAL ERRORS IN VECTOR DATA
By
Samuel Hanchett Henderson
December 2000
Chairman: Bon A. Dewitt, Ph. D.
Major Department: Civil and Coastal Engineering
Geographic Information Systems (GIS) are increasingly being used to aid in
spatially referenced decision making. GISderived data are frequently presented without
measures of reliability, leading to false certainty in the quality of data. This thesis
proposes software implementing Monte Carlo simulations for estimating uncertainty
arising from positional error propagation in vectorbased GIS. A generic application, the
Positional Error Simulator (PES), was developed in ARC/INFO to simulate error in
source data sets, carry out spatial analysis, and summarize uncertainty in derived values.
This error model was applied to a predictive erosion hazards study conducted for
the Federal Emergency Management Agency. This study used GIS to produce estimates
of Base Flood Elevation (BFE) for coastal structures at 10year intervals, 60 years into
the future. The analysis is carried out on three data layers: coastal structures, flood zones
derived from Flood Insurance Rate Maps, and the erosion hazard area consisting of the
current and 60year projected erosion reference feature. Two representations of BFE are
investigated: a zonalbased nominal elevation, and a smoothed or interpolated
representation.
Traditional Monte Carlo simulations were applied to the Erosion Hazards Study
in order to investigate the nature of error propagation in complex GIS analysis. The PES
application was used to simulate error in source data sets and summarize variability in
reported BFE attributes. Resulting errors are presented at the studywide and high
uncertainty levels, and discussed in reference to error propagation through data layer
interactions and spatial functions.
It was shown that 90% of nominal BFEs returned errors of Ift or less and 95% of
interpolated BFEs had less than 0.5ft of error. Nominal and interpolated BFE errors were
as high as 2.5 and 1.25ft respectively. Additional simulations were performed to consider
error in individual source coverages, while holding others errorfree. These tests allowed
the identification of source layer contributions to output uncertainty levels. It was shown
that the structure layer contributed the most to output uncertainty. Basic metrics were
introduced to compare observed levels of error and expected levels based on individual
layer contributions. This data showed that most errors accumulated as expected.
Simulation results can be used to express data quality in metadata statements, to
appropriately represent spatial data, and to validate data collection and processing
methods through a suitability framework. A major consideration is the time and
computer storage requirements for implementing simulations. Care must be taken in the
interpretation of output uncertainty estimates as the value of summary statistics are
largely based on unpredictable realized error distributions.
CHAPTER 1
INTRODUCTION
Geographic Information Systems are increasingly being used to aid in spatially
referenced decision making. GIS encompasses users from federal governments to
environmental researchers, and the results of GIS analyses are used to support decision
making in fields from emergency vehicle dispatch to environmental modeling. GIS
provides efficient and largely automatic modeling of geographic phenomenon by
allowing numerous datasets to be assembled, stored, combined and queried.
By generically handling spatial data, supporting an array of data formats, and not
accounting for basic accuracy information, GIS is a hotbed of error propagation.
Furthermore, the digital mapping environment has led to false certainty in data quality
(Thapa and Bossler, 1992). Because basic data quality statements such as scale and
resolution are neither retained nor used as limiting factors to the types of operations that
may be carried out, GIS products may be applied to uses for which they were never
intended (Goodchild, 1993). Lanter and Veregin (1992) have clearly summarized the
problem.
A GIS provides a means of deriving new information without
simultaneously providing a mechanism for establishing its reliability. In
such applications input data quality is often not ascertained, functions are
applied to these data without regard for the accuracy of derived products,
and these products are presented without an associated estimate of their
reliability or an indication of the types of error they may contain. (Lanter
and Veregin, 1992, p. 825)
The uncertainty problem is a major topic in the research field. Since the
computerization of geographic analysis, researchers have been concerned with the
presence, quantification, and visualization of errors in spatial databases. In the last
decade, the National Center for Geographic Information and Analysis (NCGIA) and the
University Consortium for Geographic Information Science (UCGIS) have highlighted
spatial data quality as a key research initiative. The Federal Geographic Data Committee
(FGDC) has required data quality information to accompany all spatial data in
accordance with the Spatial Data Transfer Standard and the Content Standard for Digital
Geospatial Metadata (FGDC, 1998a).
There are few practical tools for dealing with uncertainty in GIS (Veregin, 1989a;
Canters, 1997; Hunter and Goodchild, 1999a). In search of automated solutions, many
researchers have concluded that traditional mathematical error propagation techniques
cannot be applied to complex GIS analysis and have focused on computer simulation
methods as a practical alternative. Despite significant advances, comprehensive tools
have not been implemented in popular GIS software.
This thesis provides a method for estimating uncertainty in GISderived values
arising from positional errors in vector data. Similar error analyses have been conducted
on simple GIS operations such as the determination of area and length, pointinpolygon
overlay, and buffer operations. There has been no application of a vectorbased
simulation model to a substantial GIS analysis consisting of chains of multiple functions
and data layers.
This thesis investigates attributes derived in an Erosion Hazards Study conducted
by the Federal Emergency Management Agency. This study (discussed in depth in
Chapter 3) derived two measures for the Base Flood Elevation of coastal structures: a
zonalbased nominal elevation and a smoothed or interpolated representation. In
addition, a model was implemented to estimate these variables at 10year intervals over a
60year period. This study is a complex GIS analysis in which error propagation is
expected to play a major role on the quality of derived values.
The goal of this thesis is to apply traditional Monte Carlo simulations to the
Erosion Hazards Study in order to investigate the nature of error propagation in GIS. The
underlying hope of this research is that the algorithms introduced and problems revealed
will lead to new directions for future research. To address these goals, the following
questions are investigated.
1. What levels of uncertainty exist in derived Base Flood Elevations?
2. What are the potential uses and limitations of the proposed error model and the
resulting uncertainty data?
3. Which data layer contributes the most to output uncertainty?
4. How do errors from individual inputs accumulate during spatial analysis?
The proposed error model is formalized into a generic application, the Positional
Error Simulator (PES), for the simulation of vector data and estimation of uncertainty in
GIS derived information. This proposed software is a substantial improvement on a
similar model developed by Hunter and Goodchild in 1996. Although the theories and
algorithms presented in this thesis can be applied in any GIS, they were specifically
implemented in ARC/INFO. This choice is supported by the popularity of the software
and the availability of the Arc Macro Language (AML). ARC/INFO terminology will be
used throughout this document.
This thesis is divided into six chapters and one appendix. Chapter 2 provides
background on spatial data quality and highlights the importance of positional data
quality. Mathematical and simulationbased methods for error propagation are
introduced. A brief review of related research is presented, concluding with a description
of Hunter and Goodchild's uncertainty model. Chapter 3 discusses the algorithms used to
derive base flood elevations for the Erosion Hazards Study. Chapter 4 presents the
proposed error model, the practical software written for ARC/INFO, and describes how
the model was applied to evaluate error propagation in the Erosion Hazards Study.
Chapter 5 reports the results of the case study error analysis and goes over practical uses
for the uncertainty information. A discussion of the results, and overall evaluation of the
model is also provided. Chapter 6 summarizes work done and conclusions made. The
appendix contains AML and C code written for this study.
CHAPTER 2
BACKGROUND
Before the introduction of a positional error model written for this study, some
background information is presented. A general discussion of spatial data quality is
offered, highlighting the importance of positional information and clarifying terms
associated with errordescription. Next, the concept of error propagation is introduced,
drawing distinctions between traditional variance propagation and simulation methods.
Lastly, a review of related research is presented, including a description of a similar
vector based error model developed by Hunter and Goodchild.
Spatial Data Quality
The term error is used loosely in GIS literature. From discussions of image
classification accuracy to topological problems such as overshoots, dangles, and slivers,
errors may exist in virtually every aspect of a spatial database. Fundamentally, GIS data
are made up of position and attribute information. The positional or spatial component
defines where on the earth's surface features exist while the attribute or thematic
component describes what the objects represent. Data quality refers to how well position
and attribute information portray realworld entities. The differences between the
physical truth and the digital representation, whether they are positional or attribute in
nature, are generically termed errors.
A great deal of research has focused on describing, quantifying, and representing
these differences, resulting in an abundance of error descriptors. Error, reliability,
uncertainty, accuracy, precision, resolution, and scale are often used interchangeably and
sometimes incorrectly in reference to spatial data quality (Goodchild, 1993).
All spatially referenced data are subject to errors introduced during the data
collection process. The magnitude of error is largely dependent on the data collection
method, and to a lesser degree on the refinement of the processing methods (Goodchild,
1993). Thapa and Burtch (1991) recognizes primary and secondary methods of data
collection in GIS. Primary methods include those that directly measure the position of
features, including surveying, GPS, photogrammetry, and remote sensing. Secondary
methods involve obtaining geospatial data from existing documents through digitizing or
scanning. GIS databases are often composed of data from both primary and secondary
sources, and therefore are subject to varying types and degrees of error.
Data quality has formally been described by various government organizations
including the Federal Geographic Data Committee. The Content Standards for Digital
Geospatial Metadata (FGDC, 1998a) and the Spatial Data Transfer Standard (USGS,
1997) refer to five components of spatial data quality: lineage, positional accuracy,
attribute accuracy, logical consistency, and completeness. Lineage refers to the
documentation of data collection and conversion methods leading to a data set's final
digital representation. Positional accuracy describes the spatial extent of uncertainty in a
digital data set and may be divided into horizontal and vertical components. Attribute
accuracy consists of numerical estimates of variability in object properties. Logical
consistency refers to the topological correctness of the dataset. Completeness describes
rules used in the creation of the data set such as sampling methods, omissions, and
definitions. A final spatial data quality component can be identified as temporal
accuracy. Although it is not listed as an independent component, the standards recognize
its importance and call for its inclusion in each of the other quality statements.
The FGDC requires all government funded spatial data products to be
accompanied by adequate metadata in accordance with the CSDGM. In addition, data
users in the private sector are increasingly requiring metadata documents alongside GIS
products. To produce these documents, users must provide estimates for the quality of
GIS layers and derived products.
Although position and attribute components are often handled independently, both
must correctly reflect the geographic truth in order to faithfully represent spatial data.
The full functionality of GIS is only realized through the generic handling of multiple
thematic layers over a common spatial extent. This requires all data to be referenced to a
common coordinate system or positional measurement scale. Positional data quality is
therefore a fundamental aspect of data quality.
Positional Data Quality
Positional Error is the difference between a measured value and the true value.
Because no measurements are exact, the true value is only a theoretical quantity, and
similarly, all representations of error are estimates. The mean of a group of repeated
observations is often used as an estimate of the true value.
Error Types
Spatial data are subject to three types of errors: random, systematic, and gross
blunders. Random errors are the perturbations in data primarily due to the inexactness of
the measurement process, arising through variability in human observations and
equipment. The variation in a human's vision while reading an instrument may cause
fluctuations in the data around the hypothetically true value. Random errors follow the
laws of probability; they are more likely to be small than large, and just as likely to be
positive or negative (Wolf and Ghilani, 1997). Systematic errors are usually the result of
instrument miscalibrations causing all values to be shifted to one side, creating an
imbalance in the data. Blunders are simply mistakes in the observation or recording of a
measurement.
Although all data entering a GIS are subject to all three types of errors this
research assumes that the data have been adequately processed to identify and remove
systematic errors and blunders. Data derived by secondary methods (i.e. scanning,
digitizing) further complicate the detection and removal of nonrandom errors. Assuming
only the presence of random errors in derived data is fundamentally incorrect but
practically unavoidable without advanced models for error propagation in data
conversion techniques. Stanislawski, Dewitt, and Shrestha (1996) did propose such a
model for evaluating error propagation through digitizing trials and subsequent
transformations using least squares. They were able to quantify absolute and relative
accuracies and effectively model systematic errors. Generally, however, the lineage
information needed to track systematic errors through various conversions is not
available, nor computationally feasible. Random errors cannot be removed from a
database. Further discussion of error in this thesis will refer only to random errors.
Accuracy and Precision
The concept of accuracy is closely related to quality although it takes on a formal
definition in regards to positional quality. Accuracy is the nearness of a measured
quantity to its true value (Wolf and Ghilani, 1997). Positional accuracy specifically
refers to the nearness of observed values to the feature's true position in the same
coordinate system (Drummond, 1995). Positional accuracy is often quantified by root
mean square error or simply RMSE (Anderson and Mikhail, 1998; Drummond, 1995).
RMSE values are calculated for each point as follows.
(Xmes, Xtr X j (y tY
Sn F n
True coordinates are simply values attained through a higher accuracy method.
Combining the X and Y components into a single error descriptor gives the following
equation known as circular, horizontal, or radial RMSE (FGDC, 1998c).
RMSEcIRC4R = J(RMSE + RMSE,2
The term precision has two different although related usages in the spatial data
field. Most data quality research refers to precision as the number of significant figures
in a measurement (Goodchild, 1993). For example the number 154.234 is more precise
than the number 154.23. This is often referred to as arithmetic or reporting precision.
In terms of measurement theory, precision is the degree of refinement or
consistency in a measurement process (Wolf and Ghilani, 1997). It describes the
closeness of repeated observations to one another, regardless of their relation to the true
value. Although a group of observations may be very precise, systematic effects may
have caused all measurements to be of poor accuracy (Wolf and Ghilani, 1997). Standard
deviation is typically used to measure precision (Drummond, 1995) and is calculated by
ni
where xis the most probable or mean value, x, is the ith measurement, and n is the number
of observations.
Accuracy Assessment
The Spatial Data Transfer Standard (SDTS) identifies four methods for assessing
positional accuracy of a digital dataset (USGS, 1997). The first, by deductive estimate,
calls for practical estimates of errors in source data and assumptions made about error
propagation. Secondly, internal evidence such as redundancy checks or adjustment
statistics may be used. Third, comparison to source, calls for the visual comparison of
derived data from source data if such a situation is possible. The fourth and preferred test
of positional accuracy is by comparison to an independent source of higher accuracy.
Traditional accuracy standards such as the National Map Accuracy Standard
(NMAS), the American Society for Photogrammetry and Remote Sensing (ASPRS)
Accuracy Standards for LargeScale Maps, and the FGDC's Geospatial Positioning
Accuracy Standards, are formal methods for implementing the higher accuracy tests.
These standards call for independent treatment of horizontal and vertical accuracy
assessment. Only the horizontal component will be discussed here.
Although the current version of the National Map Accuracy Standards (NMAS)
was developed in 1947, it is still widely used. NMAS calls for the testing of welldefined
points by a source of higher accuracy. In order to meet the specification, 90% of test
points must be within a certain distance, determined by map scale, of the higher accuracy
location. For map scales larger than 1:20,000, the allowable error tolerance is 1/30 of an
inch at map scale. For scales smaller than 1:20,000, error must be within 1/50 of an inch
(USGS, 1999)
The ASPRS accuracy standards also link accuracy to map scale. Unlike the
NMAS, the ASPRS standards provide numerous class levels and acceptable error
tolerances for map scales larger than 1:20,000 (ASPRS, 1990). In addition the standards
require a minimum of 20 checkpoints.
Part 3 of the Geospatial Positioning Accuracy Standards is called the National
Standard for Spatial Data Accuracy (NSSDA) (FGDC, 1998c). The NSSDA is the
current standard for digital data. The standards recognizes that in a digital environment,
map scales can easily be modified and quickly loose their original meaning. To describe
accuracy of digital data, the NSSDA requires the reporting of RMSE at the 95%
confidence level. The standards also require at least 20 welldefined checkpoints.
Although positional accuracy standards can be used as a basis for data quality
statements, data producers usually lack the budget and time required to collect a
sufficient number of check points. Furthermore, the abstract nature of many data sets
precludes the identification of welldefined points. Rather, data quality statements are
usually based on deductive estimates. The user is therefore required to make assumptions
about source errors and the accumulation of the errors during data conversion and
processing.
Error Propagation
Error propagation is the technique used to evaluate errors in computed quantities
based on an understanding of how these errors accumulate through computational
functions (Anderson and Mikhail, 1998). The distribution of original errors through a
computational process will result in an augmentation of error in the result (Wolf and
Ghilani, 1997). Standard deviation is traditionally used to describe error in observed
quantities and output values derived through functions of those quantities (Anderson and
Mikhail, 1998).
Formulas can be developed for the mathematical evaluation of error propagation,
formally known as variance propagation. The special law ofpropagation of variances
can be used to evaluate error transmitted through a single function Z. With a,b,c...n
representing be observed variables with known random and independent errors o, ob, c,
o,, the error in Z can be derived by partial derivates (Anderson and Mikhail, 1998).
Sa) ab) ac) an)
To propagate error through multiple functions and to handle correlated variables,
the general law ofpropagation of variances must be used. (Wolf and Ghilani, 1997;
Anderson and Mikhail, 1998). This is accomplished by computing a variancecovariance
matrix,
=zz = AXAT
where Xzz is the covariance matrix for the function, A is the coefficient matrix of partial
derivates known as the Jacobian matrix, and I is the covariance matrix for the
measurements. For nonlinear functions, the partial derivatives in the Jacobian matrix
can be estimated by a firstorder Taylor series expansion (Wolf and Ghilani, 1997).
Openshaw (1989) recognizes that, in theory, existing mathematical error
propagation techniques can be used to evaluate error in GIS derived products. However,
in practice, he notes that the complexity of spatial operations and the combination of
these functions leave formulas very difficult if not impossible to develop. For complex
spatial analysis, Openshaw (1989), Goodchild (1995), Drummond (1995), and Kiiveri,
(1997) recognize that the propagation of error can only be practically modeled through
Monte Carlo simulations.
Monte Carlo Simulation
Monte Carlo methods are used to simulate realworld situations using random
number generators. Chou (1969) identifies two requirements for applying the method, a
model of reality and a mechanism to simulate the model. Applied to evaluating error
propagation, the model is the expected probability distribution of error, and the
simulation mechanism is a random number generator. Goodchild (1995) states Monte
Carlo methods can be used to estimate uncertainty in GIS outputs by simulating inputs
with distorted datasets or realizations based on their estimated uncertainties. Although
these realizations are fictitious, they are derived from the expected probability
distribution and therefore are equally possible representations. By replacing observed
data with appropriately distributed random data, and allowing each simulation to progress
through an identical series of functions, the variability in results can be summarized to
assess the reliability of GIS output.
Openshaw (1989) has outlined a generic Monte Carlo approach for evaluating
uncertainty in GIS derived products.
1. Decide what levels and types of error exist in source data.
2. Replace the observed data (position and/or attribute) by a set of random
variables drawn from appropriate probability distributions.
3. Apply a sequence of GIS operations to the Step 2 data.
4. Save the results.
5. Repeat Steps 2 to Step 4, N times.
6. Compute summary statistics.
Openshaw realizes that, in many cases, the magnitude and/or distribution of errors
in source data may not be known. However, reasonable assumptions about input errors
will still indicate plausible boundaries for output data quality. He also states that a key
problem is determining the appropriate value of N. If too few iterations are used, the
validity of summary statistics may be in question. On the other hand, increasing N may
result in additional computational time without adding statistical significance. Openshaw
recommends a minimum of 20 to 30 iterations to validate statistical summaries.
The simulation method can be universally applied to virtually any data type and
anlaysis, despite its level of complexity (Canters, 1997; Openshaw, 1989; Goodchild,
1995). In addition the concept is logical and easily grasped by novice users.
Furthermore, the method does not require the use of complex mathematics.
The major cost to the simulation method is the increase in required computer
processing time. As most GIS production schedules are strained to carry out an analysis
one time, repeating the analysis Ntimes is costly and in many cases not feasible. The
primary advantage of mathematical models over simulations is that they can be applied
onthefly and can be applied to new situations of the same function (Huevelink,
Burrough, and Stein, 1989)
There are numerous applications for the simulationderived data. The results may
accompany each database at the record layer and used as estimates in other analyses. The
results can be summarized for the inclusion in metadata documents as data quality
statements. The results can be used to correctly represent the quality of GISderived data
by limiting the precision of reported attributes. Also, Openshaw (1989) recognizes the
contribution of output errors from different inputs may be isolated by holding some
inputs as error free. This information would prove invaluable to project planners and will
be discussed in detail in Chapter 5.
Related Studies
A common goal among researchers in the data quality field is to establish models
that describe types and magnitudes of errors in source data, track errors through spatial
functions, and report uncertainty information alongside GIS outputs (Goodchild, 1989,
1993; Guptill, 1989). This goal remains an ideal, as few methods exist for the automatic
handling of uncertainty information. When organizing the large body of literature
dealing with data quality and GIS, broad distinctions can be made based on data type
(raster or vector), component type (positional or attribute) and model type (simulation or
variance propagation). Goodchild (1995) and Lanter and Veregin (1992) have stated that
examining spatial and thematic accuracy independently is not desirable. However, due to
the complexities of the actual operations in use, the derivation of allinclusive errors
models is largely implausible. Although this research is only concerned with positional
error propagated through spatial functions in a vector environment, many attribute and
gridbased studies have contributed to the derivation of vectorbased models. In order to
place this research among others, a brief review of related work follows.
RasterBased Studies
The majority of research has concentrated on the propagation of errors in the
raster environment. It is generally recognized that error propagation in raster or field data
is somewhat easier than in vector or object based data (Goodchild, 1989). In many of
these studies, the complexities of error propagation are reduced by holding grid cell
positions fixed and only considering errors in the values or attributes of grid cells,
allowing the accumulation of errors during raster overlay to be assessed by traditional
mathematical error propagation on a cellbycell basis.
Research dealing with error propagation in the raster environment falls into two
fundamental categories: thematic and elevation studies. In thematic related work,
Heuvelink, Burrough, and Stein (1989) applied standard mathematical error propagation
techniques to evaluate simple arithmetical operations in a raster environment. By only
considering basic operations and applying them on a cellbycell basis, output errors were
modeled with Taylor series expansions and traditional variance propagation. Fisher
(1991a) uses simulations to find errors in monetary costs associated with soil mapunit
inclusions. Lanter and Veregin (1992) propagate errors through raster overlay using
variations of the traditional PCC (points correctly classified) index. Goodchild, Guoqing,
and Shiren (1992) proposed an error model for thematic data based on stochastic
simulations and used it to evaluate uncertainty in area and overlay operations on land
cover maps. Veregin conducted several studies on error modeling in a raster
environment, concentrating on overlay (1989a and 1995) and buffer operations
(1994,1996). In 1994 he used Monte Carlo simulation as a basis for creating formal
mathematical models for error propagation. He argues that simulation modeling is too
computationally intensive to be practically applied in GIS and that simulation studies
have only led to anecdotal observations. Based on simulation results, he defines models
that propagate error through the buffer operation. He argues that these models can then
be applied to other buffer operations to achieve uncertainty information in realtime,
without performing timeconsuming simulations. Both Canters (1997) and Yuan (1997)
use Monte Carlo simulations to evaluate uncertainty in area estimation based on errors in
image classification. Arbia, Griffith, and Haining (1998) also use simulations to model
error in basic raster overlay functions.
Much of the work in raster uncertainty propagation has focused on errors arising
in data derived from digital elevation models (DEMs). Hunter and Goodchild have
produced a considerable amount of research in this area. In 1994, they used simulations
to evaluate uncertainty in areas burnt by forest fires in Portugal. These areas were
primarily identified by a slope and aspect analysis of the DEM. In 1995b, Hunter and
Goodchild reviewed various methods for evaluating the uncertainty of an elevation value
interpolated from a DEM. In 1997, Hunter and Goodchild apply their model to slope and
aspect measures. Ehlschlaeger, working with Goodchild (1994) also described
uncertainty in elevations derived from DEM interpolation. Fisher (1991b) uses Monte
Carlo simulations to evaluate uncertainty in viewshed areas. Lee, Snyder, and Fisher
(1992) used simulations to model data errors in floodplain feature extraction. Wechsler
(1999) built a practical set of tools in the ArcView GIS environment to evaluate basic
DEM uncertainties including slope and aspect calculations through Monte Carlo
simulations. Davis and Keller (1997) combined simulations and fuzzy models to
evaluate slope stability.
VectorBased Studies
Of particular relevance to this research are the few models proposed to deal with
uncertainty in vector data. The epsilon band model is a general approach to describe
uncertainty around lines. Although some differences in definition exist, the model
fundamentally states that a line's digital representation will fall within some distance
epsilon from the true line, the likelihood of which is described by a probability density
function (pdf). Dunn, Harrison, and White (1990) identify three common pdfs in use
with the epsilon model: rectangular, bellshaped, and bimodal. Blakemore (1984) used
the epsilon approach to assess pointinpolygon overlay errors, by dividing the polygons
into zones described as definitely in, possibly in, possibly out, and definitely out. Dunn,
Harrison, and White applied the model to digitizing trials in order to assess uncertainty in
the areas of polygons. Goodchild (1995) criticized the epsilon band as not meeting the
requirements of a formal error model by providing no means for generating distorted
versions of the data.
Realizing that the majority of digital vector data has entered the GIS through a
digitization process, it is logical to assume that the positional uncertainty present in
source data will follow magnitudes and distributions revealed by repeating digitizing
trials. Studies of digitizing error are therefore important to describe the positional
uncertainty present in source data, leading to more plausible models of error propagation.
Bolstad, Gessler, and Lillesand (1990a and 1990b) used linear models to estimate
errors in the digitization process due to map media, point type, and registration effects.
In their study, four operators repeatedly digitized the same map. Differences between the
digitized points and means, revealed a normal bell shaped distribution. Maffini, Arno,
and Bitterlich (1989) investigate error through a series of digitizing trials. They used the
results of these tests to make general statements about the nature of error, similar to the
epsilonmodel. Ehlers and Shi (1996) use traditional variance propagation models to
describe positional uncertainty in points collected through digitizing. Their vector model
of digitizing error is based on the assumption of normally distributed independent errors
in the coordinates of points. Shi (1998) further developed his statistical model to
incorporate uncertainty from one to Ndimensional space, allowing for the inclusion of
spatiotemporal relationships. Leung and Yan (1998) provide a similar set of pointbased
models for describing positional uncertainty in vector databases. Like Shi, they also
assume a circular normal distribution around points and line vertices to describe error in
points, lines, and polygons. They apply their model to evaluate uncertainty arising in
basic spatial functions such as length and area.
A few authors disagree that such a pointbased error model can be applied to lines
and polygons. Goodchild (1993) argues that in regards to digitizing errors, positional
uncertainty in a line cannot be adequately represented by independent errors in line
vertices. Rather, he argues, the digitizing operation produces vertices whose error is
highly correlated. Also, misregistration errors will tend to produce uniform shifts that
can only be described by assuming some degree of correlation in the digitized line.
Openshaw (1989) also argues that lines are subject to correlated errors.
Emmi and Horton (1995) used Monte Carlo simulations to evaluated error
propagation in an assessment of seismic risk. They considered errors in the position of
ground shaking intensity boundaries. However, in order create the random perturbations
of vector data they did not draw on rigorous statistical theory. Instead they adopted a
softwarefriendly solution of densifying arcs, applying a weed tolerance to shift arcs, and
generalizing arcs. This in effect randomized the position of arcs and points but does so in
a manner that cannot guarantee normally distributed data.
Hunter and Goodchild's Vector Uncertainty Model
The practical research of vectorbased error propagation utilizing simulation
techniques has come from Hunter and Goodchild. In 1996 Hunter and Goodchild
proposed a model that creates distorted vector datasets based on estimated positional
errors. The algorithms in this work are largely based on their previously developed
model for raster data. These random datasets or realizations can then be used in a
simulation model, which in turn can assess error propagation through spatial analyses and
provide an estimate for uncertainty in end products. The model makes two assumptions:
1) that the error at line vertices behaves in a circular normal distribution, and 2) that the
error's x and y components are independent. The model produces realized vector files
through the use of two normally distributed error grids: one for each x and y components.
By overlaying the error grids with the original vector data, a distorted but equally likely
version is created.
Hunter and Goodchild apply the model to assess uncertainty in polygon areas,
pointinpolygon overlay, and vector to raster conversion (1999a). After estimating
positional errors present in the polygon vertices based on source map scale and
digitization trials, 20 realizations of the polygon coverage were created. To represent
uncertainty in reported areas, they calculated the mean and standard deviation for each
polygon.
The vector uncertainty model of Hunter and Goodchild is available as AML and
C code on the author's website. The software produces data realizations from
ARC/INFO coverages based on estimated input errors. Because the positional errors are
generated and stored in the ARC/INFO GRID format, the algorithm has several
limitations. The user is required to choose a gridcell spacing as input to the model. Any
points, lines, and nodes falling within a single grid cell will have identical error
distributions applied. If the grid spacing is too large, the realizations of certain features
will be completely autocorrelated. If the spacing is too small, the algorithm slows
considerably as x,y coordinate shifts are unnecessarily calculated for cells having no
corresponding vector data. Furthermore, the application does not allow for any spatial
analysis to be carried out, nor does it automate the calculation of uncertainty in derived
attributes.
This thesis proposes a similar but improved vector uncertainty model, discussed
in depth in Chapter 4. The proposed model avoids the use of GRID by generating and
applying random errors in C. The choice of gridcell spacing is avoided, ensuring all
points, vertices, and nodes have independent random error. In addition the proposed
model facilitates the application of any generic analysis and the automatic calculation and
linkage of uncertainty information to GIS outputs.
CHAPTER 3
EROSION HAZARDS STUDY
3001, Inc. recently completed a substantial GIS analysis as an integral part of the
national Evaluation of Erosion Hazards study conducted by the Federal Emergency
Management Agency (FEMA). The goals of the project were to evaluate the "economic
impact analysis of erosion on coastal communities and on the NFIP [National Flood
Insurance Program]" (FEMA, 1998, p. 1). This study compiled over 10 thematic layers
for 27 counties in 18 coastal and Great Lakes states. 3001, Inc. collected GPS data on
sampled coastal structures and attached over 100 descriptive attributes by overlaying
assessment, parcel, flood zone, and erosion hazard layers. Using historical erosion rates,
the study predicted relative positions of data layers over a 60year period.
Data Layers
This analysis focuses on three fundamental data layers from the Erosion Hazards
Study: Structures, the Erosion HazardArea, and Flood Zones. Figure 1 shows an aerial
photograph over a sample study area in Dare County, North Carolina. The three data
layers are overlaid on the image and depicted individually in Figure 2.
.., ..... \ \
Figure 1: Sample source data layers in Dare County, North Carolina.
10 10 13 16
\'
a) Structures. b) Erosion Hazard Area. c) Flood Zones of Base
Flood Elevation (BFE).
Figure 2: Erosion Hazards Study source data layers.
The Structure data set (Figure 2a) is a point coverage of sampled coastal
buildings, collected with real time differentially corrected GPS (3001, 1998a). This
observation was usually made from the street, from which a bearing and offset was
applied to shift the point to the building. Although numerous structure attributes were
collected for the study such as house number, address, and structure condition, only the
x,y coordinates of the structure are of interest to this research.
The Erosion HazardArea (EHA) (Figure 2b) is a polygon that represents "an area
where erosion or avulsion is likely to result in damage to or loss of buildings and
infrastructure within a 60year period"(FEMA, 1998, p. 1). The EHA is bordered on the
oceanward side by the Current Erosion Reference Feature (CERF). The CERF line is a
primary indicator of erosion that represents the "receding edge of a bluff or eroding
frontal dune or the normal highwater line, or the seaward line of permanent vegetation"
(3001, 1998b, p. 1). The landward extent of the EHA is the Projected Erosion Reference
Feature (PERF) line representing the estimated 60year position of the ERF based on
historical erosion rates. It is expected that the CERF will migrate to the position of the
PERF over the 60 year period.
Figure 2c shows the Flood Zone layer derived from FEMA's Flood Insurance
Rate Maps (FIRM). FIRM maps identify degrees of risk in flood prone areas. This data
set is a polygon coverage delineating zones of varying Base Flood Elevation (BFE). BFE
is an estimate of the water height in the event of a 100year flood, decreasing in the
landward direction from the coastline. For example, if a person was standing at a point
with an elevation of 10 feet, and was in a zone with a BFE of 15, the person would be
standing in five or more feet of water during a 100year flood (3001, 1998c). All BFE
values are in feet relative to the vertical datum of NGVD29.
Erosion Hazards Analysis
The portion of the Erosion Hazards Study of relevance to this research involves
the attachment of BFE values to coastal structures. Finding the BFE of each structure is a
simple case of vector overlay: However, two additional analyses were carried out which
complicated the study. The first task involved interpolating the location of flood zones
over the 60year span. The second task was an attempt to represent the steppednature of
polygon BFEs as a more realistic smooth line.
Just as the CERF is expected to progress into the PERF over the 60year span, the
flood zone polygons are also expected to move accordingly. In order to estimate the
BFEs of structures 60 years into the future, flood zone polygons needed to be shifted
landward, the magnitude and direction of which being controlled by the erosion hazard
progression. Ideally, this was to be accomplished by digitally migrating the flood zone
polygons in 10year intervals, resulting in six additional datasets (10,20,30,40,50, and 60
r), allowing each structure's BFE to be determined over the 60year period by overlay.
However, the algorithms involved in such a nonuniform migration of flood zone
polygons proved to be too complex and generated numerous topological problems.
Project members agreed that instead of shifting polygon zones landward, the same effect
was accomplished by migrating the structures points ocean ward. This migration would
also follow the erosion hazard progression although in the opposite direction. Although
the structures do not in reality move toward the ocean, the relative temporalspatial
relationships between the structures and flood zones could be realized while allowing
complex polygon layers to remain static.
To facilitate this relative structure migration, a transect was defined as a line
through the structure perpendicular to the CERF. This was accomplished in ARC/INFO
with the NEAR algorithm, which finds the position on the line nearest to the structure,
therefore resulting in the perpendicular point of intersection. Each structure was assigned
an EHA i ithh defined as the transect length falling within the EHA (Figure 3). It is
expected that a structures erosionrisk will progress along the transect expanded over the
width of the EHA. Once the EHA width is found, and the azimuth of the transect line is
known, projected structure positions are calculated along the transect such that the 60
year structure is the distance of the EHA width from the original position, and the 10year
interval positions are equally spread in between (Figure 4). Figure 5 shows how the
geometry of the EHA controls the magnitude and direction of the progression, and in
turn, the projected structure positions. By overlaying each structure with the flood zone
polygons, the structures BFE can be determined over the 60year period. During the
design phase of this algorithm, tests were conducted that verified this method returned the
same result as shifting flood zone boundaries landward.
SEHA
Figure 3: Structure transects and EHA Figure 4: Projected structure positions for
width. years 0, 10, 20, 30, 40, 50, and 60.
 
Figure 5: Projected structure positions trom Glynn County, Georgia.
The representation of BFE as both integer and polygonbased values, incorrectly
models its true nature. A sample profile along a structure's transect (Figure 6) reveals
that at flood zone boundaries, the BFE will jump to a new level. In reality, however, it is
expected that the BFE will gradually decrease as it progresses landward. A model was
introduced to smooth the stepped nature of BFE into a more realistic representation,
referred to as interpolated BFE. To distinguish between the two measures of BFE, the
original, zonebased values will be referred to as nominal BFE (Figure 6).
Nominal BFE 16ft
LAND 13ftt OCEAN
Interpolated BFE
Figure 6: Profile along structure transect showing nominal and interpolated BFEs.
i\,
a
The calculation of interpolated BFEs was accomplished in ARC/INFO using a
Triangulated Irregular Network (TIN). The coded polygon BFE was assumed to
represent the elevation at the zone's center. Bordering polygon arcs were coded with the
average of adjacent polygon BFE values and converted into a TIN with the ARCTIN
function. Figure 7 shows the conversion of original BFE polygon codes to coded arc
elevations (treated as contours) resulting in a representation of interpolated BFE.
Because the TIN model is a continuous surface, it allowed the derivation of interpolated
BFE anywhere in the study area.
Base Flood
Elevation Range
1 10 11
1112
1213
\ 1314
S \ 10' 13 \ 16 14 15
S 1516
Figure 7: Interpolated BFE TIN surface and arc elevations.
By overlaying the TIN with projected structure positions at each 10year interval,
interpolated BFE values were derived for each structure over the 60year period. Figure
8 shows the resulting spatial relationships of all output data. Once the projected structure
positions are calculated nominal and interpolated BFE values are attached based on
spatial overlay with the flood zones and TIN surface respectively. The script fema.aml
provided in the Appendix was written to carry out the Erosion Hazards analysis. Given
the three source coverages representing Structures, Flood Zones, and the Erosion Hazard
Area, the AML calculates nominal and interpolated BFEs for each structure over the 60
year span at 10year intervals. Table 1 gives sample output for a single structure. The
next chapter describes the methods used to estimate uncertainty in these values.
\ \
\ \
F. \
Figure 8: Spatial distribution of output data.
Base Flood
Elevation Range
10 11
1112
1213
1314
1415
.. 1516
Table 1: Sample structure output data.
Year Year 10 Year20 Year30 Year40 Year50 Year60
Nominal BFE 10 10 10 10 10 13 13
Interpolated BFE 10.977 11.088 11.2 11.315 11.429 11.617 11.925
CHAPTER 4
METHODS
In order to estimate the uncertainty in nominal and interpolated BFEs, an error
model was developed and formalized into an application referred to as the Positional
Error Simulator (PES). PES is a menudriven application for NT ARC/INFO version 7.2
written primarily in the Arc Macro Language (AML). Although this application will be
applied specifically to the Erosion Hazards Study, it is a generic error model designed to
work with any set of vector coverages, and any analysis driven by AML. Various
component scripts were written in AML, C, and Avenue to create the PES model. AML
and C source code is provided in the Appendix. All code and a brief tutorial is provided
on the author's website at http://www.surv.ufl.edu/sam or by email at
samhhenderson(@hotmail.com.
This research adopts Goodchild's definition of an error model as "a stochastic
process capable of generating a population of distorted version of the same reality"
(Goodchild, Guoqing, and Shiren, 1992, p. 87). Like Hunter and Goodchild's vector
uncertainty model, the PES model allows for the creation of random data sets. However,
the proposed algorithm is applied directly to the vector data, avoiding the choices of grid
cell spacing, and insuring the creation of normally distributed random data. The PES
model also improves on Hunter and Goodchild's model by applying a userdefined
analysis to the perturbed data and summarizing variability in derived values.
This chapter describes the methods used for estimating uncertainty in GISderived
values due to positional error propagation. The first section describes algorithms in the
PES model for creating data realizations, carrying out the simulations, and statistically
summarizing uncertainty in outputs. The second section describes how this application
was applied to the Erosion Hazards Study.
Creating Data Realizations
The PES model assumes only independent random measurement error in the
position of the points, nodes, and vertices that make up vector based digital data. Like
the models used by Hunter and Goodchild (1996), Leung and Yan (1998), and Shi
(1998), the same uncertainty existing around point features is extended to all vector data
by applying identical algorithms to the nodes and vertices of line and polygon data. Error
is introduced into each of N realizations, producing a population of hypothetical
observations. At a single point, node, or vertex, the collection of all realizations (N) will
return a circular normal distribution centered on the input coordinates.
In order to apply this model, the true nature of the digital representation must be
overlooked. The digitization models mentioned in Chapter 2 have concluded that
repeated digitization trials will generally reveal pointbased circular normal distributions.
Assuming the absence of systematic errors, by repeating the digitization trials, the mean
coordinates can be calculated and used as the most probable estimate of the true value.
However, it is highly unlikely that any GIS user has the time or budget to digitize a map a
sufficient amount of times to attain most probable coordinates. Rather, a map is digitized
once, the digital version becoming one observation of a population of possible
representations. Therefore, it is incorrect to assume that by simply digitizing a line once,
one can adequately describe the error around each point or vertex by centering on it a
probability distribution. In some cases, as in surveyed parcel locations, redundant
measurements are taken and adjustments made to insure the coordinate represented are in
fact the most probable values. However, the vast majority of GIS data has not been
collected through a rigorous surveying process. It must be assumed that the majority of
digital data are single observations of the real world. To apply the error model, this fact
must be overlooked. Although the consequences of this assumption could be substantial,
they cannot be avoided without knowledge of the actual errors existing in spatial data.
Data realizations are created in ARC/INFO through the use of generate files. A
generate file is an ASCII representation of an ARC/INFO coverage. Generate files
consist of point, vertex, node, and label identifiers, and x,y coordinate pairs. The C
program randgen.c (see Appendix) was written to create N error perturbed generate files
based on a userspecified 1sigma (o) error level.
Each point, line, or vertex in the database is simulated by N possible realizations.
For each realization, a random coordinate (XR, YR) is computed from a random azimuth
and distance applied to the original position (Xo,Yo). Azimuth (a) is a random variable
generated by the C function rand() whose possible values are uniformly distributed
integers from 0 to 359. Theoretically, a truly random azimuth should not be restricted to
integer values. However, the 360 possible azimuths derived from this method prove to be
sufficient in the realistic application of such an error model. Distance (d) is a normally
distributed random variable calculated with the BoxMuller algorithm and scaled by the
user's input error tolerance. The random coordinates are computed as follows.
XR = Xo + d(sin(a)) YR = YO + d(cos(a))
Figure 9 shows 100 point realizations created with the proposed algorithm. Assuming a
l1a error of 10ft, approximately 68 points should fall within a 10ft radius of the original
point. Applying standard multipliers for probable errors, 95 points should fall within
about 20ft and 99 within about 26ft. The cross hair in the figure shows the calculated
mean coordinates.
ii99
,* 90% .
S*/ /
Figure 9: 100 point realizations generated by PES model.
Figure 10: One realization of the point based model applied to lines and polygons.
The pointbased approach above is similarly applied to the nodes and vertices of
line and polygon datasets (Figure 10). Additional functions are added to the algorithm to
retain topological relations of line and polygon data such as connectivity and adjacency.
Polygon labels are not perturbed. The original labels are copied from the source dataset
and put into the realized coverage to retain polygon topology.
Once N realized generate files are created, the feature identifier numbers are used
to relate the original attribute information (not stored in the generate files) to the realized
datasets. The end result is N copies of the original coverage complete with the original
attribute information but perturbed by positional error.
The model requires apriori knowledge of existing positional errors in source data.
This is accomplished by estimating the radius of a circle (centered on the input point)
within which 68% of the errorintroduced points will fall. Four methods for deriving
such an error estimate were outlined in Chapter 2: deductive estimate, internal evidence,
comparison to source, and independent source of higher accuracy. Where rigorous
accuracy assessment tests have been conducted, these results should be used as inputs to
the simulation. When higher accuracy data are not available, estimates of input data
quality must be made.
The application of this point model to the nodes and vertices of lines and
polygons may result in the creation of topological inconsistencies or unwanted overlaps.
Hunter and Goodchild refer to these overlaps as rips (1995a) and fractures (1999a)
(Figure 11).
/
Figure 11: Rips in realized line data.
Rips may arise when estimated positional accuracies are greater than the distances
between line vertices. In some cases, this situation will arise as the digital data are
essentially misrepresenting the real world. For example, when streammode digitizing,
the operator records points at equal intervals regardless of whether or not they represent
actual changes in direction. In this case, rips can be avoided by removing extraneous
details or generalizing the data prior to creating realizations. This technique is shown in
Figure 12.
Figure 12: Application of feature generalization to remove rips.
In other cases, when estimated accuracies are large compared to line segment lengths and
line vertices do represent actual feature changes, spatial autocorrelation must be present.
To avoid rips, Hunter and Goodchild (1995a and 1996) proposed the introduction of
autocorrelation into the randomization process. By design, their method would adjust
random coordinates until a specific level of autocorrelation is reached. Figure 13 shows a
sample application of this technique.
Figure 13: Introduction of autocorrelation to avoid rips.
However, since the 1996 paper, Hunter and Goodchild have not continued development
of this technique (Hunter, 2000). In 1999b, they overcame the problem of rips by
adjusting random coordinates until no overlaps exist, without constraining the results to a
specific level of autocorrelation.
When errors are spatially autocorrelated, the realizations produced cannot be
represented by a circular normal distribution, and the assumption that only random error
exists cannot be made. Lacking practical methods to ascertain actual degrees and extent
of spatial autocorrelation, the PES model assumes only the presence of independent
random error. Therefore, data sets consisting of details more refined than error estimates
must be generalized before applying the error model.
Deriving Uncertainty Estimates
After randomization, each of N simulated data sets is copied into a unique
workspace based on the realization number. To carry out the analysis, a userdefined
AML is then run in each workspace. This produces N equally likely GIS results.
Outputs may be in the form of new or modified attributes attached to one or more
source coverages. The PES model refers to these attributes as items of interest. Typical
items may include x,y coordinates, length, area, and perimeter. Drawing on the work of
Hunter and Goodchild (1995a, 1996, 1999a, and 1999b), the proposed model also
summarizes variability in reported attributes with the mean and standard deviation
statistics. Mean values are calculated in the model to reveal error distributions not
centered on the original GIS values.
Once all realizations are created and the analysis is performed, coverage attribute
tables are exported as ASCII files to facilitate the calculation of summary statistics. The
C programfat2rrt.c (see Appendix) was written to sort attribute ASCII files into record
realization tables (RRT). An RRTtable is created for every record or object in the
database. Each row of the RRTtable lists a simulated output from the model. There are
therefore N rows per RRT table. Once RRT tables are created, the files are converted
back to INFO tables and passed through the STATISTICS function to calculate mean and
standard deviation. No mechanism is included for describing uncertainty in character
fields. Figure 14 diagrams how uncertainty estimates are derived from data realizations
in the case of polygon area queries.
REALIZATIONS
INPUT 1 2 N
E E U
OUTPUT
ID AREA AREA MEAN AREA STDEV
1 2 1 1000 1067 49
2 1000 1130 115
3 1000 1063 131
4 1000 990 90
Figure 14: Calculation of summary statistics.
The PES model assumes only normally distributed error in derived attributes.
When the resulting error distributions are nonnormal it may be inappropriate to apply
typical assumptions made about standard deviations, and the term should be regarded not
as a probabilistic function but as a simple indicator of error. Once the outputs are
summarized from all realizations, mean and standard deviation statistics are attached to
the attribute tables of the original data sets. This allows for the direct link of the original
GIS output and uncertainty information.
Application to Erosion Hazards Study
This research focuses on 10 sites from the Erosion Hazards study spanning three
counties: Dare, North Carolina; Glynn, Georgia; and Sussex, Delaware. These 10 sites
consist of 216 structures. The PES application was used to estimate uncertainty in
nominal and interpolated BFEs calculated for each structure over the 60year period at
10year intervals. Table 2 lists the estimated positional accuracies of data layers used as
input to the PES model.
Table 2: Erosion Hazards Study estimated input errors.
Coverage ylo Error
Structures 15ft
Erosion Hazard Area (EHA) 5ft
Flood Zones 5ft
Metadata produced by 3001, Inc, states that each structure is within 15ft feet of
the structure centroid, without suggesting a particular level of confidence. The metadata
also recognizes that because the structure is in reality, 3dimensional, and in the digital
domain, 2dimensional, there is some abstraction needed to represent the structure as a
point. Due to the difficulties of estimating the centroid, and based on the authors
personal experience with the structure data, it is more plausible to assume that the
structure falls within 15ft of the centroid at the la level (i.e. 68% of the time). The flood
zone and EHA layers were streammode digitized from various scale documents. In the
case of the PERF and flood zones, it is difficult to assess the accuracy of a data layer that
represents features not actually existing on the ground. Rather, a reasonable estimate of
5ft was used for these layers based on comparisons of the CERF to higher accuracy
imagery.
Before carrying out the simulations it was necessary to remove extraneous detail
in the arcs of the EHA and ZONE coverages resulting from stream mode digitizing. By
generalizing and/or cleaning the data with three times the error tolerance, it is highly
unlikely that the algorithm will produce overlaps or rips in realized data. The EHA and
ZONE coverages were cleaned and generalized at 15ft using the default ARC/INFO
pointremove function based on the DouglasPeucker algorithm.
The PES application was used to simulate the Erosion Hazards Study 100 times.
This produced 100 equally possible GIS outputs in the form of nominal and interpolated
BFEs for years 0, 10, 20, 30, 40, 50, and 60. For each of the attributes, the mean and
standard deviation were calculated as key measures of the resulting attribute uncertainty.
CHAPTER 5
RESULTS AND DISCUSSION
Sample Results
As an introduction to the simulation output, the results from structure #1 are first
presented. Figure 15 shows the source data layers around structure #1 and the projected
structures created by the initial run of the analysis. Figure 16 shows the initial output and
the 100 realizations of flood zones, EHA polygons, and projected structure locations.
The initial GIS output and calculated uncertainty measures are presented in Table 3. The
legend associated with Figure 16 is adopted for the remainder of this thesis.
C
o Original 60Year Structures 60Year Structure Realizations:
o 0
S/Ciginal Food Zone a 10 40
a 20 50
original EHA: 30 60
/V CERF
/ PERF
Flood Zone Realizations
EHA Realizations
Figure 16: Structure #1 initial GIS output and 100 data realizations.
Table 3: Structure #1 attribute values and estimated uncertainty.
Standard
Year Value Mean St
Deviation
0 10 10 0
w 10 10 10.18 0.72
LL
m 20 13 11.56 1.51
C 30 13 12.88 0.59
E 40 13 13 0
0
z 50 13 13 0
60 13 13 0
0 11.727 11.74 0.16
w
LL 10 12.005 12.02 0.16
m
o 20 12.282 12.3 0.17
S30 12.56 12.57 0.17
40 12.838 12.85 0.17
50 13.116 13.13 0.18
60 13.394 13.4 0.18
Structure #1 progresses from a nominal BFE of 10ft at years 0 and 10 to 13ft for
years 20 through 60. The largest nominal BFE error of 1.51ft occurs at year 20. The
remainder of this thesis will use the terms error and uncertainty interchangeably to refer
to the standard deviation calculated by the PES model. The initial year 20 structure falls
very close to the border of the two flood zone polygons and is therefore subject to point
inpolygon effects when data layers are perturbed. Figure 17 shows that 48 of the Year
20 structure realizations return a 10ft BFE, while the remaining 52 structures return 13ft.
This is also presented by the histogram in Figure 18.
Figure 17: Structure #1 year 20 pointinpolygon errors.
60
52
50 48
0 40
C
S 30
20
10
0
0 ~   
9 10 11 12 13 14
Nominal BFE
Figure 18: Structure #1 distribution of year 20 realized nominal BFEs.
As discussed in the previous chapter, application of the standard deviation
function to nonnormally distributed data may be misleading. In the case of nominal
BFE, the resulting errors are not normally distributed and the probabilistic statements
usually associated with the standard deviation statistics cannot be made. For example, it
cannot be properly stated that 68% of the time, the structure returned a BFE within plus
or minus 1.5 ft of the mean BFE of 11.56ft. In addition, standard multipliers cannot be
applied to attain error estimates at the 95% or 99% confidence levels. In reality, the
structure will only return one of two possible values: 10 or 13ft. Furthermore, it may also
underestimate the amount of error possible in nominal BFEs. Even though the calculated
error is 1.5ft, the possible values actual span 3ft, and are almost equally likely to return
either result. However, because the resulting distributions of possible field values may
not be known, the statistic is useful as a generic indicator of error levels. Care must be
taken in the interpretation of results. The errors of 0.72 and 0.59, produced at years 10
and 30 respectively, are also the result of pointinpolygon effects, although less
pronounced. All other years return zero error in nominal BFEs.
The interpolated representations of BFE returned smaller errors. Pointinpolygon
effects do not occur. Rather, errors in interpolated BFEs are the result of uncertainty in
structure positions and variations in the TIN surface derived from errors in flood zone
arcs. For visualization purposes, the 100 TINs were converted to Ift contours as shown
in Figure 19. The variability in contour locations closely resembles the uncertainty in
parent flood zone boundaries. Any magnification of the uncertainty in these contours is
attributable to uncertainties in the derivation of the TIN model. As shown in Table 3
errors in the interpolated BFEs increase slightly with each 10year interval. Figure 20
43
shows a histogram of the realized interpolated BFE values for structure #1 at year 60
(corresponding to Figure 19). The returned interpolated BFEs approach a normal
distribution, indicating that the standard deviation statistic does provide a useful estimate
of error. Other interpolated BFE error distributions varied from normal to skewed to
bimodal.
10 13 16
Figure 19: Structure #1 year 60 realizations with interpolated ift contours.
35 T
o 20
20
0*
V 15
LL
5
3
2
MC M C D O CO O CO CO LC CO CO CO C DO
Interpolated BFE
Figure 20: Structure #1 distribution of year 60 realized interpolated BFEs.
4
Uncertainty in the nominal and interpolated BFEs is largely affected by the
dispersion of projected structures realizations. The geometry of the EHA and the
structure's proximity to the CERF determine projected structure dispersion or spread.
Figure 21 shows examples of increasing (a) and decreasing (b) projected structure
spreads. Large spreads increase the likelihood of error propagation into BFE values. In
Figure 21a, the calculation of the NEAR point (refer to Chapter 3) on the CERF returns
equally possible locations on different line segments, causing realized transects to point
in two different directions. In Figure 21b, realized structures share a common near point
(the vertex of the CERF line segments) causing projected structures to converge with
each 10year interval.
I I \
structure spread. structure spread..
Figure 21: Increasing and decreasing spread of projected structure realizations.
In order to compare values, a measure of circular error (similar to RMSEcm CULAR)
was calculated for each 10year interval based on the standard deviations of the x and y
coordinates.
c =: O7X2 Y2
When qx = oy, cc provides the radius of a circle (centered on mean coordinates) within
which approximately 63% of realized points fall (Mikhail and Ackerman, 1976). When
ox and oy are not equal, the resulting radial probability is variable. Furthermore, when
structure realizations do not follow the circular normal distribution, the circular error
looses its probabilistic definition and is only retained as a basis for comparison. Circular
errors were calculated for projected structures for each 10year interval. At year 0, the
circular error is expected to approximate the input uncertainty of 15ft. The PES
algorithm produces a circular normal distribution such that 68% of realized structures
should fall within the estimated input error (15ft for structures). Since oc is at the 63%
level, calculated values at year 0 should be slightly less than 15ft. In the case of structure
#1, the circular error slightly increases with each 10year interval (Table 4). Table 4 also
shows calculated circular errors corresponding to the structure #s 185 (increasing spread)
and 53 (decreasing spread) from Figure 21.
Table 4: Sample circular errors of projected structures.
Year 0 10 20 30 40 50 60
Structure #1 14.8 15.0 15.2 15.4 15.7 16.0 16.3
Structure #185 15.1 18.6 22.2 26.1 30.3 34.6 39.1
Structure #53 14.9 13.9 12.9 12.2 11.6 11.3 11.2
StudyWide Results
Initial output from the PES model is the mean and standard deviation for each
item of interest linked to the structure record. However, a general idea of the uncertainty
present in these attributes, across all structures, is desirable. This section will report on
the studywide uncertainty levels present in nominal and interpolated BFEs.
Histograms were created for each 10year interval over the 60year period based
on the 216 observations of nominal BFE error. Figure 22 shows the results for year0.
About 82% of structures returned zero error in the BFE calculation. About 93.5% of
structure BFE errors fell between 0 and 0.5ft and all BFE errors were less than 1.5ft.
200 100.0%
177 ,,i' 97 7% 100.0%
180 93.5% 90.0%
160 .1.9% 80.0%
140 70.0%
S120 60.0% >
n 100 50.0% a
E 80 Frequency 40.0% E
S60 Cumulative% 30.0% 
40 25 20.0%
20 9 5 10.0%
000 7I7I1 .0%
0.0 0.5 1.0 1.5 2.0 2.5 3.0 More
Nominal BFE Error Year 0
Figure 22: Studywide histogram of nominal BFE error at year 0.
Histograms were combined to show the cumulative percentages of structures falling
within nominal BFE error ranges for each 10year interval (Figure 23).
100.00%
S90.00%
1 90.00% Year 10
85.00% /
Year 30
S80.00% Yr
75.00% Year60
0
0 70.00%
65.00% 0oYa2
0.0 0.5 1.0 1.5 2.0 2.5 3.0
Nominal BFE Error
Figure 23: Studywide nominal BFE uncertainty.
In general, nominal BFE error increases with each 10year interval. This is expected as
the likelihood of pointinpolygon effects increases with structure spread. 80% of
nominal BFE errors are within 0.5ft, 90% within Ift, and all within 2.5ft
Similarly histograms were created for interpolated BFE errors and were combined
into a plot of cumulative percentages (Figure 24). The figure leaves off the lower 80%
for clarity. From 0.25 the plots for all years go to zero error at 0%. In other words, no
structures were free from interpolated BFE errors. All projected structures returned an
interpolated BFE of less than 1.25ft and about 96% of structures had an error of 0.5ft or
less. As observed with nominal BFEs, the interpolated BFE errors also increase with
each 10year interval.
100.00%
S96.00% Year 0
96.00%
5 l/"^ Year 10
92.00%  Year 20
4 Year 30
S88.00%  Year 40
 Year 50
S 84.00% 
80'Year 60
S80.00%
0 0.25 0.5 0.75 1 1.25 1.5
Interpolated BFE Error
Figure 24: Studywide interpolated BFE uncertainty.
Figure 25 is a histogram of calculated circular errors for year60. 70 of the 216
year60 structures did not exhibit any more than 15ft circular errors, attributable to the
estimated uncertainty in year 0 structures. 108 structures displayed minor structure
spreads between 15 and 20ft. Approximately 10% of the 60year structures spread
significantly at 30ft or more. In large structure spread areas, it is highly likely that
uncertainty will propagate into nominal and interpolated BFE values.
100.00%
90.00%
80.00%
70.00%
60.00%
50.00%
40.00%
30.00%
20.00%
10.00%
.00%
10 15 20 25 30 35 40 45 50 More
Circular Error
Figure 25: Studywide structure spread year 60
High Uncertainty Cases
Nominal BFEs will contain the most uncertainty when there is high probability
that perturbations in the structure and zone polygons will return different elevations
during polygon overlay. This is increasingly likely in large structure spread areas. The
highest nominal BFE error in the database was 2.5ft occurring at structure #132 in year
40 where 60% percent of the realizations return a BFE of 15ft while the remaining 40%
return a BFE of 20ft (Figure 26). There are two primary reasons for such a high
uncertainty at this point. First, due to the initial projected structures proximity to the
zone border, the realized structures are evenly dispersed into neighboring zones. The
second factor is the large step across the zone boundary from a BFE of 15 to 20ft.
S 15
60 Points
20
40 Points
Figure 26: Structure #132,year 20 realizations.
High uncertainty in interpolated BFEs can be attributed to large structure spreads
making it more likely for projected structures to fall in different elevation ranges. The
largest interpolated BFE error of 1.25ft occurred at structure #26 in year 60 (Figure 27).
In the figure, nominal BFE values are shown in black while coded arc elevations for the
interpolated TIN are shown in white. The projected structure realizations are sent in two
different directions. At year 60, the majority of structures fall in the center of a BFE zone
of 12ft. Interpolated BFE values for this lower group are clustered around 12.0ft. The
upper group of year60 structures returned interpolated BFEs varying only slightly from
15.9ft. The histogram in Figure 28 shows the distribution of realized interpolated BFEs
for year 60. Although each cluster of 60year structures revealed relatively precise
measurements of interpolated BFE independently, taken together, the interpolated BFE
error is 1.2ft. The 10% of circular errors over 30ft (as shown in Figure 25) were caused
by similar cases of structure spread.
50
HI
Figure 27: Structure #26 realizations exhibiting large year60 interpolated BFE error.
60 56
50
40
S30
0
2 23
UL.
20 
20
10 8
5
10
1 2 12 1 1
0 CN (0 04 ( O ^ 0O 04 ( O ^
Interpolated BFE
Figure 28: Structure #26 distribution of year 60 realized interpolated BFEs.
As previously observed with nominal BFEs, the error distribution as shown in
Figure 28 is not normal and the use of standard deviation as a summary statistic disguises
the actual nature of error at this point. Although the calculated standard deviation is
1.25ft, it is actually likely that interpolated BFEs as far apart as 4ft will be returned.
Usefulness
The previous section has shown that simulation methods can be applied to
complex GIS analyses to derive estimates of uncertainty in reported values. In the
Erosion Hazards Analysis, 90% of nominal BFEs returned errors of Ift or less and 90%
of interpolated BFEs had less than 0.5ft of error. Certain structures in the analysis
revealed large and unanticipated errors. Nominal and interpolated errors were as high as
2.4 and 1.25ft respectively. However, the realized error distribution must also be
considered when evaluating the usefulness of estimated uncertainty measures. When
output errors are not normally distributed, the mean and standard deviation statistics fail
to adequately describe the nature of uncertainty in derived attributes.
A distinction must be drawn between absolute uncertainty and uncertainty
existing within the GIS model. Output from the PES application reports only variability
arising from estimated uncertainties in the positions of input data. In reality, there are
numerous other sources of error also propagating into final values. Furthermore, the
models themselves may be misrepresenting the real world. For example, although the
majority of interpolated BFEs were found to contain less than 0.25ft of error, these values
are just smoothed representations of, and derived directly from, nominal BFEs. It is
incorrect to assume that interpolated values are more accurate than nominal values.
The largest encumbrance of the proposed error model is the required computer
processing time. The Erosion Hazards error analysis was performed on 600MHZ
Pentium III computers with 128mb's of memory running Windows NT and ARC/INFO
version 7.2.1. The original analysis, carried out byfema.aml took from 5 to 10 minutes
depending on the site. Running the PES model with 100 iterations took from 8 to 18
hours per site. As expected, the time required for the error analysis is expanded Ntimes
plus the time for creating data realizations and summarizing output. Although the time
added for perturbing data sets was minor, the ARC/INFO STATISTICS function, which
operated on INFO tables, was considerably slower than desired. The storage required for
the PES realizations and summary statistics is approximately 4Ntimes the original disk
space. This will vary by data type and complexity. Although time and storage
restrictions are considerable, these constraints are becoming less problematic with
advances in computer hardware.
Simulation results can improve the usefulness of data quality statements now
required in metadata standards. Where no other basis for comparison exists, as in the
Erosion Hazards Study, the data producer usually estimates the uncertainty in derived
values. Although estimation may be adequate in many analyses, complex layer
interactions may produce large and unexpected errors. In these cases, a realistic
representation of output error can only be attained through simulations.
Uncertainty data can be used to appropriately represent data variability by
limiting the reporting precision of output values. For example, in the Erosion Hazards
Analysis, the TINSPOT function returned interpolated BFE values to one thousandths of
a foot. After simulating the data and quantifying the variability in elevations it might be
more appropriate to display the data to only the hundredths or tenths of a foot.
Subsequent users of the data then have a rough idea of the data's inherent accuracy
simply by the number of significant figures.
As proposed by Veregin (1994), simulation modeling can be used as the basis for
mathematical error propagation models. By developing descriptive models based on
simulation output, error propagation can be assessed in similar systems, without
performing timeconsuming simulations. The application of the PES model to the
Erosion Hazards Study has lead to several observations that support this technique. For
example, output uncertainly was largely based on projected structure spread values.
Structure spreads could be predicted by variables such as distance to the CERF and the
concavity or convexity of the CERF line. Another model parameter could be the distance
to zone boundaries where larger errors are more likely. In addition, these parameters
could be introduced as dummy variables.
The error model can also be used during the design phases of a GIS project. By
applying such a model in a pilot study, expected levels of output uncertainty can be
estimated and data collection and/or processing methods can be revised to more directly
meet project requirements. In addition, the error analysis can be used in a suitability
framework. By using a trialanderror approach, input errors levels and algorithms can
be modified until required accuracy levels are reached. Heuvelink, Burrough, and Stein
(1989, p. 13) recognize that this information can be "potentially of great value for
resource managers who need to get reliable information for realistic budgets". This
research specifically looks into one such application proposed by Openshaw (1989): the
isolation of output uncertainty derived from individual data layers.
Openshaw (1989) states that by considering error in a single source layer and
holding all other layers errorfree, the contribution of that layer to output uncertainty
levels can be identified. The goal of such a study is to determine if the improvement of a
single data layer could significantly reduce uncertainty in GIS outputs. Such information
could prove invaluable for determining appropriate data collection techniques. If levels
of output uncertainty are unacceptable, this approach could identify data layers whose
improvement would most benefit study results. Conversely, if errors in outputs are
negligible, the project may benefit from the degradation of input data quality, by
substituting more cost effective data collection methods.
Isolating Error Contributors
To isolate the degree of output uncertainty contributed by the presence of
positional error in each source coverage, the 100 simulations were repeated three times,
each time only considering positional error in one of the source coverages. The original
error analysis, considering errors in all 3 layers will be referred to as the complete
analysis. Results from the complete model were compared with the individual layer
analyses, termed onlyeha, onlystru, and onlyzone.
When only considering positional error in the structures layer, all uncertainty in
nominal and interpolated BFEs is derived from the structure spread. BFE errors will
magnify with the structure's proximity to zone boundaries and in areas of high structure
spread.
In the onlyeha analysis, uncertainty in nominal BFEs will depend on shifts in the
EHA polygon and the resulting projected structure spreads. Even through the structures
layer is not perturbed, shifts in the CERF line of the EHA polygon will cause the NEAR
algorithm to produce variable projected structure positions (Figure 29). Because flood
zones are held errorfree, uncertainty in the interpolated BFEs will result only from this
slight structure spread.
Figure 29: Structure #1 onlyeha realizations.
In the onlyzone analysis, structures and the EHA are error free, producing no
variability in the positions of projected structures. Any uncertainty in nominal and
interpolated BFEs are caused by the shifting of flood zone boundaries.
Studywide average BFE errors were calculated for the 4 runs of the error analysis
(complete, onlyzone, onlyeha, and onlystru). A plot of average error against year (Figure
30) reveals error in nominal BFEs is almost completely explained by positional errors in
the structures layer. The inclusion of zone and EHA errors contribute minimal amounts
of uncertainty to final BFEs.
0.30
2 0.25 Complete
0.20 X OnlySTRU
0 OnlyEHA
E 0.15
0 A OnlyZONE
0.10
> 0.05 _. .. 
0.00
0 10 20 30 40 50 60
Year
Figure 30: Average nominal BFE uncertainty for isolated runs.
Similarly, studywide averages of interpolated BFE errors were calculated for
each run (Figure 31). Interpolated BFE uncertainty is also explained predominantly by
uncertainty in the structures layer. The effects from zone and EHA layers are smaller but
significant. When only error in structures is considered, variability in interpolated BFEs
is attributable only to structure spread. Because zones are held fixed, the 100 realized
TIN surfaces will be exactly the same throughout the analysis. When only the EHA layer
is considered, the structure spread is the only contributor to interpolated BFEs. When
only zones are considered, the TIN surfaces are themselves shifted while structures
remain fixed.
2 0.140
S*Complete
S 0.12
LUI
mO X X OnlySTRU
S0.10 " .XX
0. 0 OOnlyEHA
S0.06 _ A OnlyZONE
g 0.04 ,^ _A_
0.00 
1 0.02
0.00
0 10 20 30 40 50 60
Year
Figure 31: Average interpolated BFE uncertainty for isolated runs.
To compare structure spread values, studywide averages of circular error were
calculated for each run (Figure 32). In the complete case, circular errors progress from
15ft at year 0 (as expected) to approximately 21ft at year 60. When only structure errors
are considered, circular errors follow the complete case very closely, differing only in the
fact that the complete run contains and additional level of uncertainty due to variability
introduced by perturbation of the EHA. Considering only errors in the EHA, structure
spread progress from no error at year 0 to about 8ft at year 60. The onlyzone run does not
alter positions of projected structures.
25.00
20.00
0 & Complete
S15.   X OnlySTRU
S0OnlyEHA
10.00
A OnlyZONE
5.00
0.00 r l A A
0 10 20 30 40 50 60
Year
Figure 32: Average structure spreads for isolated runs.
Assuming only the presence of uncorrelated random error in derived uncertainty
estimates, the square root of the sum of squared errors from individual layers can
approximate how uncertainty is propagated through interaction effects during the
complete analysis.
EXPECTED ~ ONLYSTRU O+ ONLYEHA + 0ONLYZONE
Although the actual degree of correlation between layers is unknown, comparing the
expected combination of errors against the observed or complete value allows the
identification of cases where errors accumulate in an unexpected manor. To facilitate this
comparison, the parameter A was calculated as follows.
A = COMPLETE EXPECTED
Using the results from structure #1 as shown in Table 5, the complete run returned
an error of 0.18ft for the interpolated BFE at year 60. Applying the formulas above,
59
OEXPECTED = 0.16, and A = 0.02. Because A is positive, the actual interaction of data
layers returns error levels slightly larger than expected.
Table 5: Structure #1, year 60 interpolated BFE errors for complete and isolated runs.
COMPLETE 0.18
GONLYEHA 0.05
GONLYSTRU 0.15
GONLYZONE 0.03
EXPECTED 0.16
A 0.02
A values were calculated for all nominal and interpolated BFEs. A histogram of A
values for the interpolated BFE at year 60 is presented in Figure 33. Approximately 90%
of structure errors accumulate as expected, as their A values are reasonably close to zero
(+/ 0.05). The normally distributed nature of values indicate it is equally likely for
errors to accumulate or compensate during the actual interaction of data layers. Years 0 
50 produced similar distribution of A values for nominal and interpolated BFEs.
120 110
100
80
60
0
I 40 28 32
28
20 14 12
1 11 1 2 25 1 1 2 1 1
0 0 0 CDO O L O O N CD 0C CM 0
A
Figure 33: Histogram of A values for interpolated BFE at year 60.
Positive A values reveal cases where expected errors are less than those observed
in the complete run. In these cases, interaction effects magnify output uncertainty.
Conversely, negative A values occur when the expected error is greater than observed
levels, revealing situations where layer interactions diminish output uncertainty. In these
cases, errors from different layers tend to compensate.
Using the metrics introduced above, structure #79 exhibits the largest case of
error expansion. Based on the results given in Table 6 the expected error is 0.44ft.
However, layer interactions of the complete model produced the observed value of 0.65ft.
The A value is 0.21. Figure 34 shows projected structure positions created for the
isolated runs of structure #79. The onlyzone analysis does not produce variability in
structure positions. The onlyeha analysis (Figure 34a) returns projected structures falling
within the 14ft BFE zone and revealing little structure spread. The calculated error is
0.05ft. When structure errors are considered in the onlystru model (Figure 34b), the data
reveals a large erosion hazard progression, projecting a single structure into neighboring
zones and magnifying the resulting error. When all layers are randomized in the
complete model (Figure 34c), several projected structures exhibit the extended
progression, returning a final uncertainty of 0.65ft, considerably larger than the expected
value of 0.44ft.
Table 6: Structure #79, year 60 interpolated BFE errors for complete and isolated runs.
COMPLETE 0.65
GONLYEHA 0.16
GONLYSTRU 0.41
GONLYZONE 0.05
EXPECTED 0.44
A 0.21
a) Onlyeha b) Onlystru c) Complete
Figure 34: Structure #79 comparison of complete and isolated realizations.
Structure #102 exhibits the opposite effect. The interaction of data layers in the
complete model tends to compensate expected errors from individual coverages. The
results for interpolated BFE errors in year 60 are shown in Table 7 and Figure 35. The
onlyzone run (Figure 35a) produces a large uncertainty of 1.07ft. The derived TINs
produce large variability since the 60year structure is very near the zone boundary. The
onlyeha run (Figure 35b) returns very little error as the zones are held fixed and the
structure spread is not considerable. In the onlystru run (Figure 35c), a large structure
spread produces 0.17ft of error. The expected accumulation of errors produces a value of
1.08, primarily due to uncertainty from the onlyzone model. However, the observed
model only revealed an error of 0.89, resulting in a A of 0.19. In the complete case
(Figure 35d), the combination of structure spread and zone shifts tends to cancel the
amount of expected uncertainty.
62
Table 7: Structure #102, year 60 interpolated BFE errors for complete and isolated runs.
COMPLETE 0.89
GONLYEHA 0.07
GONLYSTRU 0.17
GONLYZONE 1.07
(EXPECTED 1.08
A 0.19
 YEAR 60
c) Onlystru
t7
. YEA
.i. dl YEAR 60
> 1
^J_/
b) Onlyeha
' I *,
 .
.. , ." 
1 \' ."'YEAR 60 I
d Co ml; t I
d) Complete
d) Complete
Figure 35: Structure #102 complete and isolated outputs.
a) Onlyzone
15
17 o 17 
.v 
/ o '** '\
I ",^^.
/'
,..YE 0
\ iD
\
f
f
Despite these unusual cases of error expansion and compensation, approximately
90% of BFE errors accumulated as expected. Therefore, refining any of the input data
layers will improve the precision of computed quantities for the majority of structures.
However, the conclusion sought is which data layer's improvement would most benefit
the study's results. Although sophisticated models could be developed to reveal actual
parameters of expected error contributions, the simple comparison of average errors and
the fact that errors do accumulate as expected, shows that the structures layer contributes
the most to output uncertainty. Study results could best be improved by increasing the
accuracy of structure locations, and to a lesser but significant degree by the zone and
EHA layers. However, practical limitations of the object's representation must also be
considered. Although structure locations could be collected to centimeter accuracy level
if desired, there is still the unavoidable problem of representing a 2dimensional building
footprint as a single point. Arguably, the consequent reduction in output uncertainty
could be meaningless.
CHAPTER 6
CONCLUSION
The Erosion Hazards Study was a comprehensive GIS analysis that predicted
Base Flood Elevations of coastal structures in the event of a 100year flood 60 years into
the future. In order to estimate uncertainty in reported BFEs, the Positional Error
Simulator (PES) application was written for ARC/INFO to implement Monte Carlo
simulations. The PES application simulates error in data sets, carries out spatial analyses,
and summarizes observed variability with basic statistics. The model assumes normally
distributed independent positional errors in the points, nodes, and vertices that make up
vector data.
Although many researchers have identified the ability for the error model to be
applied to complex analyses, few tests have been presented. Applying the PES model to
the Erosion Hazards Study has shown that the simulations can indeed be used in a
substantial spatial analysis. Furthermore, simulations are perhaps the only practical
method for obtaining realistic uncertainty estimates. Openshaw (1989, p. 265) recognizes
"what many applications seem to need is not precise estimates of error but some
confidence that the error and uncertainty levels are not so high as to render in doubt the
validity of the results." The proposed PES software provides a practical method for
estimating the variability of GIS outputs due to positional errors, allowing GIS users to
evaluate the data's fitness for use.
Application of the PES error model to the Erosion Hazards Study allows the
original research questions to be addressed.
1. What levels and distributions of error exist in derived Base Flood Elevations?
Summary statistics showed that 90% of nominal BFEs returned errors of Ift or
less and 95% of interpolated BFEs had less than 0.5ft of error. Nominal BFE errors were
largely due to pointinpolygon occurrences during polygon overlay. Consequently, the
resulting distributions were highly nonnormal, usually returning only two possible
elevations. Interpolated BFE errors arose during the conversion of polygon zones to
contours needed for the derivation of TIN surfaces. Observed distributions varied from
normal to skewed to bimodal. Several cases of high uncertainty were introduced and
shown to arise in areas of high structure spread. Nominal and interpolated BFE errors
were as high as 2.5 and 1.25ft respectively.
2. What are the potential uses and limitations of the proposed error model and the
resulting uncertainty data?
The calculated uncertainty information has multiple uses. These can be the basis
for reliable quality statements required in metadata documents. The mean and standard
deviation measures can be reported at the recordlevel, attaching individual measures of
data quality to geographic objects. These values can be summarized to achieve overall
data quality statements.
The variability in reported values can be used to limit the reported precision of
GIS attributes. GIS functions will always produce data stored beyond their inherent
precision. Displaying an appropriate number of significant figures will allow users to
quickly assess data quality, and curb inappropriate data usage.
Simulations not only provide quantitative estimates of uncertainty but also allow
for the visualization of error. Arguably, visual output is possibly more valuable than
quantitative summary statistics. Being able to see hypothetical data paths allows the user
to identify how errors occur and impact the study. Furthermore, the visualization of error
may lead to the identification of alternative, less errorprone, processing methods.
Simulation output can be used to develop formal models of error propagation
(Veregin, 1994). By making general observations about the nature of error, parameters
can be introduced to model situations in which errors arise. These models can then be
applied to similar functions to develop error estimates without performing time
consuming simulations.
The proposed simulation method and uncertainty data can be used in a suitability
framework. By using a trialanderror approach, input errors and processing algorithms
can be varied until desired levels of data quality are reached. This usage of the error
model is a valuable resource for project members who must determine appropriate and
costeffective methods of data collection and processing. A related study, the isolation of
maximum error contributors, was investigated in depth and discussed below for research
question #3.
Several limitations of the proposed error model and resulting uncertainty data
have also been revealed. The assumption of independent random error cannot accurately
be made for all data sets. Currently, the model requires the use of generalization
techniques to prevent rips in linework. Several data collection and conversion processes
will lead to correlated errors, and could better be modeled by accounting for systematic
effects in realized data.
Another limitation of the proposed error model is the generic application of the
standard deviation as a summary statistic. Care must be taken in the interpretation of
output uncertainty estimates as the value of summary statistics are largely based on
unpredictable realized error distributions. When errors in GISderived values are not
normally distributed, probabilistic statements about the nature of error cannot be made.
In the case of nominal BFEs, most uncertainty was derived from pointinpolygon effects
returning two possible elevations. Clearly nonnormal, these errors might be better
represented by an error of confusion matrix. Many errors in interpolated BFEs
approximated the normal distribution while others were highly skewed. In rare cases
such as structure #26, realizations of interpolated BFE returned two clusters of normally
distributed data. In this case, the uncertainty might better be represented by bimodal
statistics. The error model would benefit from more useful summary statistics based on
observed distributions.
The error model suffers from significant increases in processing time and data
storage requirements. There is therefore a need to streamline algorithms and data storage
methods. By handling summary statistics within ARC/INFO additional and avoidable
processing times were encountered by the STATISTICS routine and the storage of
tabular data in the INFO format. By coding summary routines in C, the model would
speed up considerably. In addition, storing realization and summary tables in ASCII
format would reduce disk space requirements.
The reported error levels are only due to estimated positional accuracies of source
layers. Other factors such as attribute accuracy and model misspecification error must
also be considered to fully express uncertainty in reported values. In the Erosion Hazards
Study, simulating attribute uncertainty would involve coding polygons with different
BFE values, requiring knowledge of the expected probability distribution and magnitude
of attribute error.
3. Which data layer contributes the most to output uncertainty?
By running the simulation multiple times, each time only considering positional
errors in one layer, resulting uncertainty measures arising from source layers were
calculated. Study wide averages showed error in the structure layer contributed most to
output uncertainty.
4. How do errors from individual inputs accumulate during spatial analysis?
The isolated studies allowed an investigation into the accumulation of errors
during spatial analysis. An assumption was made that the total error could be
approximated by the square root of the sum of the squared errors from individual
coverages. Comparing expected errors with those actually observed revealed that errors
did accumulate as expected in approximately 90% of cases. The remaining observations
reveal instances of error expansion (observed values were higher than expected) and error
compensation (observed values were less than expected).
Addressing the above research questions has revealed that error propagation in
GIS is a complex problem. Currently, this problem can only practically be addressed
with Monte Carlo simulations. Additional methods are needed for handling correlation in
input data and presenting meaningful uncertainty information based on observed error
distributions.
APPENDIX
CODE
Fema.aml
/*
/*
/* PROGRAM: FEMA.AML
/*
/* This program carries out the reduced Erosion Hazards
/* Analysis by attaching nominal and interpolated BFEs to
/* structures and their projected 60year positions.
/*
/* This program is based on a similar application written by
/* the author for 3001, Inc.
/*
/*
/* INPUT: Program assumes 3 source coverages exist in current
/* workspace: stru, zone, eha
/*
/* OUTPUT: Stru coverage with positions of projected structures
/* nominal and interpolated BFES at 10year intervals
/*
/* AUTHOR: Sam Henderson
/*
/* DATE: 4/2000
/*
/* CALLS:
/*
/* NOTES:
/*
/*
&severity &error &routine BAILOUT
precision double
&call EHA TO CERF PERF BOX
&call DEFINE TRANSECT
&call CALC EHA WIDTH
&call CREATE PROJECTED STRUCTURES
&call CREATE BFE SURFACE
&call OVERLAY
&call EXIT
&return
/*
&routine EHA TO CERF PERF BOX
/*
ae
&call SMALLTOL
ec eha
ef arc
sel type = 'cerf
put cerf
sel type = 'perf
put perf
sel type ="
put box
save all yes
quit
build cerf line
build perf line
build box poly
&return
/*
&routine DEFINE TRANSECT
/*
/* STEP 1: Calculate structure transect line through
/* structure perpendicular to the CERF using proximity analysis (NEAR)
/* first, addxy to structures cover
addxy stru
tables
sel stru.pat
alter xcoord xO;;;;;;;;
alter ycoord yO;;;;;;;;
quit
/* tack on coords of near point
near stru cerf line 100000 # location
tables
sel stru.pat
alter xcoord x near;;;;;;;;
alter ycoord ynear;;;;;;;;
alter distance disterf;;;;;;;
quit
dropitem stru.pat stru.pat cerf#
/* store vars for transect from and to coordinates
ae
&call SMALLTOL
ec stru
ef point
sel all
cursor open
&si=
&do &while %:edit.aml$next%
&s id%i% = %:edit.struid%
&s fromx%i% = %:edit.xO%
&s fromy%i% = %:edit.yO%
&s tox%i% = %:edit.x near%
&s toy%i% = %:edit.y_near%
&s i = [calc %i% + 1]
cursor next
&end
&s numis = [calc %i% 1]
/* Create line cover for each one, extend and put into transect coverage
create tran cerf
&do i = 0 &to %numis%
&s linecov = [scratchname prefix xxfema dir]
create %linecov% cerf
coordinate key
ef arc
createattributes
add
2,[value fromx%i%],[value fromy%i%]
2,[value tox%i%],[value toy%i%]
9
coordinate mouse
cal [entryname [show ec]]id = [value id%i%]
/* extend to meet edges
get box /* brings in box arcs
nsel
extend 10000
put tran
&if %i% gt 0 &then
y;
removeedit all yes
&end
coord mouse
save all yes
quit
build tran line
&return
/*
&routine CALC EHA WIDTH
/*
/* trim transects to erosion hazard area and find width.
/* overlay to set vars for lengths
identity tran eha tranout line .001
ae
&call SMALLTOL
ec tranout
ef arc
sel ineha = 'y'
cursor open
&do &while %:edit.aml$next%
&s id = %:edit.tranoutid%
&s width%id% = %:edit.length%
cursor next
&end
sel ineha = 'n'
cursor open
&do &while %:edit.aml$next%
&s id = %:edit.tranoutid%
&if not [variable width%id%] &then /* only set if not already defined
&s width%id% = 0
cursor next
&end
save all yes
quit
/* attach widths from vars to structures
ae
&call SMALLTOL
ec stru
ef point
additem ehawidth 8 18 f 5
sel all
cursor open
&do &while %:edit.aml$next%
&s id = %:edit.struid%
&s :edit.ehawidth = [value width%id%]
cursor next
&end
save all yes
quit
&return
/*
&routine CREATE PROJECTED STRUCTURES
/*
ae
&call SMALLTOL
ec stru
ef point
sel all
cursor open
&s num = 0
&do &while %:edit.aml$next%
&s num = [calc %num% + 1]
&s width = %:edit.ehawidth%
&do i = 1 &to 6
&s dist_yr%i%0 = [calc [calc %width% / 6] %i%]
&s id = %:edit.struid%
&s link%num% = %id%
&s xl = %:edit.x0%
&s yl = %:edit.y0%
&s x2 = %:edit.x near%
&s y2 = %:edit.y_near%
/* With these two points, INVERSE for dist, az, new point xy.
&s distance = [invdistance %xl%,%yl%,%x2%,%y2%] /* WHO CARES,.
&s polar = [radang [invangle %xl%,%yl%,%x2%,%y2%]]
&s az = [calc 360 [calc %polar% 90]]
&if %az% gt 360 &then
&s az = [calc %az% 360]
/* move to new point down line.
&s progdist = [value dist_yr%i%0]
&s dep = [calc %progdist% [sin [angrad %az%]]]
&s lat = [calc %progdist% [cos [angrad %az%]]]
&s newx%num%_%i%0 = [calc %xl% + %dep%]
&s newy%num%_%i%0 = [calc %yl% + %lat%]
&end
cursor next
&end
&s totalnums = %num%
save all yes
quit
/* now, read var coords and create points.
copy stru pstr
ae
&call SMALLTOL
ec pstr
ef point
additem year 4 5 i
additem x 8 18 f5
additem y 8 18 f5
sel all /* just structures
cal year = 0
coord key
&do num = 1 &to %totalnums%
&do i = 1 &to 6
add
1, [value newx%num%_%i%0], [value newy%num%_%i%0]
cal year = %i%0
cal x = [value newx%num%_%i%0]
cal y = [value newy%num%_%i%0]
cal pstrid = [value link%num%]
&end
&end
coord mouse
save all yes
quit
&return
/*
&routine CREATE BFE SURFACE
/*
build zone line
/* Add items.
&do item &list leftbfe right bfe bfe_avg
additem zone.aat zone.aat %item% 8 12 f 3
&end
/* relate to get c bfe or p bfe on the aat for left and right poly
relate add
left
zone.pat
info
lpoly#
zone#
linear
rw;;;
relate add
right
zone.pat
info
rpoly#
zone#
linear
rw;;;;
/* populate left and right bfes through relates
arcedit
&do type &list left right
ec zone
ef arc
sel all
cal %type%_bfe = %type%//bfe
&end
save all yes
quit
/* Calculate bfe averages on arcs.
cursor aatcur declare zone arc rw
cursor aatcur open /*autoselects all
&do &while %:aatcur.aml$next%
/* Normal case. (interior)
&if %:aatcur.leftbfe% gt 0 and %:aatcur.right bfe% gt 0 &then
&s :aatcur.bfe_avg = [calc %:aatcur.leftbfe% / 2 + %:aatcur.right bfe% / 2]
/* Outside/border case. left poly 9999
&if %:aatcur.leftbfe% le 0 and %:aatcur.right bfe% gt 0 &then
&s :aatcur.bfeavg = %:aatcur.right bfe%
/* Outside/border case. Right poly 9999
&if %:aatcur.leftbfe% gt 0 and %:aatcur.right bfe% le 0 &then
&s :aatcur.bfe_avg = %:aatcur.left_bfe%
/* both negative case (nodata)
&if %:aatcur.leftbfe% le 0 and %:aatcur.right bfe% le 0 &then
&s :aatcur.bfeavg = 666
/*go to next record
cursor aatcur next
&end
/*close & remove cursor
cursor aatcur close
cursor aatcur remove
/* If any nodata arcs exist, delete them (666)
arcedit
ec zone
ef arc
sel bfe_avg = 666
&if [show num sel] gt 0 &then
delete
save all yes
quit
build zone poly
/* Convert arc cover to tin.
arctin zone bfetin line bfe_avg .001
relate drop
left;;
relate drop
right;;
&return
/*
&routine OVERLAY
/*
/* attach interpolated bfe to projected structures
tinspot bfetin pstr intbfe
/* attach nominal flood zone and bfe info to projected structures
identity pstr zone pstr2 point .001
/* relate all new items back to structures cover
&do year &list 0 10 20 30 40 50 60
additem stru.pat stru.pat bfe%yearo 3 4 i
&end
&do year &list 0 10 20 30 40 50 60
additem stru.pat stru.pat intbfe%year% 8 12 f3
&end
&do year &list 10 20 30 40 50 60
additem stru.pat stru.pat x%year% 8 18 f 5
additem stru.pat stru.pat y%year% 8 18 f 5
&end
/* make temp coverages for each year from projects structures
ae
&call SMALLTOL
ec pstr2
ef point
&do year &list 0 10 20 30 40 50 60
sel year = %year%
put pstr2_%year%
&end
save all yes
quit
/* relate items back to structures
&do year &list 0 10 20 30 40 50 60
relate add
relate%year%/
pstr2_%year%.pat
info
struid
pstr2_%year%id
linear
rw;;;
&end
tables
sel stru.pat
&do year &list 0 10 20 30 40 50 60
&do item &list bfe intbfe
cal %item%%year% = relate%year%//%item%
&end
&end
&do year &list 10 20 30 40 50 60
&do item &list x y
cal %item%%year% = relate%year%//%item%
&end
&end
quit
pullitems stru.pat stru.pat area,perimeter,stru#,struid,x0,yO,
x_near,y near,disterf,ehawidth,x 1 ,y 0,x20,y20,x30,y30,x40,y40,~
x50,y50,x60,y60,bfeO,bfel0,bfe20,bfe30,bfe40,bfe50,bfe60,intbfe0,~
intbfe 10,intbfe20,intbfe30,intbfe40,intbfe50,intbfe60
kill pstr all
kill box all
kill tran all
dropitem tranout.aat tranout.aat tranid tran# eha# ehaid area perimeter
rename tranout tran
dropitem pstr2.pat pstr2.pat pstr# pstrid
rename pstr2 pstr
&do cov &list [listfile pstr2* cover]
kill %cov% all
&end
kill cerf all
kill perf all
&do cov &list [listfile xx* cover]
kill %cov% all
&end
&return
/*
&routine SMALLTOL
/*
weedtol .1
grain .1
editdistance .1
nodesnap first .1
&return
/*
&routine USAGE
/*
&type USAGE: FEMA.AML
&return &inform
/*
&routine EXIT
/*
/* Clean up and exit menu
/* Delete global vars
&return
/*
&routine BAILOUT
/*
&severity &error &ignore
&call exit
&return FEMA.AML is bailing out;&type
80
Pes.aml
/*
/* Positional Error Simulator
/*
/*
/* PROGRAM: PES.AML
/* "Positional Error Simulator"
/* This program is a product of research conducted by
/* Sam Henderson, University of Florida Geomatics P
/* Read the online documents & tutorials before using
/* http://www.surv.ufl.edu/sam
/*
/* INPUT: Names, paths of input coverages
/* 1sigma error radius for each coverage
/* Items of interest for each coverage
/* Number of iterations (realizations) (N)
/* Name of AML with arguments
/* Location of output directory
/*
/* OUTPUT: Original covers with mean & standard deviati
/* of interest.
/*
/* AUTHOR: Sam Henderson
/*
/* DATE: 01/2000 4/2000
/*
/* CALLS: pesmain.menu, pesmain.aml, pes_add.aml, f
/* randgen.exe, fat2rrt.exe
/* (Expects NT system variable "PESCODE" = "x:\dir
/* points to location of .exe files)
program 
on for items
)esadd.menu
" that
NOTES:
1. Tested only on Windows NT 4.0, ARC/INFO 7.2.1
2. AML and MENU files should be located within the search
directories specified by amlpath and menupath.
3. An NT system variable should be set to the path of the
EXE and APR files: Control Panel System Environment 
and set a User variable of pescode equal to the value
"x:\code" where x:\code is the full path to the directory that
contains the EXE and APR files.
(Do not include a backlash after the path)
4. Items of interest must be numeric
5. Items of interest: do not include static items (stdev = 0) 
value must vary with each realization
6. Input covers must be clean
7. Number of items in any coverage limited by list function (1024 chars)
/* 8. Avoid numbers and periods "." in input cover names
/* 9. Coverages must be poly, line, or point only (
/* multiple feature types not supported)
/* 10. Max records in any feature attribute table: 1000
/* 11. Max width for FAT unload including commas: 1000
/* 12. Max number of arcs, points, nodes, and vertices is 5000
/* 13. All point coverages will have xcoord, ycoord added/updated
/* 14. Polygon coverages: must have 1 label per polygon
/* 15. Polygon coverages: labels are cut and pasted into realizations
/* (may cause problem with small polys, relative to error)
/* 16. User defined AML string should contain AML name and arguments
/* (no ".aml")
/* 17. User defined AML should be simplified to assume covers in current
/* workspace (no absolute paths)
/* 18. Coverids in "input" are original, those in "input+" and other
/* output directories are modified
/* 19. Clean process assumes rectangular coordinate system
/* (don't use with geographic)
/* 20. Error is at onesigma level
/* (68% of points/vertices fall within error radius)
/* 21. Line and Polygon coverages: if any two vertices or nodes are within
/* 4 times the estimated error, there is a chance of creating new
/* records in realized coverages. This will cause scrambling of the
/* RRTables and invalid summary statistics. To avoid this problem data
/* should be cleaned, generalized or error estimates reduced.
/*
&severity &error &routine BAILOUT
/* set up environment
&call SET ENV
/* get user input
&call GET USER INPUT
/* create output workspaces
&call CREATE WORKSPACES
/* Query coverages and set vars, write file for ArcView
&call DESCRIBECOVERS
&call WRITE AVFILE
/* randomize coverages, create coverages with error introduced
&call INTRODUCE ERROR
/* run user aml on simulated data
&call RUN USER AML
/* reorganize all simulated data into single workspace
&call PERSIM TO ALLSIMS
/* export all attributes tables to ASCII files
&call UNLOAD TABLES
/* reorganize percover to perrecord (feature) txt files
&call FAT2RRT
/* conver RRT files into info tables
&call RRT2INFO
/* Run stats on rrt info files
&call STATS
/* Add recordid to summary tables
&call ADD RECID
/* join summary data to input data
&call JOIN STATS
/* rebuild topology
&call REBUILD
/* clean up and exit
&call EXIT
&return
/*
&routine SET ENV
/*
/* Check program
&if [show program] ne ARC &then
&return Run [before %AML$FILE% .aml] from ARC.
/* clean up any old vars
&dv .pes*
/* Start log file
&s logfile = [scratchname prefix xxpeslog file]
&watch %logfile%
&type start time: [date full]
/* Store current environment.
&s curdisplay = [show display]
display 0
&term 9999 &mouse
&s .pes$homews = [show workspace]
&if not [null [show &amlpath]] &then
&s curamlpath [show &amlpath]
&if not [null [show &menupath]] &then
&s curmenupath [show &menupath]
/* Sets arctools paths.
&run [joinfile [joinfile $ATHOME lib sub] setpaths file]
/* search for all PES amls and menus in PATH>...
&return
/*
&routine GET USER INPUT
/*
/* create file to hold input data list
&if not [variable .pes$mainfile] &then &do
&s .pes$mainfile = [scratchname]
&s fileunit = [open %.pes$mainfile% openstat write]
&s closestat = [close %fileunit%]
&end
/* open pes main menu to get user input
&run pesmain init # # modal
&if %.pesmain$cancel% &then
&call BAILOUT
/* read main menu file to get coverage name, error, items of interest
&s menufile = [open %.pes$mainfile% readstat read]
&si= 1
&do &while %readstat% = 0
&s theline = [read %menufile% readstat]
&if not [null %theline%] &then &do
&s .PES$cover%i% = [extract 1 [unquote %theline%]]
&s .PES$error%i% = [extract 2 [unquote %theline%]]
&s .PES$dangle%i% = [extract 3 [unquote %theline%]]
&s .PES$fuzzy%i% = [extract 4 [unquote %theline%]]
&s .PES$genweed%i% = [extract 5 [unquote %theline%]]
&s sofar = [value .PES$coveroi%],[value .PES$erroroi%],[value
.PES$dangle%i%], [value .PES$fuzzy%i%],[value .PES$genweed%i%],
&s .PES$ioilist%i% = [after [unquote %theline%] %sofar%]
&s i = [calc %i% + 1]
&end
&end
&s closestat = [close %fileunit%]
&s .pes$numcovs = [calc %i% 1]
/* get number of iterations
&s .PES$iterations = %.pesmain$iterations%
&s .PES$Oiterations = [calc %.PES$iterations% 1]
/* check that aml exists
&s .PES$aml = %.pesmain$aml%
&s .PES$amllocation
&if not [null %.PES$aml%] &then &do
&s theaml = [unquote %.PES$aml%]
&if [token [unquote %.PES$aml%] count] gt 1 &then /*args exist
&s theaml = [extract 1 [unquote %.PES$aml%]]
&run [joinfile [joinfile $ATHOME misclib sub] searchpaths file] ~
aml %theaml%.aml .PES$amllocation
&if [null %.PES$amllocation%] &then
&call BAILOUT
&end
/* system var PESCODE must be set before running this aml: "x:\dir"
&s .PES$randgen [joinfile %PESCODE% randgen.exe]
&s .PES$fat2rrt [joinfile %PESCODE% fat2rrt.exe]
&s .PES$aprfile [joinfile %PESCODE% pes.apr]
/* set var for output directory
&s .pes$outdir = %.pes_main$outdir%
&if not [exists %.pes$outdir% dir] &then
&sys mkdir %.pes$outdir%
/* Send user choices to screen.
&type 
&type Input Data Summary:
&type 
&type Number of coverages: %.PES$numcovs%
&do i = 1 &to %.PES$numcovs%
&type 
&type Cover: [value .PES$cover%i%]
&type Error: [value .PES$erroroi%]
&type Items: [value .PES$ioilist%i%]
&type Dangle: [value .PES$dangle%i%]
&type Fuzzy: [value .PES$fuzzy%i%]
&type Genweed: [value .PES$genweed%i%]
&end
&type
&type 
&type Processing Options:
&type 
&type Iterations: %.PES$iterations%
&type AML String: %.PES$aml%
&type 
&return
/*
&routine CREATE WORKSPACES
/*
/* set up some vars
&s .PES$inputws [joinfile %.PES$outdiro input sub]
&s .PES$tempws [joinfile %.PES$outdirO temp sub]
&s .PES$input+ws [joinfile %.PES$outdir% input+ sub]
&s .PES$realizationsws [joinfile %.PES$outdir% realizations sub]
&s .PES$realeachws [joinfile %.PES$realizationsws% each sub]
&s .PES$realallws [joinfile %.PES$realizationsws% all sub]
&s .PES$tablesws [joinfile %.PES$outdir% tables sub]
&s .PES$fattabsws [joinfile %.PES$tablesws% fat sub]
&s .PES$rrttabsws [joinfile %.PES$tablesws% rrt sub]
&s .PES$sumtabsws [joinfile %.PES$tablesws% sum sub]
/* Make output workspaces
cw %.PES$inputws%
cw %.PES$tempws%
cw %.PES$input+ws%
&sys mkdir %.PES$realizationsws%
&sys mkdir %.PES$realeachws%
cw %.PES$realallws%
&sys mkdir %.PES$tablesws%
cw %.PES$fattabsws%
cw %.PES$rrttabsws%
cw %.PES$sumtabsws%
/* create workspaces for each iteration
&do num = 0 &to %.PES$Oiterations%
cw [joinfile %.PES$realeachws% real%num% sub]
&end
/* Copy coverages to input and temp workspaces
&do i = 1 &to %.PES$numcovs%
&s ename = [entryname [value .PES$cover%i%]]
copy [value .PES$coveroi%] [joinfile %.PES$inputws% %ename%]
copy [value .PES$coveroi%] [joinfile %.PES$tempws% %ename%]
&end
/* redirect all cover vars to points to those in "tempws"
&do i = 1 &to %.PES$numcovs%
&s ename = [entryname [value .PES$cover%i%]]
&s .PES$cover%i% [joinfile %.PES$tempws% %ename% sub]
&end
&return
/*
&routine DESCRIBECOVERS
/*
/* Determine if point, line, or poly
&do i = 1 &to %.PES$numcovs%
&describe [value .PES$cover%i%]
&s .PES$cover%i%type = poly
&s .PES$cover%i%fat = pat
&s .PES$coveroi%numpolys = %DSC$POLYGONS%
&if %DSC$PAT BYTES% le 0 &then &do
&s .PES$cover%i%type = line
&s .PES$cover%i%fat = aat
&s .PES$cover%i%numarcs = %DSC$ARCS%
&end
&if not %DSC$QTOPOLOGY% and %DSC$XATBYTES% gt 0 &then &do /* no
polys.
&if %DSC$AATBYTES% gt 0 &then &do /* arcs exist
&s .PES$cover%i%type = line
&s .PES$cover%i%fat = aat
&s .PES$cover%i%numarcs = %DSC$ARCS%
&end
&else &do /* Only points.
&s .PES$cover%i%type = point
&s .PES$cover%i%fat = pat
&s .PES$coveroi%numpoints = %DSC$POINTS%
&end
&end
&type Coverage: [value .PES$cover%i%]
&type is type: [upcase [value .PES$cover%i%type]]
/* check if poly coverage has arc attributes,keep them
&s .PES$cover%i%pataat .FALSE.
&if [value .PES$cover%i%type] = poly &then &do
&if [exists [value .PES$cover%i%].aat info] &then &do
&s numitems = [listitem [value .PES$cover%i%].aat info xxpestmp.txt]
&if %numitems% gt 7 &then /* arc items exist
&s .PES$cover%i%pataat .TRUE.
&s .PES$cover%i%numarcs = %DSC$ARCS%
&end
&end
&end
&return
/*
&routine WRITE AVFILE
/*
&s avfileunit = [open [joinfile %.PES$outdiro avfile.txt] openstat write]
&s writestat = [write %avfileunit% %.PES$numcovs%]
&do i = 1 &to %.PES$numcovs%
&s cov= [entryname [value .PES$cover%i%]]
&s type = [value .PES$coveroi%type]
&s writestat = [write %avfileunit% [quote %cov%,%type%]]
&end
&s writestat = [write %avfileunit% %.PES$iterations%]
&s closestat = [close %avfileunit%]
/* copy pes.apr project from PESCODE sysvar to output
&s copystat [copy %.PES$aprfile% [joinfile %.PES$outdir% pes.apr]]
&return
/*
&routine INTRODUCE ERROR
/*
/* Pass each coverage with error (not 0) through randomization
&do i = 1 &to %.PES$numcovs%
/* Set some short variables for coverage name, path
&s type = [value .PES$coveroi%type]
&s fat = [value .PES$coveroi%fat]
&s covfull = [value .PES$coveroi%]
&s covname = [entryname [value .PES$coveroi%]]
&s error = [value .PES$erroroi%]
&s dangle = [value .PES$dangle%i%]
&s fuzzy = [value .PES$fuzzy%i%]
&s genweed = [value .PES$genweed%i%]
/* clean and/or generlize coverage
&if %type% ne point &then &do
&if %genweed% ne 0 &then &do
&s tmpname = [scratchname prefix xxpes dir full]
copy %covfull% %tmpname%
kill %covfull% all
generalize %tmpname% %covfull% %genweed%
build %covfull% %type%
kill %tmpname% all
kill [joinfile %.PES$inputws% %covname%] all
copy %covfull% [j oinfile %.PES$inputws% %covname%]
&end
&if %dangle% ne 0 &then &do
&s tmpname = [scratchname prefix xxpes dir full]
copy %covfull% %tmpname%
kill %covfull% all
clean %tmpname% %covfull% %dangle% %fuzzy% %type%
kill %tmpname% all
kill [joinfile %.PES$inputws% %covname%] all
copy %covfull% [j oinfile %.PES$inputws% %covname%]
&end
/*update numarcs
&describe %covfull%
&s .PES$cover%i%numarcs = %DSC$ARCS%
&s .PES$coveroi%numpolys = %DSC$polygons%
&end
/* build features
&select %type%
&when point
build %covfull% point
&when line
&do
build %covfull% node
build %covfull% line
&end
&when poly
&do
createlabels %covfull%
build %covfull% node
build %covfull% line
build %covfull% poly
&end
&end
/* store original userids in new item
&workspace %.PES$tempws%
tables
&if not [iteminfo %covname%.%fat% info pes_origid exists] &then
additem %covname%.%fat% pes_origid 4 5 b
sel %covname%.%fat%
cal pes_origid = %covname%id
quit
/* Make coverid unique (=$recno)
ae
ec %covfull%
&select %type%
&when point
ef point
&when line,poly
ef line
&end
sel all
cal %covname%id = %covname%#
save all yes
quit
/* copy covers with unique userids to input+
copy %covfull% [j oinfile %.PES$input+ws% %covname% sub]
/* Store original polygon labels to build pat later
&if %type% = poly &then &do
ae
ec %covfull%
ef label
sel all
&s polylabels [scratchname prefix xxpes dir full]
put %polylabels%
save all yes
quit
&end
/* Update describe elements after build.
&describe %covfull%
/* Store number of features (records
&select %type%
&when point
&s .PES$cover%i%recs = %DSC$POINTS%
&when poly /* Exclude universe poly
&s .PES$coveroi%recs = [calc %DSC$POLYGONS% 1]
&when line
&s .PES$cover%i%recs = %DSC$ARCS%
&end
/* Ungenerate coverage
&if %type% = poly &then
ungenerate line %covfull% %covfull%.gen
&else
ungenerate %type% %covfull% %covfull%.gen
/* copy randgen.c program from path defined by PESCODE sys var
&s program = [entryname %.PES$randgen%]
&s copystat [copy %.PES$randgen% [joinfile %.PES$tempws% programam%]
&workspace %.PES$tempws%
&if %type% = line or %type% = poly &then &do
tables
sel %covname%.aat
&s formatfile = [scratchname prefix xxpes file]
unload %covname%.aat %covname%id fnode# tnode# columnar %formatfile%
quit
&data programa%
line
%covname%.aat
%DSC$ARCS%
%DSC$MAX NODE%
%covname%.gen
%error%
%.PES$iterations%
&end
&end
&if %type% = point &then &do
&data programa%
point
%covname%.gen
%DSC$POINTS%
%error%
%.PES$iterations%
&end
&end
&workspace %.PES$homews%
/* Generate coverages
&do num = 0 &to %.PES$Oiterations%
generate [joinfile %.PES$realeachws% ~
[joinfile real%num% %covname% sub] sub]
input [joinfile %.PES$tempws% %num%x%covname%.gen]
&if %type% = point &then
points
&else
lines
quit
&s realcov = [joinfile %.PES$realeachws% ~
[joinfile real%num% %covname% sub] sub]
&if %type% = line or %type% = poly &then
clean %realcov% # .00001 .00001 line
&if %type% = point &then &do
build %realcov% point
addxy %realcov% /* standard item now, like area,perimeter,length.
&end
&if %type% = poly &then
build %realcov% poly
/* Check same number of features after randomization, bailout if not
&describe %realcov%
&select %type%
&when poly
&do
&if %DSC$POLYGONS% ne [value .PES$coveroi%numpolys] &then &do
&type;&type [value .PES$coveroi%numpolys] polys became
%DSC$POLYGONS%
&call BAILOUT
&end
&if [value .PES$cover%i%pataat] &then &do
&if %DSC$ARCS% ne [value .PES$cover%i%numarcs] &then &do
&type;&type [value .PES$coveroi%numarcs] arcs became %DSC$ARCS%
&call BAILOUT
&end
&end
&end
&when arc
&do
&if %DSC$ARCS% ne [value .PES$cover%i%numarcs] &then &do
&type;&type [value .PES$coveroi%numarcs] arcs became %DSC$ARCS%
&call BAILOUT
&end
&end
&end
&end
/* link original attributes to simulated data
&if %type% = point or %type% = line &then &do
&do num = 0 &to %.PES$Oiterations%
&s realcov = [joinfile %.PES$realeachws% ~
[joinfile real%num% %covname% sub] sub]
joinitem %realcov%.%fat% %covfull%.%fat% ~
%realcov%.%fat% %covname%id %covname%id
&end
&end
&if %type% = poly &then &do
&do num = 0 &to %.PES$Oiterations%
&s realcov = [joinfile %.PES$realeachws% ~
[joinfile real%num% %covname% sub] sub]
joinitem %realcov%.pat %covfull%.pat %realcov%.pat %covname%id ~
%covname%id /*values scrambled but gets items
ae
ec %realcov%
ef label
get %polylabels%
save all yes
quit
build %realcov% poly
/* get arc attributes too.
&if [variable .pes$cover%i%pataat] &then &do
&if [value .PES$cover%i%pataat] &then
joinitem %realcov%.aat %covfull%.aat %realcov%.aat %covname%id ~
%covname%id
&end
&end
kill %polylabels% all
&end
&end /* end of cover loop
&return
/*
&routine RUN USER AML
/*
/* Run user AML on simulated data
&if not [null %.PES$aml%] &then &do
/* first run on source data in input directory
&workspace %.PES$inputws%
&run [unquote %.PES$aml%]
/* run on input+ data
&workspace %.PES$input+ws%
&run [unquote %.PES$aml%]
/* run on simulated data
&do i = 0 &to %.PES$Oiterations%
&workspace [joinfile %.PES$realeachws% real%i% sub]
&run [unquote %.PES$aml%]
&end
&end
&return
/*
&routine PERSIM TO ALLSIMS
/*
/* copy all simulated data into single workspace
&do w = 0 &to %.PES$Oiterations%
&do i = 1 &to %.PES$numcovs%
&s ename= [entryname [value .PES$coveroi%]]
&s inwork = [joinfile %.PES$realeachws% real%w% sub]
copy [joinfile %inwork% %ename% sub] ~
[joinfile %.PES$realallws% %ename%%w% sub]
&end
&end
&return
/*
