Examination of statistical relationships between highway crashes and highway geometric and operational characteristics o...

MISSING IMAGE

Material Information

Title:
Examination of statistical relationships between highway crashes and highway geometric and operational characteristics of two-lane urban highways
Physical Description:
xii, 129 leaves : ill. ; 29 cm.
Language:
English
Creator:
Aruldhas, Jacob, 1965-
Publication Date:

Subjects

Subjects / Keywords:
Highway engineering -- Statistical methods   ( lcsh )
Low-volume roads -- Research   ( lcsh )
Civil Engineering thesis, Ph. D   ( lcsh )
Dissertations, Academic -- Civil Engineering -- UF   ( lcsh )
Genre:
bibliography   ( marcgt )
non-fiction   ( marcgt )

Notes

Thesis:
Thesis (Ph. D.)--University of Florida, 1998.
Bibliography:
Includes bibliographical references (leaves 125-128).
General Note:
Typescript.
General Note:
Vita.
Statement of Responsibility:
by Jacob Aruldhas.

Record Information

Source Institution:
University of Florida
Rights Management:
All applicable rights reserved by the source institution and holding location.
Resource Identifier:
aleph - 029218113
oclc - 41186490
System ID:
AA00022866:00001

Table of Contents
    Title Page
        Page i
    Dedication
        Page ii
    Acknowledgement
        Page iii
    Table of Contents
        Page iv
        Page v
    List of Tables
        Page vi
    List of Figures
        Page vii
    Legend
        Page viii
        Page ix
    Abstract
        Page x
        Page xi
        Page xii
    Chapter 1. Introduction
        Page 1
        Page 2
        Page 3
        Page 4
        Page 5
        Page 6
    Chapter 2. Literature review
        Page 7
        Page 8
        Page 9
        Page 10
        Page 11
        Page 12
        Page 13
        Page 14
        Page 15
        Page 16
        Page 17
        Page 18
        Page 19
    Chapter 3. Data organization
        Page 20
        Page 21
        Page 22
        Page 23
        Page 24
        Page 25
        Page 26
        Page 27
        Page 28
        Page 29
        Page 30
    Chapter 4. Crash distribution
        Page 31
        Page 32
        Page 33
        Page 34
        Page 35
        Page 36
        Page 37
        Page 38
        Page 39
    Chapter 5. Explanatory variables
        Page 40
        Page 41
        Page 42
        Page 43
        Page 44
        Page 45
        Page 46
    Chapter 6. Modeling strategies and the base model
        Page 47
        Page 48
        Page 49
        Page 50
        Page 51
        Page 52
        Page 53
        Page 54
        Page 55
        Page 56
        Page 57
        Page 58
        Page 59
        Page 60
        Page 61
        Page 62
    Chapter 7. Statistical modeling
        Page 63
        Page 64
        Page 65
        Page 66
        Page 67
        Page 68
        Page 69
        Page 70
        Page 71
        Page 72
        Page 73
        Page 74
        Page 75
        Page 76
        Page 77
        Page 78
        Page 79
        Page 80
        Page 81
        Page 82
        Page 83
        Page 84
        Page 85
        Page 86
        Page 87
        Page 88
        Page 89
        Page 90
        Page 91
    Chapter 8. Results and discussion
        Page 92
        Page 93
        Page 94
        Page 95
        Page 96
        Page 97
        Page 98
        Page 99
        Page 100
        Page 101
        Page 102
        Page 103
        Page 104
        Page 105
        Page 106
        Page 107
        Page 108
        Page 109
        Page 110
        Page 111
        Page 112
    Chapter 9. For further studies
        Page 113
        Page 114
        Page 115
        Page 116
        Page 117
        Page 118
        Page 119
        Page 120
        Page 121
    Chapter 10. Conclusions
        Page 122
        Page 123
        Page 124
    References
        Page 125
        Page 126
        Page 127
        Page 128
    Biographical sketch
        Page 129
        Page 130
        Page 131
Full Text









EXAMINATION OF STATISTICAL RELATIONSHIPS
BETWEEN HIGHWAY CRASHES AND HIGHWAY GEOMETRIC
AND OPERATIONAL CHARACTERISTICS
OF TWO-LANE URBAN HIGHWAYS
















By

JACOB ARULDHAS


A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL
OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY

UNIVERSITY OF FLORIDA


1998

























to my mother Mrs. Daisy Aruldhas who happily sacrificed
her life and everything she had to bring up her three
children, Paul Aruldhas, Dorothy John and myself in the
absence of my father, Professor M. Aruldhas, now with the
Lord, who was taken into glory when I was only four years of
age.













ACKNOWLEDGMENTS


It is with great pleasure that I thank my advisor,
Professor Joseph A. Wattleworth, for giving me the
opportunity to work in his research project which lead to the
study that I present in this dissertation. Dr. Wattleworth
served as the chairman on my committee. Dr. Wattleworth
played the key role in directing the studies and analysis
towards obtaining practically applicable results.
I thank Dr. Mang Tia, Dr. David Bloomquist, and
Dr. Ralph Ellis, at the Department of Civil Engineering,
University of Florida for being in my committee and for
giving valuable guidance concerning the objectives of the
study.
My sincere thanks to Dr. Geoffrey Wining, at the
Department of Statistics, University of Florida for all the
analytical guidance he has given in issues related to
statistical modeling.
I express my gratitude to Dr. Paul Thompson, Dr. Fazil
Najafi, Dr. Mang Tia and Dr. David Bloomquist at the
Department of Civil Engineering for advice, encouragement and
extended help during times of need.
Special thanks to my wife, Emi and my children, Eden,
Emily and Elizabeth for all the trouble they took when I had
to neglect them for the sake of this study.
And, I thank God for His wonderful presence in my life
that I have enjoyed most of the time. The joy of the Lord has
always been my strength. His peace that he has given me has
enabled me to overcome all barriers that came my way.











TABLE OF CONTENTS


Page


ACKNOWLEDGEMENTS

LIST OF TABLES

LIST OF FIGURES

LEGEND ... . .


. . . . vi

. . . . vii

. . . . viii


ABSTRACT

CHAPTERS


1 INTRODUCTION .


Background .. ......
Scope . . . .
Objectives ........
Layout of this Report


2 LITERATURE REVIEW


Lane Width .....
Shoulder Width ...
Shoulder Type ...
Speed Limit .....


3 DATA ORGANIZATION .


Highway Geometric Data .. ......
Highway Accident Data ......
Highway Classification .....
Two-lane Urban Undivided Highways.
Data Statistics .........
Visual Display of Data .. ......


4 CRASH DISTRIBUTION .... ............

Crash Classification .... ..........
Crash Frequency and Crash Rate .......
Frequency Distribution ... ..........


. . . 1


1
2
3
4


7
. . 13
.... 15
. . 17

. . 20


. . 31








Poisson Distribution ... .........
Negative Binomial Distribution . .
Rejection of Poisson Distribution
Conclusion ..... .............

5 EXPLANATORY VARIABLES ........

The Highway Section ........
Longitudinal Parameters ......
Operational Parameters .. ........
Cross Sectional Parameters ....

6 MODELING STRATEGIES & THE BASE MODEL

Generalized Linear Models .....
Model Statistics .........
Variable Selection Procedure ...
Model Performance Criteria .......
Basic Assumptions and Base Model .

7 STATISTICAL MODELING ... ..........

Representing Longitudinal Factors
Representing Operational Factors
Representing Cross Sectional Factors
Identifying Significant Interactions
The Final Model ..........

8 RESULTS AND DISCUSSION ... ........

All Crashes . . . . . .
Crashes with Property Damages Only .
Injury Crashes .... ...........
Fatal Crashes . . . . . .
A Brief Overview of the Models . .


9 FOR FURTHER STUDIES.

10 CONCLUSIONS .


S. . 113

122


Conclusions
Limitations


REFERENCES .


BIOGRAPHICAL SKETCH .


. . 40


. . 47


. . 63


. . 92


* 93
* 95
* 98
* 100
102


123
124


125


. . 129














LIST OF TABLES


Table page

3.1 Classification of Highways ... ............ 24
3.2 Data Statistics ....... .................. 26

7.1 Models for representing Section Length ....... 65
7.2 Models for representing Longitudinal Factors 69
7.3 Models for representing AADT ... ........... 71
7.4 Models for representing Speed Limit ....... 75
7.5 Models for representing Lane Width ......... 80
7.6 Models for representing Cross Sectional Factors 83
7.7 Models representing Second Degree Interactions 86
7.8 Observed vs. Predicted Values .. .......... 88















LIST OF FIGURES


Figure


3.1 Unsorted & Sorted Plots of Continuous Variables
3.2 Unsorted & Sorted Plots of Categorical Variables
3.3 Unsorted & Sorted Plots of Logical Variables


4.1 Distribution of Actual Crash Data . .
4.2 Distribution of Poisson Data ........
4.3 Distribution of Negative Binomial Data .


5.1
5.2
5.3


Pairwise Plot of Longitudinal Factors
Pairwise Plot of Operational Factors
Pairwise Plot of Cross Sectional Factors


6.1 Standard Plots of the Base Model .......


7.1 Plots from diagnostic study


Crash Frequencies vs.
Crash Frequencies vs.
Crash Frequencies vs.
Crash Frequencies vs.
Crash Frequencies vs.
Crash Frequencies vs.
Crash Frequencies vs.

Total Crash Frequency
Total Crash Frequency
Total Crash Frequency
Total Crash Frequency
Total Crash Frequency


Section Length .......
Intersections .....
AADT .... ..........
Speed Limit ......
On-Street Parking . .
Pavement Width .......
Unpaved Shoulder Width

vs. Section Length . .
vs. Lane Width .......
vs. Paved Shoulder . .
vs. Unpaved Shoulder .
vs. Raised Curb ........


page


. . . 33
. . . 37
. . . 38

. . . 43
. . . 44
. . . 46


61


8.1
8.2
8. 3
8.4
8.5
8. 6
8.7

9.1
9.2
9.3
9.4
9.5


103
104
106
108
109
110


114
116
118
119
121












LEGENDS


Variables

acc

inj

pdo

fat



Variables

slen

adt

its

lw

lwc

ops

opsc

oups

oupsc

oc

fr

spd


Representing Crash Frequencies:

total number of accidents

Number of crashes that involve injuries

Number of Crashes with Property Damage Only

Number of Crashes that result in fatalities



Representing Explanatory Terms:

Length of highway section in 1/1000th of a mile

Average Annual Daily Traffic

Number of intersections in the section

Lane width

Lane width redefined as categorical variable

Paved shoulder width

Paved shoulder redefined as categorical variable

Paved shoulder width

Paved shoulder redefined as categorical variable

Variable to represent presence or absence of Curb

Coefficient of Friction

Speed Limit


viii








Statistical Terms:

nb negative binomial distribution

0 Overdispersion factor in nb Distribution

GLM Generalized Linear Models

c Expected value of y-intercept of regress

PExpected value of regressor coefficient

Df Degrees of Freedom

AIC Akaike's Information Criteria

D Deviance

Resid Residual

SS Sum of Squares

n Number of observations

MAD Mean Absolute Deviation

Std Err Standard Error


ion model


Modeled as function of

Represents the product term of two variables

Observed /Actual value

Predicted value


Obs

Pred














Abstract of Dissertation Presented to the Graduate School
of the University of Florida in Partial Fulfillment of the
Requirements for the Degree of Doctor of Philosophy.


EXAMINATION OF STATISTICAL RELATIONSHIPS
BETWEEN HIGHWAY CRASHES AND HIGHWAY GEOMETRIC
AND OPERATIONAL CHARACTERISTICS
OF TWO-LANE URBAN HIGHWAYS

By

JACOB ARULDHAS

May 1998


Chairman: Dr. Joseph A Wattleworth
Major Department: Department of Civil Engineering

The accuracy and reliability of crash prediction

models depend on the validity of assumptions made in the

analysis, definition of response variables and proper

representation of influencing factors on the relationships

established between crash frequencies, geometric

characteristics and operational characteristics of highways.

Several crash prediction models were developed in the

past years by investigators. These models have helped the

highway engineers design the highway sections with improved

safety and economy. They were also used in estimating the

overall benefits of highway improvement programs. Several

studies reported in the literature show that accident







prediction models were developed to understand the effect of

cross sectional parameters on crash occurrence.

In this study, crashes that occurred during the years

1988 through 1991 in the State of Florida are analyzed using

Statistical Methods. The analysis done using this large

number of observations has resulted in some important

results. The accepted belief concerning the distribution of

crash data was found to have several inconsistencies

including violation of basic assumptions. A fair portion of

the analysis time was used to find the best distribution

function that can be used to represent crash frequency.

Crash rate, a function of crash frequency, section

length and traffic volume is generally considered as the

response variable in crash modeling studies. Crash rate was

found to have very weak relationship with explanatory

variables as indicated by high p-values. Crash frequency was

found to be the ideal response variable with section length

and traffic volumes defined as explanatory variables.

The use of certain transformation functions, as

recommended by most of the studies done in the past was found

to deteriorate the model quality. Several experiments were

done to find the best transformation function to represent

the model parameters. The relationship between cross

sectional variables and crash frequency was found to be non-

linear.







The results of this analysis can be used to estimate

the contribution of each design parameter in the expected

crash frequency during any given time period for a two-lane

urban highway section. The effect of each variable on total,

PDO (Property Damage Only), injury and fatal crashes can be

estimated using the models. The procedure used in the

analysis can be used to study other types of highways.














CHAPTER 1
INTRODUCTION


Background

Crash analysis has become a much easier task in the

recent past since the state authorities have started taking

solid steps in accident data collection and management. Why

do crashes occur? How often do they happen? Can they be

measured? If they can be measured, can they be predicted? If

they cannot be measured, are they completely random? Many

attempts have been made by investigators from various

professional backgrounds to find answers to such questions.

Several studies done by investigators from various

professional backgrounds including highway engineering,

transportation planning, statistics and mathematics to find

relationship between highway geometric parameters and safety

have resulted in valuable conclusions.

The Poisson distribution is widely used to represent

the crash rate. The most commonly used predictors are lane

width, shoulder width and traffic volume. Some of the

parameters that are believed to affect highway crashes are

lane width, shoulder width, median width, surface

characteristics and slopes, drainage condition, horizontal









curvature, vertical curvature, sight distance, presence of

narrow bridges, type of access control, lighting conditions

and width of clear roadside recovery area.

Though a number of studies have been done in this area,

several assumptions and concepts concerning the relationships

established were not validated. A close observation of these

models showed that they lack any kind of uniformity.

Inconsistencies were found in the distribution assumed, the

parameters considered for analysis, the functions used for

variable transformation, the variables used in the final

prediction model, and the coefficient of parameters.



Scope

The scope of this study includes review of studies done

in the past that are relevant to highway safety and highway

crash modeling. Information obtained from the review of

literature is used as guidelines in the initial stages of

analysis.

The second stage of the study is the preparation of

data. The crash data used in the analysis consists of all

reported accidents that occurred during the period of 1989-

1992 on two-lane urban highways within the State of Florida.

Most of the parameters that are directly or indirectly

related to highway design are included in the database.









Basic assumptions concerning crash distributions should

be validated based on actual data. One entire chapter is

dedicated to identifying the appropriate distribution

function for representing crash frequency. Other issues

related to defining crash rate as the dependent variable are

also reviewed.

The regression analysis and the search for a model are

usually exhaustive. To avoid large number of random

searches, an analysis strategy has been developed. As part

of the strategy, certain norms are also developed for testing

the models. All assumptions made based on literature review

and known concepts are validated or rejected based on

statistical inference. The models reviewed are further

diagnosed to find desirable design features.



Objectives

The primary objective of this study is to find the

effect of highway geometric and highway operational

parameters on crash frequency. The effect of these

parameters on total, PDO (Property Damage Only), injury and

fatal crashes are estimated.

There are some intermediate objectives that will

address some key issues and form the foundation of the

analysis. Finding the best distribution function to









represent crash distribution, identifying the best form of

independent variables and identifying the interactions among

independent variables are intermediate objectives.

Another intermediate objective of the study is to find

out whether uniformity of highway design, represented by

section length, contributes to crash rate. If it does, then

crash frequency should be considered as the dependent

variable and section length as one of the independent

variables.



Layout of this Report

The results of studies done in the past are briefly

discussed in the 2nd Chapter titled "Literature Review."

Most of the references are made to studies done in the

United States or Canada. A few studies done in other

developed countries are also included in the review of

literature.

The assumptions concerning distribution of crash

frequency are evaluated and the results are discussed in

Chapter 4. Three experiments are done on crash frequency to

test the validity of accepted assumptions and to identify

the ideal distribution function.

All available explanatory variables, their ranges and

limits are discussed in Chapter 5. Data statistics like









mean, median, mode and quartiles are calculated and

graphical methods are used to visualize the raw data from

various perspectives.

Chapter 6 gives an introduction to the Generalized

Linear Models and briefly discusses the parameters that are

used for variable transformation and model selection.

Statistical methods used for organized regression analysis

and norms developed for testing the developed models are

discussed in this chapter. The information gathered from

literature is used to develop the base model.

Chapter 7 consists of the summaries of all stages of

regression analysis done starting from the base model to the

final model. Each independent variable is analyzed

separately and an intermediate model is selected at the

conclusion of each stage of analysis. The order in which

the variables are analyzed is prioritized based on the

relative importance of each parameter in the model as

suggested by the base model. This chapter is concluded with

the selection of the final model.

In chapter 8, the final model selected in chapter 7 was

developed from 75% of the data. The other 25% of the data

were used for model testing at each stage. Once the final

model is identified, the analysis data and test data is

combined and the final model is updated using all data.









Three other models are developed using the same procedure to

represent injury crashes, PDO crashes and fatal crashes

respectively. These models along with all the model

parameters are displayed in this chapter.

Crash frequencies are computed at different levels of

each independent variable. The calculated crash frequencies

are plotted and displayed in this chapter. These figures

can be used to understand the prevailing trends in the

relationships established. While preparing the plot for one

variable, all other variables are held constant at their

median values.

In chapter 9, categorical representation of cross

sectional variables revealed some type of trend which are

displayed in this chapter and recommended for future

studies. Lane width, paved shoulder width and unpaved

shoulder width when treated as categorical variables showed

that there are specific values of these parameters at which

the crash frequency is minimum.

The conclusions and limitations of this study are

briefly discussed in chapter 10 and the listing of all

literature reviewed is given in References.














CHAPTER 2
REVIEW OF LITERATURE


The results of studies done in the past on safety

analysis are categorized by parameters and discussed briefly

in the following sections.



Lane Width

The Florida Green Book' specifies in page IIT-25 that

"Traffic lanes should be 12 feet in width, but shall not be

less than 10 feet in width. Streets and highways with

significant truck traffic should have 12 feet wide traffic

lanes."

Lane width was found to affect crash rates on rural

two-lane highways, particularly run-off-road, opposite

direction and sideswipe crash rates. Jorgensen2 reviewed

fifteen studies, conducted before 1978, and dealt with the

effect of lane width on safety. Eight of these studies

showed that crash rates decreased as the lane width increased

for rural two-lane highways. Another study showed that crash

rates for highways with 12 feet wide lanes did not differ

significantly from those for highways with 11 feet wide

lanes. Two other studies2 found no relationship between

7









crash rates and lane width for rural two-lane highways.

Three studies on two-lane urban arterials could not find

relationships between roadway width and crashes.

The following is a summary of the most important

findings of the studies on two-lane rural highways that were

2
reviewed by Jorgensen

" Gupta and Jain3 used multiple linear regression analyses

to investigate the effects of roadway width on crash

rates. Increasing roadway width was found to reduce

multiple-vehicle crashes. The roadway width was found to

have no effects on crash rates at AADT higher than 3000

vehicles per day.

" Dart and Mann4 found that crash rates decreased as lane

width increased up to 11 feet, then remained relatively

constant.

* Cope5 showed from a before and after study a significant

decrease in crash rates when widening lanes from 9 feet to

12 feet, especially at high crash sections.

" Shah6 found a definitive relationship between pavement

width and crash rate. The results showed that 22 feet to

24 feet wide pavements had fewer crashes than narrower and

wider pavements.

* Shannon and Stanley7 studied the relationship of

construction cost, maintenance cost and crash costs as









related to paved width. The analysis revealed a general

tendency for crash rates to decline as pavement width

increased. For two-lane urban arterial streets, Gupta and

Jain3 Head8 and Mulinazzi9 could not find a relationship

between crash rates and lane width.

Silyanovl evaluated international studies on two-lane

two-way highways and found that crash rates decreased as

pavement width increased for pavement widths between 13 feet

and 30 feet. On wide pavements, the crash reduction due to

improvement was lower than that on narrow pavements. Based

on several international studies, Choueeiri et al.11

concluded that a significant decrease in crash rates could be

expected by increasing pavement width up to about 25 feet.

A study in Australia by Mclean12 showed that the most

safety effective lane width is about 3.4 meter (11 feet).

McCarthy'3 showed from a before and after study that

widening lanes on 17 sites from 2.7 meter and 3.0 meter (9

feet and 10 feet) to 3.4 meter and 3.7 meter (22 feet and 24

feet) resulted in a reduction in crash rate by 22%. However,

Choueeiri et al.11 reported results from a previous study

that show, contrary to the expectation that the crash

severity increases as pavement width increases. They

suggested that the reason for this might be the higher

operating speed, on the sections that have wider pavements.









Zegeer14 found that the only crashes that can be

expected to decrease with lane widening were run-off-the-road

(ROR) crashes and opposite-direction (OD) crashes. He also

found that only property-damage crashes and injury crashes

decreased as lane width increased with no change in fatality

rate. Very little additional benefit was realized by widening

a lane beyond 11 feet.

In that study, an economic analysis was conducted to

determine the expected cost effectiveness of lane widening.

Savings due to crash reductions were the only benefits

included in the analysis. It was concluded that 11 feet wide

lane may be optimal for rural two-lane roadways.

Zeeger and Deacon15 reviewed 30 studies performed until

the mid 1980's and concluded that no satisfactory

quantitative model relating crash rates to lane width and

shoulder width could be found. Therefore, they calibrated a

new model that estimates the most likely relationships of

crashes with lane width, shoulder width and shoulder type on

two-lane rural highways. This model was derived using data

obtained from four previous studies.

AR = 4.15(.8907)L (.9562)s (1.0026)LS (9403)P (1.004)LP

Where,

L = lane width in feet

S = shoulder width in feet (including stabilized and

unstabilized components)









P width in feet of stabilized component of shoulder

(0<=P<=S), P=O for un-stabilized shoulders and P=S for

full-width stabilization; and

AR = Number of ROR and OD crashes per million vehicle miles.

The authors recognized that, many assumptions were made

in the development of the above model. They considered the

model as a first approximation of the effect of lane and

shoulder conditions on crash rates. No attempt was made to

determine the pavement widths that should be used under

various traffic conditions or roadway classes.

Later, Zegeer et al. 16 developed another model to

quantify the benefits of shoulder and lane improvements based

on data selected from seven states. Only two-lane roadway

sites were selected. The crash types that appeared to be

highly correlated with lane width, shoulder width and

shoulder type were single vehicle (fixed object, rollover and

run-off-the-road crashes), head-on, and sideswipe (opposite-

direction and same-direction) crashes. Using regression

analysis, the following model was derived.

AO = .0019(AADT)"24 (.8786)w (.9192)PA (.9316)UP (1.2365)H

(.8822)TER1 (1.3221)TER2

where,

TERI = 1 if flat, 0 otherwise;

TER2 = 1 if mountainous, 0 otherwise;

PA = Average paved shoulder width in feet;









UP = unpaved shoulder width in feet;

H = median roadside (or hazard) rating;

W = lane width in feet;

AO = the number of related crashes per mile (single vehicle,

Head-on and sideswipe);

AADT = Average annual daily traffic.

The above study indicates that as the amount of lane

widening increases the percentage reduction in related

crashes also increases. The first foot of lane widening

between 8 and 12 feet, corresponds to a 12% reduction in

related crashes, two feet corresponds to a 23% reduction, 3

feet to a 32% reduction and 12 feet of widening to 40 percent

reduction.

The above model only applies to two-lane rural highways

with lane widths of 8 to 12 feet, shoulder width of zero to

12 feet (paved or unpaved) and traffic volumes of 100 to

10,000 (AADT). This model was used to develop an

informational guide17 that enables estimation of safety

benefits of various roadway and roadside improvements.

Goldstine'8 conducted a before and after analysis on

twenty five projects covering 152 miles of road to examine

the effect of road and shoulder widening on crash rates in

New Mexico. Reductions of 38% to 53% in crash rate were

observed. The study supported the TRB Special Report 214 in

its recommendation that the higher the AADT the wider the









road should be. However, the study recommended using even

greater minimum widths.

More recently, Garber and Joshua19 developed a logistic

regression model to describe the probability of truck

involvement in crashes as a linear logistic function of

traffic and highway variables. For undivided two and four

lane highways, the most significant variables were the slope

change rate, lane width and to a lesser extent shoulder

width. The model derived for these types of highways is

given below.

P = 1 /(l + eOX)

:3 13.648 1.164*LW -.9095*SW -.1969*SCR + .0501*SW*SCR

where,

SW = shoulder width in feet,

SCR = slope change rate, the rate at which the longitudinal

slope changes

LW = lane width in feet, and

P = probability of large truck crash involvement.



Shoulder Width

The FDOT Green Book' specifies that "The width of all

shoulders should, ideally, be at least 10 feet in width.

Where economical or practical constraints are severe, it is

permissible, but not desirable, to reduce the shoulder width.









Outside shoulders shall be provided on all streets and

highways with open drainage and should be at least 6 feet

wide. Facilities with a heavy total traffic volume or a

significant volume of truck traffic should have outside

shoulders at least 8 feet wide.

Previous studies that investigated the effect of

shoulder width on safety dealt with two-way rural highways.

Zegeer14 reviewed some of these studies and found that, there

was lack of correlation between shoulder width and crash rate

on two-lane rural highways with AADT less than 2000 vehicles

per day. Wide shoulders appeared to be most beneficial where

AADT are between 3000 and 5000. In general, shoulders 4-7

feet wide were preferred to wider ones although some studies

suggested that shoulders as wide as 10-12 feet are the

safest.

Crillio and Council20 concluded from reviewing several

studies that increasing shoulder width up to 1.8 meters (6

feet) wide on facilities with AADT greater than 1000 improved

safety. However, the benefits of increasing shoulder width

above 1.8 meters (6 feet) were not clear.

A study in Oregon21 concluded that total crashes

increased with increasing shoulder width except for roads

that have AADT between 3600 and 5500. Shoulders wider than 8

feet experienced significantly higher crash rates than

shoulders less than 8 feet wide.









22
Hiembach et al. concluded that highway sections that

have paved shoulders are associated with lower crash rate

than with identical sections that do not have shoulders.

Rogness et al. 22 compared crash frequency for the time

before and after shoulder widening. They found that the

addition of full-width paved shoulders to a two-lane roadway

was effective in reducing the total number of crashes. For

AADT less than 3000, they recommended a paved shoulder in

place of an additional travel lane. Adding paved shoulders

reduced crash rate by 55% for AADT between 1000 and 3000,

21.4% for AADT between 3000 and 5000; and 0% for AADT between

5000 and 7000.

Zegeer24 found that ROR and OD crashes decreased as

shoulder width increased up to 9 feet for two-lane rural

highways. For 10-12 feet wide shoulders, there was a slight

increase in these crash rates.



Shoulder Type

The possibility of a vehicle skidding out of control or

turning over is expected to increase when the shoulder is

soft or is covered with loose gravel, sand or mud. The FDOT

Green Book' (page V-3) specifies that shoulders "should be

capable of providing a safe path for vehicles traveling at

the roadway speed." It also specifies that "the shoulder









should be designed and constructed to provide a firm and

uniform surface, capable of supporting vehicles in distress."

Turner et al. 25 compared the crash experience on three

types of undivided highways: two-lane with unpaved shoulder,

two-lane with paved shoulder and four-lane with unpaved

shoulder. Two-lane roadway with paved shoulder was found to

be the safest and two-lane with no shoulder was found to be

the least safe.

In general, shoulder paving or stabilization is

desirable if conducted properly. Zegeer reported that the

effectiveness of shoulder stabilization depends on the need

for improvement from a safety standpoint. Based on crash

data from Ohio, they found that shoulder stabilization can

reduce crashes by 38% and injury and fatality crashes by 46%.

In another study using crash data from North Carolina, they

found that for two-lane rural highways, unpaved shoulders

resulted in higher crash rate and severity than paved

shoulders.

Foody and Long27 performed a series of analysis of

variance (ANOVA) which revealed that the differences in crash

rate between stabilized shoulders and paved shoulders were

not significant. However, the crash rate of sections having

these two shoulder types was significantly less than that of

sections that have unstabilized shoulders.









Speed Limit

The FDOT Green Book' recommends that "the design speed

should not be less than the expected posted or legal speed

limit. A design speed 5 to 10 mph greater than the posted

speed limit will compensate for a slight (and generally not

enforceable) overrunning of the speed limit by many drivers."

Jackobsberg and Danchik28 investigated the effect of

speed limit on the safety of Maryland roads. They could not

find any first order linear relationships between crashes and

physical characteristics of highways including speed limits.

Fieldwick and Brown29 compared the crash rates and speed

limits at 21 counties. Speed limits in those counties varied

between 80 km/h (50 mph) and 120 km/hr (75 mph). Using

regression analysis, they showed that safety is sensitive to

speed limit. For example, their results suggested that

reducing rural speed limit from 100 km/hr (62 mph) to 90

km/hr (56 mph) could reduce fatalities and injuries by 11%

and 15%, respectively. The authors admitted that these

figures might include other factors (safety measures employed

by counties that use lower speed limits) not investigated in

this study. In addition, the study did not differentiate

between highway classes (freeways, two-way two-lane, etc.).

Therefore, their results should be viewed with caution.

Fieldwick and De Beer30 analyzed a monthly crash time

series between January 1972 and December 1985. The results









showed that a reduction in the urban speed limit from 60

Km/hr (37 mph) to 50 Km/hr (31 mph) would reduce fatal and

injury crashes by 12.3% and 14.3%, respectively.

In Texas, speed-zoning procedures rely primarily on the

85th percentile speed of traffic on a facility. Ullman and

Dudek31 investigated the argument that speed zoning below

85th percentile may be beneficial to drivers in rapidly

developing areas. Spot speed, speed profile and crash data

were collected before and after the speed at six urban fringe

highway sites in Texas were reduced from 55 mph (the 85th

percentile speed) to 45 mph. No changes were observed in

speeds, speed distributions, speed changing activities or

crash rates at the sites. They concluded that the lower

speed zones were not effective in improving safety at the

investigated sites.

Garber et al. 32 investigated the effect of the design

speed and the posted speed limit on safety. The types of

highways included in the study were urban interstates, rural

interstates, urban arterials, rural arterials and major rural

collectors. Thirty-six different locations in Virginia were

selected for the study. They found that the average speeds

on these highways depend on design speeds. An attempt was

made to correlate crash rates with average speed for the

different types of highway. No strong correlation was found









between crash rates and average speed for any given type of

highway.

They also found that drivers tend to travel at higher

speeds on highways with better geometric characteristics

regardless of the posted speed limit. The speed variance was

found to be a function of the difference between the design

speed and the posted speed limit. Results of regression

analysis showed that the speed variance were minimum when

this difference was between 5 and 10 mph. The regression

analysis also showed that crash rates increase with

increasing speed variance for all classes of roads.















CHAPTER 3
DATA ORGANIZATION


Highway Geometric Data

The Florida Department of Transportation gathers and

maintains information pertaining to all highways and streets

in the State of Florida. This database is known as the RCI

data (Roadway Characteristic Inventory). Each Record or line

of information in the RCI database represents one highway

section.

Some of the relevant items in these records are

location code representing the begin and end points of the

highway, lane width, paved shoulder width and unpaved

shoulder width, shoulder type, traffic volume, speed limit,

number of intersections, presence of raised curb and friction

factor. The information available about each highway section

in the RCI data is broadly classified based on location,

highway type and General characteristics of the highway.



Location:

The location code is the first nine digits of each

record, designed to geographically identify the highway

section. The location code includes county number, highway









number and mile point. Two location codes are assigned to

each highway section, one representing the beginning and the

other representing the end of the section. The length of the

section may be calculated as the differences between the

begin-mile point and the end-mile point.



Highway type:

Several numeric codes are used to represent the highway

type. Access-control, number of lanes, presence of median

and number of directions in which traffic moves are some of

them. The highway type is recognized based on this

information. For example, if the number of lanes is 2,

presence of median is 0, number of directions in which

traffic flow is 2, access control is 0, and type of location

is 1, that record represents a two-lane, urban, undivided

highway section.



General characteristics:

The characteristics of a highway section that play an

important role in this study include cross sectional design

features like lane width, shoulder width, shoulder type, and

operational parameters like speed limit, presence of on-

street parking and traffic volume.









Highway Accident Data

The Florida Department of Transportation also maintains

another database that consists of all measurable and

representable information pertaining to each highway accident

that has been reported. The information available in the

crash data base may be broadly classified based on location

and crash characteristics



Location:

The location code in the crash database is identical to

that of the location code in the RCI data. The only

difference is that in the crash data, only one location code

is required to represent the spot where the crash has

occurred. The location code common to both databases helps

to link the crash incident to the highway section in which it

has occurred.



Crash characteristics:

Crash characteristics include details about types of

crash severity, times of occurrence and weather conditions.

A subroutine was developed to merge the crash data into the

RCI data. While merging, the computer program reads the

first record in the crash data and remembers the location

code. A search is performed on the RCI data to find the









record for which the location codes are in the range of the

crash location. When this condition is met, all built-in

crash parameters are updated in the RCI database based on the

information obtained from the crash data. At the end of

merging, the resulting database will contain exactly the same

number of records as that of the RCI data, regardless of the

number of records in the accident data.



Highway Classification

The highway sections in the rural areas are

operationally different from similar sections in the urban

areas. High level of pedestrian activities, large number of

access points, high traffic volumes, restricted shoulders and

the absence of safe recovery area are characteristics of

urban highways. Therefore, urban highways and rural highways

are analyzed separately.

Two highway sections with similar geometric and traffic

parameters in the same location could still be different from

each other based on the type of access. Highways with full

access control fall under the category of freeways. The

other two types are partially access-controlled highways and

highways with no access control. Further, highways are also

classified based on the presence of median and based on the

number of lanes.












TABLE 3.1 Classification of Highways

# Code Location Access Control Median # Lanes

1 uu2 urban no control undivided 2
2 uu4 urban no control undivided 4
3 ud4 urban Partial divided 4
4 uf4 urban Full divided 4
5 uf6 urban Full divided 6
6 ru2 rural no control undivided 2
7 ru4 rural no control undivided 4
8 rd4 rural Partial divided 4
9 rf4 rural full divided 4
10 rf6 rural full divided 6


Table 3.1 shows the code used to represent various

types of highways. The effect of geometric and operational

parameters on crash frequency depends on the highway type.

For example, the effect of lane width on two lane highways

could be much different from that of six lane highways.

Therefore, each highway type needs to be analyzed separately

and then, if required, the models developed could be examined

for similar behavior patterns.



Two-lane Urban Undivided Highways

Highways that come under the category of two-lane,

urban, undivided highways are considered for this study.









About 2500 highway sections in the State of Florida belong to

this category. The important features of 2-lane urban

highways are section length, AADT, presence of on-street

parking, number of intersections, number of railway

crossings, lane width, paved shoulder width, unpaved shoulder

width, presence of curb and coefficient of friction.

The crashes that occur at an intersection are not

included in the study since such crashes are dependent more

on the design features of the intersection and the type of

control used than on the characteristics of the highway

section. Similarly the crashes that occur at a horizontal

curvature are dependent more on the features of the curve

than on the longitudinal features. Therefore highway sections

with acute horizontal curvature are not included in the

study.

Highway sections that pass through railroad crossings

and narrow bridges are also excluded from this study. The

number of highway sections available for this study after

removing sections with sharp curves, railroad crossings, and

narrow bridges are about 2000. Seventy-five percent of these

highway sections are used for analysis and modeling. Twenty-

five percent of the remaining highway sections are used for

testing the models.










Data Statistics

The minimum, maximum, mean, median, and quartile values

of each parameter considered in the study are shown in Table

3.2. These values are used to find the range at which the

majority of the data lies. Section length is measured in one

thousandth of a mile. Traffic volume is expressed in Annual

Average Daily Traffic. All cross sectional parameters are

measured in feet and speed limit is measured in mph.


TABLE 3.2 Data Statistics

# Parameter Code Min. 1st Q Median Mean 3rd Q Max.

1 Section Length slen 10 62 144.5 247.2 330 1933

2 Intersections its 0 0 1 1.76 2 24.5

3 Traffic Volume adt 913 7442 10890 12090 15580 38680

4 Speed Limit spd 25 35 45 41.78 45 55

5 Lane Width 1w 9 12 12 11.90 12 15

6 Paved shoulder ops 0 0 0 1.21 2 12

7 Unpaved Shoulder oups 0 4 6 5.87 8 12

8 Total Shoulder tosh 0 6 8 7.08 8 14

9 Outside Curb oc 0 0 0 0.10 0 1

10 On-street Parking pk 0 0 0 0.07 0 1

11 Friction fr 0 0 0 0.44 1 1

12 Total Crashes acc 0 0 1 3.36 4 47

13 Property Damage pdo 0 0 0 1.35 2 26

14 Injury inj 0 0 1 1.95 2 32

15 Fatality fat 0 0 0 0.06 0 3









Visual Display of Data

The parameters that are generally used to express the

variable statistics are shown in the previous table. Though

these values help to find the range in which majority of the

data lie, it does not give any information on its

distribution. To get the full picture, a series of plots are

prepared and shown in the following pages.

Two figures are used to display each variable. The

first figure shows the plot of the variable in the order in

which it exists in the database. The x-axis represents the

observation number and the y-axis represents the value of the

variable for each observation. This plot looks like a

scatter plot and gives an idea of the level at which more

observations are concentrated.

The plot on the right-hand side of each pair shows the

parameter in another order. The variables are sorted in the

increasing order of value. The sorted plot helps to identify

regions where sufficient observations are not available. It

also helps to identify the variables that are categorical in

nature. Plots prepared to display continuous variables,

categorical variables, and logical variables are shown in

Figures 3.1, 3.2 and 3.3, respectively.














Number of Crashes


40
30

20

10

0


0 500 1000 1500 200(




Section Length


500 1000 1500 2000


Number of Intersections








0 500 1000 1500 2C


Traffic Volume


40000

Z 30000


20000

0


0 500 1000 1500 2000


Number of Crashes: Sorted













0 500 1000 1500 200C




Section Length: Sorted


40
30

20

10

0









2000

1500

1000

500

0


1500 2000


Number of Intersections: Sorted


i








0 500 1000 1500 2000





Traffic Volume: Sorted















0 500 1000 1500 2000


FIGURE 3.1 Unsorted & Sorted Plots of Continuous Variables


.1~ *~


0 500 1000


A *: ..
o o. . ,, 4

} .-, .... .. -. .. .,


40000

30000

20000


10000
0













Traffic Speed: Sorted


0 500 1000 1500 2000



Lane Width










0 500 1000 1500 2000



Paved Shoulder










0 500 1000 1500 2000



Unpaved Shoulder










0 500 1000 1500 2000


0 500 1000 1500 2000



Lane Width: Sorted










0 500 1000 1500 2000



Paved Shoulder: Sorted










0 500 1000 1500 2000



Unpaved Shoulder: Sorted










0 500 1000 1500 2000


FIGURE 3.2 Unsorted & Sorted Plots of Categorical Variables


Traffic Speed











On-Street Parking: Sorted


10
0,8
0.5

0.4
0.2
0.0


0 500 1000 1500 200


Raised Curb


0 500 1000 1500 2000


0 500 1000 1500 2000


Raised Curb: Sorted









0 500 1000 1500 2000


FIGURE 3.3 Unsorted & Sorted Plots of Logical Variables




The observations at extreme values in the sorted plots

that are detached from the other observations show extreme

values when compared to most of the other observations.

These observations will be considered carefully during

analysis. If inconsistencies are observed in the models at

any stage due to these observations, all attempts will be

made to find measures of rectifying such problems. If no

means are available to rectify such situations, these points

will be eliminated and the behavior of the corresponding

variable at such values will be considered as unpredictable.


On-Street Parking














CHAPTER 4
CRASH DISTRIBUTION


Crash Classification

The total number of crashes that occur in a highway

section is generally classified based on severity and type.

The crash classification done based on crash severity and the

code used to represent them are given in the following

listing.



# Crash Severity Code
1 Property Damage PDO
2 Injury Crashes Inj
3 Fatal Crashes Fat
4 All Crashes Acc



Crash Frequency and Crash Rate

Crash frequency is the total number of crashes that

occur at a highway section during a given period of time,

regardless of length of the section, AADT, or duration of

observation. The period of observation is usually taken as

one year and the length of section is generally limited to

one mile. Crash rate is a function of crash frequency,

section length and the average annual daily traffic. For any









given highway section, crash rate is defined as the number of

crashes per one million vehicle miles.

Crash rate = f (crash frequency, section length, AADT)

Crash rate is generally used in regression analysis for

developing crash prediction models. When crash rate is

defined as the response variable, it is assumed that the

probability for a crash to occur does not depend on the

traffic volume or on the uniformity of design. When number

of crashes per section is considered as the dependent

variable and when section length and AADT are treated as

independent variables, the effect of traffic volume and

uniformity in highway design on the number of crashes are

also taken into consideration while modeling.

In this study, crash frequency is considered as the

response variable regardless of the section length or the

AADT. Section length and AADT are treated as independent

variables that can influence the occurrence of number of

crashes in a given highway section.



Frequency Distribution

The highway section can be grouped into classes based

on the number of crashes that occur in each section. A

frequency distribution table can be constructed using the

counts in each class or scores in each interval. The shape






33


of the frequency distribution can be seen from the plot of

crash frequency against each class. The resulting plot is

known as a histogram.

Figure 4.1 shows the histogram of all crashes in two-

lane urban highway sections in the State of Florida. The

plot on the left side shows a scatter plot obtained by

plotting crash frequency of each highway section. The plot

on the right-side shows the histogram prepared from the

crash distribution table.

According to the histogram, the crash frequency

distribution is single sided with a large number of highway

sections with no crashes. The highway sections

corresponding to higher number of crashes per section seem

to decrease. Beyond fifteen crashes per section, the number

of highway sections get close to zero while the curve tends

to become asymptotic to the x-axis.



0
.. 0:



2 0L
-%A-
...... .-.... -.2.




0 500 1000 1500 2000 0 5 10 15 20
Observation Number Crash Class


FIGURE 4.1 Distribution of Actual Crash Data









Three important observations from the histogram plot are

listed below.

1. There are more than 600 highway sections that have zero

crashes.

2. About 50 highway sections have more than 15 crashes.

3. The distribution is single sided.



Poisson Distribution

The Poisson distribution is widely used to model count

data. Rarely occurring events are generally represented

using Poisson distribution. The shape of the distribution

depends only on one parameter, the mean value of the data.

In other words, the mean determines the shape of the

distribution.

Poisson distribution has generally been accepted as the

standard distribution function to represent the crash

frequencies. The Poisson distribution models the probability

of y 'events' or incidents according to the Poisson process

with the probability given by the following expression.



p(y, t) = e- py /y!

Where,

y 0, 1, 2, 3, .

: mean value of the sample









The variance of the distribution is assumed to be equal to

the mean of the distribution.



Negative Binomial Distribution

The negative binomial distribution is similar to the

Poisson distribution but unlike the Poisson distribution, it

allows the variance to be much larger than the mean. The

mean and variance of the negative binomial model can be

written as follows.



E (Y/x) = (x)

V (Y/x) = g(x) + a [(x)2

Where,

a is referred to as the dispersion parameter.



Rejection of Poisson Distribution

The assumption that crash frequency is distributed

according to a Poisson distribution is rejected based on the

results from three experiments.



Violation of Mean-Variance Equality:

From the observed values of crashes, the mean, standard

deviation, and variance are calculated. The values obtained

are listed below.









# Parameter Value
1 Range 0 60
2 Mean 3.2
3 Standard Deviation 6.1
4 Variance 37.2


Poisson Distribution assumes that the variance is equal

to the mean. The mean value of crash frequency is 3.2, while

the variance is 37.2 ( >> mean ). Therefore the basic

assumption used to develop the Poisson model is violated.



Over-dispersion Coefficient exceeds 1:

A test for over-dispersion was performed using the

outputs from the procedure that estimates the negative

binomial regression in the statistical analysis package,

LIMDEP by Green [2]. If the over-dispersion factor exceeds

1, the distribution is assumed to be negative binomial. The

over-dispersion factor estimated by the regression analysis

was 1.49.



Disagreement in Shape of Distribution:

The mean number of crashes for two lane highways is

calculated from the crash data. All statistics obtained from

the actual crash frequencies may be used to generate

theoretical frequencies to follow any assumed distribution.









A vector of length "n" is generated randomly using a

Poisson distributed random number generator. The number of

elements, (n) is made equal to the number of highway

sections. The value of parameters used to drive the random

number generator is obtained from the actual crash data. The

distribution of the resulting vector is expected to look like

the distribution of actual crash data.

To compare this theoretical crash data with the actual

crash data, a scatter plot and a histogram plot are prepared

using the randomly generated crash data. The plots thus

obtained are shown in Figure 4.2.















0
C 0
Lo
So LO


C)


0 500 1000 1500 2000 5 10 15 20

Observation Number Crash Class


FIGURE 4.2 Distribution of Poisson Data









Three important observations from the histogram plot are

listed below.

1. The number of highway sections that had no crashes

during the observation period is less than 100.

2. There are no highway sections with crash frequency

greater than 10.

3. The distribution is double sided with short tails.



None of these observations agree with the observations

made from the actual frequency. Similar procedure is used to

generate another vector of random numbers that follow a

negative binomial distribution based on the actual crash

statistics. The scatter plot and histogram plot are shown in

Figure 4.3.






o
C .......... 0
CoC
:. ....:.'... ......

Z, M



0 500 1000 1500 2000 5 10 15 20
Observation Number Crash Class


FIGURE 4.3 Distribution of Negative Binomial Data









Three important observations from the histogram plot are

listed below.

1. About 600 highway sections had zero crashes.

2. About 50 highway sections experienced more than 15

crashes.

3. The distribution is single sided.



Conclusion

All these observations agree with the observations made

from the actual crash distribution. Based on the results

obtained from all the three experiments described in the

previous sections, it can be concluded that the crash

distribution of total crashes that occur at two-lane urban

highways follow negative binomial distribution.

The distribution of PDO crashes, injury crashes and

fatal crashes are also checked using the same procedure. The

following results were obtained.



# Crash Type Distribution
1 PDO crashes Negative Binomial Distribution
2 Injury crashes Negative Binomial Distribution
3 Fatal Crashes Poisson Distribution


The fatal crashes were of very rare occurrence and there were

no signs of over-dispersion.














CHAPTER 5
EXPLANATORY VARIABLES


This chapter gives an introduction to all the variables

that are believed to contribute to the occurrence of crashes.

Such variables that may be able to explain the occurrence of

crashes are termed explanatory variables. The explanatory

variables are classified into longitudinal factors,

operational factors and cross sectional factors.



* Longitudinal factors include section length, number of

intersections, level crossings and narrow bridges.

" Operational factors include traffic volume, speed limit

and on-street parking conditions.

" Cross sectional factors include lane width, shoulder

width, paved shoulder width and unpaved shoulder width.



The Highway Section

A highway section is defined as a uniform stretch of

roadway for which the operational factors and cross sectional

factors remain unchanged. The length of a highway sections

usually ranges between 0.5 and 2.5 miles. Since

intersections are not considered as constraints in









determining highway section boundaries, a highway section may

consist of several intersections.

Changes in geometry, speed limits, parking regulations

or traffic volumes results in a highway section getting

categorized into several smaller highway sections. Therefore

a longer highway section implies design consistency while

several short highway sections imply irregularities in

design.

There is a possibility for irregularities in design to

contribute to crashes. Therefore each highway section is

considered as one observation in this study rather than

considering sections of one mile length. The section length

will be considered as one of the explanatory variables.



Longitudinal Parameters

The section length is the most important longitudinal

parameter of the highway section. Other factors include

number of intersections, number of railway crossings and

number of narrow bridges.

The crashes that occur at an intersection depend more

on the design aspects and operational features of the

intersection than on the features of the section. Since this

study is focused towards modeling the crashes as function of

the highway features, the crashes that occur at the









intersections are not included as part of the response

variable.

Even though the crashes that occur at intersections are

excluded, there is a possibility for mid-block crash

frequency to be influenced by the presence of intersections.

Therefore 'number of intersections' is also considered as one

of the explanatory variables in this analysis.

Figure 5.1 shows the pairwise plot of all longitudinal

parameters with crash frequency as the first variable. A

pairwise plot is prepared by plotting all variables on a two-

dimensional surface. Each plot represents two variables.

The plots give a general idea of how well the variables are

related to each other. The response variable is shown as the

first variable in each plot.

The plots may be explained using the following

examples. In Figure 5.1, the plot corresponding to lst row

and 3rd column was prepared by plotting number of

intersections on the x-axis and crash frequency on the y-

axis. The plot corresponding to 1st column and 2nd row was

prepared by plotting crash frequency on the x-axis and

section length on the y-axis.

The points in the pairwise plot are not completely

random. Therefore, some relationship could be expected

between the variables. Since the points are also spread out,






43


it can be expected that the behavior of the variables under

consideration is also influenced by other variables.


0 500 1000 2000


i-h1* ...
m.-...n.. . .
uadfl': .'I..








0 10 20 30 40


0.0 0.4 0.8


'0
I t i 1


0 5 10 15 20 25


acc total crash frequency
slen section length (1/100th of mile)
its number of intersections
oc presence of curb


FIGURE 5.1 Pairwise Plot of Longitudinal Factors


A ",".Y"






44


Operational Parameters

Traffic volume, traffic speed and on-street parking are

the important highway operational parameters. Figure 5.2

shows the pairwise plot of all operational parameters with

crash frequency as the first variable.




0 10000 30000 0.0 0.4 0.8


"

ac ... .


... . 0 30 4 0










0 10 20 30 40


acc
adt
spd
pk


__-__spd









..2. 35 4 4 5 55
25 30 35 40 45 50 55


total crash frequency
AADT
speed limit (mph)
on-street parking


FIGURE 5.2 Pairwise Plot of Operational Factors









Since traffic volume changes with time it is difficult

to measure the volume and record it on an ongoing basis.

Besides, it is practically impossible to know the traffic

volume or density at the time of the accident. Therefore a

representative variable, average annual daily traffic (AADT)

is used as the variable to represent traffic volume.

Similarly traffic speed also changes with time.

Therefore another indicated variable, speed limit, is used to

represent this factor. Speed limit is a function of several

geometric parameters, pavement conditions and sight distance.

Speed limit when defined as an explanatory variable

represents the effect of these factors on highway safety.

On-street-parking is another important operational

parameter. It is represented using a logical variable that

takes the value zero if on-street parking is prohibited and a

value one if on-street parking is permitted for that highway

section.



Cross Sectional Parameters

The cross sectional factors are lane width, shoulder

width, median width, and safe recovery area width. Since the

type of highway considered for this study is undivided, the

median width is zero. The safe recovery area is usually zero

for urban highways. The shoulder could either be paved,









unpaved, or a combination of both. Since a paved shoulder

can functionally contribute to the width of lane, paved and

unpaved shoulders are considered as two different parameters.

Figure 5.3 shows the pairwise plot of all cross sectional

parameters with crash frequency as the first variable.


acc












..... .
N........ .





.... .=.... .




0 10 20 30 40


9 10 11 12 13 14 15

U I g
I I ,
:il iil


, I


0 2 4 6 8 10 12



I .
t=iiii


0 2 oups


0 2 4 6 8 10 12


acc total crash frequency
lw width of lane
ops width of paved shoulder
oups width of unpaved shoulder


FIGURE 5.3 Pairwise Plot of Cross Sectional Parameters














CHAPTER 6
MODELING STRATEGIES & THE BASE MODEL


This chapter gives an introduction to the statistical

methods used in the study. Criteria used for accepting or

rejecting models and for preferring one model over other

models are also discussed briefly in this chapter.

Assumptions made based on the insights obtained from

literature review, visualization of actual data and based on

known statistical concepts are used to develop the base

model. This model and all relevant model parameters are

displayed and discussed briefly in this chapter.

There is a need to validate or reject these assumptions

based on statistical inference. The next chapter deals with

improving this model step by step while all assumptions used

in the base model are evaluated in stages.



Generalized Linear Models

In ordinary linear regression analysis, the errors are

assumed to be distributed normally. Therefore the

properties of least squares estimates are stronger when the

errors actually follow normal distribution than when they

are not normal. Most of the time the errors are not









normally distributed. In cases where response variables are

of rare occurrence, the errors are seldom normal. In such

situations, the models developed using linear models become

highly unreliable even though a very good fit can be

attained through sophisticated modeling.

The generalized linear models (GLM) introduced by

Nelder and Wedderburn (1972) is a generalized approach to

linear models in which a wide range of different types of

error distribution families is accommodated. Generalized

Linear Models are specified by three components, random,

systematic link.

The random component identifies the probability

distribution of the response variable. Since crash

frequency represents counts it is discrete in nature and

follows a distribution pattern. The systematic component

specifies a linear function of explanatory variables that is

used as a predictor. Section length, traffic volume, speed

limit, lane width and shoulder width are examples of

explanatory variables that can be expressed in the linear

form as given below.



systematic component = P0 + Pixl + P2x2 + P3x3 + .......

where,
xl, x2, x3 ......... are the independent variables or functions of
independent variables.









PO is the y-intercept and

P1, P2, P3 ......... are coefficients of the independent variables



The link component of the Generalized Linear Model

links expected values of observations to explanatory

variables through a specified function. In Poisson and

negative binomial distributions, the link function could be

natural log.



Model Statistics

The definition of some important terms used to express

the model characteristics and reliability are given in the

following sections.



Regression Coefficient (,6):

The parameters Po and P1 ..... .n are called regression

coefficients. Po is the y-intercept of the regression model

and .1... 1n represents change in expected value of the

response variable per unit change in each independent

variable.



Sum of Squares (SS):

Sum of squares of the model is a measure of the

variability in the response variable that has been explained









by the model. Sum of squares of individual parameters

represents the portion of model sum of squares that has been

contributed by that parameter.



F-test:

The value of P depends on the units used to represent

the corresponding parameter. For example, the P obtained

when expressing section length in miles will be one thousand

times the value of P obtained when section length is

expressed in one-thousandth of a mile. The magnitude of

itself is not a clear indication of its significance.

The F-test can be used to find the relative importance

of one term with respect to another. If a significant

amount of extra variance can be eliminated (explained) by

including the term, its presence is justified.



t-test:

The t-test is similar to that of F-test except for the

fact that the t-test takes the direction of the coefficient

into consideration. The t-value may be written as follows.

t = fj / sl(cjj)

where,

Pj is the coefficient of jth term,









s is the standard error of bj and

cjj is the jth diagonal element of the (X'X)- matrix.



p-value:

The p-value measures the level at which the t-statistic

is significant. A p-value of .10 suggests that the

parameter it represents is significant at a confidence level

of 90%. The generally accepted significance level is 95%

which corresponds to a p-value of .05.



Likelihood function:

The likelihood function of a given data n, is the

probability of n for that sampling model, treated as a

function of the unknown parameters [37, page 40]. The

maximum likelihood (ML) estimates are parameter values under

which the observed data would have had the highest

probability of occurrence.



Deviance:

The deviance of an ordinary least squares model is a

function of its log-likelihood and the log-likelihood of the

corresponding saturated model. It is calculated by finding

the difference in log-likelihood and multiplying it by 2.









The deviance of a generalized linear model is similar

to the residual sum of squares of a linear model. It is

the weighted residual sum of squares of the model. The

residual degrees of freedom is used to calibrate the

deviance.



Variable Selection Procedure

The value of AIC can be used as the criteria to prefer

one model over another model. AIC is short for Akaike

Information Criteria. For generalized models, AIC is

defined as a function of deviance (D), degrees of freedom

(p), and an estimate of the dispersion parameter (0).

AIC = D + 2p0

A decrease in the value of any of the three parameters,

results in a reduction in AIC value. In all model selection

routines, AIC is used as the criteria for ranking candidate

models from which the model corresponding to minimum AIC

values is accepted. It is similar to the Mallows' Cp

criteria which penalizes the use of more number of

regressors to attain the expected quality of fit.



Sequential Variable Selection:

Stepwise variable selection is a search routine that

assists in finding a subset of explanatory variables that









could be included in a multiple regression model. According

to this concept, variables can get added or deleted from the

existing model on the basis of a predefined criteria which

measures the relative improvement of the model with respect

to each variable.

For a given number of variables, the number of models

that can be generated considering all possible combinations

of all or part of the variables is very large. An exhaustive

search will result in examining a large number of models.

Stepwise variable selection is a technique used to reduce the

number of models that need to be examined without taking any

risk of missing the best combination of variables.

To eliminate the probability of losing effective

combinations of variables, certain strategies are adopted in

identifying a path which would lead to the best model. The

three general sequential algorithms are discussed briefly in

the following sections.



Forward Selection:

In forward selection, the initial model contains only

the constant term that represents the y-intercept. A set of

models is developed as the second stage in which each model

contains exactly one term other than the intercept. The

model with lowest value of AIC is selected and this model









forms the base to find the next regressor. The process stops

when adding another regressor is not capable of bringing down

the AIC value any further. In this method, a regressor once

selected is never considered for elimination. The parameters

in the final model depend on the order in which variables

enter the model.



Backward Selection:

In backward selection, the first model is developed

using all the regressors. A series of analysis follow to

identify the regressor that has the highest contribution to

the AIC value. The regressor thus identified is eliminated

from the current model and the procedure is repeated to find

the next regressor. This procedure is stopped when

eliminating another term cannot reduce the AIC value any

further.

In this method, a regressor once rejected will not be

reconsidered for getting acceptance in the model. Therefore

the final model selected by this process depends on the order

in which parameters get rejected.



Stepwise Regression:

Stepwise regression is a modification of the forward

selection. In each stage of selection, all regressors









currently in the model are further evaluated to justify its

existence in the presence of the new variable that was added.

Therefore a regressor that entered the model at one stage

may be eliminated at another stage. The procedure is

terminated when no additional regressors can bring about an

improvement in the AIC value either by leaving the model or

by entering the model. Stepwise regression methodology is

used in the analysis.

Stepwise model selection procedure is the generally

preferred methodology for generalized linear models. The

procedure starts with an arbitrary model that has been fit

previously. The initial model is improved in stages by

adding terms to or deleting terms from the current model.

Each addition or deletion is justified by the reduction in

the AIC statistic.



Model Performance Criteria

At the model building stage, the conditions under which

the models should perform are not known. The models

developed through regression analysis needs to be cross

validated for reliability in application.

Fitting Sample and Testing Sample:

Prior to regression analysis, the data is split to form

two data sets. The larger set forms the fitting sample and









the smaller set forms the testing sample. All appropriate

candidate models are developed and their coefficients are

computed using the fitting sample. The testing sample could

then be used to estimate the performance of fitted models.

The following procedure is used to obtain an unbiased

split of data in the ratio 75:25. A random number in the

range of 0 1 is generated. If the value of the first

random number generated is less than or equal to .75, the

first record is included in the fitting sample. If the value

of the random number generated is greater than .75, the first

record is included in the validation sample. This process is

repeated for each observation.

Total number of observations = 1934

Observations in fitting sample = 1466

Observations in testing sample = 468

All acceptable models developed at each stage of

analysis using the fitting sample can be compared or ranked

based on the quality of prediction on the testing sample.

The performance of two or more models can be compared by the

relative accuracy of prediction. Two norms are developed to

automate computations and to develop a performance table.

The procedure used to develop the norms are shown in the

following sections.









Norm including all observations:

The prediction errors from the estimated response for

each model, can be used to generate norms which could form

the criteria for accepting /rejecting /ranking the candidate

models. Mean Absolute Deviation in prediction can be used as

the criteria for comparing the relative performance of any

two models.

MAD = ABS(observed predicted )/ n

where, MAD is the mean absolute deviation, and

n stands for number of observations in the fitting sample



Norm excluding outliers:

Mean Absolute Deviation could be highly influenced by a

few observations which may be outliers for a particular

model. To nullify the effect of such observations, up to 5%

of the observations with worst prediction error are excluded

from the computation. The absolute values of errors are

sorted in increasing order and the last 5% of observations

are rejected from the calculation. The norms are included

along with other model parameters in all performance tables

discussed in the next chapter.

Basic Assumptions and the Base Model

A regression model is developed using all regressors as

independent variables and crash frequency as the response









variable. The variable transformations used are based on the

information obtained from literature review. Traffic volume

and section length are assumed to follow natural log

transformation. All other variables are represented in the

natural scale. This assumption is based on the studies done

by a few analysts at earlier stages. These parameters are

further experimented to see if any other transformation can

represent them better than the default functions. Some

important results of such studies are also discussed in the

next chapter.

Information about variable interactions is not clearly

known at this time and such situations are assumed to be

nonexistent at this time. All predictors are assumed to be

continuous though some of them show categorical nature which

will be explored at a later stage. The variables included in

the development of this model are listed below.



# Parameter Code Transformation
1. Section Length slen log
2. Number of intersections its none
3. Ave. Annual Daily Traffic adt log
4. Posted Speed Limit spd none
5. On-Street-Parking pk none
6. Lane Width lw none
7. Outside Paved Shoulder ops none
8. Outside Unpaved Shoulder oups none










9. Outside Curb

10. Friction Factor


none

none


The Model Parameters:

The model parameters and standard model plots are shown

in the following six sections.

I. The Model:

acc log(slen) + its + log(adt) + spd + pk + lw + ops + cups + oc + fr
theta = 1.51256, family = negative binomial, link = log


II. Model Coefficients:
Parameter Value
(Intercept) -10.321376409
log(slen) 0.764342626
its 0.064613833
log(adt) 0.856658961
spd -0.017071908
pk 0.563012796
1w 0.002467123
ops -0.070158874
cups -0.021864803
c 0.200213092
fr 0.043057361


Std Err
0.710073553
0.036520748
0.011073166
0.056969923
0.004398058
0.142831156
0.033606293
0.015603061
0.012393907
0.117361492
0.061632129


t value
-14.53564405
20.92899676
5.83517269
15.03703913
-3.88169259
3.94180662
0.07341253
-4.49648148
-1.76415740
1.70595217
0.69861875


III. F Statistics:
Parameter Df Sum of Sq
log(slen) 1 785.511
its 1 74.929
log(adt) 1 200.624
spd 1 39.317
pk 1 14.772
1w 1 0.042
ops 1 16.273
cups 1 6.407
oc 1 2.694


Mean Sq
785.5110
74.9289
200.6236
39.3174
14.7720
0. 0417
16.2726
6.4074
2.6937


F Value
662.7017
63.2142
169.2575
33.1704
12.4625
0.0351
13.7285
5.4056
2.2725


Pr (F)
0.0000000
0.0000000
0.0000000
0. 0000000
0.0004281
0.8513137
0. 0002191
0.0202091
0.1319030










fr 1 0.488
Residuals 1455 1724.635


0.4881
1.1853


0.4118 0.5211776


IV. Analysis of Deviance Table:


Variable Df
NULL
log(slen) 1
its 1
log(adt) 1
spd 1
pk 1
1w 1


ops
oups
oc
fr


Deviance Resid. Df
1465
1139.127 1464
72.707 1463
217.330 1462
42.244 1461
15.901 1460
0.003 1459
18.196 1458
6.506 1457
2.477 1456
0.498 1455


Resid. Dev
3105.902
1966.775
1894.068
1676.738
1634.493
1618.593
1618.590
1600.394
1593.888
1591.410
1590.912


V. Model Statistics:
Null Deviance: 3105.902


on 1465 degrees of freedom


Residual Deviance: 1590.912 on 1455 degrees of freedom
Theta: 1.51257, Standard Error: 0.10623
2 x log-likelihood: 8830.80398
AIC: 1614.967, MAD: 2.711362



Crash frequency is modeled as a function of section

length (slen), number of intersections in the sections

(its), AADT, speed limit (spd), parking regulations (pk),

lane width (1w), outside paved shoulder width (ops), outside

unpaved shoulder width (oups), presence of outside curb (oc)

and the coefficient of friction of the pavement surface.

The distribution assumed is negative binomial, and the link


function is natural log.


Pr (Chi)


0.0000000
0.0000000
0.0000000
0.0000000
0.0000668
0.9575715
0.0000199
0. 0107488
0.1154974
0.4803596













VI. Model Plots:


3
2

I :. .
0 - - - - - - - - - - - - -

S -2 ,. "

-3
0 20 40 60 80

Fitted





80

60

40 .. "



20
0


0 20


60 80


1.5
E
1.0

0.5


0.0


-2 2 4


Predicted


3-
_n
2-

8 1-
o 0-
> -I
- 2 "

-3


0

Quantiles of Standard Normal


FIGURE 6.1 Standard Plots of the Base Model.


The variable coefficients (P's), standard error of each


term and t-statistics are shown in section II. Section III


shows the Sum of Squares imparted by each variable and the


associated F-statistics. The p-values of most of the terms


are less than .05 that shows significance at a confidence


level greater than 95%. The analysis of deviance is shown


in section IV, all important model statistics are shown in









section V and the standard model plots are given in section

VI.

Lane width, a very important parameter was declared

insignificant by the criteria used in eliminating terms

during stepwise regression. The coefficient of speed limit

suggests that as speed limit increases crash frequency

decreases. This model is studied in detail and various

stages of improvement that it goes through before reaching

the final model are discussed in the next chapter.














CHAPTER 7
STATISTICAL MODELING


This chapter deals with various stages of regression

analysis and related issues. The objective of this chapter

is to identify the best way of representing each variable in

the crash prediction model while validating basic

assumptions. Assumptions made about variable transformation

and interactions are reviewed in this chapter.

The analysis is started from the base model presented

in the previous chapter. Each variable in the base model is

examined individually and compared with all possible and

reasonably explainable form of representation in the model. A

model in which any form of the variable under consideration

is not significant at the 95% confidence level is rejected.

The models that survive this test are compared at

stages to find the best model for each stage. While

examining the transformation function of one variable, it is

assumed that the other variables are represented correctly.

Since this assumption can affect the outcome of the first few

variables that are analyzed, the final models are subjected

to cross checking for confirmation. This error is minimum










when the order used for analyzing variables are based their

importance in the model.

The independent variables and the sum of square values

as represented in the base model are listed below. The sum

of squares value of each variable in a model is a measure of

its relative contribution with respect to other variables in

explaining the response variable. The base model shows that

the section length (sum of squares = 785.5) contributes more

than double that of all other variables put together. if

section length is considered as the first variable for

analysis, the error induced by incorrect representation of

other variables can be minimized.


Parameter
Section Length
Intersections
Traffic Volume
Speed Limit
Parking
Lane Width
Paved Shoulder
Unpaved Shldr
Raised Curb
Friction Factor


Parameter
Code
log(slen)
its
log(adt)
spd
pk
lw
ops
oups
oc
fr


Listing: The Base Model
Sum of Sq Mean Sq F Value
785.511 785.5110 662.7017
74.929 74.9289 63.2142
200.624 200.6236 169.2575
39.317 39.3174 33.1704
14.772 14.7720 12.4625
0.042 0.0417 0.0351
16.273 16.2726 13.7285
6.407 6.4074 5.4056
2.694 2.6937 2.2725
0.488 0.4881 0.4118


Pr (F)
0.0000000
0.0000000
0.0000000
0.0000000
0.0004281
0.8513137
0.0002191
0.0202091
0.1319030
0.5211776









Representing Longitudinal Factors

In the base model, section length was assumed to have a

natural log transformation. Some of the functions generally

used to transform continuous variables in statistical

modeling are natural log, square root and square.

A series of models are developed from the base model

utilizing these transformation functions to represent section

length and intersections. The models that survived the

confidence level test and stepwise regression are listed

below and the corresponding model parameters are shown in the

table that follows.


# Model
I. SLI
2. SL2
3. SL3
4. SL4


Section Length
log (slen)
slen
sqrt (slen)
slen^2


Rejected by stepwise
lw, fr
lw, fr, oups
lw, fr
lw, fr


TABLE 7.1 Models for representing Section Length
Model Parameter SLI SL2 SL3 SL4
Null Deviance 3104.35 2762.75 3075.06 2374.54
Residual Deviance 1590.80 1559.78 1578.20 1548.76
Theta 1.511 1.23 1.485 .9521
Standard Error .106 .078 .103 .0554
2*log likelihood 8830.28 8710.03 8830.74 8517.54
AIC 1610.45 1576.89 1597.70 1563.62
Prediction Error 2.71 2.677 2.691 2.691
Error on 95% data 1.924 1.889 1.907 1.893









Parameter
Parameter Code
Section Length slen
Intersections its
Traffic Volume log(adt)
Speed Limit spd
Parking pk
Paved Shoulder ops
Raised Curb oc


Listing: The Model SL2
Sum of Sq Mean Sq F Value Pr(F)
820.515 820.5146 772.2600 0.00000000
60.324 60.3240 56.7763 0.00000000
204.467 204.4671 192.4424 0.00000000
36.342 36.3424 34.2051 0.00000001
10.350 10.3499 9.7412 0.00183721
9.857 9.8567 9.2770 0.00236208
5.118 5.1177 4.8167 0.02834245


Model SL4 with square function on section length was

found to be the best model based on the values of Null

Deviance, Residual Deviance, Standard Error, log likelihood,

and AIC values. Model SL2, which corresponds to

representation of section length without any transformation

was found to be the second best. Models SLI and SL3, which

represents log and square root transformation are rejected

since the model statistics are inferior to SL2 and SL4.

Both untransformed and square transformed models are

further compared. Though the square transformation gave

better model parameters, its ability to predict crash

frequencies on the test was found to be inferior (mean error

is 2.691 crash /section) to the prediction capability of the

untransformed model (mean error is 2.677 crashes/section).

Therefore the untransformed form of section length is

preferred over the square transformed form. The square

transformation will be further considered at other stages.









Interaction between Section Length and Intersections:

Long highway sections with large numbers of

intersections over-predicted the response variable. As

section length increases, crash frequency increases. For a

given section length, it is reasonable to expect the crash

frequency to increase as the number of intersections

increases. In the mean time, longer sections may be able to

accommodate more intersections than a shorter section within

a defined range of safety.

The effect of intersections on crash frequency cannot

be completely independent of section length. A product term

of intersections and section length is introduced in the

model to represent the combined effect of these parameters on

crash frequency. The coefficient of the product term was

found to be negative as expected. This term applies a

corrective measure against over prediction on long sections

with a large number of intersections. The p-value of the

product showed significance at confidence level above 95%.

The crash frequencies predicted by the resulting model

show great improvement. The plots prepared for diagnostic

study of the model SL2, is shown in Figure 7.1. The

improvement in prediction for observations listed in the

previous section are shown in the following listing.
























0





I.'.







LO


I. 0 L-


0 V*0 0*0


0
U-)






C.,
o U
o C~
- *0
C,
U
-o
C,


0


0 -- 00~ 0 o 0


C,
U

C,
C-




-0


0 9 9 z 0


z~ ~AJS~c(sjenpis~j)sqe)i4bs


CD


r')




C


-oE paljasqo









# Index section
length
1 276 1310
2 641 1871
3 644 1430
4 1257 1672


# inter-
sections
19
17.5
23.5
12.5


Actual


36
15
44
33


Predicted
Before After
105 51
61 29
178 21
110 40


When compared with the values predicted by the non-

interactive model, the errors for interactive model are

substantially lower. In the presence of the interactive term,

the best way of representing section length, intersections

and the product term is not known. Therefore some more

models are considered to find the best form of representing

longitudinal factors. The models that were acceptable for

performance comparison are shown in the following listing.

The model parameters are displayed in Table 7.2.

# Model Characteristics
1 LFI slen, slen^2
2 LF2 slen:its
3 LF3 slen, slen^2, slen:its

TABLE 7.2 Models representing Longitudinal Factors
Model Parameters LF1 LF2 LF3
Null Deviance 3060.64 2949.865 3063.98
Residual Deviance 1582.36 1570.81 1581.54
Theta 1.47 1.379 1.475
Standard Error .102 .092 .102
2*log likelihood 8820.56 8784.60 8822.77
AIC 1604.10 1590.21 1605.45
Pred Error Norml 2.689 2.6735 2.6875
Pred Error Norm2 1.9067 1.889 1.905











Parameter
Section Length
Intersections
Traffic Volume
Speed Limit
Parking
Paved Shoulder
Raised Curb
Product


Parameter
Code
slen
its
log(adt)
spd
pk
ops
oc
slen:its


Listing: The Model LF2
Sum of Sq Mean Sq F Value
779.733 779.7332 697.0394
59.487 59.4875 53.1786
205.715 205.7146 183.8977
34.970 34.9698 31.2611
11.724 11.7238 10.4805
11.628 11.6276 10.3945
5.580 5.5802 4.9884
79.273 79.2732 70.8659


Pr (F)
0.00000000
0.00000000
0.00000000
0.00000003
0.00123358
0.00129194
0.02566882
0.0000000


Among the listed models, LF2 gave the best results in

terms of model statistics and prediction errors. According

to this model, the interactive term is able to yield higher

quality than the introduction of a square term in the model.




Representing Operational Factors

In all previous analyses, AADT was assumed to follow

natural log transformation. This assumption was based on the

finding from a few recent studies. In this section, this

assumption is re-evaluated by comparing the parameters of the

current model with that of several other models obtained by

assuming various transformation functions for AADT including

the untransformed form. All acceptable models that resulted

from this analysis are listed below. The corresponding model

parameters are shown in Table 7.3.










AADT

ADT1

ADT2

ADT 3

ADT 4


Characteristics

log (adt)

adt

sqrt (adt)

adt, sq(adt)


TABLE 7.3 Models for representing AADT
Model Parameter ADT1 ADT2 ADT3 ADT4
Null Deviance 2949.865 2918.41 2945.30 2941.56
Residual Deviance 1570.81 1570.65 1570.60 1570.86
Theta 1.379 1.3517 1.3754 1.3722
Standard Error .0921 .0898 .092 .0917
2*log likelihood 8784.60 8770.86 8782.76 8780.90
AIC 1590.21 1590.05 1590.01 1592.43
Pred Error Norml 2.6735 2.6659 2.667 2.6703
Pred Error Norm2 1.889 1.8831 1.8880 1.8883


Parameter
Section Length
Intersections
Traffic Volume
Speed Limit
Parking
Paved Shoulder
Raised Curb
Product


Parameter


slen
its
adt
spd
pk
ops
oc
slen:its


Listing: The Model AD2
Sum of Sq Mean Sq F Value
778.889 778.8891 691.0159
61.090 61.0904 54.1983
208.089 208.0886 184.6124
36.750 36.7503 32.6042
10.422 10.4216 9.2459
9.941 9.9407 8.8192
3.597 3.5966 3.1908
74.114 74.1145 65.7530


Pr (F)
0.00000000
0.00000000
0.00000000
0.00000001
0.00240237
0.00302953
0. 07426068
0.00000000


Among the four best models that were accepted for

further comparison, ADT2 gave the best results. ADT2

represents the model corresponding to untransformed form of









AADT. All model parameters and both norms indicating

relative quality of prediction are superior for ADT2 compared

to that of the other three models. The interaction of AADT

with other parameters if any will be discussed in the latter

sessions.



Can Coefficient of Speed Limit be Negative?

While examining the models that were developed in the

past, most of the modeling process started with speed limit

as one of the regressors. But the final model did not

contain speed limit as one of the predictors. None of the

models presented in literature review contain speed limit.

When higher speed limit is expected to result in higher crash

frequencies, any result contradicting that result looks

unacceptable and can cause the forceful removal of the

variable itself from the model.

As speed limit increases, can the crash frequency

decrease? A few models were displayed in the previous

sections. In all these models, speed limit was found as a

very significant parameter. But the coefficient of speed

limit in all these models was consistently negative. When

the p-value of this variable is close to 0, its importance in

the model is undeniable though its credibility looks

suspicious.









Speed limit is not a truly independent variable. Higher

speed limit is associated with higher design standards.

Higher design standard is associated with better physical

highway features. Examples of measurable features are wider

pavement and wider shoulder. Features, which are difficult

to measure include better pavement conditions, drainage

conditions, sight distance and access control.

According to these models, crash frequency decreases as

speed limit increases. If this assumption is completely true

then, most of the efforts to increase safety should focus on

attaining higher highway design standards that would call for

higher speed limits. Model SPDI, given in Table 7.4

represents the model obtained by treating speed limit as a

continuous variable where the coefficient assigned to speed

limit through regression analysis is negative.



Categorical Treatment of Speed Limit:

As discussed in the previous section, higher speed

limit might be associated with higher safety and lower crash

frequency. The level to which this concept can be extended is

well understood by treating speed limit as a categorical

variable. The following listing shows a method used to

redefine speed limit as a categorical variable.











# spd spdc spdO spdl spd2 spd3 spd4 spd5 spd6
1 25 0 1 0 0 0 0 0 0
2 30 1 0 1 0 0 0 0 0
3 35 2 0 0 1 0 0 0 0
4 40 3 0 0 0 1 0 0 0
5 45 4 0 0 0 0 1 0 0
6 50 5 0 0 0 0 0 1 0
7 55 6 0 0 0 0 0 0 1



Model SPD2 in Table 7.4 represents the model obtained

by giving categorical treatment to speed limit. Though the

model parameters did not improve, the same trend was

observed. As speed limit increased, crash frequency

decreased.

The categorical treatment of speed limit also displayed

some quadratic trend. Therefore the square term was added

to see if the continuous term for representing speed limit

could still be used without losing the behavior pattern at

higher speed limits. Model SPD3 represents the model

resulting from including the quadratic term of speed limit.




# Model Speed Limit Characteristics
1 SPD1 spd speed limit continuous
2 SPD2 spdc speed limit is categorical
3 SPD3 spd^2 square transformed










TABLE 7.4 Models for representing Speed Limit
Model Parameter SPDI SPD2 SPD3
Null Deviance 2918.412 2948.45 2977.26
Residual Deviance 1570.648 1571.23 1570.31
Theta 1.3517 1.376 1.402
Standard Error .0898 .0920 .094
2*log likelihood 8770.856 8783.55 8797.05
AIC 1590.052 1601.532 1613.75
Pred Error Norml 2.6659 2.6668 2.6725
Pred Error Norm2 1.8831 1.886 1.8923


Parameter Listing: The Model SPDI


Parameter
Section Length
Intersections
Traffic Volume
Speed Limit
Parking
Paved Shoulder
Raised Curb
Product


slen
its
adt
spd
pk
ops
oc
slen: its


Sum of Sq
778.889
61. 090
208.089
36. 750
10 .422
9.941
3. 597
74.114


Mean Sq
778.8891
61.0904
208.0886
36.7503
10. 4216
9.9407
3.5966
74.1145


=AD2)
F Value
691.0159
54.1983
184.6124
32.6042
9.2459
8.8192
3.1908
65.7530


Pr (F)
0.00000000
0.00000000
0.00000000
0.00000001
0.00240237
0.00302953
0.07426068
0.00000000


A stepwise regression on SPD3 rejected the quadratic

term of speed limit from the model. Therefore SPD3 is not

considered in the selection process. Among SPD1 and SPD2,

SPD1 has better model parameters and better prediction

quality. All parameters that correspond to are superior to









SPD2, which indicates that defining speed limit as a

continuous variable is the best among both the options.

Though SPD2, the categorical model is not selected as

the best model, it has given some very powerful results which

help to accept the results of SPDl with more confidence -

"higher speed limits are associated with safer highway

sections."



Representing Cross Sectional Factors


Lane width is the most important parameter among all

cross sectional variables. The next significant parameter is

paved shoulder width and the least of all is the unpaved

shoulder width. In terms of safety, this belief need not be

true. Though the lane width provides the primary function,

which is moving the traffic, the shoulder plays a major role

in situations of emergency.

On rural highways, some clear area is provided beyond

the unpaved shoulder. This area is called the safe recovery

area. Safe recovery area gives vehicles under danger a very

high chance of surviving calamities. For urban highways,

this provision is usually absent due to unavailability of

adequate right-of-way.









The Lane Width Problem:

All studies done in the past have concluded that lane

width is an important parameter, which has significant

influence on crash frequency. Moreover, it is very

reasonable and logical to expect lane width to have a

tremendous effect on safety. In all models, the p-value of

lane width was significantly high and stepwise regression

procedures consistently rejected lane width and prevented it

from becoming one of the predictors.

In the following sections, some methods are used to

identify the behavior of lane width. If lane width does not

affect crash frequencies, it will be a strange result. If

lane width does affect crash frequency, there must be an

underlying behavior pattern, which could be preventing it

from staying in the prediction models.



Categorical Treatment of Lane width:

The value of lane width ranges from 9 feet 15 feet.

The sorted plot of lane width [Figure 4.2] shows that it is

discrete in nature and assumes only integer values. New

indicator variables were defined to treat lane width as a

categorical variable. The values assumed by these new

variables corresponding to various levels of lane width are

shown in the following listing.









# Lane width lwO iwl 1w2 1w3 1w4 1w5
1 9 feet 1 0 0 0 0 0
2 10 feet 1 0 0 0 0 0
3 11 feet 0 1 0 0 0 0
4 12 feet 0 0 1 0 0 0
5 13 feet 0 0 0 1 0 0
6 14 feet 0 0 0 0 1 0
7 15 feet 0 0 0 0 0 1



The seven discrete values of lane width (9, 10, 11, 12,

13, 14, and 15 feet) were put into six categories. The first

two discrete values are identified by the same category since

there are few highway sections with 9 feet wide lanes. The

model obtained from categorical treatment of lane width is

LW2.

In model LW2, the variable lwc which represents the

lane width parameter defined as a categorical variable has

become significant with low p-value. The model parameters

are shown in Table 7.5. The model parameters are not

superior when compared to other forms representing lane

width. Therefore this model is also rejected.

The coefficients of model LW2 revealed a typical

behavior pattern. As lane width increases the crash

frequency decreases initially but as lane width is further

increased, the crash frequency increased instead of

decreasing. The shape of this trend can be approximated by a









horizontal line (slope is 0) rather than an inclined line.

This behavior prevented it from being a significant parameter

in the prediction model as a continuous variable. Since the

relationship is nonlinear, a square term is included to

capture and represent the behavior of lane width

successfully. The resulting model LW3 is also discussed in

Table 7.5. The effect of lane width on safety could be

greatly influenced by the availability of paved shoulder

width. A product term representing interaction between lane

width and paved shoulder width was rejected based on AIC.



Introducing Pavement Width:

In two lane highways, the boundary of lane width and

paved shoulder width is just a solid white line. Unlike

multi-lane highways, this line has the least importance in a

two-lane highway since all vehicles have direct access to the

paved shoulder.

In highway sections with more than one lane in each

direction, only vehicles in the outer most lane have direct

access to the paved shoulder. Pavement width is defined as

the sum of lane width and paved shoulder width. For two-lane

highways, pavement width may be considered as the effective

lane width since vehicles can use the paved shoulder without

any restrictions.









Pavement width could be modeled as a single parameter

instead of using lane width, paved shoulder width and the

product term to represent interactions. The model thus

obtained, LW4 is compared with other models to find the best

prediction model among the group. The following listing

shows a brief description of the models considered and Table

7.5 shows model parameters of all models discussed.


Lane Width
lw
lwc
1w, lw2
pw = lw+ops


Rejected by Stepwise
lw
oups
oups, oc
oc


TABLE 7.5 Models for representing Lane width
Model Parameter LWI LW2 LW3 LW4
Null Deviance 2918.41 2954.86 2926.45 2917.64
Residual Deviance 1570.65 1574.12 1571.28 1569.89
Theta 1.3517 1.3819 1.3583 1.3511
Standard Error .0898 .0928 .0905 .0897
2*log likelihood 8770.86 8783.48 8773.79 8771.27
AIC 1590.052 1602.28 1597.22 1589.29
Pred Error Norml 2.666 2.668 2.665 2.663
Pred Error Norm2 1.8832 1.8832 1.882 1.880


Though LW2, the categorical model was successful in

revealing the behavior of lane width, the prediction error

did not improve while it became slightly worse from 2.666 to


Model
LW1
LW2
LW3
LW4










2.668. Model LW3 representing the square form of lane width

is a slight improvement. Model LW4 which represents pavement

width gave the best results. Therefore LW4 is selected as the

best of all these models.


Parameter
Section Length
Intersections
Traffic Volume
Speed Limit
Parking
Pavement Width
Raised Curb
Product


Parameter Listing: The Model LW4
Code Sum of Sq Mean Sq F Value
slen 779.363 779.3627 690.0949
its 61.340 61.3403 54.3145
adt 207.298 207.2979 183.5541
spd 36.681 36.6808 32.4794
pk 10.020 10.0201 8.8724
pw 9.443 9.4431 8.3615
oc 3.731 3.7308 3.3035
slen:its 75.747 75.7475 67.0714


Pr (F)
0.00000000
0.00000000
0.00000000
0.00000001
0.00294298
0.00388939
0.06933926
0.00000000


Analysis of Shoulder:

The parameters ops, oups and oc which represents

outside paved shoulder width, outside unpaved shoulder width

and presence of raised curb respectively have managed to

survive the AIC criteria and found a place in the model. The

coefficients of these parameters suggest that, increasing

paved or unpaved shoulder reduces crash frequency and the

presence of raised curb increases crash frequency. Though

these parameters were qualified by the AIC criteria, the

standard error and p-values are very high.









Paved and unpaved shoulders help to increase safety of

any highway section. All the models that were developed in

the past support this argument. But to what extent is a

shoulder capable of reducing crashes efficiently? The answer

to this question is revealed through the analysis shown in

the following sections.



Categorical treatment of shoulder:

A sorted plot of shoulder widths is shown in Figure

4.2. The pattern seen in the plots suggests that values of

shoulder widths are discrete. The value ranges from 0-12.

If an indicator variable is assigned to represent each value

of shoulder width, the degrees of freedom will increase by

24. To reduce the degrees of freedom, values in specific

ranges are included in the same category. The following

listing shows how shoulder width parameters can be redefined

to reduce the total degrees of freedom to 6.



# opsc/ ops/ opscO/ opscl/ opsc2/ opsc3/
oupsc oups oupscO oupscl oupsc2 oupsc3
1 0 0 1 0 0 0
2 3 2-4 0 1 0 0
3 6 5-7 0 0 1 0
4 9 8-12 0 0 0 1









It was observed that both paved and unpaved shoulders

have strong influence on the model. Estimates of standard

error and p-value for these parameters were found to be low.

The models discussed above are listed below and the model

parameters are shown in Table 7.6.

The paved shoulder showed a behavior similar to that of

lane width. The model statistics are inferior to that of

the model in which pavement width is used. Though this

model is not preferred, the categorical treatment has some

important results to offer which will be discussed in the

final chapter.


Cross Section
pw
1w, opsc, oups
1w, opsc, oupsc
pw, oupsc


Rejected by Stepwise
oups
1w, oups
1w, oc


TABLE 7.6 Models representing Cross Sectional Factors
Model Parameter SHI SH2 SH3 SH4
Null Deviance 2917.64 2929.60 2952.74 2949.70
Residual Deviance 1569.89 1570.36 1573.40 1571.97
Theta 1.3511 1.3608 1.3798 1.377
Standard Error .0897 .0906 .0925 .0922
2*log likelihood 8771.27 8776.10 8783.27 8783.35
AIC 1589.29 1594.12 1597.19 1595.745
Pred Error Norml 2.6633 2.6661 2.6643 2.6617
Pred Error Norm2 1.8832 1.8836 1.8812 1.8784


Model
SHI
SH2
SH3
SH4











Parameter
Section Length
Intersections
Traffic Volume
Speed Limit
Parking
Pavement Width
Unpaved Shldr
Product


Parameter Listing: The Model SH4
Code Sum of Sq Mean Sq F Value Pr(Z)
slen 782.203 782.2030 694.9421 0.00000000
its 63.645 63.6448 56.5448 0.00000000
adt 207.663 207.6626 184.4962 0.00000000
spd 37.945 37.9447 33.7116 0.00000001
pk 9.600 9.6001 8.5291 0.00354894
pw 10.081 10.0806 8.9560 0.00281215
oupsc 12.403 4.1343 3.6731 0.01182596
slen:its 76.999 76.9993 68.4094 0.00000000


The unpaved shoulder, when given categorical treatment

showed a pattern different from that of both lane width and

paved shoulder width. Besides, this model has a significant

improvement over the former model though the degree of

freedom increased by 3. Therefore SH4 is considered as the

best model in which pavement width is used to represent lane

width and paved shoulder width, and unpaved shoulder width is

expressed as categorical variable.



Identifying Significant Interactions

The previous sections evaluated the transformation

functions and identified the best way of representing each

parameter in the model. The variables that are assumed to be

independent need not be truly independent. The presence of

powerful interactions among variables could be identified and

measured using their product terms.










A large model is developed from the current model by

allowing all possible second level interactions. Interactions

at level three and above are neglected due to the increased

level of complexity and unexplainability of resulting terms.

The resulting model, INT2 is not better than any of the

simpler models discussed in the earlier sections but this

model could lead towards identifying some powerful

interactions. Since this model has a large number of

parameters, it is able to give a good fit with the present

data. But this model has very high prediction error.

Besides, most of the second level interactive parameters are

unexplainable.


Parameter
Section Length
Intersections
Traffic Volume
Speed Limit
Parking
Pavement Width
Unpaved Shldr
Product
Product
Product
Product
Product


Parameter Listing: The Model INT3
Code Sum of Sq Mean Sq F Value Pr(F)
slen 799.720 799.7196 712.9671 0.00000000
its 62.956 62.9564 56.1270 0.00000000
adt 214.338 214.3380 191.0869 0.00000000
spd 43.483 43.4826 38.7657 0.00000000
pk 7.774 7.7741 6.9308 0.00856236
pw 11.006 11.0057 9.8118 0.00176869
oupsc 13.405 4.4683 3.9836 0.00771230
slen:its 81.607 81.6067 -72.7541 0.00000000
slen:spd 15.395 15.3945 13.7245 0.00021963
its:spd 3.214 3.2137 2.8651 0.09073489
adt:oupsc 21.650 7.2166 6.4337 0.00025089
spd:pk 18.935 18.9347 16.8806 0.00004203









Even though this model cannot be accepted above any

other models, it gives some powerful insights to a few

important interactive terms. Such terms are identified after

screening this full model through the stepwise filter. The

resulting smaller model, INT3 has reduced the prediction

error considerably compared to models INTI and INT2.



# Model Characteristics
1 INTl best model from previous section
2 INT2 all second degree interactions
3 INT3 model selected by stepwise regression
4 INT4 spd:its removed manually


TABLE 7.7 Models representing Second Degree Interactions
Model Parameter INTI INT2 INT3 INT4
Null Deviance 2949.70 3177.45 3077.12 3070.09
Residual Deviance 1571.97 1579.94 1573.31 1573.64
Theta 1.377 1.5755 1.4884 1.4816
Standard Error .0922 .1108 .1025 .1018
2*log likelihood 8783.35 8870.79 8836.49 8833.24
AIC 1595.745 1673.16 1610.23 1608.36
Pred Error Norml 2.6617 2.6748 2.6545 2.6531
Fred Error Norm2 1.8784 1.8964 1.8746 1.8735


Model INT3 is further checked to see if there are

interactions, which are very weak and unexplainable. Such

variables are removed and checked to see if such removal

could improve the prediction error. The interaction between










speed limit and number of intersections is very week and has

very high p-value. Removing this term (INT4) has further

improved the prediction quality. The important models in

this series of analyses are listed below and the model

parameters are shown in Table 7.7.


Parameter
Section Length
Intersections
Traffic Volume
Speed Limit
Parking
Pavement Width
Unpaved Shldr
Product
Product
Product
Product


Parameter
Code
slen
its
adt
spd
pk
pw
oupsc
slen:its
slen:spd
adt:oupsc
spd:pk


Listing: The Model INT4
Sum of Sq Mean Sq F Value Pr(F)
799.117 799.1168 712.8898 0.000000000
65.146 65.1457 58.1163 0.000000000
214.570 214.5702 191.4175 0.000000000
43.456 43.4564 38.7674 0.000000001
7.954 7.9544 7.0961 0.007810734
10.908 10.9084 9.7314 0.001847196
13.497 4.4990 4.0135 0.007400247
81.778 81.7782 72.9541 0.000000000
15.911 15.9113 14.1945 0.000171459
20.814 6.9379 6.1893 0.000354112
18.252 18.2516 16.2822 0.000057421


The Final Model

The previous sections of this chapter displayed a

series of regression models in stages. Each stage of

improvement was supported by improvement in model parameters

and justified by a corresponding reduction in the mean

prediction error. The final model selected from this series

of analysis is INT4.









Table 7.8 Observed vs. Predicted Values
# Crash Frequency Absolute # Crash Frequency Absolutel # ICrashFrequency ]Absolute
Actual Predicted Error Actual Predicted Error Actual jPredicted Error


3
4
1
1
2
3
7
0
--6 -
---
2
0
0
1
0


-
0
0
0
0
0
0
0
0
0
0
0
0
0
0


-r I


0

1
0
1
-1

2
1-
1
1 -
0
2


0
0
0
0
0
0
0
0
0
0
0
0
0
0


9 0
1 0
1 0
1 0
1 0
1 0
1 0
3 0
6 0
2 0
1 0
0 0


19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46

48
49
50
51
52
53
54
55


U
0
0

0
0
-0

-0
-0
0
0


3-)
58

59
60
161
62
63
64
65
66
67
68


2
2
-1
6

-6
5
1
2
1
1
-14


2

3
4
5
6
7
8
-9-
711
12
13
14
15
16


1
3
4


2
3
7
0
6


1
2
0
0
0


7L10


0
0
5


9
0
8
0
0
5
0
0
0
3
0
0
0
2
0
0
0
0
0
0
0
0
3


/4
75
76
77
78
79
80
81
82
83
84
85
86
-87-
88
89
90
91
92
93
94
95
96
97
98
99


10(1 0
101 0


104
105
106
107
108
109
'!O-


2
1
2
6
1
6
5
2
2
1
2
1-- -


U
0
0
0
0


111
172
113


2
0
!1


114 U
115 1


0 116 3
0 117 0
0 118 1
0 119 3
0 120 0


U
0


0


3
1


2
4

2
2
-- 1


14 0 1231 1 2 I


2


4 1 125 0


1
1


1 1
1 1


126 0 1 1
127 0 1 1


1 1 128 2 1 1


8

7
-1
1


1
-2-
1


1 I129 0
1 130 7


1 131 0 1 1
1 132 0 1 1


1 133 0


134
135
136


153

155


1
1
1


2
0
0


3
-1
1


0 1-
0 1 1


1 1 156r 3 t44 1


:-1 =_- t_ _


U
0
2
0
0
0
3
3


1
1


157 0 1 1
158 0 1 1
159 2 1 1
160 0 1 1
161 1 2 1-2
162 0 1 1
163 0 f 1


______ + -


104 3 2 1
165 13 12 I


1


2 1
1 1
1 1
2 1
4 1


69 1
70 3


71 U
72 2


17 0
18 3


0
1

0
1

1

1
1
17
17


I
1
1


2
1
1
-1
0

0

2


9
1
1
1

1
1
-3 -
-6-
2

0


0
1
2
17
1
0



1
2


A


-


E


1 1 138 1 2 1
-1 1 139 21 1
1 1 140 0 1 1
2 1 141 01 1
-1 1 1T4-2 1 2 1
-1 1 143 3 -2- 1
1 1 144 1 2 1
- 1 1 145 0 1 -1
1 1 146 1 2 1- i
1 1 -147 0 1 -1
1 -1 148 0 1-r I
1 1 149 2 1 1
1 1 150 0 1 1-
1 1 1T51 0 1 1-'-
--T 1 152 2 1 1i


1
-1


1
1
1
1
1


2


1
1


7


1


1
1
1


4


1 131.7 1 27