UFDC Home  UF Institutional Repository  UF Theses & Dissertations  Internet Archive   Help 
Material Information
Subjects
Notes
Record Information

Table of Contents 
Title Page
Page i Dedication Page ii Acknowledgement Page iii Table of Contents Page iv Page v List of Tables Page vi List of Figures Page vii Legend Page viii Page ix Abstract Page x Page xi Page xii Chapter 1. Introduction Page 1 Page 2 Page 3 Page 4 Page 5 Page 6 Chapter 2. Literature review Page 7 Page 8 Page 9 Page 10 Page 11 Page 12 Page 13 Page 14 Page 15 Page 16 Page 17 Page 18 Page 19 Chapter 3. Data organization Page 20 Page 21 Page 22 Page 23 Page 24 Page 25 Page 26 Page 27 Page 28 Page 29 Page 30 Chapter 4. Crash distribution Page 31 Page 32 Page 33 Page 34 Page 35 Page 36 Page 37 Page 38 Page 39 Chapter 5. Explanatory variables Page 40 Page 41 Page 42 Page 43 Page 44 Page 45 Page 46 Chapter 6. Modeling strategies and the base model Page 47 Page 48 Page 49 Page 50 Page 51 Page 52 Page 53 Page 54 Page 55 Page 56 Page 57 Page 58 Page 59 Page 60 Page 61 Page 62 Chapter 7. Statistical modeling Page 63 Page 64 Page 65 Page 66 Page 67 Page 68 Page 69 Page 70 Page 71 Page 72 Page 73 Page 74 Page 75 Page 76 Page 77 Page 78 Page 79 Page 80 Page 81 Page 82 Page 83 Page 84 Page 85 Page 86 Page 87 Page 88 Page 89 Page 90 Page 91 Chapter 8. Results and discussion Page 92 Page 93 Page 94 Page 95 Page 96 Page 97 Page 98 Page 99 Page 100 Page 101 Page 102 Page 103 Page 104 Page 105 Page 106 Page 107 Page 108 Page 109 Page 110 Page 111 Page 112 Chapter 9. For further studies Page 113 Page 114 Page 115 Page 116 Page 117 Page 118 Page 119 Page 120 Page 121 Chapter 10. Conclusions Page 122 Page 123 Page 124 References Page 125 Page 126 Page 127 Page 128 Biographical sketch Page 129 Page 130 Page 131 
Full Text 
EXAMINATION OF STATISTICAL RELATIONSHIPS BETWEEN HIGHWAY CRASHES AND HIGHWAY GEOMETRIC AND OPERATIONAL CHARACTERISTICS OF TWOLANE URBAN HIGHWAYS By JACOB ARULDHAS A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 1998 to my mother Mrs. Daisy Aruldhas who happily sacrificed her life and everything she had to bring up her three children, Paul Aruldhas, Dorothy John and myself in the absence of my father, Professor M. Aruldhas, now with the Lord, who was taken into glory when I was only four years of age. ACKNOWLEDGMENTS It is with great pleasure that I thank my advisor, Professor Joseph A. Wattleworth, for giving me the opportunity to work in his research project which lead to the study that I present in this dissertation. Dr. Wattleworth served as the chairman on my committee. Dr. Wattleworth played the key role in directing the studies and analysis towards obtaining practically applicable results. I thank Dr. Mang Tia, Dr. David Bloomquist, and Dr. Ralph Ellis, at the Department of Civil Engineering, University of Florida for being in my committee and for giving valuable guidance concerning the objectives of the study. My sincere thanks to Dr. Geoffrey Wining, at the Department of Statistics, University of Florida for all the analytical guidance he has given in issues related to statistical modeling. I express my gratitude to Dr. Paul Thompson, Dr. Fazil Najafi, Dr. Mang Tia and Dr. David Bloomquist at the Department of Civil Engineering for advice, encouragement and extended help during times of need. Special thanks to my wife, Emi and my children, Eden, Emily and Elizabeth for all the trouble they took when I had to neglect them for the sake of this study. And, I thank God for His wonderful presence in my life that I have enjoyed most of the time. The joy of the Lord has always been my strength. His peace that he has given me has enabled me to overcome all barriers that came my way. TABLE OF CONTENTS Page ACKNOWLEDGEMENTS LIST OF TABLES LIST OF FIGURES LEGEND ... . . . . . . vi . . . . vii . . . . viii ABSTRACT CHAPTERS 1 INTRODUCTION . Background .. ...... Scope . . . . Objectives ........ Layout of this Report 2 LITERATURE REVIEW Lane Width ..... Shoulder Width ... Shoulder Type ... Speed Limit ..... 3 DATA ORGANIZATION . Highway Geometric Data .. ...... Highway Accident Data ...... Highway Classification ..... Twolane Urban Undivided Highways. Data Statistics ......... Visual Display of Data .. ...... 4 CRASH DISTRIBUTION .... ............ Crash Classification .... .......... Crash Frequency and Crash Rate ....... Frequency Distribution ... .......... . . . 1 1 2 3 4 7 . . 13 .... 15 . . 17 . . 20 . . 31 Poisson Distribution ... ......... Negative Binomial Distribution . . Rejection of Poisson Distribution Conclusion ..... ............. 5 EXPLANATORY VARIABLES ........ The Highway Section ........ Longitudinal Parameters ...... Operational Parameters .. ........ Cross Sectional Parameters .... 6 MODELING STRATEGIES & THE BASE MODEL Generalized Linear Models ..... Model Statistics ......... Variable Selection Procedure ... Model Performance Criteria ....... Basic Assumptions and Base Model . 7 STATISTICAL MODELING ... .......... Representing Longitudinal Factors Representing Operational Factors Representing Cross Sectional Factors Identifying Significant Interactions The Final Model .......... 8 RESULTS AND DISCUSSION ... ........ All Crashes . . . . . . Crashes with Property Damages Only . Injury Crashes .... ........... Fatal Crashes . . . . . . A Brief Overview of the Models . . 9 FOR FURTHER STUDIES. 10 CONCLUSIONS . S. . 113 122 Conclusions Limitations REFERENCES . BIOGRAPHICAL SKETCH . . . 40 . . 47 . . 63 . . 92 * 93 * 95 * 98 * 100 102 123 124 125 . . 129 LIST OF TABLES Table page 3.1 Classification of Highways ... ............ 24 3.2 Data Statistics ....... .................. 26 7.1 Models for representing Section Length ....... 65 7.2 Models for representing Longitudinal Factors 69 7.3 Models for representing AADT ... ........... 71 7.4 Models for representing Speed Limit ....... 75 7.5 Models for representing Lane Width ......... 80 7.6 Models for representing Cross Sectional Factors 83 7.7 Models representing Second Degree Interactions 86 7.8 Observed vs. Predicted Values .. .......... 88 LIST OF FIGURES Figure 3.1 Unsorted & Sorted Plots of Continuous Variables 3.2 Unsorted & Sorted Plots of Categorical Variables 3.3 Unsorted & Sorted Plots of Logical Variables 4.1 Distribution of Actual Crash Data . . 4.2 Distribution of Poisson Data ........ 4.3 Distribution of Negative Binomial Data . 5.1 5.2 5.3 Pairwise Plot of Longitudinal Factors Pairwise Plot of Operational Factors Pairwise Plot of Cross Sectional Factors 6.1 Standard Plots of the Base Model ....... 7.1 Plots from diagnostic study Crash Frequencies vs. Crash Frequencies vs. Crash Frequencies vs. Crash Frequencies vs. Crash Frequencies vs. Crash Frequencies vs. Crash Frequencies vs. Total Crash Frequency Total Crash Frequency Total Crash Frequency Total Crash Frequency Total Crash Frequency Section Length ....... Intersections ..... AADT .... .......... Speed Limit ...... OnStreet Parking . . Pavement Width ....... Unpaved Shoulder Width vs. Section Length . . vs. Lane Width ....... vs. Paved Shoulder . . vs. Unpaved Shoulder . vs. Raised Curb ........ page . . . 33 . . . 37 . . . 38 . . . 43 . . . 44 . . . 46 61 8.1 8.2 8. 3 8.4 8.5 8. 6 8.7 9.1 9.2 9.3 9.4 9.5 103 104 106 108 109 110 114 116 118 119 121 LEGENDS Variables acc inj pdo fat Variables slen adt its lw lwc ops opsc oups oupsc oc fr spd Representing Crash Frequencies: total number of accidents Number of crashes that involve injuries Number of Crashes with Property Damage Only Number of Crashes that result in fatalities Representing Explanatory Terms: Length of highway section in 1/1000th of a mile Average Annual Daily Traffic Number of intersections in the section Lane width Lane width redefined as categorical variable Paved shoulder width Paved shoulder redefined as categorical variable Paved shoulder width Paved shoulder redefined as categorical variable Variable to represent presence or absence of Curb Coefficient of Friction Speed Limit viii Statistical Terms: nb negative binomial distribution 0 Overdispersion factor in nb Distribution GLM Generalized Linear Models c Expected value of yintercept of regress PExpected value of regressor coefficient Df Degrees of Freedom AIC Akaike's Information Criteria D Deviance Resid Residual SS Sum of Squares n Number of observations MAD Mean Absolute Deviation Std Err Standard Error ion model Modeled as function of Represents the product term of two variables Observed /Actual value Predicted value Obs Pred Abstract of Dissertation Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy. EXAMINATION OF STATISTICAL RELATIONSHIPS BETWEEN HIGHWAY CRASHES AND HIGHWAY GEOMETRIC AND OPERATIONAL CHARACTERISTICS OF TWOLANE URBAN HIGHWAYS By JACOB ARULDHAS May 1998 Chairman: Dr. Joseph A Wattleworth Major Department: Department of Civil Engineering The accuracy and reliability of crash prediction models depend on the validity of assumptions made in the analysis, definition of response variables and proper representation of influencing factors on the relationships established between crash frequencies, geometric characteristics and operational characteristics of highways. Several crash prediction models were developed in the past years by investigators. These models have helped the highway engineers design the highway sections with improved safety and economy. They were also used in estimating the overall benefits of highway improvement programs. Several studies reported in the literature show that accident prediction models were developed to understand the effect of cross sectional parameters on crash occurrence. In this study, crashes that occurred during the years 1988 through 1991 in the State of Florida are analyzed using Statistical Methods. The analysis done using this large number of observations has resulted in some important results. The accepted belief concerning the distribution of crash data was found to have several inconsistencies including violation of basic assumptions. A fair portion of the analysis time was used to find the best distribution function that can be used to represent crash frequency. Crash rate, a function of crash frequency, section length and traffic volume is generally considered as the response variable in crash modeling studies. Crash rate was found to have very weak relationship with explanatory variables as indicated by high pvalues. Crash frequency was found to be the ideal response variable with section length and traffic volumes defined as explanatory variables. The use of certain transformation functions, as recommended by most of the studies done in the past was found to deteriorate the model quality. Several experiments were done to find the best transformation function to represent the model parameters. The relationship between cross sectional variables and crash frequency was found to be non linear. The results of this analysis can be used to estimate the contribution of each design parameter in the expected crash frequency during any given time period for a twolane urban highway section. The effect of each variable on total, PDO (Property Damage Only), injury and fatal crashes can be estimated using the models. The procedure used in the analysis can be used to study other types of highways. CHAPTER 1 INTRODUCTION Background Crash analysis has become a much easier task in the recent past since the state authorities have started taking solid steps in accident data collection and management. Why do crashes occur? How often do they happen? Can they be measured? If they can be measured, can they be predicted? If they cannot be measured, are they completely random? Many attempts have been made by investigators from various professional backgrounds to find answers to such questions. Several studies done by investigators from various professional backgrounds including highway engineering, transportation planning, statistics and mathematics to find relationship between highway geometric parameters and safety have resulted in valuable conclusions. The Poisson distribution is widely used to represent the crash rate. The most commonly used predictors are lane width, shoulder width and traffic volume. Some of the parameters that are believed to affect highway crashes are lane width, shoulder width, median width, surface characteristics and slopes, drainage condition, horizontal curvature, vertical curvature, sight distance, presence of narrow bridges, type of access control, lighting conditions and width of clear roadside recovery area. Though a number of studies have been done in this area, several assumptions and concepts concerning the relationships established were not validated. A close observation of these models showed that they lack any kind of uniformity. Inconsistencies were found in the distribution assumed, the parameters considered for analysis, the functions used for variable transformation, the variables used in the final prediction model, and the coefficient of parameters. Scope The scope of this study includes review of studies done in the past that are relevant to highway safety and highway crash modeling. Information obtained from the review of literature is used as guidelines in the initial stages of analysis. The second stage of the study is the preparation of data. The crash data used in the analysis consists of all reported accidents that occurred during the period of 1989 1992 on twolane urban highways within the State of Florida. Most of the parameters that are directly or indirectly related to highway design are included in the database. Basic assumptions concerning crash distributions should be validated based on actual data. One entire chapter is dedicated to identifying the appropriate distribution function for representing crash frequency. Other issues related to defining crash rate as the dependent variable are also reviewed. The regression analysis and the search for a model are usually exhaustive. To avoid large number of random searches, an analysis strategy has been developed. As part of the strategy, certain norms are also developed for testing the models. All assumptions made based on literature review and known concepts are validated or rejected based on statistical inference. The models reviewed are further diagnosed to find desirable design features. Objectives The primary objective of this study is to find the effect of highway geometric and highway operational parameters on crash frequency. The effect of these parameters on total, PDO (Property Damage Only), injury and fatal crashes are estimated. There are some intermediate objectives that will address some key issues and form the foundation of the analysis. Finding the best distribution function to represent crash distribution, identifying the best form of independent variables and identifying the interactions among independent variables are intermediate objectives. Another intermediate objective of the study is to find out whether uniformity of highway design, represented by section length, contributes to crash rate. If it does, then crash frequency should be considered as the dependent variable and section length as one of the independent variables. Layout of this Report The results of studies done in the past are briefly discussed in the 2nd Chapter titled "Literature Review." Most of the references are made to studies done in the United States or Canada. A few studies done in other developed countries are also included in the review of literature. The assumptions concerning distribution of crash frequency are evaluated and the results are discussed in Chapter 4. Three experiments are done on crash frequency to test the validity of accepted assumptions and to identify the ideal distribution function. All available explanatory variables, their ranges and limits are discussed in Chapter 5. Data statistics like mean, median, mode and quartiles are calculated and graphical methods are used to visualize the raw data from various perspectives. Chapter 6 gives an introduction to the Generalized Linear Models and briefly discusses the parameters that are used for variable transformation and model selection. Statistical methods used for organized regression analysis and norms developed for testing the developed models are discussed in this chapter. The information gathered from literature is used to develop the base model. Chapter 7 consists of the summaries of all stages of regression analysis done starting from the base model to the final model. Each independent variable is analyzed separately and an intermediate model is selected at the conclusion of each stage of analysis. The order in which the variables are analyzed is prioritized based on the relative importance of each parameter in the model as suggested by the base model. This chapter is concluded with the selection of the final model. In chapter 8, the final model selected in chapter 7 was developed from 75% of the data. The other 25% of the data were used for model testing at each stage. Once the final model is identified, the analysis data and test data is combined and the final model is updated using all data. Three other models are developed using the same procedure to represent injury crashes, PDO crashes and fatal crashes respectively. These models along with all the model parameters are displayed in this chapter. Crash frequencies are computed at different levels of each independent variable. The calculated crash frequencies are plotted and displayed in this chapter. These figures can be used to understand the prevailing trends in the relationships established. While preparing the plot for one variable, all other variables are held constant at their median values. In chapter 9, categorical representation of cross sectional variables revealed some type of trend which are displayed in this chapter and recommended for future studies. Lane width, paved shoulder width and unpaved shoulder width when treated as categorical variables showed that there are specific values of these parameters at which the crash frequency is minimum. The conclusions and limitations of this study are briefly discussed in chapter 10 and the listing of all literature reviewed is given in References. CHAPTER 2 REVIEW OF LITERATURE The results of studies done in the past on safety analysis are categorized by parameters and discussed briefly in the following sections. Lane Width The Florida Green Book' specifies in page IIT25 that "Traffic lanes should be 12 feet in width, but shall not be less than 10 feet in width. Streets and highways with significant truck traffic should have 12 feet wide traffic lanes." Lane width was found to affect crash rates on rural twolane highways, particularly runoffroad, opposite direction and sideswipe crash rates. Jorgensen2 reviewed fifteen studies, conducted before 1978, and dealt with the effect of lane width on safety. Eight of these studies showed that crash rates decreased as the lane width increased for rural twolane highways. Another study showed that crash rates for highways with 12 feet wide lanes did not differ significantly from those for highways with 11 feet wide lanes. Two other studies2 found no relationship between 7 crash rates and lane width for rural twolane highways. Three studies on twolane urban arterials could not find relationships between roadway width and crashes. The following is a summary of the most important findings of the studies on twolane rural highways that were 2 reviewed by Jorgensen " Gupta and Jain3 used multiple linear regression analyses to investigate the effects of roadway width on crash rates. Increasing roadway width was found to reduce multiplevehicle crashes. The roadway width was found to have no effects on crash rates at AADT higher than 3000 vehicles per day. " Dart and Mann4 found that crash rates decreased as lane width increased up to 11 feet, then remained relatively constant. * Cope5 showed from a before and after study a significant decrease in crash rates when widening lanes from 9 feet to 12 feet, especially at high crash sections. " Shah6 found a definitive relationship between pavement width and crash rate. The results showed that 22 feet to 24 feet wide pavements had fewer crashes than narrower and wider pavements. * Shannon and Stanley7 studied the relationship of construction cost, maintenance cost and crash costs as related to paved width. The analysis revealed a general tendency for crash rates to decline as pavement width increased. For twolane urban arterial streets, Gupta and Jain3 Head8 and Mulinazzi9 could not find a relationship between crash rates and lane width. Silyanovl evaluated international studies on twolane twoway highways and found that crash rates decreased as pavement width increased for pavement widths between 13 feet and 30 feet. On wide pavements, the crash reduction due to improvement was lower than that on narrow pavements. Based on several international studies, Choueeiri et al.11 concluded that a significant decrease in crash rates could be expected by increasing pavement width up to about 25 feet. A study in Australia by Mclean12 showed that the most safety effective lane width is about 3.4 meter (11 feet). McCarthy'3 showed from a before and after study that widening lanes on 17 sites from 2.7 meter and 3.0 meter (9 feet and 10 feet) to 3.4 meter and 3.7 meter (22 feet and 24 feet) resulted in a reduction in crash rate by 22%. However, Choueeiri et al.11 reported results from a previous study that show, contrary to the expectation that the crash severity increases as pavement width increases. They suggested that the reason for this might be the higher operating speed, on the sections that have wider pavements. Zegeer14 found that the only crashes that can be expected to decrease with lane widening were runofftheroad (ROR) crashes and oppositedirection (OD) crashes. He also found that only propertydamage crashes and injury crashes decreased as lane width increased with no change in fatality rate. Very little additional benefit was realized by widening a lane beyond 11 feet. In that study, an economic analysis was conducted to determine the expected cost effectiveness of lane widening. Savings due to crash reductions were the only benefits included in the analysis. It was concluded that 11 feet wide lane may be optimal for rural twolane roadways. Zeeger and Deacon15 reviewed 30 studies performed until the mid 1980's and concluded that no satisfactory quantitative model relating crash rates to lane width and shoulder width could be found. Therefore, they calibrated a new model that estimates the most likely relationships of crashes with lane width, shoulder width and shoulder type on twolane rural highways. This model was derived using data obtained from four previous studies. AR = 4.15(.8907)L (.9562)s (1.0026)LS (9403)P (1.004)LP Where, L = lane width in feet S = shoulder width in feet (including stabilized and unstabilized components) P width in feet of stabilized component of shoulder (0<=P<=S), P=O for unstabilized shoulders and P=S for fullwidth stabilization; and AR = Number of ROR and OD crashes per million vehicle miles. The authors recognized that, many assumptions were made in the development of the above model. They considered the model as a first approximation of the effect of lane and shoulder conditions on crash rates. No attempt was made to determine the pavement widths that should be used under various traffic conditions or roadway classes. Later, Zegeer et al. 16 developed another model to quantify the benefits of shoulder and lane improvements based on data selected from seven states. Only twolane roadway sites were selected. The crash types that appeared to be highly correlated with lane width, shoulder width and shoulder type were single vehicle (fixed object, rollover and runofftheroad crashes), headon, and sideswipe (opposite direction and samedirection) crashes. Using regression analysis, the following model was derived. AO = .0019(AADT)"24 (.8786)w (.9192)PA (.9316)UP (1.2365)H (.8822)TER1 (1.3221)TER2 where, TERI = 1 if flat, 0 otherwise; TER2 = 1 if mountainous, 0 otherwise; PA = Average paved shoulder width in feet; UP = unpaved shoulder width in feet; H = median roadside (or hazard) rating; W = lane width in feet; AO = the number of related crashes per mile (single vehicle, Headon and sideswipe); AADT = Average annual daily traffic. The above study indicates that as the amount of lane widening increases the percentage reduction in related crashes also increases. The first foot of lane widening between 8 and 12 feet, corresponds to a 12% reduction in related crashes, two feet corresponds to a 23% reduction, 3 feet to a 32% reduction and 12 feet of widening to 40 percent reduction. The above model only applies to twolane rural highways with lane widths of 8 to 12 feet, shoulder width of zero to 12 feet (paved or unpaved) and traffic volumes of 100 to 10,000 (AADT). This model was used to develop an informational guide17 that enables estimation of safety benefits of various roadway and roadside improvements. Goldstine'8 conducted a before and after analysis on twenty five projects covering 152 miles of road to examine the effect of road and shoulder widening on crash rates in New Mexico. Reductions of 38% to 53% in crash rate were observed. The study supported the TRB Special Report 214 in its recommendation that the higher the AADT the wider the road should be. However, the study recommended using even greater minimum widths. More recently, Garber and Joshua19 developed a logistic regression model to describe the probability of truck involvement in crashes as a linear logistic function of traffic and highway variables. For undivided two and four lane highways, the most significant variables were the slope change rate, lane width and to a lesser extent shoulder width. The model derived for these types of highways is given below. P = 1 /(l + eOX) :3 13.648 1.164*LW .9095*SW .1969*SCR + .0501*SW*SCR where, SW = shoulder width in feet, SCR = slope change rate, the rate at which the longitudinal slope changes LW = lane width in feet, and P = probability of large truck crash involvement. Shoulder Width The FDOT Green Book' specifies that "The width of all shoulders should, ideally, be at least 10 feet in width. Where economical or practical constraints are severe, it is permissible, but not desirable, to reduce the shoulder width. Outside shoulders shall be provided on all streets and highways with open drainage and should be at least 6 feet wide. Facilities with a heavy total traffic volume or a significant volume of truck traffic should have outside shoulders at least 8 feet wide. Previous studies that investigated the effect of shoulder width on safety dealt with twoway rural highways. Zegeer14 reviewed some of these studies and found that, there was lack of correlation between shoulder width and crash rate on twolane rural highways with AADT less than 2000 vehicles per day. Wide shoulders appeared to be most beneficial where AADT are between 3000 and 5000. In general, shoulders 47 feet wide were preferred to wider ones although some studies suggested that shoulders as wide as 1012 feet are the safest. Crillio and Council20 concluded from reviewing several studies that increasing shoulder width up to 1.8 meters (6 feet) wide on facilities with AADT greater than 1000 improved safety. However, the benefits of increasing shoulder width above 1.8 meters (6 feet) were not clear. A study in Oregon21 concluded that total crashes increased with increasing shoulder width except for roads that have AADT between 3600 and 5500. Shoulders wider than 8 feet experienced significantly higher crash rates than shoulders less than 8 feet wide. 22 Hiembach et al. concluded that highway sections that have paved shoulders are associated with lower crash rate than with identical sections that do not have shoulders. Rogness et al. 22 compared crash frequency for the time before and after shoulder widening. They found that the addition of fullwidth paved shoulders to a twolane roadway was effective in reducing the total number of crashes. For AADT less than 3000, they recommended a paved shoulder in place of an additional travel lane. Adding paved shoulders reduced crash rate by 55% for AADT between 1000 and 3000, 21.4% for AADT between 3000 and 5000; and 0% for AADT between 5000 and 7000. Zegeer24 found that ROR and OD crashes decreased as shoulder width increased up to 9 feet for twolane rural highways. For 1012 feet wide shoulders, there was a slight increase in these crash rates. Shoulder Type The possibility of a vehicle skidding out of control or turning over is expected to increase when the shoulder is soft or is covered with loose gravel, sand or mud. The FDOT Green Book' (page V3) specifies that shoulders "should be capable of providing a safe path for vehicles traveling at the roadway speed." It also specifies that "the shoulder should be designed and constructed to provide a firm and uniform surface, capable of supporting vehicles in distress." Turner et al. 25 compared the crash experience on three types of undivided highways: twolane with unpaved shoulder, twolane with paved shoulder and fourlane with unpaved shoulder. Twolane roadway with paved shoulder was found to be the safest and twolane with no shoulder was found to be the least safe. In general, shoulder paving or stabilization is desirable if conducted properly. Zegeer reported that the effectiveness of shoulder stabilization depends on the need for improvement from a safety standpoint. Based on crash data from Ohio, they found that shoulder stabilization can reduce crashes by 38% and injury and fatality crashes by 46%. In another study using crash data from North Carolina, they found that for twolane rural highways, unpaved shoulders resulted in higher crash rate and severity than paved shoulders. Foody and Long27 performed a series of analysis of variance (ANOVA) which revealed that the differences in crash rate between stabilized shoulders and paved shoulders were not significant. However, the crash rate of sections having these two shoulder types was significantly less than that of sections that have unstabilized shoulders. Speed Limit The FDOT Green Book' recommends that "the design speed should not be less than the expected posted or legal speed limit. A design speed 5 to 10 mph greater than the posted speed limit will compensate for a slight (and generally not enforceable) overrunning of the speed limit by many drivers." Jackobsberg and Danchik28 investigated the effect of speed limit on the safety of Maryland roads. They could not find any first order linear relationships between crashes and physical characteristics of highways including speed limits. Fieldwick and Brown29 compared the crash rates and speed limits at 21 counties. Speed limits in those counties varied between 80 km/h (50 mph) and 120 km/hr (75 mph). Using regression analysis, they showed that safety is sensitive to speed limit. For example, their results suggested that reducing rural speed limit from 100 km/hr (62 mph) to 90 km/hr (56 mph) could reduce fatalities and injuries by 11% and 15%, respectively. The authors admitted that these figures might include other factors (safety measures employed by counties that use lower speed limits) not investigated in this study. In addition, the study did not differentiate between highway classes (freeways, twoway twolane, etc.). Therefore, their results should be viewed with caution. Fieldwick and De Beer30 analyzed a monthly crash time series between January 1972 and December 1985. The results showed that a reduction in the urban speed limit from 60 Km/hr (37 mph) to 50 Km/hr (31 mph) would reduce fatal and injury crashes by 12.3% and 14.3%, respectively. In Texas, speedzoning procedures rely primarily on the 85th percentile speed of traffic on a facility. Ullman and Dudek31 investigated the argument that speed zoning below 85th percentile may be beneficial to drivers in rapidly developing areas. Spot speed, speed profile and crash data were collected before and after the speed at six urban fringe highway sites in Texas were reduced from 55 mph (the 85th percentile speed) to 45 mph. No changes were observed in speeds, speed distributions, speed changing activities or crash rates at the sites. They concluded that the lower speed zones were not effective in improving safety at the investigated sites. Garber et al. 32 investigated the effect of the design speed and the posted speed limit on safety. The types of highways included in the study were urban interstates, rural interstates, urban arterials, rural arterials and major rural collectors. Thirtysix different locations in Virginia were selected for the study. They found that the average speeds on these highways depend on design speeds. An attempt was made to correlate crash rates with average speed for the different types of highway. No strong correlation was found between crash rates and average speed for any given type of highway. They also found that drivers tend to travel at higher speeds on highways with better geometric characteristics regardless of the posted speed limit. The speed variance was found to be a function of the difference between the design speed and the posted speed limit. Results of regression analysis showed that the speed variance were minimum when this difference was between 5 and 10 mph. The regression analysis also showed that crash rates increase with increasing speed variance for all classes of roads. CHAPTER 3 DATA ORGANIZATION Highway Geometric Data The Florida Department of Transportation gathers and maintains information pertaining to all highways and streets in the State of Florida. This database is known as the RCI data (Roadway Characteristic Inventory). Each Record or line of information in the RCI database represents one highway section. Some of the relevant items in these records are location code representing the begin and end points of the highway, lane width, paved shoulder width and unpaved shoulder width, shoulder type, traffic volume, speed limit, number of intersections, presence of raised curb and friction factor. The information available about each highway section in the RCI data is broadly classified based on location, highway type and General characteristics of the highway. Location: The location code is the first nine digits of each record, designed to geographically identify the highway section. The location code includes county number, highway number and mile point. Two location codes are assigned to each highway section, one representing the beginning and the other representing the end of the section. The length of the section may be calculated as the differences between the beginmile point and the endmile point. Highway type: Several numeric codes are used to represent the highway type. Accesscontrol, number of lanes, presence of median and number of directions in which traffic moves are some of them. The highway type is recognized based on this information. For example, if the number of lanes is 2, presence of median is 0, number of directions in which traffic flow is 2, access control is 0, and type of location is 1, that record represents a twolane, urban, undivided highway section. General characteristics: The characteristics of a highway section that play an important role in this study include cross sectional design features like lane width, shoulder width, shoulder type, and operational parameters like speed limit, presence of on street parking and traffic volume. Highway Accident Data The Florida Department of Transportation also maintains another database that consists of all measurable and representable information pertaining to each highway accident that has been reported. The information available in the crash data base may be broadly classified based on location and crash characteristics Location: The location code in the crash database is identical to that of the location code in the RCI data. The only difference is that in the crash data, only one location code is required to represent the spot where the crash has occurred. The location code common to both databases helps to link the crash incident to the highway section in which it has occurred. Crash characteristics: Crash characteristics include details about types of crash severity, times of occurrence and weather conditions. A subroutine was developed to merge the crash data into the RCI data. While merging, the computer program reads the first record in the crash data and remembers the location code. A search is performed on the RCI data to find the record for which the location codes are in the range of the crash location. When this condition is met, all builtin crash parameters are updated in the RCI database based on the information obtained from the crash data. At the end of merging, the resulting database will contain exactly the same number of records as that of the RCI data, regardless of the number of records in the accident data. Highway Classification The highway sections in the rural areas are operationally different from similar sections in the urban areas. High level of pedestrian activities, large number of access points, high traffic volumes, restricted shoulders and the absence of safe recovery area are characteristics of urban highways. Therefore, urban highways and rural highways are analyzed separately. Two highway sections with similar geometric and traffic parameters in the same location could still be different from each other based on the type of access. Highways with full access control fall under the category of freeways. The other two types are partially accesscontrolled highways and highways with no access control. Further, highways are also classified based on the presence of median and based on the number of lanes. TABLE 3.1 Classification of Highways # Code Location Access Control Median # Lanes 1 uu2 urban no control undivided 2 2 uu4 urban no control undivided 4 3 ud4 urban Partial divided 4 4 uf4 urban Full divided 4 5 uf6 urban Full divided 6 6 ru2 rural no control undivided 2 7 ru4 rural no control undivided 4 8 rd4 rural Partial divided 4 9 rf4 rural full divided 4 10 rf6 rural full divided 6 Table 3.1 shows the code used to represent various types of highways. The effect of geometric and operational parameters on crash frequency depends on the highway type. For example, the effect of lane width on two lane highways could be much different from that of six lane highways. Therefore, each highway type needs to be analyzed separately and then, if required, the models developed could be examined for similar behavior patterns. Twolane Urban Undivided Highways Highways that come under the category of twolane, urban, undivided highways are considered for this study. About 2500 highway sections in the State of Florida belong to this category. The important features of 2lane urban highways are section length, AADT, presence of onstreet parking, number of intersections, number of railway crossings, lane width, paved shoulder width, unpaved shoulder width, presence of curb and coefficient of friction. The crashes that occur at an intersection are not included in the study since such crashes are dependent more on the design features of the intersection and the type of control used than on the characteristics of the highway section. Similarly the crashes that occur at a horizontal curvature are dependent more on the features of the curve than on the longitudinal features. Therefore highway sections with acute horizontal curvature are not included in the study. Highway sections that pass through railroad crossings and narrow bridges are also excluded from this study. The number of highway sections available for this study after removing sections with sharp curves, railroad crossings, and narrow bridges are about 2000. Seventyfive percent of these highway sections are used for analysis and modeling. Twenty five percent of the remaining highway sections are used for testing the models. Data Statistics The minimum, maximum, mean, median, and quartile values of each parameter considered in the study are shown in Table 3.2. These values are used to find the range at which the majority of the data lies. Section length is measured in one thousandth of a mile. Traffic volume is expressed in Annual Average Daily Traffic. All cross sectional parameters are measured in feet and speed limit is measured in mph. TABLE 3.2 Data Statistics # Parameter Code Min. 1st Q Median Mean 3rd Q Max. 1 Section Length slen 10 62 144.5 247.2 330 1933 2 Intersections its 0 0 1 1.76 2 24.5 3 Traffic Volume adt 913 7442 10890 12090 15580 38680 4 Speed Limit spd 25 35 45 41.78 45 55 5 Lane Width 1w 9 12 12 11.90 12 15 6 Paved shoulder ops 0 0 0 1.21 2 12 7 Unpaved Shoulder oups 0 4 6 5.87 8 12 8 Total Shoulder tosh 0 6 8 7.08 8 14 9 Outside Curb oc 0 0 0 0.10 0 1 10 Onstreet Parking pk 0 0 0 0.07 0 1 11 Friction fr 0 0 0 0.44 1 1 12 Total Crashes acc 0 0 1 3.36 4 47 13 Property Damage pdo 0 0 0 1.35 2 26 14 Injury inj 0 0 1 1.95 2 32 15 Fatality fat 0 0 0 0.06 0 3 Visual Display of Data The parameters that are generally used to express the variable statistics are shown in the previous table. Though these values help to find the range in which majority of the data lie, it does not give any information on its distribution. To get the full picture, a series of plots are prepared and shown in the following pages. Two figures are used to display each variable. The first figure shows the plot of the variable in the order in which it exists in the database. The xaxis represents the observation number and the yaxis represents the value of the variable for each observation. This plot looks like a scatter plot and gives an idea of the level at which more observations are concentrated. The plot on the righthand side of each pair shows the parameter in another order. The variables are sorted in the increasing order of value. The sorted plot helps to identify regions where sufficient observations are not available. It also helps to identify the variables that are categorical in nature. Plots prepared to display continuous variables, categorical variables, and logical variables are shown in Figures 3.1, 3.2 and 3.3, respectively. Number of Crashes 40 30 20 10 0 0 500 1000 1500 200( Section Length 500 1000 1500 2000 Number of Intersections 0 500 1000 1500 2C Traffic Volume 40000 Z 30000 20000 0 0 500 1000 1500 2000 Number of Crashes: Sorted 0 500 1000 1500 200C Section Length: Sorted 40 30 20 10 0 2000 1500 1000 500 0 1500 2000 Number of Intersections: Sorted i 0 500 1000 1500 2000 Traffic Volume: Sorted 0 500 1000 1500 2000 FIGURE 3.1 Unsorted & Sorted Plots of Continuous Variables .1~ *~ 0 500 1000 A *: .. o o. . ,, 4 } ., .... .. . .. ., 40000 30000 20000 10000 0 Traffic Speed: Sorted 0 500 1000 1500 2000 Lane Width 0 500 1000 1500 2000 Paved Shoulder 0 500 1000 1500 2000 Unpaved Shoulder 0 500 1000 1500 2000 0 500 1000 1500 2000 Lane Width: Sorted 0 500 1000 1500 2000 Paved Shoulder: Sorted 0 500 1000 1500 2000 Unpaved Shoulder: Sorted 0 500 1000 1500 2000 FIGURE 3.2 Unsorted & Sorted Plots of Categorical Variables Traffic Speed OnStreet Parking: Sorted 10 0,8 0.5 0.4 0.2 0.0 0 500 1000 1500 200 Raised Curb 0 500 1000 1500 2000 0 500 1000 1500 2000 Raised Curb: Sorted 0 500 1000 1500 2000 FIGURE 3.3 Unsorted & Sorted Plots of Logical Variables The observations at extreme values in the sorted plots that are detached from the other observations show extreme values when compared to most of the other observations. These observations will be considered carefully during analysis. If inconsistencies are observed in the models at any stage due to these observations, all attempts will be made to find measures of rectifying such problems. If no means are available to rectify such situations, these points will be eliminated and the behavior of the corresponding variable at such values will be considered as unpredictable. OnStreet Parking CHAPTER 4 CRASH DISTRIBUTION Crash Classification The total number of crashes that occur in a highway section is generally classified based on severity and type. The crash classification done based on crash severity and the code used to represent them are given in the following listing. # Crash Severity Code 1 Property Damage PDO 2 Injury Crashes Inj 3 Fatal Crashes Fat 4 All Crashes Acc Crash Frequency and Crash Rate Crash frequency is the total number of crashes that occur at a highway section during a given period of time, regardless of length of the section, AADT, or duration of observation. The period of observation is usually taken as one year and the length of section is generally limited to one mile. Crash rate is a function of crash frequency, section length and the average annual daily traffic. For any given highway section, crash rate is defined as the number of crashes per one million vehicle miles. Crash rate = f (crash frequency, section length, AADT) Crash rate is generally used in regression analysis for developing crash prediction models. When crash rate is defined as the response variable, it is assumed that the probability for a crash to occur does not depend on the traffic volume or on the uniformity of design. When number of crashes per section is considered as the dependent variable and when section length and AADT are treated as independent variables, the effect of traffic volume and uniformity in highway design on the number of crashes are also taken into consideration while modeling. In this study, crash frequency is considered as the response variable regardless of the section length or the AADT. Section length and AADT are treated as independent variables that can influence the occurrence of number of crashes in a given highway section. Frequency Distribution The highway section can be grouped into classes based on the number of crashes that occur in each section. A frequency distribution table can be constructed using the counts in each class or scores in each interval. The shape 33 of the frequency distribution can be seen from the plot of crash frequency against each class. The resulting plot is known as a histogram. Figure 4.1 shows the histogram of all crashes in two lane urban highway sections in the State of Florida. The plot on the left side shows a scatter plot obtained by plotting crash frequency of each highway section. The plot on the rightside shows the histogram prepared from the crash distribution table. According to the histogram, the crash frequency distribution is single sided with a large number of highway sections with no crashes. The highway sections corresponding to higher number of crashes per section seem to decrease. Beyond fifteen crashes per section, the number of highway sections get close to zero while the curve tends to become asymptotic to the xaxis. 0 .. 0: 2 0L %A ...... ..... .2. 0 500 1000 1500 2000 0 5 10 15 20 Observation Number Crash Class FIGURE 4.1 Distribution of Actual Crash Data Three important observations from the histogram plot are listed below. 1. There are more than 600 highway sections that have zero crashes. 2. About 50 highway sections have more than 15 crashes. 3. The distribution is single sided. Poisson Distribution The Poisson distribution is widely used to model count data. Rarely occurring events are generally represented using Poisson distribution. The shape of the distribution depends only on one parameter, the mean value of the data. In other words, the mean determines the shape of the distribution. Poisson distribution has generally been accepted as the standard distribution function to represent the crash frequencies. The Poisson distribution models the probability of y 'events' or incidents according to the Poisson process with the probability given by the following expression. p(y, t) = e py /y! Where, y 0, 1, 2, 3, . : mean value of the sample The variance of the distribution is assumed to be equal to the mean of the distribution. Negative Binomial Distribution The negative binomial distribution is similar to the Poisson distribution but unlike the Poisson distribution, it allows the variance to be much larger than the mean. The mean and variance of the negative binomial model can be written as follows. E (Y/x) = (x) V (Y/x) = g(x) + a [(x)2 Where, a is referred to as the dispersion parameter. Rejection of Poisson Distribution The assumption that crash frequency is distributed according to a Poisson distribution is rejected based on the results from three experiments. Violation of MeanVariance Equality: From the observed values of crashes, the mean, standard deviation, and variance are calculated. The values obtained are listed below. # Parameter Value 1 Range 0 60 2 Mean 3.2 3 Standard Deviation 6.1 4 Variance 37.2 Poisson Distribution assumes that the variance is equal to the mean. The mean value of crash frequency is 3.2, while the variance is 37.2 ( >> mean ). Therefore the basic assumption used to develop the Poisson model is violated. Overdispersion Coefficient exceeds 1: A test for overdispersion was performed using the outputs from the procedure that estimates the negative binomial regression in the statistical analysis package, LIMDEP by Green [2]. If the overdispersion factor exceeds 1, the distribution is assumed to be negative binomial. The overdispersion factor estimated by the regression analysis was 1.49. Disagreement in Shape of Distribution: The mean number of crashes for two lane highways is calculated from the crash data. All statistics obtained from the actual crash frequencies may be used to generate theoretical frequencies to follow any assumed distribution. A vector of length "n" is generated randomly using a Poisson distributed random number generator. The number of elements, (n) is made equal to the number of highway sections. The value of parameters used to drive the random number generator is obtained from the actual crash data. The distribution of the resulting vector is expected to look like the distribution of actual crash data. To compare this theoretical crash data with the actual crash data, a scatter plot and a histogram plot are prepared using the randomly generated crash data. The plots thus obtained are shown in Figure 4.2. 0 C 0 Lo So LO C) 0 500 1000 1500 2000 5 10 15 20 Observation Number Crash Class FIGURE 4.2 Distribution of Poisson Data Three important observations from the histogram plot are listed below. 1. The number of highway sections that had no crashes during the observation period is less than 100. 2. There are no highway sections with crash frequency greater than 10. 3. The distribution is double sided with short tails. None of these observations agree with the observations made from the actual frequency. Similar procedure is used to generate another vector of random numbers that follow a negative binomial distribution based on the actual crash statistics. The scatter plot and histogram plot are shown in Figure 4.3. o C .......... 0 CoC :. ....:.'... ...... Z, M 0 500 1000 1500 2000 5 10 15 20 Observation Number Crash Class FIGURE 4.3 Distribution of Negative Binomial Data Three important observations from the histogram plot are listed below. 1. About 600 highway sections had zero crashes. 2. About 50 highway sections experienced more than 15 crashes. 3. The distribution is single sided. Conclusion All these observations agree with the observations made from the actual crash distribution. Based on the results obtained from all the three experiments described in the previous sections, it can be concluded that the crash distribution of total crashes that occur at twolane urban highways follow negative binomial distribution. The distribution of PDO crashes, injury crashes and fatal crashes are also checked using the same procedure. The following results were obtained. # Crash Type Distribution 1 PDO crashes Negative Binomial Distribution 2 Injury crashes Negative Binomial Distribution 3 Fatal Crashes Poisson Distribution The fatal crashes were of very rare occurrence and there were no signs of overdispersion. CHAPTER 5 EXPLANATORY VARIABLES This chapter gives an introduction to all the variables that are believed to contribute to the occurrence of crashes. Such variables that may be able to explain the occurrence of crashes are termed explanatory variables. The explanatory variables are classified into longitudinal factors, operational factors and cross sectional factors. * Longitudinal factors include section length, number of intersections, level crossings and narrow bridges. " Operational factors include traffic volume, speed limit and onstreet parking conditions. " Cross sectional factors include lane width, shoulder width, paved shoulder width and unpaved shoulder width. The Highway Section A highway section is defined as a uniform stretch of roadway for which the operational factors and cross sectional factors remain unchanged. The length of a highway sections usually ranges between 0.5 and 2.5 miles. Since intersections are not considered as constraints in determining highway section boundaries, a highway section may consist of several intersections. Changes in geometry, speed limits, parking regulations or traffic volumes results in a highway section getting categorized into several smaller highway sections. Therefore a longer highway section implies design consistency while several short highway sections imply irregularities in design. There is a possibility for irregularities in design to contribute to crashes. Therefore each highway section is considered as one observation in this study rather than considering sections of one mile length. The section length will be considered as one of the explanatory variables. Longitudinal Parameters The section length is the most important longitudinal parameter of the highway section. Other factors include number of intersections, number of railway crossings and number of narrow bridges. The crashes that occur at an intersection depend more on the design aspects and operational features of the intersection than on the features of the section. Since this study is focused towards modeling the crashes as function of the highway features, the crashes that occur at the intersections are not included as part of the response variable. Even though the crashes that occur at intersections are excluded, there is a possibility for midblock crash frequency to be influenced by the presence of intersections. Therefore 'number of intersections' is also considered as one of the explanatory variables in this analysis. Figure 5.1 shows the pairwise plot of all longitudinal parameters with crash frequency as the first variable. A pairwise plot is prepared by plotting all variables on a two dimensional surface. Each plot represents two variables. The plots give a general idea of how well the variables are related to each other. The response variable is shown as the first variable in each plot. The plots may be explained using the following examples. In Figure 5.1, the plot corresponding to lst row and 3rd column was prepared by plotting number of intersections on the xaxis and crash frequency on the y axis. The plot corresponding to 1st column and 2nd row was prepared by plotting crash frequency on the xaxis and section length on the yaxis. The points in the pairwise plot are not completely random. Therefore, some relationship could be expected between the variables. Since the points are also spread out, 43 it can be expected that the behavior of the variables under consideration is also influenced by other variables. 0 500 1000 2000 ih1* ... m....n.. . . uadfl': .'I.. 0 10 20 30 40 0.0 0.4 0.8 '0 I t i 1 0 5 10 15 20 25 acc total crash frequency slen section length (1/100th of mile) its number of intersections oc presence of curb FIGURE 5.1 Pairwise Plot of Longitudinal Factors A ",".Y" 44 Operational Parameters Traffic volume, traffic speed and onstreet parking are the important highway operational parameters. Figure 5.2 shows the pairwise plot of all operational parameters with crash frequency as the first variable. 0 10000 30000 0.0 0.4 0.8 " ac ... . ... . 0 30 4 0 0 10 20 30 40 acc adt spd pk ____spd ..2. 35 4 4 5 55 25 30 35 40 45 50 55 total crash frequency AADT speed limit (mph) onstreet parking FIGURE 5.2 Pairwise Plot of Operational Factors Since traffic volume changes with time it is difficult to measure the volume and record it on an ongoing basis. Besides, it is practically impossible to know the traffic volume or density at the time of the accident. Therefore a representative variable, average annual daily traffic (AADT) is used as the variable to represent traffic volume. Similarly traffic speed also changes with time. Therefore another indicated variable, speed limit, is used to represent this factor. Speed limit is a function of several geometric parameters, pavement conditions and sight distance. Speed limit when defined as an explanatory variable represents the effect of these factors on highway safety. Onstreetparking is another important operational parameter. It is represented using a logical variable that takes the value zero if onstreet parking is prohibited and a value one if onstreet parking is permitted for that highway section. Cross Sectional Parameters The cross sectional factors are lane width, shoulder width, median width, and safe recovery area width. Since the type of highway considered for this study is undivided, the median width is zero. The safe recovery area is usually zero for urban highways. The shoulder could either be paved, unpaved, or a combination of both. Since a paved shoulder can functionally contribute to the width of lane, paved and unpaved shoulders are considered as two different parameters. Figure 5.3 shows the pairwise plot of all cross sectional parameters with crash frequency as the first variable. acc ..... . N........ . .... .=.... . 0 10 20 30 40 9 10 11 12 13 14 15 U I g I I , :il iil , I 0 2 4 6 8 10 12 I . t=iiii 0 2 oups 0 2 4 6 8 10 12 acc total crash frequency lw width of lane ops width of paved shoulder oups width of unpaved shoulder FIGURE 5.3 Pairwise Plot of Cross Sectional Parameters CHAPTER 6 MODELING STRATEGIES & THE BASE MODEL This chapter gives an introduction to the statistical methods used in the study. Criteria used for accepting or rejecting models and for preferring one model over other models are also discussed briefly in this chapter. Assumptions made based on the insights obtained from literature review, visualization of actual data and based on known statistical concepts are used to develop the base model. This model and all relevant model parameters are displayed and discussed briefly in this chapter. There is a need to validate or reject these assumptions based on statistical inference. The next chapter deals with improving this model step by step while all assumptions used in the base model are evaluated in stages. Generalized Linear Models In ordinary linear regression analysis, the errors are assumed to be distributed normally. Therefore the properties of least squares estimates are stronger when the errors actually follow normal distribution than when they are not normal. Most of the time the errors are not normally distributed. In cases where response variables are of rare occurrence, the errors are seldom normal. In such situations, the models developed using linear models become highly unreliable even though a very good fit can be attained through sophisticated modeling. The generalized linear models (GLM) introduced by Nelder and Wedderburn (1972) is a generalized approach to linear models in which a wide range of different types of error distribution families is accommodated. Generalized Linear Models are specified by three components, random, systematic link. The random component identifies the probability distribution of the response variable. Since crash frequency represents counts it is discrete in nature and follows a distribution pattern. The systematic component specifies a linear function of explanatory variables that is used as a predictor. Section length, traffic volume, speed limit, lane width and shoulder width are examples of explanatory variables that can be expressed in the linear form as given below. systematic component = P0 + Pixl + P2x2 + P3x3 + ....... where, xl, x2, x3 ......... are the independent variables or functions of independent variables. PO is the yintercept and P1, P2, P3 ......... are coefficients of the independent variables The link component of the Generalized Linear Model links expected values of observations to explanatory variables through a specified function. In Poisson and negative binomial distributions, the link function could be natural log. Model Statistics The definition of some important terms used to express the model characteristics and reliability are given in the following sections. Regression Coefficient (,6): The parameters Po and P1 ..... .n are called regression coefficients. Po is the yintercept of the regression model and .1... 1n represents change in expected value of the response variable per unit change in each independent variable. Sum of Squares (SS): Sum of squares of the model is a measure of the variability in the response variable that has been explained by the model. Sum of squares of individual parameters represents the portion of model sum of squares that has been contributed by that parameter. Ftest: The value of P depends on the units used to represent the corresponding parameter. For example, the P obtained when expressing section length in miles will be one thousand times the value of P obtained when section length is expressed in onethousandth of a mile. The magnitude of itself is not a clear indication of its significance. The Ftest can be used to find the relative importance of one term with respect to another. If a significant amount of extra variance can be eliminated (explained) by including the term, its presence is justified. ttest: The ttest is similar to that of Ftest except for the fact that the ttest takes the direction of the coefficient into consideration. The tvalue may be written as follows. t = fj / sl(cjj) where, Pj is the coefficient of jth term, s is the standard error of bj and cjj is the jth diagonal element of the (X'X) matrix. pvalue: The pvalue measures the level at which the tstatistic is significant. A pvalue of .10 suggests that the parameter it represents is significant at a confidence level of 90%. The generally accepted significance level is 95% which corresponds to a pvalue of .05. Likelihood function: The likelihood function of a given data n, is the probability of n for that sampling model, treated as a function of the unknown parameters [37, page 40]. The maximum likelihood (ML) estimates are parameter values under which the observed data would have had the highest probability of occurrence. Deviance: The deviance of an ordinary least squares model is a function of its loglikelihood and the loglikelihood of the corresponding saturated model. It is calculated by finding the difference in loglikelihood and multiplying it by 2. The deviance of a generalized linear model is similar to the residual sum of squares of a linear model. It is the weighted residual sum of squares of the model. The residual degrees of freedom is used to calibrate the deviance. Variable Selection Procedure The value of AIC can be used as the criteria to prefer one model over another model. AIC is short for Akaike Information Criteria. For generalized models, AIC is defined as a function of deviance (D), degrees of freedom (p), and an estimate of the dispersion parameter (0). AIC = D + 2p0 A decrease in the value of any of the three parameters, results in a reduction in AIC value. In all model selection routines, AIC is used as the criteria for ranking candidate models from which the model corresponding to minimum AIC values is accepted. It is similar to the Mallows' Cp criteria which penalizes the use of more number of regressors to attain the expected quality of fit. Sequential Variable Selection: Stepwise variable selection is a search routine that assists in finding a subset of explanatory variables that could be included in a multiple regression model. According to this concept, variables can get added or deleted from the existing model on the basis of a predefined criteria which measures the relative improvement of the model with respect to each variable. For a given number of variables, the number of models that can be generated considering all possible combinations of all or part of the variables is very large. An exhaustive search will result in examining a large number of models. Stepwise variable selection is a technique used to reduce the number of models that need to be examined without taking any risk of missing the best combination of variables. To eliminate the probability of losing effective combinations of variables, certain strategies are adopted in identifying a path which would lead to the best model. The three general sequential algorithms are discussed briefly in the following sections. Forward Selection: In forward selection, the initial model contains only the constant term that represents the yintercept. A set of models is developed as the second stage in which each model contains exactly one term other than the intercept. The model with lowest value of AIC is selected and this model forms the base to find the next regressor. The process stops when adding another regressor is not capable of bringing down the AIC value any further. In this method, a regressor once selected is never considered for elimination. The parameters in the final model depend on the order in which variables enter the model. Backward Selection: In backward selection, the first model is developed using all the regressors. A series of analysis follow to identify the regressor that has the highest contribution to the AIC value. The regressor thus identified is eliminated from the current model and the procedure is repeated to find the next regressor. This procedure is stopped when eliminating another term cannot reduce the AIC value any further. In this method, a regressor once rejected will not be reconsidered for getting acceptance in the model. Therefore the final model selected by this process depends on the order in which parameters get rejected. Stepwise Regression: Stepwise regression is a modification of the forward selection. In each stage of selection, all regressors currently in the model are further evaluated to justify its existence in the presence of the new variable that was added. Therefore a regressor that entered the model at one stage may be eliminated at another stage. The procedure is terminated when no additional regressors can bring about an improvement in the AIC value either by leaving the model or by entering the model. Stepwise regression methodology is used in the analysis. Stepwise model selection procedure is the generally preferred methodology for generalized linear models. The procedure starts with an arbitrary model that has been fit previously. The initial model is improved in stages by adding terms to or deleting terms from the current model. Each addition or deletion is justified by the reduction in the AIC statistic. Model Performance Criteria At the model building stage, the conditions under which the models should perform are not known. The models developed through regression analysis needs to be cross validated for reliability in application. Fitting Sample and Testing Sample: Prior to regression analysis, the data is split to form two data sets. The larger set forms the fitting sample and the smaller set forms the testing sample. All appropriate candidate models are developed and their coefficients are computed using the fitting sample. The testing sample could then be used to estimate the performance of fitted models. The following procedure is used to obtain an unbiased split of data in the ratio 75:25. A random number in the range of 0 1 is generated. If the value of the first random number generated is less than or equal to .75, the first record is included in the fitting sample. If the value of the random number generated is greater than .75, the first record is included in the validation sample. This process is repeated for each observation. Total number of observations = 1934 Observations in fitting sample = 1466 Observations in testing sample = 468 All acceptable models developed at each stage of analysis using the fitting sample can be compared or ranked based on the quality of prediction on the testing sample. The performance of two or more models can be compared by the relative accuracy of prediction. Two norms are developed to automate computations and to develop a performance table. The procedure used to develop the norms are shown in the following sections. Norm including all observations: The prediction errors from the estimated response for each model, can be used to generate norms which could form the criteria for accepting /rejecting /ranking the candidate models. Mean Absolute Deviation in prediction can be used as the criteria for comparing the relative performance of any two models. MAD = ABS(observed predicted )/ n where, MAD is the mean absolute deviation, and n stands for number of observations in the fitting sample Norm excluding outliers: Mean Absolute Deviation could be highly influenced by a few observations which may be outliers for a particular model. To nullify the effect of such observations, up to 5% of the observations with worst prediction error are excluded from the computation. The absolute values of errors are sorted in increasing order and the last 5% of observations are rejected from the calculation. The norms are included along with other model parameters in all performance tables discussed in the next chapter. Basic Assumptions and the Base Model A regression model is developed using all regressors as independent variables and crash frequency as the response variable. The variable transformations used are based on the information obtained from literature review. Traffic volume and section length are assumed to follow natural log transformation. All other variables are represented in the natural scale. This assumption is based on the studies done by a few analysts at earlier stages. These parameters are further experimented to see if any other transformation can represent them better than the default functions. Some important results of such studies are also discussed in the next chapter. Information about variable interactions is not clearly known at this time and such situations are assumed to be nonexistent at this time. All predictors are assumed to be continuous though some of them show categorical nature which will be explored at a later stage. The variables included in the development of this model are listed below. # Parameter Code Transformation 1. Section Length slen log 2. Number of intersections its none 3. Ave. Annual Daily Traffic adt log 4. Posted Speed Limit spd none 5. OnStreetParking pk none 6. Lane Width lw none 7. Outside Paved Shoulder ops none 8. Outside Unpaved Shoulder oups none 9. Outside Curb 10. Friction Factor none none The Model Parameters: The model parameters and standard model plots are shown in the following six sections. I. The Model: acc log(slen) + its + log(adt) + spd + pk + lw + ops + cups + oc + fr theta = 1.51256, family = negative binomial, link = log II. Model Coefficients: Parameter Value (Intercept) 10.321376409 log(slen) 0.764342626 its 0.064613833 log(adt) 0.856658961 spd 0.017071908 pk 0.563012796 1w 0.002467123 ops 0.070158874 cups 0.021864803 c 0.200213092 fr 0.043057361 Std Err 0.710073553 0.036520748 0.011073166 0.056969923 0.004398058 0.142831156 0.033606293 0.015603061 0.012393907 0.117361492 0.061632129 t value 14.53564405 20.92899676 5.83517269 15.03703913 3.88169259 3.94180662 0.07341253 4.49648148 1.76415740 1.70595217 0.69861875 III. F Statistics: Parameter Df Sum of Sq log(slen) 1 785.511 its 1 74.929 log(adt) 1 200.624 spd 1 39.317 pk 1 14.772 1w 1 0.042 ops 1 16.273 cups 1 6.407 oc 1 2.694 Mean Sq 785.5110 74.9289 200.6236 39.3174 14.7720 0. 0417 16.2726 6.4074 2.6937 F Value 662.7017 63.2142 169.2575 33.1704 12.4625 0.0351 13.7285 5.4056 2.2725 Pr (F) 0.0000000 0.0000000 0.0000000 0. 0000000 0.0004281 0.8513137 0. 0002191 0.0202091 0.1319030 fr 1 0.488 Residuals 1455 1724.635 0.4881 1.1853 0.4118 0.5211776 IV. Analysis of Deviance Table: Variable Df NULL log(slen) 1 its 1 log(adt) 1 spd 1 pk 1 1w 1 ops oups oc fr Deviance Resid. Df 1465 1139.127 1464 72.707 1463 217.330 1462 42.244 1461 15.901 1460 0.003 1459 18.196 1458 6.506 1457 2.477 1456 0.498 1455 Resid. Dev 3105.902 1966.775 1894.068 1676.738 1634.493 1618.593 1618.590 1600.394 1593.888 1591.410 1590.912 V. Model Statistics: Null Deviance: 3105.902 on 1465 degrees of freedom Residual Deviance: 1590.912 on 1455 degrees of freedom Theta: 1.51257, Standard Error: 0.10623 2 x loglikelihood: 8830.80398 AIC: 1614.967, MAD: 2.711362 Crash frequency is modeled as a function of section length (slen), number of intersections in the sections (its), AADT, speed limit (spd), parking regulations (pk), lane width (1w), outside paved shoulder width (ops), outside unpaved shoulder width (oups), presence of outside curb (oc) and the coefficient of friction of the pavement surface. The distribution assumed is negative binomial, and the link function is natural log. Pr (Chi) 0.0000000 0.0000000 0.0000000 0.0000000 0.0000668 0.9575715 0.0000199 0. 0107488 0.1154974 0.4803596 VI. Model Plots: 3 2 I :. . 0              S 2 ,. " 3 0 20 40 60 80 Fitted 80 60 40 .. " 20 0 0 20 60 80 1.5 E 1.0 0.5 0.0 2 2 4 Predicted 3 _n 2 8 1 o 0 > I  2 " 3 0 Quantiles of Standard Normal FIGURE 6.1 Standard Plots of the Base Model. The variable coefficients (P's), standard error of each term and tstatistics are shown in section II. Section III shows the Sum of Squares imparted by each variable and the associated Fstatistics. The pvalues of most of the terms are less than .05 that shows significance at a confidence level greater than 95%. The analysis of deviance is shown in section IV, all important model statistics are shown in section V and the standard model plots are given in section VI. Lane width, a very important parameter was declared insignificant by the criteria used in eliminating terms during stepwise regression. The coefficient of speed limit suggests that as speed limit increases crash frequency decreases. This model is studied in detail and various stages of improvement that it goes through before reaching the final model are discussed in the next chapter. CHAPTER 7 STATISTICAL MODELING This chapter deals with various stages of regression analysis and related issues. The objective of this chapter is to identify the best way of representing each variable in the crash prediction model while validating basic assumptions. Assumptions made about variable transformation and interactions are reviewed in this chapter. The analysis is started from the base model presented in the previous chapter. Each variable in the base model is examined individually and compared with all possible and reasonably explainable form of representation in the model. A model in which any form of the variable under consideration is not significant at the 95% confidence level is rejected. The models that survive this test are compared at stages to find the best model for each stage. While examining the transformation function of one variable, it is assumed that the other variables are represented correctly. Since this assumption can affect the outcome of the first few variables that are analyzed, the final models are subjected to cross checking for confirmation. This error is minimum when the order used for analyzing variables are based their importance in the model. The independent variables and the sum of square values as represented in the base model are listed below. The sum of squares value of each variable in a model is a measure of its relative contribution with respect to other variables in explaining the response variable. The base model shows that the section length (sum of squares = 785.5) contributes more than double that of all other variables put together. if section length is considered as the first variable for analysis, the error induced by incorrect representation of other variables can be minimized. Parameter Section Length Intersections Traffic Volume Speed Limit Parking Lane Width Paved Shoulder Unpaved Shldr Raised Curb Friction Factor Parameter Code log(slen) its log(adt) spd pk lw ops oups oc fr Listing: The Base Model Sum of Sq Mean Sq F Value 785.511 785.5110 662.7017 74.929 74.9289 63.2142 200.624 200.6236 169.2575 39.317 39.3174 33.1704 14.772 14.7720 12.4625 0.042 0.0417 0.0351 16.273 16.2726 13.7285 6.407 6.4074 5.4056 2.694 2.6937 2.2725 0.488 0.4881 0.4118 Pr (F) 0.0000000 0.0000000 0.0000000 0.0000000 0.0004281 0.8513137 0.0002191 0.0202091 0.1319030 0.5211776 Representing Longitudinal Factors In the base model, section length was assumed to have a natural log transformation. Some of the functions generally used to transform continuous variables in statistical modeling are natural log, square root and square. A series of models are developed from the base model utilizing these transformation functions to represent section length and intersections. The models that survived the confidence level test and stepwise regression are listed below and the corresponding model parameters are shown in the table that follows. # Model I. SLI 2. SL2 3. SL3 4. SL4 Section Length log (slen) slen sqrt (slen) slen^2 Rejected by stepwise lw, fr lw, fr, oups lw, fr lw, fr TABLE 7.1 Models for representing Section Length Model Parameter SLI SL2 SL3 SL4 Null Deviance 3104.35 2762.75 3075.06 2374.54 Residual Deviance 1590.80 1559.78 1578.20 1548.76 Theta 1.511 1.23 1.485 .9521 Standard Error .106 .078 .103 .0554 2*log likelihood 8830.28 8710.03 8830.74 8517.54 AIC 1610.45 1576.89 1597.70 1563.62 Prediction Error 2.71 2.677 2.691 2.691 Error on 95% data 1.924 1.889 1.907 1.893 Parameter Parameter Code Section Length slen Intersections its Traffic Volume log(adt) Speed Limit spd Parking pk Paved Shoulder ops Raised Curb oc Listing: The Model SL2 Sum of Sq Mean Sq F Value Pr(F) 820.515 820.5146 772.2600 0.00000000 60.324 60.3240 56.7763 0.00000000 204.467 204.4671 192.4424 0.00000000 36.342 36.3424 34.2051 0.00000001 10.350 10.3499 9.7412 0.00183721 9.857 9.8567 9.2770 0.00236208 5.118 5.1177 4.8167 0.02834245 Model SL4 with square function on section length was found to be the best model based on the values of Null Deviance, Residual Deviance, Standard Error, log likelihood, and AIC values. Model SL2, which corresponds to representation of section length without any transformation was found to be the second best. Models SLI and SL3, which represents log and square root transformation are rejected since the model statistics are inferior to SL2 and SL4. Both untransformed and square transformed models are further compared. Though the square transformation gave better model parameters, its ability to predict crash frequencies on the test was found to be inferior (mean error is 2.691 crash /section) to the prediction capability of the untransformed model (mean error is 2.677 crashes/section). Therefore the untransformed form of section length is preferred over the square transformed form. The square transformation will be further considered at other stages. Interaction between Section Length and Intersections: Long highway sections with large numbers of intersections overpredicted the response variable. As section length increases, crash frequency increases. For a given section length, it is reasonable to expect the crash frequency to increase as the number of intersections increases. In the mean time, longer sections may be able to accommodate more intersections than a shorter section within a defined range of safety. The effect of intersections on crash frequency cannot be completely independent of section length. A product term of intersections and section length is introduced in the model to represent the combined effect of these parameters on crash frequency. The coefficient of the product term was found to be negative as expected. This term applies a corrective measure against over prediction on long sections with a large number of intersections. The pvalue of the product showed significance at confidence level above 95%. The crash frequencies predicted by the resulting model show great improvement. The plots prepared for diagnostic study of the model SL2, is shown in Figure 7.1. The improvement in prediction for observations listed in the previous section are shown in the following listing. 0 I.'. LO I. 0 L 0 V*0 0*0 0 U) C., o U o C~  *0 C, U o C, 0 0  00~ 0 o 0 C, U C, C 0 0 9 9 z 0 z~ ~AJS~c(sjenpis~j)sqe)i4bs CD r') C oE paljasqo # Index section length 1 276 1310 2 641 1871 3 644 1430 4 1257 1672 # inter sections 19 17.5 23.5 12.5 Actual 36 15 44 33 Predicted Before After 105 51 61 29 178 21 110 40 When compared with the values predicted by the non interactive model, the errors for interactive model are substantially lower. In the presence of the interactive term, the best way of representing section length, intersections and the product term is not known. Therefore some more models are considered to find the best form of representing longitudinal factors. The models that were acceptable for performance comparison are shown in the following listing. The model parameters are displayed in Table 7.2. # Model Characteristics 1 LFI slen, slen^2 2 LF2 slen:its 3 LF3 slen, slen^2, slen:its TABLE 7.2 Models representing Longitudinal Factors Model Parameters LF1 LF2 LF3 Null Deviance 3060.64 2949.865 3063.98 Residual Deviance 1582.36 1570.81 1581.54 Theta 1.47 1.379 1.475 Standard Error .102 .092 .102 2*log likelihood 8820.56 8784.60 8822.77 AIC 1604.10 1590.21 1605.45 Pred Error Norml 2.689 2.6735 2.6875 Pred Error Norm2 1.9067 1.889 1.905 Parameter Section Length Intersections Traffic Volume Speed Limit Parking Paved Shoulder Raised Curb Product Parameter Code slen its log(adt) spd pk ops oc slen:its Listing: The Model LF2 Sum of Sq Mean Sq F Value 779.733 779.7332 697.0394 59.487 59.4875 53.1786 205.715 205.7146 183.8977 34.970 34.9698 31.2611 11.724 11.7238 10.4805 11.628 11.6276 10.3945 5.580 5.5802 4.9884 79.273 79.2732 70.8659 Pr (F) 0.00000000 0.00000000 0.00000000 0.00000003 0.00123358 0.00129194 0.02566882 0.0000000 Among the listed models, LF2 gave the best results in terms of model statistics and prediction errors. According to this model, the interactive term is able to yield higher quality than the introduction of a square term in the model. Representing Operational Factors In all previous analyses, AADT was assumed to follow natural log transformation. This assumption was based on the finding from a few recent studies. In this section, this assumption is reevaluated by comparing the parameters of the current model with that of several other models obtained by assuming various transformation functions for AADT including the untransformed form. All acceptable models that resulted from this analysis are listed below. The corresponding model parameters are shown in Table 7.3. AADT ADT1 ADT2 ADT 3 ADT 4 Characteristics log (adt) adt sqrt (adt) adt, sq(adt) TABLE 7.3 Models for representing AADT Model Parameter ADT1 ADT2 ADT3 ADT4 Null Deviance 2949.865 2918.41 2945.30 2941.56 Residual Deviance 1570.81 1570.65 1570.60 1570.86 Theta 1.379 1.3517 1.3754 1.3722 Standard Error .0921 .0898 .092 .0917 2*log likelihood 8784.60 8770.86 8782.76 8780.90 AIC 1590.21 1590.05 1590.01 1592.43 Pred Error Norml 2.6735 2.6659 2.667 2.6703 Pred Error Norm2 1.889 1.8831 1.8880 1.8883 Parameter Section Length Intersections Traffic Volume Speed Limit Parking Paved Shoulder Raised Curb Product Parameter slen its adt spd pk ops oc slen:its Listing: The Model AD2 Sum of Sq Mean Sq F Value 778.889 778.8891 691.0159 61.090 61.0904 54.1983 208.089 208.0886 184.6124 36.750 36.7503 32.6042 10.422 10.4216 9.2459 9.941 9.9407 8.8192 3.597 3.5966 3.1908 74.114 74.1145 65.7530 Pr (F) 0.00000000 0.00000000 0.00000000 0.00000001 0.00240237 0.00302953 0. 07426068 0.00000000 Among the four best models that were accepted for further comparison, ADT2 gave the best results. ADT2 represents the model corresponding to untransformed form of AADT. All model parameters and both norms indicating relative quality of prediction are superior for ADT2 compared to that of the other three models. The interaction of AADT with other parameters if any will be discussed in the latter sessions. Can Coefficient of Speed Limit be Negative? While examining the models that were developed in the past, most of the modeling process started with speed limit as one of the regressors. But the final model did not contain speed limit as one of the predictors. None of the models presented in literature review contain speed limit. When higher speed limit is expected to result in higher crash frequencies, any result contradicting that result looks unacceptable and can cause the forceful removal of the variable itself from the model. As speed limit increases, can the crash frequency decrease? A few models were displayed in the previous sections. In all these models, speed limit was found as a very significant parameter. But the coefficient of speed limit in all these models was consistently negative. When the pvalue of this variable is close to 0, its importance in the model is undeniable though its credibility looks suspicious. Speed limit is not a truly independent variable. Higher speed limit is associated with higher design standards. Higher design standard is associated with better physical highway features. Examples of measurable features are wider pavement and wider shoulder. Features, which are difficult to measure include better pavement conditions, drainage conditions, sight distance and access control. According to these models, crash frequency decreases as speed limit increases. If this assumption is completely true then, most of the efforts to increase safety should focus on attaining higher highway design standards that would call for higher speed limits. Model SPDI, given in Table 7.4 represents the model obtained by treating speed limit as a continuous variable where the coefficient assigned to speed limit through regression analysis is negative. Categorical Treatment of Speed Limit: As discussed in the previous section, higher speed limit might be associated with higher safety and lower crash frequency. The level to which this concept can be extended is well understood by treating speed limit as a categorical variable. The following listing shows a method used to redefine speed limit as a categorical variable. # spd spdc spdO spdl spd2 spd3 spd4 spd5 spd6 1 25 0 1 0 0 0 0 0 0 2 30 1 0 1 0 0 0 0 0 3 35 2 0 0 1 0 0 0 0 4 40 3 0 0 0 1 0 0 0 5 45 4 0 0 0 0 1 0 0 6 50 5 0 0 0 0 0 1 0 7 55 6 0 0 0 0 0 0 1 Model SPD2 in Table 7.4 represents the model obtained by giving categorical treatment to speed limit. Though the model parameters did not improve, the same trend was observed. As speed limit increased, crash frequency decreased. The categorical treatment of speed limit also displayed some quadratic trend. Therefore the square term was added to see if the continuous term for representing speed limit could still be used without losing the behavior pattern at higher speed limits. Model SPD3 represents the model resulting from including the quadratic term of speed limit. # Model Speed Limit Characteristics 1 SPD1 spd speed limit continuous 2 SPD2 spdc speed limit is categorical 3 SPD3 spd^2 square transformed TABLE 7.4 Models for representing Speed Limit Model Parameter SPDI SPD2 SPD3 Null Deviance 2918.412 2948.45 2977.26 Residual Deviance 1570.648 1571.23 1570.31 Theta 1.3517 1.376 1.402 Standard Error .0898 .0920 .094 2*log likelihood 8770.856 8783.55 8797.05 AIC 1590.052 1601.532 1613.75 Pred Error Norml 2.6659 2.6668 2.6725 Pred Error Norm2 1.8831 1.886 1.8923 Parameter Listing: The Model SPDI Parameter Section Length Intersections Traffic Volume Speed Limit Parking Paved Shoulder Raised Curb Product slen its adt spd pk ops oc slen: its Sum of Sq 778.889 61. 090 208.089 36. 750 10 .422 9.941 3. 597 74.114 Mean Sq 778.8891 61.0904 208.0886 36.7503 10. 4216 9.9407 3.5966 74.1145 =AD2) F Value 691.0159 54.1983 184.6124 32.6042 9.2459 8.8192 3.1908 65.7530 Pr (F) 0.00000000 0.00000000 0.00000000 0.00000001 0.00240237 0.00302953 0.07426068 0.00000000 A stepwise regression on SPD3 rejected the quadratic term of speed limit from the model. Therefore SPD3 is not considered in the selection process. Among SPD1 and SPD2, SPD1 has better model parameters and better prediction quality. All parameters that correspond to are superior to SPD2, which indicates that defining speed limit as a continuous variable is the best among both the options. Though SPD2, the categorical model is not selected as the best model, it has given some very powerful results which help to accept the results of SPDl with more confidence  "higher speed limits are associated with safer highway sections." Representing Cross Sectional Factors Lane width is the most important parameter among all cross sectional variables. The next significant parameter is paved shoulder width and the least of all is the unpaved shoulder width. In terms of safety, this belief need not be true. Though the lane width provides the primary function, which is moving the traffic, the shoulder plays a major role in situations of emergency. On rural highways, some clear area is provided beyond the unpaved shoulder. This area is called the safe recovery area. Safe recovery area gives vehicles under danger a very high chance of surviving calamities. For urban highways, this provision is usually absent due to unavailability of adequate rightofway. The Lane Width Problem: All studies done in the past have concluded that lane width is an important parameter, which has significant influence on crash frequency. Moreover, it is very reasonable and logical to expect lane width to have a tremendous effect on safety. In all models, the pvalue of lane width was significantly high and stepwise regression procedures consistently rejected lane width and prevented it from becoming one of the predictors. In the following sections, some methods are used to identify the behavior of lane width. If lane width does not affect crash frequencies, it will be a strange result. If lane width does affect crash frequency, there must be an underlying behavior pattern, which could be preventing it from staying in the prediction models. Categorical Treatment of Lane width: The value of lane width ranges from 9 feet 15 feet. The sorted plot of lane width [Figure 4.2] shows that it is discrete in nature and assumes only integer values. New indicator variables were defined to treat lane width as a categorical variable. The values assumed by these new variables corresponding to various levels of lane width are shown in the following listing. # Lane width lwO iwl 1w2 1w3 1w4 1w5 1 9 feet 1 0 0 0 0 0 2 10 feet 1 0 0 0 0 0 3 11 feet 0 1 0 0 0 0 4 12 feet 0 0 1 0 0 0 5 13 feet 0 0 0 1 0 0 6 14 feet 0 0 0 0 1 0 7 15 feet 0 0 0 0 0 1 The seven discrete values of lane width (9, 10, 11, 12, 13, 14, and 15 feet) were put into six categories. The first two discrete values are identified by the same category since there are few highway sections with 9 feet wide lanes. The model obtained from categorical treatment of lane width is LW2. In model LW2, the variable lwc which represents the lane width parameter defined as a categorical variable has become significant with low pvalue. The model parameters are shown in Table 7.5. The model parameters are not superior when compared to other forms representing lane width. Therefore this model is also rejected. The coefficients of model LW2 revealed a typical behavior pattern. As lane width increases the crash frequency decreases initially but as lane width is further increased, the crash frequency increased instead of decreasing. The shape of this trend can be approximated by a horizontal line (slope is 0) rather than an inclined line. This behavior prevented it from being a significant parameter in the prediction model as a continuous variable. Since the relationship is nonlinear, a square term is included to capture and represent the behavior of lane width successfully. The resulting model LW3 is also discussed in Table 7.5. The effect of lane width on safety could be greatly influenced by the availability of paved shoulder width. A product term representing interaction between lane width and paved shoulder width was rejected based on AIC. Introducing Pavement Width: In two lane highways, the boundary of lane width and paved shoulder width is just a solid white line. Unlike multilane highways, this line has the least importance in a twolane highway since all vehicles have direct access to the paved shoulder. In highway sections with more than one lane in each direction, only vehicles in the outer most lane have direct access to the paved shoulder. Pavement width is defined as the sum of lane width and paved shoulder width. For twolane highways, pavement width may be considered as the effective lane width since vehicles can use the paved shoulder without any restrictions. Pavement width could be modeled as a single parameter instead of using lane width, paved shoulder width and the product term to represent interactions. The model thus obtained, LW4 is compared with other models to find the best prediction model among the group. The following listing shows a brief description of the models considered and Table 7.5 shows model parameters of all models discussed. Lane Width lw lwc 1w, lw2 pw = lw+ops Rejected by Stepwise lw oups oups, oc oc TABLE 7.5 Models for representing Lane width Model Parameter LWI LW2 LW3 LW4 Null Deviance 2918.41 2954.86 2926.45 2917.64 Residual Deviance 1570.65 1574.12 1571.28 1569.89 Theta 1.3517 1.3819 1.3583 1.3511 Standard Error .0898 .0928 .0905 .0897 2*log likelihood 8770.86 8783.48 8773.79 8771.27 AIC 1590.052 1602.28 1597.22 1589.29 Pred Error Norml 2.666 2.668 2.665 2.663 Pred Error Norm2 1.8832 1.8832 1.882 1.880 Though LW2, the categorical model was successful in revealing the behavior of lane width, the prediction error did not improve while it became slightly worse from 2.666 to Model LW1 LW2 LW3 LW4 2.668. Model LW3 representing the square form of lane width is a slight improvement. Model LW4 which represents pavement width gave the best results. Therefore LW4 is selected as the best of all these models. Parameter Section Length Intersections Traffic Volume Speed Limit Parking Pavement Width Raised Curb Product Parameter Listing: The Model LW4 Code Sum of Sq Mean Sq F Value slen 779.363 779.3627 690.0949 its 61.340 61.3403 54.3145 adt 207.298 207.2979 183.5541 spd 36.681 36.6808 32.4794 pk 10.020 10.0201 8.8724 pw 9.443 9.4431 8.3615 oc 3.731 3.7308 3.3035 slen:its 75.747 75.7475 67.0714 Pr (F) 0.00000000 0.00000000 0.00000000 0.00000001 0.00294298 0.00388939 0.06933926 0.00000000 Analysis of Shoulder: The parameters ops, oups and oc which represents outside paved shoulder width, outside unpaved shoulder width and presence of raised curb respectively have managed to survive the AIC criteria and found a place in the model. The coefficients of these parameters suggest that, increasing paved or unpaved shoulder reduces crash frequency and the presence of raised curb increases crash frequency. Though these parameters were qualified by the AIC criteria, the standard error and pvalues are very high. Paved and unpaved shoulders help to increase safety of any highway section. All the models that were developed in the past support this argument. But to what extent is a shoulder capable of reducing crashes efficiently? The answer to this question is revealed through the analysis shown in the following sections. Categorical treatment of shoulder: A sorted plot of shoulder widths is shown in Figure 4.2. The pattern seen in the plots suggests that values of shoulder widths are discrete. The value ranges from 012. If an indicator variable is assigned to represent each value of shoulder width, the degrees of freedom will increase by 24. To reduce the degrees of freedom, values in specific ranges are included in the same category. The following listing shows how shoulder width parameters can be redefined to reduce the total degrees of freedom to 6. # opsc/ ops/ opscO/ opscl/ opsc2/ opsc3/ oupsc oups oupscO oupscl oupsc2 oupsc3 1 0 0 1 0 0 0 2 3 24 0 1 0 0 3 6 57 0 0 1 0 4 9 812 0 0 0 1 It was observed that both paved and unpaved shoulders have strong influence on the model. Estimates of standard error and pvalue for these parameters were found to be low. The models discussed above are listed below and the model parameters are shown in Table 7.6. The paved shoulder showed a behavior similar to that of lane width. The model statistics are inferior to that of the model in which pavement width is used. Though this model is not preferred, the categorical treatment has some important results to offer which will be discussed in the final chapter. Cross Section pw 1w, opsc, oups 1w, opsc, oupsc pw, oupsc Rejected by Stepwise oups 1w, oups 1w, oc TABLE 7.6 Models representing Cross Sectional Factors Model Parameter SHI SH2 SH3 SH4 Null Deviance 2917.64 2929.60 2952.74 2949.70 Residual Deviance 1569.89 1570.36 1573.40 1571.97 Theta 1.3511 1.3608 1.3798 1.377 Standard Error .0897 .0906 .0925 .0922 2*log likelihood 8771.27 8776.10 8783.27 8783.35 AIC 1589.29 1594.12 1597.19 1595.745 Pred Error Norml 2.6633 2.6661 2.6643 2.6617 Pred Error Norm2 1.8832 1.8836 1.8812 1.8784 Model SHI SH2 SH3 SH4 Parameter Section Length Intersections Traffic Volume Speed Limit Parking Pavement Width Unpaved Shldr Product Parameter Listing: The Model SH4 Code Sum of Sq Mean Sq F Value Pr(Z) slen 782.203 782.2030 694.9421 0.00000000 its 63.645 63.6448 56.5448 0.00000000 adt 207.663 207.6626 184.4962 0.00000000 spd 37.945 37.9447 33.7116 0.00000001 pk 9.600 9.6001 8.5291 0.00354894 pw 10.081 10.0806 8.9560 0.00281215 oupsc 12.403 4.1343 3.6731 0.01182596 slen:its 76.999 76.9993 68.4094 0.00000000 The unpaved shoulder, when given categorical treatment showed a pattern different from that of both lane width and paved shoulder width. Besides, this model has a significant improvement over the former model though the degree of freedom increased by 3. Therefore SH4 is considered as the best model in which pavement width is used to represent lane width and paved shoulder width, and unpaved shoulder width is expressed as categorical variable. Identifying Significant Interactions The previous sections evaluated the transformation functions and identified the best way of representing each parameter in the model. The variables that are assumed to be independent need not be truly independent. The presence of powerful interactions among variables could be identified and measured using their product terms. A large model is developed from the current model by allowing all possible second level interactions. Interactions at level three and above are neglected due to the increased level of complexity and unexplainability of resulting terms. The resulting model, INT2 is not better than any of the simpler models discussed in the earlier sections but this model could lead towards identifying some powerful interactions. Since this model has a large number of parameters, it is able to give a good fit with the present data. But this model has very high prediction error. Besides, most of the second level interactive parameters are unexplainable. Parameter Section Length Intersections Traffic Volume Speed Limit Parking Pavement Width Unpaved Shldr Product Product Product Product Product Parameter Listing: The Model INT3 Code Sum of Sq Mean Sq F Value Pr(F) slen 799.720 799.7196 712.9671 0.00000000 its 62.956 62.9564 56.1270 0.00000000 adt 214.338 214.3380 191.0869 0.00000000 spd 43.483 43.4826 38.7657 0.00000000 pk 7.774 7.7741 6.9308 0.00856236 pw 11.006 11.0057 9.8118 0.00176869 oupsc 13.405 4.4683 3.9836 0.00771230 slen:its 81.607 81.6067 72.7541 0.00000000 slen:spd 15.395 15.3945 13.7245 0.00021963 its:spd 3.214 3.2137 2.8651 0.09073489 adt:oupsc 21.650 7.2166 6.4337 0.00025089 spd:pk 18.935 18.9347 16.8806 0.00004203 Even though this model cannot be accepted above any other models, it gives some powerful insights to a few important interactive terms. Such terms are identified after screening this full model through the stepwise filter. The resulting smaller model, INT3 has reduced the prediction error considerably compared to models INTI and INT2. # Model Characteristics 1 INTl best model from previous section 2 INT2 all second degree interactions 3 INT3 model selected by stepwise regression 4 INT4 spd:its removed manually TABLE 7.7 Models representing Second Degree Interactions Model Parameter INTI INT2 INT3 INT4 Null Deviance 2949.70 3177.45 3077.12 3070.09 Residual Deviance 1571.97 1579.94 1573.31 1573.64 Theta 1.377 1.5755 1.4884 1.4816 Standard Error .0922 .1108 .1025 .1018 2*log likelihood 8783.35 8870.79 8836.49 8833.24 AIC 1595.745 1673.16 1610.23 1608.36 Pred Error Norml 2.6617 2.6748 2.6545 2.6531 Fred Error Norm2 1.8784 1.8964 1.8746 1.8735 Model INT3 is further checked to see if there are interactions, which are very weak and unexplainable. Such variables are removed and checked to see if such removal could improve the prediction error. The interaction between speed limit and number of intersections is very week and has very high pvalue. Removing this term (INT4) has further improved the prediction quality. The important models in this series of analyses are listed below and the model parameters are shown in Table 7.7. Parameter Section Length Intersections Traffic Volume Speed Limit Parking Pavement Width Unpaved Shldr Product Product Product Product Parameter Code slen its adt spd pk pw oupsc slen:its slen:spd adt:oupsc spd:pk Listing: The Model INT4 Sum of Sq Mean Sq F Value Pr(F) 799.117 799.1168 712.8898 0.000000000 65.146 65.1457 58.1163 0.000000000 214.570 214.5702 191.4175 0.000000000 43.456 43.4564 38.7674 0.000000001 7.954 7.9544 7.0961 0.007810734 10.908 10.9084 9.7314 0.001847196 13.497 4.4990 4.0135 0.007400247 81.778 81.7782 72.9541 0.000000000 15.911 15.9113 14.1945 0.000171459 20.814 6.9379 6.1893 0.000354112 18.252 18.2516 16.2822 0.000057421 The Final Model The previous sections of this chapter displayed a series of regression models in stages. Each stage of improvement was supported by improvement in model parameters and justified by a corresponding reduction in the mean prediction error. The final model selected from this series of analysis is INT4. Table 7.8 Observed vs. Predicted Values # Crash Frequency Absolute # Crash Frequency Absolutel # ICrashFrequency ]Absolute Actual Predicted Error Actual Predicted Error Actual jPredicted Error 3 4 1 1 2 3 7 0 6   2 0 0 1 0  0 0 0 0 0 0 0 0 0 0 0 0 0 0 r I 0 1 0 1 1 2 1 1 1  0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 9 0 1 0 1 0 1 0 1 0 1 0 1 0 3 0 6 0 2 0 1 0 0 0 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 48 49 50 51 52 53 54 55 U 0 0 0 0 0 0 0 0 0 3) 58 59 60 161 62 63 64 65 66 67 68 2 2 1 6 6 5 1 2 1 1 14 2 3 4 5 6 7 8 9 711 12 13 14 15 16 1 3 4 2 3 7 0 6 1 2 0 0 0 7L10 0 0 5 9 0 8 0 0 5 0 0 0 3 0 0 0 2 0 0 0 0 0 0 0 0 3 /4 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 10(1 0 101 0 104 105 106 107 108 109 '!O 2 1 2 6 1 6 5 2 2 1 2 1  U 0 0 0 0 111 172 113 2 0 !1 114 U 115 1 0 116 3 0 117 0 0 118 1 0 119 3 0 120 0 U 0 0 3 1 2 4 2 2  1 14 0 1231 1 2 I 2 4 1 125 0 1 1 1 1 1 1 126 0 1 1 127 0 1 1 1 1 128 2 1 1 8 7 1 1 1 2 1 1 I129 0 1 130 7 1 131 0 1 1 1 132 0 1 1 1 133 0 134 135 136 153 155 1 1 1 2 0 0 3 1 1 0 1 0 1 1 1 1 156r 3 t44 1 :1 =_ t_ _ U 0 2 0 0 0 3 3 1 1 157 0 1 1 158 0 1 1 159 2 1 1 160 0 1 1 161 1 2 12 162 0 1 1 163 0 f 1 ______ +  104 3 2 1 165 13 12 I 1 2 1 1 1 1 1 2 1 4 1 69 1 70 3 71 U 72 2 17 0 18 3 0 1 0 1 1 1 1 17 17 I 1 1 2 1 1 1 0 0 2 9 1 1 1 1 1 3  6 2 0 0 1 2 17 1 0 1 2 A  E 1 1 138 1 2 1 1 1 139 21 1 1 1 140 0 1 1 2 1 141 01 1 1 1 1T42 1 2 1 1 1 143 3 2 1 1 1 144 1 2 1  1 1 145 0 1 1 1 1 146 1 2 1 i 1 1 147 0 1 1 1 1 148 0 1r I 1 1 149 2 1 1 1 1 150 0 1 1 1 1 1T51 0 1 1' T 1 152 2 1 1i 1 1 1 1 1 1 1 2 1 1 7 1 1 1 1 4 1 131.7 1 27 