<%BANNER%>

Generating Disaggregate Population Characteristics for Input to Travel-Demand Models

Permanent Link: http://ufdc.ufl.edu/UFE0043554/00001

Material Information

Title: Generating Disaggregate Population Characteristics for Input to Travel-Demand Models
Physical Description: 1 online resource (124 p.)
Language: english
Creator: Ma, Lu
Publisher: University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: 2011

Subjects

Subjects / Keywords: fitness -- population -- synthesis -- validation
Civil and Coastal Engineering -- Dissertations, Academic -- UF
Genre: Civil Engineering thesis, Ph.D.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

Notes

Abstract: The past several years have seen tremendous developments in disaggregate travel-demand models. The application of such models for predictions and policy evaluations requires as inputs detailed information on the socio-economic-mobility characteristics of the population. Synthesis methods are used to first generate the population for a base year (current year/census year) and this, in turn, is used as an input to generate the target year (forecast year) population. The state-of-the-practice approach to population synthesis involves the use of the Iterative Proportional Fitting (IPF) method. While there have been several applications of this approach, several issues still remain. First, the number of controls used in the synthesis of the base-year population has been limited. In particular, most practical applications control only for household-level attributes (e.g., household size and dwelling-unit type) and not for person-level attributes such as age and gender. Thus, the synthesized base-year population may not truly match the observed person-level distributions. This would affect the accuracy of the target-year population as the synthesized base-year population is used as an input to generate the target-year population. Second, documentation of the validation of the synthesis procedure, especially in the context of a "target" year population, is limited. The broad focus of this dissertation is to contribute towards synthetic population generation by addressing these issues. To generate a synthetic population as inputs on disaggregate travel-demand models, this dissertation proposes a new framework for synthetic population generation through a fitness based synthesis (FBS) method in which multi-level (household level and personal level etc.) attributes can be controlled simultaneously. During simulation, several socioeconomic variables (such that household size, income, gender, age and etc) under census tract level are chosen as control information and the 5% sample from corresponding PUMA (Public Used Microdata Area) forms the seed data. Empirical results indicate that controlled attributes of synthetic population can match the true population almost perfectly. Furthermore, this dissertation also proposes a validation idea in which a set of household is transferred into the distribution of household type and several criteria are also introduced for measuring the difference between true population and synthetic populations. As expected, the synthetic population under household and person level controls will have much similarity with the true population comparing to the synthetic population using only household level controls. On the other hand, this dissertation also compares the FBS method with other population synthesizers through proposed validation criteria. Then these synthetic populations are compared based on the difference with the true population. Even though the generation of synthetic target-year population is similar to the one of base-year methodologically, there are two factors which affect the quality of synthetic populations for target year exclusively. More specifically, the seed data, namely the synthetic population of the base year could be one of the factors. Another important factor is the controlled attributes used in target-year population synthesis. Unlike base-year population synthesis, the controls used in target year population synthesis are come from population projection models in which these controls are inaccurate. Toward these aspects, the target-year synthetic populations are applied through the proposed framework and IPF separately. By a back-casting analysis, target-year (1990) populations for twelve census tracts in Florida are generated. For each census tract, three different populations of base year are synthesized according to different controls, and for each base-year population two different methods are applied for target-year populations. The results from target year population analysis indicate that using more accurate base-year populations as seed data is more likely to end up with more accurate target-year populations. Then similar analysis is repeated for the inaccurate controls. By comparing the population with the true controls, it is indicated that the additional errors of synthetic populations introduced by the inaccurate controls are linearly related to the amount of errors of controls. With the proposed method, this dissertation also assesses travel demand models applied to synthetic populations using NHTS (National Household Travel Survey) dataset. First, two trip generation models are estimated using different specification of explanatory variables. One of them is in a more disaggregate fashion since a lot of household characteristics are involved while another model only adopts several characteristics and hence it can be considered as an aggregated model. Second, two different synthetic populations are generated by different controls. Then, the two models are applied to the two populations separately. In sum, this dissertation develops a fitness based synthesis methodology that can be applied to synthesize populations by controlling several attributes at both household and person levels simultaneously. In order to assess the similarity between the true population and synthetic populations, this dissertation proposed a validation idea as well as several validation measurements. The procedure was applied to synthesize both base-year and target-year populations for twelve census tracts in Florida. The analysis indicates that the proposed approach results in synthetic populations that match rather closely with the true distributions. Further, the results also highlight the improvements that can be achieved by controlling for both household and person level attributes.
General Note: In the series University of Florida Digital Collections.
General Note: Includes vita.
Bibliography: Includes bibliographical references.
Source of Description: Description based on online resource; title from PDF title page.
Source of Description: This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Statement of Responsibility: by Lu Ma.
Thesis: Thesis (Ph.D.)--University of Florida, 2011.
Local: Adviser: Srinivasan, Sivaramakrishnan.

Record Information

Source Institution: UFRGP
Rights Management: Applicable rights reserved.
Classification: lcc - LD1780 2011
System ID: UFE0043554:00001

Permanent Link: http://ufdc.ufl.edu/UFE0043554/00001

Material Information

Title: Generating Disaggregate Population Characteristics for Input to Travel-Demand Models
Physical Description: 1 online resource (124 p.)
Language: english
Creator: Ma, Lu
Publisher: University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: 2011

Subjects

Subjects / Keywords: fitness -- population -- synthesis -- validation
Civil and Coastal Engineering -- Dissertations, Academic -- UF
Genre: Civil Engineering thesis, Ph.D.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

Notes

Abstract: The past several years have seen tremendous developments in disaggregate travel-demand models. The application of such models for predictions and policy evaluations requires as inputs detailed information on the socio-economic-mobility characteristics of the population. Synthesis methods are used to first generate the population for a base year (current year/census year) and this, in turn, is used as an input to generate the target year (forecast year) population. The state-of-the-practice approach to population synthesis involves the use of the Iterative Proportional Fitting (IPF) method. While there have been several applications of this approach, several issues still remain. First, the number of controls used in the synthesis of the base-year population has been limited. In particular, most practical applications control only for household-level attributes (e.g., household size and dwelling-unit type) and not for person-level attributes such as age and gender. Thus, the synthesized base-year population may not truly match the observed person-level distributions. This would affect the accuracy of the target-year population as the synthesized base-year population is used as an input to generate the target-year population. Second, documentation of the validation of the synthesis procedure, especially in the context of a "target" year population, is limited. The broad focus of this dissertation is to contribute towards synthetic population generation by addressing these issues. To generate a synthetic population as inputs on disaggregate travel-demand models, this dissertation proposes a new framework for synthetic population generation through a fitness based synthesis (FBS) method in which multi-level (household level and personal level etc.) attributes can be controlled simultaneously. During simulation, several socioeconomic variables (such that household size, income, gender, age and etc) under census tract level are chosen as control information and the 5% sample from corresponding PUMA (Public Used Microdata Area) forms the seed data. Empirical results indicate that controlled attributes of synthetic population can match the true population almost perfectly. Furthermore, this dissertation also proposes a validation idea in which a set of household is transferred into the distribution of household type and several criteria are also introduced for measuring the difference between true population and synthetic populations. As expected, the synthetic population under household and person level controls will have much similarity with the true population comparing to the synthetic population using only household level controls. On the other hand, this dissertation also compares the FBS method with other population synthesizers through proposed validation criteria. Then these synthetic populations are compared based on the difference with the true population. Even though the generation of synthetic target-year population is similar to the one of base-year methodologically, there are two factors which affect the quality of synthetic populations for target year exclusively. More specifically, the seed data, namely the synthetic population of the base year could be one of the factors. Another important factor is the controlled attributes used in target-year population synthesis. Unlike base-year population synthesis, the controls used in target year population synthesis are come from population projection models in which these controls are inaccurate. Toward these aspects, the target-year synthetic populations are applied through the proposed framework and IPF separately. By a back-casting analysis, target-year (1990) populations for twelve census tracts in Florida are generated. For each census tract, three different populations of base year are synthesized according to different controls, and for each base-year population two different methods are applied for target-year populations. The results from target year population analysis indicate that using more accurate base-year populations as seed data is more likely to end up with more accurate target-year populations. Then similar analysis is repeated for the inaccurate controls. By comparing the population with the true controls, it is indicated that the additional errors of synthetic populations introduced by the inaccurate controls are linearly related to the amount of errors of controls. With the proposed method, this dissertation also assesses travel demand models applied to synthetic populations using NHTS (National Household Travel Survey) dataset. First, two trip generation models are estimated using different specification of explanatory variables. One of them is in a more disaggregate fashion since a lot of household characteristics are involved while another model only adopts several characteristics and hence it can be considered as an aggregated model. Second, two different synthetic populations are generated by different controls. Then, the two models are applied to the two populations separately. In sum, this dissertation develops a fitness based synthesis methodology that can be applied to synthesize populations by controlling several attributes at both household and person levels simultaneously. In order to assess the similarity between the true population and synthetic populations, this dissertation proposed a validation idea as well as several validation measurements. The procedure was applied to synthesize both base-year and target-year populations for twelve census tracts in Florida. The analysis indicates that the proposed approach results in synthetic populations that match rather closely with the true distributions. Further, the results also highlight the improvements that can be achieved by controlling for both household and person level attributes.
General Note: In the series University of Florida Digital Collections.
General Note: Includes vita.
Bibliography: Includes bibliographical references.
Source of Description: Description based on online resource; title from PDF title page.
Source of Description: This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Statement of Responsibility: by Lu Ma.
Thesis: Thesis (Ph.D.)--University of Florida, 2011.
Local: Adviser: Srinivasan, Sivaramakrishnan.

Record Information

Source Institution: UFRGP
Rights Management: Applicable rights reserved.
Classification: lcc - LD1780 2011
System ID: UFE0043554:00001


This item has the following downloads:


Full Text

PAGE 1

1 GENERATING DISAGGREGATE POPULATION CHARACTERISTICS FOR INPUT TO TRAVEL DEMAND MODELS By LU MA A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 2011

PAGE 2

2 2011 Lu Ma

PAGE 3

3 To my parents

PAGE 4

4 ACKNOWLEDGMENTS First of all, I would like to thank my supervisor Dr. Siva Srinivasan who always inspires me for my study. From him, I have received not only knowledge but also the way to thinking and serious attitude on research, which are beneficial for my future career and life. I also want to thank Dr. Trevor Park Dr. Lily Elefteriadou, Dr. Yafeng Yin Dr. Scott Washburn and Dr. Ruth Steiner for their v aluable suggestions and comments on this study

PAGE 5

5 TABLE OF CONTENTS page ACKNOWLEDGMENTS .................................................................................................. 4 LIST OF TABLES ............................................................................................................ 7 LIST OF FIGURES .......................................................................................................... 8 ABSTRACT ..................................................................................................................... 9 CHAPTER 1 INTRODUCTION .................................................................................................... 13 2 CONCEPTUAL FRAMEWORK ............................................................................... 17 Conceptual Overview of the Generation of Synthetic Population ............................ 17 Synthesizing the BaseYear Population .................................................................. 18 Synthesizing the Target Year Population ................................................................ 20 Application on Travel Demand Models ................................................................... 22 3 LITERATURE REVIEW .......................................................................................... 24 Generation of Synthetic Populations ....................................................................... 24 Iterative Proportional Fitting (IPF) ..................................................................... 24 Methodology for SingleL evel Controls ............................................................. 28 Methodology for Multi L evel Controls ............................................................... 30 Applications of Population Synthesis ................................................................ 36 Validation Methods ................................................................................................. 37 Summary ................................................................................................................ 39 4 FITNESS BASED SYNTHESIS METHODOLOGY ................................................. 40 Framework .............................................................................................................. 40 Initial Household Sets ....................................................................................... 42 Fitness Functions ............................................................................................. 42 Selection Mech anism ....................................................................................... 44 Some Properties ..................................................................................................... 44 Conceptual Comparison with IPF Based Methods .................................................. 49 Summary ................................................................................................................ 50 5 BASEYEAR POPULATION SYNTHESIS: COMPARISON AND VALIDATION ..... 53 Dataset ................................................................................................................... 54 Pre Treatment of Seed Data ............................................................................ 54 Control Tables .................................................................................................. 55

PAGE 6

6 Validation Method ................................................................................................... 56 Defining Household Types ............................................................................... 56 Measures of Dissimilarity between Trueand SynthesizedPopulations ......... 57 Comparison with Other Methods ...................................................................... 59 Summary ................................................................................................................ 61 6 TARGET YEAR POPULATION SYNTHESIS: APPLICATION AND VALIDATION .......................................................................................................... 71 Analysis Framework ................................................................................................ 72 Dataset ................................................................................................................... 74 Results .................................................................................................................... 76 Impact of Accuracy of the BaseYear Population ............................................. 76 Impact of Target year Control Tables and Methods ......................................... 77 Impact of Inaccurate Control Tables ................................................................. 78 Summary ................................................................................................................ 79 7 ASSESSMENT OF TRAVEL DEMAND MODELS APPLIED TO SYNTHETIC POLULATIONS ....................................................................................................... 91 Dataset ................................................................................................................... 92 Trip Generation Models .......................................................................................... 92 Population Synthesis .............................................................................................. 93 Assessment o f Linear Regression Based Trip Generation Models ......................... 94 Summary ................................................................................................................ 97 8 SUMMARY AND CONCLUSIONS ........................................................................ 108 APPENDIX: NUMERICAL ILLUSTRATION OF THE FITNESS BASED SYNTHESIS PROCEDURE ....................................................................................................... 112 LIST OF REFERENCES ............................................................................................. 120 BIOGRAPHICAL SKETCH .......................................................................................... 124

PAGE 7

7 LIST OF TABLES Table page 4 1 Feasible value of fitness functions and corresponding operations ...................... 51 5 1 Characteristics for defining household types ...................................................... 63 5 2 Aggregate comparisons of the trueand synthesizedpopulations for 22 artificial census tracts ......................................................................................... 64 5 3 Validation results of population from three population synthesizers ................... 65 5 4 Number of iterations for generating population of 22 artificial census tracts ....... 66 6 1 Characteristics of the twelve census tracts in 1990 and 2000 ............................ 81 6 2 Control tables for base year population synthesis .............................................. 82 6 3 Number of iterations for generating po pulation of 12 census tracts by method FBS2 in base year (2000) ................................................................................... 83 6 4 Accuracy of target year synthetic populations .................................................... 84 6 5 Differences among synthesized baseyear populations ..................................... 85 6 6 Difference between true controlled tables and erroneous tables ........................ 86 7 1 Frequency distribution of household HBNWSR trip rates ................................... 99 7 2 Aggregate model .............................................................................................. 1 00 7 3 Disaggregate model ......................................................................................... 101 7 4 Control tables for populat ion I ........................................................................... 102 7 5 Control tables for population II .......................................................................... 103 7 6 Total number of trips for households with different life cycle characteristic ...... 104 7 7 Average number of trips for households with different life cycle characteristic 105 7 8 Average number of trips for households with different income characteristic ... 106 7 9 Distribution of household size and location for population I and true population ......................................................................................................... 107

PAGE 8

8 LIST OF FIGURES Figure page 2 1 Conceptual framework of the population synthesis procedure ........................... 23 4 1 Flowchart of the Fitness Based Synthesis method ............................................. 52 5 1 Examples of control tables .................................................................................. 67 5 2 Scatter plots of three different synthetic populations against true population on artificial census A6. ........................................................................................ 68 6 1 Marginal tables for assessing the target year populations .................................. 87 6 2 Impact of base year populations on the accuracy of target year population ....... 88 6 3 Impact of targetyear controls and datafusion methodology on the accuracy of target year population ..................................................................................... 89 6 4 Impact of inaccurate control tables on the change of accuracy of target year populations ......................................................................................................... 90 A 1 Control tables ................................................................................................... 116 A 2 Seed dat a ......................................................................................................... 117 A 3 HT tables for each of the households in the seed data .................................... 118 A 4 The representation of synthetic population as the structure of control tables ... 119

PAGE 9

9 Abstract of Dissertation Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy GENERATING DISAGGREGATE POPULATION CHARACTERISTICS FOR INPUT TO TRAVEL DEMAND MODELS By Lu Ma December 2011 Chair: Sivaramakrishnan Srinivasan Major: Civil Engineering The past several years have seen tremendous developments in disaggregate travel demand models. The application of such models for predictions and policy evaluations requires as inputs detailed information on the socioeconomic mobility characteristics of the population. Synthesis methods are used to first generate the population for a base year (current year/census year) and this, in turn, is used as an input to generate the target year (forecast year) population. The state of the practice approach to population synthesis involves the use of the Iterative Proportional Fitting (I PF) method. While there have been several applications of this approach, several issues still remain. First, the number of controls used in the synthesis of the baseyear population has been limited. In particular, most practical applications control only for householdlevel attributes (e.g., household size and dwelling unit type) and not for personlevel attributes such as age and gender. Thus, the synthesized baseyear population may not truly match the observed personlevel distributions. This would affect the accuracy of the target year population as the synthesized baseyear population is used as an input to generate the target year population. Second, documentation of the validation of the synthesis procedure,

PAGE 10

10 especially in the context of a target year population, is limited. The broad focus of this dissertation is to contribute towards synthetic population generation by addressing these issues. To generate a synthetic population as inputs on disaggregate travel demand model s, this dissertation proposes a new framework for synthetic po pulation generation through a f itness based s ynthesis (FBS) method in which multi level (household level and personal level etc.) attributes can be controlled simultaneously. During simulation, several socioeconomic variables (such that household size, income, gender, age and etc) under census tract level are chosen as control information and the 5% sample from corresponding PUMA (Public Used Microdata Area) forms the seed data. Empirical results indicate that controlled attributes of synthetic population can match the true population almost perfectly. Furthermore, this dissertation also proposes a validation idea in which a set of household is transferred into the distribution of household type and several criteria are also introduced for measuring the difference between true population and synthetic populations As expected the synthetic population under household and person level controls will have much simil arity with the true population comparing to the synthetic population using only household level controls. On the other hand, this dissertation also compares the FBS method with other population synthesizers through proposed validation criteria. Then these synthetic po pulations are compared based on the difference with the true population Even th ough the generation of synthetic target year population is similar to the one of base year methodologically there are two factors which affect the quality of synt hetic populations for target year exclusively. More specifically, the seed data, namely the

PAGE 11

11 synthetic population of the base year could be one of the factors Another important factor is the controlled attributes used in target year population synthesis. U nlike base year population synthesis, the controls used in target year population synthesis are come from population projection models in which these controls are i naccurate. Toward these aspects, the target year synthetic populations are applied through t he proposed framework and IPF separately By a back casting analysis, target year (1990) populations for twelve census tracts in Florida are generated. For each census tract, three different populations of base year are synthesized according to different c ontrol s, and for each baseyear population two different methods are applied for target year populations. The results from target year population analysis indicate that using more accurate baseyear populations as seed data is more likely to end up with more accurate target year populations. Then similar analysis is repeated for the inaccurate controls. By comparing the population with the true controls, it is indicated that the additional errors of synthetic populations introduced by the inaccurate controls are linearly related to the amount of errors of controls. With the proposed method, this dissertation also assesses travel demand models applied to synthetic populations using NHTS (National Household Travel Survey) dataset First, two trip generation models are estimated using different specification of explanatory variables. One of them is in a more disaggregate fashion since a lot of household characteristics are involved while another model only adopts several characteristics and hence it can be considered as an aggregated model. Second, two different synthetic populations are generated by different controls. Then, the two models are applied to the two populations separately.

PAGE 12

12 In sum this dissertation develops a fitness b ased synthesis methodol ogy that can be applied to synthesize populations by controlling several attributes at both household and person levels simultaneously. In order to assess the similarity between the true population and synthetic populations, this dissertation proposed a validation idea as well as several validation measurements. The procedure was applied to synthesize both baseyear and target year populations for twelve census tracts in Florida. The analysis indicates that the proposed approach results in synthetic populations that match rather closely with the true distributions. Further, the results also highlight the improvements that can be achieved by controlling for both household and person level attributes.

PAGE 13

13 CHAPTER 1 INTRO DUCTION The past several years have seen tremendous developments in disaggregate travel demand models and activity based transportation planning models (e.g. Bhat and Koppelman, 1993) This interest is motivated by several factors such as (1) reduction of aggregation errors, (2) ensuring sensitivity to demographic shifts like the ageing of the population, (3) capturing differential response of travelers to policy actions, and (4) address ing special travel needs of certain population groups. The application of such models (e.g. Bhat et al., 2004) for predictions and policy evaluations requires as inputs detailed information on the socioeconomic mobility characteristics of the population ( some examples of these characteristics include householdlevel attributes such as size and composition, income, dwelling unit type, auto ownership, and personlevel attributes such as age, gender, and employment status). However systematic procedures to forecast these attributes required by such disaggregate travel demand models are under development only recently and the experiences with the empirical application of such models are limited. Therefore, developments in the area of population synthesis methods are important for furthering disaggregatemodeling efforts and for their adoption as stateof practice. The methodology for producing such a disaggregate population is generally referred to as Synthetic Population Generation (SPG). The classical SPG framework (Beckman et al 1996) comprises two major inputs: (1) the marginal distributions (control tables) of certain attributes (e.g. household size and income) for certain target area (census tract), and (2) a sample of households (seed data) with detai led attribute values often from a larger area (such as the Public Use Microdata Area or PUMA,

PAGE 14

14 discussed in more detail later on). The Iterative Proportional Fitting (IPF) technique (see Deming and Stephan 1940; Ireland and Kullback 1968) is employed to g enerate the multivariate distribution of all attributes of interest. Using the cell values of this joint distribution as selection probabilities, households are drawn from the seed data thereby generating the synthetic population of interest. An important short coming of the classical SPG procedure is that the population is generated to match only onelevel controls (the householdlevel in almost all practical applications). Information about the personlevel attributes (such as age and gender) is ignored, even though such marginal controls are available. Therefore, there is potential for improving the accuracy of the synthesized population by employing methods that can handle multi level controls. Within the last couple of years, there is growing interest in population synthesis using multi level controls and a primary objective of this dissertation is to contribute towards this end. Specifically, a fitness based approach is developed to synthesize populations that fit controls at multiplelevel simultaneously. A second important issue is that of validation. The populations have to be synthesized only because the true values are unknown. However, the lack of the true population also makes it difficult to validate the methods developed to synthesize the population. In this dissertation, rigorous validation procedures are described and applied in the contexts of base year population synthesis. The proposed validation metrics are also used to compare our fitness based methodology to other approaches that have b een recently developed. Furthermore unlike the case of baseyear population synthesis, the documentation of results on target year population synthesis is limited (Bowman2004

PAGE 15

15 and Bowman and Rousseau 2008). Although, conceptually, the application of the approach for target year synthesis is similar to its application for base year synthesis, there are three important issues of concern. First, the target year synthesis uses the baseyear synthesized population as seed data. Thus, the methodology and controls used in the baseyear synthesis impact the accuracy of the baseyear population, and in turn, the target year population. Second, one can expect significantly fewer control tables available for the target year synthesis as opposed to the base year synthesis. In this situation, there might be benefits to using approaches that control for both personand householdlevel information as opposed to methods that control for only householdlevel information so as to take advantage of all the minimal data avai lable. Third, the target year control tables are projections in contrast to base year control tables which are derived from the census counts. It has been well documented ( Smith and Shahidullah 1995) that there are significant errors in these projected agg regate distributions of population characteristics. Therefore, examining the effects of errors in control tables is of interest. Finally, it is important to remember that these procedures ultimately provide data that are in turn fed into disaggregate travel demand models. Disaggregate models capture travel behavior of the fundamental decisionmaking units and include several explanatory variables (including socioeconomic and mobility characteristics). Consequently, one may expect such models to provide more accurate predictions of the travel characteristics than aggregate models which include fewer explanatory variables. However, this depends on the accuracy of the socioeconomic mobility characteristics of the synthesized population. Specifically, if the synthesized population

PAGE 16

16 is an inaccurate representation of the true population, gains because of a disaggregate model could be offset by the errors in the synthesized population. In light of the above discussion, predictions from aggregate tripgeneration m odels will be compared with those from disaggregate tripgeneration models (with synthetic populations as inputs) to assess the true value of synthetic populations and disaggregate travel demand models. In summary, the goals of this dissertation are to: (1) develop a procedure for population synthesis that can allow for multi level control and demonstrate its benefits over conventional, IPF based methods, (2) develop a systematic framework for validating synthetic population generators and apply it in the c ontext of the new procedure developed and two other major methods, (3) Synthesize target year populations (using back casting methods) and assess the value of the new synthesis procedure and quantify the impacts of erroneous controls, and (4) Apply the synthetic populations to tripgeneration models and assess the overall accuracy of the triprate predictions. The rest of the dissertation will be arranged as follows. Chapter 2 presents the conceptual framework of population synthesis. Chapter 3 give s a thorough review of population synthesis as well as validation methods Chapter 4 introduces the framework of fitness based population synthesis Chapter 5 applies the proposed fitness based approach on the bas e year population synthesis and several validation methods are also introduced. Chapter 6 synthesizes target year populations and an analysis of accuracy of target year populations is conducted. Then, C hapter 7 applies the synthetic population to trip generation models. And, Chapter 8 closes this proposal with summary and conclusions

PAGE 17

17 CHAPTER 2 CONCEPTUAL FRAM E WORK This chapter presents an overview of the methods currently available for population synthesis. The first section in this chapter presents a c onceptual overview of the overall synthesis procedure. Although the intent is generally to generate a population for a target year, the synthesis procedure begins with generating a population for a base year. The second section discusses methods for synthesizing baseyear population whereas the third section describes the methods for target year population synthesis. At the end of this chapter, the synthetic population is discussed in terms of input s for travel demand models. Conceptual Overview of the Gene ration of Synthetic Population A conceptual overview of the SPG procedure is presented in Figure 21. In the beginning of a SPG procedure, the base year population is generated first. The base year is defined as the survey year and usually the most recent census year in the past (currently, this would be year 2000). After generating the population for a base year, it will serve as an input in the SPG of a target year. A target year is defined as any year beyond the base year and may or may not be a year for which the decennial census has been planned. That is, if the base year is 2000, years 2003, 2010, and 2025 would all be qualified as target years. The synthesis of the base year population is performed by data fusion techniques. Broadly, aggregate control tables (often at the census tract level) are fused with disaggregate data on population characteristics (seed data) available for a sample of households in the area (often at the PUMA level) to which the census tract belongs.

PAGE 18

18 The result is a synthetic population for base year comprising households drawn (with replacement) from the seed data of corresponding PUMA in the way that the aggregate characteristics of the synthetic population is matched with these aggregated control tables. Given the baseyear synthetic population, there are two approaches for generating the target year population. The first approach uses the data fusion technique which is similar to the one used in baseyear population synthesis. The synthetic baseyear population wi ll be served as seed data in target year population synthesis. Another approach is called evolution approach and it involves growing each baseyear household over time to determine its characteristic s at the target year. This involves modeling complex phenomenon such as household formation, dissolution and migration. Once target year populations are generated, they can be used as inputs to travel demand models. Synthesizing the Base Year Population The state of the practice approach to baseyear population synthesis involves fusing aggregate control tables with disaggregate seed data. Control tables are oneway or multi way marginal distributions. Each of these tables corresponds to the joint distribution of a subset of the required population attributes. Typically, these distribution tables are available from the census SF1 and SF3 files and at the spatial resolution of census block groups or census tracts. The population is synthesized at the spatial resolution of the control tables (this is referred to as the synthesis area in the rest of this document). The seed dataset comprises a sample of population records with each household/person characterized by all the attributes of interest. The location of these households is typically known only at a more aggregate spatial scale (in contrast to the

PAGE 19

19 finer spatial resolution of the control tables). Typically, such householdlevel information is obtained from the US census Public Use Microdata Samples (PUMS) and the location is defined in terms of the Public Us e Microdata Areas (PUMAs). The state of the practice data fusion procedure involves two major steps. First, a joint multiway distribution of all attributes of interest is generated using the Iterative Proportional Fitting (IPF) procedure (conceptually, the procedure is analogous to the Fratar balancing technique; detailed algorithm of the IPF procedure is available from Beckman et al., 1996). The IPF procedure ensures that, when the multi way distribution is appropriately aggregated, the results match the marginal distributions provided by the control tables (the extent of matching depends on the tolerance used). The result of this iterative procedure is a multi way distribution table that provides the number of households of each type in the synthesis ar ea. In the second step, individual household records are drawn from the seed dataset using M onteC arlo simulation so as to satisfy the joint multi way distributions. This methodology has been applied to support travel demand modeling in several areas such as Portland Metro, San Francisco, New York, Columbus, Atlanta, Sacramento, Bay Area, and Denver. Bradley and Bowman (2006) and Bowman (2004) provide a general overview of these applications. The Sacramento application is available in Bowman and Bradley (2006) and the Atlanta application and validation results are presented in Bowman and Rousseau (2008). All the applications discussed thus far control for only householdlevel attributes. There are several researches (e.g. Guo and Bhat, 2007; Arentze et al., 2007; Ye et al., 2009; Auld and Mohammadian, 2010) provide extensions to incorporate both

PAGE 20

20 householdand person level controls in the IPF based populationsynthesis procedure. Detailed descriptions of these procedures are presented in Chapter 3. Synthesizi ng the Target Year Population The data fusion approach for the synthesis of the target year population is conceptually similar to the one used for generating the baseyear population. Once again, aggregate control tables and disaggregate seed data are the inputs. The control tables represent the aggregate socioeconomic mobility characteristics of the synthesis area in the target year. There are two key differences between the control tables used in the baseyear synthesis and those used in the target year synthesis. First, for the target year, the number of controls available is limited (and often multidimensional controls may not be available). In contrast, the base year would have several (and multi dimensional) controls from the Census data. Second, the control tables for the target year may not even be available at the synthesis area level and may have to be derived from more aggregate spatial units (such as the county). The structure of the seed data for the target year population synthesis is the same as the one for the base year. This is because the synthesized baseyear population is taken as the seed data. The reader will note that the seed data for the base year are at the PUMA level, but from the same year which is in contrast to the seed data for the target year which is fr om the same census tract but is from the base year. The methodology used for the target year population synthesis is predominantly the same as the one used in the base year. However, some of the attributes of interest may not be directly synthesized due to lack of control data. For these cross section models can be used. A classic example of an attribute which is forecasted in such a manner is automobile ownership [see for example, the Oregon2 Model (Hunt et al.,

PAGE 21

21 2004) or the SAC OG model (Bowman and Bradley, 2007)]. Typically, US census does not provide projections of aggregate autoownership levels for any future year for use in a datafusion approach. However, it is possible to develop cross sectional models of auto ownership (as a function of household characteristics, land use patterns, transportation system characteristics, etc.) using data from local household travel surveys or the PUMS. Thus, once the appropriate socioeconomic characteristics for a forecast year have been determined using datafusion techniques, the cross sectional model can be applied to each household to generate the autoownership levels. For evolution method, each household in the baseyear synthetic population database is evolved or aged though time to determine its characteristics for any future year. This involves the development of a system of models that describe the common demographic/economic transitions that take place over the lifecycle of a household. These transitions include processes such as ageing, births, deaths, formation (marriage) and dissolution (divorce) of households, employment and education choices, children moving out of the household, automobile ownership decisions, and emigration from or immigration to the study region. Some of t he currently available model systems that adopt such an approach include MIDAS (Goulias and Kitamura, 1996), MASTER (Mackett, 1990), CEMSELTS (Eluru et al., 2008), DEMOS (Sundararajan and Goulias, 2003), and the HA module of the Oregon2 model system (Hunt et al., 2003). Such methods are appealing as they try to simulate the real processes households go though and model behavioral decisions made at different stages of the life cycle. However, as identified by Eluru et al., (2008), limited theoretical knowledge on the complex socioeconomic evolution processes and the minimal availability of relevant data at the

PAGE 22

22 household level limit our ability to specify and estimate good models of household evolution. Application on T ravel D emand M odels As mentioned before, synthetic populations are required as input for disaggregate travel demand models and activity based transportation planning models (e.g. Bhat and Koppelman, 1993) And this is the ultimate goal for generate synthetic populations. Undoubtedly synthetic populations have more disaggregate information than aggregated data even through such information is inaccurate in some degree. So the accuracy of travel demand forecasting depends on the accuracy of the socioe conomic mobility characteristics of the synthesized population. Specifically, if the synthesized population is an inaccurate representation of the true population, gains because of a disaggregate model could be offset by the errors in the synthesized popul ation. The details of the proposed fitness based synthesis framework for the data fusion are presented in Chapter 4 of this dissertation. The procedure is applied and validated in the context of the baseyear synthesis and these are discussed i n Chapter 5. The analysis of the population generation procedure in the context of the target year is described in Chapter 6. In C hapter 7, synthetic populations are applied and assessed under trip generation models

PAGE 23

23 Figure 21. Conceptual framework of the p opulation synthesis p rocedure

PAGE 24

24 CHAPTER 3 LITERATURE REVIEW This chapter presents a literature review of the procedure of population synthesis based on the data fusion methods The first section of this chapter summarizes the past researches and applications of population synthesis and in this section, the methods for single level controls are reviewed. Then the ideas and current methods for controlling multilevel characteristic s are described in more detail s. At the end of this chapter, validation methods for population synthesis are reviewed. Generation of Synthetic Populations Broadly, the procedure for synthesizing populations involves selecting a set of households from the s eed data in such a way that the controls are satisfied. A classic framework adopts IPF as the kernel part of datafusion technique. Include IPF, two major steps are involved in the procedure of population synthesis. First, the sample frequenc ies of each cell of the household contingency table are projected to match the marginal counts. And IPF ( Deming and Stephan 1940 and Wong, D.W.S., 1992) is used to project the seed data by mating the control data. Second, the number of households to be gener ated of each type (according to a set of combination of household characteristics ) will be determined by the projected cell counts. In fact, almost all current approaches involve IPF in their first step and this dissertation will briefly introduce the IPF method through a two way contingency table. Iterative P roportional F itting (IPF) In a two way contingency table, let the cell counts in th row and th column for seed data be and the marginal control for i th row and j th column be and

PAGE 25

25 respectively. Then the estimated cell counts can be calculated by an iteratively fitting procedure. During each iteration, the estimated cell counts will be updated by a row fitting and a columnfit ting. Let be the number of rows in the contingency table. With the initial value of each cell counts ( )= a row fitting is implemented as, ( )= ( ) ( ) = 1 ( 3 1 ) where ( ) is the cell counts in 2 1 iteration and ( ) is the row summation of cell counts in 2 2 iteration. Then, the columnfitting can be implemented in the same manner, ( )= ( ) ( ) = 1 ( 3 2 ) where ( ) is the column summation of cell counts in 2 1 iteration and is the number of columns in the contingency table. Implementing such procedure until conv ergence, the estimated cell counts can match the marginal row distribution and marginal column distribution simultaneously. Moreover, the underline nature of the IPF procedure actually maximizes the entropy or minimizes the discrimination information of th e joint distribution of the contingency table (see Ireland and Kullback, 1968. and Ruschendorf, 1995) and also maximizes the likelihood function under a log linear model (see Agresti, 2002). Above procedure can be readily extended for multi way contingency tables. In case of more than two attributes, cell counts will be projected to fit the marginal distribution for each dimension of the contingency table.

PAGE 26

26 As mentioned before, IPF has very nice property e.g. maximize the entropy and keep the odds ratio fro m seed data and also I PF is the foundation of almost all current researches about population synthesis However, when the number of controls is large, it is no surprise that some of the cells are zero, especially when the size of the seed data is small, some types of household are missing due to the variation of sampling Also, some combinations of household characteristics are rarely distributed and even do not exist in reality. For example, the chance for a single household with 6 or more vehicles or a single household with an only 5 years old child is very small and the household size of a family household cannot be one. T he later situation is also refer to as fixed zeros (Fienberg, 1970b) while the former is called sampling zeros. More specifically, the values of are not all positive and if a huge number of = 0 in the contingency table, such situation is called zero cell problems (Auld et al. 2008 and Beckman et al. 1996). In the research of Ireland and Kullback (1968), they shows that the iterative procedure is convergent, however the situation in which the values of are not all positive is avoided (Fienberg, 1970a ). Such contingency table with some zero entries is refer red to incomplete contingency tables ( see Yvonne et al., 1969). In fact th e incomplete contingency tables always converge under regularity conditions but it is slower than the related complete contingency tables ( Fienberg 1972) Other than the situation that = 0 it is also possible that = 0 and hence the pr ogram will have no feasible solution because the required types of households do not exist in seed data. Furthermore, in such situation the IPF cannot proceed because the denominator is zero and all the related objective function e.g. discrimination infor mation

PAGE 27

27 and entropy become infinity Another zero value situation during IPF is that t he marginal total = 0 This situation will also encounter numerical issues under the original IPF procedure, but these issues can be dodged by making some minor changes to the procedure. One of the suggested changes is to define zero divided by zero as zero (Fienberg, 1970b) More detailed discussion about zerocell situations can be found in Fienberg (1970b). Other than IPF based approaches, another branch for generati ng synthetic population is the combinational optimization method (Williamson et al. 1998. and Voas and Williamson, 2000). Contrast to IPF, combinational optimization methods does not require an initial distribution but weights are associated to each household sample. These weights are readjusted to yield the best fit for matching controlled tables. It has been debated between the two methods for a long time. For example, Ryan et al. (2009) states that combinational optimization methods are superior to the I PF based methods whereas Pritchard and Miller (2009) point several weakness of combinational optimization method. Intuitively, the IPF can copy the relationship among controlled characteristics and hence produce a more accurate population if the seed data is highly correlated to the true population. For example, if the area from which the seed data sample come is geog raphically related to the area of true population, it is reasonable to assume the high correlation between populations of the two areas In this dissertation, the household sample is from the area (e.g. PUMA) which contains the analysis area (e.g. census t ract) and we assume the informat ion contained in seed data correlate s to the synthetic populations. Even though the new proposed framework also does not rely on IPF, the

PAGE 28

28 fundamental mechanism will require a high correlation among characteristics between seed data and true population. To be noted, the projected cell counts may not be integers and an empirical way is to adopt the rounded cell counts. Consequently, some types of households are slightly overestimated and some types of households are slightly underestimated. Moreover, instead of using the rounded cell counts, a household can be selected from the seed data base on the probability of each cell (Beckman, et al., 1996). And the selection probability may be updated (Guo and Bhat, 2007) after a new household is selected. Methodology for Single L evel Controls The tradition population synthesizers only focus on the controls under single level, in which only household level characteristics, such as household size, household income and vehicle ownership are controlled. Since the work of Beckman et al. (1996), several researches have contributed in this area. Auld et al. (2008) develops a routine to automatically aggregate categories with zero cells to adjacent categories; Zhang and Mohammadian (2008) try t o synthesize the population of New York Metropolitan Statistical Area by a two stage population synthesis procedure; Simpson and Tranmer (2005) illustrate that using a seed data from larger areas containing the analysis area can gain accuracy in some extent and they also developed a SPSS based routine for IPF implementation and Moeckel et al. (2003) using IPF to generate a disaggregate population on Netanya, Israel and Dortmund, Germany. Beckman et al. (1996) provide a comprehensive framework for generating synthetic population. In this research, an IPF procedure is performed to predict the joint distribution among several demographic characteristics (e.g. age of householder, family

PAGE 29

29 income). The marginal control data comes from census Summary Files. After I PF, the synthetic population of households is constructed from census PUMS data. Beckman et al. (1996) assign a selection probability to each household in PUMS data. The probability is computed based on the distance between the selected households and th e households characterized by a cell in the multi way contingency table. Basically, if a household sample in PUMS is similar to the most desirable household in the contingency table, a higher probability will be assigned to this household. Beckman et al. ( 1996) also mentioned the zero cell issue of IPF. Frick and Axhausen (2004) generate the population for Switzerland using the method similar to Beckman et al. (1996). In this research, two major steps are conducted and each of them includes several IPFs. The first step estimates a multi way contingency table for a low spatial area and in the second step it will combine the marginal distribution of high spatial area or small area in which it contained to generate the population for the small area. The purpose of two steps of IPF is to correct the correlation between high spatial area and low spatial area. Moeckel et al. (2003) using IPF to generate a disaggregate population on Netanya, Israel and Dortmund Germany During the step of the Monte Carlo sampling the location/address of each household is selected based on the densities of different residential locations. Such consideration enriches the generated disaggregate population information in terms of spatial location which is required by a ctivity b ased t ransportation microsimulation models. As mentioned before, when the number of attributes is large, the number of cells of the household contingency table will grow exponentially and h ence the computation

PAGE 30

30 resource may be occupied substantially. Pritchard and Miller (2009) suggest a list based data structure to accommodate this issue b ecause for a large multi dimensional contingency table, a lot of cells of households does not exist or rarely distributed, and hence such sparse matrix can be stored as a list of physically exist households. With such specification of contingency tables, IPF only adjust the existed households during the procedure. However, the zerocell issue still exists if some types of households are missing. Several methods have been propose d for the zerocell issue. Beckman et al. (1996) simply tweak the zero entry by an arbitrary small value (e.g. 0.01). And this idea was showed to be no benefit of accuracy by Beckman et al. (1996) and it may also introduce bias during prediction. Another major method aggregate s the zero cell categories with adjacent categories (e.g. Auld et al., 2008). In fact, most of the current SPGs adopt one of the two methods or a combination of the two methods for dealing with the zerocell problem ( Auld et al., 2 008) Other than the zerocell issue related to IPF, a downside of this stateof practice procedure is that it requires all the control tables to belong to the same universe. Therefore, it is not possible to apply this procedure directly to synthesize populations by simultaneously controlling for householdand personlevel attributes. However, for achieving more accurate populations, it would be desirable to control for a wide range of attributes. In the last few years, there have been several efforts t o modify the traditional approach to deal with multi level controls. Methodology for Multi L evel Controls In fact, person level attributes are correlated with household level attributes. For example, a twoperson family household is more like to have one m ale and one female

PAGE 31

31 as its members than other possible household structures. Therefore, introducing person level attributes can benefit the accuracy of household level attributes of synthetic population. Several research studies ar e conducted towards this point and Mller and Axhausen ( 2011 ) gives a detailed review on population synthesizers which can control multilevel attribute s. Among these researches, Arentze et al. (2007) adopt a twostep procedure each involving IPFs. Another three methods (Guo and Bhat, 2007; Ye et al., 2007; and Auld and Mohammadian, 2010) to be discussed all begin by generating multi dimensional distributions for householdand personlevel attributes independently using IPF techniques. Guo and Bhat (2007) p resented a methodology in which household level joint distribution and person level distribution are generated by IPF separately. Based on the joint distribution of household level controlled variables, a household is randomly selected from the seed data u nder a specified probability. Then the desirability of the selected household is checked by two requirements. First, adding the selected household will not make the number of such households exceeds a prespecified threshold value (e.g. 120% of this kind o f households in the multidimensional contingency table from IPF). Secondly, for each of the affiliated persons of the selected household, after adding this household the number of persons of this type is lower than a prespecified threshold value. If a sel ected household meets both requirements, a copy of this household with all affiliated persons is added into the synthetic population. Otherwise the selected household is ignored.

PAGE 32

32 Therefore, one of the issues concerning this method is to decide the designed threshold values for each type of households and persons. Furthermore, this method does not control household level characteristics and person level characteristics through the whole selection procedure. Because during the beginning of selection, all feas ible types of persons do not exceed the prespecified threshold values, the second desired requirement cannot be violated by any households. Arentze et al. (2007) uses a different method to deal with the two level controls. In this research, an algorithm o f two step IPF is proposed. The first IPF converts person level attributes (e.g., age and work status) into household level. Through the interaction between age, gender and household size, a household age composition variable is defined based on the combination of the three variables. For example, one male household with his age under some category or two adult household with the age specification of the two members can be consider as different households. In the research of Arentze et al. (2007), age and w ork status are combined with household demographic variables separately. They called the two new variables (based on age and work status of each person in household) household age composition and household work composition. During the second IPF, the new g enerated variables together with other household level variables are served as controls in a regular population synthesis. This method takes some person level characteristics into account, but for each person level characteristics, a special design for the structure between person level attributes and household level demographic attributes is need. Furthermore, since not all marginal counts are available for the first IPF, several addition al constraints are assumed toward this point.

PAGE 33

33 Ye et al. (2009) proposed an algorithm which can make the synthetic population closely match the true population in terms of household level and person level marginal controls under certain conditions. In this research, household level joint distribution and personal level joint distribution are formulated individually by IPF, and then household types and person types are defined according to cells in household level and person level multidimensional tables. At the beginning of data fitting, each household in seed data is assign ed an initial weight value (usually one). The weight values are literately updated until the weighted seed data matches the two multidimensional table generated by IPF. The final weight can be considered as the number of such household which is needed for constructing the synthetic population. Unfortunately, these weight values are usually not integers and hence cannot truly represent the number of household of each type. A simple way for solving this issue is to round off these weight values to the most cl osed integers. However, for some types of households which have very small weight (e.g. some households have weight less than one), the round off method will introduce some bias. Another method uses M o n te C ar lo simulation to select households from seed dat a. Similar to the research conducted by Beckman et al. (1996), the selection probability for each household is calculated based on the weight values Auld and Mohammadian (2010) consider the joint multi level controls in the household selection stage. In t he first stage, the multidimensional tables under household level and person level are generated by IPF separately. Then during the household selection stage, the selection probability is computed based on household level and person level contingency tables together. In the classic selection probability (e.g. Beckman et al. 1996; Guo and Bhat 2007) associated with each household, only

PAGE 34

34 household level information is considered. That is, if a particular household is the type which has large counts in joint tables a higher probability will be assigned. Auld and Mohammadian (2010) treat person level information in the similar manner and the desirability for each person within a particular household are take into account through a new version of selection probability. Another research study conducted by Pritchard and Miller (2009) conceptually illustrate a mechanism in which, a household coul d be formulated by combine the pre generated household level population and person level population. For example, a married couple household must be constructed by one male person and one female person and these two people will also need to meet some age c riteria e.g. the age difference is not exceed some value. It is a complex mechanism at lest because we need to consider a lot of rules such as age difference and gender consistency and some of the rules are subjectively formulated. Moreover, due to the lack of relationship information of nonfamily household, errors are expected to be large for such types of household. Unfortunately, this literature does not provide detailed numerical example for the method. Other than the IPF based researches, Ryan et al. (2010) design a protocol to link the generated individual record to household record by specifying the relationship between household member s characteristics. Instead of IPF, they use a combinational optimization method ( Williamson et al., 1998 and Voas and Williamson 2000) to generate the populations before linking persons to households. Lee and Fu (2011) proposed a cross entropy optimization model for population synthesis. The cross entropy are defined similar ly to the discrimination information

PAGE 35

35 ( Irel and and Kullback 1968) which is also the entropy of the joint household level distribution. Lee and Fu (2011) actually try to formulate entropy for household distributions defined based on multilevel characteristics. More specifically, cells in the traditional household level contingency table are further divided by person level characteristics. Because IPF actually is a numerical approach for solving the entropy problem, the cross entropy method will face the same numerical issue s as IPF. Namely, if a type (in terms of control table) of households in seed data is missi ng, the entropy goes to infinity In sum, several researches have contributed toward the point of multi level controls. As reviewed, the methodology for simultaneously controlling household level and person level controls is varied enough. Starting from different perspectives and ideas, thes e methods utilizes person level information in order to gain more accurate synthetic populations. However, several issues still exist. For example, the stopping criteria is more subjective if threshold values are assigned and there no guarantee that the s ynthesis population can match the true marginal controls even the weighted household sample ( Ye et al. 2009) can match the true marginal controls perfectly. Furthermore, the considered household/person characteristics are limited in past population synthesizes. As a result, if a large set of control variables (e.g. 20 household level controls and 10 person level controls) involves, IPF may not easily to be implemented since the highdimensional contingency table faces the zerocell problem as well as convergence problem of IPF procedure. Towards to this point, this dissertation proposes a new framework to solve or dodge these issues. As the reader will see shortly, the proposed method can

PAGE 36

36 simultaneously control multi level characteristics (e.g. household level, person level and family level) and the synthesis population can match these controls almost perfectly. In the new developed framework, IPF procedure is optional and hence the zerocell and convergence problem can be avoided. Furthermore, the new framework also allows partial IPF during synthesizing populations. More specifically, the joint distribution of some attributes can be generated through IPF while other attributes remain their original data format. The more detailed discussion will be present ed in Chapter 4. Applications of Population S ynthesis Since the work of Beckman et al. (1996), this methodology has been applied to support travel demand modeling in several areas such as Portland Metro, San Francisco, New York, Columbus, Atlanta, Sacramento, Bay Area, and Denver. Bowman and Bradley (2006) and Bowman (2004) provide a general overview of these applications. The Sacramento application is available in Bowman and Bradley (2006) and the Atlanta application and validation results are presented in Bowman and Rousseau (2008). As the requirement of future consideration of transportation planning, the final objective of SPG is to generate the synthetic population for a future year or sometimes called a target year. Bowman and Rousseau (2008) illustrat e the population synthesizer for Atlanta, Georgia, regional commission (ARC) as well as its validation. The base year SPG of ARC is common to many population synthesizers. Specifically, they define the marginal controls per Transportation Analysis Zone (TA Z) by the data source of census summary file and CTPP with PUMS data performs as seed data. Once the base year synthetic populations have been generated, it will be used as seed data for a forecast year. The marginal controls for the forecast year in ARC c ome from the land use

PAGE 37

37 forecast. Thus the SPG of ARC connects between land use information and travel demand model. T he base year population synthesis of MORPC (MidOhio Regional Planning Commission) is designed to have identical data structure as the avail able data for target year (Bowman, 2004) However, in most of situation, the control information of target year is limited and usually in person level such as the gender and age distribution for an area. Thus the data structure for target year and base year is so different that additional adjusting procedure may be required. For example, in SFCTA (San Francisco County Transit Authority) the available social economic forecasts cannot be used directly in generation of target year population. More specifically SFCTA has the social economic characteristics in person level whereas the base year population synthesis is based on household attributes. An additional procedure is used in SFCTA to adjust the corresponding base year control variable. The population synthesis of ARC also faces the similar issue due to the inconsistent categorization of control variables between baseyear population synthesis and target year population synthesis. To solve this problem, ARC aggregates the control variables in both baseyea r and target year population synthesis to closely match the control variables between the two synthetic populations. Validation Methods In order to evaluate the performance of population synthesis we need to measure the accuracy of synthetic population, e.g. the difference between synthetic population and true population. Since SPG is motivated by the situation that the true population is not available, there is no true population for validation. Beckman et al. (1996) use the

PAGE 38

38 sample in a Public Use Micr odata Area (PUMA) as several artificial census tracts and a 5% sample from these artificial census tracts is served as seed data. Then the synthetic population is ready to compare with the true population. Other than the comparison between true population and synthetic population, Auld et al. (2008) verify their SPG with a baseline population from another method. Instead of measure the distance between true population and synthetic population, this research uses weighted average absolute percent difference between synthetic population and the population from baseline SPG as the validation criterion. Another commonly used validation method dose not requires the true population or some baseline SPG. Under the purpose of validation, some attributes form control data are consider ed as validation variables and the SPG is implemented without controlling of these validation variables. And then, the distance between the validation variables from synthetic population and the corresponding variables from true population is measurable and hence can be used as a criterion for validations and evaluations (e.g. Auld et al., 2008). Bowman and Rousseau ( 2008) validate the performance of SPG with back casting to 1990. The seed data for back casting comes from the base year (2000) SPG and the control data is the land use forecast data. With the SPG of 1990, the validation is implemented by comparing the synthetic population of 1990 with the population from census 1990. They assume the land use information is accurate enough and hence the validation is only for SPG itself. However if the assumption false, the effect of SPG and land use information may confounded with each other.

PAGE 39

39 Summary Synthesis methods are used to first generate the population for a base year (current year/census year) and it, in turn, is used as an input to generate the target year (forecast year) population. The stateof the practice approach to population synthesis involves the use of the Iterative Proportional Fitting (IPF) method. While there have been sever al applications of this approach, the following issues still remain. First, the number of controls used in the synthesis of the baseyear population has been limited. In particular, most practical applications control only for householdlevel attributes (e.g., household size and dwelling unit type) and not for personlevel attributes such as age and gender. Thus, the synthesized baseyear population may not truly match the observed personlevel distributions. This would affect the accuracy of the target yea r population as the synthesized baseyear population is used as an input to generate the target year population. Second, documentation of the validation of the synthesis procedure, especially in the context of a target year population, is limited. The br oad focus of this research is to contribute towards synthetic population generation (SPG) literature by addressing these issues.

PAGE 40

40 CHAPTER 4 FITNESS BASED SYNTHESIS METH ODOLOGY This chapter presents the idea of fitness based synthesis (FBS) methodology for generating a population with multilevel characteristics being controlled. In the first section of this chapter, the framework of the proposed method is presented and then several aspects of this method are discussed in the following section s. Framework Broadly, the procedure involves selecting a set of households (with replacement) from the seed data in such a way t hat the controls are satisfied. This study is also an extension to our pr evious work (Srinivasan et al., 2008 and Srinivasan an d Ma, 2009). The fitness basedsynthesis methodology provides a mechanism to synthesize population under multilevel control tables. The proposed method works in the way that households are iteratively selected (with replacement) from seed data until the control tables (at multiple levels) are matched. Further, during the iterative procedure, some households already selected are allowed to be removed if losing such household can contribute to reducing the matching error of control tables. Figure 41 gives a flowchart of the fitness based synthesis method. Generally, three major components define a FBS framework The first component is the initial household sets. The initial population can be any reasonable populations. For example, the empty set or the entire seed data could be served as a start point for the whole procedure. There are several considerations upon the choice of initial household sets and the detailed discussion will present shortly. Specifically in the analysis of C hapter 5 the entire data set is served as the start point and in contrast Chapter 6 and Chapter 7 will use the empty set as the start point.

PAGE 41

41 The second component is the metric used for calculate the fitness for each household. Broadly, the fitness measures the marginal contribution of adding or removing the household for matching with controls. In the procedure, two different fitness values which are referred as Type I Fitness and Type II fitness are defined for each household. More specifically the Ty pe I Fitness computes the contribution if the household is added to current selection where as the Type II Fitness computes the contribution if the household is removed from current selection. After the two types of fitness values are calculated for all households. All the households with positive fitness values are considered as candidate. Rigorous ly speaking, a household is eligible for candidate if it has positive type I fitness or it has positive type II fitness and at least one such household is alread y in current selected households There are a lot of choice of fitness functions and any function for measur ing the difference between two distributions or vectors could be served as fitness functions. This research uses the squareerror based fitness meas urement and under such fitness definition, Type I and Type II fitness values cannot be positive simultaneously. It is a very nice property. There are also other properties for squareerror based fitness function and detailed explanation will be present shortly. The third component is the mechani sm for assigning probability to each candidate households. In this research, an equal probability is assigned to each candidate households during iteration. Then a household will be selected randomly and after adding or removing this household, the overall match against control tables will be enhanced for sure. Detailed discussion about the three components is present as follows.

PAGE 42

42 Initial Household S ets This dissertation suggests three different initial household sets. They are (1) the empty sets, (2) the entire data sets of seed data and (3) a random selected sample of seed data. Other than these three types of sets, readers can design their own initial household sets based some particular reason. For example, you can even use the population from other population synthesis. For the three proposed types of initial household sets, the empirical results show that they perform similar in the context of validation. Another consideration of initial household sets is the running time of the procedur e. The empty sets are the simplest specification of the initial household sets but the running time probability is the longest. Because in the beginning of the procedure, if the empty sets are severed as initial household sets, almost all households own positive Type I fitness and hence the procedure is identical to randomly select household from seed data without check the fitness. Therefore, using a random ly selected sample of seed data is computationally efficient. Fitness F unctions T wo fitness values of each household of seed data can be computed by equation (4 1) and equation (42 ). The two fitness functions are called as T ype I fitness and T ype II fitness respectively. = ( 4 1 ) = + ( 4 2 ) where =

PAGE 43

43 Conceptually, calculates the reduced sum of squared error of controlled tables if the th household being selected for adding at the th iteration while is the corresponding error the th household being selected for removing at the th iteration. In the above formula, is an index representing the control (and the corresponding count) tables and is the total number of control or count tables. Here count tables have the same structure as control tables and they are used to aggregate current selected households in the structure of corresponding control tables. For example, = 1 could represent the joint distribution of household size against tenure; = 2 could represent the joint distribution of age against gender; and so on. For each control (count) table is an index representing the different cells in that table. For example, in table = 1 ( e.g. household siz e against tenure), could have values from 1 through 14 representing the 14 different cells (7 categories for household size multiplied by the 2 categories for tenure). Therefore, for this table, = 1 represents the first cell (1 person / own household), = 2 represents the second cell (2 person / own household), and so on. represents the value of cell in control table and it is also the target number of households of a particular type to be synthesized. represents the value of cell in count table after iteration 1 At initialization = 1 all values of the count tables are set to be the value according to the start point of population. After each operation (adding or removing), the values of the cells in the count tabl es are updated based on the type of the household drawn. Based on the above definitions = is the number of households/persons required to satisfy the target for cell in control table after iteration 1 This is calculate d as the difference between the corresponding cell

PAGE 44

44 values of the control and the count tables. is the contribution of the th household in the seed data to the th cell in control table is the required number househol ds to achieve the target in cell of control table if household is added and + is the required number households to achieve the target in cell of control table if household is removed. With the terms and + functions can be constructed to calculate the overall fitness of the household. The one adopted in this study is presented in equation ( 4 1 ) and ( 4 2 ). A comparison of the performance of the algorithm for under di fferent functional forms for the fitness calculations is an area of future study. In addition, it is also useful to note here that the present algorithm assumes that all control tables are equally important. If this is not the case, (for example, if matchi ng the householdsize distribution is more important than matching the ethnicity distribution) weights can be added to reflect the relative importance of the different tables. Selection M echanism W e believe the seed data is the best available sample to represent the true population. In according to assumption, all households in the candidate s et should have equal probabilities In addition, some household survey provides a sampling weight associate with each household and instead of the equal probability, t he selection probabilities can be assigned based on the sampling weight. Some Properties It is can be proved that for fixed and and cannot be positive simultaneously. Combine the equations of (41) and (42), it is easy to show that

PAGE 45

45 + = 2 ( 4 3 ) We know that because any household will contribute to each household level control table by one and to each person level control table by at least one. So, + 2 for fixed and Therefore for one certain type of households, only one possible operation between adding and removing can reduce the overall matching error of control tables More specifically, Table 4 1 lists the feasible values of and corresponding operations. Then all households in seed data with positive T ype I fitness or positive T ype II fitness could be consider ed as the candidate household for selection. In addition, households with positive type II fitness will be ignored if there is no such household in t he current select ion. The fitness based mechanism actually forces t he matching error smaller during each iteration and close to zero eventually if there are sufficiently various types of households in seed data. A natural termination criterion for the algorithm is when no household can meet the requirement of being a candidate of selection. It is necessary to proof the existence o f such criterion, namely the number of iterations is bounded above. Lets assume the sequence of selected households are indexed by and = If we denote the operation of household be where = 1 if the household was added and = 1 if the household was removed. Then we have,

PAGE 46

46 = = T F ( 4 4 ) w here = namely the positive fitness of household at the th iteration. We already know > 0 for all = 1 Because all the calculations are based on basic mathematical operation (addition, subtraction, multiplication and no division) of integers, 1 for all = 1 Therefore, = ( 4 5 ) So, we just find an upper bound for number of iterations which justifies the proposed natural termination criterion. Actually such upper bound is very conservative since in the beginning of the algorithm, the fi tness value of each household is very larger Empirical results show that the synthesis population can match the control tables almos t perfectly. Namely, the number of required households for each category of each control tables is around zero. But it is difficult to calculate the fitting errors before the procedure. Now, a geometrical explanation about the fitting errors is present. Because at the end of the procedure as no household can meet the requirement of being a candidate of selection, most of households will have negative Type I and negative Type II fitness values. The following illustration ignores the households with positive Type II fitness values but do not exist in current selection.

PAGE 47

47 Assume the fitting errors for each category is and the two types of fitness at the end of procedure are and for th household Then we have, = = 2 ( 4 6 ) = + = 2 ( 4 7 ) By the stop criterion, 0 and 0 for all Therefore for all 2 0 2 0 ( 4 8 ) 1 2 1 2 ( 4 9 ) or, 1 2 ( 4 10 ) S trict ly speaking, the number of effective inequalities in (410) is the number of distinct households in seed data, since is the contribution of th household in table at category

PAGE 48

48 For fixed household is a linear combination of and is the correspond coefficients. So the vector of will locate between the two hyperplanes of = in the Euclid space of dimension Therefore the system of inequality (410) represents an interaction of all space of all distinct households in seed data. When the number of distinct of households are large, it is expect that the vector of are bounded in a small space around origin. In the other words, a diverse seed data is the foundation to have small fitting errors The proposed met hod can also be considered as a heuristic solution search for the following integer program (IP) problem. Min Subject to Above IP problem try to minimize the distribution discrepancy between true population and synthetic population and it can be specified by some discrepancy measurement e.g., entropy and Hellinger distance. The c onstraint s are specified by the matching errors with control tables. The integer decision variables actually are the number of household selected for each household in seed data. By FBS m ethod, in each iteration the random selection mechanism tries to transfer the distributional information from seed data and hence reduce the distribution discrepancy. T he fitness functions make the controlled tables are satisfied at the end of the procedur e. For practical purposes, when the number of decision variables is huge, it is difficult to solve

PAGE 49

49 the IP problem. However, the proposed FBS method actually provides a heuristic search method for solv ing the IP problem and it can easily handle the problem with huge number of decision variables. Conceptual Comparison with IPF B ased Methods In fact such random selection mechanism is analogous to IPF. As mentioned before, IPF actually keeps the odds ratio between different attributes and also minimized the dis crimination information under the restriction of controls. Now the random selection mechanism also inherits the relationship among different characteristics even in person level and the interaction between household level and person level. Because of the presence of constrains (control tables), the candidate set for selection is part of seed data otherwise the discrimination can achieve zero for a sufficiently large population. So, the FBS procedure naturally tries to find an integer solution to minimize the discrimination information under the controlled table. There is no zerocell problem and hence the issue of infinite entropy and infinite discrimination information can be avoided Based on the new mechanism, controlled attributes from different universe e.g. household level, person level, family level and so on can be readi ly matched by incorporate these control table to the fitness functions. For the same reason, huge number of attributes can also be controlled simultaneously through the new approach while the IPF or entropy based method will face zerocell problem if there are too many attributes and categories. The new approach has a highly integrated framework, it can directly select households during iterations and hence all the resulting num bers of each type of households are integers. Empirical results show that synthetic population can match the controlled tables almost perfectly. In contrast, the conventional methods need additional

PAGE 50

50 steps converting the results from IPF to appropriate integers and hence the controlled information may not be perfectly satisfied. Summary The Fitness Based Synthesis (FBS) approach is introduced in this chapter. This method is designed to select households in a way that marginal control tables are matched. The proposed framework contains three major components, namely initial household sets fitness functions and selection mechanism The initial household set is prepared before the procedure and proper selection of initial household sets can reduce the running t ime of procedure. Fitness function measures the marginal contribution of each household for reducing matching errors. In this study, we choose fitness function based on the criterion of sum of squared error s and it can be replaced by other criteria. During the selection process, this study assigns equal probability to the candidate households since we believe that household in the seed data are equally important. However, if there is some information indicates that some types of households are more important to the area of interest, the selection probabilities can be adjusted accordingly. The proposed method is not a member of IPF based population synthesizers and hence there is no related zero cell issue and also no convergence issue under square error base d fitness function (adopted in this dissertation) because the number of iteratio ns is proved to be bounded. Furthermore, under square error based fitness functions, the synthetic population can match the marginal control tables almost perfectly if there ar e sufficient variable types of households.

PAGE 51

51 Table 4 1 Feasible value of fitness functions and corresponding operations Operation > 0 < 0 Adding < 0 > 0 Removing = 0 < 0 Nothing < 0 = 0 Nothing < 0 < 0 Nothing

PAGE 52

52 Figure 41. Flowchart of the Fitness Based Synthesis method Initial household sets Number of candidate households in seed data END Randomly select a household among all candidate households then add or remove such a household based on the type of fitness which is positive. Then the selections of household are appropriately updated Calculate the Type I Fitness and Type II Fitness for each household in seed data. =0 >0

PAGE 53

53 CHAPTER 5 B ASEY EAR P OPULATION SYNTHESIS : COMPARISON AND VALIDATION In order to evaluate the quality of synthetic populations, it is necessary to measure the accuracy of synthetic population, e.g the difference between synthetic population and true population. However, the reason for generation synthetic populations is the lack of true population. So, it is difficult to compare the synthetic population with t he true population in practice, but sometimes a designed or artificial true population can help to do such analysis. Beckman et al. (1996) use several PUMS samples as artificial census tracts and a 5% sample from these artificial census tracts is served as seed data. Then it is feasible to compare synthetic populations with the designed true population. As mentioned before, in synthetic population procedure, the true population is unknown. I n order to capture the true population for validation purpose, this study is going to design several artificial census tracts which are similar to the idea of artificial census tracts provided by Beckman et al. (1996). In this chapter PUMS data from 22 PUMAs in Florida are served as the artificial census tracts. These are set s as the true populations of these tracts. The control tables are generated by aggregating attributes of these artificial true populations. The seed data for synthesis are drawn randomly from these PUMAs. The designed control tables and the designed seed data are then fused to the Fi tness Based Synthesis framework described in Chapter 4 for generating the population of these artificial census tracts. With the artificial true populations and synthetic populations, all information for input to validation is available at this point. Befo re validation, list based households set are transferred to frequency based household sets More specifically, PUMS data list s households row by row and each

PAGE 54

54 household is defined by household characteristics as well as all members characteristics and hous eholds in different row may have same specification of attributes, especially when the sample size is large. Now, if a household is defined by a particular combination of household characteristics and its members characteristics, a set of households can also be defined by a list of frequencies of household types. Based on the new definition, we can compare two sets of households by comparing the two lists of frequencies. Conceptually, validation is the same as measuring the distance of two vectors or similarity between two distributions. According to the above validation idea, the following analysis will compar e several synthetic populations from different methods which include the conventional IPF method (Beckman et. al. 1996) and the recently developed IPU method (Ye et. al. 2009) with true population. This is accomplished by comparing the actual number of households of each type in the true population with the corresponding number in the synthesized populations. Dataset Before validation, synthetic populations of 22 artificial census tracts are generated first. As illustrated in Figure 21, base year population generation requires two types of data of input, seed data and control tables. The seed data comes from the 5% random samples of the true population of 22 artificial census tracts. And five control tables (an example in Figure 51) adopted for the validation analysis. Pre T reatment of Seed Data The seed data for the population synthesis come from the PUMS. As the PUMS data represent only a 5% sample of the overall population, it is possible that there are certain types of households (especially rare households) which are represented in the

PAGE 55

55 trac t level control tables but are not present in the PUMS data from the corresponding PUMA. For example, a control table may indicate few 4persons and 1car households to be present in a census tract. However, the 5% PUMS data from the PUMA to which this t ract belongs may not have any such households. The pretreatment procedure simply augments the PUMS data for each PUMA by adding such missing household types from other PUMAs. Our current procedure ensures that each PUMA has at least one household that sat isfies each cell (independently) in each of the control tables. One household of the missing type is borrowed (arbitrarily, in the current implementation) from some other PUMA to satisfy this requirement. Overall, the pretreatment procedure broadly ensures consistency between the seed data and the control tables and, therefore, it would always be possible to find a household in the seed data towards satisfy each cell of the control tables. To be noted, it is not necessarily to pretreat the seed data for t he new method and the result synthetic population will leave those missing cell unsatisfied, which is still the best population under the original seed data. However, if some categories of control tables are missing samples from seed data, the IPF cannot have a feasible solution since some constrains are not satisfied. The following analysis will also incorporate two IPF related population synthesizers. So it is necessarily to pretreat the seed data to serve the three population synthesizers. Control Tables Five control tables are considered here and Figure 51 provides an example of these tables which is also control tables for artificial census tract with ID A6. The five control tables consists two tables (household size tenure and number of

PAGE 56

56 vehicles tenure) from household level, two tables (age gender and e thnicity ) from person level and one table (age gender) from group quarter level. Validation Method Essentially, validating a synthetic population is to measure the difference or similarity between synthetic population and true population. Therefore, two questions are raised, namely how to define a population more comprehensively? And how two measure the difference between two populations based on the definition? Defining Household Types As mentioned before, the validation is based on frequency based household sets Namely, a list based household sets are transferred into the distribution of household types. This study defines a type of household based on its household level attributes and each members person level attributes It is not recommended to define household types only through a particular controlled attribute since population synthesi s can force the generated population to match controlled population and as a result the difference can be close to zero which makes the population looks accurate but the intrinsic errors are veiled. Conventionally, a household are characterized by only household level attributes and hence the maximum possible number of types is the product of number of categories of attributes even though some of these are not practically feasible (such as a one person seven car household). However, if the household types are defined based on a combination of household and person level attributes, it is difficult to pred efine all types of households. That is due to that we do not know which types of households exist in reality and if we enumerate all types of household the number is huge and more likely that most of them are rarely distributed. This study finally counts all types of

PAGE 57

57 households in the true population of 22 artificial census tracts as the set of possible household type. The household and person attributes considered for defining household types are listed in T able 5 1 and corresponding 8529 types of househol ds are classified. Such household type definition considers the interaction between household level attributes and person level attributes and we can consider that it is in the household composite level. For instance, a household type under this classifica tion scheme could be a two person family household with two cars with one person being a white male in age category 3544 years and the other person is a white female in the age category 2534 years. More precisely, if two households are counted into one type, these two households will have identical household level attributes and persons in any household are one to one corresponding to persons in another household with identical person level attributes. Measures of Dissimilarity between Trueand Synthesi zed Populations Once the household types have been defined and the frequency of each type are determined for the true and synthesizedpopulations, we need measures to determine the extent to which these are dissimilar. Let and be the number of households for th type of true population and synthetic population respectively and be the total number of household types (e.g. = 8529 for this study). There are several measures can be defined. MAE = 1 | | ( 5 1 )

PAGE 58

58 Equation ( 5 1 ) defines the mean absolute error (MAE) as the average absolute value of the difference between the actual number of households and the synthesized number of households of different types. MSE = 1 ( ) ( 5 2 ) In equation ( 5 2 ) the mean squared error ( MSE ) is determined as the average of square differences across all household types. H( T S ) = 1 2 ( 5 3 ) where = and = An alternative distance measure is Hellinger distance (see Simpson, 1987 and Karlis and X ekalaki 1998 for more information) presented in Equation ( 5 3 ). The Hellinger distance is used to measure the distance between two probability distributions or non negative vectors. If the distributions of the two populations match perfectly, the Hellinger distance is zero and if the distributions are perfectly disjoint, the distance is 1. Therefore, smaller values of the Hellinger distance indicate a closer match of the synthetic population with the true population. Unlike MSE and MAE, Hellinger distance is more intend to capture the distributional difference between two distributions without sensitivity to the size of population and it is also not sensitive to outliers. Another measurement is the proportion of types of household such that the absolute difference | | exceeds a predefined threshold values. Following analysis uses 5% of type size and they are denoted as T hreshold 5% in the following anal ysis

PAGE 59

59 Comparison with Other M ethods This study compares three different synthetic populations over 22 artificial census tracts. Besides the proposed FBS approach, the conventional IPF approach (refer to Beckman et al. 1996) and IPU approach (refer to Ye et al. 2009) are also implemented on these artificial census tracts. All the three methods use the same PUMS data and the FBS approach and IPU approach use all the 5 control tables (an example in Figure 51) However, IPF method only adopts household level c ontrols. Table 5 2 summarizes the numbers of households, male individuals and female individuals for these generated populations. Comparing these numbers to true populations, the three methods generate populations that match to the total number of household of true population. However, IPF approach cannot match the gender distribution since the person level controls are not included. Since the control tables are true informatio n of census tracts, matching with controlled information (number of household, number of male and number of female are controlled information here) is an important aspect of population synthesizers. Nevertheless more comprehensive comparisons should be applied through uncontrolled information, namely the validation. In T able 53 validation of populations of the three methods is presented. For each of the 22 artificial census tracts four different measurements are applied to three different synthetic popul ations By compare the errors across the three method, the FBS approach and IPU approach generate populations closer to true population than the one from conventional IPF method because this two methods have smaller errors for most of census tracts. Since the former two methods control person level characteristics, they are expected to be more accurate. In term of validation criteria there are no superior between FBS and IPU. More specifically, FBS is better in some cases while IPU is

PAGE 60

60 better in other cases. However, IPU cannot provide solution for some cases due to the convergence issues. Generally, there are three important aspects which affect the accuracy of synthetic populations. The most fundamental factor is the seed data. More specifically, if the seed data is more similar to true population, it is expected to be more accurate for synthetic population. The second factor is the population synthesizer itself. A good synthesizer should preserve the information from seed data while match the control inform ation. The last factor is the relationship between the controlled information and the information of interest (usually is not controlled but used for validation). In other words, if the controlled attributes are highly correlated with the uncontrolled attr ibutes, the synthetic population is more likely to have less error if validated by the uncontrolled attributes. In addition, T able 54 gives the number of iterations of FBS procedure for the 22 artificial census tracts. In this example, the number of adding iterations is constantly increasing as the increase of tract level population size while the number of remove iterations does not has such pattern. To be remembering, the initial household sets are the seed data (1542 households in this analysis). Therefore, for smaller census tracts ( e.g., A1 and A2) most of the iterations remove households. Irrespective of tract size, i t is interesting that the number of iterations is correlated to Hellinger distance, e.g., for case A 22, even it is a large tract, numbers of adding and remove iterations are relatively smaller than the one of A20 and A21. Intuitively, smaller Hellinger distance reflects more similar ity between synthetic population and true population and hence the selection procedure can easily fit to the controlled tables. In other words, if the seed

PAGE 61

61 data is highly related to true population, there is less iterations with other factors fixed. Moreover, the total number of iterations is compatible to the tract size in this example. Figure 52 demonstrates more detailed difference between synthetic population and true population for the census tract ID A6. Each point represents a household type, axis is the size of this type of true population and axis is the size of this type of synthetic population. Therefore a perfect synthetic population (zero validation error) will locate on the 45 angle straight line and we can roughly observe the validation error through the concentration of the point cloud on the 45 angle straight line. S ummary This study provides a more comprehensive validation conception in which a set of households first are transfer red to a representation of householdtype distribution and then the synthetic population is validated by measuring the difference between two householdtype distributions. Such difference can be evaluated by the difference of two vectors or similarity of two distributions. Several criteria are introduced toward this point. Based on the propose d FBS method, syn thetic populations from 22 artificial census tracts are generated. As an important objective, this dissertation compares the proposed method with other synthesis method. So, the synthetic populations of the same 22 census tracts are also generated by the c onventional IPF method and a recently developed IPU method. A ccording to validation, the FBS approach and IPU approach generate populations closer to true populations than the one from conventional IPF method. There are several avenues for the future research. First, it is useful to test different fitness functions under the new mechanism. Second, the quality of synthetic population

PAGE 62

62 is affected by several aspects, e.g. seed data. So the analysis of impact of seed data or other aspects can also be conducted.

PAGE 63

63 Table 5 1. Characteristics for defining household types Attributes Categories household level Household Size 1,2,3,4,5,6,7+ Tenure Rent, Own Number of Vehicles 0,1,2,3,4,5+ person level Ethnicity White, Black, Other, Multiple Race Gender Male, Female Age 0 5,6 15,16 17,18 24,25 34,35 44,45 54,55 64,65 74,Over 75

PAGE 64

64 Table 52. Aggregate comparisons of the true and synthesizedpopulations for 22 artificial census tracts Case ID Households a Male Population Female Population True Values Synthesized True Values Synthesized True Values Synthesized FBS IPF IPU FBS IPF IPU FBS IPF IPU A1 225 231 225 213 251 248 218 228 249 252 291 260 A2 460 455 462 439 469 469 506 435 532 533 503 513 A3 616 617 612 585 715 714 733 668 785 785 756 790 A4 801 806 801 761 1006 1008 974 958 994 991 1030 970 A5 1014 1015 1017 1015 1197 1198 1215 1165 1305 1304 1308 1314 A6 1400 1391 1404 1389 1440 1442 1463 1399 1560 1562 1546 1548 A7 1794 1783 1794 1790 1881 1882 1868 1880 2032 2034 2038 2045 A8 1673 1665 1672 1661 2131 2137 2046 2100 2062 2057 2151 2067 A9 1489 1514 1487 NA b 2237 2236 2250 NA 2485 2482 2480 NA A10 1965 1967 1968 1926 2479 2480 2398 2430 2493 2492 2599 2466 A11 2002 2010 1997 1978 2403 2402 2411 2367 2617 2615 2616 2612 A12 2214 2217 2216 2195 2439 2438 2446 2411 2707 2708 2685 2700 A13 2588 2576 2589 2585 2718 2723 2756 2701 3094 3093 3065 3117 A14 2363 2365 2363 2352 2961 2960 2869 2928 3013 3014 3100 3021 A15 2474 2479 2479 2454 3137 3138 3084 3105 3056 3054 3137 3056 A16 3110 3133 3109 NA 3192 3187 3142 NA 3464 3465 3515 NA A17 3335 3361 3334 NA 3443 3438 3373 NA 3713 3714 3752 NA A18 2979 2986 2973 2966 3676 3676 3608 3647 3798 3798 3852 3821 A19 3342 3340 3343 NA 3866 3869 3837 NA 3795 3792 3843 NA A20 2982 3022 2983 NA 3598 3597 3703 NA 4102 4096 4004 NA A21 3654 3644 3654 3656 3790 3792 3829 3775 4124 4125 4074 4153 A22 3763 3763 3765 3730 4413 4414 4419 4377 4454 4453 4466 4434 a Group quart population are counted as household b NA means that Yes method does not converge and hence no feasible solutions

PAGE 65

65 Table 53. Validation results of population from three population synthesizers Case ID MSE MAE Hellinger Threshold 5% FBS IPF IPU FBS IPF IPU FBS IPF IPU FBS IPF IPU A1 0 .059 0.049 0.059 0.040 0.037 0.040 0.683 0.626 0.718 0.033 0.032 0.034 A2 0.124 0.142 0.127 0.072 0.076 0.072 0.581 0.622 0.601 0.055 0.058 0.058 A3 0.157 0.167 0.130 0.090 0.091 0.087 0.544 0.532 0.537 0.071 0.069 0.074 A4 0.252 0.308 0.227 0.120 0.128 0.118 0.540 0.575 0.562 0.092 0.090 0.095 A5 0.264 0.303 0.223 0.137 0.141 0.134 0.487 0.491 0.476 0.103 0.098 0.108 A6 0.576 0.669 0.725 0.190 0.199 0.190 0.461 0.481 0.467 0.113 0.120 0.115 A7 0.938 0.935 0.714 0.252 0.252 0.232 0.480 0.483 0.443 0.137 0.146 0.146 A8 0.672 0.787 0.549 0.232 0.240 0.221 0.482 0.487 0.464 0.148 0.147 0.154 A9 0.630 0.632 NA a 0.256 0.254 NA 0.641 0.646 NA 0.165 0.169 NA A10 0.788 0.918 0.713 0.261 0.271 0.228 0.460 0.472 0.442 0.161 0.159 0.166 A11 0.668 0.750 0.761 0.247 0.256 0.252 0.418 0.439 0.441 0.154 0.157 0.155 A12 0.658 0.891 0.680 0.256 0.273 0.252 0.395 0.414 0.387 0.163 0.158 0.165 A13 1.284 1.625 0.987 0.266 0.302 0.247 0.324 0.367 0.311 0.139 0.149 0.144 A14 0.978 1.150 0.840 0.309 0.318 0.297 0.448 0.461 0.431 0.180 0.177 0.183 A15 1.452 1.462 1.069 0.342 0.343 0.319 0.459 0.459 0.438 0.188 0.190 0.197 A16 1.925 2.006 NA 0.401 0.392 NA 0.429 0.408 NA 0.201 0.206 NA A17 2.168 2.109 NA 0.429 0.407 NA 0.428 0.402 NA 0.211 0.215 NA A18 1.508 1.610 1.214 0.381 0.382 0.366 0.436 0.431 0.416 0.201 0.202 0.208 A19 2.941 3.112 NA 0.447 0.443 NA 0.422 0.425 NA 0.211 0.217 NA A20 2.700 2.100 NA 0.461 0.448 NA 0.513 0.531 NA 0.222 0.242 NA A21 3.616 3.163 3.480 0.515 0.487 0.467 0.447 0.426 0.408 0.207 0.220 0.214 A22 2.169 2.074 1.681 0.470 0.465 0.444 0.404 0.406 0.388 0.230 0.232 0.238 a NA means that IPU method does not converge in that situation

PAGE 66

66 Table 54. Number of iterations for generating population of 22 artificial census tracts Case ID Number of generated household Number of a dding iterations Number of remove iterations Number of total i terations A1 231 127 1438 1565 A2 455 260 1347 1607 A3 617 203 1128 1331 A4 806 257 993 1250 A5 1015 284 811 1095 A6 1391 927 1078 2005 A7 1783 1360 1119 2479 A8 1665 733 610 1343 A9 1514 1345 1373 2718 A10 1967 1155 730 1885 A11 2010 1127 659 1786 A12 2217 1094 419 1513 A13 2576 2067 1033 3100 A14 2365 1631 808 2439 A15 2479 1830 893 2723 A16 3133 2863 1272 4135 A17 3361 3216 1397 4613 A18 2986 2570 1126 3696 A19 3340 3133 1335 4468 A20 3022 4274 2794 7068 A21 3644 4285 2183 6468 A22 3763 3327 1106 4433

PAGE 67

67 Household Size Tenure (Household level) 1 2 3 4 5 6 7+ Own 165 253 98 82 36 12 3 Rent 278 192 94 53 12 6 8 Number of Vehicles Tenure (Household level) 0 1 2 3 4 5+ Own 25 209 301 90 19 5 Rent 85 338 166 38 14 2 Age Gender (Person Level) 0 5 6 15 16 17 18 24 25 34 35 44 45 54 55 64 65 74 Over 75 Male 95 164 31 351 218 171 191 102 65 52 Female 106 163 32 367 223 184 198 120 77 90 Ethnicity (Person Level) White Black Other Multiple Race 1910 952 88 50 Age Gender (Group Quarter Level) 0 17 18 64 Over 65 Male 0 54 5 Female 0 41 8 Figure 51 Examples of control tables

PAGE 68

68 Figure 52. Scatter plots of three different synthetic populations against true population on artificial census A6. A) FBS Method. B) IPF Method. C) IPU Method. A

PAGE 69

69 Figure 52. Continued B

PAGE 70

70 Figure 52. Continued C

PAGE 71

71 CHAPTER 6 TARGET YEAR POPULATION SYNT HESIS : APPLICATION AND VALI DATION This study contributes by presenting an empirical assessment of target year populations synthesized with different baseyear populations, datafusion methods, and control tables. Twelve synthetic populations were synthesized for 12 census tracts in Florida The empirical results indicate the value of synthesizing more accurate baseyear populations by accommodating multi level controls. The impact of the data fusion methodology applied in the target year context is more modest possibly because there are few er control tables available in the target year. Finally, errors in the target year control tables significantly reduce the accuracy of the synthesized populations. The magnitude of the overall error in the synthesized population appears to be linearly rela ted to the magnitude of the input errors introduced via the control tables. Overall, efforts to accurately synthesize baseyear populations and obtain target year controls can help synthesiz e good target year populations. Unlike the case of baseyear popul ation synthesis, the documentation of results on target year population synthesis is limited (Bowman 2004 and Bowman and Rousseau 2008). Although, conceptually, the application of the datafusion approach for target year synthesis is similar to its application for base year synthesis, there are three important issues of concern. First, the target year synthesis uses the baseyear synthesized population as seed data. Thus, the methodology and controls used in the baseyear synthesis impact the accuracy of th e baseyear population, and in turn, the target year population. Second, one can expect significantly fewer control tables available for the target year synthesis as opposed to the base year synthesis. In this situation, there might be benefits to using ap proaches that control for both personand

PAGE 72

72 householdlevel information as opposed to methods that control for only householdlevel information so as to take advantage of all the minimal data available. Third, the target year control tables are projections in contrast to base year control tables which are derived from the census counts. It has been well documented ( Smith and Shahidullah 1995 and Stoto, 1983) that there are significant errors in these projected aggregate distributions of population characte ristics. In addition these errors are inherent and cannot be avoided. Therefore, examining the effects of errors in control tables is of interest. In light of the above discussions, the intent of this paper is to contribute to our understanding of target y ear population synthesis by addressing the following questions: (1) what is the effect of the accuracy of the baseyear population (which will serve as the seed data for target year synthesis) on the accuracy of the target year population?, (2) What is the value of controlling both householdand personlevel information in the target year versus only householdlevel controls? (3) How do errors in the projections of target year controls affect the accuracy of the population synthesized? Analysis Framework A total of twelve synthetic populations were generated for a target year for each of several census tracts to address the three fundamental research questions of this study. As already discussed, the synthesis of target year population begins with the syn thesis of baseyear populations. In this case, three different base year populations were generated for each census tract with varying number of controls and differing in datafusion methods. The first base year population was generated using only househol d level controls and IPF as the data fusion methodology (this population is referred to as B IPF in the rest of this document). The other two baseyear populations

PAGE 73

73 were synthesized using the FBS approach with both householdand personlevel controls. These are referred to as B FBS1 and B FBS2 with the latter having more controls than the former. Thus, given the differences in the number of controlled attributes, one may expect the following order for the accuracy of the synthesized baseyear population (B FBS2 > B FBS1 > B IPF). The second research question relates to the target year controls. To address this, the target year populations were synthesized using two different approaches: IPF with only householdlevel controls and FBS with both householdand person level controls. Each of the three base year populations were used with each of the two target year data fusion methods giving a total of six target year populations. These are referred to as T IPF B IPF (i.e., target year IPF and base year IPF), T IPF B FBS1, T IPF B FBS2, T FBS B IPF, T FBS B FBS1, and T FBS B FBS2 reflecting the baseyear population used as seed data and the target year synthesis methodology. The six populations described above were synthesized using the true tract level control tables. In order to assess the impact of erroneous target year controls, an approximate control table was generated for each target year controlled attribute by replacing the true distribution with the distribution of the same attribute in the county to wh ich the tract belongs. Six additional populations were synthesized using these approximate control tables and there are referred to as T* IPF B IPF, T* IPF B FBS1, T* IPF B FBS2, T* FBS B IPF, T* FBS B FBS1, and T* FBS B FBS2. The is used to indicate t hat approximate controls were used for the target year.

PAGE 74

74 Once the populations were synthesized, they were compared in terms of their ability to accurately replicate several marginal tables available for the target year. The error specific to marginal table is calculated as follows: = ( 6 1 ) is the true value of the th category in table and is the corresponding value of synthesized marginal table. The synthesized marginal tables were obtained by simply aggregating the synthesized population along the appropriate dimensions. The error measure can be interpreted as the proportion of the synthesized households/persons misclassified in the cells of the corresponding marginal table. Data set Data from 12 census tracts and their corresponding PUMAs and Counties from Florida were used for this analysis. These census tracts and the PUMAs and counties to which they belong are identified in T able 6 1 Data were collected for years 1990 and 2000. The reader will note that the there are wide variations in the populations and the changes in population between the years. Further, these census tracts were c hosen to represent some of the major urban regions in Florida where advanced travel demand models are likely to be needed or developed. Finally, for all these census tracts, the boundaries did not change between 1990 and 2000. The year 2000 was used as the base year in this analysis and the year 1990 was set as the target year. Thus, we adopt a back casting approach as opposed to a forecasting approach. The primary reason for this was that the PUMA level data required for baseyear population synthesis were available for 2000 and not 1990.

PAGE 75

75 Table 6 2 identifies eleven baseyear (2000) control tables (eight twodimensional tables and three onedimensional tables) used in this study. These distribution tables were obtained from the US census SF1 and SF3 files. These controls cover most of the important socioeconomic mobility attributes commonly used in travel modeling. Each of these tables corresponds to the joint distribution of a subset of the required population attributes. All these 11 tables were controlled for in the synthesis of the B FBS2 (baseyear) population. For the synthesis of the other two baseyear populations, only a subset of these controls were used with only householdlevel controls being used for the B IPF population. Table 6 2 also identifi ed the controls used for the synthesis of each of the baseyear populations. The seed data of baseyear synthesis come from US census 5% PUMS (Public Use Microdata Sample). For the target year synthesis, distributions of household size and dwelling unit ty pe were used as controls for the IPF procedure. Person level controls for age and gender were used in addition to the two household level controls in the FBS procedure. The true tract level tables were obtained from the US Census SF1 tables of 1990. The approximate control tables were obtained from the counties of the respective census tracts from the US census data of 1990. In generating the approximate control tables for the target year, we assume that the total population (persons and households) is stil l accurately known at the tract level. Only the distribution is borrowed from the county level. In addition to the controls for the target year, several other marginal tables are also available which are used to assess the accuracy of the synthesized popul ations.

PAGE 76

76 R esults Table 63 gives the number of iterations for base year population synthesis by the method FBS2. In this analysis, the number of adding iterations are roughly twice of number of remove iterations and they are linearly related to the size of generated population. For each of the 12 synthetic populations, and for each of the twelve census tracts analyzed, error measures were calculated for each of the marginal tables. These are identified in F igure 6 1 Further this figure also indicates whether any or all of the attributes of the different marginal tables were controlled for in synt hesizing either the baseyear or the target year populations. The errors are then averaged across the 12 census tracts for each of the 12 synthetic populations and each of the marginal tables. These results are summarized in T able 6 4 In the rest of this section, these results are syst ematically analyzed to address the three key questions of interest. Impact of Accuracy of the BaseYear Population Table 6 5 compares the three baseyear synthetic populations pair wise using the error measures as defined before. Broadly, the numbers indicate that the three baseyear populations are significantly different in terms of accurately replicating various baseyear control tables. Further, the errors of B IPF relative to B FBS2 are greater than the errors of B FBS relative to B FBS2 reflecting the effects of increased controls in baseyear synthesis. The B FBS2 population was synthesized controlling for all the tables mentioned in T able 6 5 and hence can be expected to replicate the true distributions of all these t ables with great accuracy.

PAGE 77

77 Figure 6 2 includes two sets of graphs which compare the accuracy of the populations synthesized with different baseyear populati ons but with the same target year controls and data fusion methodology. The top graph is for the cases when the target year synthesis was undertaken using the IPF procedure and the bottom graph is for the target year synthesis with FBS methodology. All these are for the case when true tract level controls were used (similar trends were observed for the case of approximated controls and hence these are not presented graphically here). The reader will note that the graphs derive their values from T able 6 4 I n general, we observe that the errors are least for populations synthesized using B FBS2 as the seed data (see the circles) and are maximum for the populations synthesized using B IPF as the seed data (see the triangles). The differences are particularly s triking for marginal tables such as P52 which has attributes that are not controlled for in the target year. This indicates that if the baseyear populations are synthesized controlling for as many attributes as possible, then the corresponding target year populations are also more accurate irrespective of the target year controls/data fusion methodology employed. Impact of T arget Y ear C ontrol T ables and M ethods Figure 6 3 includes three sets of graphs which compare the accuracy of the populations synthesized with the same baseyear populations but with different target year controls and data fusion methodology. Each graph compares the population synthesized with both household and person controls and using the FBS methodology against the population synthesized with only household controls using the IPF methodology. The top graph is for the cases when the base year synthesis was undertaken using the IPF procedure, the middle graph is when the base year population

PAGE 78

78 is B FBS and the bottom graph represents the base year population of B FBS2. All these are for the case when true tract level controls were used (similar trends were observed for the case of approximated controls and hence these are not presented graphically here). Again, the graphs derive their values from T able 6 4 For each fixed baseyear population, the two target year populations perform similar in the context of accuracy with the FBS approach providing slightly better accuracy. This relatively low magnitude of improvement is as expected as the FBS essentially controls for only age and gender over and above the IPF target year controls. Further, with the gender being practically equally distributed, the real diff erence between the two methods is the control for age in the FBS approach. Consistent with this discussion, the reader will not e significant differences in the error for control t able P12 which is the two dimensional joint table between gender and age. Sin ce gender and age are controlled in the FBS method but not in IPF during target year synthesis, the population under method FBS performs systematically better than IPF for these attributes. Impact of Inaccurate Control Tables The final research question examines the effect of the inaccuracies in the control tables on the accuracy of the synthetic populations. As shown in T able 6 4 the errors increase significantly when the approximate, county level distributions are used as controls instead of the true controls. This holds irrespective of the baseyear synthesis methodology and the target year synthesis methods. Therefore, it is important to be cognizant of the errors in the target year controls despite using multi level population synthesis methods as well as moreaccurate baseyear synthetic population as seed data.

PAGE 79

79 It is also of interest to assess how the error in the control tables translates into errors in the synthetic populations. Table 6 6 presents the errors between the true and approximate (i.e., county level) control tables for the twelve census tracts. These errors are calculated using procedures previously described. The table also presents the average of these errors across the different control tables. Specifically, Average1 is calculated across all four control tables and, hence, it may be interpreted as the input error (or discrepancy) introduced in populations employing the FBS for target year synthesis. Average2 is calculated across all two householdlevel control tables and, hence, i t may be interpreted as the input error (or discrepancy) introduced in populations employing the IPF for target year synthesis. Figure 64 plots the input error (discrepancy) against the loss of accuracy for each census tracts and for each of the six types of synthetic populations. The loss of accuracy is calculated as follows. First, for each marginal table, the difference in errors between the population synthesized with the true controls and the one synthesized with the approximate controls is calculated (for each base year population and target year synthesis approach). This difference is error averaged across all marginal tables and is defined as the loss of accuracy for the census tract. In general, the loss of accuracy is greater with greater input er rors and this relationship appears to be linear. Summary The application of disaggregate models for predictions and policy evaluations requires as inputs detailed information on the socioeconomic characteristics of the future year population. Although the IPF based procedure is most popularly used, this is limited by the need to restrict all controls to the same universe. More recently, new methods have been developed to incorporate multi level controls in population

PAGE 80

80 synthesis. However, there is limited documentation of the application of IPF and other methods in the context of target year synthesis. This study contributes by presenting an empirical assessment of target year populations synthesized with different baseyear populations, datafusion methods, and control tables. Twelve synthetic populations were synthesized for 12 census tracts in Florida. The year 2000 was taken as the base year and the 1990 as the target year. The empirical results indicate the value of synthesizing more accurate baseyear populations by accommodating multi level controls. Target year populations synthesized with more accurate baseyear populations as seed data are shown to be more accurate. The impact of the data fusion methodology applied in the target year context is more m odest. This is because there are few control tables available in the target year and, hence, there might not be significantly more information contained in personlevel controls beyond those in householdlevel controls. Nonetheless, the populations synth esized (target year) with multi level controls and the FBS methodology do perform better than those synthesized with only household level controls and IPF. Finally, errors in the target year control tables significantly reduce the accuracy of the synthesiz ed populations. The magnitude of the overall error in the synthesized population appears to be linearly related to the magnitude of the input errors introduced via the control tables.

PAGE 81

81 T able 6 1. Characteristics of the twelve c ensus t racts in 1990 and 2000 Case ID Census Tract ID PUMA ID County Name Households Population Group Quarters Population 2000 1990 % Change 2000 1990 % Change 2000 1990 % Change 1 0012 701 Leon 474 491 3.59 1030 1094 6.21 0 0 NA 2 0273.09 2601 Pinellas 643 240 62.67 1606 617 61.58 55 11 80.00 3 0215.03 2003 Seminole 593 556 6.24 1630 1561 4.23 130 112 13.85 4 0202 300 Okaloosa 711 612 13.92 1799 1592 11.51 0 0 NA 5 0101.24 4016 Miami Dade 581 429 26.16 2257 1290 42.84 87 0 100.00 6 0142.02 1104 Duval 1992 1797 9.79 3770 3683 2.31 30 0 100.00 7 0016 3502 Palm Beach 1606 1515 5.67 3875 3423 11.66 0 34 NA 8 0219.02 2001 Seminole 1862 1857 0.27 4513 4469 0.97 14 25 78.57 9 0019.06 3502 Palm Beach 4170 2274 45.47 7728 4260 44.88 342 0 100.00 10 0168.02 1106 Duval 3529 2203 37.57 8145 5409 33.59 0 0 NA 11 9801 600 Jefferson 3128 2747 12.18 8894 7634 14.17 1034 205 80.17 12 0054.02 4011 Miami Dade 3720 3572 3.98 9426 8855 6.06 12 0 100.00

PAGE 82

82 T able 62 Control t ables for b ase year population s ynthesis Control Tables Controlled In Universe Dimension 1 Dimension 2 Attribute Categories Attribute Categories H15 B IPF, B FBS, BFBS2 Households TENURE Own, Rent HHSIZE 1,2,3,4,5,6,7+ H32 B FBS2 Households TENURE Own, Rent DUTYPE Single Family, Multi Family H44 B IPF, B FBS, BFBS2 Households TENURE Own, Rent NUMAUTO 0,1,2,3,4,5+ P26 B FBS2 Households HHSTRUCT Family, Non Family HHSIZE 1,2,3,4,5,6,7+ P34 B FBS2 Families HHSTRUCT Married couple, Other family CHAGEa None, Only <6 years, Only >=6 years, Both <6 years and >= 6 years P48 B FBS2 Families HHSTRUCT Married couple, Other family NUMWORKb 0,1,2, 3+ P52 B FBS2 Households INCOME < 30K, 30 50K, 50 75K, 75125K, more than 125K NA P7 B FBS, B FBS2 Total Population ETHNICITY White, Black, Other, and Multiple Race NA P12 B FBS, BFBS2 Total Population GENDER Male, Female AGE 0 5, 6 15, 16 17, 18 24, 2534, 3544, 4554, 55 64, 65 74, over 75 P21 B FBS2 Total Population CITIZEN Native, Naturalized, Non Citizen NA P47 B FBS2 Population >=16 years GENDER Male, Female WRKHOURSc 0,1 14, 15 35, more than 35 a Age distribution of "own children" in the household b Number of workers (more than 0 hours per week in 1999) c Hours of work per week in 1999

PAGE 83

83 Table 6 3. Number of iterations for generating population of 12 census tracts by method FBS2 in base year (2000) Case ID Number of generated household Number of a dding iterations Number of remove iterations Number of total i terations 1 498 1180 682 1862 2 700 1350 650 2000 3 721 1280 559 1839 4 729 1403 674 2077 5 682 1465 783 2248 6 2036 3925 1889 5814 7 1629 2836 1207 4043 8 1909 3496 1587 5083 9 4517 9349 4832 14181 10 2515 5049 2534 7583 11 4185 6363 2178 8541 12 3747 7002 3255 10257

PAGE 84

84 T able 6 4. Accuracy of t arget year synthetic p opulations H15 P26 P12 H32 H44 P52 P21 P47 T FBS B FBS2 0.14 0.05 0.09 0.09 0.29 0.17 0.12 0.16 T FBS B FBS 0.16 0.06 0.11 0.15 0.29 0.25 0.16 0.13 T IPF B FBS2 0.14 0.05 0.21 0.10 0.27 0.17 0.11 0.16 T IPF B FBS 0.18 0.04 0.22 0.17 0.30 0.29 0.15 0.14 T IPF B IPF 0.19 0.05 0.25 0.19 0.33 0.36 0.15 0.17 T FBS B IPF 0.18 0.06 0.10 0.18 0.30 0.32 0.15 0.15 T* FBS B FBS2 0.34 0.22 0.28 0.41 0.41 0.20 0.13 0.25 T* FBS B FBS 0.27 0.23 0.26 0.41 0.32 0.32 0.14 0.20 T* IPF B FBS2 0.33 0.24 0.28 0.40 0.37 0.22 0.17 0.21 T* IPF B FBS 0.26 0.24 0.26 0.42 0.31 0.33 0.18 0.16 T* IPF B IPF 0.28 0.24 0.29 0.43 0.34 0.40 0.16 0.20 T* FBS B IPF 0.26 0.23 0.26 0.41 0.31 0.40 0.16 0.22

PAGE 85

85 T able 6 5. Differences among synthesized b ase year p opulations H15 P26 P34 P7 P12 H32 H44 P48 P52 P21 P47 Between IPF and FBS 0.03 0.05 0.15 0.26 0.17 0.05 0.02 0.14 0.09 0.04 0.10 Between IPF and FBS2 0.03 0.05 0.19 0.26 0.17 0.20 0.02 0.23 0.28 0.09 0.18 Between FBS and FBS2 0.02 0.05 0.14 0.00 0.01 0.22 0.02 0.20 0.24 0.07 0.13

PAGE 86

86 Table 6 6. Difference between t rue controlled t ables and e rroneous t ables Case ID Household Size Dwelling type Age gender Average1 a Average2 b 1 0.27 0.09 0.20 0.01 0.14 0.18 2 0.45 0.61 0.41 0.04 0.38 0.53 3 0.07 0.22 0.19 0.06 0.14 0.15 4 0.07 0.35 0.24 0.03 0.17 0.21 5 0.25 0.77 0.18 0.05 0.31 0.51 6 0.31 0.69 0.17 0.05 0.31 0.50 7 0.25 0.13 0.19 0.04 0.15 0.19 8 0.23 0.18 0.14 0.04 0.15 0.21 9 0.26 0.71 0.53 0.03 0.38 0.49 10 0.20 0.25 0.21 0.02 0.17 0.23 11 0.04 0.04 0.04 0.01 0.03 0.04 12 0.17 0.68 0.29 0.03 0.29 0.43 a Average1 is the average of all four attributes listed in the table b Average2 is the average of household size and dwelling type

PAGE 87

87 H15 P26 P34 P7 P12 H32 H44 P48 P52 P21 P47 Base Year IPF (BIPF) Fully Controlled Tables Base Year NEW (BNEW) Partially Controlled Tables Base Year NEW2 (BNEW2) Target Year IPF (TIPF) Target Year NEW (TNEW) F igure 6 1. Marginal t ables for a ssessing the t arget y ear populations

PAGE 88

88 F igure 6 2. Impact of b ase y ear p opulations on the accuracy of t arget year p opulation. A) Target Year IPF. B)Target Year FBS. H15 P26 P12 H32 H44 P52 P21 P47 H15 P26 P12 H32 H44 P52 P21 P47 A B

PAGE 89

89 Figure 6 3. Impact of t arget year controls and d ata f usion m ethodology on the a ccuracy of target year population. A) BaseYear IPF. B) Base Year FBS. C) BaseYear FBS2. H15 P26 P12 H32 H44 P52 P21 P47 H15 P26 P12 H32 H44 P52 P21 P47 H15 P26 P12 H32 H44 P52 P21 P47 A B C

PAGE 90

90 Figure 64 Impact of inaccurate control tables on the change of accuracy of target year populations 0.1 0.2 0.3 0.4 0.5 -0.1 0.1 0.2 0.3 T-IPF-B-IPF Discrepancy Loss of accuracy 1 2 3 4 5 6 7 8 9 10 11 12 0.05 0.10 0.15 0.20 0.25 0.30 0.35 -0.1 0.1 0.2 0.3 T-FBS-B-IPF Discrepancy Loss of accuracy 1 2 3 4 5 6 7 8 9 10 11 12 0.1 0.2 0.3 0.4 0.5 -0.1 0.1 0.2 0.3 T-IPF-B-FBS Discrepancy Loss of accuracy 1 2 3 4 5 6 7 8 9 10 11 12 0.05 0.10 0.15 0.20 0.25 0.30 0.35 -0.1 0.1 0.2 0.3 T-FBS-B-FBS Discrepancy Loss of accuracy 1 2 3 4 5 6 7 8 9 10 11 12 0.1 0.2 0.3 0.4 0.5 -0.1 0.1 0.2 0.3 T-IPF-B-FBS2 Discrepancy Loss of accuracy 1 2 3 4 5 6 7 8 9 10 11 12 0.05 0.10 0.15 0.20 0.25 0.30 0.35 -0.1 0.1 0.2 0.3 T-FBS-B-FBS2 Discrepancy Loss of accuracy 1 2 3 4 5 6 7 8 9 10 11 12

PAGE 91

91 CHAPTER 7 ASSESSMENT OF TRAVEL DEMAND MODELS APPLIED TO SYNTHETIC POLULATIONS Disaggregate models capture travel behavior of the fundamental decisionmaking units and include several explanatory variables (including socioeconomic and mobility characteristics). Consequently, one may expect such models to provide more accurate predictions of the travel characteristics than aggregate models which include fewer explanatory variables. At the same time, for use in forecasting, disaggregate models req uire more inputs compared to aggregate models (as the number for explanatory variables are more in the former models). To generate such detailed inputs, populationsynthesis procedures such as the one presented in this report are used. Thus, the accuracy o f the socio economic mobility characteristics of the synthesized population (i.e., the inputs to the disaggregate models) is of particular interest. Specifically, if the synthesized population is an inaccurate representation of the true population, gains b ecause of a disaggregate model could be offset by the errors in the synthesized population. In light of the above discussion, the intent of this chapter is to compare the predictions from aggregate tripgeneration models with those from disaggregate tripg eneration models. Further, for each of the two types of models, the predictions when the model is applied to the true population are compared to the predictions when the model is applied to a synthesized population. The scope of this comparative analysis i s limited to linear regressionbased tripgeneration models. The extension of such an analysis for the evaluation of nonlinear models such as the multinomial Logit for mode choice is identified as an important future avenue for study.

PAGE 92

92 Dataset Data from the NHTS ( National Household Travel Survey ) 2009 Add On s of Florida are used to estimate the aggregate and disaggregate tripgeneration models used in this analysis. Further, preprocessing of the data was performed to eliminate those households for which we did not have the travel records for all persons in the household. Additional cleaning was also performed to remove cases with missing values for the explanatory variables of interest. The final estimation sample comprised 12 577 households. Using the abov e dataset, linear regression models are estimated for the trip purpose of Home Based Non W ork Non S chool and Non Religion (HBNWSR) Frequency distributions of the household HBNWSR tr ip rates are presented in Table 7 1. Th is table is interpreted as follows: 2058 out of 12,577 or 16 36 % of all households in the estimation sample did not have any HBNWSR trips (see first row in Table 7 1 ). TripGeneration Models In this study two trip generation models are estimated. One of the models adopts h ousehold size, vehicle ownership, and household tenure type as explanatory variables while another one adopts four additional explanatory variables, e.g. number of workers, urban/rural location, household income and life cycle characteristics. T he n umber of explanatory variables used in the latter models is greater than those used in the former. Thus, the former model is tr eated as the aggregate model and the latter model is the disaggregate model. T he trip generation models relate the total number of HBNWSR trips made by a household to the socio economic characteristics of the household. For aggregate model, h ousehold size, vehicle ownership, and household tenure type were taken as

PAGE 93

93 the explanatory variables as these are the m ost common variables used in trip generation models The estimation of aggregate model is presented in T able 7 2. The disaggregate model include s four additional explanatory variables and its estimation is presented in T able 7 3. As a general note on the m odels estimated, the linear regression was chosen considering its simplicity and popularity in practical use. We recognize that this is not the best econometric structure to model integer variables within a finite range (i.e., the number of trips in our context). An examination of the application of methods such as the Poissonregression and orderedprobit is identified as an avenue for future research. Based on the estimation results of the two models, the adjusted rho square of disaggregate model (0.2111) is apparent ly larger than the one of aggregate model (0.1796). Therefore, under this point the disaggregate model is superior and can give a more precise forecasting for HBNWSR trips. Population Synthesis One of the objectives of this research is to exami ne the predictions of a tripgeneration model when applied to a true population with the predictions when the same model is applied to synthesized versions of the same population. The Miami Dade sample ( 2540 persons from 1191 households) is extracted from the NHTS 2009 Add O ns of Florida This sample is treated as the true population as all the socioeconomic characteristics are known for each person and household. The same population was also synthesized using the proposed FB S method The control tables are generated by aggregating the household and population characteristics of the Miami Dade sample of the NHTS 2009 AddO ns of Florida (instead of using the US Census SF tables). The n a

PAGE 94

94 10% random sample of the NHTS 2009 Addons of Florida is used as the seed data (instead of the PUMS data as described in Chapter 5 ). It is useful to note that the NHTS does not include institutionalized population and hence the population in groupquarters is not a part of this analysis. Two populations were synthesized. The first population (called Population I) was synthesized using six control tables (Table 7 4 ). Four of these tables are two dimensional and the other two are one dimensional. Five of these are householdlevel tables (i.e., the universe is all households) and another one are personlevel tables. Overall, the structures of these tables are largely consistent with those presented in Chapter 5 The second population (called Population II) is synthesized using just four onedimensional control tables (household size, dwelling type gender, and age ) as opposite to Population I with more controls. Assessment of Linear Regression Based Trip Generation Models The aggregate and disaggregate models developed for h ousehold HBNWSR t rip were presented in T able 72 and T able 73. T hese models are applied on three populations One of these is the true population ( 2540 persons and 1191 households of t he Miami Dade sample and the other two are synthesized populations, e.g. Population I w hich have more control tables and Population II which have few control tables This chapter presents the results of an analysis of the predictive accuracy of each of the models when applied to each of the populations. The results are examined by the total number of trips by different lifecycle categories and different income categories. In this study, each of the aggregate and disaggregate models is applied to each of the three populations to predict the number of trips for each household. Then the

PAGE 95

95 predict ed trips are aggregate d by different categories of lifecycle and household income characteristics The six predictions of travel volumes obtained (two models three populations) are compared with the actual travel volumes as reported in the travel survey s. Table 76 presents the total number of trips for each lifecycle categories of three populations. First, the observed number of trips o f true population is 5068 which is closed to corresponding number of trips of the six predications ( 5014.8, 4980.7, 5149.6 5145.0, 5131.9, and 5162.0). In fact it is usually difficult to assess synthetic populations by the total number of predicted trips (aggregated trips of all households), as presented different populations and different models can provide similar results. That is because of that the aggregation process will mediate all the errors and make the total trips accurate. So, in this study the total number of trips is further divided into different categories to observe more disaggregate errors. In this stud y, the absolute difference of number of trips or average number of trips between predicated trips and true trips is adopted as a criterion of accuracy. Table 76 provides number of trips for each lifecycle category for P o pulation I. By such specification, it is observed that evident errors occur in different life cycle categories for both aggregated model and disaggregated model. The overall errors indicate that the disaggregated model is slightly better. In Table 76, significant errors are also observed for both aggregated model and disaggregated model under P o pulation II but the aggregated model is more accur ate in terms of over all errors. A s uperficial explanation is that even the disaggregate model is more accurate but with population with few control s, the compromised data accuracy can make the estimation worse. However,

PAGE 96

96 several factors can lead to such results and detailed discussion will be present shortly. Table 76 also shows the application of the two models on true population. The disaggregate m odel is better than the aggregate as expected. The errors of disaggregate model is uniformly small than aggregate model. Because that the lifecycle attribute is not controlled in both Population I and Population II, the number of households is different t han the one of true population. So, it is necessar y to exam the average number of trips under each lifecycle category. Table 77 provides the average number of trips of the six predications on the five lifecycle categories. For that comparison, we see that the disaggregate model is superior in most of categories under all three populations. Now we try to examine the comparison on household income categories. Different than life cycle attribute, household income is controlled in population I and not controlled in population II where as the life cycle attribute is not controlled in these two sy nthetic populations So, still the comparison is performed in terms of average number of trips. The corresponding result is present in T able 7 8 Within each income category, evident errors occur for both aggregate model and disaggregate model across the t hree populations. Even in the application on true population, disaggregate model is not uniformly better than aggregate model since in the last category of income, disaggregate model has 0.32 off from true average trips while the aggregate model only have 0.06 off from true average trips. To remember, the model is estimated by the whole Florida dataset while the population is from Miami Dade, some discrepancy is expected between estimation sample and forecasting sample.

PAGE 97

97 Another interesting evident error is i n the first income category of T able 78 under the Population I There is a 0.42 off between average true trips and average trips estimated by the disaggregate model under population I while such error (0.08) is very small for the true population applicat ion in T able 78 under true population By inspect on the population, it is found that the household size and household location variables have major discrepancy. Table 79 presents the distribution of household size and household location of Population II and true population within the first income category In true population, there are more households with size greater than two and more households live in urban area. In T able 73 household size and urban household location have positive impact on number of trips made by households. That is the reason that number of trips made by first income categories of Population I is underestimated. Summary Two linear regression based trip generation models are estimated by Florida 2009 NHTS Add Ons. The one with more explanatory variables is referred as disaggregate model while another one with less explanatory variables is referred as aggregate model. These trip generation models are applied on three populations. One of the populations is the observed sample of Miami Dade County from Florida 2009 NHTS Add Ons and another two populations are synthetic populations of the same region. With the comparison between the forecasting number of trips and observed number of trips (also referred as true trips), the performance of the two models and three populations are examined by the number of trips and average number of trips within different categories from life cycle and household income attributes. Results indicate that by the categorization of lifecycle attribute, disagg regate model provides

PAGE 98

98 smaller errors on average trips made by most of categories across the three populations. B y the categorization of household income, disaggregate model gives relatively large errors on average trips made by the first income category when applied on synthetic population s. Conceptually, disaggregate models give a more precisely predication on trips made by each household and hence should provide more precise forecasting on average trips by different categories. However, several other factors affect such aggregate predications, especially when applied to synthetic populations. First, after aggregation of trips, predication errors of each household are neutrali ze d and hence the advantage of disaggregate models is veiled. Second, when applied to synthetic populations the disaggregate model could give less precise predications because of the inaccuracy distribution of some attributes which are the explanatory var iables. Also, the adopted trip generation models may have discrepancies between the applied populations. And some errors are expected when the behavior of some group of households is different than such households in the data used to estimation the models

PAGE 99

99 T able 7 1. Frequency d istribution of h ousehold HBNWSR t rip r ates # Trips Freq. % # Trips Freq. % # Trips Freq. % 0 2058 16.36 5 365 2.90 10 424 3.37 1 671 5.34 6 1374 10.92 11 82 0.65 2 2693 21.41 7 228 1.81 12 268 2.13 3 530 4.21 8 915 7.28 13 47 0.37 4 2428 19.31 9 158 1.26 14+ 336 2.67 Total 12577 100

PAGE 100

100 T able 7 2. Aggregate m odel Explanatory Variables Aggregate Model Coefficient Std. Error t value Intercept One 1.623 0.153 10.604 Household Size One a Two 1.947 0.086 22.590 Three 3.000 0.124 24.262 Four and more 5.033 0.121 41.462 Tenure Owned household Rented household 0.584 0.108 5.421 Household vehicle ownership Zero One 0.731 0.157 4.655 Two 0.698 0.168 4.148 Three and more 0.783 0.179 4.382 Number of cases 12577 Rho square 0.1801 Adjusted Rho square 0.1796 a Reference categories

PAGE 101

101 Table 73. D isaggregate m odel Explanatory Variables Disaggregate Model Coefficient Std. Error t value Intercept One 1.144 0.166 6.900 Household Size One a Two 1.796 0.088 20.435 Three 3.112 0.161 19.347 Four and more 5.414 0.199 27.259 Tenure Owned household Rented household 0.300 0.107 2.792 Household vehicle ownership Zero One 0.721 0.156 4.610 Two 0.809 0.171 4.720 Three and more 0.939 0.185 5.076 Employment Status Number of workers 0.393 0.054 7.277 Household Location Household in urban area Household in rural area 0.534 0.075 7.148 Household Income <$25,000 $25,000 $44,999 0.410 0.090 4.539 $45,000 $64,999 0.599 0.102 5.902 $65,000 $99,999 0.763 0.103 7.373 >$100,000 1.072 0.111 9.680 Life Cycle Characteristics No children Youngest child 0 5 1.129 0.190 5.935 Youngest child 6 15 1.033 0.173 5.974 Youngest child 16 21 0.420 0.197 2.134 Retired and no children 0.548 0.089 6.165 Number of cases 12577 Rho square 0.2122 Adjusted R ho square 0.2111 a Reference categories

PAGE 102

102 Table 74. Control t ables for population I Universe Dimension 1 Dimension 2 Attribute Categories Attribute Categories 1 Households TENURE Own, Rent HHSIZE 1,2,3,4+ 2 Households TENURE Own, Rent DUTYPE Single Family, Multi Family 3 Households TENURE Own, Rent NUMAUTO 0,1,2,3+ 4 Households NWORKER 0,1,2,3+ NA 5 Households HHINCOME < 25K, 25 45K, 45 65K, 65 100K, more NA 6 Person AGE 0 5, 6 15, 16 17, 18 24, 25 34, 3544, 4554, 55 64, 6574, over 75 GENDER Male, Female

PAGE 103

103 Table 75. Control t ables for population II Universe Dimension 1 Dimension 2 Attribute Categories Attribute Categories 1 Households NA HHSIZE 1,2,3,4+ 2 Households NA DUTYPE Single Family, Multi Family 3 Person AGE 0 5, 6 15, 16 17, 1824, 2534, 3544, 45 54, 55 64, 65 74, over 75 NA 4 Person NA GENDER Male, Female

PAGE 104

104 Table 76. Total number of t rips for households with different life cycle characteristic Life Cycle (U nder Population I ) Frequency Number of Trips Difference between True and True Aggregate Model Disaggregate Model Aggregated Model Disaggregated Model No children 361.0 1129.0 1308.9 1146.4 179.9 17.4 Youngest child 0 5 109.0 733.0 717.9 603.7 15.1 129.3 Youngest child 6 15 121.0 982.0 766.4 881.2 215.6 100.8 Youngest child 16 21 51.0 324.0 315.8 321.1 8.2 2.9 Retired and no children 548.0 1900.0 1905.8 2028.3 5.8 128.3 Total 1190.0 5068.0 5014.8 4980.7 424.6 378.6 Life Cycle ( U nder Population II ) Frequency Number of Trips Difference between True and True Aggregate Model Disaggregate Model Aggregated Model Disaggregated Model No children 321.0 1129.0 1154.1 995.6 25.1 133.4 Youngest child 0 5 110.0 733.0 724.4 603.2 8.6 129.8 Youngest child 6 15 147.0 982.0 936.7 1069.7 45.3 87.7 Youngest child 16 21 47.0 324.0 297.3 305.2 26.7 18.8 Retired and no children 566.0 1900.0 2037.0 2171.3 137.0 271.3 Total 1191.0 5068.0 5149.6 5145.0 242.8 641.1 Life Cycle ( U nder true population ) Frequency Number of Trips Difference between True and True Aggregate Model Disaggregate Model Aggregated Model Disaggregated Model No children 363.0 1129.0 1243.4 1116.0 114.4 13.0 Youngest child 0 5 137.0 733.0 912.3 771.7 179.3 38.7 Youngest child 6 15 134.0 982.0 863.7 996.9 118.3 14.9 Youngest child 16 21 53.0 324.0 305.3 314.0 18.7 10.0 Retired and no children 504.0 1900.0 1807.2 1963.4 92.8 63.4 Total 1191.0 5068.0 5131.9 5162.0 523.6 140.0

PAGE 105

105 Table 77. Average number of t rips for households with different life cycle characteristic Life Cycle (U nder Population I ) Frequency Average N umber of Trips Difference between True and True Aggregate Model Disaggregate Model Aggregated Model Disaggregated Model No children 361.0 3.11 3.63 3.18 0.52 0.07 Youngest child 0 5 109.0 5.35 6.59 5.54 1.24 0.19 Youngest child 6 15 121.0 7.33 6.33 7.28 0.99 0.05 Youngest child 16 21 51.0 6.11 6.19 6.30 0.08 0.18 Retired and no children 548.0 3.77 3.48 3.70 0.29 0.07 Life Cycle ( U nder Population II ) Frequency Average Number of Trips Difference between True and True Aggregate Model Disaggregate Model Aggregated Model Disaggregated Model No children 321.0 3.11 3.60 3.10 0.49 0.01 Youngest child 0 5 110.0 5.35 6.59 5.48 1.24 0.13 Youngest child 6 15 147.0 7.33 6.37 7.28 0.96 0.05 Youngest child 16 21 47.0 6.11 6.32 6.49 0.21 0.38 Retired and no children 566.0 3.77 3.60 3.84 0.17 0.07 Life Cycle ( U nder true population ) Frequency Average Number of Trips Difference between True and True Aggregate Model Disaggregate Model Aggregated Model Disaggregated Model No children 363.0 3.11 3.43 3.07 0.32 0.04 Youngest child 0 5 137.0 5.35 6.66 5.63 1.31 0.28 Youngest child 6 15 134.0 7.33 6.45 7.44 0.88 0.11 Youngest child 16 21 53.0 6.11 5.76 5.92 0.35 0.19 Retired and no children 504.0 3.77 3.59 3.90 0.18 0.13

PAGE 106

106 Table 78. Average number of t rips for households with different income characteristic Household Income (U nder Population I ) Frequency Number of Trips Difference between True and True Aggregate Model Disaggregate Model Aggregated Model Disaggregated Model <$25,000 349 3.16 3.07 2.74 0.09 0.42 $25,000 $44,999 209 4.35 4.21 4.33 0.14 0.01 $45,000 $64,999 169 4.01 4.50 4.50 0.49 0.48 $65,000 $99,999 221 5.29 4.87 5.00 0.43 0.29 >$100,000 242 4.98 5.07 5.18 0.09 0.20 Household Income ( U nder Population II ) Frequency Number of Trips Difference between True and True Aggregate Model Disaggregate Model Aggregated Model Disaggregated Model <$25,000 293 3.16 3.20 2.88 0.05 0.28 $25,000 $44,999 291 4.35 4.23 4.24 0.12 0.11 $45,000 $64,999 196 4.01 4.53 4.63 0.52 0.62 $65,000 $99,999 203 5.29 5.03 5.15 0.27 0.15 >$100,000 208 4.98 5.15 5.37 0.17 0.39 Household Income ( U nder true population ) Frequency Number of Trips Difference between True and True Aggregate Model Disaggregate Model Aggregated Model Disaggregated Model <$25,000 349 3.16 3.43 3.24 0.27 0.08 $25,000 $44,999 210 4.35 4.19 4.22 0.15 0.13 $45,000 $64,999 169 4.01 4.42 4.42 0.41 0.41 $65,000 $99,999 221 5.29 4.91 5.05 0.38 0.24 >$100,000 242 4.98 5.04 5.30 0.06 0.32

PAGE 107

107 T able 7 9. Distribution of household size and location for population I and true population Populations Household Size Location 1 2 3 4+ Urban Rural Population I 199 110 22 18 282 67 True Population 147 144 30 28 341 8

PAGE 108

108 CHAPTER 8 SUMMARY AND CONCLUSI ONS As a growing interest in the development of disaggregate travel demand models. This dissertation proposed a fitness based pr ocedure for generating synthetic populations which are the detailed information as inputs to disaggregate travel demand models and activity based transportation planning models. Around the proposed method all major steps toward synthetic populations are analyzed comprehensively. These steps include baseyear population synthesis, target year population synthesis and application on travel demand models. The traditional population synthesis is based on IPF procedure which is most popular ly used in current applications. However, IPF is limited by the need to restrict all controls to the same universe and issues like zerocell problem also impede this method on complex situations More recently, new methods have been developed to incorporate multi level controls in population synthesis. However, there is also limited documentation of the application of IPF and other methods in the context of target year synthesis. T his research is to contribute towards synthetic population generation by addressing these issues. In C hapter 4, the Fitness Based Synthesis (FBS) approach is proposed. This method is designed to select households in a way that marginal control tables are matched. The proposed framework contains three major components, namely initial household sets fit ness functions and selection mechanism This study introduces two types of fitness functions to measure the marginal add/remove contribution of each household for reducing matching errors. The initial household set is prepared before the

PAGE 109

109 procedure and proper selection of initial household sets can reduce the running time of procedure. Fitness function measures the marginal contribution of each household for reducing matching errors. In this study, we choose fitness function based on the criterion of sum of squared error and it can be replaced by other criteria. The proposed method is not a member of IPF based population synthesizers and hence there is no zerocell issue related and also no convergence issue under square error based fitness function (adopted in this dissertation) because the number of iterations is proved to be bounded. Furthermore, under square error based fitness functions, the synthetic population can match the marginal control tables almost perfectly if there are sufficient variable types of households. In C hapter 5, synthetic populations from 22 artificial census tracts are generated based on the propose FBS method. As an important objective, this dissertation compares the proposed method with other method. So, the synthetic populations of the same 22 census tracts are also generated by the conventional IPF method and a recently developed IPU method. According to validation results the FBS approach and IPU approach generate populations closer to true population than the one from convention al IPF method. In C hapter 6, t his study also contributes by presenting an empirical assessment of target year populations synthesized with different baseyear populations, datafusion methods, and control tables. Twelve synthetic populations were synthesized for twelve census tracts in Florida. The year 2000 was taken as the base year and the 1990 as the target year. The empirical results indicate the value of synthesizing more accurate baseyear populations by accommodating multi level controls. Target year populations

PAGE 110

110 synthesized with more accurate baseyear populations as seed data are shown to be more accurate. And, the populations synthesized (target year) with multi level controls and the FBS methodology do perform better than those synthesized with only household level controls and IPF. Finally, errors in the target year control tables significantly reduce the accuracy of the synthesized populations. The magnitude of the overall error in the synthesized population appears to be linearly related to the magnitude of the input errors introduced via the control tables. In C hapter 7, t wo linear regression based trip generation models are estimated by Florida 2009 NHTS Add Ons. The one with more explanatory variables is referred as disaggregate model while another one wit h less explanatory variables is referred as aggregate model. These trip generation models are applied on three populations. One of the populations is the sample of Miami Dade County from Florida 2009 NHTS Add Ons and another two populations are synthetic populations of the same region. Disaggregate models give a more precisely predication on trips made by each household and hence should provide more precisely forecasting on average trips by different categories. However, several other factors affect such ag gregate predications, especially when applied to synthetic populations. First, after aggregation of trips, predication errors of each household are neutralized and hence the advantage of disaggregate models is veiled. Second, when applied to synthetic population, disaggregate model could give less precise predications because of the inaccuracy distribution of some attributes which are the explanatory variables. Also, the adopted trip generation models may have discrepancies between the applied populations. And

PAGE 111

111 some errors are expected when the behavior of some group of households is different than such households in the data used to estimation the models. There are several avenues for the future research. First, it is useful to test different fitness functions under the FBS mechanism. Second, the quality of synthetic population is affected by several aspects, e.g., seed data. So the analysis of impact of seed data or other aspects can also be conducted. The analysis on the application of trip generation model s can be extend to such an analysis for the evaluation of nonlinear models such as the multinomial Logit for mode choice. In addition, an examination of the application of methods such as the Poissonregression and orderedprobit for household trips is al so identified as an avenue for future research.

PAGE 112

112 APPENDIX A NUMERICAL ILLUSTRATI ON OF THE FITNESS BASED SYNTHESIS PROCEDURE Chapter 4 outlines the procedure for population synthesis which involves selecting a set of households from the PUMS data in such a way that the tract level controls are satisfied. A numerical illustration of this procedure is presented here. For simplicity, we assume that there are two control tables (Figure A 1) for a hypothetical census tract (synthesis area). The first t able (T1k) is a two dimensional householdlevel table joint distribution of householdsize (household size is limited to being either 1 or 2 persons, again for simplicity) and tenure. The second table (T2k) is a onedimensional, personlevel table represen ting the distribution of gender. The intent of the populationsynthesis procedure is to generate households and persons that satisfy the distributions present in these control tables. The seed data are presented in Figure A 2. There are two tables. The householdlevel table presents the tenure and household size of each household. The personlevel table presents the gender of each person present in each of the households in the seed data. We see that there are five households and eight persons in this dataset. The reader will note that the seeddata has at least one household of each of the four types (i.e., Own, 1 person; Own, 2person; Rent, 1person; and Rent, 2Person) represented in the control table. Similarly, there are both males and females in the seed data. The population synthesis is an iterative procedure and one household is selected to be added or removed in each iteration. The selection of households is based on the two types of fitness function s. This fitness function is calculated for each household in the seed data and for each iteration based on the values of the control tables (T), the current selection and the contribution of the corresponding household towards satisfying

PAGE 113

113 the controls. Figure A 3 presents the values in the two HT tables for the five households in the seed data. As already defined, the HT tables define the contribution of each household towards satisfying the different controls. Household 1 (HH ID = 1) comprised a single female living in a rental house (Figure A 2 ). Thus, selecting this household would contribute one singleperson, rental household (as indicated in the first HT table in Figure A 3 ) and one female (as indicated in the second HT table) to the population. Similarly, selecting the third household from the seed data will contribute one twoperson, rental household (as indicated in the first HT table for the third household) and two males (as indicated in the second HT table of the third household) to the populati on. Once all the tables have been defined, the t wo fitness values can be calculated for each of the households in the seed data. The fitness of a household in iteration and are calcula ted using the following formula. = ( 1 ) = + ( 2 ) where = In the current example, = 2 (there are two control tables) and = 4 (four cells in the first control table) and = 2 (two cells in the second control table). In this example, the entire seed data is chosen as the initial household sets. T he algorithm begins with the calculation of the fitness values for each of the five households. Then a household is selected among the candidate sets.

PAGE 114

114 The fitness values are then recalculated and the iterations continue. A household from the seed data can be selected into the population of the census tract multiple times. The algorithm terminates when there is no household meet the requirement of candidate. The numerical calculations corresponding to the application of the algorithm to the example problem is presented in table A 1 In the first iteration, all households have positive Type I fitness value and negative Type II fitness value. Hence the candidate sets consists all the five households. By simulation, the household 2 is selected and added into the current selection. After this household is added, t he fitness values for all household are then suitably updates. The results for the following iterations are presented in table A 1. After eight iterations, it is found that all fitness values of the five household are negative and hence no household is eligible as a candidate and hence the procedure is finished. To be noted, in iteration 7, only household 2 has a positive type II fitness value, and it means remove such household from current selection can imp rove the overall fitting with control tables. After the procedure, the household counts in iteration 7 construct the synthetic population. After aggregate the synthetic into the structure of control tables it is found that the table by aggregating syntheti c population is exactly the same as control tables (refer to Figure A 1 and Figure A 4). Therefore, it indicat es a perfect fit of the synthesized population to the controls. When the number of households is large and there are several control tables, such a perfect fit may not be possible.

PAGE 115

115 T able A 1. Intermediate results during generating the example population H ousehold ID 1 2 3 4 5 Initial Counts 1 1 1 1 1 Iteration1 Fitness I 8 8 17 21 21 Fitness II 12 12 27 27 31 Candidate set X X X X X Selection X Counts 1 2 1 1 1 Iteration2 Fitness I 8 4 13 19 17 Fitness II 12 8 23 25 27 Candidate set X X X X X Selection X Counts 1 2 1 2 1 Iteration3 Fitness I 6 2 9 13 11 Fitness II 10 6 19 19 21 Candidate set X X X X X Selection X Counts 2 2 1 2 1 Iteration4 Fitness I 2 2 9 11 11 Fitness II 6 6 19 17 21 Candidate set X X X X X Selection X Counts 2 2 1 3 1 Iteration5 Fitness I 0 0 5 5 5 Fitness II 4 4 15 11 15 Candidate set X X X Selection X Counts 2 2 2 3 1 Iteration6 Fitness I 0 4 5 1 3 Fitness II 4 0 5 7 7 Candidate set X Selection X Counts 2 2 2 4 1 Iteration7 Fitness I 2 6 9 5 9 Fitness II 2 2 1 1 1 Candidate set X Selection X Counts 2 1 2 4 1 Iteration8 Fitness I 2 2 5 3 5 Fitness II 2 2 5 3 5

PAGE 116

116 Figure A 1. Control tables T1kHHSize = 1 HHSize = 2 Total T2kTotal Own 1 5 6 Male 11 Rent 2 2 4 Female 6 Total 3 7 10 Total 17

PAGE 117

117 Figure A 2. Seed data HH ID ( i ) Tenure HH Size HH ID ( i ) Person ID Gender 1 Rent 1 1 1 Female 2 Own 1 2 1 Male 3 Rent 2 3 1 Male 4 Own 2 3 2 Male 5 Own 2 4 1 Male 4 2 Female 5 1 Male 5 2 Male

PAGE 118

118 Figure A 3. HT tables for each of the households in the seed data HH ID (i)= 1 HTi 1kHHSize = 1 HHSize = 2 HTi 2kTotal Own 0 0 Male 0 Rent 1 0 Female 1 HH ID (i)= 2 HTi 1kHHSize = 1 HHSize = 2 HTi 2kTotal Own 1 0 Male 1 Rent 0 0 Female 0 HH ID (i) = 3 HTi 1kHHSize = 1 HHSize = 2 HTi 2kTotal Own 0 0 Male 2 Rent 0 1 Female 0 HH ID (i) = 4 HTi 1kHHSize = 1 HHSize = 2 HTi 2kTotal Own 0 1 Male 1 Rent 0 0 Female 1 HH ID (i)= 5 HTi 1kHHSize = 1 HHSize = 2 HTi 2kTotal Own 0 1 Male 2 Rent 0 0 Female 0

PAGE 119

119 A1k HHSize = 1 HHSize = 2 Total A2k Total Own 1 5 6 Male 11 Rent 2 2 4 Female 6 Total 3 7 10 Total 17 Figure A 4 The representation of synthetic population as the structure of control tables

PAGE 120

120 LIST OF REFERENCES Agresti, A., 2002 Categorical Data A nalysis, 2nd ed. Wiley, New York. A rentze, T., Timmermans, H.J.P., Hofman, F., 2007. Creating synthetic household populations: problems and approach. Transportation Research Record 2014, 8591. Auld, J., Mohammadian, A., Wies, K., 2008. Population synthesis with regionlevel control variable aggregation. In: Paper p resen ted at the 87th Transportation Research Board Annual Meeting Washington DC. Auld, J., Mohammadian, A., 2010. E fficient methodology for generating synthetic populations with multiple control levels Transportation Research Record 2175, 138147 Beckman, R.J., Baggerly, K.A., Mckay, M.D., 1996. Creating synthetic baseline populations. Transportation Research Part A 30 (6), 415429. Bhat C.R., Koppelman F.S., 1993. A conceptual framework of individual activity program generation. Transportation Research Part A 27 (6), 433 446. Bhat, C.R., Guo, J.Y., Srinivasan, S., Sivakumar, A, 200 4 Comprehensive e conometric m icrosimulator for d aily a ctivity t ravel p atterns Transportation Research Record 1894, 5766. Bowman, J.L., B radley, M 2006 Activity b ased t ravel f orecasting m odel for SACOG: p opulation synthesis Technical m emo n umber 2, prepared for Sacramento Area Council of Governments, available from http://jbowman.net/ProjectDocuments/SacSim/SACOG%20tech%20memo%202-Pop%20Synth.20060731.pdf Bowman, J.L. Bradley, M 2007 Activity b ased t ravel f orecasting m odel for SACOG: h ousehold a uto a vailability m odel Technical m emo n umber 9, prepared for Sacr amento Area Council of Governments, available from http://jbowman.net/ProjectDocuments/SacSim/SACOG%20tech%20memo%209-Auto%20availability.20 060914.pdf Bowman, J.L., 2004. A comparison of p opulation synthesizers u sed in m icrosimulation m odels of a ctivity and t ravel demand. D raft paper available from http://jbowman.net/papers/ Bowman, J.L., Rousseau, G., 2008. Validation of A tlanta, G eorgia, regional commission population synthesizer. Transportation Research Board Conference Proceedings 2 (42), 5462.

PAGE 121

121 Deming, W.E., Stephan, F.F., 1940. On a least squares adjustment of a sampled frequency table when the expected marginal totals are known. The Annals of Mathematical Statistics 11 (4), 427444. Eluru, N., Pinjari, A., Guo, J.Y., Sener, I., Srinivasan, S., C opperman., R., Bhat., C.R., 2008. Population u pdating system structures and m odels e mbedded within the comprehensive e conometric m icrosimulator for u rban systems Transportation Research Record 2076, 171 182. Fienberg, S.E., 1970a An iterative procedur e for estimation in contingency tables. The Annals of Mathematical Statistics 41(3), 907917. Fienberg, S.E., 1970b. The a nalysis of m ultidimensional c ontingency t ables Ecology 5 1 ( 3 ), 419 433 Fienberg S.E., 1972. The a nalysis o f i ncomplete m ultiw ay contingency t ables Biometrics 28( 1 ) 177 202. Frick, M., Axhausen, K.W., 2004. Generating synthetic populations using IPF and Monte Carlo techniques: s ome new results. In Paper presented at the 4th Swiss Transport Research Conference, Ascona. Guo, J.Y., Bhat, C.R., 2007. Population synthesis for microsimulating travel behavior. Transportation Research Record 2014, 92101. Go ulias, K. G., Kitamura. R., 1996. A d ynamic m odel system for r egional t ravel d emand f orecasting. In Panels for Transportation Planning: Methods and Applications, Eds. Golob, T., R. Kitamura, and L. Long, Kluwer Academic Publishers, Boston, Ch. 13, 321348. Hunt, J.D., Abraham, J.E., Weidner, T., 2004. The h ousehold a llocation (HA) m odule of the Oregon2 m odel Transportation Research Record 1898, 98 107. Ireland, C.T., Kullback, S., 1968. Contingency tables with given marginals. Biometrika 55 (1), 179188. Karlis, D., X ekalaki E., 1998. Minimum h ellinger distance estimation for p oisson mixtures. Computational Statistics & Data Analysis 29 (1), 81 103. Lee, D., Fu, Y., 2011. A cross e ntropy o ptimization m odel f or p opulation synthesis u sed i n a ctivity based m icro simulation m odels. In: Paper p resented at the 90th Transportation Research Board Annual Meeting Washington, DC. Mackett, R. L., 1990. MASTER Mode. Report SR 237, Transport and Road Research Laboratory, Crowthorne, England.

PAGE 122

122 Moeckel, R., Spiekermann, k., Wegener, M., 2003. Creating a synthetic p opulation. In: Paper presented at the 8th International Conference on Com puters in Urban Planning and Urban Management (CUPUM), Sendai, Japan. Mohammadian, A., Javanmardi M., Zhang Y., 2010. Synthetic household travel survey data simulation Transportation Research Part C 18 (6), 869878. Mller, k., Axhausen, K.W., 2011. P opulation synthesis for microsimulation: state of the art. In: Paper p resented at the 90th Transportation Research Board Annual Meeting Washington DC. Pritchard, D.R., Miller, E.J., 2009. Advances in agent population synthesis and application in an integrated land use and transportation model In: P aper presented at the 88th Transportation Research Board Annual Meeting Washington, D C. R uschendorf, L., 1995. Convergence of the iterative proportional fitting procedure. The Annals of Statistics 23(4), 11601174. Ryan, J., Maoh, H., Kanaroglou, P., 2009. Population synthesis: c omparing the m ajor t echniques u sing a s mall, complete Population of f irms. Geographical Analysis 41 (2), 181203. Ryan J., Maoh, H., Kanaroglou, P., 2010. Population synthesis for microsimulating urban residential mobility. In: Paper presented at the 89th Transportation Research Board Annual Meeting Washington, D C. Simpson, D.G., 1987. Minimum h ellinger d istance e stimation for the a nalysis of count d ata J ournal of the American Statistical Association 82, 802 807 Simpson, L., Tranmer, M., 2005. Combining sample and c ensus d ata in s mall a rea stimates: i terative p roportional f itting with standard software. The Professional Geographer 57(2), 222 234. Smith, K.S., Shahidullah, M., 1995. An e valuation of p opulation p rojection e rrors for census t racts. Journal of the American Statistical Association 90 64 71. Srinivasan, S., Ma, L., Yathindra, K., 2008. Procedure for f orecasting h ousehold characteristics for Input to t ravel demand m odels. Project Report of University of Florida, Gainesville; Florida Department of Transportation, available from http://www.f sutmsonline.net/images/uploads/reports/FDOT_BD545_79_rpt.pdf Srinivasan, S., Ma, L., 2009. Synthetic population generation: a heuristic datafitting approach and validations. In: P aper presented at the 12th International Conference on Travel Behaviour Res earch (IATBR), Jaipur, India.

PAGE 123

123 Stoto, M.A., 1983. The a ccuracy of p opulation p rojections Journal of the American Statistical Association, 78, 1320. Sundar arajan, A., Goulias, K. G., 2003. Demographic m icrosimulation with DEMOS 2000: d esign, v alidation, and f orecasting. In Transportation Systems Planning: Methods and Applications, Eds. K.G. Goulias, CRC Press, Boca Raton, Ch. 14. Voas D., Williamson P., 2000. An e valuation of the combinatorial optimization a pproach to the creation of synthet ic microdata. International Journal of Population Geography 6 (5), 349366. Williamson, P., Birkin, M., R ees, P.H., 1998. The e stimation of p opulation m icrodata b y u sing d ata from s mall a rea statistics and s amples of anonymised r ecords. Environment and Planning A 30( 5 ), 785 816. Wong, D.W.S., 1992. The r eliability of u sing the i terative p roportional f itting p rocedure. The Professional Geographer 44(3), 340348. Ye, X., Konduri, K., Pendyala, R.M., Sana, B., Waddell, P., 2009. Methodology to m atch d istributions of b oth h ousehold and p erson a ttributes in g eneration of synthetic p opulations. In: P aper p resented at the 88th Transportation Research Board Annual Meeting Washington DC. Yvonne, M ., Bishop M., Fienberg S.E., 1969. Incomplete t wo d imensional c ontingency t ables Biometrics 25 (1), 119128 Zhang, Y., Mohammadian, A., 2008. Microsimulation of household travel survey data. In: Paper p resented at the 87th Transportation Research Board Annual Meeting Washington, DC.

PAGE 124

124 BIOGRAPHICAL SKETCH Lu Ma received his bachelors degree and masters degree in civil engineering at Tsinghua University, Beijing, China, in 2004 and 2007 respectively. He also earned another master s degree in statistics at the University of Florida in 2010. In August 2007, he began graduate school at the University of Florida in Department of Civil and Coastal Engineering for his Ph.D. study. Lu Ma has published one journal paper and co authored four papers for conferences and technique report s. During the Ph.D. study, he was awarded the Short Stay Fellowship from Utrecht University the Netherlands and visited this university for a study of three month in 2010 and he was also awarded the ITS Florida Annual Anne Brewer Scholarship in 2010.