UFDC Home  UF Theses & Dissertations   Help 
Material Information
Thesis/Dissertation Information
Subjects
Notes
Record Information

Full Text 
PAGE 1 1 LINEAR DISCRIMINATION WITH STRATEGICALLY MISSING VALUES By JUHENG ZHANG A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 2011 PAGE 2 2 2011 Juheng Zhang PAGE 3 3 To m y p arents PAGE 4 4 ACKNOWLEDGMENTS Working on a dissertation can be a lonely journey, it is impossible without support and help of many people. My sincere gratitude goes to my advisors, my committee members, my family, and Department of Information Systems and Operations Management, for the ir help and support. This is a great opportunity to express my respect to my advisor, Professor Gary J. Koehler. This dissertation would not have been possible without his expert guidance. His suggestions and comments are always perceptive, appropriate, and helpful. I have been amazed by his broad knowledge, his views on research, his work ethics, and many other things. Professor Gary J. Koehler inspires me to be a good scholar. He has provided me with his kindness guidance and support during my graduate studying. I am pleased to thank my co advisor, Professor Haldun Aytug for his guidance and assistance. He gave suggestions which help me improve my presentations and clarify my arguments The discussions with Professor Haldun Aytug are always helpful. I would like to thank Dr. Selwyn Piramuthu, who helped me start my very first research project. It is always pleasure to talk with him about research topics, academy, or simply life I am pleased to thank Professor Steven Shugan for his kindness and support. In addition, I want to thank Professor Praveen Pathak and other faculty members at ISOM Department at University of Florida. Finally, but most importantly, I want to thank my family for their support and love. This dissertation is dedicated to my parents who give me my life, and my drive and confidence to tackle challenges head on. PAGE 5 5 TABLE OF CONTENTS page ACKNOWLEDGMENTS .................................................................................................. 4 LIST OF TABLES ............................................................................................................ 7 LIST OF FIGURES .......................................................................................................... 9 ABSTRACT ................................................................................................................... 11 CHAPTER 1 INTRODUCTION .................................................................................................... 12 2 LITERATURE REVIEW OF METHODS FOR HANDLING MISSING VALUES ...... 21 Conventional Methods of Handling Missing Data ................................................... 21 Methods Handling Missing Data in Classification Problems ................................... 32 3 LITERATURE REVIEW OF LDFS, SLT, AND SVMS ............................................. 37 Linear Discriminant Functions ................................................................................. 37 Statistical Learning Theory ..................................................................................... 38 Support Vector Machines ........................................................................................ 41 4 STRATEGIC LEARNING ........................................................................................ 44 5 THE PRINCI PALS PROBLEM ............................................................................... 52 6 LARGE TRAINING SAMPLE CASE ....................................................................... 77 7 EXPERIMENTAL STUDY ....................................................................................... 98 Experiment Design ................................................................................................ 102 Expected Outcomes Part One (The Same Method Case) .................................... 106 Controlled Factors .......................................................................................... 106 Strategic Game .............................................................................................. 117 Different Default Vectors ................................................................................ 123 Experi ment Outcomes Part Two (The Randomized Case) ................................... 127 Experiment Outcomes Part Three (The Random Case) ....................................... 134 8 SUMMARY, CONCLUSIONS AND FUTURE RESEARCH .................................. 140 Theoretical Work ................................................................................................... 141 Empirical Work ...................................................................................................... 142 Application on the Trust and Reputation Systems ................................................ 142 PAGE 6 6 LIST OF REFERENCES ............................................................................................. 144 BIOGRAPHICAL SKETCH .......................................................................................... 148 PAGE 7 7 LIST OF TABLES Table page 1 1 Summary of notations (for the case of one principal ) ........................................ 20 2 1 Complete data set .............................................................................................. 25 2 2 Data set with missing values .............................................................................. 25 2 3 Data subset (The Deletion Method) .................................................................... 25 2 4 New data set with imputed values (The Average Method) ................................. 25 2 5 New data set with imputed values (The Similarity Method) ................................ 27 2 6 New data set with imputed values (The Regression Method) ............................. 29 2 7 Comparison of method results ............................................................................ 31 7 1 Summary of experimental designs ................................................................... 105 7 2 Two way analysis for dependent variabl e : misclassification rate ...................... 107 7 3 Two way analysis for dependent variables: TotMisc, PosMisc, NegMisc ......... 108 7 4 Simple summary of statistics ............................................................................ 109 7 5 Simple summary of statistics in percentage ...................................................... 110 7 6 Tukey's Studentized Range (HSD) test for misclassification rate (OL=High) .... 111 7 7 Summary of Tukeys Test results for dependent variables: TotMisc, PosMisc, NegMisc ............................................................................................................ 112 7 8 Tukey's Studentized Range (HSD) test for TotMisc of agents methods (OL=High, Principal's Method="D method" ) ..................................................... 115 7 9 Strategy analysis of principal and agents (OL="High" TrainSize=TestSize=1000) ............................................................................. 121 7 10 Strategy Analysis of Principal and Agents ( OL="low" TrainSize=TestSize=1000) ............................................................................. 122 7 11 Comparison of chosen default vector of D method (OL=High) ......................... 123 7 12 Comparison of chosen default vector of D method (OL=High) % ..................... 123 7 13 Comparison of chosen default vector of D method (OL=Low) .......................... 124 PAGE 8 8 7 14 Comparison of chosen default vector of D method (OL=Low) % ...................... 124 7 15 Comparison of chosen default vector of DNeg method (OL=High) ................... 125 7 16 Comparison of chosen default vector of DNeg method (OL=High) % ............... 126 7 17 Comparison of chosen default vector of DNeg method (OL=Low) .................... 126 7 18 Comparison of chosen default vector of DNeg method (OL=Low) % ............... 127 7 19 Two way analysis for dependent variable: misclassification rate ...................... 128 7 20 The comparison results of Tukey's range test .................................................. 129 7 21 The comparison results of Tukey's range test for strategic rates ...................... 129 7 22 Comparison of chosen default vector of D method (OL=High) ......................... 130 7 23 Comparison of chosen default vector of D method (OL=High) % ..................... 130 7 24 Comparison of chosen default vector of D method (OL=Low) .......................... 131 7 25 Comparison of chosen default vector of D method (OL=Low) % ...................... 131 7 26 Comparison of chosen default vector of DNeg method (OL=High) ................... 132 7 27 Comparison of chosen default vector of DNeg method (OL=High) % ............... 132 7 28 Comparison of chosen default vector of DNeg method (OL=Low) .................... 133 7 29 Comparison of chosen default vector of DNeg method (OL=Low) % ............... 133 7 30 Two way analysis for dependent variable: misclassification rate ...................... 134 7 31 Statistics of randomly missing holes ................................................................. 135 7 32 Statistics of randomly missing holes % ............................................................. 135 7 33 Statistics of randomly missing holes (D) ........................................................... 136 7 34 Statistics of randomly missing holes (D) % ....................................................... 136 7 35 Statistics of randomly missing holes (DNeg) .................................................... 137 7 36 Statistics of randomly missing holes (DNeg) % ................................................ 137 7 37 Statistics of randomly missing holes (D kw) ..................................................... 138 7 38 Statistics of randomly missing holes (D kw) % ................................................. 138 PAGE 9 9 LIST OF FIGURES Figure page 4 1 Strategically hidden data example ...................................................................... 46 5 1 ,iDwb for a support vector ix ......................................................................... 54 5 2 ,iDwb for a nonsupport vector ix .................................................................. 55 5 3 ,iDwb where 20 w for a nonsupport vector ix ............................................ 55 5 4 ,yDwb ........................................................................................................... 61 5 5 ,iDwb the set of default vectors incenting ix to reveal everything ................. 65 5 6 ,yDwb ........................................................................................................... 67 5 7 Dwb ............................................................................................................... 70 5 8 Empty D set ....................................................................................................... 74 6 1 Volumes ............................................................................................................. 80 6 2 Balanced \ unbal anced data set ......................................................................... 81 6 3 Average Method ................................................................................................. 82 6 4 Average Method in the balanced case ............................................................... 85 6 5 Similarity Method ................................................................................................ 87 6 6 Special case one of Similarity Method ................................................................ 89 6 7 Special case two of Similarity Method ................................................................ 89 6 8 Example of optimal solution of convex problem .................................................. 95 7 1 Comparison of chosen default vector of D method (OL=High) ......................... 124 7 2 Comparison of chosen default vector of D method (OL=Low) .......................... 125 7 3 Comparison of chosen default vector of DNeg method (OL=High) ................... 126 7 4 Comparison of chosen default vector of DNeg method (OL=Low) .................... 127 PAGE 10 10 7 5 Comparison of chosen default vector of D method (OL=High) ......................... 130 7 6 Comparison of chosen default vector of D method (OL=Low) .......................... 131 7 7 Comparison of chosen default vector of DNeg method (OL=High) ................... 132 7 8 Comparison of chosen default vector of DNeg method (OL=Low) .................... 133 7 9 Comparison of the impact of missing holes ...................................................... 136 7 10 Comparison of missing holes for D method ...................................................... 137 7 11 Comparison of missing holes of DNeg method................................................. 138 7 12 Comparison of chosen default vector of D method when data are randomly missing ............................................................................................................. 139 PAGE 11 11 Abstract of Dissertation Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy LINEAR DISCRIMINATION WITH STRATEGICALLY MISSING VALUES By Juheng Zhang August 2011 Chair: Gary J. Koehler Cochair: Haldun Aytug Major: Business Administration This study analyzes a problem where a decision maker needs to estimate missing values that are hidden strategically by agents so that further analysis can be carried out as if data are complete. Data can be missing due to different reasons When data are provided by intelligent agents, most often information is hidden strategically to receive a favorable classification (or potential transaction) from the decision maker. Anticipating such strategic moves, we find a set of default vectors tha t the decision maker can use for replacing missing values, such that she minimizes her misclassification rate and incents agents to publish information at the same time. In theoretical and empirical stud ies the performance of this set of default vectors i s compared to some common statistical methods for handling missing data. Empirical results show that the default vectors chosen from this set dominates other methods in terms of misclassification rates. PAGE 12 12 CHAPTER 1 INTRODUCTION Often in many empirical studies, information is missing due to various reasons. For example, applicants filling out forms might omit data knowingly or inadvertently when they apply for things like a credit card, a school, a job, etc. In online transactions, sellers may leave some fields blank, such as demographic data, product quality, past transactions, and so forth. In survey data, some questions are not applicable to a fraction of respondents, such as a question that asks unmarried people to rate the quality of their marriage. Sometimes respondents do not know the answer, such as a question that asks the birth date of an acquaintance. In other cases, trained interviewers might neglect to ask some questions. In these and many other cases, observations are taken with val ues missing. Researchers must then decide how to deal with observations containing missing values since most methods of analysis require observations having entries for each attribute or variable. Missing values are a common problem in many research fields such as economics (Poirier and Ruud 1983), marketing (Kaufman 1988), health science (Berk 1987, Bernhard et al. 1998, Breteler and Rombouts 1988, Laird 1988), statistics (Little 1988), psychology (Brown 1983, Schafer and Graham 2002), education (Johnson 1989). One might say that missing data result from poor data design or collection procedures and would not be a problem if data acquisition is well planned and executed. The fact is that, even with careful design and conscientious data collection, data may be incomplete (DeSarbo et al. 1986). There are events beyond the control of researchers that lead to missing values. Subjects of longitudinal studies may die or move away during periods of experiments; individuals may refuse to reveal sensitive PAGE 13 13 information such as salary or health care history; some respondents typically overlook or forget to answer some of the questions when data are collected through questionnaires, etc. After data collection, inappropriate values are edited out further; times of life events can be missing due to censoring by the date of the interview; data of panel surveys suffer from attrition; some records can be inadvertently lost when data are collated fr o m multiple records and so on. Missing data can lead to a number of problems (Rot h 1994). First of all, missing data decreases statistical power. Statistical analysis studies relationships between variables in a data set. S tatistical power measures how well such analysis discovers relationships. Schmidt et al. (1976) showed that a high level of statistical power requires a large amount of data. When data are missing, sample size decreases dramatically if only observations with complete data are used. Empirical studies (Kim and Curry 1977) found that if two percent of data are missing ra ndomly in a data set, then eighteen percent of the total data can be lost when observations having a missing value are removed. Increases in the amount of missing data in a data set can diminish sample size and statistical power. Secondly, lost data can bi as statistical estimates. Parameters can be overestimated (or underestimated) if lost data concentrate on one side of a distribution. For example, a standard deviation can be underestimated if missing data of one variable are t hose with high (or low) value s. Eventually, one must decide what to do about missing values. A number of methods for dealing with missing data have been proposed, including ones in statistics (Allison 2001, Buck 1960, Haitovsky 1968, Little 1992, Ragel and Crmilleux 1998); in machine learning (Loh and Vanichsetakul 1988, Saar Tsechansky and Provost 2007), PAGE 14 14 marketing (Hand and Henley 1997, Malhotra 1987), etc. Such methods depend on the nature of the missing values, are designed for different problems, and are conducted under certain as sumptions about data distribution, dependence of variables, etc. Some statistical methods, such as Average, S imilarity and Regression, have gained popularity. We see variants of these methods in different research fields. In Chapter 2, we survey methods for dealing with missing values, and discuss their advantages, limitations, and assumptions. Conventional methods dealing with missing data commonly assume that data are missing randomly. Researchers often try to make the case that observations having missin g values on a particular variable are not different from those having observed measurements. How data are missing is generally ignored by these methods. However, the fact is that often data are missing for strategic reasons. There are many examples where i nformation is hidden intelligently. A well known example is that online sellers choose to publish information favorable to themselves and their products and services while concealing information that might diminish the probability of potential sales. Sel lers can even hide past transactions by registering a new online identity when an old one has attained a poor reputation (Ba 2001, Baron et al. 2002). Limited information disclosure is also common in financial markets (Healy and Palepu 2001, Hirshleifer an d Teoh 2003). For example, companies may conceal important information in their annual financial reports, or typically entrepreneurs have better information than investors about the value of business investment opportunities and incentives (Hirshleifer and Teoh 2003). Other examples may be found in job recruiting (Goldberg 2010), insurance PAGE 15 15 approval (Insure.com 2010), school application (Braun et al. 2010), second hand markets (Akerlof 1970), etc. In such cases, if the missing data mechanism is not considered, then subsequent analyses can be biased. For instance, buyers may end up selecting sellers incorrectly because the data on which the buyers decision was based was incomplete. This may result in a monetary loss in subsequent transactions. Investors might make bad investments because of incomplete financial disclosures. In asset allocation, Bouchaud et al. (1997 pp.1) highlight this very clearly: When the available statistical information is imperfect, it is dangerous to follow standard optimization procedures to construct an optimal portfolio, which usually leads to a strong concentration of the weights on very few assets. Recruiters might hire an unqualified employee while rejecting intelligent and diligent ones. If data are missing strategically, missi ng data mechanisms need to be considered before any analysis on the data set can be undertaken. Throughout we call the person (or persons collectively) who makes decisions with data a principal. The principal may also be a person running an experiment or s tudying a phenomenon of interests. Observations in a data set may be made on inanimate objects that do not act strategically (for example, items produced by a company). In contrast, observations may be collected on individuals or institutions that may act strategically for some purpose. Throughout, such intelligent entities will be called agents. We are interested in strategically missing values observations made on agents. Such strategic behavior may arise for different reasons. Perhaps agents have gotten information about the principal prior to interaction and use the information to predict the principals decision and to game the system (i.e., to act strategically for their own self  PAGE 16 16 interests) Or they have built knowledge about the principal through past interactions and try to provide what the principal looks for and be cooperative. When making decisions, some principals might anticipate agents gaming behavior, while some may not. We call the principal who makes her decisions without taking agents strategic behavior into consideration a nave principal, and the principal who does an intelligent (or diligent) principal. In some situations, multiple principals or agents may be involved. For example, on eBay, several buyers (principals) may be interes ted in a product sold by different sellers (agents). These buyers collect agents information on eBay and through other sources, and then decide from whom to purchase. In this situation, we have multiple principals who make decisions on the same group of agents. In this study, we are interested in situations where principals wish to classify agents into one of a set of classes (we usually just have two classes). For example, a buyer (the principal) gathers data on sellers (agents) and then decides with whic h she will transact business (the classes here might be trustworthy or not trustworthy). A recruiter (the principal) must decide which applicants (agents) to hire based on information they provide. Here the classes might be good potential employee or not. Since agents may engage in strategic activity to confuse a principal, it behooves the principal to anticipate such activity in inducing and using a classification mechanism. There are many methods used to determine classification functions. Such functions are often called discriminant functions (Lachenbruch and Goldstein 1979). Methods include decision tree methods (Quinlan 1986), neural network methods (Anderson and Rosenfeld 1988), Bayesian methods (Neal 1996), etc. Among these PAGE 17 17 methods, linear discriminant functions (LDFs) are often used since they are simple to understand and easy to apply. They have been used in many different fields. Fisher (1936) originally proposed using LDF induction methods to find a linear combination of attributes to separate two or more groups. In twogroup linear discriminant analysis, one class is labeled positive while the other neg ative. For any observation of n attributes, n x the principal uses an induced classifier wb where nw and b is a scalar, to determine the class of x If '0 wxb then x will be classified into the positive class, otherwise, the negative class. The classification rule ,:1,1nwb is learned from a collection of examples The process of determining a discriminant function from known examples of positive and negative cases is called supervised learning. The collection of given examples is called a training sample, denoted as ,,1,,iiSxyi Kl where l is the number of observations, the ith observation n ix consist of a number of attributes (or variables) ,1,,ikxkn K and 1,1iy denote the class labels. The collection of the examples used for testing the discriminant functi on is known as a test data set (which may be a hol dout sample from the training sample) For example, in the buyer seller setting, the characteristics of a seller are the attributes making up x such as price, location, membership date, positive feedback percentage and feedback score. The classes may simply be {trustworthy, untrustworthy}. A buyer determines her hypothesis , wb to discriminate sellers, such that a trustworthy seller satisfies '0 wxb and is grouped into the +1 class and an untrustworthy seller into the 1 class. Although many methods have been proposed to PAGE 18 18 handle missing data when using linear discriminant functions, to the best of our knowledge, none of these methods deal with missing data omitted strategically. We give a detailed discussion on LDFs in Chapter 3. The performance of supervised learning methods is generally measured by two types of errors: the empirical error and the generalization error. The empirical error (induction error) measures the accuracy rate of learning algorithms on the training sample. The generalization error is the risk of how well the induced classifier performs on unseen data and is usually not known at the time of induction. Typically a learning method induces a class ifier so as to minimize either the empirical error or (ideally) a combination of empi rical and generalization error. Many learning methods only minimize empirical error. A class of methods called Support Vector Machines (SVMs) applies a fundamental risk mi nimization principle and considers both types of errors. This risk minimization principle is the main focus of Statistical Learning Theory (SLT). Typically, risk, as discussed in SLT, is a combination of empirical error and generalization error. SVMs are designed to minimize this risk by trading minimization of empirical error with generalization error. We adapt SVMs models to formulate our principals problem and use this powerful tool for finding a principals classifier. A review of SVMs and SLT is cover ed in Chapter 3. We want to point out that, just as other linear discriminate methods, SVMs generally do not handle missing values. Conventional discriminate methods use observations in data sets as given, meaning they do not consider strategically hidden data. However, manipulated data biases results; insufficient information increases error rates in outcomes; and hidden facts distort decisions. If principals do not examine data, agents acting strategically PAGE 19 19 stand to gain. Hardly ever have conventional discriminate methods considered data generation and data validity. We survey a few such methods in Chapter 4. In this study, we study the principals classification problems when agents hide information strategically. In our framework, discussed in detail in Chapter 4, a diligent principal anticipates agents strategic behavior on publishing attribute values, and minimizes her classification risk when making classification decisions. In this work, we question how data are missing; we study how attribute val ues can be induced, and we investigate optimal strategies for a principal w hen she faces strategic agents who hide data. The study is structured as follows. In Chapter 2, we discuss common methods for handling missing data, and explain assumptions, advantages and limitations of each method. In Chapter 3, we give an overview of linear discrimination functions, statistical learning theory and support vector machines. In Chapter 4, we discuss strategic learning, which sets a general context for the problems we study. A framework used in this study is introduced. In Chapter 5, we model a strategic agents data hiding problem, and provide a theoretical analysis of a missing data method a principal can use to deal with such hiding. In Chapter 6, we investigate the performance of our method and several common statistical methods handling missing data in the ideal case where the principals training sample size goes to infinity In Chapter 7, we conduct an empirical study to measure and compare the performance of com mon statistical methods handling missing data and our method. Finally in Chapter 8, we conclude the study and discuss future work N otation used throughout this study is summarized in Table 1 1. PAGE 20 20 Table 1 1. Summary of n otations (for the case of one princ ipal ) Agent X Instance space of observations. S Training sample. l Total number of examples in the training sample. i E xample (or agent i ) index. 1,, i n Total number of attributes. k A ttribute index. 1,, kn ix N dimensional vector of attributes of the ith example (or agent i). n ixX iy True label of agent i 1 1iy iz Pair of agent i 's vector and label. ,iiizxy i Agent i 's cost vector of publishing attributes. i Agent i 's 0 1 vector of attributes. 1ik if agent i reveals the kth attribute; 0ik if agent i does not. iV Opportunity cost of agent i iI Indicator of whether agent i takes the opportunity. 0,1iI 0iI if does, otherwise 1iI M A sufficiently large number. Principal (,) wb Classification rule of the principal. Hwb Hyperplane of classification. ,:'0 Hwbxwxb d Default vector chosen by the principal. nd C Positive trade off parameter between generalization and empirical error. i Slack margin of agent i. max0,1iiywxb Infinite case X A subset of X such that '1wxb where X is hypersphere instance space with zero origin and r radius X A subset of X such that '1wxb where X is hypersphere instance space with zero origin and r radius. X A subset of X such that 1'1 wxb where X is hypersphere instance space with zero origin and r radius. V Volume of X V Volume of X V Volume of X r Radius of the hypersphere, :'1 Xxwxb h Height of the hyperspherical cap, X r Radius of the hypersphere, :'1 Xxwxb h Height of the hyperspherical cap X p Pole point of w ray in X p Pole point of w ray in X u Intersection point of w ray with '1 wxb l Intersection point of w ray with '1 wxb PAGE 21 21 CHAPTER 2 LITERATURE REVIEW OF METHODS FOR HANDLING MISSING VALUES Usually principals have to contend with the problem of missing data. Sometimes data are missing because of careless data collection or inappropriate data design; in some cases, missing information in a data set is a result of factors not under the control of the experimenters. Information can be too sensitive to be revealed, such as personal financial status or health records; or data are hidden on purpose for hope of a better outcome or classification. For example, online sellers may hide their geographica l data if they think that their geographical region may lead t o fewer potential transactions. Insufficient data can result in inaccurate decisions. When data are missing at random, different methods have been proposed for handling the problem. However, few studies have considered the possibility that the data are hidden strategically. W e review the existing methods for handling nonstrategic ally missing data. We leave a review of techniques dealing with strategic data to Chapter 4. Conventional Methods of H andling Missing Data Many methods have been proposed for handling missing data. When data are missing, o ne possible solution is to collect new observations in similar circumstances and obtain new values for the missing data. For example, researchers can send out questionnaires to similar populations again; chemists may redo their experiments under similar conditions to collect data again, etc. However, such solutions may be economic ally infeasible or impractical. Another possible solution is to delete obs ervations having missing values on any attribute/variable and perform data analysis on the new subset. The method is known as L istwise D eletion. The limitation of such a deletion method is that it results in discarding PAGE 22 22 a large number of observations. When a data set is sparse, deletion can yield a new data subset that can be very small or even empty and cause large standard errors because little information i s utilized. Pairwise D eletion, also called available case analysis, is different from L istwise D elet ion. Instead of deleting all observation with missing values, the P airwise D eletion M ethod just discards observations having missing data on the pair of variables that are being used for calculating the covariance between two variables. Another possibility is to replace the missing data with an estimated value, so that the analysis can proceed as if a data set has no missing value. A category of methods utilizes all available data and substitutes some reasonable guess (an imputation) for each missing value. Hence, these methods are called imputation methods and are very common. The main advantage of imputation is that it saves a great amount of information and removes the need to delete existing data. Under certain assumptions, some imputation methods produce good and consistent results. There are many different ways to impute missing values. Some methods are general purpose (i.e., independent of the type of function to be induced), while others are specific to certain problems. In the statistics field, a few imputation methods such as the A verage M ethod, the R egression M ethod, and the M aximum L ikelihood M ethod, have gained widespread acceptance. These methods normally consider attributes having continuous values. They have become conventional methods for deal ing with missing data, and we see adaption of these methods in different fields. We refer readers to survey papers (Donders et al. 2006, Schafer and Graham 2002) for detailed discussion on these conventional methods, although we discuss some of these below. PAGE 23 23 There are also some imput ation techniques unique to classification problems. Several papers (Chan and Dunn 1972, Chan et al. 1976, Jackson 1968) summarize different linear discriminate methods for handling missing data and compare the performance of these methods (we briefly discuss these later in Chapter 4) An overview of tree based techniques for missing data problem can be found in the survey paper by Liu et al. ( 1997) Before moving to specific imputation methods, we first explain common assumptions used by most of these methods. Rubin ( 1976) makes different assumptions about missing data. The assumption, missing completely at random (MCAR), refers to the case where the probability that a variable kx has missing values is unrelated to the values of some other variables kx and to the value of kx itself. With the MCAR assum ption, after deleting all observations and variables with missing values from a data set, this new subset can be regarded as a simple random sample from the original data set. Another assumption, called missing at random (MAR), is that a variable kx having missing values is unrelated to the value of kx itself if other variables kx are controlled. Omissions are randomly distributed within one or more subsamples kx For a given kx the probability of kx having missing values does not depend on the values of kx but on kx For example, we study the relation of two variables about used products on eBay, product quality ( kx ) and market price range ( kx ). If MCAR is satisfied, then the probability of missing data on product quality depends neither on the values of market price range nor on the values of product quality. Missing data on product quality in observations are completely random. Instead, if MAR is satisfied, then PAGE 24 24 the probability of missing data on product quality depends on the other variable, market price range. But for each market price range, this probability is unrelated to values of product quality. That is, controlling the market price range, a used product with poor quality is not more (or less) likely to have unobserved data on the product quality. If the data are e ither MCAR or MAR and the parameters control the missing data process are not related to those parameters to be estimated, then the mechanism leading to missing data are generally ignored, and data analysis can proceed without modeling the missing data mec hanism. In general, however, neither MCAR nor MAR may be a realistic assumption in real data situations. We will discuss the cases where neither MCAR nor MAR is satisfied in Chapter 4. The simplest imputation method perhaps is the A verage M ethod, also know n as marginal mean imputation. The method was first mentioned by Wilks ( 1932 ) For each missing value on a given variable, the Average Method finds the estimated value by calculating the mean for those cases with observed data present on that variable. The Average Method can introduce substantial bias and can yield incorrect standard errors (Haitovsky 1968) especially if the data set is neither MCAR nor MAR. To illustrate this, suppose the missing values on a real variable kx are mostly those below the mean of kx it i s clear that using the mean of kx values in sample can shift the distribution of kx to the right. We use a numeric example to show how many of these methods work. Table 21 and 22 are the data tables we use to illustrate each method. There are three variables and four observations. The true values of each variable on every observation are listed PAGE 25 25 in Table 2 1. Table 2 2 is the same data table except that the variable 2x has missing valu es on two observations. We use ??? to represent missing data. Table 2 1. Complete data set x1 x2 x3 5 1 2 2 5 3 3 5 1 1 2 7 Table 2 2. Data set with missing values x1 x2 x3 5 1 2 2 ??? 3 3 ??? 1 1 2 7 Table 2 3 shows the result s of using the L istwise D eletion method. We can see that any observations having missing data are deleted. The results of the Average Method are shown in Table 2 4. It substitutes imputed values for missing values. The Average Method calculates the mean of the values of variable 2x presented in the observations. That is, (1+2)/2=1.5, where the numerator (1+2) is the sum of observed values and the denominator 2 is the number of observations having data. Table 24 shows the new data set with imputed values. Table 2 3. Data subset (The Deletion Method ) x1 x2 x3 5 1 2 1 2 7 Table 2 4 New data set with imputed values (The Average Method ) x1 x2 x3 5 1 2 2 1.5 3 3 1.5 1 6 1 7 PAGE 26 26 The Similarity Method finds an observation that is the most similar to the record with a missing value as measured by the values not missing, and uses the actual value in the most similar record to replace the missing value. Proponents of the Similarity Method argue that the m ethod improves accuracy since it uses realistic values, and that the method also preserves the shape of the distribution. The underlying principal of the Similarity Method can be used for discrete variables, and this variant of the S imilarity M ethod is cal led Hot deck imputation, which has become popular in survey research. A data set is hot if it is currently being used for imputing a score. The Hot deck imputation replaces a missing value with actual scores of a similar case. If there are several equall y similar cases, then the method randomly chooses one of them. Similarity between observations p and q is often measured by Euclidean distance, 2 1,n kk kdpqpq Only variables with observed values are used for calculating the distance when observations have missing values Since the second and third observation in Table 22 has missing values, the Similarity Method needs to find a most similar observation for each of them. We show how to find a su bstitute for the missing value on the second observation. The method uses the variables 1x and 3x to calculate distances, since 1x and 3x do not have missi ng data. We need to calculate distances of the second observation with any other observation that does not have the same missing values. So, we compute the distances PAGE 27 27 of the second observation with the first one and fourth one separately. The distance with the first observation is 222,125323.16 d and the distance with the fourth observation is 222,421374.12 d We see the distance between the second observation and the first is smaller than that with the fourth, so the first observation is more similar than the fourth. So, we use the value of 2x in the first as an estimate. That is why we see 1.00 in Table 25 at the original missing spot in the second observation. Similarly, we can find a closest record for the third observation and an estimate for 2x Table 2 5 is the new data table with imputed values calculated by using the Similarity Method Table 2 5. New data set with imputed values (The Similarity Method) x1 x2 x3 1 1 2 2 1.00 3 3 1.00 1 6 1 2 The Regression Method also called the conditional mean imputation method, is another statistical imputation method. The Regression Method uses a regression equation to calculate an estimate of the true value. Assume only one variable kx has missing values on some cases. Using the cases with complete data on kx the method regresses kx on all of the other variables and then uses the regression equation to generate the substitutes for the cas es with missing data. The substitutes are predicted values for missing data. T he estimators are unbiased in general (Gourieroux and PAGE 28 28 Monfort 1981) when the data ar e MCAR, sample is large, and the imputed values are only dependent on independent variables According to Allison ( 2001) the Regression Method generates predicted values that preserve deviations from the mean and the shape of the distribution. It does not attenuate correlations between variables as much as mean substitution. However, both the R egression and Average m ethods are sensitive to departure from MCAR or MAR. Since they analyze imputed data as though they were complete, these two methods produce underestimated standard error and overestimated test statistics. Using the same data sets in Tables 2 1 and 22, we demonstrate how the Regression Method computes the estimates. Since only one variable, 2x has missing values, we can simply take 2x as a dependent variable and 1x and 3x as independent variables, and run a regression to get the vector of coefficients As we know for ordinary least squares, 1 XXXY where 1352 17 xx X 21 2 x Y so 0.0909 0.2727 Thus, the missing data can be calculated as 230.09091.00 310.27270.55 yX Substituting the estimates with missing data, the method generates the new complete data set, as shown in Table 26. Many variant s of the Regression Method have been proposed (Beale and Little 1975, Buck 1960, GrzymalaBusse and Hu 2001, Haitovsky 1968, Raghunathan et al. 2001) PAGE 29 29 Buck ( 1960) originally studied the cases where more than one variable has missing values. If there are k variables having missing values and nk variables having complete values, then the multiple regressions of each missing variable on remaining variables are run. Predicted values for unknowns are then estimated from appropriate regression function. Table 2 6 New data set with imputed values (T he Regression Method) x1 x2 x3 1 1 2 2 1.00 3 3 0.55 1 6 1 2 To illustrate, suppose a data set has four variables, 1x 2x 3x and 4x The variables 1x and 2x have missing data, but no values absent on 3x and 4x Then regression formulations, 134, xfxx 234, xfxx and 1234,, xxfxx are obtained. Correspondingly, the variance matrix also needs to be adjusted, 1 1 1,34varvar xxresidual 2 2 2,34varvar xxresidual and 1,2 1,2 12,34covcov xxresidual where 1,34residual is the residual variance of running regression 1x on 3x and 4x and so on. The method modifies the variancecovariance matrix. It corrects the underestimated residual variance problem in most of regression methods. Apart from that, Bucks method is applicable to the cases that data are not normally distributed. The method shares some similarity with the M aximum L ikelihood M ethod. The Maximum Likelihood (ML) M ethod chooses estimates to maximize the likelihood of present values. In other words, computes a distribution that is most likely to generate what has been observed. The method assumes that data are drawn from a multivariate normal distribution. It has been shown that when MAR is not satisfied, the PAGE 30 30 ML method still can generate consistent and efficient estimates. Most algorithms based on the ML method are iterative and computationally expensive. Thus, instead of giving a numeric example to illustrate the underlying principal of the ML method, we choose one of m ost widely used variants expectationmaximization ( EM ) to explain how the Maximum Likelihood Method is used. The EM algorithm is an iterative technique. It starts with certain values of parameters, mean and covariance matrix. The starting values of these parameters can be estimated with the complete data set selected by the Deletion Method from the original data set. At each iteration, it has two steps, expectation and maximization. The expectation step calculates the regression coefficients of missing variables on nonmissing variables by using the mean and covariance matrix. Using the regression coefficients, the expectation step predicts missing values and substitutes them in the data gap. After that, the maximization step calculates a new variance cova riance matrix with the new complete data set having original values and estimated values. The new variancecovariance matrix is used for the next cycle. In the next iterate, the expectation step starts with the new matrix and generates new estimates for mi ssing values. The cycles keep repeated till the estimated results converge. We have reviewed several conventional methods, A verage, S imilarity, R egression, and M aximum L ikelihood Method Except for the M aximum L ikelihood method, we provided a numeric example for each method. We summarize these numeric results in Table 2 7 Table 2 7 shows that all the estimated values of different imputation methods are smaller than the true values. The reason is that imputed values are based on the PAGE 31 31 observed data present on the variable having missing values. Since present data are those with low values, the imputation methods generate smaller predicted values. It demonstrates a fact that when missing data are greater than the mean of the distribution, the estimated data are biased and the new distribution with estimates becomes skewed. When using imputed data for missing values, statistical analysis produces underestimated statistics, such as smaller standard deviation, variance, and covariance, but overestimated test statis tics, such as p values. Table 2 7 Comparison of method results True values Missing values Imputed values Average M ethod Regression Method Similarity Method Observations x2 x2 x2 x2 x2 1 1 1 1 1 5 ??? 1.5 1 1 5 ??? 1.5 0.55 1 2 2 1 1 1 Mean 3.25 1.25 0.89 1.00 Std dev 2.06 0.29 0.23 0.00 Also, it is essential to know that these methods operate under certain assumptions about the missing data. When the assumptions are not satisfied, the methods may be inappropriate. As we see later, when the data are missing strategically those assumptions can be easily violated, which once again shows the importance of considering the missing data mechanism before salvaging and generating estimated values. In summary, most statistical methods produce biased estimates if uncertainty in missing data is ignored. They may obtain biased statistics and test statistics when the underlying ass umptions are violated. For the Listwise Deletion Method, if the sample size is large and missing data are MCAR, then the reduced subsample is a good PAGE 32 32 representative of original sample and the method can generate unbiased statistics. Yet, if not MCAR but only MAR is satisfied, then estimates are biased and standard errors are larger since less information is utilized. As discussed by Allison ( 2001 ) the deletion method is not robust if MCAR is violated, and the pairwise deletion utilizes more information than the listwise deletion. But the biggest drawback of the method is that when constructing the covariance matrix, the method may use different observations to compute the covariance for each pair of variables, so it is difficult to interpret the matrix. The Average Method, as pointed out by Haitovsky ( 1968 ) should be avoided, since it gives biased estimates even if MCAR assumption is met. In terms of generating robust results, it is worse than the Regression Method (Jackson 1968) The Regression Method can produce unbiased estimates, such as the least squares coefficients, if use d on a large sample set with MCAR data and imputation is only per form ed on independent variables (Allison 2001) The Similarity Method proponents argue that the method uses actual values rather than estimates to replace missing values. The disadvantage of this method is that little theoret ical or empirical work has been c ompleted to determine accuracy. In summary, if data are MCAR and the sample size is large, then listwise deletion is not worse than other methods (Allison 2001) If independent variables are h ighly correlated, then the Regression Method can generate consistent estimates. None of the methods considers that unknown values are not random but hidden strategically. We analyze this issue in Chapter 4. Methods H andling M issing D ata in Classification P roblems For classification problems, since data sets have a class variable to indicate the membership of observations, the values of the class variable can also be used in PAGE 33 33 attaining imputed estimates. For example, Grzymala Busse and Hu ( 2001 ) compare several methods including the M ode M ethod and the C oncept M ode method. The Mode Method assigns the most common value of a variable in present data to missing values of the variable, while the C oncept Mode Method assigns the most common value of a variable in present data with the same concept to the missing values. W e see that the class memberships are used for imputing estimates. The methods that consider the class variables just use the examples in the same class as the observation having missing data, while the methods can also use all observations without consider ation of the values of classes. The mode methods studied in the paper are an example of the methods handling missing data in discrete cases. Depending on the types of missing data, either continuous or discrete, methods dealing with missing data can be dif ferent. Classification methods induce a classification rule from training examples and then measure the accuracy of classification on test samples If missing values exist in a training sample, it is difficult to generate a classification rule since LDFs and other classification methods generate a classifier using all variables If a test set has missing values, a classification rule cannot proceed as usual. Depending on whether data are missing in the training sample or test set, algorithms handing missing data can be grouped into two types, those dealing with unknown values in the training data or those for unknown values in the test data. For example, decision tree induction methods (Doetsch et al. 2009, Loh and Vanichsetakul 1988, Quinlan 1986, 2006) are popular approaches for creating a classifi er A decision tree has terminal nodes (leaves), internal nodes (attributes), and branches. At internal nodes, the tree chooses the attribute that most effectively separates a set of examples into several subsets (this is PAGE 34 34 called branching) Each terminal node has the training data that belongs to one class (or belongs mostly to one class ). Different criterion can be used to select the attribute to branch on. For instance, ID3 and C4.5 algorithms (Quinlan 1986) use normalized information gain to det ermine the splitting attribute. The problem is that when a sample has unknown values, a simple tree building process cannot generate a classification rule, since it does not know which branch the process should carry on when the value of an attribute is required in order to classify a particular case. When values are missing in test data, the membership of a new observation cannot be determined without modifying the tree or the evaluation algorithm Thus, the question is how these unknown values should be treated. When unknown values exis t in training sample, several techniques find ways to get around the problem of missing data. One method (Quinlan 1986) uses other information to determine missi ng values. Assume only one attribute kx has missing values, then take a subsample S from original sample S S has cases with observed values of kx and includes the original class as another attribute. The attribute kx becomes class to be determined and a classification tree is built for estimating the values of attribute kx from the other attributes and the class. We can see the idea of this method is similar to that used in the regression methods. The technique utilizes all available information, but it is only appropriate for sparse concentration of missing values. If the same example has more than one attribute having missing values, this technique is difficult to use (Liu et al. 1997) Another approach (Quinlan 1986) is to assign a new value called Unknown to missing discrete values The method then builds a classification tree as if no values are PAGE 35 35 missing. Alternatively, another method, as proposed in (Quinlan 2006) just discards all cases with unknown values when the attribute is selected for building a decision tree. The underlying idea of this approach is similar to that of the Deletion Method. Other methods, such as the Predictor M ethod (White and Liu 1990) that uses a cross tabulation to build a tree, also ignore cases with missing values. Quinlan ( 2006) compares the deletion a pproach with several other methods, and finds that it gives bad performance and should be avoided in classification problems. If test data has missing values, prediction becomes impossible since traversing the branches for an observation with missing values is im possible. The C4.5 algorithm (Quinlan 1993) uses an approach that considers missing values in test sets. C4.5 has improvements over ID3 (Quinlan 1986) If the prediction algorithm encounters a variable with a value missing, it assigns all possible values to the variable along with associated probabilit ies. Based on these mul tiple values it assigns the instance to all the matching leaf nodes. O n each leaf node, the probability of class membership of the instance is computed. The overall probability of class membership of the instance is computed as a weighted average of the probabilities from each leaf node. That includes the instance with missing values This procedure handles missing values nicely because it considers multiple possible values instead of assigning one specific value. There are other methods for hand l ing miss ing data in test sets. When classifying test examples, the Surrogate split method (Breiman et al. 1984) uses a surrogate attribute to replace an original chosen attribute if data are miss ing on the original attribute. The surrogate attribute is one that has the highest correlation with the original attribute having missing data. How effective this method is depends on the data sets PAGE 36 36 being used (Liu et al. 1997) If the chosen su rrogate is highly correlated with the original attribute, then it is a good substitute otherwise the technique does not perform well A technique called dynamic path generation (White and Liu 1990) provides more flexibility when classifying cases because the generation method produces a rule instead of a tree for classification. Suppose the rule is to select the attribute that gives the highest information gain. At each non terminal step, the attribute giving the highest information gain is the first choice for the classification to branch on, but if the attribute has missing data, then it is disregarded and the second attribute is considered, and so on and so forth. We see the generation process is dynamic, but computationally expensive. LDF methods dealing with missing values have been studied (e.g. Chan and Dunn 1972, Chan et al. 1976, Gourieroux and Monfort 1981, Jackson 1968) Most of these methods are extension of regression methods, except that LDFs use data sets having a dependent variable to indicate the membership of observations. The values of the class variable can also be used in attaining regression formulations. Except for that, LDFs and regression models are quite similar. PAGE 37 37 CHAPTER 3 LITERATURE REVI E W OF LDFS, SLT, AND SVMS Linear Discriminant F unctions Discriminant analysis (DA) is popular in social sciences, bi ology, and business. It is used to classify instances into different groups, and helps study the differences between these groups (Hand 1981) The main goal of linear discriminant analysis is to predict a class membership for a new observation by using a linear combination of variables. Linear discriminant analysis determines a non zero vector nw and a scalar b such that the hyperplane wb partitions the ndimensional Euclidian space into two half spaces, 0iwxb for group 1 (positive class) observations and 0iwxb for group 2 (negative class) observations. Linear discriminant analysis was first proposed by Fisher ( 1936) found a linear combination of variables that maximized the ratio of the variance between groups to the variance within groups. Fishers method was a statistical method for determining the classifier wb and was followed by many other approaches There are different types of methods, such as mathematical programming (MP) methods (e.g. Freed and Glov er 1981) and search methods (Koehler 1991) MP methods have certain advantages over other types of methods, for example, they are free from parametric assumption. We refer the reader to the survey papers (e.g. Erenguc and Koehler 1990, Joachimsthaler and Stam 1990, Koehler 1990) for discussions on MP methods. PAGE 38 38 Statistical Learning Theory Vap nik ( 1998 ) provided a comprehensive basis, Statistical Learning Theory, for inducing functions to describe a concept. SLT defines generalization, over fittin g, learning, and the characterization of the performance of learning algorithms. It provides a framework for the good design of induction algorithms. Because the performance of most algorithms is measured by error or risk, statistical learning theory studies the properties of learning algorithms and investigates techniques to find induced functions Given a class of functions fx ,nxX and a random independent sample of examples S sampled using a fixed but unknown distribution function, Fxy one type of loss function is defined as 0if, ,, 1if, yfx Lyfx yfx fx is an instance of a collection of target functions parametrically defined by and (here) the loss function is an indicator function. For instance, with LDFs wb , fxwxb and 1,1 y If 0iiywxb then ,,0 Lyfx Otherwise, it is 1. The structural minimization principal suggests minimizing a risk functional of the form ,,, RLyfxdFxy That is, choose parameters that minimizes the expected loss over all cases. Most of the linear discriminant methods minimize an empirical risk function (i.e. training errors) or a proxy. That is, instead of optimizing the true risk function, they minimize PAGE 39 39 e 11 ,,mp ii iR Lyfx When one uses the indicator function for the loss function, minimizing the empirical risk function is equivalent to minimizing the number of misclassifications. Other loss functions can be used for the empirical risk functions. Since a small training risk does not mean a small overall risk, the performance of the induced classifier on the unseen data are not guaranteed. SLT, however, focuses on the overall risk. G iven S L a confidence level to induce f and the capacity h (a measure of the expressiveness of the target function), it provides an upper bound on the risk functional. It solves inf ,,, RLyfxdFxy where R is the risk functional as mentioned earlier It has been shown (Schlkopf and Smola 2001, Vapnik 1998) that 4, ,, 11 2 ,,emp struct emp bound structR Rh RR R Rh where 11 ,, 2emp i i iR fxy is the empirical risk and shows the performance of the induced classifier performs on the training sample, and 2 ,,ln2/ln/2structRhheh PAGE 40 40 is the structural risk of using the function class fx on S ,,structRh is a measure of how well the induced function will perform on data that was not in S (i.e., the so called predictive power of f ). It is a function of the number of examples the confidence factor and the capacity of target function class h For binary classification problems, h is the maximal number of points that can be separated into two classes in all possible 2 ways using a classifier in the target class of functions. This measure is called the VC dimension. For the linear class of functions, the VC dimension is 1n When the input space is bounded, (i.e., xr nxX ) and the data are linearly separable. T he VC dimension may become tighter since it can further be bounded by 2 2r using a margin LDF (Vapnik 1998) The effective VC dimension becomes 21min, r hn As shown above, t he structural risk can be improved by decreasing the VC dimension, so that for margin LDFs one can focus on minimizing 2 2r to minimize the bound of risk functional Support Vector Machines provide a powerful method that uses the structural risk minimization principal using margin LDFs It determines an LDF by directly minimizing the theoretical bound on the risk functional. We now provide a brief review of this methodology PAGE 41 41 Support Vector Machines SVMs are a set of supervised learning techniques that can be applied to classification and regression. When used for classification, SVMs determine a hyperplane that separates data into different groups. Based on statistical learning theory, SVMs deliver a good out of training generalization, and offer capacity control. In addition, since the basic SVM formulation is a convex minimization problem, there exists a unique global optimum a property missing in many methods There are mainly two types of SVM models, one of which is for a linear separable training data and also known as the hardmargin problem, and the other is for linearly inseparable training examples and called the soft margin problem. The hardmargin problem can find a consistent linear hyperp lane to separate the examples in the training sample. For the hardmargin problem, SVMs find linear discriminant functions by solving the following model. ,2:' ..'11,...,wb iiSVMMinww stywxbi This model is also known as the 2 norm model since 2www and w is the 2 norm of w This is a convex quadratic program. The 1 norm variant is as follows: 1 ,1: ..'11,...,wb iiSVMMinw stywxbi and can be formulated as a linear program. We also see some models that include the factor of in the objective function, such as 21 2 w or 1 2 w to substitute 2w or w respectively. Doing that is just for PAGE 42 42 mat hematical convenience without changing optimal solutions, i.e., the minimum of the original and the modified versions have the same wb To explain the rationale behind the formulati o n of SVM we need to define a few terms. The functi onal margin of example ix with respect to wb defined as 'iiiywxb shows the distance of point ix to the hyperplane. The functional margin of a data set S is the minimal functional margin, i.e., 1mini i However, the functional margin can be increased indefinitely by scaling up wb The geometric margin places some constraints on wb We define the geometric margin of ix with regard to wb as 'i iiwxb y w SVM (SVM1 or SVM2) produces a maximal margin hyperplane wb with a geometric margin equal to 1/ w S et ting the functional margin to 1, we obtain t wo hyperplanes, called the supporting hyperplanes : :1 xwxb and :1 xwxb and the points in the training sample that are on the hyperplanes, say x or x are called support vectors. M inimizing ww is the same as maximizing the geometric margin, which, in turn, minimizes the term 2 2R bounding the VC dimension. For many real world problems the sample space may not be linearly separable either due to measurement errors or simply because the pattern is non linear. In t he latter case it is necessary to employ kernel mappings in conjunction with SVMs to learn a nonlinear function (Cristianini and ShaweTaylor 2000) A kernel implicitly maps the input space onto a (usually) higher dimensional space, called the feature space. The feature space not only allows original input variables but also functions of original PAGE 43 43 variables. In theory a learning machine using kernels can discover more complex patterns because kernel mappings are typically nonlinear Kernel mappings, however, do not guarantee separability Fortunately, for inseparable data sets, we can formulate the SVM problem as a soft margin problem : 0 1 ,2:' ..'11,...,ii i wb iiiSVMSMinwwC stywxb i where i is the margin slack variable measuring the degree of misclassification of a point ix and C is the penalty cost of training errors. The objective function trades off margin maximization with training error minimization. It is possible to augment this formulation with kernels. The computationally convenient way is to work with the dual of SVM2S (Cristianini and ShaweTaylor 2000) Th e problem remains a convex quadratic program when positive semi definite kernels are used. The advantage of SVM2S is that it allows for mislabeled examples. If there does not exist a hyperplane that separates examples completely, SVM2S will choose a hyperplane that splits data points as cleanly as possible, while still maximizing the distance to the nearest cleanly split examples. Also, since the penalty function is linear, it brings some nice properties to the dual of the problem. The slack variables vanish in the dual and only the penalty cost appears. In this research, we focus on linearly separable data to study the strategic information revelation problem. In C hapter 4 we provide a review and framework for the strategic learning problem. PAGE 44 44 CHAPTER 4 STRATEGIC LE A RNING Strategic (or adversarial) learning refers to cases where decision makers (principals) discover optimal classifiers while anticipating strategic behavior by intelligent agents that are to be classified. Agents will act in their own best interest to achieve a positive labeling by manipulati ng their attributes. One type of manipulation is to hide an attribute value. Other types include changing attribute values, possibly through deception or self improvement For example, a plumber can intentio nally hide an earlier incident where he caused a more severe leak while fixing someones water pipes. Reporting the incident might decrease his chances of being hired, so he may intentionally choose to conceal the accident. So, instead he might just mention to new prospective customers facts like how many years he has been working as a plumber, how professional he is, etc. In this example, a new prospective customer, a principal, classifies plumbers (the agents). The decision rule is based on the characteri stics of plumbers. The classes are competent plumber or incompetent plumber. A plumber can hide information from the principal to increase his chance of being classified as a competent plumber. Doing so does not require him to know much about the pri ncipals classification rule but merely that this information might be detrimental to most reasonable classification rules. Similarly, online sellers may hide information from buyers such as minor scratches on previous product sales, etc. Others may simply try to improve themselves (at least in the eyes of the principal) to acquire a positive labeling, like studying harder to be classified as collegeworthy. PAGE 45 45 Figures 41 show s exactly how hiding information strategies can fool a principal. In Figure 41, the principals classification rule is represented by wb For any agent, the vector x denotes his characteristics. If 1 wxb the agent is considered as positive; if 1 wxb negative. If 11 wxb then we consider that he is not classified or, more commonly, that he is classified as a negative case. Dark points in Figure 41 show the agent s positions relative to the hyperplanes. The points where the arr ows start are the original x vectors containing complete attribute values, and the ending points are new positions when one attribute value is hidden. To illustrate the ideas we discuss a simple case where x only has two attributes T he horizontal line is used for attribute 1x while the vertical line for 2x We assume the principal just uses observed values to classify agents and do not spend time or money to find hidden values (except for a training sample) In Figure 41, originally 1.5,1.5 x 1 wxb so the agent is a true negative as h e is on the negative side of the :1 zwzb hyperplane. When the agent h ides 2x 1.5,1.5 x becomes 1.5,??? If the principal ignores missing data, 1122 11wxwxbwxb We have 111 wxb the agent will be classified as a positive case. The true negative agent successfully achieves a positive classification by hiding 2x Notice that the agent does not need to know the classifier. He only needs to know the direction (i.e., the signs of the components of w ) of the classification rule to act strategically. PAGE 46 46 Figure 41 S trategically hidden data example Figure 41 shows that a principal can be misled by an agent if she does not consider the missing data mechanism. Agents can intentionally hide attributes for a favorable outcome or classification. However, classical machine learning methods implicitly assume that the attributes of instances under classification are not manipulated (although many methods allow for random errors). However, in many situations where intelligent agents are being classified, including fraud detection (Phua et al. 2005) spam detection (Fawcett 2003) and intrusion detection (Mahoney and Chan 2002) the information available for inducing classifiers may have been strategically altered to achieve a positive classification. Agents can actively manipulate data to defeat the principal, but classical learning methods have given little attention to possible strategic activity in the data generation process. In this study, we take into account the case when data are missing strategically. Principals need anticipate such behavior and determine a classification rule to reduce or remove the impac t of agents strategic behavior on hiding information. In Chapter 2, we summarized many imputation methods dealing with missing values, but most of them PAGE 47 47 are based on the MCAR or MAR assumption and are sensitive to departures from these assumptions. When th e data are not missing at random, these imputation methods can b e inappropriate and misleading. In Chapter 5, we develop a method to handle missing values such that the principals classification rule can separate agents correctly even if agents behave str ategically. Before moving to that framework and our methodology, we survey current studies that directly anticipate strategic learning in the principals induction process. Several papers (Boylu et al. 2008, 2010a, b, c, Dalvi et al. 2004) consider situations in which agents alter their attribute values for a favorable classification. Adversarial learning was proposed by Dalvi et al. ( 2004) They considered the situation in which one player, the adversary, acted strategically and changed some of his attributes, aiming to mislead the other player called the classifier into labeling him as positive. The classifier (which we call the principal) first publishes a classification rule before the game, and the adversary (which we call an agent) modifies the sample points to fool the rule. Assuming the classifier is unaware of strategic change of attributes, the adversary finds the optimal s trategy against the first classification rule. However, the classifier anticipates that the adversary will engage in such behavior and actually uses a new classification rule. The authors assume it is a single shot game having a single principal and a sing le agent. They also assume one move by each of the players and that the players act sequentially. Also, the adversary is assumed to have full knowledge of the initial classification rule. They use a nave Bayes classifier (Sahami et al. 1998) Their study shows that, based on the expectation that the adversary will try to alter their attributes, if PAGE 48 48 the principal instead uses an updated nave Bayes classif ier (NBC) anticipating such behavior, she gets better results. An optimal strategy for a classifier exists if the adversary incurs som e cost when altering attributes In their study, they show that, in the context of spam filtering, the adversarys problem can be formulated as a mixed integer problem and they provide an algorithm to compute the classifiers strategy. Following Dalvi et al. ( 2004) Lowd and Meek ( 2005) and Barreno et al. ( 2006 ) propose a novel technique for the adversary to learn a classification rule used by the classifier. In the email spam example, the spammer sends a certain number of emails to the system to learn the classification rule. The objective of the adversary is not to perfectly determine the classifier but gain some knowledge about the instances that have a high chance of being classified not as spam. The technique provides a unique way to exami ne the vulnerability of a system. Boylu et al. (2008, 2010a, b, c) also study the adversarial problem but call it induction over strategic agents (ISA). They focus on finding a classifier that minimizes the classification risk while antic ipating that the agents alter their characteristics strategically for a favorable classification. The agents are self interested individuals who maximize their utility function and can adapt their decisions. In their paper, Boylu et al. (2010b) assume that agents experience a disutility, expressed as a linear cost, associated with the alteration of values and the principal knows each agents cost vector. Agents can choose a vector s so that the attributes seen by the principal are then xs With a linear utility and no bounds on possible attribute changes only one attribute needs to be altered when the data are strictly separable. In addition, they assume both the principal and agents anticipate each PAGE 49 49 others behavior and act simultaneously. The game is a multi agent, single principal game. They show that the agents strategic behavior affects the principals decision and a better cl assification rule for the principal needs to be determined. Also, in the base case where nonaltered data can be completely separated by a hyperplane and all agents have the same utility function, they find that only marginal positive agents close to the + 1 hyperplane have to alter their attributes to gain a positive classification, although both positive and negative agents are assumed to be able to alter their attribute values. The principal will use a scaled and shifted classifier when anticipating agent s strategic behavior. No true negative agent will achieve positive classification by altering its attributes with regard to this shifted classifier. When agents have different cost functions and reservation utility and the training sample is not linearly separable, the authors include an optional feature that penalizes positive agents effort in the objective function. Also, they consider the amount of bias that the agents can introduce to the classifier. They show that the optimal solution to ISA is a Nas h equilibrium and give an MIP formulation for finding one. Boylu et al. ( 2010a) consider the case where agents are constrain ed on how much they can alter their attributes in ISA (Boylu et al. 2010b) O ne such case is when the feasible set of moves is 0,1n for example, for the spam problem. With binary alteration, earlier results found in ( Boylu et al. 2010b) become inapplicable. Also, this constraint may require multiple changes instead of altering one attribute as shown in (Boylu et al. 2010b) The alteration mainly depends on attri bute values, so a negative agent may also be able to change his attributes and get a positive classification. Then PAGE 50 50 the training sample becomes inseparable. Boylu et al. ( 2010a ) adapt the general formulation in (Boylu et al. 2010b) for the spam problem. Learning problems considering the strategic behavior of agents have also been studied in other settings. Individuals make decisions to maximize their utility functions through interactions with the environment and other individuals. For example, Littman and Stone ( 2001 ) study a repeated game where individuals reacted with one another and played against thei r opponents based on past interactions. In addition, in Stackelberg games (von Stackelberg 1949) learning takes place in dynamic and changing environments Players interact with each other and build knowledge on other players actions. The fact that agents interact with the principle or with one another leads us to question the assumption made in many supervised learning algorithms that is, whether the data miner can rely on the given data sets for inducing classifiers without considering how data are generated. As we can see, most of the strategic learning literature considers the cases where agents strategically change their attributes for a favorable classification. However, we often see the cases where respondents refuse to provide information or intentionally hide information from principals. That is the problem this study focuses on. We study the problem that agents can act strategically to h ide or reveal attribute values. Strictly speaking, the model studied by Boylu et al. (2010b) permits agent changes of attributes from x to xs ( s is a vector of the changes that an agent made on x ) and attribute k can be hidden by choosing kksx (assuming a zero attribute value is equivalent to hiding t he true value) However, while this is a possibility, it is not a requirement or constraint in their model so hiding attributes was not the focus of that PAGE 51 51 study nor is this case covered by their results In the spam papers mentioned above, the attributes are all binary valued so choosing a value of zero is equivalent to hiding the attribute (which meant not using a particular word in the email). So these are special cases of strategic hiding. In this work we study strategic hiding of continuous attributes In our study, we assume a principal considers intelligent agents who hide attributes strategically. In Chapter 5 we derive a set of default values the principal can substitutes for missing values so that negative agents cannot game their classification r ule and positive agents will still be classified correctly In Chapter 6, we look at an ideal case, where the training sample size goes to infinity, and conduct a theoretical study to compare the performance of the D M ethod to other commonly used statistic al methods handling missing data. In Chapter 7, we conduct an experiment to compare the performance of our method with several conventional imputation methods when agents strat egically hide their attributes. PAGE 52 52 CHAPTER 5 THE PRINCIPALS PROB LEM Suppos e the principal uses a sample S of l observations to induce her classification rule. As denoted earlier, X is the instance space of possible observations. To avoid uninteresting aberrant cases, we assume there is at least one example in S having 1iy and one having 1iy We assume S is strictly linearly separable meaning that there is a nw and a scalar b such that the hyperplane ,:'0 Hwbxwxb separates S so that '1iiywxb for each example ix in S Let H be the set of all pairs wb giving such strictly separating hyperplanes where at least one positive and one negative exam ple of S yield '1iiywxb In keeping with results from SLT, w e assume a principal will choose from H to obtain a linear discriminator to classify new observations. An observed xX is assigned to the positive class ( 1 y ) if it satisfies '1 wxb and to the negative class otherwise. We consider the problem where the agents may strategically attempt to hide their attributes to attain a positive labeling. So an agent will choose 0,1k signifying that attribute k will be published ( 1k ) or not ( 0k ). Thus, not all of the attributes ma y be observable by a principal. In the absence of an attribute, we consider a principal who imputes the value of the attribute using a default vector, d That is, if attribute k is missing in an agent, th e principal will assume the value is kd An astute principal assumes a worst case, that an agent knows which wb and default vector d she will use and that the agent will act PAGE 53 53 in his own best interest. W e investigate ways a principal might select such a default vector. We assume for agent i there is some disutility associated with not attaining a positive labeling, iV and a positive cos t to publish or divulge attributes, n i We define the agents problem as follows, given that the principal uses wb and d The Agents P roblem: The agents problem is defined as 1 1.. 1 0,1,1,, 0,1n i ikk I k n kkikkk k kMinVI stwxddMIb kn I The objective function of agent i is to minimize total cost (i.e., disutility) which consists of the cost of publishing attributes plus the disutility of not being labeled as a positive case. The first con straint attempts to attain a positive classification from the principal. The agent decides whether to participate ( 0 I ) or not ( 1 I ) and, if participating, which attributes to hide and which to publish. 1 Mb We now look at the principals problem of choosing default vector d Let 1,0,1n n i kikkkk kRdxdde iRd are extreme points of a hyper rectangle defined by possible attribute vectors a principal could see from agent i using default values when necessary, and agent i s possible decisions (the values). Note that iixRd for e and idRd for PAGE 54 54 0 Clearly wb correctly classifies a negative agent i when using default values if '1iitRdywtb Let ,: 1i iiDwbdtRdywtb This set gives all the default vectors the principal could use that would result in a correct classification of agent i regardless of the agents decisions on h iding attribute values. We start with determining a characterization of ,iDwb This is achieved in Proposition 1. The following three examples show ,iDwb for three cases. Fi gure 51 shows ,iDwb when ix is a support vector (i.e., where '1iiywxb ). Figure 52 shows ,iDwb when ix is not a support vector (i.e., where '1iiywxb ). Both of these cases have a w vector not having any zero terms. Figure 5 3 shows ,iDwb for a nonsupport vector with w having 20 w Figure 5 1 ,iDwb for a s upport vector ix PAGE 55 55 Figure 5 2 ,iDwb for a n on support vector ix Figure 5 3 ,iDwb where 20 w for a non support vector ix Let wbH 0kZwkw and iP be the set of all coordinate wise projections of ix on :'1ixywxb That is: ii ii k kywxb Px ekZw w Define ,0,0, :i ii kk kkk kZwEconvPyFwFFkke Lemma 1 (Characterization of iE ) PAGE 56 56 For wbH let 0ikkikikNkZwywdywx Then idE iff 1ikkikikii kNywdywxywxb Proof: ( ) If idE then idgyFwa for some igconvP :kkk kZwae and non negative diagonal matrix F. Thus 21ikkikikiikkkkywdywxywxbFw kZw and 01k Hence 1ikkikikiikywdywxywxb so 11ikkikikii kii kN kNywdywxywxb ywxb since 10iiywxb and 10k kN ( ) We assume we have a d such that 1ikkikikii kNywdywxywxb (Case 1) We assume 0ihhihihywdywx for all hZw (i.e., 0 N ). Pick any igconvP and let 0kk kdgkZw a kZw :kkk kZwae 0kk i k kkdg ykZw w F kZw Then we have PAGE 57 57 idgyFwa 2221 0ii kkikkikik ikkikik kki k k kkkywxb dgywdywx ywdywx Fy w www Thus F is a nonnegative diagonal matrix, so idE (Case 2) If ix is a support vector, then the condition 1ikkikikii kNywdywxywxb implies 0ikkikik kNywdywx which reduces this case to Case 1. (Case 3) Now assume 1iiywxb and 0ihhihihywdywx for at least one hZw Then for 1 let 0 1 0ikkikik ii kywdywx kN ywxb otherwise Then the assumed condition 1ikkikikii kNywdywxywxb implies 11 1ikkikik ikkikikii ii kN kNywdywx ywdywxywxb ywxb Then define 1 1 1ikkikik ii kNywdywx ywxb ii ikk kZw kywxb gx e w 0kk kdgkZw a kZw :kkk kZwae PAGE 58 58 0kk i k kkdg ykZw w F kZw Clearly igconvP By construction we have idgyFwa Now for kN 21kkkikkkikkikikiikwFywdgywdywxywxb 21 1ikkikik kkkikkikikii iiywdywx wFywdywxywxb ywxb 10ikkikikikkikik ikkikikywdywxywdywx ywdywx For kN 20kkkikkikikwFywdywx Hence F is a nonnegative diagonal matrix and thus idE QED Proposition 1 : If wbH then ,iiDwbE Proof: Let idE Then ii k ikkk k kik kywxb yFwkZw w dx kZw where 0k 1k kZw and k So 1 n i kikkkk kywxddeb i kikkkk kZwyxddwb PAGE 59 59 ii k ikkkk k i ki kZw ii ikk ikkk kywxb yFw w y wyb ywxb x yFw w 211kkiikkkii kZwywxbFwywxb 2111ii kk kkkkii kZw kZwywxb Fwywxb Now, 1iiywxb 10iiywxb and 011kk kZw so 2111ii kk kkkkii kZw kZwywxb Fwywxb 2111kkkk kZwFw Hence, ,idDwb Suppose idE Then by Lemma 1 1ikkikikii kNywdywxywxb Pick any igconvP and let 0kk kdgkZw a kZw :kkk kZwae 0kk i k kkdg ykZw w F kZw So 1 n i kikkk kywxddeb kikikikkikki kZwywxywdywdyb 2 21 1ikikikik k ii kkkk i kZw ikik ii kkkkywxywx ywxbFw yb ywxywxbFw PAGE 60 60 2211k ii kkkkkkk kZwywxbFwFw 2111kkkk kii k kZw kZwFw ywxb 1111kikkikikkii kiik kZw kZwywdywxywxb ywxb 111kikkikikii k kZw kZwywdywxywxb 1kikkikikii kZwywdywxywxb Let 0 1kkN kN But since 1ikkikikii kNywdywxywxb we get 1 n i kikkk kywxddeb 1kikkikikii kZwywdywxywxb 11ikkikikiiiiii kNywdywxywxbywxbywxb So, ,idDwb QED Proposition 1 shows the structure of ,iDwb Now we turn to choosing d values that work across many agents. We start with looking at how a principal finds a set of default vectors that can classify many agents of the same type correctly. For 1,1 y define ,,iy i yyDwbDwb PAGE 61 61 If the principal choos es ,ydDwb all agents i of type y will be correctly classified with wb regardless of what information they hide. F igure 54 shows the intersection of six negative agents ,iDwb and one with two positive agents. Among these agents, one is a positive support vector to the hyperplane and another is a negative support vector. The following proposition shows the structure of ,yDwb Figure 54: ,yDwb Define ,, 0,0, :kk kkk kZwEyxxyFwFFkke Proposition 2 (Structure of ,yDwb ) Let 1,1 y and wbH then ,,yDwbEyx where PAGE 62 62 11 max 0iikikii yy kk kywxywxbkZw ywyw x kZw Proof: First, we prove ,,yDwbEyx We start by showing for all i where iyy that ,ixDwb Assume ,ixDwb for some such i By Lemma 1 1ikkikikii kNywxywxywxb But by definition of x for kZw 11 maxik ikikii yy kkx ywxywxb ywyw 1kkikikiiywxywxywxb 1kkikikiiywxywxywxb giving a contradiction if 1 N So 0 N and for each kZw 0ikkikikywxywx Pick any igconvP and let 0kk kxgkZw a kZw :kkk kZwae 0kk i k kkxg ykZw w F kZw Then we have ixgyFwa 2221 0ii kkikkiki ikkikik kki k k kkkywxb xgywxywx ywxywx Fy w www which is also a contradiction (since this shows ,ixDwb ). So ,ixDwb for all i where iyy Now, since ,ixDwb then for any nonnegative diagonal matrix F and any PAGE 63 63 :kkk kZwae we get dxyFwaEyx Hence ,idDwb by Proposition 1. So, ,,yDwbEyx Next, we show ,,yDwbEyx by proving ,,ydEyxdDwb Assume dEyx then there exists at least one hZw where 0ihh hh hydx F w For such an h the definition of hx implies there is an i such that 1hhihihiiywxywxywxb Then, for this i and h 210hhhihhihhihhihihiiwFywdywxywdywxywxb That is, 1ihhihihiiywdywxywxb If ,idDwb from Lemma 1 we have 1ihhihihii hZwywdywxywxb giving a contradiction. Thus, ,ydDwb QED The previous propositions investigate the set of default values that the principal can choose to classify agents of one type correctly regardless of their hiding decisions. Unfortunately, a principal cannot choose a default vector from the intersection of these cases that will work across all agents (bot h positive and negative ones). That is because PAGE 64 64 11,, DwbDwb To see this, suppose 11,, dDwbDwb Then for the decision to not publish any attributes, a positive agent gives 1'1 wdb and a negative agent gives 1'1 wdb which gives a contradiction (recall, we assume there is at least one example having 1iy and one having 1iy ). Since the costs of publishing attributes are positive the natural response of negative agents to a default vector is to publish nothing since publishing achieves nothing but incurs cost. This is a desired response of negative agents, but not for positive agents. So a principal would likely choose default vectors from 1, Dwb On top of this, a principal might like a default vector that can incent positive agents to publish all their information. Let ,:,1,i ii kikDwbdtRdywtbtxkZw This set of default vectors incents agent i to publish every attribute k where kZw Then define ,,iy i yyDwbDwb If the principal chooses ,ydDwb default vector d works across all agents of the type y PAGE 65 65 F igure 55 shows a default set ,iDwb where a positive agent must publish everything since he will be classified as negative otherwise. This is an extreme case, but shows that positive agents can be incented to reveal information (assuming the publication costs do not exceed opportunity costs ). Proposition 3 characterizes such a set. Figure 55 ,iDwb the set of default vectors incenting ix to reveal everything Proposition 3 ( Structure of ,iDwb ) : If wbH then ,,i iiDwbEyx where 0ii ik ik kywxb x kZw x w kZw Proof: Let ,iidEyx Then 1 n i kikkkk kywxddeb PAGE 66 66 ii kikik ikkk k ik kZw ii ik ikkk kywxb xx yFw w y wb ywxb x yFw w 211ii kiikkk kZwywxb ywxbFw Let 0,1 jj jZwjZwAee Then for any A we get 211ii kiikkk kZwywxb ywxbFw 1iiywxb For any 0,1nA at least one 0h for hZw so 211ii kiikkk kZwywxb ywxbFw 2 01kii iikkk kZwywxb ywxbFw 21ii iihhhywxbywxbFw 11iiiiywxbywxb Thus, ,,i iiDwbEyx Assume ,iidEyx Then kikikkkdxyFw implies there is at least one hZw such that 0hhF So 1 n i kikkkk kywxddeb PAGE 67 67 ii kikik ikkk k ik kZw ii ik ikkk kywxb xx yFw w y wb ywxb x yFw w 2111ii ii k kkkk kZw kZwywxbywxb Fw Choose an such that only 0h Th en the last expression becomes 221 11ii ii hhhhhhywxbywxbFwFw Thus, ,idDwb QED Now we investigate a set of default values that works across agents of the same type, such that these agents are correctly classified if they publish every attribute. The shaded cone in F igure 56 shows such a set for positive agents. P roposition 4 characterizes th is set Figure 56 ,yDwb Proposition 4 (Structure of ,yDwb ) : Let 1,1 y If wbH then ,,yDwbEyx where PAGE 68 68 1 max 0iikik yy k kywxkZw yw x kZw Proof: First, we show that ,,yDwbEyx We start with showing for all i where iyy that ,ixDwb Suppose ,ixDwb for some i then from Proposition 3 ixxyFwa for 0kik kxxkZw a kZw :kkk kZwae 0kik i k kkxx ykZw w F kZw implies for some kZw that 0kkF Then we get 20kkkikkikikFwywxywx However, by the definition of x for kZw kkikikywxywx which gives a contradiction. So, ,ixDwb Now, for any :kkk kZwae And any non negative, diagonal matrix F, we get for dxyFwaEyx by P roposition 3, that ,idDwb So, ,,yDwbEyx Next, we show ,,yDwbEyx Assume dEyx Then for dxyFwa where PAGE 69 69 0kk kdxkZw a kZw :kkk kZwae 0kk ik kkdx ykZw w F kZw implies for some kZw that 0hhF Then we get 20hhhhhhhFwywdywx By the definition of x there is some i, such that hhihihywxywx then 20hhhihhihihFwywdywx That is, ihhihihywdywx If ,idDwb then for all kZw 0ikik kk kydx F w which gives a contradiction. So, ,idDwb and then ,ydDwb Thus, ,,yDwbEyx QED With 1, Dwb all negative agents will be classified correctly regardless of their decisions. With 1, Dwb all positive agents will publish all their attributes and will be classified correctly. So a principal might want to find the intersection of 1, Dwb and 1, Dwb Let 11,,, DwbDwbDwb Dwb is a subset of 1, Dwb If a principal uses 1, Dwb positive agents can publish everything or not If the principal uses Dwb all positive agents must publish PAGE 70 70 everything to be classified c orrectly. With Dwb the principal restricts the strategies of positive agents and saves the cost of finding missing data. She knows what actions the positive agents will take and are more certain about her cost. So, Dwb is a more conservative set than 1, Dwb F igure 57, Dwb is the darker shaded region. The next proposition characterizes Dwb Figure 57 Dwb Proposition 5 (Structure of Dwb ) : If wbH then ,1, DwbEx where 1 max, 0kkkk k kwxwxkZw w x kZw Proof: First, we show that ,1, DwbEx We start with showing 1 xDwb and 1 xDwb By the definition of x for kZw we get kkkk kkkkwxwx wxwx Suppose 1 xDwb then from Proposition 2, xxFwa where PAGE 71 71 0kk kxxkZw a kZw :kkk kZwae 0kk k kkxx kZw w F kZw implies for some hZw 0hh hh hxx F w 2 0hhhhhhhFwwxwx which gives a contradiction. So, 1 xDwb Suppose 1 xDwb then from Proposition 4 with 1iyy xxFwa where 0kk kxxkZw a kZw :kkk kZwae 0kk k kkxx kZw w F kZw implies for some hZw 0hh hh hxx F w 2 0hhhhhhhFwwxwx which gives a contradiction. So 1 xDwb Now, for any nonnegative diagonal matrix F and any :kkk kZwae we get 1, dxFwaEx By Proposition 2 and 4, we know 1 ,, 0,0, :kk kkk kZwDwbxFwFFkk e PAGE 72 72 1 ,, 0,0, :kk kkk kZwDwbxFwFFkk e So, 11,, dDwbDwb Thus, ,1, DwbEx Next, we show ,1, DwbEx Assume 1, dEx then for dxFwa where 0kk kdxkZw a kZw :kkk kZwae 0kk k kkdx kZw w F kZw implies for some hZw 0hh hh hdx F w S o 2 0hhhhhhhFwwdwx By the definition of x for kZw kkkkwxwx or kkkkwxwx So, 20hhhhhhhFwwdwx or 20hhhhhhhFwwdwx However, if 1, dDwb and 1, dDwb then for kZw 20kkkkkkkFwwdwx and 20kkkkkkkFwwdwx That is, kkkkwdwx and kkkkwdwx kZw which gives a contradiction. So, 11,, dDwbDwb PAGE 73 73 Thus, ,1, DwbEx QED From Proposition 5, we see Dwb is a default set that induces positive agents to publish everything to be classified correctly while negative agents stay negative no matter what they publish. A principal uses the default set, Dwb to restrict agents strategies while anticipating possible strategic actions from all of agents. The computation of Dwb is simplified below. By Proposition 5, we have 1 max, 0kkkk k kwxwxkZw w x kZw where 1, xDwb and 1, xDwb And, by Proposition 2 and 4, we have 11 max 0iikikii yy kk kywxywxbkZw ywyw x kZw 1 max1 0iikikii yy k kywxywxbkZw yw x kZw For 1 y 11 max1 0ikiki yk kwxwxbkZw w x kZw For 1 y 11 max1 0ikiki y kkwxwxbkZw w x kZw PAGE 74 74 Thus, for kZw 0kx And, for kZw 11 max1 min1k kiki kiki i i kkx wxwxb wxwxb ww Define ,,wbHDDwb D is the set of default vectors that the principal could choose so that regardless of which wbH she uses it would classify agents correctly if positive ones publish everything. However, this set can be empty, which is demonstrated by the following example. In this example, we have four agents, two of which are positive and another two are negative. These four points can be separated by different hyperplanes, for instance, 11, wb and 22, wb Different hyperplanes give different Dwb So, we have 11, Dwb and 22, Dwb We can see from F igure 58 that the intersection of 11, Dwb and 22, Dwb is empty. b b b b Figure 58. E mpty D set Nonetheless, if a principal plans to use a dDwb regardless of which wbH she chooses, then she is guaranteed all negative agents refrain from PAGE 75 75 publishing any attributes and a positive agent will either publish all his attributes or refrain from publishing any if the opportunity cost is too low (i.e., if 1 n i ik kV ). So if a principal wants to choose a wbH that minimizes the risk functional then she could solve ,'max0,1',i wbH iGMinwwC wtwbb where 1:,1n i iki kGiVy C is a misclassification cost and 1,,n i kikk k k ktxdwbdwbe is the imputed attribute vector for agent i using agent is solution to his agent problem (the k values). For separable data, wb of the hardmargin solution gives an optimal solution, since negative and positive agents are classified correctly. Given 1 n i ik kV positive agent will publish enough to be classified as positive. The positive will be classified correctly no matter whether the principal uses 1, Dwb or Dwb If the principal uses Dwb then positive agents are induced to publish every attribute. Instead, if 1, Dwb is used, then positive ones are not forced to reveal every attribute. They will reveal enough attributes so that they will be classified as positive. In this sense, 1, Dwb yields more social welfare than Dwb does. PAGE 76 76 In short, we see Dwb is a set of default vecto rs that induces positive agents to publish everything to be classified correctly while negative agents stay negative ly classified no matter what they publish. A principal uses Dwb to restrict agents strategies. As shown above, through simple substitutions of the various definitions, we readily get 1 min1k kiki i kx wxwxb w for kZw The results apply to any wbH and separable sample S satisfying the Assumptions. Chapter 6 looks at this method and the Average and Similarity imputation methods as the sample size increases to infinity, an ideal case. PAGE 77 77 CHAPTER 6 LARGE TRAINING SAMPL E CASE In practice, a principal uses a training sa mple T to determine imputation parameters. For example, the Similarity Method searches the training sample to find an example closest to an unclassified new observation. Likewise, the Average Method computes a default average vector from a training sample to use in imputations. The new imputation methods developed earlier in C hapter 5 constructs a set also obtained from the training sample and uses a vector, d from it In practice, training sample is used by a principal to determine these imputation parameters for use on new examples. If training sample is not representative of the population, the imputation methods may suffer. We empirically study this in C hapter 7 This s ection explores some theoretical issues concerning the quality of these induction methods as the training sample size increases. As is often done, we assume the instance space X is a hypersphere around the origin with radius r (for example, see C ristianini and ShaweTaylor 2000) That is nXxxr The volume of such a hypersphere (Huber 19 82) is 21 2n nr n where is the gamma function. Consistent with our assumption of separability, we assume X is partitioned into three sets: :'1 XXxwxb I :'1 XXxwxb I PAGE 78 78 :1'1 XXxwxb I Furthermore, we assume sampling of instances is uniform over XXU and that X is not sampled. As needed below, of interest are the volumes of these three s ets defined earlier. We assume the hyperplane :'0 xwxb intersects the hypersphere. Consider the line passing through the origin intersecting the hyperplane at right angles. The two intersection points with the hyperspheres surface are rw p w and rw p w This line intersects :'1 xwxb at 1' ubwww and :'1 xwxb at 1' lbwww Using the Pythagorean rule, the radius of the sphere :'1 Xxwxb I denoted as r is 2 221 ' b rruur ww and the height of the hyperspherical cap, :'1 Xxwxb I is given by 1 11 1 '''' wrwb bw b rwb rw rw hwuw r wwwwwwww w w Likewise, the radius and height of the cap :'1 Xxwxb I are 2 21 b rr ww and 1 b hr w Using C avalieri's principle (Kern and Bland 1948) together with these radii and the formula for the volume of a hypersphere, we can compute the volumes of the three sets. PAGE 79 79 Denote the volume of X X X or X as V V V or V respectively. Then, for example, in 3dimensions, 34 3 Vr 22 2 2 2 223 6 111 3 6' 21211 4 6' Vrhh bbb rrr wwww brbb rr wwww 2 221211 4 6' brbb Vr r wwww 34 3 VVVVrVV In n dimensions, the volumes are (Li 2011) 21 2nnr V n 1 2 2 1 2 21 2 cos 0 1 1 2 1 2 2 cos 1 0sin 1 2 1 sin 1 2n n rh n r n b b rr n www b r ww nr V tdt n b r ww tdt n PAGE 80 80 1 2 2 1 2 21 2 cos 0 1 1 2 1 2 2 cos 1 0sin 1 2 1 sin 1 2n n rh n r nb b rr n www b r ww nr V tdt n b r ww tdt n Figure 61 illustrates the various items defined above. The darker areas are the volumes of X and X separately. The volume of X is the summation of V V and V Figure 61. Volumes When the probabi lity of belonging to one of X or X is different from that of the other, we say the instance space is unbalanced, otherwise the instance space is called balanced. Figure 62 demonstrate these two cases. The left of Figure 62 is for the balanced instance space, while the right is the unbalanced one. PAGE 81 81 Figure 62. B alanced \ unbalanced data set As mentioned earlier, we know the default values found by the various imputation methods depend on the size of the training sample l so w e now study the stylized case where l Proofs for the following propositions are sketched since the ideas are rather straightforward but the details are very involved. Typically all proofs will require making limiting arguments. In a few cases, we will define sets with certain characteristics and will need that we sample points that fall in these sets. Even if these sets are not measurable in X their proper perturbations are. We will not rigorously distinguish between the two. In addition, most proofs require that we observe a specific point or a collection of such points in our large sample. We will assume that is large enough so that the sample includes a point that is sufficiently close to the desired point. As with sets, we will not disting uish between the desired point and the sufficiently close sample point. We will highlight these issues during the proof but for presentational convenience, we will not rigorously derive the results using limiting arguments. We study the properties of the Average imputation method first. Here the default vector is found by averaging the attribute values in the training sample. As we know, both X and X are symmetrical with respect to the line passing through the poles, p PAGE 82 82 and p which we will refer to as the w line. The symmetry with respect to the w line is in a sense that the rotation of X or X is isometric. Therefore, the average of all instances in X is thus on the w line, and we call it the point x as shown in the Figure 6 3 below. Similarly, the average of all examples in X is denoted as x and is also on the w line. Below we argue that the Average Method approaches the weighted average of x and x as the training sample size increases. Figure 63. Average Method We consider three cases. When rw up w and rw lp w the default vector estimated with the Average Method is the origin; The second consists of the special cases where either up or lp in which case rw xup w or rw xlp w resp ectively. For the third case, neither up nor lp x and x can be computed by integrating along the rays starting from u and l respectively. For x the segment length is h So we get PAGE 83 83 0 hS xupud V where S is the volume of the hypersphere : Xxwxwupu Likewise, for x the segment length is h that is 1 b r w We get 0 hS xlpld V where S is the volume of the hypersphere : Xxwxwlpl Finally, the average over XX is a weighted value of x and x based on the volumes of V and V Proposition 6 ( Average Method ) : As the average imputation method produces an imputation vector v such that Case 1. If rw up w and rw lp w then 0 v Case 2. If up but lp then 0 0limhS vlpl d V ; If up but lp then 0 0limhS vupu d V PAGE 84 84 Case 3. If up and lp then VxVx v VV Proof Sketch: We focus on Case 3 with Cases 1 and 2 being trivial special cases. First, we note that the imputation vector v is the sample average for a given Since the sample is created i.i.d. sampling theory tells us that as l EvEx and 0 Varv By definition, we have XExfxxdx Conditioning we have ExExxXPxXExxXPxX We had shown earlier that Ex is on the line w It is convenient to write the integral then using h hS d V where S V is the density with respect to We note that VVV since we assume sampling of instances is uniform over XXU and that X is not sampled. Conditioning we have 0 hS xExxXupud V and similarly 0 hS xlpld V Noting that V PxX V and V PxX V and putting it all together VxVx Ex V QED. For the special case of a balanced instance space, the average of all sample points is the origin of the hypersphere, that is, the default vector is the origin (since only Cases 1 and 3 apply and then xx Figure 64 illustrates the point. PAGE 85 85 Figure 64 Average Method in t he balanced case What does Proposition 6 say about misclassifications? The answer depends on wb but a stylized consideration gives us some ideas on how bad things could be. Suppose we have a balanced dataset, as shown in Figure 64 where v is the center of the sphere implying the volume of positives and negatives are equal. Since vectors with missing values get imputed values, they may end up between the two margin hyperplanes. So one can label these as unclassified, or more conservatively, as negative as we now do. If one attribute is missing, the imputed instance will lie on either the horizontal ( 1x ) or vertical axis ( 2x ) through the center. All points missing attribute two will be labeled as negative (meaning all the positive cases will be misclassified). Some of the positives missing attribute one will be misclassified (call that fraction 1p ) and no negatives will be misclassified. Finally all positive points missing both attribut es will get misclassified while none of the negatives will be misclassified. If is the probability of a missing attribute value for this example we get the following overall misclassification rate: 2 2 111 0111 22 p PAGE 86 86 We note that 1p depends on w It is based on the volume of the positives that fall below the positive margin hyperplane when they get mapped onto 2x w determi nes that volume. The Average Method finds a single point for the default imputation vector. The Similarity M ethod operates differently and a default value for an attribute depends on the known attribute values. As mentioned in the Chapter 2, the Similarity Method typically uses the Euclidean distance for measuring the similarity of observations. Figure 6 5 below shows a twodimension case where a vector x has a missing value for 2x Depending on the value of 1x t he Similarity Method may use any point on the dashedline segments to generate an estimate for the missing value. The reason for this is as follows. The Similarity Method uses a point in the training sample that is mos t similar to x as the default vector. When Euclidean distance is used to measure the similarity between two points, only the known values are considered in this distance measure, that is, 2 kk kIdxyxy 1,,1,,sIiin I is the set of attributes where both points have known values. In Figure 6 5 x only has 1x as a revealed attribute, and any point on the lines has the same 1x therefore a zero distance to x As the number of training points arbitrarily close to the dashedline goes to infinity. Any point that is not on the line segments has distance to x greater than zero. Thus, with the Similarity Method in this two dimensional example, in the limit any point on the dashed line may be chosen as an estimate. The probabili ty of choosing a positive sample or a negative one is based on the relative sizes of the PAGE 87 87 dashedline segments as it is possible to show that the volume created by including all points that are arbitrarily close to the line will also have the same property. Figure 65. Similarity Method Without loss of generality, we assume a new observation hides the first attributes, Let that is , where is a vector with the trailing attributes of Define where Let and xX s 0 sn 01k kks y xskn 0 y t t ns x :xPXX 1 2 xsz XzRRr t K ,1x xXzXwzb ,1x xXzXwzb PAGE 88 88 Let and be the volumes of , and respectively. Proposition 7 (Similarity Method) : As the Similarity imputation method chooses a value in the positive class for a default vector with the probability and a value in the negative class with Proof Sketch: Without loss of generality, assume is an observation that has missing values in the first attributes. As we argued earlier if large enough we will observe at least one point in the neighborhood of .As noted earlier we will not distinguish between a nd its neighborhood ). Now, given that we observe such a point and that we sample uniformly the conditional probability that the point belongs to is and similar arguments apply for (i.e., the associated probability is ) QED A special case of the above proposition is when either or is empty. Figure 66 illustrate s such cases. The left shows that no point in the positive class has the same as the does, that is, is empty. The right shows a case where is empty. pV pV pV xX xX xX p ppV VV p ppV VV xX s xX xX xX p ppV VV xX p ppV VV xX xX 1x x xX xX PAGE 89 89 Figure 66 Special case one of Similarity Method It is possible that in the instance space only one point has the same 1x value as x does, that is x itself, as shown in Figure 6 7 In this case, 2 2 1xr The closest points to x are on the line closed to x Then the probability of assigning a point in the same class as x equals to one, and the probability of assigning a point in the other class is zero. To generalize, in the cases where 22 1 n k ksyr the probability equals to one when choosing a default vector in the same class, while the probability of choosing a point in the other class equals to zero. Figure 67 Special case two of Similarity Method PAGE 90 90 As for Proposition 6, determining what Proposition 7 says about misclassifications depends on wb and another stylized consideration gives us some ideas on how bad things could be. Again, suppose we have something like that shown in Figure 64 (where v is the center of the sphere) Suppose attribute two is missing. An imputati on will have the probabilities shown above of classifying the point. So 100 % V V of the negative cases will be misclassified and 100 % V V of the positive cases. This gives 11001100 %%50% 22 VV VV If both attributes are missing, one just flips a coin and also gets a 50% misclassification rate. If neither is missing, there is no misclassification. So, if is the probability of a missing attribute we get the following overall misclassification rate: 2 22222 111 011 2 012 222 In contrast, Proposition 8 below shows that as the number of examples in the training sample goes to infinity the misclassification rate goes to zero when a principal uses the Dwb method. Proposition 5 shows that where As , 0,0, :kk kkk kZwDwbxFwFFkk e 1 min1 0kk xS k kwxwxbkZw w x kZw PAGE 91 91 So for Both are well defined convex optimization problems. Using KKT conditions, the solution to the first is as follows. We start with KKT conditions are: Note that and Case 1: Suppose Then giving a contradiction. Case 2: Suppose and Then 1 min1 0kk xXX k kwxwxbkZw w x kZw kZw 11 minmin ,mink kk kk xX xX kkb x wxwxwxwx ww minkk xXwxwx minkk xXwxwx 2min' ..'0 '10kkfxwxwx stxxr wxb 2 2 '0 '10 20 '0 '10 ,0 xxr wxb fxw xxr wxb 2''kfwwww 2''kffwww ,0 0 f 0 0 PAGE 92 92 Thus a contradiction. Case 3: Suppose and Then and Thus, and 2 kk kreww x www and Case 4: Suppose and Then So, we have Case 4a: Assume then 0kkewww 1k kw e w 0 0 0 2'0 xxr 20 fx 20kkewwx 2kkeww x 2 2 4 kwww xx 2 2 24kwww r 0kx 0 0 20 kkewwxw 1 2kkeww x 21 1 2kwww wx b 1 b 2 2101k kw www ww 2 2 2 2211 4kwww xx r PAGE 93 93 Hence Case 4b: Assume Then, Further, we get So, where 2 22 2 22 24 2 22 4 2 22 4 2211114 12 4 4 2kk k kk k k k k kww w wwr ww ww ww wr wwww w wr ww w w ww r 2 2 44 2211 2 2k kk k kk kk kkwew w ww w r x eww ww ww ww ww ww r 1 b 21 21kwww b 2 2 2 2211 4kwww xx r 2 2 2 2212114 21k kwww www r b 22 222 2 22 2 2 24211 1kkbwbwrwwwrequa bwwr PAGE 94 94 We have three solutions for the convex problem, one is 2 kk kreww xwww and (the case and ), and the other two are for the case 0 and 2 4 2 k kk k krw x eww ww w w ww if 1 b or 1 2kkeww x if 1 b where and are shown as above. For example, suppose 1 1 w and 0 b the optimal solutions in X and X separately are 17 2 17 2 x and 17 2 17 2 x The corner point of D set is as follows. 111172737 min, 2.835 11222 x 211371757 min, 3.82 11222 x Figure 68 shows these optimal solutions. 442 4 22 4222 2 224 4 242 4 244 2 234 2 68 222 24 41121 121211 21 131 211kk kk k kk kk kkequabwbbrw bwrbwbrbww bwrwbrbwrww rwrbwrwrw 0kx 0 0 0 PAGE 95 95 Figure 68 Example of optimal solution of convex problem Proposition 8 ( ,yDwb Method) : As the misclassification rate of the ,yDwb imputation method goes to zero for yxX Proof Sketch: Proposition 4 shows the structure of ,yDwb where 11 max 0ykk xSX kk kywxywxbkZw ywyw x kZw As 11 max 0ykk xX kk kywxywxbkZw ywyw x kZw PAGE 96 96 So for kZw 11 min 'yk kk xX kkyb x ywxywx ywyw The minimization problem is a well defined convex optimization problem. Thus, in the limit, ,yDwb covers yX QED Proposition 9 ( Dwb Method) : As the misclassification rate of the Dwb imputation method goes to zero. Proof Sketc h : Proposition 5 shows the structure of Dwb where 1 min1 0kk xS k kwxwxbkZw w x kZw As 1 min1 0kk xXX k kwxwxbkZw w x kZw So for kZw 11 minmin ,mink kk kk xX xX kkb x wxwxwxwx ww Both inner minimization problems are well defined convex optimization problems. Thus, in the limit, Dwb covers the entire instance space. QED As shown above, when the training size increases, the misclassification rate of the D M ethod converges to zero, which is not necessarily the case for the Average Method and the Similarity M ethod. In a balanced data set, as shown in Figure 6 4, the imputed PAGE 97 97 default vector is the origin, which means if X has no inter section with coordinates, then all positives are classif ied as negatives, while all negatives are classified correctly. T he performance of the Similarity Method depends on the dataset and wb For example, in the example we discussed earlier if the set is balanced, then misclassification for positives or negatives is 50%. PAGE 98 98 CHAPTER 7 EXPERIMENTAL STUDY In Chapter 5, we studied several possible sets, ,yDwb ,yDwb and Dwb that a principal might use from which to choose a default vector when imputing missing attribute values. Briefly, ,yDwb 1,1 y are the sets of default vectors where all agents in class y will be classified correctly regardless of their strategic actions. More specifically, if a principal chooses any default vector 1, dDwb every agent of type +1 will be positively classified without regard to the agents strategy of publishing attributes. Likewise, if a principal uses 1, dDwb then every negative agent gets classified negatively regardless of any attributes he might hide. Since we assume a separable sample, these two sets have no intersection, so the principal cannot use one default vector that correctly classifies all agents despite their actions. We posit that a conservative principal will want to classify negative agents correctly independent of their publishing strategy, since the cost of misclassifying negative agents is generally higher than that for positive agents. Then, a natural response of a negative agent to 1, dDwb is not to publish any attribute because he gains nothing but publishing costs. However, when 1, dDwb is chosen, every positive agent can reveal information freely. The principal wishes to f ind some default vectors that incent positive agents to publish. The default set ,yDwb serves that purpose. When the principal chooses 1, dDwb as a default vector, all positive agents are induced to publish all attri bute values in order to get a positive classification. The principal, therefore, may want to use a default vector PAGE 99 99 11,,, dDwbDwbDwb such that all negative agents are classified correctly regardless of what information they publish and all positiv e agents are incented to reveal all attribute values or get misclassified otherwise. This accomplishes the principals two goals. The principal can select a default vector in any of the sets, ,yDwb ,yDwb 1,1 y or Dwb to use as a default for missing data values. Note that these sets are convex cones, due to the assumption of existence of at least one positive and negative support vectors. The principal can choose e ither the corner points of these sets or any other points inside the sets as a default vector, without changing decision outcomes. For instance, any 1, dDwb either dx or 1, dDwbx can classify a ll negative agents correctly. Recall that x is the vertex of 1, Dwb defined in Proposition 2 as 11 max 0iikikii yy kk kywxywxbkZw ywyw x kZw In this study, we look at the case where the principal will use the vertex of Dwb or 1, Dwb set as a default vector for the D and DNeg methods respectively, and also the case where interior point of these sets are chosen for a default vector When the principal chooses a default vector 1, dDwb we say she uses the D negative method, or DNeg method. The DNeg method ensures that no negative agents will be labeled as a positive case. Note that when using the DNeg method, the principal does not restrict a positive agents str ategy set. Positive agents can rationally PAGE 100 100 publish all or some of their attributes. If they publish every attribute, they get classified positive with certainty. If they choose to publish part of data at a lower publishing cost, then they may get labeled negatively. The Agents Problem assures us that when the opportunity cost is higher than the total publishing cost, an intelligent agent will choose to publish enough attributes so that he will not be classified incorrectly. When the principal chooses dDwb we say she uses the D method for estimating missing values. Dwb is a subset of 1, Dwb. It takes into account positive agents strategies in addition to those of negative agents. Comparing to the DNeg method, the D method confines positive agents freedom of choosing their strategies. The method may further reduce the principals misclassification rate since positive agents get correct classification when each of their attributes is known. In the experiment described below, we investigate the performance of the principals D and DNeg methods and several conventional methods when agents reveal or hide information strategically. Conventional imputation methods we study are the Average Method (Average), the negative concept Average Method (AvgNeg), the Regression Method (Regression), the negative concept regression method (RegNeg), the Similarity Method (Similarity), the negative concept similarity method (SimNeg). Briefly speaking, the Average Met hod imputes an estimate for the missing value of a variable by calculating the mean of observed data on that variable. The negative concept Average Method computes a substitute for the missing value of a variable by averaging present data of that variable in observations on cases from the negative class. The Regression Method imputes the predicted values from regression formulations. The negative concept regression method predicts the coefficients of PAGE 101 101 regression formulations from observations that belong to the negative class. The Similarity Method fills in the missing values of an example with the values of an observation that is most similar (closest). The negative concept similarity method looks for a negative observation that is most similar to the exampl e having missing values. As in any other supervised classification problems, the principal uses two data sets, a training set for inducing a classification rule and a test set for measuring the performance of the classifier. In our experiment, the training set is also used for finding the predicted values and for determining default vector sets. As is typical in supervised learning situations, it is assumed a principal will develop a training set having correct labels and values for all attributes. When the principal observes that an example in the test set has missing data strategically hidden, she uses the training set to compute estimates by using one of eight methods (D, Average, Regression, Similarity, DNeg, AvgNeg, RegNeg, and SimNeg). Depending on how representative a training set is, the estimated values for missing data can be more or less accurate for these six methods. When the D and DNeg methods use a training sample to find the sets of default vectors for missing data in a test set, it matters w hether the vertex of the Dwb or 1, Dwb is used. The further away from the negative hyperplane the default vector, the more conservative the classification outcome is. As we know, the vertex of the set has the shor test distance to the negative hyperplane among all points in the set. That is, for the DNeg method, using 1, dDwbx as an estimate for missing values may give the misclassification rates that are no greater than the rates when letting dx Similarly, the D method may have better performance in the test set when using PAGE 102 102 dDwbx instead of the vertex x defined in Proposition 5. In this experiment, we also analyze the impact of choosing different d in Dwb or 1, Dwb We select change steps at { 1, 2, 4, 6, 8 } such that for the D method, d varies in ,2,4,6,8 xwxwxwxwxw and 1, dDwb ; for the DNeg method, d varies in ,2,4,6,8 xwxwxwxwxw and dDwb By choosing a different default vector d in such way, we let d move along the w direction. As the change step increases, the chosen d is further away from the vertex of the set Dwb or 1, Dwb. We believe such d is more conservative. In this experiment, we assume that agents know the principal may choose from eight different imputation methods but are not sure which method she will use. We also assume the agents are self interested and utility maximizing individuals. So, if agents know the principals choice, a natural response for them is to find the lowest cost selection of attributes that gives them a positive labeling. In addition, we assume that the principal knows that the agents act strategically when publishing information but does not know which imputation method they are anticipating while solving their hiding problem (i.e., the agents problem). Experiment D esign We assume true agent attribute vectors are linearly separable. Also, all agents face the same opportunity cost and publishing costs. In this experiment, we set the publishing cost vector at {1, 2, 3, 4, 5, 4, 3} Different attributes may have different costs. PAGE 103 103 Our ex periment controls for several factors: opportunity level (OL), dimensionality ( n ), training set size ( l ), testing set size ( l ), agents percent of randomly missing data ( Ram ), agents imputation method ( Ma ), and the principals imputation method ( Mp ). We use 30 replications per case. The opportunity level (OL) indicates whether the opportunity cost is hi gher than the sum of all publishing costs. OL=0 or Low represents the case where the opportunity cost is lower than the total publishing cost. In this experiment, the opportunity cost is set to 1 1 n k k An agent may find it optimal to not publish anything thus foregoing classification by the principal. OL=1 or High is for the case that the opportunity cost is high enough to categorically refuse publishing attributes. For the experiment, the opportunity cost is equal to 1 12n kn k A negative agent may still choose not to publish any attributes if he cannot obtain a positive labeling. The number of attributes, the dimension, is denoted as n We test the performance of methods when the dimension ranges from 3 to 7. We randomly generate n real numbers for each observation i x Each vector is then normalized to a value randomly chosen between zero and a specified maximum norm. The maximum norm is set to 5 in this study. That is, we are implicitly limiting the instance space to a hypersphere as was explicitly done in Chapter 6. The training sample size is represented by l T he number of examples of the same type in the training sample is 2 l which varies in {10, 50, 100, 500}. For example, if 210 then the sample has ten positive examples and ten negative ones. PAGE 104 104 In the experiment, a prespecified classifier wb is used to label the example i x We set 1,2, 1.2, 2, 4, 2, 2.4 w and 0 b in the experiment. A randomly generated vector, after normal ization to a random value between zero and five, is labeled as positive (+1) if 1iwxb or negative ( 1) if 1iwxb It is discarded otherwise. This is repeated until the required number of positive and negative cases is generated and at least one positive and one negative support vectors are produced. Similarly, we generate the test set with 2 l positive examples and negative ones. The size 2 l also varies in {10, 50, 100, 500}. The performance of classification methods is generally measured by their misclassification rate. Misclassification rate (TotMisc) is the fraction of observations misclassified when a classifier is used on the test set, that is, TotMisc equals the number of examples misclassified divided by the size of the test set. In this study, we also report the misclassification rates per class. The negative misc lassification rate (NegMisc) is the misclassification rate for negative examples (e.g., the number of misclassified negatives divided by the total number of negatives in the test set), and the positive misclassification rate (PosMisc) for positive ones (e. g., the number of misclassified positive ones divided by the total number of positives in the test set). The rate of examples not classified (Notclassified) measures how many examples in the test set have missing values that cannot be imputed by the princi pal (e.g., if all the values are missing, the Regression Method cannot impute values). In addition, the strategic outcomes are measured. PosStr is the measurement of the rate of positive agents who act strategically, that is, they hide at least one attribute. NegStr measures how many negative agents in the test set act strategically (i.e., publish at least one attribute). A PAGE 105 105 natural response for the negatives is to disclose nothing since the negatives get classified negative even if they reveal all attributes. The percent of misclassified positive strategic agents in the test set is called PosStrMisc, while NegStrMisc is for negative ones. TotStrMisc is the rate of the total number of agents who act strategically but get incorrect classification among all agents (misclassified agents/test set size). To stabilize the variance of the rates of misclassification in statistical tests (Stam and Joachimsthaler 1989) we map the performance measures to 2 arc sin (sqrt (misclassificatio n rate)). Table 7 1 is the summary of the experimental design. Table 7 1 Summary of experimental designs Treatment Parameter Replications 30 Opportunity level ( OL ) {Low, High} Dimensionality ( n ) {3, 4, 5, 6, 7} Training set size ( ) {20, 100, 200, 1000} Testing set size ( ) {20, 100, 200, 1000} Default value change size ( ds ) {1, 2, 4, 6, 8} Randomly missing data percent ( Ram ) {1%, 2%, 3%, 4%, 5%} Agents methods ( Ma ) {D, Average, Regression, Similarity, DNeg, AvgNeg, RegNeg, SimNeg} Principal methods ( Mp ) {D, Average, Regression, Similarity, DNeg, AvgNeg, RegNeg, SimNeg } Outcomes {TotMisc, PosMisc, NegMisc, Notclassified, TotStrMisc, PosStrMisc, NegStrMisc, StrPos, StrNeg } In this study, we will simulate the case where a principal chooses from those eight methods, {D, Average, Regression, Similarity, DNeg, AvgNeg, RegNeg, SimNeg }, to impute estimates for missing data of all agents, and simultaneously agents select a method in determining which attributes to hide. To simplify exposition we will say, for example, the agent used the D method. What we mean by this is that the agent computed which attributes to hide by assuming that the principal used the D method. Depending o n how the agents choose to hide his attributes, the study is separated into three parts. In the first part of this study, we consider the case where all agents chose PAGE 106 106 the same method out of the eight methods available to solve their Agent problem. For insta nce, all agents may choose the Regression Method to hide information strategically. In the second part, they choose one method among the eight methods but different agents may use different methods to decide the hidden attributes. For example, one agent may select the Regression Method while another can use the D method to decide to how to conceal information strategically. We call it the Randomized case. In the third case, we consider the situation where all agents conceal their attributes randomly so no s trategic moves are expected from the agents side. The percent of randomly missing data are measured by Ram which ranges in {1%, 2%, 3%, 4%, 5%}. This we call our Random case. Expected O utcomes Part One (The Same Method Case) Contro lled Factors A full ANOVA analysis on misclassification rate, when all experiment parameters are cont rolled, is provided in Table 72 All controlled factors, such as dimension, training size, testing size, opportunity level, agents methods, and the princ ipals methods, are significant at the 0.001 level. The interaction effects of these independent variables are also significant, except for An F value significantly greater than 1 implies that the betweengroups variability is larger than within group variability. When the F test for a factor is significant, we can conclude that the means for groups of the factor are significantly different. For example, it is appropriate to claim from Table 72 that the training set size is a significant factor of TotMisc, although the effect of training size on TotMisc can be modified by the test size as seen from the interaction effect of via the F test. PAGE 107 107 T able 7 2 Two way a nalysis for dependent v ariable: m isclassification r ate Source DF Sum of Squares Mean Square F Value Pr>F R Square Coeff Var Root MSE Rate Mean Model 271 103371.3 381.4441 15298.5 <.0001 0.931 10.579 0.158 1.493 Error 306928 7652.747 0.0249 Corrected Total 307199 111024.1 Source DF Anova SS Mean Square F Value Pr>F Main effect n 4 78.763 19.691 789.74 <.0001 OL 1 9941.297 9941.297 398715 <.0001 3 66.170 22.057 884.62 <.0001 3 4.944 1.648 66.1 <.0001 Ma 7 2751.800 393.114 15766.6 <.0001 Mp 7 42912.157 6130.308 245868 <.0001 Two way Interaction effect nOL 4 852.294 213.074 8545.72 <.0001 n 12 19.430 1.619 64.94 <.0001 n 12 1.015 0.085 3.39 <.0001 nMa 28 430.728 15.383 616.97 <.0001 nMp 28 1277.847 45.637 1830.37 <.0001 OL 3 159.836 53.279 2136.84 <.0001 OL 3 17.706 5.902 236.71 <.0001 OLMa 7 19127.513 2732.502 109592 <.0001 OLMp 7 243.368 34.767 1394.39 <.0001 9 0.560 0.062 2.5 0.0075 Ma 21 68.872 3.280 131.54 <.0001 Mp 21 156.013 7.429 297.96 <.0001 Ma 21 1.672 0.080 3.19 <.0001 Mp 21 13.682 0.652 26.13 <.0001 MaMp 49 25245.673 515.218 20663.8 <.0001 Similarly, we performed ANOVA tests on PosMisc, NegMisc, and NotClassified. We want to study how these factors affect these classification rates in detail. We also consider the effects in three cases separately, OL =Low, OL =High, and OL =Low or High, since different strategic behavior is anticipated from agents in each case. We summarize all of these ANOVA tests in Table 73 rather than listing every result in detail, due to space constraints. In Table 73 * means that F value is significant at the 0.0001 level, and when F value is not significant we post the F value. As shown in Table PAGE 108 108 7 3 all factors are significant on all misclassification rates in every case, except for the test set size on PosMisc when OL =High or Low. In addi tion, the interaction effect of different factors are mostly significant, except for n and The effect of training size and d imension n on different rates is modified by the test size i n some cases Table 7 3 Two way analysis for dependent v ariables: TotMisc, PosMisc, NegMisc OL =High or Low OL =High OL =Low Independent TotMisc PosMisc NegMisc TotMisc PosMisc NegMisc TotMisc PosMisc NegMisc Main Effect n * * * * I * * * * * * * * * 0.4073 * * * Ma * * * * Mp * * * * Two way Interaction Effect nI * * * * * n * * * * * n 0.9769 0.1571 0.5769 * 0.0231 * nMa * * * * * nMp * * * * * I * * * * * I * * * * * IMa * * * * * IMp * * * * * 0.0075 0.0375 0.8733 0.0013 * 0.2418 * Ma * * * * * Mp * * * * * Ma * * * * * Mp * * * * * MaMp * * * * As has been shown in Table 7 3 almost all independent variables are significant on all performance measures in every case. First of all, since the opportunity level is significant in all cases, we investigate how the opportunity level affects the classification rates specif ically. The numbers in Table 7.3 are averaged across all experimental factors. PAGE 109 109 Table 7 4 summarizes the misclassification rates for different levels of the opportunity cost. The misclassification rate is calculated as an average across all experimental fact ors. As expected, from Table 74 we see that TotMisc, PosMisc, and Notclassified are lower when OL= High than when OL= Low. NegMisc is almost the same for the case OL= High as OL= Low, which means that negative agents cannot fool the system. Negative agents choose to publish nothing in the both cases. The minimums of all misclassification rates are zero in all cases. Furthermore, since the case OL= Low or High is just the collection of the cases OL= High and OL= Low, we see that the case OL= Low or High has smaller TotMisc, PosMisc, and NotClassified than the case OL =Low does. Table 7 4 Simple summ ary of s tatistics OL =Low or High OL =High OL =Low Variable Mean Minimum Maximum Mean Minimum Maximum Mean Minimum Maximum TotMisc 1.492595 0 3.14159 1.312704 0 3.14159 1.672487 0 3.14159 PosMisc 1.361122 0 3.14159 1.178118 0 3.14159 1.544127 0 3.14159 NegMisc 0.414285 0 3.14159 0.415648 0 3.14159 0.412921 0 3.14159 StrMisc 0.853205 0 2.0944 0.863833 0 2.0944 0.842577 0 2.0944 PosStrMisc 0.65403 0 1.5708 0.664018 0 1.5708 0.644042 0 1.5708 NegStrMisc 0.273526 0 1.5708 0.274822 0 1.5708 0.27223 0 1.5708 StrPos 1.275705 0 1.5708 1.316736 0 1.5708 1.234674 0 1.5708 StrNeg 0.644347 0 1.5708 0.647463 0 1.5708 0.64123 0 1.5708 As stated in the experiment design section, the misclassification rates are mapped to 2 arc sin (sqrt (misclassification rate)) for stability. We transform the rates in percent format for readability. Table 7 5 is Table 74 in percent version. As show n earlier in Table 73 independent factors mostly are significant. We want to further study how each of these factors affects the rates. For example, whether TotMisc increases as dimension decreases. PAGE 110 110 Table 7 5 Simple summary of s tatistics in percentage OL =Low or High OL =High OL =Low Variable Mean Minimum Maximum Mean Minimum Maximum Mean Minimum Maximum TotMisc 46.09% 0 100.00% 37.24% 0 100.00% 55.08% 0 100.00% PosMisc 39.59% 0 100.00% 30.87% 0 100.00% 48.67% 0 100.00% NegMisc 4.23% 0 100.00% 4.26% 0 100.00% 4.20% 0 100.00% StrMisc 17.12% 0 75.00% 17.52% 0 75.00% 16.72% 0 75.00% PosStrMisc 10.32% 0 50.00% 10.62% 0 50.00% 10.02% 0 50.00% NegStrMisc 1.86% 0 50.00% 1.88% 0 50.00% 1.84% 0 50.00% StrPos 35.46% 0 50.00% 37.43% 0 50.00% 33.51% 0 50.00% StrNeg 10.03% 0 50.00% 10.12% 0 50.00% 9.93% 0 50.00% Table 7 6 is an example of Tukeys range test comparing the impact of dimensions on TotMisc when OL= High. Recall that the number of dimension, n varies from 3 to 7. FromT able7 6, we see that the experiment has the highest TotMisc when 7 n TotMisc decreases as the dimension decreases. For each comparison, Table 7 6 lists the results of difference between means, simultaneous confidence, and significance l evel. For instance, the comparison of the dimension 7 n and 6 n shows that the difference between the means of experiments is 0.0 12758 and such difference is significant Tukey grouping results are provided too. If the means of two experiments are significantly different, then the experiments are grouped separately. Otherwise, they are in the same group. Tukeys test uses a letter, such as A, B, C, etc., for each group. In general, the experiments are compared by their means and then the results are represented and ranked by group letters alphabetically. For example, we see that the means of TotMisc of the experiments 7 n 6 n 5 n 4, and 3 are significantly different and each experiment is represented by a different group. The experiment 7 n gives the highest mean of TotMisc, followed by 6 n 5 n 4, and 3. PAGE 111 111 Table 7 6 T ukey's Studentized Range (HSD) test for misclassification r ate ( OL =High) The ANOVA Procedure for Independent Variable: Dimension Alpha 0.05 Error Degrees of Freedom 153353 Error Mean Square 0.015115 Critical Value of Studentized Range 3.8577 Minimum Significant Difference 0.0027 Comparisons significant at the 0.05 level are indicated by ***. Dimension Comparison Difference Between Means Simultaneous Confidence 95% Limits 76 0.012758 0.010052 0.015464 *** 75 0.030505 0.027799 0.033211 *** 74 0.0632 0.060495 0.065906 *** 73 0.115233 0.112527 0.117939 *** 67 0.01276 0.01546 0.01005 *** 65 0.017748 0.015042 0.020454 *** 64 0.050443 0.047737 0.053149 *** 63 0.102476 0.09977 0.105182 *** 57 0.03051 0.03321 0.0278 *** 56 0.01775 0.02045 0.01504 *** 54 0.032695 0.029989 0.035401 *** 53 0.084728 0.082022 0.087434 *** 47 0.0632 0.06591 0.06049 *** 46 0.05044 0.05315 0.04774 *** 45 0.0327 0.0354 0.02999 *** 43 0.052033 0.049327 0.054739 *** 37 0.11523 0.11794 0.11253 *** 36 0.10248 0.10518 0.09977 *** 35 0.08473 0.08743 0.08202 *** 34 0.05203 0.05474 0.04933 *** Means with the same letter are not significantly different. Tukey Grouping Mean N Dimension A 1.357043 30720 7 B 1.344285 30720 6 C 1.3265376 30720 5 D 1.2938424 30720 4 E 1.2418095 30720 3 Similarly, we perform a number of Tukeys tests on other independent variables. The summary of Tukeys test results of all controlled factors on TotMisc, PosMisc, and PAGE 112 112 NegMisc in three cases, OL =Low, OL =High, and OL =Low or High, are provided in Table 7 7 Table 7 7 Summa ry of Tukeys Test results for dependent v ariables: TotMisc, PosMisc, NegMisc OL =Low or High OL =High OL =Low TotMisc PosMisc NegMisc TotMisc PosMisc NegMisc TotMisc PosMisc NegMisc In dependent Group Rank Group Rank Group Rank Group Rank Group Rank Group Rank Group Rank Group Rank Group Rank n A 4 A 7 A 7 A 7 A 7 A 7 A 3 A 7 A 7 B 5 B 6 B 6 B 6 B 6 B 6 B 4 B 4 B 6 B 3 C 5 C 5 C 5 C 5 C 5 C 5 B 5 C 5 C 6 D 4 D 4 D 4 D 4 D 4 D 6 B 6 D 4 C 7 E 3 E 3 E 3 E 3 E 3 E 7 C 3 E 3 2 A 500 A 100 A 10 A 10 A 10 A 10 A 500 A 500 A 10 B 100 BA 50 B 50 B 50 B 50 B 50 B 100 B 100 B 50 C 50 B 500 C 100 C 100 B 100 C 100 C 50 C 50 C 100 D 10 C 10 D 500 D 500 C 500 D 500 D 10 D 10 D 500 2 A 10 A 10 A 500 A 500 A 500 A 500 A 10 A 10 A 500 B 50 A 50 B 100 A 100 A 100 B 100 B 50 B 50 B 100 B 100 A 100 C 50 A 50 A 50 C 50 B 100 B 100 C 50 B 500 A 500 D 10 B 10 B 10 D 10 C 500 B 500 D 10 Ma A 2 A 3 A 2 A 2 A 3 A 2 A 0 A 3 A 2 B 6 B 6 B 6 B 6 B 6 A 6 B 4 B 2 A 6 C 1 B 2 C 1 C 1 B 2 B 1 C 7 B 6 B 1 D 0 C 1 D 3 D 3 C 1 C 3 D 2 C 1 C 3 E 3 D 5 E 5 E 5 D 5 D 5 D 1 D 0 D 5 F 5 E 7 F 7 F 7 E 7 E 7 E 6 E 5 E 7 G 7 F 0 G 4 G 4 F 4 F 4 F 5 F 7 F 4 H 4 G 4 H 0 H 0 G 0 G 0 G 3 G 4 G 0 Mp A 7 A 0 A 3 A 7 A 0 A 3 A 7 A 0 A 3 B 3 B 4 B 6 B 3 B 4 B 6 B 3 B 4 B 6 C 6 C 7 B 2 C 6 C 7 B 2 C 6 C 5 B 2 D 2 D 5 C 1 D 2 D 5 C 1 D 2 D 7 C 1 E 0 E 1 D 5 E 0 E 1 D 5 E 0 E 1 D 5 F 4 F 3 E 7 F 4 F 3 E 7 F 4 F 3 E 7 G 5 G 6 F 4 G 5 G 6 F 4 G 5 G 6 F 4 H 1 H 2 G 0 H 1 H 2 G 0 H 1 H 2 G 0 OL A 0 A 0 A 1 B 1 B 1 B 0 Note: 1. "0" stands for "D method", "1" for "Average", "2" for "Regression", "3" for Similarity, "4" for "DNeg", 5 for AvgNeg", 6 for RegNeg, and 7 for SimNeg. 2. The letters A, B, C, and so on, represent the groups of Tukeys range test results. The means of the experiments wit h the same group letter are not significantly different. The means of different gr oup letters are significantly different. PAGE 113 113 As shown in Table 77 for the cases OL =High, it shows that TotMisc, PosMisc, and NegMisc decrease as the dimension decreases, the training size increases, or the test size decreases. For example, the experiment of training size 210 l is the group A, 250 l is group B, 2100 l in C, 2500 l is D. Also, when OL= High, Table 77 shows that TotMisc, PosMisc, and NegMisc have the lowest means i f agents adopt the D method. The second lowest rates are found by the DNeg method, followed by the SimNeg and AvgNeg. When the principal chooses the D or DNeg method, the highest PosMisc and lowest NegMisc are observed. For the case OL= Low, PosMisc and NegMisc decreases as dimension decreases, but training size or test size has no clear trend. When agents select the DNeg or D method, PosMisc and NegMisc are the smallest comparing to other methods. When the principal or agents use the D method, the lowest NegMisc can be observed. When the cases OL= Low and OL= High are not separated, we notice that PosMisc and NegMisc decrease as dimension decreases, and NegMisc decreases as training size increases or test size decreases. Moreover, the lowest PosMisc and NegMisc are found when agents select the D or DNeg method. The highest PosMisc and lowest NegMisc are observed when the principal choose either the D or DNeg method, which is caused by the combined results in the case OL= High and OL= Low. Table 7 7 has gene ral test results about imputation methods. In order to test our hypotheses about strategic responses, however, we need to investigate the interaction of agents and the principals method in more detail. We use Tukeys range test for such pairwise comparis on as we did before but with extra controlled factors. Table 78 PAGE 114 114 provides an example of our strategy analysis. It examines the performance of agents imputation methods in the case OL= High when the principal uses the D method. The dependent variable is Tot Misc. Independent variable is agents methods. The principals method is controlled at the D method, and the opportunity level is High. Table 78 shows that TotMisc is the lowest if agents choose the D method to hide information strategically when anticipa ting the principal uses the D method to impute estimates. Agents DNeg method gives the second lowest TotMisc when the D method is anticipated. The Similarity Method gives the highest Misc. From T able 7 8 we see that agents methods are grouped into different categories The group F is for the D Method, the group E is for the DNeg Method, the group D is for the SimNeg Method, etc. We see that the group F (D Method) has the lowest mean, the group E (DNeg Method) has the second lowest, and the group A (Simil arity Method) has the largest mean. So, we conclude that, in the case of high opportunity level, when the principal uses the D Method, the agents D give the lowest TotMisc. In summary, all controlled factors in the experiment including dimension, opportu nity level, training size, testing size, opportunity level, agents methods, and the principals methods, are significant at the 0.001 level. The interaction effects of these independent variables are also significant, except for .In addition, for the cases OL=High, TotMisc, PosMisc, and NegMisc decrease as the dimension decreases, the training size increases, or the test size decreases. For the case OL=Low, PosMisc and NegMisc decreases as dimension decreases. PAGE 115 115 Table 7 8 T ukey's Studentized Range (HSD) test for TotMisc of agents m ethods ( OL= High, Principal's Method="D method" ) The ANOVA Procedure for Independent Variable: Agents' Methods Alpha 0.05 Error Degrees of Freedom 19079 Error Mean Square 0.003055 Critical Value of Studentized Range 4.28679 Minimum Significant Difference 0.0048 Comparisons significant at the 0.05 level are indicated by ***. Agents' Methods Comparison Difference Between Means Simultaneous Confidence 95% Limits 32 0.004728 0.00011 0.009564 36 0.004881 0.000045 0.009718 *** 31 0.008676 0.003839 0.013513 *** 35 0.062191 0.057354 0.067028 *** 37 0.183681 0.178844 0.188518 *** 34 0.272373 0.267536 0.27721 *** 30 1.57035 1.565513 1.575187 *** 23 0.00473 0.00956 0.000109 26 0.000154 0.00468 0.00499 21 0.003948 0.00089 0.008785 25 0.057463 0.052627 0.0623 *** 27 0.178953 0.174117 0.18379 *** 24 0.267645 0.262809 0.272482 *** 20 1.565622 1.560786 1.570459 *** 63 0.00488 0.00972 4.5E 05 *** 62 0.00015 0.00499 0.004683 61 0.003795 0.00104 0.008631 65 0.05731 0.052473 0.062146 *** 67 0.1788 0.173963 0.183636 *** 64 0.267492 0.262655 0.272328 *** 60 1.565469 1.560632 1.570305 *** 13 0.00868 0.01351 0.00384 *** 12 0.00395 0.00879 0.000888 16 0.0038 0.00863 0.001042 15 0.053515 0.048678 0.058352 *** 17 0.175005 0.170168 0.179842 *** 14 0.263697 0.25886 0.268534 *** 10 1.561674 1.556837 1.566511 *** 53 0.06219 0.06703 0.05735 *** 52 0.05746 0.0623 0.05263 *** 56 0.05731 0.06215 0.05247 *** PAGE 116 116 Table 7 8 Continued. Agents' Methods Comparison Difference Between Means Simultaneous Confidence 95% Limits 51 0.05352 0.05835 0.04868 *** 57 0.12149 0.116653 0.126327 *** 54 0.210182 0.205345 0.215019 *** 50 1.508159 1.503322 1.512996 *** 73 0.18368 0.18852 0.17884 *** 72 0.17895 0.18379 0.17412 *** 76 0.1788 0.18364 0.17396 *** 71 0.17501 0.17984 0.17017 *** 75 0.12149 0.12633 0.11665 *** 74 0.088692 0.083855 0.093529 *** 70 1.386669 1.381832 1.391506 *** 43 0.27237 0.27721 0.26754 *** 42 0.26765 0.27248 0.26281 *** 46 0.26749 0.27233 0.26266 *** 41 0.2637 0.26853 0.25886 *** 45 0.21018 0.21502 0.20535 *** 47 0.08869 0.09353 0.08386 *** 40 1.297977 1.29314 1.302814 *** 03 1.57035 1.57519 1.56551 *** 02 1.56562 1.57046 1.56079 *** 06 1.56547 1.57031 1.56063 *** 01 1.56167 1.56651 1.55684 *** 05 1.50816 1.513 1.50332 *** 07 1.38667 1.39151 1.38183 *** 04 1.29798 1.30281 1.29314 *** Tukey Grouping Mean N Agents Method A 1.570735 2400 3 B 1.566007 2400 2 B 1.565854 2400 6 B 1.562059 2400 1 C 1.508544 2400 5 D 1.387054 2400 7 E 1.298362 2400 4 F 0.000385 2400 0 Note: 1. "0" stands for "D method", "1" for "Average", "2" for "Regression", "3" for Similarity, "4" for "DNeg", 5 for "AvgNeg", 6 for RegNeg, and 7 for SimNeg 2. A, B, C, and E represent groups of Tukeys range test results. The difference of the means of the experiments with the same group letter is insignificant. The means of different group letters are significantly different. PAGE 117 117 Str ategic Game Since all agents choose the same method from the eight methods in this first case, we want to study how agents choose their strategy when anticipating a principals method and vice vers a that is, the game strategies for both the agents and pri ncipal. We formulate some expected results first in this section, based on the characteristics of each method and our knowledge of controlling factors. Obviously, using the same method as the principal, both positive and negative agents may get the best estimates about the principals decision so find a most efficient way to reveal information strategically. When the same method is adopted, they can predict how the principal imputes estimates and provide what the principal looks for. Stated in a general w ay: Hypothesis 1: No response of agents to a principals method is better than using the same method as the principal ( OL =low, or OL= High) To support Hypothesis 1, a pairwise comparison test is needed. Tukeys range test compares the means of one treatment with the means of each other treatment, and it is more suitable for multiple pairwise comparisons than the t test since it corrects the Type I error. We choose 0.05 that is, the confidence level is set at 0.95. In each Tukeys range test, we control a principals imputation method, and then compare agents strategies. In general, the best method for positive agents means the method that can give the lowest positive misclassification rates, that is, the smallest PosMisc and PosStrMisc, while the best for negative agents is the one that can give the highest negative misclassification rates, NegMisc and NegStrMisc because the negatives benefits from being misclassified. Thus, if Tukeys range test comparison results show PAGE 118 118 that the best method for agents is the same method as the principal, then Hypothesis 1 is supported. Propositions 2 and 5 show that the D method and the DNeg method classify negative agents correctly r egardless of their strategic behavior. Yet, as we pointed out earlier, the accuracy of the methods also depends on the sample and test set sizes. When we have a very small training size, the resulting default vector d using the D o r DNeg method may not be representative. When the sample size is large, we expect to see zero or very small NegMisc, StrNeg and NegStrMisc. The negatives get classified negative no matter what their strategy is. Therefore, we hypothesize: Hypothesis 2: Negative agents are indifferent to any method to a principals D or DNeg method. A Tukeys range test can test the Hypothesis 2. In the test, a principals method is controlled at D or DNeg method. If the test results show that misclassification rates are not significantly different when agents select different methods, then the hypothesis is supported. Using Tukeys test we can also find Tukey groupings. If the means of two experiments are significantly different, then the experiments are grouped separately. If the difference of the means is insignificant, they are categorized in the same group. Therefore, if the agents compared methods are in the same group, the hypothesis is supported. The principal may choose to put more weight on NegMisc than PosMisc or NotClassified, since she generally has higher cost of negative misclassification. In addition, she would generally want to incent agents to disclose attributes. StrPos show s how many positive agents reveal all attributes, and PosStrMisc shows the rate of PAGE 119 119 positive agents who hide some attribute and then get misclassified. Thus, if we assume a principal that is more sensitive to negative misclassification rates the method that her choice should be the D method as it yields lower NegMisc and StrPos compared to other methods discussed earlier in the section. Hence, Hypothesis 3: No response of a principal to agents any method is better than using the D method. Similarly, Hypothesis 3 can be supported by Tukeys range tests. In each treatment, the principals methods should be compared when the agents method is controlled. If the results show that the D or DNeg method gives the lowest NegMisc and StrPos in each treatment, then we say this hypothesis is supported. The above three hypotheses are expected in general, no matter whether the opportunity cost is high or low. Next, we examine the effect of opportunity level on classification rates. To find whether these hypotheses are supported or not, we investigate a complete analysis of the strategy response between the principal and agents in the cases OL= High and OL= Low. W e conduct a number of Tukeys range test and summarize the results into Table 79 and Table 710. Each of these tables includes two parts. The first part compares agents strategies when controlling the principals method. The second part has the Tukeys comparison results of the principals methods when controlling agents imputation strategy. Recall that 0 stands for the D method, 1 for Average, 2 for Regression, 3 for Similarity, 4 for DNeg, 5 for AvgNeg, 6 for RegNeg, and 7 for SimNeg. For example, in the cell of Table 79 0>4>7>5>1>2,6,3 meaning D PAGE 120 120 >DNeg>SimNeg>AvgNeg>Average>RegNeg, Regression, S imilarity is the comparison result of Tukeys test on TotMisc for the agent s methods, when the principal is assumed to use the D method to impute the default vector It shows that the agents can have th e lowest misclassification rate if they use the D M e thod when the principal uses the D method. The DNeg Method gives the second lowest misclassification rate for the positives. The second part of the table compares the principal strategies if the agents adopt a certain method. For example, the first cell in the second part 5,4,1,0>7,6,3,2 shows that the principal has lower TotMisc if he uses any of the D, DNeg, Average, and AvgNeg methods instead of any of other four methods. Both Table 7 9 and 710 are conducted with big sample size and test size (e.g. training size = test size =1000) As shown in Tables 79 and 7 10, when the principals method is anticipated, the same method by all agents can give the lowest PosMisc. For instance, when the agents anticipate that the principal will use the D, lowest P osMisc is observed if the agents also use the D method ( 0>4>7>5>1>2,6,3). When the Similarity Method by a principal is anticipated, the result 3,0>4>7>5>1>2,6 shows that positive agents have the lowest PosMisc if they chose the Similarity Method. Inspecting all other method we see that the best strategy for the positive agents is to adopt the same method the principal uses. For negative agents, the best strategy is also to adopt the same method that the principal uses. For instance, in the second part of Table 710, 4,0>7>5>3>6,2>1 is the NegMisc comparison results when the principal uses the Average Method. The results say that using the Average Method the negatives can achieve the highest misclassification when they anticipate the principal using the Average Method. That is, by adopting the same method as the principal does, the negative agents benefit most by PAGE 121 121 strategically hiding information with the method that the principal uses to impute the default values. Table 710 shows that when OL=Low t he agents benefit most if they use the method the principal uses. Therefore, Hypothesis 1 is supported. Table 7 9 Strategy analysis of principal and agents ( OL= "High" TrainSize=TestSize=1000) Principal's Agents' Methods Comparison Methods TotMisc PosMisc NegMisc NotClassified PosStrMisc NegStrMisc StrPos StrNeg D 0>4>7>5>1,6,2,3 0>4>7>5>1>2,6,3 7,6,5,4,3,2,1,0 7,6,5,4,3,2,1,0 0>4>7>5>1,6,2,3 7,6,5,4,3,2,1,0 0>4>7>5>1,6,2,3 4,0>7>5>1>6,2>3 Average 4,0>7>5>1>6,2>3 5,4,1,0>7>6,2>3 4,0>7>5>3>6,2>1 7,6,5,4,3,2,1,0 5,4,1,0>7>6,2>3 4,0>7>5>3>6,2>1 0>4>7>5>1,6,2,3 4,0>7>5>1>6,2>3 Regression 3>7,2,4,5,0,1>6 5,2,4,0>1,7>6,3 4,0>7>5>3>1>6,2 3>2,6>1>5>7,4,0 5,2,4,0>1,7>6>3 4,0>7>5>3>1>6,2 0>4>7>5>1,6,2,3 4,0>7>5>1>6,2>3 Similarity 3,0,4>7>5>1,2,6 3,0>4>7> 5>1>2,6 4,0>7>5>1>6,2>3 3>2,6>1>5>7,4,0 3,0>4>7>5>1>2,6 4,0>7>5>1>6,2>3 0>4>7>5>1,6,2,3 4,0>7>5>1>6,2>3 DNeg 4,0>7>5>1,6,2,3 4,0>7>5>1,2,6,3 7,6,5,4,3,2,1,0 7,6,5,4,3,2,1,0 4,0>7>5>1,6,2,3 7,6,5,4,3,2,1,0 0>4>7>5>1,6,2,3 4,0>7>5>1>6,2>3 AvgNeg 4,0>7>5 6,2>3 5,4,0>7>1>6,2>3 4,0,3>7>6,2,1>5 7,6,5,4,3,2,1,0 5,4,0>7>1>6,2>3 4,0,3>7>6,2,1>5 0>4>7>5>1,6,2,3 4,0>7>5>1>6,2>3 RegNeg 3>7,6,4,5,0,1 5,6,4,0>7,1>2>3 4,0>7>5>3>1>2,6 3>2,6>1>5>7>4>0 5,6,4,0>7,1>2>3 4,0>7>5>3>1>2,6 0>4>7>5>1,6,2,3 4,0>7>5>1>6,2>3 SimNeg 7,0>4,3>2,6>1,5 7,0>4>5>1>2,6>3 3,6,1,4,2,0,5>7 3>2,6>1>5>7>4>0 7,0>4>5>1,6,2,3 3,6,1,4,2,0>5>7 0>4>7>5>1,6,2,3 4,0>7>5>1>6,2>3 Agents' Principal's Methods Comparison Methods TotMisc PosMisc NegMisc NotClassified PosStrMisc NegStrMisc StrPos StrNeg D 5,4,1,0>7,6,3,2 7,6,5,4,3,2,1,0 7,6,5,4,3,2,1,0 5,4,1,0>7,6,3,2 7,6,5,4,3,2,1,0 7,6,5,4,3,2,1,0 7,6,5,4,3,2,1,0 7,6,5,4,3,2,1,0 Average 1>5>4,0,2,6>3>7 1>2,6>3>5>7,4,0 7,4,0>5>3>6,2,1 5,4,1,0>7,6,3,2 1>2,6>3>5>7,4,0 7,4,0>5>3>6,2,1 7,6,5,4,3,2,1,0 7,6,5,4,3,2,1,0 Regression 1>5>4,0,2,6>3>7 2>6>3>1>5>7,4,0 7,4,0>5>1>3>6>2 5,4,1,0>7,6,3,2 2>6>3>1>5>7,4,0 7,4,0>5>1>3>6,2 7,6,5,4,3,2,1,0 7,6,5,4,3,2,1,0 Similarity 1>2,6>5>4,0,3>7 3>6,2>1>5>7,4,0 7,4,0,5>1>2,6>3 5,4,1,0>7,6,3,2 3>6,2>1> 5>7,4,0 7,4,0>5>1>2,6>3 7,6,5,4,3,2,1,0 7,6,5,4,3,2,1,0 DNeg 5,4,1>0>6,2,3>7 5,6,1,4,2,3>7>0 7,6,5,4,3,2,1,0 5,4,1,0>7,6,3,2 5,6,1,4,2>3>7>0 7,6,5,4,3,2,1,0 7,6,5,4,3,2,1,0 7,6,5,4,3,2,1,0 AvgNeg 5,1>4,0>6,2,3>7 5,6,1,2>3>7>4,0 4,0,7>3,5,6,1,2 5,4,1,0> 7,6,3,2 5,6,1,2>3>7>4,0 4,0,7>3,5,6,1,2 7,6,5,4,3,2,1,0 7,6,5,4,3,2,1,0 RegNeg 1>5>4,0,6,2>3>7 6>2>3>1>5>7,4,0 7,4,0>5>1>3>2,6 5,4,1,0>7,6,3,2 6>2>3>1>7,4,0 7,4,0>5>1>3>2,6 7,6,5,4,3,2,1,0 7,6,5,4,3,2,1,0 SimNeg 1,5>4>0>2,6,7,3 7,6,2,1>3,5>4>0 4,0> 5,1,2,6 5,4,1,0>7,6,3,2 7>6,2,1>3,5>4>0 4,0>5,1,2>3,7 7,6,5,4,3,2,1,0 7,6,5,4,3,2,1,0 Note: 1. "0" stands for "D method", "1" for "Average", "2" for "Regression", "3" for Similarity, "4" for "DNeg", for "AvgNeg", for RegNeg, and for SimReg. 2. ", stands for "insignificant difference between misclassification rates". ">" stands for "better in terms of lower misclassification rate, and the difference is significant ".For example, "1, 2" means the difference of misclassification rat e s of Average and Regression methods is insignificant. "1 > 2" means the Average Method is better than the Regression Method in terms of misclassification rate. Tables 7 9 and 710 also show the comparison results 7, 6, 5, 4, 3, 2, 1, 0 on NegMisc for agents methods when the principals D Method or DNeg Method is anticipated. It means that negative agents are indifferent to any method when a principal uses the D Method or DNeg Method. Hence, Hypothesis 2 is supported. PAGE 122 122 Table 7 10. Strategy Analysis of Principal and Agents ( OL= "low" TrainSize=TestSize=1000) Principal's Agents' Methods Comparison Methods TotMisc PosMisc NegMisc NotClassified PosStrMisc NegStrMisc StrPos StrNeg D 0>7,5,6,3,2,1,4 0>7,5,6,3,2,1,4 7,6,5,4,3,2,1,0 7,6,5,4,3,2,1,0 0>4> 7>5>1,2,6,3 7,6,5,4,3,2,1,0 0>4>7>5>1,2,6,3 0,4>7>5>1>6,2>3 Average 5>7,1>6,2,4>3>0 1>5>6,2>7>4,3>0 0,4>7>5>3>2,6>1 7,6,5,4,3,2,1,0 5,4,1,0>7>6,2>3 0,4>7>5>3>2,6>1 0>4>7>5>1,2,6,3 0,4>7>5>1>6,2>3 Regression 3>2,6,1>5>7>4>0 5,2,4,0>1,7>6>3 0,4>7>5>3>1>6,2 3>2,6>1>5>7>4>0 5,2,4,0>1,7>6,3 0,4>7>5>3>1>6,2 0>4>7>5>1,2,6,3 0,4>7>5>1>6,2>3 Similarity 3>6,2,1>5>7>4>0 3,0,4>7>5>1>6,2 0,4>7>5>1>6,2>3 3>2,6>1>5>7>4>0 3,0,4>7>5>1>6,2 0,4>7>5>1>6,2>3 0>4>7>5>1,2,6,3 0,4>7>5>1>6,2>3 DNeg 4>7>5,0,1,6,2,3 4>7> 5,0,1,6,2,3 7,6,5,0,3,2,1,4 7,6,5,4,3,2,1,0 4,0>7>5>1,2,6,3 7,6,5,3,2,1,0,4 0>4>7>5>1,2,6,3 0,4>7>5>1>6,2>3 AvgNeg 5>7>4>1,6,2 5>7>4>1,6,2>3>0 0,4,3>7>2,6,1>5 7,6,5,4,3,2,1,0 5,4,0>7>1>6,2>3 0,4,3>7>2,6,1>5 0>4>7>5>1,2,6,3 0,4>7>5>1>6,2>3 RegNeg 3> 6,2,1>5>7>4>0 5,6,4,0>7,1>2>3 0,4>7>5>3>1>2,6 3>2,6>1>5>7>4>0 5,6,4,0>7,1>2,3 0,4>7>5>3>1>2,6 0>4>7>5>1,2,6,3 0,4>7>5>1>6,2>3 SimNeg 3>2,6,7>1>4>5>0 7,0>4>5>1>6,2>3 3,6,1,4,2,0,5,7 3>2,6>1>5>7>4>0 7,0>4>5>1>6,2,3 3,6,1,4,2,0,5>7 0>4>7>5>1,2,6,3 0,4>7>5>1> 6,2>3 Agents' Principal's Methods Comparison Methods TotMisc PosMisc NegMisc NotClassified PosStrMisc NegStrMisc StrPos StrNeg D 5,4,1,0>7,6,3,2 7,6,3,2>5,4,1,0 7,6,5,4,3,2,1,0 5,4,1,0>7,6,3,2 7,6,5,4,3,2,1,0 7,6,5,4,3,2,1,0 7,6,5,4,3,2,1,0 7,6,5,4,3,2,1,0 Average 1>5>4,0,2,6>3>7 2,6>1>3>5>7>4,0 7,4,0>5>3>6,2,1 5,4,1,0>7,6,3,2 1>2>6>3>5>7,4,0 7,4,0>5>3>6,2,1 7,6,5,4,3,2,1,0 7,6,5,4,3,2,1,0 Regression 1>5>4,0,2,6>3>7 2>6>3>1>5>7>4,0 7,4,0>5>1>3>6>2 5,4,1,0>7,6,3,2 2>6>3>1>5>7,4,0 7,4,0>5>1>3>6>2 7,6,5,4,3,2,1,0 7,6,5,4,3,2,1,0 Similarity 1>2,6>5>4,0,3>7 3>2,6>1>5>7,4,0 7,4,0,5>1>6,2>3 5,4,1,0>7,6,3,2 3>2,6>1>5>7,4,0 7,4,0,5>1>6,2>3 7,6,5,4,3,2,1,0 7,6,5,4,3,2,1,0 DNeg 5,4,1>0>6,2,3,7 6,2,3>7>5,4,1>0 7,0,5,6,3,2,1,4 5,4,1,0>7,6,3,2 5,6,1,4,2,3>7>0 7,0,5,6,3,2,1,4 7,6,5,4,3,2,1,0 7,6,5,4,3,2,1,0 AvgNeg 5,1>4,0>6,2>3>7 6,2>3>5,1>7>4,0 4,0,7>3,5,6,1,2 5,4,1,0>7,6,3,2 5,6,1,2>3>7,4,0 4,0,7>3,5,6,1,2 7,6,5,4,3,2,1,0 7,6,5,4,3,2,1,0 RegNeg 1>5>4,0,6,2>3>7 6>2>3>1>5>7>4,0 7,4,0>5>1>3>2,6 5,4,1,0>7,6,3,2 6>2>3>1>5>7,4,0 7,4,0>5>1>3>2,6 7,6,5,4,3,2,1,0 7,6,5,4,3,2,1,0 SimNeg 1,5>4>0>2,6,7,3 7,2,6,3>1,5>4>0 4,0>5 5,4,1,0>7,6,3,2 7,2,6,1>3,5>4,0 4,0>5>1,2,6,3,7 7,6,5,4,3,2,1,0 7,6,5,4,3,2,1,0 Note:1. "0" stands for "D method", "1" for "Average", "2" for "Regression", "3" for Similarity, "4" for "DNeg", 5 for "AvgNeg", 6 for RegNeg, and 7 for SimReg. 2. ", stands for "insignificant difference between misclassification rates". ">" stands for "better in terms of lower misclassification rate, and the difference is significant ". For example, "1, 2" means the difference of misclassification rates of Average and Regression methods is insignificant. "1 > 2" means the Average Method is better than the Regression Method in te rms of misclassification rate. Regarding to Hypothesis 3, we need to check the lower part of Tables 79 and 7 10, which show the comparison results of a principals methods when the agents method is controlled. As we see, regardless of agents method, the D method gives the lowest NegMisc and StrPos. For example, when OL= High and the SimNeg method by the agents is anticipated, 4, 0 > 5, 1, 2, 6 > 3, 7 for NegMisc and 7, 6, 5, 4, 3, 2, 1, 0 for StrPos means that the D and DNeg methods give the lowest negative misclassification and all methods give the same StrPos. Similar conclusion can be stated in the OL= Low case. Therefore, Hypothesis 3 is supported. PAGE 123 123 Different Default Vectors After analyzing the strategic response of a principal and agents, we examine how the performance rates vary when the chosen default vector in the D or DNeg method moves further away from the vertex D or 1D We discuss two cases, OL= High and OL= Low cases for the D and DNeg methods. Table 7 1 1 provides the summary of different performance measurements for the D method in OL= High case. For readability, the performance measurements are transformed back to percents, which can be found in Table 7 1 2. Tables 713 and 7 14 (in percent format) are for the D m ethod in OL= Low case; Tables 715 and 716 (in percent format) for the DNeg method in OL= High case; Tables 7 17 and 718 (in percent format) for the DNeg method i n OL= Low case. Notice that in these tables, the sample size is not controlled. As shown in Table 71 1 the vertex point of the D set performs well as other points along the w direction. The performance of measurements is stable as the default vector moves further away from the vertex. Table 71 2 provides the same information in percentage format. Table 7 11. Comparison of chosen default vector of D method ( OL= High) TotMisc PosMisc NegMisc TotStrMisc PosStrMisc NegStrMisc StrPos StrNeg D 0.689368 1.254881 0.00003 0.6893675 0.689346 0.00002 0.697976 0.287761 D w 0.694968 1.262993 0 0.694968 0.694968 0 0.697976 0.287761 D 2w 0.696724 1.2655 0 0.6967237 0.696724 0 0.697976 0.287761 D 4w 0.697846 1.267096 0 0.6978455 0.697846 0 0.697976 0.287761 D 6w 0.697976 1.267282 0 0.6939869 0.697976 0 0.697976 0.287761 D 8w 0.697976 1.267282 0 0.6979763 0.697976 0 0.697976 0.287761 Table 7 12. Comparison of chosen default vector of D method ( OL= High ) % TotMisc PosMisc NegMisc TotStrMisc PosStrMisc NegStrMisc StrPos StrNeg D 11.42% 34.47% 0.00% 11.42% 11.42% 0.00% 11.69% 2.06% D w 11.60% 34.85% 0.00% 11.60% 11.60% 0.00% 11.69% 2.06% D 2w 11.65% 34.97% 0.00% 11.65% 11.65% 0.00% 11.69% 2.06% D 4w 11.69% 35.05% 0.00% 11.69% 11.69% 0.00% 11.69% 2.06% D 6w 11.69% 35.06% 0.00% 11.56% 11.69% 0.00% 11.69% 2.06% D 8w 11.69% 35.06% 0.00% 11.69% 11.69% 0.00% 11.69% 2.06% PAGE 124 124 Figure 71. Comparison of chosen default vector of D method ( OL= High ) Table 7 13. Comparison of chosen default vector of D method ( OL= Low) TotMisc PosMisc NegMisc TotStrMisc PosStrMisc NegStrMisc StrPos StrNeg D 1.5700728 3.135284 1.7E 05 0.578413 0.578401 1.19E 05 0.581986 0.28499 D w 1.5707168 3.140582 0 0.581458 0.581458 0 0.581986 0.28499 D 2w 1.5707788 3.141291 0 0.581839 0.581839 0 0.581986 0.28499 D 4w 1.5708 3.14159 0 0.581986 0.581986 0 0.581986 0.28499 D 6w 1.5708 3.14159 0 0.581986 0.581986 0 0.581986 0.28499 D 8w 1.5708 3.14159 0 0.581986 0.581986 0 0.581986 0.28499 As shown above, the vertex point of the D set gives as good as performance as other chosen default vectors along w ray. Next, we check how different default vectors in the D set perform in the case of opportunity level is low. Table 7 1 4 Comparison of chosen default vector of D method ( OL= Low) % TotMisc PosMisc NegMisc TotStrMisc PosStrMisc NegStrMisc StrPos StrNeg D 49.96% 100.00% 0.00% 8.13% 8.13% 0.00% 8.23% 2.02% D w 50.00% 100.00% 0.00% 8.22% 8.22% 0.00% 8.23% 2.02% D 2w 50.00% 100.00% 0.00% 8.23% 8.23% 0.00% 8.23% 2.02% D 4w 50.00% 100.00% 0.00% 8.23% 8.23% 0.00% 8.23% 2.02% D 6w 50.00% 100.00% 0.00% 8.23% 8.23% 0.00% 8.23% 2.02% D 8w 50.00% 100.00% 0.00% 8.23% 8.23% 0.00% 8.23% 2.02% PAGE 125 125 Figure 72. Comparison of chosen default vector of D method ( OL= Low ) It shows that in the situation where the opportunity level is low, the performance of different default vectors in the D set are similar. Thus, moving away from the vertex, chosen default vectors along w does not necessarily give better performance. As far as the DNeg method in the OL= High case is concerned, the performance measurements for negative agents are stable as the default vector moves away from the vertex, but those for positive agents increase. It suggests that choosing the vertex as the default vector generate the best NegMisc and PosMisc. Any default vector further away from the vertex can only yield poorer performance on positive agents. This amounts to penalizing positive agents for no gain. For the OL= Low case, simil ar penalty situation does exist. Table 7 1 5 Comparison of chosen default vector of D Neg method ( OL= High) TotMisc PosMisc NegMisc TotStrMisc PosStrMisc NegStrMisc StrPos StrNeg DNeg 0.47914 0.88752 0.012678 0.479145 0.474393 0.008898 0.69797 0.28776 DNeg w 0.57770 1.08823 0 0.577701 0.577701 0 0.69797 0.28776 DNeg 2w 0.62968 1.16722 0 0.629689 0.629689 0 0.69797 0.28776 DNeg 4w 0.67054 1.22799 0 0.670547 0.670547 0 0.69797 0.28776 DNeg 6w 0.68736 1.25217 0 0.687367 0.687367 0 0.69797 0.28776 DNeg 8w 0.69548 1.26373 0 0.69548 0.69548 0 0.69797 0.28776 PAGE 126 126 Table 7 16. Comparison of chosen default vector of DNeg method ( OL= High) % TotMisc PosMisc NegMisc TotStrMisc PosStrMisc NegStrMisc StrPos StrNeg DNeg 5.63% 18.43% 0.00% 5.63% 5.52% 0.00% 11.69% 2.06% DNeg w 8.11% 26.80% 0.00% 8.11% 8.11% 0.00% 11.69% 2.06% DNeg 2w 9.59% 30.36% 0.00% 9.59% 9.59% 0.00% 11.69% 2.06% DNeg 4w 10.83% 33.19% 0.00% 10.83% 10.83% 0.00% 11.69% 2.06% DNeg 6w 11.35% 34.34% 0.00% 11.35% 11.35% 0.00% 11.69% 2.06% DNeg 8w 11.61% 34.89% 0.00% 11.61% 11.61% 0.00% 11.69% 2.06% Figure 73. Comparison of chosen default vector of DNeg method ( OL= High ) In summary, in both the OL=High and OL=Low cases, choosing the vertex of the D set as the default vector in the D Method gives similar performance as choosing other default vectors moving further away points along w ray. For the DNeg set, as the default vector moves further away from the vertex of the 1D set, the performance measurements on positive agents got worse while negative ones keep stable. Thus, the vertex point of D and 1D set performs quite well. Table 7 17. Comparison of chosen default vector of DNeg method ( OL= Low) TotMisc PosMisc NegMisc TotStrMisc PosStrMisc NegStrMisc StrPos StrNeg DNeg 1.51048 2.91450 0.012218 0.469299 0.464657 0.008563 0.58198 0.28499 DNeg w 1.55954 3.08439 0 0.546521 0.546521 0 0.58198 0.28499 DNeg 2w 1.56719 3.11811 0 0.568444 0.568444 0 0.58198 0.28499 DNeg 4w 1.57021 3.13585 0 0.578527 0.578527 0 0.58198 0.28499 DNeg 6w 1.57071 3.14031 0 0.581378 0.581378 0 0.58198 0.28499 DNeg 8w 1.5708 3.14159 0 0.581986 0.581986 0 0.58198 0.28499 PAGE 127 127 Table 7 18. Comparison of chosen default vector of DNeg method ( OL= Low) % TotMisc PosMisc NegMisc TotStrMisc PosStrMisc NegStrMisc StrPos StrNeg DNeg 46.99% 98.72% 0.00% 5.41% 5.30% 0.00% 8.23% 2.02% DNeg w 49.44% 99.92% 0.00% 7.28% 7.28% 0.00% 8.23% 2.02% DNeg 2w 49.82% 99.99% 0.00% 7.86% 7.86% 0.00% 8.23% 2.02% DNeg 4w 49.97% 100.00% 0.00% 8.14% 8.14% 0.00% 8.23% 2.02% DNeg 6w 50.00% 100.00% 0.00% 8.21% 8.21% 0.00% 8.23% 2.02% DNeg 8w 50.00% 100.00% 0.00% 8.23% 8.23% 0.00% 8.23% 2.02% Figure 74. Comparison of chosen default vector of DNeg method ( OL= Low ) Experiment O utcomesPart Two (The Randomized Case) Recall that in the randomized case different agents may use different methods to decide which attributes to reveal. In this part, instead of analyzing the strategic play of a principal and agents, we compare a principals methods when agents method cannot be anticipated. An ANOVA test is performed to examine the effect of factors and interaction effect. As shown in Table 719 all experiment factors are significant except for the test size and the interaction effects are significant other than OL * Mp and Mp PAGE 128 128 Table 7 1 9 Two way a nalysis for dependent v ariable: m isclassification r ate Source DF Sum of Squares Mean Square F Value Pr>F R Square Coeff Var Root MSE Rate Mean Model 138 22744.84 164.8177 734.31 <.0001 0.726 30.925 0.474 1.532 Error 38261 8587.796 0.22445 Corrected Total 38399 31332.63 Source DF Anova SS Mean Square F Value Pr>F Main effect n 4 15.771 3.943 17.57 <.0001 OL 1 9641.835 9641.835 42957 .00 <.0001 3 5.969 1.990 8.86 <.0001 3 2.655 0.885 3.94 0.008 Mp 7 12909.334 1844.191 8216.38 <.0001 Two way Interaction effect nOL 4 41.367 10.342 46.08 <.0001 n 12 19.487 1.624 7.24 <.0001 n 12 12.570 1.047 4.67 <.0001 nMp 28 40.714 1.454 6.48 <.0001 OL 3 12.695 4.232 18.85 <.0001 OL 3 2.133 0.711 3.17 0.0233 OLMp 7 25.104 3.586 15.98 <.0001 9 4.773 0.530 2.36 0.0115 Mp 21 8.714 0.415 1.85 0.0103 Mp 21 1.716 0.082 0.36 0.9965 We conduct Tukeys range tests for the principals methods. The summary of these test results can be found in Tables 720 and 721. Table 7 20 shows that when the principal uses the D or DNeg method, NegMisc is the lowest and PosMisc is the highest. More specifically we see that NegMisc is 0 for the D method and 0. 013 for the DNeg method (or 0.004% in terms of actual rate before the mapping), which are lower than 0.402 (or 3.98% ) for Similarity, 0.324 (or 2.6%) for RegNeg, 0.322 (or 2.58%) for Regression, 0.283 (or 1.99%) for Average, 0.06 (or 0.09%) for AvgNeg, and 0.031 (or 0.02%) for SimNeg. The rate of positive or negative agents who act strategically is the same regardless of a principals method. Therefore, if a principal is more negative risk a verse, her best strategy should be the D method. PAGE 129 129 Table 7 20. The comparison results of Tukey's range test TotMisc PosMisc NegMisc NotClassified Group Mean Methods Group Mean Methods Group Mean Methods Group Mean Methods A 2.205 7 A 2.192 0 A 0.402 3 A 1.902 2 B 2.086 3 B 1.893 4 B 0.324 6 A 1.902 3 B 2.065 6 C 1.629 5 B 0.322 2 A 1.902 6 B 2.062 2 D 1.283 1 C 0.283 1 A 1.902 7 C 1.128 0 E 0.783 7 D 0.060 5 B 0.000 0 D 0.992 4 F 0.185 3 E 0.031 7 B 0.000 1 E 0.895 5 F 0.167 6 FE 0.013 4 B 0.000 4 F 0.822 1 F 0.158 2 F 0.000 0 B 0.000 5 Table 7 2 1 The comparison results of Tukey's range test for strategic rates StrMisc PosStrMisc NegStrMisc StrPos StrNeg Group Mean Method Group Mean Method Group Mean Method Group Mean Method Group Mean Method A 0.629 0 A 0.629 0 A 0.252 3 A 0.634 0 A 0.28 0 B 0.468 4 B 0.463 4 B 0.217 6 A 0.634 1 A 0.28 1 B 0.442 7 C 0.427 7 B 0.216 2 A 0.634 2 A 0.28 2 C 0.362 5 D 0.338 5 C 0.189 1 A 0.634 3 A 0.28 3 D 0.307 3 E 0.158 1 D 0.042 5 A 0.634 4 A 0.28 4 D 0.285 1 FE 0.128 3 E 0.022 7 A 0.634 5 A 0.28 5 D 0.279 6 F 0.11 6 FE 0.009 4 A 0.634 6 A 0.28 6 D 0.274 2 F 0.104 2 F 0 0 A 0.634 7 A 0.28 7 Next, we study how different default vectors affect the misclassification rates. As investigated in Part one, we consider two cases: OL= High and OL= Low. For each of these two cases, we check the performance of the D and DNeg methods separately. As shown in Table 722, which is for the D method in the OL= High case, the NegMisc is very small when the vertex point of the D set is chosen as the default vector. When the default vector moves one step away from the vertex poi nt along w ray, NegMisc becomes zero. As the default moves f urther away, NegMisc stays zero. PosMisc increases a small amount when the chosen default is one step away from the vertex and does not change afterwards. The trends can be identified in Figure 7 5. PAGE 130 130 Table 7 22. Comparison of chosen default vector of D method ( OL= High) TotMisc PosMisc NegMisc TotStrMisc PosStrMisc NegStrMisc StrPos StrNeg D 0.685739 1.247173 8.35E 05 0.685739 0.68568 5.9E 05 0.69388 0.28532 D w 0.691319 1.255297 0 0.691319 0.691319 0 0.69388 0.28532 D 2w 0.692833 1.257448 0 0.692833 0.692833 0 0.69388 0.28532 D 4w 0.693835 1.258875 0 0.693835 0.693835 0 0.69388 0.28532 D 6w 0.69388 1.258939 0 0.69388 0.69388 0 0.69388 0.28532 D 8w 0.69388 1.258939 0 0.69388 0.69388 0 0.69388 0.28532 In the case OL=Low, we see NegMisc and PosMisc stay at 0 and 100% separately no matter which default vector along w ray in the D set is chosen. Table 724 and 7 25 and Figure 76 show the results. Table 7 23. Comparison of chosen default vector of D method ( OL= High) % TotMisc PosMisc NegMisc TotStrMisc PosStrMisc NegStrMisc StrPos StrNeg D 11.30% 34.10% 0.00% 11.30% 11.30% 0.00% 11.56% 2.02% D w 11.48% 34.49% 0.00% 11.48% 11.48% 0.00% 11.56% 2.02% D 2w 11.53% 34.59% 0.00% 11.53% 11.53% 0.00% 11.56% 2.02% D 4w 11.56% 34.66% 0.00% 11.56% 11.56% 0.00% 11.56% 2.02% D 6w 11.56% 34.66% 0.00% 11.56% 11.56% 0.00% 11.56% 2.02% D 8w 11.56% 34.66% 0.00% 11.56% 11.56% 0.00% 11.56% 2.02% Figure 75. Comparison of chosen default vector of D method ( OL= High ) PAGE 131 131 Table 7 24. Comparison of chosen default vector of D method ( OL= Low) TotMisc PosMisc NegMisc TotStrMisc PosStrMisc NegStrMisc StrPos StrNeg D 1.570226 3.136393 0 0.571691 0.571691 0 0.574643 0.275041 D w 1.570604 3.139899 0 0.573588 0.573588 0 0.574643 0.275041 D 2w 1.570764 3.141047 0 0.574486 0.574486 0 0.574643 0.275041 D 4w 1.5708 3.14159 0 0.574643 0.574643 0 0.574643 0.275041 D 6w 1.5708 3.14159 0 0.574643 0.574643 0 0.574643 0.275041 D 8w 1.5708 3.14159 0 0.574643 0.574643 0 0.574643 0.275041 Table 7 25. Comparison of chosen default vector of D method ( OL= Low) % TotMisc PosMisc NegMisc TotStrMisc PosStrMisc NegStrMisc StrPos StrNeg D 49.97% 100.00% 0.00% 7.95% 7.95% 0.00% 8.03% 1.88% D w 49.99% 100.00% 0.00% 8.00% 8.00% 0.00% 8.03% 1.88% D 2w 50.00% 100.00% 0.00% 8.03% 8.03% 0.00% 8.03% 1.88% D 4w 50.00% 100.00% 0.00% 8.03% 8.03% 0.00% 8.03% 1.88% D 6w 50.00% 100.00% 0.00% 8.03% 8.03% 0.00% 8.03% 1.88% D 8w 50.00% 100.00% 0.00% 8.03% 8.03% 0.00% 8.03% 1.88% Figure 76 Comparison of chosen d efault vector of D method ( OL= Low ) For the DNeg method, in the OL= High case, NegMisc drops to zero when the chosen default moves one step away from the vertex of 1D set and then stays at zero afterwards. PosMisc increases as the chosen default moves away along w ray. The r esults are listed in Table 72 6 PAGE 132 132 Table 7 2 6 Comparison of chosen default vector of D Neg method ( OL= High ) TotMisc PosMisc NegMisc TotStrMisc PosStrMisc NegStrMisc StrPos StrNeg DNeg 0.47731 0.882553 0.013456 0.47731 0.472189 0.009439 0.6938 0.2853 DNeg w 0.572212 1.077901 0 0.572212 0.572212 0 0.6938 0.2853 DNeg 2w 0.62664 1.160494 0 0.62664 0.62664 0 0.6938 0.2853 DNeg 4w 0.666149 1.219256 0 0.666149 0.666149 0 0.6938 0.2853 DNeg 6w 0.683527 1.244198 0 0.683527 0.683527 0 0.6938 0.2853 DNeg 8w 0.691114 1.255011 0 0.691114 0.691114 0 0.6938 0.2853 Table 7 2 7 Comparison of chosen default vector of D Neg method ( OL= High ) % TotMisc PosMisc NegMisc TotStrMisc PosStrMisc NegStrMisc StrPos StrNeg DNeg 5.59% 18.24% 0.00% 5.59% 5.47% 0.00% 11.56% 2.02% DNeg w 7.96% 26.34% 0.00% 7.96% 7.96% 0.00% 11.56% 2.02% DNeg 2w 9.50% 30.06% 0.00% 9.50% 9.50% 0.00% 11.56% 2.02% DNeg 4w 10.69% 32.78% 0.00% 10.69% 10.69% 0.00% 11.56% 2.02% DNeg 6w 11.23% 33.96% 0.00% 11.23% 11.23% 0.00% 11.56% 2.02% DNeg 8w 11.47% 34.47% 0.00% 11.47% 11.47% 0.00% 11.56% 2.02% Figure 77. Comparison of chosen default vector of DNeg method ( OL= High ) In the OL= Low case, the change of performance measures when the default vector moves away from the vertex is similar to the D method in the OL= Low case. We see a small decrease in NegMisc at the first change step of default vector and then stays zero afterwards. PosMisc starts with a hig h value and increases to 100%. PAGE 133 133 Table 7 2 8 Comparison of chosen default vector of D Neg method ( OL= Low ) TotMisc PosMisc NegMisc TotStrMisc PosStrMisc NegStrMisc StrPos StrNeg DNeg 1.50594 2.90384 0.012061 0.458913 0.454408 0.008466 0.57464 0.27504 DNeg w 1.55920 3.08418 0 0.538691 0.538691 0 0.57464 0.27504 DNeg 2w 1.56718 3.11794 0 0.560827 0.560827 0 0.57464 0.27504 DNeg 4w 1.56994 3.13316 0 0.569405 0.569405 0 0.57464 0.27504 DNeg 6w 1.57071 3.13999 0 0.573809 0.573809 0 0.57464 0.27504 DNeg 8w 1.5708 3.14159 0 0.574643 0.574643 0 0.57464 0.27504 Table 7 2 9 Comparison of chosen default vector of D Neg method ( OL= Low) % TotMisc PosMisc NegMisc TotStrMisc PosStrMisc NegStrMisc StrPos StrNeg DNeg 46.76% 98.59% 0.00% 5.17% 5.07% 0.00% 8.03% 1.88% DNeg w 49.42% 99.92% 0.00% 7.08% 7.08% 0.00% 8.03% 1.88% DNeg 2w 49.82% 99.99% 0.00% 7.66% 7.66% 0.00% 8.03% 1.88% DNeg 4w 49.96% 100.00% 0.00% 7.89% 7.89% 0.00% 8.03% 1.88% DNeg 6w 50.00% 100.00% 0.00% 8.01% 8.01% 0.00% 8.03% 1.88% DNeg 8w 50.00% 100.00% 0.00% 8.03% 8.03% 0.00% 8.03% 1.88% Figure 78. Comparison of chosen default vector of DNeg method ( OL= Low ) To summarize, a default vector that is one change step away from the vertex point of the D set and 1D set can produce zero NegMisc without increasing PosMisc too much. PAGE 134 134 Ex periment O utcomesPart Three (The Random Case) In the random case, data are missing randomly, that is the information is not hidden strategically. A principal still can choose one of eight methods to impute missing information. In this part, we want to check how our D or DNeg Method performs. Further, we examine how different default vectors along the w ray in the D or 1D set affect the performance of methods. We also want to examine the effect of the percent randomly missing information. T able 7 30. Two way a nalysis for dependent v ariable: m isclassification r ate Source DF Sum of Squares Mean Square F Value Pr>F R Square Coeff Var Root MSE Rate Mean Model 214 3798.524 17.75011 1013.14 <.0001 0.531 44.463 0.132 0.298 Error 191785 3360.069 0.01752 Corrected Total 191999 7158.593 Source DF Anova SS Mean Square F Value Pr>F Main effect n 4 98.434 24.608 1404.59 < .0001 OL 1 0.079 0.079 4.5 0.0339 3 4.711 1.570 89.63 <.0001 3 187.575 62.525 3568.79 <.0001 Ram 4 1546.322 386.581 22065.1 <.0001 Mp 7 1691.193 241.599 13789.9 <.0001 Two way Interaction effect nOL 4 0.085 0.021 1.210 0.3059 n 12 2.930 0.244 13.940 <.0001 n 12 3.147 0.262 14.970 <.0001 nRam 16 13.559 0.847 48.370 <.0001 nMp 28 91.219 3.258 185.950 <.0001 OL 3 0.406 0.135 7.730 <.0001 OL 3 0.098 0.033 1.860 0.1346 OLRam 4 0.061 0.015 0.870 0.4789 OLMp 7 0.015 0.002 0.120 0.9968 9 2.128 0.236 13.490 <.0001 Ram 12 1.425 0.119 6.780 <.0001 Mp 21 47.441 2.259 128.940 <.0001 Ram 12 5.449 0.454 25.920 <.0001 Mp 21 5.567 0.265 15.130 <.0001 RamMp 28 96.678 3.453 197.080 <.0001 PAGE 135 135 An Anova test is first conducted to examine the effect of experimental factors. Table 7 30 shows the results. We see that the opportunity level is an insignificant factor, while other experiment factors are significant. In the random case agents do not act strategically, so whether their opportunity level is high or low does not affect how much information is missing. The interaction effect of OL with other factors are also insignificant except for OL The interaction effect of all other factors are significant at 0.0001 confidence level. Next, we study the impact of percent of randomly missing data on performance meas urements. As shown in Table 7 3 1 all of four misclassification rates, TotMisc, PosMisc, Ne gMisc, and NotClassified, increase as the percent increases. Table 73 2 is in percentage format. Table 7 31. Statistics of randomly missing holes Missing Percent TotMisc PosMisc NegMisc NotClassified 1% 0.159571 0.218616 0.015136 0 2% 0.243308 0.333724 0.02756 0.000055379 3% 0.306348 0.421391 0.038911 0.000094914 4% 0.364556 0.502989 0.048766 0.000264511 5% 0.414688 0.573681 0.058068 0.000326308 Table 7 3 2 Statistics of randomly missing holes % Missing Percent TotMisc PosMisc NegMisc NotClassified 1% 0.64% 1.19% 0.01% 0.00% 2% 1.47% 2.76% 0.02% 0.00% 3% 2.33% 4.37% 0.04% 0.00% 4% 3.29% 6.19% 0.06% 0.00% 5% 4.24% 8.00% 0.08% 0.00% PAGE 136 136 Figure 79. Comparison of the impact of missing holes We also want to study the impact of percent of randomly missing information on the performance of the D and DNeg methods. As demonstrated in Table 73 3 NegMisc and NotClassified are not affected by the increase in the percent, but TotMisc and PosMisc do increase and the amount of increase at each change step i s larger than in the general case where a principals method is unknown. Table 7 3 3 Statistics of randomly missing holes (D) Missing Percent TotMisc PosMisc NegMisc NotClassified 1% 0.28438 0.40438 0 0 2% 0.418529 0.597857 0 0 3% 0.515916 0.74025 0 0 4% 0.599436 0.864227 0 0 5% 0.673126 0.974968 0 0 Table 7 3 4 Statistics of randomly missing holes (D) % Missing Percent TotMisc PosMisc NegMisc NotClassified 1% 2.01% 4.03% 0.00% 0.00% 2% 4.32% 8.67% 0.00% 0.00% 3% 6.51% 13.09% 0.00% 0.00% 4% 8.72% 17.54% 0.00% 0.00% 5% 10.91% 21.94% 0.00% 0.00% PAGE 137 137 Figure 710. Comparison of missing holes for D method The DNeg method has different characteristics. As the percent increases, TotMisc, PosMisc, and NegMisc all increases, but the amount of increase at each step is smal ler than the D method. Table 7 3 5 summarizes the results. Table 73 6 is in percentage format. Finally, we investigate the impact of different default vectors in the D method. As shown in Table 737, NotClassified and NegMisc st ay at zero when the default moves along the w ray in the D set, while TotMisc and PosMisc increase a very small amount at each step. Table 7 3 5 Statistics of randomly missing holes (DNeg) Missing Percent TotMisc PosMisc NegMisc NotClassified 1% 0.20472 0.290578 0.00038 0 2% 0.30636 0.435822 0.001171 0 3% 0.383447 0.546924 0.00157 0 4% 0.452502 0.647036 0.002055 0 5% 0.512359 0.734727 0.002183 0 Table 7 3 6 Statistics of randomly missing holes (DNeg) % Missing Percent TotMisc PosMisc NegMisc NotClassified 1% 1.04% 2.10% 0.00% 0.00% 2% 2.33% 4.67% 0.00% 0.00% 3% 3.63% 7.29% 0.00% 0.00% 4% 5.03% 10.11% 0.00% 0.00% 5% 6.42% 12.90% 0.00% 0.00% PAGE 138 138 Figure 711. Comparison of missing holes of DNeg method Table 7 37. Statistics of randomly missing holes (D kw) Default vector TotMisc PosMisc NegMisc NotClassified D 0.498277 0.716336 0 0 D w 0.499436 0.718054 0 0 D 2w 0.499579 0.718268 0 0 D 4w 0.499664 0.718393 0 0 D 6w 0.499674 0.718408 0 0 D 8w 0.499674 0.718408 0 0 Table 7 38. Statistics of randomly missing holes (D kw) % Default vector TotMisc PosMisc NegMisc NotClassified D 6.08% 12.29% 0.00% 0.00% D w 6.11% 12.35% 0.00% 0.00% D 2w 6.11% 12.35% 0.00% 0.00% D 4w 6.11% 12.36% 0.00% 0.00% D 6w 6.11% 12.36% 0.00% 0.00% D 8w 6.11% 12.36% 0.00% 0.00% To summarize, the default vector at the vertex point of the D set performs no worse than other points along the w ray in the D set. It provides zero NegMisc and small positive and total misclassification. We have studied three experiments, where all agents hide attributes by using the same method (the case one) or different method (the case two), or attributes values are PAGE 139 139 missing randomly (the case three).We investigate the impact of training sample size, test size, agents method, principals method, opportunity level. We see that all controlled factors are significant except for the test size in certain cases. Also, we find that the vertices of the D or DNeg methods have as good performance as other chosen default vectors along the w ray, whether data are missing randomly or hidden by agents. Also, we show that as the default vector moves away from the vertex, the misclassification rates converge. Furthermore, we found that the D and the DNeg methods give the lowest negative miscallsification rate. In addition, the D Method has the lowest StroPos, which means that when the D Method is adopted by the principal, the positive agents are induced to rev eal inforamtion and negative agents are classified correctly. Figure 712. Comparison of chosen default vector of D method when data are randomly missing PAGE 140 140 CHAPTER 8 SUMMARY, CONCLUSIONS AND FUTURE RESEARCH In this dissertation, we studied two imputation methods handling missing data. When data are strategically hidden from the principal, she needs to estimate missing values so that further analysis can be conducted. In this study, the two imputation methods we developed create default vectors for the principal, while anticipating agents strategic behaviors. In Chapter 5, several possible sets, ,yDwb ,yDwb and Dwb are identified for a principal to choose a default vector. Specifically, if a default vector is chosen from ,yDwb 1,1 y then all agents in class y will be classified correctly regardles s of their strategic actions; If a default vector is chosen from ,yDwb then any agent in class y is induced to reveal all attribute values. If a default vector is chosen from Dwb then any negative agent will be classified correctly regardless of his strategic behavior while any positive agent will be induced to reveal all information. In Chapter 6, we study how the default vector changes for large samples. We show that for large training sets, the misclassification rate goes to zero when the principal uses a default vector in Dwb However, other common statistical methods handling missing data, such as the Average Method and Similarity Method, degrade as the training size increases. In Chapter 7, we conduct three experiments. In the first experiment, all agents use the same method (among eight different methods) to strategically hide data. In the second experiment, agents randomly chose his method to hide data. In the third experiment, data are missing randomly and no agents strategic behavior is considered. PAGE 141 141 The findings support the conclusion that our D Method and DNeg Method improves as the training size increases while other methods degrade. The findings s hown in earlier chapters are based on several assumptions including the assumption that the data are separable, and that all agents had the same cost structure and with the various parameter choices. W e outline future research that relaxes some of these as sumptions and lays out several possible directions of future study. Theoretical W ork In Chapter 5, we investigate the D set by assuming that training data set is linearly separable. Also, we assume that data consist of continuous numbers, and agents have perfect knowledge about the principals classifier. We also assumed that all agents had sufficient reservation cost to cover publication of all attributes, although we empirically studied the case when this assumption was violated. More commonly, we encounter problems where data are not linearly separable or some attributes are discrete. For example, the ratings of online sellers may only take values 1, 2, 3, 4, or 5. Also, in reality agents may never know the principals classifier or even the type of classifier but may know aspects of the classifier, like the signs of different coefficients for LDF classifiers. We can extend our current study theoretically on several different directions by naturally relaxing some of these assumptions. When a data set is inseparable, a principal needs to tradeoff empirical risk with structural risk. Instead of using SVMs hardmargin model, we can use a soft margin model for non separable data problems. Another possibility, we may relax the assumption on the type of data that we handle. Instead of using continuous data, we want to consider discrete data, which have PAGE 142 142 different properties than continuous data so may generate different results. In general, there are four types of measurements: nominal (such as name, hair color, class membership, etc.), ordinal or rank (such as first, second, third, and so on), interval (for instance, temperature) and ratio (for example, debt ratio or liquidity ratio). Each type of data may require different considerations. For classification problems, nominal data are often encountered. Nominal data are usually converted into binary data. Boylu, et al. ( 2010a ) have indirectly investigated strategic hiding of binary data although their model did not use our default value approach. We intend to revisit the Theorems of Section 5 under the assumption of binary attributes. In addition, we only consider one principal in our study. Future work on multiple principals may give interesting results. The interaction between principals and/or agents such as collusion and competition, and the assumptions about how much one each knows about anothers classifier or cost parameters can be interesting. Empirical W ork In Chapter 7, we assume that agents have the same publishing costs and opportunity cost. This is a strong assumption. More commonly, we see different agents have different costs for revealing his information and different disutility of missing an opportunity. We can extend our experiment by considering different costs for different agents. Also, we can control the values of max norm when generating a training set and test set We may also include more imputation methods in the experiment and compare the performance of all included methods. Application on the Tru st and Reputation Systems Hiding or revealing information is closely related to the trust issue. With the popularity of online transaction, people rely more on information available online to PAGE 143 143 judge the other transaction partys trustworthiness. Strategical ly hidden information may be too important to be ignored by principals, as discussed in Chapter 1. B elow is a list of things one would expect in trust and reputation systems where information sharing/hiding takes place in trust related settings For instance, se llers with better reputations gain significant purchase intention and price premium (Kim et al. 2008) ); Sellers with social presence gain higher purchase intention (Gefen and Straub 2003) ; Sellers with higher perceived information quality and Presence of a third party seal increase buyers purchase intention (Kim et al. 2008) ), etc. So, we can further investigate the missing data problem by applying the study to the trust setting. PAGE 144 144 LIST OF REFERENCES Allison, P. D. 2001. Missing data. Sage Publications, Inc. Barreno, M., B. Nelson, R. Sears, A. Joseph, J. Tygar. 2006. Can machine learning be secure? Proceedi ngs of the 2006 ACM Symposium on Information, computer and communications security 1625 Beale, E., R. Little. 1975. Missing values in multivariate analysis. Journal of the Royal Statistical Society, Series B (Methodological) 37(1) 129145. Boylu, F., H. Aytug, G. Koehler. 2008. Systems for strategic learning. Information Systems and E Business Management 6 (2) 205 220. Boylu, F., H. Aytug, G. Koehler. 2010a. Induction over constrained strategic agents. European Journal of Operational Researc h 203 (3) 698705. Boylu, F., H. Aytug, G. Koehler. 2010b. Induction over strategic agents. Information Systems Research 21(1) 170 189. Boylu, F., H. Aytug, G. Koehler. 2010c. Induction over strategic agents: A genetic algorithm solution. Annals of Operations Research. 174 (1) 135 146. Breiman, L., J. Friedman, R. Olshen, C. Stone. 1984. Classification and regression trees Wadsworth and Brooks/Cole. Buck, S. 1960. A method of estimation of missing values in multivariate data suitable for use with an electronic computer. Journal of the Royal Statistical Society. Series B (Methodological) 22(2) 302 306. Chan, L., O. Dunn. 1972. The treatment of missing values in discriminant analysis. Journal of the American Statistical Association. 67(336) 473 477. Ch an, L., J. Gilman, O. Dunn. 1976. Alternative approaches to missing values in discriminant analysis. Journal of the American Statistical Association. 71(356) 842844. Cristianini, N., J. Shawe Taylor. 2000. An introduction to support vector machines and ot her kernel based learning methods Cambridge university press. Dalvi, N., P. Domingos, Mausam, S. Sanghai, D. Verma. 2004. Adversarial classification. KDD '04 proceedings the tenth ACM SIGKDD international conference on knowledge discovery and data mining Doetsch, P., C. Buck, P. Golik, N. Hoppe, M. Kramp, J. Laudenberg, C. Oberdrfer, P. Steingrube, J. Forster, A. Mauser. 2009. Logistic model trees with auc split criterion. Journal of Machine Learning Research: Workshop and Conference Proceedings, KDD Cup 2009 competition. 7 7788. PAGE 145 145 Donders, A. R. T., G. van der Heijden, T. Stijnen, K. G. M. Moons. 2006. Review: A gentle introduction to imputation of missing values. Journal of Clinical Epidemiology 59(10) 1087 1091. Erenguc, S., G. Koehler. 1990. Survey of mathematical programming models and experimental results for linear discriminant analysis. Managerial and Decision Economics 11(4) 215 225. Fawcett, T. 2003. In vivo spam filtering: A challenge problem for data mi ning. KDD Explorations 5 (2) 203231. Fisher, R. A. 1936. The use of multiple measurements in taxonomical problems. Annals of Eugenics 7 179188. Freed, N., F. Glover. 1981. Simple but powerful goal programming models for discriminant problems. European J ournal of Operational Research. 7 (1) 44 60. Gefen, D., D. Straub. 2003. Managing user trust in b2c eservices. e Service Journal 2 (2) 7 24. Gourieroux, C., A. Monfort. 1981. On the problem of missing data in linear models. The Review of Economic Studies 48(4) 579 586. Grzymala Busse, J., M. Hu. 2001. A comparison of several approaches to missing attribute values in data mining. Springer. Haitovsky, Y. 1968. Missing data in regression analysis. Journal of the Royal Statistical Society. Series B (Methodological) 30(1) 6782. Hand, D. J. 1981. Discrimination and classification. Wiley Series in Probability and Mathematical Statistics, Chichester: Wiley, 1981. Huber, G. 1982. Gamma function derivation of nsphere volumes. America Mathematics Monthly 89 (5) 301 302. Jackson, E. 1968. Missing values in linear multiple discriminant analysis. Biometrics 24(4) 835844. Joachimsthaler, E., A. Stam. 1990. Mathematical programming approaches for the classification problem in twogroup discriminant analysis. Multivaria te Behavioral Research. 25 (4) 427 454. Kern, W. F., J. R. Bland. 1948. "Cavalieri's theorem" and "proof of cavalieri's theorem." Solid mensuration with proofs Wiley. Kim, D. J., D. L. Ferrin, H. R. Rao. 2008. A trustbased consumer decisionmaking model in electronic commerce: The role of trust, perceived risk, and their antecedents. Decision Support Systems 44 (2) 544564. PAGE 146 146 Koehler, G. 1990. Considerations for mathemat ical programming models in discriminant analysis. Managerial and Decision Economics 11(4) 227 234. Koehler, G. 1991. Linear discriminant functions determined by genetic search. INFORMS Journal on Computing. 3 (4) 345. Li, S. 2011. Concise formulas for the area and volume of a hyperspherical cap. Asian Journal of Mathematics and Statistics 4 (1) 66 70. Littman, M., P. Stone. 2001. Implicit negotiation in repeated games. Proceedings of The Eighth International Workshop on Agent Theories, Architectures, and Languages, Intelligent Agents VIII 393 404. Liu, W. Z., A. P. White, S. G. Thompson, M. A. Bramer. 1997. Techniques for dealing with missing values in classification. Springer. Loh, W., N. Vanichsetakul. 1988. Tree structured classification via generalized discriminant analysis. Journal of the American Statistical Association. 83(403) 715725. Lowd, D., C. Meek. 2005. Adversarial learning. The eleventh ACM SIGKDD international conference on Knowledge discovery in data mining 641647. Mahoney, M., P. Chan. 2002. Learning nonstationary models of normal network traffic for detecting novel attacks. The eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 3 76385. Phua, C., V. Lee, K. Smith, R. Gayler. 2005. A comprehensive survey of data miningbased fraud detection research. Cornelll University Library. Quinlan, J. 1993. C4.5: Programs for machine learning. Morgan Kaufmann. Quinlan, J. R. 1986. Induction of decision trees Springer. Quinlan, J. R. 2006. Unknown attribute values in induction. The Sixth International Machine Learning Workshop Raghunathan, T., J. Lepkowski, J. Van Hoewyk, P. Solenberger. 2001. A multivariate technique for multiply imputing missing values using a sequence of regression models. Survey methodology 27(1) 85 96. Rubin, D. 1976. Inference and missing data. Biometrika 63(3) 581 592. Sahami, M., S. Dumais, D. Heckerman, E. Horvitz. 1998. A bayesian approach to filtering junk email. The AAAI 98 Workshop on Learning for Text Categorization. 62 9805. PAGE 147 147 Schafer, J., J. Graham. 2002. Missing data: Our view of the state of t he art. Psychological methods 7 (2) 147 177. Schlkopf, B., A. J. Smola. 2001. Learning with kernels: Support vector machines, regularizaiton, optimization and beyond. MIT Press. Stam, A., E. A. Joachimsthaler. 1989. Solving the classification problem in d iscriminant analysis via linear and nonlinear programming methods. Decision Science. 20(2) 285293. Vapnik, V. N. 1998. Statistical learning theory Wiley New York. von Stackelberg, H. F. 1949. Market structure and equilibrium ORDO. White, A., W. Liu. 199 0. Probabilistic induction by dynamic path generation for continuous attributes 285. Wilks, S. 1932. Moments and distributions of estimates of population parameters from fragmentary samples. The Annals of Mathematical Statistics 3 (3) 163 195. PAGE 148 148 BIOGRA PHICAL SKETCH Juheng Zhang received her doctorate degree in b usiness a dministration at the University of Florida in August 201 1 She also received an M.S. in finance and a B.A. in m anagement i nformation systems. Prior to graduate study, she worked in Shenzhen Customs District of Peoples Republic of China, China. Her research interests include d ata m ining, quantum c omputing, game t heory, and c loud computing. Her teaching interest s include database m anagement, d ata a nalysis and q uantitative a nalysis. 