UFDC Home  myUFDC Home  Help 



Full Text  
DISCLOSURE CONTROL OF CONFIDENTIAL DATA BY APPLYING PAC LEARNING THEORY By LING HE A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 2005 Copyright 2005 by Ling He I would like to dedicate this work to my parents, Tianqin He and Yan Gao, for their endless love and encouragement through all these years. ACKNOWLEDGMENTS I would like to express my complete gratitude to my advisor, Dr. Gary Koehler. This dissertation would not have been possible without his support, guidance, and encouragement. I have been very fortunate to have an advisor who is always willing to devote his time, patience and expertise to the students. During my Ph.D. program, he taught me invaluable lessons and insights on the workings of academic research. As a distinguished scholar and a great person, he sets an example that always encourages me to seek excellence in the academic area as well as my personal life. I am very grateful to my dissertation cochair, Dr. Haldun Aytug. His advice, support and help in various aspects of my research carried me on through a lot of difficult times. In addition, I would like to thank the rest of my thesis committee members: Dr. Selwyn Piramuthu and Dr. Anand Rangaraj an. Their valuable feedback and comments helped me to improve the dissertation in many ways. I would also like to acknowledge all the faculty members in my department, especially the department chair, Dr. Asoo Vakharia, for their support, help and patience. I also thank my friends for their generous help, understanding and friendship in the past years. My thanks also go to my colleagues in the Ph.D. program for their precious moral support and encouragement. Last, but not least, I would like to thank my parents for always believing in me. TABLE OF CONTENTS page A C K N O W L E D G M E N T S ................................................................................................. iv LIST OF TABLES ......... ................... .. ............. .............. viii LIST OF FIGURES ......... ......................... ...... ........ ............ ix ABSTRACT ........ .............. ............. ...... ...................... xi CHAPTER 1 IN TR OD U CTION ............................................... .. ......................... .. 1.1 B ack g rou n d ................................................................................... 1 1.2 M otiv atio n ............................................................................... ............... .. 2 1.3 Research Problem .................................. .. ....... ................. .3 1.4 C contribution .......................................................... ......... ............ . . 1.5 O organization of D issertation ....................................................................... .... 4 2 STATISTICAL AND COMPUTATIONAL LEARNING THEORY .....................6 2 .1 In tro d u ctio n ................................................................................. 6 2.2 M machine Learning .................. ........................... .... .... ... ........ .. .. 2.2.1 Introduction ................................................ ........ ................. 2.2.2 M machine Learning M odel................................. ...... ..................7 2.3 Probably Approximately Correct Learning Model ..........................................8 2.3.1 Introduction ..................... ............................. .. ..... ........... .. 8 2.3.2 The Basic PAC Model Learning Binary Functions ..................................8 2.3.3 Finite H ypothesis Space ................................. ..................................... 11 2.3.4 Infinite hypothesis space............................. ... ...............12 2.4 Empirical Risk Minimization and Structural Risk Minimization................ 13 2.4.1 Em pirical Risk M inim ization .......... ........................ ..1............. .13 2.4.2 Structural Risk M inimization............................... ......... ... ........... 13 2.5 L earning w ith N oise......... ..................................................... ............... 14 2.5.1 Introduction .............. ......... ....... .......... 14 2.5.2 Types of N oise ..................................... ......... ...... .. ............ 15 2.5.3 Learning from Statistical Query .................................. ............... 17 2.6 L earning w ith Q ueries.............................................. ............................ 18 3 DATABASE SECURITYCONTROL METHODS....................... ............... 19 3.1 A Survey of Database Security .................... ...........................19 3.1.1 Introduction .................... .... .... ......... ..... .. ... ............. 19 3.1.2 Database Security Techniques .................................... ......... ..........21 3.1.3 M icrodata files ............. ................... ..... ........ .. .. ............ 22 3.1.4 Tabular data files ............ ....... ............. ............. .. ............. 25 3.2 Statistical Database ............ .... ............. .......... .. ............. 27 3.2.1 Introduction ............. .. ........... ......... .. .................... .. 27 3.2.2 An Example: The Compromise of Statistical Databases....................28 3.2.3 Disclosure Control Methods for Statistical Databases ..........................29 4 INFORMATION LOSS AND DISCLOSURE RISK .............................................35 4 .1 In tro d u ctio n ............................ ..... ............ .......... ................ 3 5 4.2 Literature Review ..... ............... .......... .. ............ .. ............ 36 5 D A TA PERTU RB A TION ........... .................................. .................. ............... 42 5.1 Introduction ................................................... ..... ............... 42 5.2 Random Data Perturbation.............................................................43 5.2.1 Introduction .......... ........................................ ........... 43 5.2.2 Literature Review ........................................................43 5.3 V ariable D ata Perturbation .................. ........................... ............... ... 46 5.3.1 CVC Interval Protection for Confidential Data ............. ...............46 5.3.2 V ariabledata Perturbation.................................. ........................ 50 5.3.3 D discussion ....................................................... ... .. ..... .......... 53 5.4 A Bound for The Fixeddata Perturbation (Theoretical Basis)........................54 5.5 Proposed A pproach................................................................ ...............58 6 DISCLOSURE CONTROL BY APPLYING LEARNING THEORY....................62 6 .1 R research P rob lem s............................................................................. .. 62 6.2 The PAC Model For the Fixeddata Perturbation.................... .............63 6.3 The PAC Model For the Variabledata Perturbation ....................................72 6.3.1 PA C M odel Setup ..................................................... ...................72 6.3.2 D isqualifying L em m a 2 ..................... ...................... ................. ...74 6.4 The Bound of the Sample Size for the Variabledata Perturbation Case.........82 6.4.1 The bound based on the Disqualifying Lemma proof ............................82 6.4.2 The Bound based on the Sample Size................... .......................... 84 6.4.3 D discussion ............................... .. ................. ........ ........... ......... 85 6.5 Estimated the Mean and Standard Deviation.......................................86 7 EXPERIMENTAL DESIGN AND RESULTS ................... ......................... 91 7.1 Experimental Environment and Setup .................................. ...............91 7.2 D ata G generation .............................. ........................ .. ...... .... ............93 7.3 E xperim ental R results ............................................... ............................ 96 7.3.1 E xperim ent 1............................................... ....................... 97 7 .3 .2 E xperim ent 2 ............................. .... .............................. ............ 10 1 8 C O N C L U SIO N ......... ...................................................................... ........... ..... .. 104 8.1 Overview and Contribution........................................ ........................ 104 8.2 Lim stations ...................... ...... ................................................105 8.3 Directions for Future Research ........... ..............................................106 APPENDIX A N O TA TIO N TA B LE S............................................ ....................................... 108 B DATA GENERATED FOR THE UNIFORM DISTRIBUTION............................110 C DATA GENERATED FOR THE SYMMETRIC DISTRIBUTION .......................113 D DATA GENERATED FOR THE DISTRIBUTION WITH POSITIVE SK EW N E SS ............................................................... .... .... ........ 116 E DATA GENERATED FOR THE DISTRIBUTION WITH NEGATIVE SK EW N E SS ............................................................... .... .... ........ 119 L IST O F R E FE R E N C E S ....................................................................... .................... 122 BIOGRAPH ICAL SKETCH .............................................................. ............... 133 LIST OF TABLES Table p 31: O original R records ...................... .................... .. .. ........... .... ....... 24 32: M asked R records ................... .... ............................ .. ...... ............... 24 33: O original Table ......................... ........ .. .. ........ .. ............. 26 34: Published Table .................. ......................................... .. ........ .... 26 35: A H hospital's D database .............................................................................. .... ........29 51: A n Exam ple D database .................................. .......................................... 47 52: The Example Database With Camouflage Vector ................................................48 53: A n Exam ple of Interval D isclosure................................. ........................ .. ......... 54 54: LP A lgorithm .............................................. 55 61: Bounds on the Sample Size with Different Values of n. ......................................72 62: The Relationship among u/ c s and .................................... ...............86 63: Heuristic to Estimate the Mean /i, Standard Deviation and the Bound I .........88 64: Summary of the Estimated /i, ~, and 1 in the CVC Example Network...............89 71: Summary of Four cases with Different Means and Standard Deviations. ...............93 72: The Intervals of [a, b] under the Four Cases ....................................................93 73: Experiments Results on 16 Tests with the Means, Standard Deviations, Sample Sizes and A average Error R ates ........................................ ........................... 98 74: Experimental Results on the Average Error Rates with / = 6,000 for 16 Cases...101 LIST OF FIGURES Figure p 2 1 : E rro r P ro b ab ility ............................................................................ ..................... 10 31: Microdata File That Has Been Read Into SPSS..............................................23 41: RU Confidentiality Map, Univariate Case, n = 10, 02 = 5, 2 = 2 .......................40 51: Network With (m,w) = (1,3) (data source: Garfinkel et al. 2002) ..........................49 52: Discrete Distribution of Perturbations from the BinCVC Network Algorithm......52 53: Relationships of c, c', c and d ............................................................................58 54: Illustration of the Connection between the PAC Learning and Data Perturbation ..59 61: Relationships H0, H1, H2, ho, h1 and d in the FixedData Perturbation ..............65 62: Relationships of H0, H,, H2, h0, h and d in the VariableData Perturbation......74 63: A Bimodal Distribution of Perturbations in the CVC Network while / 64: A Distribution of Perturbations in the CVC Network with u/ > cn > a.................77 71: Plots of Four Uniform Distributions of Perturbations at Different Means and Standard D eviations ..................................... .. .. .......................94 72: Plots of Four Symmetric Distributions of Perturbations at Different Means and Standard D eviations ......................................... .............................95 73: Plots of Four Distributions with Positive Skewness of Perturbations at Different Means and Standard Deviations............................ ............................ 96 74: Plots of Four Distributions with Positive Skewness of Perturbations at Different M eans and Standard Deviations. ........................................ .......................... 97 75: Plot of Average Error Rates (%) for 16 Tests. ......................................................99 76: The Probability Histogram of Perturbation Distribution for the CVC Network.... 100 77: Plot of Bounds on the Sample Size for 16 Tests. ................................................. 101 Abstract of Dissertation Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy DISCLOSURE CONTROL OF CONFIDENTIAL DATA BY APPLYING PAC LEARNING THEORY By Ling He August 2005 Chair: Gary Koehler Cochair: Haldun Aytug Major Department: Decision and Information Sciences With the rapid development of information technology, massive data collection is relatively easier and cheaper than ever before. Thus, the efficient and safe exchange of information becomes the renewed focus of database management as a pervasive issue. The challenge we face today is to provide users with reliable and useful data while protecting the privacy of confidential information contained in the database. Our research concentrates on statistical databases, which usually store a large number of data records and are open to the public where users are allowed to ask only limited types of queries, such as Sum, Count and Mean. Responses for those queries are aggregate statistics that intends to prevent disclosing the identity of a unique record in the database. My dissertation aims to analyze these problems from a new perspective using Probably Approximately Correct (PAC) learning theory which attempts to discover the true function by learning from examples. Different from traditional methods from which database administrators apply security methods to protect the privacy of statistical databases, we regard the true database as the target concept that an adversary tries to discover using a limited number of queries, in the presence of some systematic perturbations of the true answer. We extend previous work and classify a new data perturbation method the variable data perturbation which protects the database by adding random noises to the confidential field. This method uses a parametrically driven algorithm that can be viewed as generating random perturbations by some (unknown) discrete distribution with known parameters, such as the mean and standard deviation. The bounds we derive for this new method shows how much protection is necessary to prevent the adversary from discovering the database with high probability at small error. Put in PAC learning terms we derive bounds on the amount of error an adversary makes given a general perturbation scheme, number of queries and a confidence level. CHAPTER 1 INTRODUCTION 1.1 Background Statistical organizations, such as U.S. Census Bureau, National Statistical Offices (NSOs), and Eurostat, collect large amounts of data every year by conducting different types of surveys from assorted individuals. Meanwhile, the data stored in the statistical databases (SDBs) are disseminated to the public in various forms, including microdata files, tabular data files or sequential queries to the online databases. The data are retrieved, summarized and analyzed by various database users, i.e., researchers, medical institutions or business companies. Among the published data, restrictions are established on the release of sensitive data in order to comply with the confidentiality agreements imposed by the sources or providers of the original information. Therefore, the protection of confidential information becomes a critical issue with serious economic and legal implications which in turn expands the scope and necessity of improved security in the database field. Statistical databases usually store large a number of data records and are open to the public where users are allowed to ask only limited types of queries, such as Sum, Count and Mean. Responses for those queries are aggregate statistics that aim to prevent disclosing the identity of a unique record in the database. With the rapid development of information technology, it becomes relatively easier and cheaper to obtain data than ever before. With the recent passage of The Personal Responsibility and Work Opportunity Act of 1996 (The Welfare Reform Act) (Fiengerg 2000) and Health Insurance Portability and Accountability Act of 1996 (HIPPA) in the United States, the protection of confidential information collected by statistical organizations has become a renewed focus of database management as a pervasive issue since the 70s and 80s. Those statistical organizations have the legal and ethical obligations to maintain the accuracy, integrity and privacy of the information contained in their databases. 1.2 Motivation Traditional research on SDBs privacy, which is also called Statistical Disclosure Control (SDC), has been under way for over 30 years. SDC provides all types of security control methods. Among them, microaggregation, cell suppression and random data perturbation are some of the most promising SDC methods. Recently, Garfinkel et al. (2002) developed a new technique called CVC protection which designs a network algorithm to construct a series of camouflage vectors which hides the true confidential vector. This CVC technique provides interval answers to adhoc queries. All those SDC methods attempt to provide the SDB users with reliable and useful data (minimizing the information loss) while protecting the privacy of the confidential information in the database (minimizing the disclosure risk) as well. Probably Approximately Correct (PAC) learning theory is a framework for analyzing machine learning algorithms. It attempts to discover the true function by learning from examples which are randomly drawn from an unknown but fixed distribution. Given accuracy and confidence parameters, the PAC model bounds the error that the true function makes. Different from the traditional methods from which database administrators apply SDC methods to protect the privacy of SDBs, we approach the database security problem from a new perspective, from which we assume that an adversary regards the true confidential data in the database as the target concept and tries to discover it within a limited number of queries by applying PAC learning theory. We describe how much protection is necessary to guarantee that the adversary cannot uncover the database's confidential information with high probability. Put in PAC learning terms we derive bounds on the amount of error an adversary makes given a general perturbation scheme, number of queries and a confidence level. 1.3 Research Problem Additive data perturbation includes some of the most popular database security methods. Inspired by the CVC technique, we classify a new method into this category the variable data perturbation which protects a database by adding random noises. Different from the fixed random data perturbation method, this method effectively generates random perturbations which have an unknown discrete distribution. However, parameters, such as the mean and standard deviation, can be estimated. The variable data perturbation method is the focus of our research. We intend to derive a bound on the level of error that an adversary may make while compromising a database. We extend the previous work by Dinur and Nissim (2003), who found a bound for the fixed data perturbation method, and deploy the PAC learning theory to develop a new bound for the variable data perturbation. A threshold on the number of queries is developed from the error bound. With high probability, the adversary can disclose the database at small error if this certain number of queries is asked. Therefore, we may find out how much protection would be necessary to prevent the disclosure of the confidential information in a statistical database. Our experiments indicate that a high level of protection may yield answers that are not useful whereas useful answers can lead to the compromise of a database. 1.4 Contribution Two major contributions are expected from this research. First, we approach the database security problem from a new perspective instead of following the traditional research paths in this field. By applying PAC learning theory, we regard an adversary of the database as a learner who tries to discover the confidential information within a certain number of queries. We show that both SDC methods and PAC learning theory actually use the similar methodology for different purposes. We also derive a PAClike bound on the sample size for the variable data perturbation method, within which the database can be compromised with a high probability at small error. Based on this result, we would find out if a security method can provide enough protection to the database. 1.5 Organization of Dissertation The dissertation is organized into 8 parts. Chapter 2 provides an overview of the important concepts, methodologies and models in the fields of machine learning and PAC learning theory. In Chapter 3, we summarize database securitycontrol methods in microdata files, tabular data files and the statistical database which is the emphasis of our efforts. We review the literature of performance measurements for the database protection methods in Chapter 4. Following that, in Chapter 5 random data perturbation methods are reviewed and a new data perturbation method, variabledata perturbation, is defined and developed. Two papers that motivated our research are reviewed and explained. We propose our approach at the end of this chapter. In Chapter 6, we introduce our methodology and develop the research model. A bound on the sample size for the variable data perturbation method is derived, within which the confidential information 5 can be disclosed. In Chapter 7, experiments are designed and conducted to test our theoretical conclusions from previous chapters. Experimental results are summarized and analyzed at the end. Chapter 8 concludes our work and gives directions for future research. CHAPTER 2 STATISTICAL AND COMPUTATIONAL LEARNING THEORY In this chapter, we introduce Statistical and Computational Learning Theory, a formal mathematical model of learning. The overview focuses on the PAC model, the most commonly used theoretical framework in this area. We then move to a brief review of statistical learning theory and its two important principles: empirical and structural minimization principles. Other wellknown concepts and theorems are also investigated here. At the end of the chapter, we extend the basic PAC framework to more practical models, that is, learning with noise and query learning models. 2.1 Introduction Since the 1960s, researchers have been diligently working on how to make computing machines learn. Research has focused on both empirical and theoretical approaches. The area is now called machine learning in computer science but referred to as data mining, knowledge discovery, or pattern recognition in other disciplines. Machine learning is a mainstream of artificial intelligence. It aims to design learning algorithms that identify a target object automatically without human involvement. In the machine learning area, it is very common to measure the quality of a learning algorithm based on its performance on a sample dataset. It is therefore difficult to compare two algorithms strictly and rigorously if the criterion depends only on empirical results. Computational learning theory defines a formal mathematical model of learning, and it makes it possible to analyze the efficiency and complexity of learning algorithms at a theoretical level (Goldman 1991). 2.2 Machine Learning 2.2.1 Introduction In this section we start our review with an introduction to important concepts in the machine learning field, such as hypotheses, training samples, instances, instance spaces, etc. This is followed by a demonstration of the basic machine learning model which is designed to generate an hypothesis that closely approximates the unknown target concept. See Natarajan (1991) for a complete introduction. 2.2.2 Machine Learning Model Many machine learning algorithms are utilized to tackle classification problems which attempt to classify objects into particular classes. Three types of classification problems include binary classificationone with two classes; multiclass classification handling a finite number of output categories; and regression whose output are real values (Cristianini and ShaweTaylor 2000). Most machine learning methods learn from examples of the target concept. This is called supervised learning. The target concept (or target function) f is an underlying function that maps data from the input space to the output space. The input space is also called an instance space, denoted as X, which is used to describe each instance x e X c 91". Here n represents the dimensions or attributes of the input instance. The output space, denoted as Y, contains every possible output label y e Y. In the binary classification case, the target concept (or target function) f (x) classifies all instances x e X into negative and positive classes, illustrated as 0 and 1, X c 91" Y c {0,1}. Let f (x) = 1 if x belongs to a positive (true) class, and f(x) = 0 (false) otherwise. Suppose a sample S includes / pairs of training examples, S = ((x,, ), ,(x", Y )). Each x& is an instance, and output y, is x 's classification label. The learning algorithm inputs the training sample and outputs an hypothesis h(x) from the set of all hypotheses under consideration which best approximates the target concept f(x) according to its criteria. An hypothesis space H is a set of all possible hypotheses. The target concept is chosen from the concept space, f e C, which consists of a set of all possible concepts (functions). 2.3 Probably Approximately Correct Learning Model 2.3.1 Introduction The PAC model proposed by Valiant in 1984 is considered the first formal theoretical framework to analyze machine learning algorithms, and it formally initiated the field of computational learning theory. By learning from examples, the PAC model combines methods from complexity theory and probability theory, aimed at measuring the complexity of learning algorithms. The core idea is that the hypothesis generated from the learning algorithm approximates the target concept with a high probability at a small error in polynomial time and/or space. 2.3.2 The Basic PAC Model Learning Binary Functions The PAC learning model quantifies the worstcase risk associated with learning a function. We discuss its details using binary functions as the learning domain. Suppose there is a training sample S of size 1. Every example is generated independently and identically from an unknown but fixed probability distribution D over the instance space X c {0, 1). Thus, the PAC model is also named a distributionfree model. Each instance is an n bits binary vector, x e X c {0, 1". The learning task is to choose a specific boolean function that approximates the target concept f : {0," {0,1}, f c C. The target concept f is chosen from the concept space C = 2X of all possible boolean functions. According to PAC requirements a learning algorithm must output an hypothesis h e H in polynomial time, where H ( 2X We hope that the target function f e H and hypothesis h can approximate target function f as accurately as possible. If f V H then the classification errors are inevitable. Consider a concept space C = 2 an hypothesis space H c 2 and an unknown but fixed probability distribution D over an instance space X c {0,1}", the error of an hypothesis, h e H with respect to a target concept f e C, is the probability that h and f disagree on the classification of an instance x e X drawn from D. This probability of error is denoted by a risk functional: err(h)= PrD x,f (x)):h(x) f(x) To understand the error more intuitively, see Figure 21. The error probability is indicated by areas of I and II. Areas I and II in the figure show where h(x) disagrees with f(x) on the instances located in these places. We can think about them as Type I and Type II errors. Area III and IV contain those instances that h(x) and f(x) agree on their classification. The PAC model utilizes an accuracy parameter e and confidence parameter 3 to measure the quality of an hypothesis h. Given a sample S of size 1, and a distribution D from which all training examples are drawn, the PAC model strives to bound the probability that an hypothesis h gives large error by 3 as in Pr' {S:errorD (h)> E8 <3 where h means that the training set decides the selection of the hypothesis. h(x) # f(x) I III II IV ............................... h(x)= f(x) Instance Space X Figure 21: Error Probability Definition: PAC Learnable. A concept class C of boolean functions is PAC learnable if there exists a learning algorithm A, using an hypothesis space H, such that for every f e C, for every probability distribution D, for every 0 < E < 1/2, and for every 0< <1/2: (1) An hypothesis he H, produced by algorithm A, can approximate the target function f with high probability at least 1 3, such that error (h) < . (2) The complexity of the learning algorithm A is bounded by the size of target concept n, 1/E and 1/3 in polynomial time. The sample complexity refers to the sample size within which the algorithm A needs to output an hypothesis h. 2.3.3 Finite Hypothesis Space An hypothesis space H can be finite or infinite. If an hypothesis h classifies all training examples correctly, it is called a consistent hypothesis. We will derive the main PAC result in multiple steps using wellknown inequalities from probability theory. 2.3.3.1 Finite consistent hypothesis space Assuming the hypothesis spaceH is finite, if we choose an hypothesis h with a risk greater than E, the probability that it is consistent on a training sample S of size / is bounded as Pr S :{S h consistent and error (h) > Es <(1 E)' < e1 To see this, observe that the probability that hypothesis h classifies one input pair (x,,f(x,)) correctly is Pr' {h, (x,) = f(x,)} (1 E). Given / examples, the probability h classifies (x,, f (x )), (x,, f (x,)) correctly is Pr' {(h (x,)= f(x,))A...A(h, (x,)= f(x,))} <(1 ) because the sampling is i.i.d. Thus, the probability of finding an hypothesis h with error greater than E and consistent with the training set (of size 1) is denoted by the union bound (i.e., the worst case) H (1 )1 To see this latter step, first define El to represent the event that h is consistent. Then we know that I1 I1 Finally, ( ) < is a commonly known simple algebraic inequality.(1 Finally, (1 F)' e is a commonly known simple algebraic inequality. The idea behind the PAC bound is to bound this unlucky scenario (i.e., algorithm A finds a consistent hypothesis that happens to be one with error greater than ). The following result formalizes this. Blumer Bound (Blumer et al. 1987). H (1 F)' < 3. Thus, the sample complexity, 1, for a consistent hypothesis h over finite hypothesis spaceH, is bounded by 1 In H +ln 2.3.3.2 Finite inconsistent hypothesis space An hypothesis h is called inconsistent if there exist misclassification errors E, > 0 in the training sample. The sample complexity is therefore bounded by 1> In H ln +ln1 2(Ee2 8) and the error is bounded by E 2 + ln H+lnn 2/1 8) We can see from the above inequality that e is usually larger than error rate E,. Interested readers can see Goldman (1991) for further explanations. 2.3.4 Infinite hypothesis space When H is finite we can use H directly to bound the sample complexity. When H is infinite we need to utilize a different measure of capacity. One such measure is called the VC dimension, which was first proposed by Vapnik and Chervonenkis (1971). Definition: VC Dimension Definition. The VC dimension of an hypothesis space is the maximum number, d, of points of the instance space that can be separated into two classes in all possible 2d ways using functions in the hypothesis space. It measures the richness or capacity of H (i.e., the higher d is the richer the representation). Given H with a VC dimension d and a consistent hypothesis h e H then the PAC error bound is (Cristianini and ShaweTaylor 2000): 2 dlog 2el 2log2 E< dlog2 +0log /1 d 8 ) provided d< Il and l> 2/e. 2.4 Empirical Risk Minimization and Structural Risk Minimization 2.4.1 Empirical Risk Minimization Given a VC dimension d and an hypothesis h e H with a training error es, the error rate E is bounded by 4 2el 41 S< 2E, +{dln2+l1n l d Therefore, the empirical risk can be minimized directly by minimizing the number of misclassifications on the sample. This principle is called the Empirical Risk Minimization principle. 2.4.2 Structural Risk Minimization As is well known, one disadvantage of the empirical risk minimization is the over fitting problem, that is, for small sample sizes, a small empirical risk does not guarantee a small overall risk. Statistical learning theory uses the structural risk minimization principle (SRM) (Scholkopf and Smola 2001, Vapnik 1998) to solve this problem. The SRM focuses on minimizing a bound on the risk functional. Minimizing a risk functional is formally developed as a goal of learning a function from examples by statistical learning theory (Vapnik 1998): R(a) = L(z,g(z,a))dF(z) over a e A where L ( ) is a loss function for misclassified points, g (*, a) is an instance of a collection of target functions parametrically defined by Ua e A, and z is the training pair assumed to be drawn randomly and independently according to an unknown but fixed probability distribution F (z). Since F (z) is unknown, an induction principle must be invoked. It has been shown that for any a e A with a probability at least 1 6, the bound on a consistent hypothesis R (d,1, 3) 4ReP (a) U() R(a) holds where the structural risk RtrI, ( ) depends on the sample size, /, the confidence level, 6 and the capacity, d of the target function. The bound is tight, up to log factors, for some distributions (Cristianini and ShaweTaylor 2000). When the loss function is the number of misclassifications, the exact form of Rshr ( ) is 4d(ln (21/d)+l1) In (5/4) strut, (d,1, 6)=4  It is a common learning strategy to find consistent target functions that minimize a bound on the risk functional. This strategy provides the best "worst case" solution, but it does not guarantee finding target functions that actually minimize the true risk functional. 2.5 Learning with Noise 2.5.1 Introduction The basic PAC model is also called the noisefree model since it assumes that the training set is errorfree, meaning that the given training examples are correctly labeled and not corrupted. In order to be more practical in the real world, the PAC algorithm has been extended to account for noisy inputs (defined below). Kearns (1993) initiated another wellstudied model in the machine learning area, the Statistical Query model (SQ), which provides a framework for a noisetolerant learning algorithm. 2.5.2 Types of Noise Four types of noise are summarized in Sloan's paper (Sloan 1995): (1) Random Misclassification Noise (RMN) Random misclassification noise occurs when the learning algorithm, with probability 1 q, receives noiseless samples (x, y) from the oracle and, with probability 77, receives noisy samples (x,y) (i.e., x with an incorrect classification). Angluin and Laird (1988) first theoretically modeled PAC learning with RMN noise. Their model presented a benign form of misclassification noise. They concluded if the rate of misclassification is less than 1/2, then the true concept can be learned by a polynomial algorithm. Within / number of samples, the algorithm can find an hypothesis h minimizing the number of disagreements F(h, o). Disagreements F(h, o) denotes the number of times that some hypothesis h disagrees with o, where a is the training sample. Sample size / is bounded by 2 ln(2H} E 2(1 2qn )2 2 provided 0 < 7 < rb < 1/2. Extensive studies can be found in Aslam and Decatur (1993), Blum et al. (1994), Bshouty et al. (2003), Decatur and Gennaro (1995), and Kearns (1993). (2) Malicious Noise (MN) Malicious noise occurs when the learning algorithm, with probability 1 r, gets the correct samples but with probability r the oracle returns noisy data, which may be chosen by a powerful malicious adversary. No assumption is made about corrupted data, and the nature of the noise is also unknown. Valiant (1985) first simulated this situation of learning from MN. Kearns and Li (1993) further analyzed this worstcase model of noise and presented some general methods that any learning algorithm can apply to bound the error rate, and they showed that learning with noise problems are equivalent to standard combinatorial optimization problems. Additional work can be found in Bshouty (1998), CesaBianchi et al. (1999), and Decatur (1996, 1997). (3) Malicious Misclassification Noise (MMN) Malicious misclassification (labeling) noise is that where misclassification is the only possible noise. The adversary can choose only to change the label y of the sample pair (x,y) with probability 7, while no assumption is made about y. Sloan (1988) extended Angluin and Laird's (1988) result to this type of noise. (4) Random Attribute Noise (RAN) Random attribute noise is as follows. Suppose the instance space is {0,1)". For every instance x in a sample pair (x,y), its attribute x 1 < i < n, is flipped to x5 independently and randomly with a fixed probability qr. This kind of noise is called uniform attribute noise. In this case, the noise affects only the input instance, not the output label. Shackelford and Volper (1988) probed the RAN for the problem of k DNF expressions. k DNF is the disjunctions of terms, where each term is a conjunction of at most kliterals. Later Bshouty et al. (2003) defined a noisy distance measure for function classes, which they proved to be the best possible learning style in an attribute noise case. They also indicated that a concept class C, is not learnable if this measure is small (compared with C and attribution noise distribution D). Goldman and Sloan (1995) developed a uniform attribute noise model forproduct random attribute noise, in which each attribute x, is flipped with its own probability 7,, 1 < i < n. They demonstrated that if the algorithm focuses only on minimizing the disagreements, this type of noise is nearly as harmful as malicious noise. They also proved that no algorithm can exist if the noise rate r, (1 < i < n ) is unknown and the noise rate is higher than 2E (E is the accuracy parameter in the PAC model). Decatur and Gennaro (1995) further proved that if each noise probability 7, (or an upper bound) is known, then a PAC algorithm may exist for the simple classification problem. 2.5.3 Learning from Statistical Query The Statistical Query (SQ) model introduced by Kearns (1993) provides a general framework for an efficient PAC learning algorithm in the presence of classification noise. Kearns proved that if any function class can be learned efficiently by the SQ model, then it is also learnable in the PAC model, and those algorithms are called SQtyped. In the SQ model, the learning algorithm sends predicates (x, a) to the SQ oracle and asks for the probabilities Px that the predicate is correct. Instead of answering the exact probabilities, the oracle gives only probabilities P/ within the allowed approximation error a, which here indicates a tolerance for error, i.e., P a _< P The approach that the SQ model suggested to generate noisetolerant algorithms is successful. A large number of noisetolerant algorithms are formulated as SQ algorithms. Aslam and Decatur (1993) presented a general method to boost the accuracy of the weak SQ learning algorithm. A later study by Blum et al. (1994) proved that a concept class can be weakly learned with at least Q(d ) queries, and the upper bound for the number of queries is O(d). The SQdimension d is defined as the number of "almost uncorrelated" concepts in the concept class. Jackson (2003) further improved the lower bound to (2") while learning the class of parity functions in an nbit input space. However, the SQ model has its limitations. Blumer et al. (1989) proved that there exists a class that cannot be efficiently learned by SQ, but is actually efficiently learnable. Kearns (1993) showed that the SQ model cannot generate efficient algorithms for parity functions which can be learned in a noiseless data PAC model. Jackson (2003) later showed that noisetolerant PAC algorithms developed from using the SQ model cannot guarantee to be optimally efficient. 2.6 Learning with Queries Angluin (1988) initiated the area of Query learning. In the basic framework, the learner needs to identify an unknown concept f from some finite or countable concept space C of subsets of a universal set. The Learner is allowed to ask specific queries about the unknown concept f to an oracle which responds according to the queries' types. Angluin studied different kinds of queries, such as membership query, equivalence query, subset, and so forth. Different from a PAC model which requires only an approximation to the target concept, query learning is a nonstatistical framework and the Learner must identify the target concept exactly. An efficient algorithm and lower bounds are described in Angluin's research. Any efficient algorithm using equivalence queries in query learning can also be converted to satisfy the PAC criterion Pr(error(h)> e)< 8 . CHAPTER 3 DATABASE SECURITYCONTROL METHODS In this chapter, we will survey important concepts and techniques in the area of database security, such as compromise of a database, inference, disclosure risk, and disclosure control methods among other issues. According to the way that confidential data are released, we categorize the review of database security methods into three parts: microdata, tabular data, and sequential queries to databases. Our main efforts will concentrate on the security control of a special type of database the statistical database (SDB), which accepts only limited types of queries sent by users. Basic SDB protection techniques in the literature are reviewed. 3.1 A Survey of Database Security For many decades, computerized databases designed to store, manage, and retrieve information, have been implemented successfully and widely in many areas, such as businesses, government, research, and health care organizations. Statistical organizations intend to provide database users with the maximum amount of information with the least disclosure risk of sensitive and confidential data. With the rapid expansion of the Internet, both the general public and the research community have been much more attentive to the issues of the database security. In the following sections, we introduce basic concepts and techniques commonly applied in a general database. 3.1.1 Introduction A database consists of multiple tables. Each table is constructed with rows and columns representing entities (or records) and attributes (fields), respectively. Some attributes may store confidential information such as income, medical history, financial status, etc. Necessary security methods have been designed and applied to protect the privacy of specific data from outsiders or illegal users. Database security has its own terminology for research purposes. Therefore, first we would like to clarify certain important definitions and concepts. Those are repeatedly used in this research paper and may have varied implications under different circumstances. When talking about the confidentiality, privacy or security of a database, we refer to the disclosure risk of the confidential data. A compromise of the database occurs when the confidential information is disclosed to illegitimate users exactly, partially or inferentially. Based on the amount of compromised sensitive information, the disclosure can be classified into exact disclosure and partial disclosure (Denning et al. 1979, Beck 1980). Exact disclosure or exact inference refers to the situation that illegal users can infer the exact true confidential information by sending sequential queries to the database, while in the case ofpartial disclosure, the true confidential data can be inferred only to a certain level of accuracy. Inferential disclosure or statistical inference is another type of disclosure, which refers to the situation that an illegal user can infer the confidential data with a high probability by sending sequential queries to the database. And the probability exceeds the threshold of disclosure predetermined by the database administrator. This is known as an inference problem, which also falls within our research focus. There are mainly two types of disclosures in terms of the disclosure objects: identity disclosure and attribute disclosure. Identity disclosure occurs if the identity of a subject is linked to any particular disseminated data record (Spruill 1983). Attribute disclosure implies the users could learn the attribute value or estimated attribute value about the record (Duncan and Lambert 1989, Lambert 1993). Currently, most of the research focuses on identity disclosure. 3.1.2 Database Security Techniques Database security concerns the privacy of confidential data stored in a database. Two fundamental tools are applied to prevent compromising a database (Duncan and Fienberg 1999): (1) restricting access and (2) restricting data. For example, a statistical office or U.S. Census Bureau disseminating data to the public may enforce administrative policies to limit users' access to data. Normally the common method used is that the database administrator assigns IDs and passwords to different types of users to restrict the access at different security levels. For example, for a medical database, doctors could have full access to all kinds of information and researchers may only obtain the non confidential records. This security mechanism is addressed as the restricting access. When all users have the same level of access to the database, only transformed data are usually allowed to be released for the purpose of security. This protection approach which is in the data restriction category reduces disclosure risk. However, for some public databases only access control is not feasible and sufficient enough to prevent inferential disclosure. Thus both tools are complementary and may be used together. However, we prioritize our research in the second category the data restriction approach. Database privacy is also known as Statistical Disclosure Control or Statistical Disclosure Limitation (SDL). The SDC techniques, which are used to modify original confidential data before their release, try to balance the tradeoff between information loss (or data utility) and disclosure risk. Some measures evaluating the performance of SDC methods will be discussed in Chapter 4. Based on the way that data are released publicly, all responses from queries can be classified into three types: microdata files, tabular data files and statistical responses from sequential queries to databases (Mas 2000). Most of the typical databases deal with all three dissemination formats. Our research focuses on a section of the third category  sequential queries to a statistical database (SDB), which differs from a regular database due to its limited querying interface. Normally only a few types of queries such as SUM, COUNT, Mean, and etc. can be operated in SDB. The goal of applying disclosure control methods is to prevent users from inferring confidential data on the basis of those successive statistical queries. We briefly describe protection mechanisms for microdata and tabular data in the next two subsections, 3.1.3 and 3.1.4. Security control techniques for the statistical database are discussed in detail in section 3.2. 3.1.3 Microdata files Microdata are unaggregated or unsummarized original sample data containing every anomynized individual record (such as person, business company, etc.) in the file. Normally, microdata originally come from the responses of census surveys issued by the statistical organizations, such as the U.S. Census Bureau (see Figure 31 for an example) and include detailed information with many attributes (probably over 40), such as income, occupation, household composition, and etc. Those data are released in the form of flat tables, where rows and columns represent records and attributes for each individual respondent, respectively. Microdata can usually be read, manipulated and analyzed by computers with statistical software. See Figure 31 for an example of microdata that are read into SPSS (Statistical Package for the Social Sciences). i A ,2 _.. =a hh, i fi, , iI I I IJ i Figure 31: Microdata File That Has Been Read Into SPSS. (Data source: Indiana University Bloomington Libraries, Data Services & Resources. http://www.indiana.edu/ libgpd/data/microdata/what.html) 3.1.3.1 Protection Techniques for microdata files Before disseminating microdata files to the public, statistical organizations will apply SDC techniques either to distort or remove certain information from original data files, therefore protecting the anonymity of individual record. Two generic types of microdata protection methods are (Crises 2004a): (1) Masking methods The basic idea of masking is to add errors to the elements of a dataset before the data are released. Masking methods have two categories: perturbative (see Crises 2004d for a survey) and nonperturbative (see Crises 2004c for a survey). The perturbative category modifies the original microdata before its release. It includes methods such as adding noise (Sullivan 1989 and Brand 2002, DomingoFerrer et al. 2004), rounding (Willenborg 1996 and 2000), microaggregation (Defays and Nanopoulos 1993, Anwar 1993, Mateo and Domingo 1999, Domingo and Mateo 2002, Li et al. 2002b, Hansen and Mukherjee 2003), data swapping (Dalenius and Reiss 1982, Reiss 1984, Feinberg 2000, and Fienberg and McIntyre 2004) and others. The nonperturbative category does not change data but it makes partial suppressions or reductions of details in the microdata set, and applies methods such as sampling, suppression, recoding, and others (DeWaal and Willenborg 1995, Willenborg 1996 and 2000). The following two tables are simple illustrations of masking methods, i.e., data swapping, Additive noise and microaggregation. (Data source: DomingoFerrer and Torra 2003). First the microaggregation method is used to group "Divorced" and "Widow" into one category "Widow/erordivorced" in the field "Marital Status"; Secondly, values of record 3 and record 5 in the "Age" column are switched by applying data swapping techniques; finally, the value of record 4 in the "Age" attribute is perturbed from "36" to "40" by adding noise of"4". Table 31: Original Records Record Illness ... Sex Marital Status Town Age 1 Heart ... M Married Barcelona 33 2 Pregnancy ... F Divorced Tarragona 40 3 Pregnancy ... F Married Barcelona 36 4 Appendicitis ... M Single Barcelona 36 5 Fracture ... M Single Barcelona 33 6 Fracture ... M Widow Barcelona 81 Table 32: Masked Records Record Illness ... Sex Marital status Town Age 1 Heart ... M Married Barcelona 33 2 Pregnancy ... F Widow/erordivorced Tarragona 40 Table 32. Continued. Record Illness ... Sex Marital status Town Age 3 Pregnancy ... F Married Barcelona 33 4 Appendicitis ... M Single Barcelona 40 5 Fracture ... M Single Barcelona 36 6 Fracture ... M Widow/erordivorced Barcelona 81 (2) Synthetic data generation Liew et al. (1985) initially proposed this protection approach which first identifies the underlying density function with associated parameters for the confidential attribute, and then generates a protected dataset by randomly drawing from that estimated density function. Even though data generated from this method do not derive from original data, they preserve some statistical properties of the original distributions. However, the utility of those simulated data for the user has always been an issue. See (Crises 2004b) for an overview of this method. 3.1.4 Tabular data files Another common way to release data is in the tabular data format (also called macrodata) obtained by aggregating microdata (Willenborg 2000). It is also called summary data, table data or compiled data. The numeric data are summarized into certain units or groups, such as geographic area, racial group, industries, age, or occupation. In terms of different processes of aggregation, published tables can be classified into several types, such as magnitude tables, frequency count tables, linked tables, etc. 3.1.4.1 Protection techniques for tabular data Tabular data files collect data at a higher level of aggregation since they summarize individual atomic information. Therefore they provide higher security for database than microdata files. However, the disclosure risk has not been completely eliminated and intruders could still infer confidential data from an aggregated table (see Table 33 and 3.4 for an example). Protection techniques, such as cell suppression (Cox 1975, 1980, Malvestuto et al. 1991, Kelly et al. 1992, Chu 1997), table redesign, noise adding, rounding, or swapping among others, have to be adopted before the release. See Sullivan (1992), Willenborg (2000), Oganian (2002) for an overview. See Table 33 for an illustration of tabular data. It shows state level data for various types of food stores The Economic Division published the economic data by geography and standard industrial classification (SIC) codes. The "Value of Sales" field is considered as confidential data. Table 34 demonstrates how a cell suppression technique is applied to protect the confidential data. (Data source: U.S. Bureau of the Census Statistical Research Division, Sullivan 1992). Table 33: Original Table: Number of Value of SIC 'Establishments Sales ($) 54 All Food Stores ... 347 200,900 541 Grocery ... 333 196,000 542 Meat and Fish ... 11 1,500 543 Fruit Stores ... 2 2,400 544 Candy ... 1 1,000 Table 34: Published Table After Applying Cell Suppression Number of Value of SIC Establishments Sales ($) 54 All Food Stores ... 347 200,900 541 Grocery ... 333 196,000 542 Meat and Fish ... 11 1,500 543 Fruit Stores ... 2 D 544 Candy ... 1 D Only one Candy store reported sales value for this state in Table 33. If the table is released as it is, any user would learn the exact sales value for this specific store. Also a sales value is listed for two Fruit stores in this state. Therefore by knowing its own sales figure, either of these two stores can infer the competitor's sales volume. A disclosure occurs under either situation. Thus, SDC methods have to be incorporated into the original table before its publication. Table 34 shows that the confidential data resulting in a compromise are suppressed and replaced by a "D" in the cells. The technique applied is called cell suppression, which is very commonly used by U.S Bureau Census currently. 3.2 Statistical Database 3.2.1 Introduction A statistical database (SDB) differs from a regular database due to its limited querying interface. Its users can retrieve only aggregate statistics of confidential attributes, that is, SUM, COUNT, and Mean, for a subset of records stored in the database. Those aggregate statistics are calculated from tables in databases. Tables could include microdata or tabular data. In other words, query responses in SDBs could be treated as views of microdata or tabular data tables. However, those views can only be summarized to answer limited types of queries and in the form of aggregate statistics they are computed according to each query. A SDB is compromised if the sensitive data is disclosed by answering a set of queries. Note that some of the protection methods used in SDBs are overlapped with those for microdata files and tabular data files. However, SDBs security methods emphasize on preventing a disclosure from responding sequential queries. Many government agencies, businesses, and research institutions normally collect and analyze aggregate data for their special purposes. For instance, medical researchers may need to know the total number of HIVpositive patients within a certain age range and gender. The users should not be allowed to link the sensitive information to any specific record in the SDB by asking sequential statistical queries. We illustrate how a statistical database could possibly be compromised by the following example, and further explain the necessity of applying statistical disclosure control methods before data are released. 3.2.2 An Example: The Compromise of Statistical Databases Adam and Wortmann (1989) described three basic types of authorized users for a statistical database: the nonstatistical users accessing the database, sending queries and updating data; the researchers authorized to receive only aggregate statistics; and the snoopers, attackers or adversaries seeking to compromise the database. The purpose of database security is to provide researchers with useful information while preventing disclosure risk from attackers. For instance (example from Adam and Wortmann 1989, Garfinkel et al. 2002), a hospital's database (see Table 35) providing aggregate statistics to the outsiders contains one confidential field, that is, HIV status which is denoted by "1" as positive and "0" as otherwise. Suppose a snooper knows that Cooper working for company D is a male under the age of 30, and attempts to find out whether or not Cooper is HIVpositive. Therefore, he types the following queries: Query 1: Sum = (Sex=M) & (Company=D) & (Age<30); Query 2: Sum = (Sex=M) & (Company=D) & (HIV=1) & (Age<30); The response to Query 1 is 1, and the response to Query 2 is 1. Neither of queries is a threat to the database privacy individually, however, when they are put together, the attacker who knows Cooper's personal information can locate Cooper from Query l's answer and immediately infer that Cooper is HIVpositive from Query 2's answer. Thus, the confidential data is disclosed. And we refer to this case as a compromise of a database. From this example, we can tell that the snooper is able to infer the true confidential data through analyzing aggregate statistics by sending the sequential queries. Therefore security mechanisms have to be established prior to the data release. Table 35: A Hospital's Database (data source: part from Garfinkel et al. 2002) Record Name Job Age Sex Company HIV 1 Daniel Manager 27 F A 0 2 Smith Trainee 42 M B 0 3 Jane Manager 63 F C 0 4 Mary Trainee 28 F B 1 5 Selkirk Manager 57 M A 0 6 Daphne Manager 55 F B 0 7 Cooper Trainee 21 M D 1 8 Nevins Trainee 32 M C 1 9 Granville Manager 46 M C 0 10 Remminger Trainee 36 M D 1 11 Larson Manager 47 M B 1 12 Barbara Trainee 38 F D 0 13 Early Manager 64 M A 1 14 Hodge Manager 35 M B 0 3.2.3 Disclosure Control Methods for Statistical Databases Some basic security control methods for microdata and tabular data have been summarized in the previous sections. In this section, we will concentrate on the security control methods for statistical databases. Some methods used for microdata and tabular data may also be utilized here. Adam and Wortmann (1989) conducted a complete survey about security techniques for statistical databases (SDBs). They classified all security methods for SDBs into four categories: conceptual, query restriction, data perturbation, and output perturbation. In addition to that, Adam and Wortmann provided five criteria to evaluate the performance of security mechanisms. Our literature review will follow suit and discuss major security control methods in the following sections. (restricted) Queries Researcher SDB Exact responses or denial A Oueries Perturbed Researcher SDB Data perturbation (Perturbed) Responses B (restricted) Queries SDB Researcher Perturbed Resoonses C Figure 32: Three Approaches in Statistical Database Security. A) Query Restriction, B) Data Perturbation and C) Perturbed Responses. Figure 32 demonstrates three approaches: Query Restriction, Data Perturbation and Output Perturbation (Data source: Adam and Wortmann 1989). Figure 32A shows how Query Restriction method works. This technique either returns exact answers to the user or refuses to respond at all. Figure 32B introduces Data Perturbation method which creates a perturbed SDB from the original SDB to respond to all queries. The user can receive only perturbed responses. The output perturbation method is illustrated in Figure 32C. Each query answer is modified before being sent back to the user. 3.2.3.1 Conceptual approach The Conceptual approach includes two basic models: the Conceptual and Lattice models. The Conceptual model, proposed by Chin and Ozsoyoglu (1981, 1982), addressed security issues at a Conceptual data model level where the users only access entities with common attributes and their statistics. The Lattice model developed by Denning (1983) and Denning and Schlorer (1983), retrieved data from SDBs in tabular form at different aggregation levels. Both methods provide a fundamental framework to understand and analyze SDBs' security problems, but neither seems functional at the implementation level. 3.2.3.2 Query restriction approach Based on the users' query history, SDBs either provide the exact answer or decline the query (see Figure 32A). The five major methods in this approach include: (1) Querysetsize control (Hoffman and Miller 1970, Fellegi 1972, Schlorer 1975 and 1980, Denning et al. 1979, Schwartz et al. 1979, Denning and Schlorer 1980, Friedman and Hoffman, 1980, Jonge 1983). This method allows the release of the data only if the query set size (number of records included in the query response) meets some specific conditions. (2) Querysetoverlap control (Dobkin et al. 1979). This mechanism is based on querysetsize control and further explores the possible overlapped entities involved in successive queries. (3) Auditing (Schlorer 1976, Hoffman 1977, Chin and Ozsoyoglu 1982, Chin et al. 1984, Brankovic et al. 1997, Malvestuto and Moscarini 1998, Kleinberg et al. 2000, Li et al. 2002a, Malvestuto and Mezzini 2003). This technique intends to keep query records for each user, and before answering new queries, it checks whether or not the response can lead to a disclosure of the confidential data. (4) Partitioning (Yu and Chin 1977, Chin and Ozsoyoglu 1979, 1981, Schlorer 1983). This method groups all entities into a number of disjoint subsets. Queries are answered on the basis of those subsets instead of original data. (5) Cell suppression (Cox 1975, 1980, Denning et al. 1982, Sande 1983, Malvestuto and Moscarini 1990, Kelly et al. 1992, Malvestuto 1993). The basic idea of the technique is to suppress all cells that may result in the compromise of SDBs. So far, some methods in this category have been proved either inefficient or infeasible. For instance, a statistical database normally includes a large number of data records. Under this situation, a traditional auditing method would become impractical due to its requirement for large memory storage and strong computing power. Among those methods, the most promising method is the cell suppression technique, which has been implemented successfully by the US Census Bureau and widely adopted in the real world. 3.2.3.3 Data Perturbation Approach In this approach, a dedicated perturbed database is constructed once and for all by altering the original database to answer users' queries (see Figure 32B). According to Adam and Wortmann (1989), all methods fall into two categories: (1) The probability distribution. This category treats SDB as a sample drawn from some distribution. The original SDB is replaced either by another sample coming from the same distribution, or by the distribution itself (Lefons et al. 1983). Techniques in this category include data swapping (Reiss 1984), multidimensional transformation of attributes (Schlorer 1981), data distortion by probability distribution (Liew et al. 1985), and etc. (2) Fixed data perturbation. This category includes some of the most successful database protection mechanisms. It can be achieved by either an additive or multiplicative technique (Muralidhar et al. 1999, 1995). An additive technique (Muralidhar et al. 1999) refers to adding noise to the confidential data. The multiplicative data perturbation (Muralidhar et al. 1995) protects the sensitive information by multiplying the original data with a random variable, which has mean of 1 and a prespecified variance. Our study focuses on the additive data perturbation, which are classified into two types of perturbation in our research: random data perturbation and variable data perturbation. We will introduce these two methods separately in Chapter 5. 3.2.3.4 Output Perturbation Approach Output Perturbation is also named querybased perturbation. The response for each query is computed first from the original database, and then it is perturbed based on the answer of each query (see Figure 32C). Three methods are included in this approach: (1) The RandomSample Queries technique is proposed by Denning (1980). Later, Leiss (1982) suggested a variant of Denning's method. The basic rationale is that the query response is calculated from a randomly selected sampled query set. This selected query set is chosen from the original query set by satisfying some specific conditions. However, an attacker may compromise the confidential information by repeating the same query and averaging the results. (2) VaryingOutput Perturbation (Beck 1980) works for SUM, COUNT and Percentile queries. This method assigns a varying perturbation to the data that are used to compute the response statistic. (3) Rounding includes three types of output perturbation: systematic rounding (Achugbue and Chin 1979), random rounding (Fellegi and Phillips 1974, Haq 1975, 1977), and controlled rounding (Dalenius 1981). This technique calculates queries based on unbiased data, and then the answer is rounded up or down to the nearest multiple of a base number set by Database Administrators (DBAs). Query results do not change for the same query, therefore providing good protection in terms of averaging attacks. In this chapter we summarized different types of database securitycontrol methods. For a specific database, one SDC method could be more effective and efficient than another. Therefore, how to select the most suitable security method becomes a critical issue in the database privacy. We will review various performance measurements for SDC in the next chapter. CHAPTER 4 INFORMATION LOSS AND DISCLOSURE RISK Chapter 2 provided an overview of important SDC methods that are applied to protect the privacy of a database. However, since SDC methods reach their goals by transforming original data, users of the database would achieve only approximate results from a modified data. Therefore, a fundamental issue that every statistical organization has to address is how to protect confidential data maximally while providing database users with as much useful and accurate information as possible. In this chapter, we review the main performance measurements of SDC methods. These assessments are used to evaluate the information loss (used interchangeably with data utility) and disclosure risk of a database. These measures have become standard criteria for deciding on how to choose appropriate protection techniques for SDBs. 4.1 Introduction All SDC methods attempt to optimize two conflicting goals: (1) Maximizing data utility or minimizing information loss that legitimate data users can obtain. (2) Minimizing the disclosure risk of the confidential information that data organizations take by publishing the data. Therefore the efforts to obtain greater protection usually result in reducing the quality of data that are released. So the database administrators always seek to solve the problem by optimizing tradeoffs between the information loss and disclosure risk. The definitions for information loss and disclosure risk are as follows: Information Loss (IL) refers to the loss of the utility of data after being released. It measures the damage of the data quality for the legal users due to the application of SDC methods. Disclosure Risk (DR) refers to the risk of disclosure of confidential information in the database. It measures how dangerous it is for statistical organizations to publish modified data. The problem that statistical organizations always have to confront is how to choose an appropriate SDC method with suitable parameters from many potential protection mechanisms. And the selected mechanism should be able to minimize disclosure risk as well as information loss. One of the best solutions is to count on performance measures to evaluate the suitability of different SDC techniques to the database. Good designs for performance criteria quantifying information loss and disclosure risk are therefore desirable and necessary. 4.2 Literature Review Designing good performance measures is a challenging task because different users collect data for different purposes and organizations define disclosure risk to different extents. So far, there are many performance assessment methods existing in the literature. Based on their properties, we divide those measurement techniques into five categories in our research: (1) Information loss measures for some specific protection methods. This type of measurement assesses the difference of masked (modified) data from original data after applying a specific protection method. Refer to Willenborg and Waal (2000) and Oganian (2002) for example. If variances of the original microdata are critical for the user, then the information loss can be estimated as Var ( (datamaked))Var ( (dataongnal)) where (datao,,gna) is a consistent estimator of the original data, and (datamaked) is the corresponding estimator of the modified data. We can tell from the above criterion that this measurement depends on a specific purpose of data use, such as mean, variances, etc. (2) Generic information loss measures for different protection methods. A generic information loss measure, which is not limited to any particular data use, is designed to compare different protection methods. Two wellknown general information loss measures are as follows: Shannon's entropy, discussed in Kooiman et al. (1998) and Willenborg and Waal (2000), can be applied to any SDC technique to define and quantify information loss. This measurement models the masking process as noise added to the original dataset, which then is sent through a noisy channel. The receiver of the noisy data intends to reconstruct the probability distribution of the original data. The entropy of this probability distribution measures the uncertainty of the original data after masked data are released because of the transmission process. However an entropybased measurement is not a very good criterion since it ignores the impact of covariances and means. Whether or not these two statistics can be preserved properly from the original data directly affects the validity and quality of the altered data. Another measurement by DomingoFerrer et al. (2001) and Oganian (2002) suggests that IL would be small if the original and masked data have similar analytical structure, but the disclosure risk would be higher in this case. This method compares statistics, such as mean square error, mean absolute error, and mean variation, which are calculated from the difference of covariance matrix, coefficient matrix, correlation matrix, and etc. between the original data and modified data. (3) Disclosure risk measures for specific protection methods. The disclosure risk also affects the quality of the SDC methods. Compared with IL measures, DR measures are more methodspecific. The idea of assessing disclosure risk was initially proposed by Lambert (1993). Later, different DR measures were developed for SDC methods, i.e., for sampling methods by Chen and KellerMcNulty (1998), Samuel (1998), Skinner et al. (1994), and Truta et al. (2004), and for microaggregation masking methods by Jaro (1989), and Pagliuca and Seri (1998). (4) Generic disclosure risk measures for different protection methods. The two main types of general DR measurements are applied to measure the quality of different protection methods for tabular data. The first measurement is called sensitivity rules, which is used to estimate DR prior to the publication of data tables. There are three methods: (n,k) dominance, p% rule, and pq rule (Felso et al. 2001, Holvast 1999, Luige and Meliskova 1999). Different from dominance rule, which is criticized for its failure to to reflect the disclosure risk properly, a new priori measure is proposed by Oganian (2002), who also introduced a posterior DR measure, which takes the modified data into account and operates after applying SDC methods. A new method based on Canonical Correlation Analysis was introduced by Sarathy and Muralidhar (2002) to evaluate the security level for different SDC methods. This methodology can also be used to select the appropriate inference control method. For more details, refer to Sarathy and Muralidhar (2002). (5) Generic performance measures that encompass disclosure risk and information loss for different protection methods. A sound SDC method should be able to achieve an optimal tradeoff between disclosure risk and information loss. Therefore a joint framework is desired to examine the tradeoffs and compare the performance of distinct SDC methods. Two popular performance measures in the literature are Score Construction and RU confidentiality map. Score Construction, proposed by DomingoFerrer and Torra (2001), ranks different SDC methods, based on their scores obtained by averaging their information loss and disclosure risk measures. For example (Crisis 2004e), Score(V,) =L(VV)DR(V, 2 Where V is the original data, V is the modified data. Information Loss (IL) and Disclosure Risk (DR) are information loss and disclosure risk measures. Refer to Crisis (2004e), DomingoFerrer et al. (2001), Sebe et al. (2002) and Yancey et al. (2002) for more examples. An RU confidentiality map, first proposed by Duncan and Fienberg (1999), constructs a general analytical framework for information organization to trace the tradeoffs between disclosure risk and data utility. It was further developed by Duncan et al. (2001, 2004), and Gomatam et al. (2004). Trottini and Fienberg (2002) later illustrated two examples of RU map in their paper. An application is given in Boyen et al. (2004). Database administrators could decide the most appropriate SDC method from the RU map by observing the influence of a particular method with the according parameter choice. See the following figure (Data source: Trottini and Fienberg 2002) for an example. < r,01 0 2 4  0 I Data Utilty Figure 41: RU Confidentiality Map, Univariate Case, n = 10, 2 = 5, 02 = 2 M,, M1 and M2, are represented by a diamond, a circle and a dashed line in the figure, and indicate three types of SDC methods: trivial microaggregation, microaggregation, and the combination of additive noise and microaggregation, respectively. The disclosure risk and data utility are functions determined by the data size n, known variance (prior belief) f2, known population variance o2, and the standard deviation r of the noise added to the original data. The yaxis measures the disclosure risk while the xaxis estimates the data utility. For example, checking Figure 32, if the database administrators intend to have the disclosure risk below 0.5, we will see that the appropriate SDC method that satisfies this requirement is 2,, the mixed strategy of additive noise plus microaggregation method. From the xaxis, the corresponding data utility is shown as 2.65. The choice of r can also affect the RU map. If r is large, then the mixed strategy M2 is close to not release any data at all, as r is chosen close to zero, 41 the M2 is equivalent to the microaggregation method with some specific parameter. In Figure 41, r = 2.081. We do not differentiate the measurements for microdata and tabular data in the overview since our research focuses on statistical databases. All examples and methods previously mentioned are applied either to microdata or tabular data or both. CHAPTER 5 DATA PERTURBATION This chapter provides an introduction to additive data perturbation methods. Based on different ways of generating perturbative values, additive data perturbation methods are classified into three categories: randomdata perturbation, fixdata perturbation and variabledata perturbation. The first category, randomdata perturbation, with five types of perturbation methods, can be found in Kim 1986, Muralidhar et al. 1999, Sullivan 1989, Tendick 1991, Tendick and Matloff 1994. Our proposed variabledata perturbation method is a new category that includes the interval protection technique given by Gopal et al. (1998, 2002) and Garfinkel et al. (2002). In both random data perturbation and variabledata perturbation methods, a perturbed database is constructed by adding noise to the confidential data in the original database. All query responses are computed from the perturbed database. We will review an algorithm by Dinur and Nissim (2003) that finds a bound for the fixeddata perturbation. The noise is added to each query response. This bound can be applied to both data perturbation and output perturbation methods. Their work considers the tradeoff between privacy and usability of a statistical database. We end the chapter with the proposed approach to the database security problem. 5.1 Introduction Our study focuses on additive noise perturbation methods, which are usually employed to protect confidential numerical data. Perturbation methods can guarantee the prevention of the exact disclosure by adding noise to sensitive data, however they are still susceptible to partial disclosure and inferential disclosure. (See Chapter 3 for definitions of exact disclosure, partial disclosure and inferential disclosure.) Two types of additive perturbation methods are described in the following sections based on their different approaches of generating noise. An algorithm by Dinur and Nissim (2003) providing a theoretical basis for our study is also reviewed. Our proposed research approach is discussed at the end of this chapter. 5.2 Random Data Perturbation 5.2.1 Introduction Random Data Perturbation (RDP) is one of the most popular and practical data protection methods employed in statistical databases today. In order to effectively prevent statistical inference against a snooper, DBAs attempt to provide an appropriate level of security by distorting the sensitive data with random noise. The RDP method could assure adequate protection of confidential information while satisfying legitimate users' needs for aggregate statistics of the database. 5.2.2 Literature Review In the Random Data Perturbation (RDP) method, a perturbed database is created by adding random noise to the confidential numerical attributess. We discuss four types of RDP summarized by Crises (2004) and describe a general method for RDP given by Muralidhar et al. (1999). Before walking through different types of RDP methods, we first discuss the main disadvantage of the data perturbation methods. RDP methods may generate bias into statistical characteristics of databases, such as PERCENTILES, conditional SUMS, and COUNTS. Matloff (1986) initially introduced the concept of bias, which occurs when the responses to certain queries computed from a perturbed database may be different from the responses computed from the original database. The four types of bias, A, B, C, and D, are defined and analyzed in the literature by Muralidhar et al. (1999). Type A bias occurs when a change in variance causes a change of summary measures of some perturbed attribute. Typed B bias applies when the perturbation distort the relationships between confidential attributes. Type C bias occurs when the perturbation changes the relationships between confidential and nonconfidential attributes. Type D bias occurs when the underlying distribution of the perturbed database can not be determined because the original database or noise term has a nonmultivariate normal distribution. Improved perturbation methods are designed to avoid bias (Matloff 1986, Tendick 1991, Tendick and Matloff 1994, Muralidhar et al. 1995). A creative method called General Additive Data Perturbation (GADP), proposed by Muralidhar (1999), deletes all these types of bias completely from additive perturbation methods. For more information about GADP, see Section 5.2. (1) Masking by uncorrelated noise addition This method is also called the Simple Additive Data Perturbation method (Muralidhar et al. 1999). The vector of confidential fields, d,,, representing the mth attribute of the original database which contains n records, is replaced by a vector Ym by adding a noise term em: y, = d, + e where each element of em is normally distributed and drawn from a random variable 72r~ N(0, O ). Each noise term is generated independently of the others, such that Cov(r,, r) = 0 for all i a j. The variances of z, are generally assumed proportional to those of the original vector d,,, that is, if the variance of dm is o,', then o" := ac The distribution of nm and parameter a are decided by the DBA. This perturbation method introduces Type A, B and C bias. (2) Masking by correlated noise addition This method proposed by Kim (1986) and Tendick (1991) uses correlated noise to perturb the database. It is also called the CorrelatedNoise Additive Data Perturbation method (CADP). The formulation of the method is: S=V+V where V is the covariance matrix from the perturbed data; V, is the covariance matrix of the errors, that is, rzN(O, V), which is proportional to the covariance matrix of the original data, V, that is: V' = aV The CADP method generates Type A and Type C bias. (3) Masking by noise addition and linear transformations In Kim (1986), Tendick and Matloff (1994), Crises (2004), and Muralidhar et al. (1999) masking by correlated noise addition was modified to use additional linear transformations to eliminate certain types of bias. Therefore, the sample covariance matrix of the masked data is an unbiased estimator for the covariance matrix of the original data. This method is also named the BiasCorrected CorrelatedNoise Additive Data Perturbation (BCADP) method and only results in Type C bias. (4) Masking by noise addition and nonlinear transformation Sullivan (1989) proposed a complex algorithm (not discussed here) combining simple additive noise with a nonlinear transformation. This masking method is applied to discrete attributes. Muralidhar et al. (1999) introduced a General Method for the Additive Data Perturbation (GADP) method, which is a further improvement on the previous RDP methods. Suppose the database U has a set C of confidential attributes and a set NC of nonconfidential attributes with n records. A perturbed database P which only alters the attributes in set C is constructed on the basis of the original database U. The perturbation process keeps all statistical relationships, such as the mean values for C, and measures of the covariance and canonical correlation between C and NC Then each record in the set C is generated from a multivariate normal distribution. This process is repeated for all records. The GADP method guarantees that the statistical properties between all attributes are the same before and after perturbation, therefore eliminating all types of bias. Thus, the GADP is called a biasfree RDP method. By comparing with other perturbation methods empirically, Muralidhar et al. suggested that the GADP method would provide the highest level of security and represents a general form of additive noise perturbation. 5.3 Variable Data Perturbation 5.3.1 CVC Interval Protection for Confidential Data Gopal, Goes, and Garfinkel (1998) initiated the idea of interval protection for confidential information in a database and introduced the concept of interval disclosure. They developed three techniques, which they called "TechniqueLP", "TechniqueELS, and TechniqueRP", for various query types. As a result, the query types that a user could ask are limited to SUM (COUNT), Mean, MIN, and MAX for numerical data. This method was further studied in Gopal et al. (2000). Later, Gopal et al. (2002) formally proposed the Confidentiality via Camouflage (CVC) interval protection technique, which is designed to answer numerical ad hoc statistical queries to an online database. Garfinkel et al. (2002, 2004) further extended this technique. Garfinkel et al. (2002) explored the CVC technique for privacy protection of binary confidential data and answered only ad hoc COUNT queries (the same as SUM queries here). The extended technique is called BinCVC. Consider a database consisting of n records. The BinCVC technique introduces s binary camouflage vectors, P= {P,l 2..., P 1, ,, which are used to camouflage or hide the true confidential vector d, where P" = d for s. Without loss of generality, they assumed the database contained only one binary confidential field. Each camouflage vector is denoted as P' = (p,...,p). When a user asks a query q, an interval answer I(q)= [(q), u(q)] will be returned as follows. The upper bound u(q) and lower bound 1(q) of the interval are calculated from the maximum and minimum of all camouflage vectors in the specific set related to the query, that is, u (q) = max I p' and 1(q)= min p' The true l q lGq answers are guaranteed to be inside the interval response, cd e I (q). Table 51: An Exam le Database (Data source: Garfinkel et al. 2002) Record Name Job Age Company HIV 1 Jones Manager 27 A 0 2 Smith Trainee 42 B 0 3 Johnson Manager 63 C 0 4 Andres Trainee 28 B 1 5 Selkirk Manager 57 A 0 6 Clark Manager 55 B 0 7 Cooper Trainee 21 D 1 8 Nevins Trainee 32 C 1 Table 51. Continued Record Name Job Age Company HIV 9 Granville Manager 46 C 0 10 Brady Trainee 36 D 1 11 Larson Manager 47 B 1 12 Remminger Trainee 28 D 0 13 Early Manager 64 A 1 14 Hodge Manager 35 B 0 The HIV status field represents a binary confidential field with 14 records (see Table 51). All query responses involving this sensitive field are computed from camouflage vectors generated by the BinCVC technique. Table 52 is an example of camouflage vectors for this database where vector P3 is the true vector. Table 52: The Example Database with Camouflage Vector(Data source: Garfinkel et al. 2002) Record P1 P2 P3= d 1 1 0 0 2 0 1 0 3 1 0 0 4 0 0 1 5 0 1 0 6 1 0 0 7 0 0 1 8 0 0 1 9 0 1 0 10 0 0 1 11 0 0 1 12 1 0 0 13 0 0 1 14 0 1 0 Camouflage vectors are generated from a complex network algorithm. The design of the network algorithm whose joint paths construct different camouflage vectors is a critical step in the success of the BinCVC model. The network represents all n records in the confidential field with variables (x, ..., x). All paths start from the source to the destination. The network is constructed using two parameters. Parameter w gives the total number of paths, and parameter m is the number of paths consisting only of true value edges. These determine the number of camouflage vectors s = .An illustration of the network construction of the example database (see Table 51) using three camouflage vectors (see Table 52) is shown in Figure 51. S x X Xg c x,. ? xn PQ i Figure 51: Network With (m,w)= (1,3) (data source: Garfinkel et al. 2002) In the example database (Table 51), all 14 records in the confidential field are denoted by variables (x,, x4) Parameter w = 3 indicates 3 disjoint paths are constructed in the network and m = 1 implies that all those variables with true value 1 in the true confidential field are assigned to one of three paths. Variables representing other records with value zero are assigned as evenly as possible to the rest of two paths. The total number of camouflage vectors is s =3 =3. Every camouflage vector is the combination of choosing m edges out of w paths. So, in Figure 51, each camouflage vector selects one edge out of three paths with their true value records on the path. Compared with Table 52, camouflage vector P' has records 1, 3, 6, and 12 containing value one. The remaining records in P' are zero. In the corresponding network, accordingly there is one path including only variables (x,, x3, x6, 12 Performance measurement CB= 1 p m/w is employed to assess the quality of networks for a given database with different w and m values, where CB stands for Column Balancing. The usefulness of each query answer is computed by the formula: Z = 100 x(1(u(q) (q))/ q). q denotes the cardinality of the query q which is the number of records that are involved in that query. The closer to 1.0 Z is, the better the query answer is. The ideal network that yields the tightest interval response has a small s and every camouflage vector has the same number of ones as the true confidential field. That is, p" = p*, where pi is the proportion of ones in Pj, and p is the proportion of ones in P" = d. This ideal structure is called "perfect column balancing". See Table 52 as an example. Here p = p2 = 0.4, p* = 0.6. A good CB "increases the probability of (a) better query answer". BinCVC is a very promising methodology for the database privacy. However, instead of an exact answer, it responds to the query with an interval which reduces the data utility. We define the information loss of the CVC technique as the width of the interval, given by e =u(q)l(q). 5.3.2 Variabledata Perturbation Inspired by the CVC technique, we propose a new data perturbation method the variabledata perturbation. Different from random data perturbation whose random noise is drawn from a normal distribution N (0, r2), the variabledata perturbation method is defined as a data perturbation method which modifies the confidential information by adding discrete noise that is generated by a parametrically driven algorithm, such as w and m in the CVC interval protection method. The perturbed database is created once and for all. The algorithm can choose various parameters to produce different types of noise. We can view the output of the algorithm as if it were pulling values randomly from some distribution D with known parameters, with a nonzero mean u/ and variance c2. The mean and variance are always finite. Each query answer is computed from the perturbed data. A discrete random data perturbation method builds a perturbed database from which all query responses are computed. Output perturbation method does not alter the database, but query answers are perturbed before they are returned to the user. Variable data perturbation method is a hybrid of data perturbation and output perturbation and generates noise for the confidential field. Perturbed answers for each query involving sensitive data are calculated only from the perturbed confidential vector. We treat the variabledata perturbation as a data perturbation method with query protection. Consider the BinCVC technique as an example of the variabledata perturbation method. The network algorithm creates camouflage vectors to disguise the true confidential vector once and for all. Each query answer is an interval which is computed from the camouflage vectors and assures the true answer is included. In a worstcase scenario, the noise or perturbation could be regarded as the difference between the lower bound and upper bound of the interval: e = u (q) (q), where eq are discrete random variable. We simulated the network algorithm on the example database (see Table 51) in Garfinkel et al. (2002) and computed the interval answers for all queries. Since the confidential vector in the database is a 14bit binary string, the total number of queries involving this binary vector is 214. The following figures (Figure 52 AD) show four different cases with parameters of the network algorithm at (1) w = 5 and m = 2; (2) w = 7 and m = 3; (3) w = 8 and m = 5; (4) w = 12 and m = 6. Among those networks, w = 7 and m = 3 creates perfect column balancing and based on its frequencies of each noise value for all 214 queries, we obtain a noise distribution with mean p = 3.302 and variance 2 = 1.379 as shown in Figure 52B. Discrete Distribution of Perturbations inete Distribut n ofPerturbations Discrete Distribution of Perturbations the CVC Network with w=5 and m=2 in the CVC Network with w7 and m in the CVC Network with w=7 and m=3 6000 55 6000 4830 5208 5000 4802 400000 40004000 2 12 4000 I= 3000 0 2870 3000 3000 1053 2000 2000 1000 1000 0 ,0 .. ..  1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12 Perturbation Perturbation A B Discrete Distribution of Perturbations Discrete Distribution of Perturbations in in the CVC Network with w=8 and m=5 the CVC Network with w=12 and m=6 6000 6000 5170 000 5000 4641 4000 3 4 4000 S3000 3000  2000 2000 1295 1244 1000 4 1000 11 : i i I 1 1 1 2 3 4 5 6 7 8 9 10 11 12 Perturbation 4Perturbation C D Figure 52: Discrete Distribution of Perturbations from the BinCVC Network Algorithm. A) w=5 and m=2,B) w=7 and m=3,C) w=8 and m=5 and D) w=12 and m= 6. After the network is set up with parameters w and m, the noise distribution D is fixed, and its mean p and variances o2 are finite and known. Figure 52 showed this property. We intend to bound the noise eq drawn from D in terms of p and 2. We will continue to discuss how to estimate the mean / and variances U2 in the next chapter. 5.3.3 Discussion For BinCVC, there is a conflict between the two performance measures, CB and Z score. That is, a high Column Balancing value, which indicates a good protection for the whole database with some specific w and m, could not guarantee good query answers (i.e., a high Z value). We claim that Interval disclosure or interval inference occurs when the maximum of the error of the snooper's estimation about the true confidential value is less than the tolerance threshold predetermined by the DBA. Exact inference can be treated as a special case of interval inference and has an error value of 0. Gopal et al. (2002) state that the CVC technique could completely eliminate exact disclosure and interval inference. However, Muralidhar et al. (2004) have shown empirically that CVC technique is sometimes vulnerable to interval inference. By utilizing a simple deterministic procedure, the snooper can sometimes compromise the database by shrinking the interval answers into a smaller range within the predetermined threshold. Suppose the ith query is answered by [,, ,u]. In their example, they show how a snooper could compute the midpoint of the interval m, = (1, + u, )/2, the halfwidth of the interval, w, =(u, l )/ 2, and then use these to build a new interval as m, + (0.5 x w,) which still includes the true value, but is narrower than the original interval and, hence, less than the threshold. See Table 53 for this example. Table 53: An Example of Interval Disclosure (Data source: Muralidhar et al. 2004) Original Interval Intruder Interval Width True 1 2 3 Width of uery Value Lower Upper Ori Lower Upper Modified Limit Limit Limit Limit Interval Interval (%)(%) 1 276.3 275.2 302.8 263.5 263.5 302.8 14.2 273.3 293.0 7.1 2 35.4 36.2 32.7 36.3 32.7 36.2 10.2 33.6 35.4 5.1 3 37.4 37.4 41.1 35.5 35.5 41.1 14.9 36.9 39.7 7.5 In Gopal et al. (2002), the interval protection requires that the interval length is at least 10% of the original value. In Table 53, the intruder's intervals computed using the method provided by Muralidhar et al. (2004) are narrower than the threshold of 10%. Thus, the database is compromised in terms of the interval disclosure. However, the test given by Muralidhar et al. (2004) only examined the CVC interval protection empirically. For networks with different w and m, this deterministic method may not apply. 5.4 A Bound for The Fixeddata Perturbation (Theoretical Basis) Dinur and Nissim (2003) studied a theoretical tradeoff between privacy and usability of statistical databases (SDBs). They concluded that a minimum perturbation magnitude of Q(n) is required for each query q in order to maintain even weak privacy of the database. Otherwise, an adversary could reconstruct the statistical database using = n(lgn)2 (base 2 logarithm) queries with high probability in polynomial time. As expected, the SDB can be protected from disclosure if the perturbation value is bounded by e > o(n), however, then the data utility may be too low to be useful. Since Dinur and Nissim make no assumptions beyond assuming the additive error is fixed, their results are valid both for data perturbation and output perturbation methods using fixed additive error. We review their results and methodology in the following sections. Dinur and Nissim (2003) modeled the confidential field in the database as an n bit binary string (dl,..., d ) e {0,1)". The true answer for a SUM query q, q c {1, .., n, is computed as The perturbed answer for a query q is A(q) obtained by adding a perturbation A (q) d < e, where e = o(n) is the bound for the perturbation of each query. The authors developed a Linear Programming (LP) algorithm to generate the candidate confidential vector which is the vector that an adversary would use to compromise the database. See Table 54 for details of the LP algorithm. Table 54: LP Algorithm (Data source: Dinur and Nissim 2003). [Query Phase] Let = n (lgn)2. For 1 q, c {1,,n},andset ad <A(q). [Weeding Phase] Using and linear objective, solve the following linear program with unknowns c, ..., c : e 0 [Rounding Phase] Let c '= 1 if c, > Y and c, '= 0 otherwise. Output c'. Other vectors that are far away from the true confidential vector d are weeded out by the algorithm. The output of the LP algorithm is the candidate vector that best estimates the confidential vector. The n bit binary vector c' is obtained by rounding c, which is a vector of real numbers produced by the LP algorithm. Dinur and Nissim (2003) also introduced a 1 vector c obtained by rounding c to the nearest integer multiple of where k = n k represents a precision parameter, and K = 0, k'k",...,k 1 Hence ce K". They [ k k k proved that d, d, < I 2e +1. To prove that the candidate vector c' obtained from the algorithm is close to the true confidential field d, Dinur and Nissim (2003) introduced a Disqualifying Lemma, which proves that random queries ql,..., q, would weed out all vectors x E X where X= xK" Pr x d, >sn\ (1) The term Pr xd in Equation 1 represents the expected number of 1 records that obey x d, , for E > 0. Therefore, X denotes the set of all vectors which are far away from the true vector d. The Disqualifying Lemma states that Pr [ (x,d ,)2e+1 > (2) qcR[n] L q The lemma proves that there exists a probability c > 0 such that a query q disqualifies x if (x d,) > 2e + 1. x will not be a valid LP solution if such a q exists. The lemma guarantees if x is far away from d, at least one of the / queries q*, q, would disqualify x with high probability. One missing piece is the relationship between inequalities (1) and (2) that relates E to The proof of the disqualifying lemma establishes this link and it is possible to think of 5 as a function of : (F). We will discuss this further in Chapter 6. If I queries q ... q, are chosen independently and randomly, then for each x e X, the probability that all / queries do not disqualify x is (1 )1. A conclusion derived from the Disqualifying Lemma is Pr [Vx X 3i, q, disqualifies x] >1(n +1)" (1) >lneg(n) q,, q R["] 1 Pr [Vx X 3i, q, disqualifies x]<_ (n +1)" (1 ) Thus, the probability that none of the / queries can disqualify x E X is bounded by a very small number neg(n) > 0 . Therefore, the Disqualifying Lemma guarantees ruling out all disqualifying vectors x E X with high probability (1neg(n)) and guarantees that the hamming distance between the final candidate vector c' and true vector d is small, that is, dist(c', d) < s . The number of queries that are required to weed out disqualified vectors is computed from the Disqualifying Lemma. That is, = n(lgn)2. See Figure 53 for an illustration of relationships of c, c', c and d. c '=1 if c >1/2 C, = 0 otherwise Rounding C to the nearest integer multiple of 1/k Scl '=1 if 1 > 1/2 c '= 0 otherwise dist(c',d) Figure 53: Relationships of c, c', c and d. 5.5 Proposed Approach Although SDC methods and machine learning have completely opposite research goals, similar methodologies are applied in both areas (DomingoFerrer and Torra 2003). The SDC methods attempt to modify the data intentionally before the public release. The data distortion should be sufficient enough to protect the privacy of the confidential data and small enough to minimize the information loss. ML seeks to learn from noisy examples and designs errorresilient algorithms to disclose true information (Angluin and Laird 1988, Goldman and Sloan 1995, Shackelford and Volper 1988, Sloan 1988, Valiant 1985). SDC methods protect the confidential data stored in a database with n records and m fields. ML learns the true function from / examples, each of them having m attributes. Therefore, a common structure is used to express the information between SDL methods and ML. Although the two areas have different research purpose and often use different terminologies, the underlying methodologies are often the same. In our research, we approach the database privacy problem from a machine learning perspective by applying PAC learning theory. We consider a scenario when a snooper uses a learning algorithm to discover the true confidential data protected by a SDC method. For example, Figure 54 demonstrates the connection between the methodologies employed in PAC learning theory and in the database protection approach in Dinur and Nissim (2003). Disqualifying Lemma: Pr Vx eX 3i, err(q, disqualifies x )> (s) < (n +1l)n (1 )) < neg ( Random Samples Error Cardinality Accuracy Confiden with of parameter level Sie Hypothesis Size / Pr' S: h consistent and error(h)>e c H (1 s) o 8 PAC learning: Figure 54: Illustration of the Connection between the PAC Learning and Data Perturbation Figure 54 indicates that both approaches determine a training sample size 1, necessary to accomplish the desired goal. The probability that a query disqualifies the x e X with probability greater than e () is bounded by the union bound of X, high probability (F), and further bounded by a small probability neg(n). Those three parameters correspond to the cardinality of the hypothesis space H the accuracy parameter E, and the confidence level 3 in the PAC learning theory. They are shown in Figure 54 as matched terms even though different notation and terminologies are adopted. Therefore, we could conclude that both PAC learning theory and the Disqualifying Lemma address the problems by using the same methodology for different purposes. The same parameters are required to build up the models. From the perspective of PAC learning theory, we regard the true confidential field as the target concept that an adversary seeks to discover within a limited number of queries in the presence of some noise, such as random data perturbation or variabledata perturbation. In Chapter 6, we raise our research questions and extend Dinur and Nissim (2003)'s work by using PAC learning theory. We set up a model to describe how much protection is necessary to guarantee that the adversary cannot discover the database with high probability. Put in PAC learning terms, we derive bounds on the amount of error an adversary makes, given a general perturbation scheme, the number of queries, and a confidence level. Three types of data perturbation bounds are summarized as follows in terms of different error distributions. (1) Perturbation with a General Bound Case: General PAC bound The error is randomly generated identically and independently from an unknown distribution D. So it is also called Perturbation with a Distributionfree Bound case. A general PAC bound is derived as: /> In H +lnj where / is the number of queries needed to discover the binary confidential data, E is the amount of error that an adversary may make to compromise the database and 3 is the confidence level. H = 2" is the number of candidate confidential vectors in the hypothesis space H. Without specific information about the distribution of noise, the derivation of I wholly depends on e and 3, so this bound is relatively loose. (2) Perturbation with a Fixeddata Bound Case: Fixed data perturbation Dinur and Nissim (2003) derived a fixeddata bound e = o (i) for the perturbation added to query responses. A bound for the number of queries is also developed, denoted as: /= n(lgn)2 which is sufficient to discover the true confidential vector in the database with a high probability at a small error. (3) Perturbation with a Random Variable Bound Case: Variable data perturbation (Proposed research) We assume that random perturbations which are added to the query responses have an unknown discrete distribution. The moments of the distribution, such as the mean and standard deviation, can be estimated. Variabledata perturbation belongs to this case. In the next chapter, we derive an error bound for this case by applying the PAC learning theory. This bound provides the minimum number of queries needed to discover the protected column with specified error and accuracy. CHAPTER 6 DISCLOSURE CONTROL BY APPLYING LEARNING THEORY In Chapter 2 and 3 we reviewed PAC learning theory and database security methods. In this chapter, we approach the database privacy problem using ideas from Probably Approximate Correct learning theory. Our research will delve into the additive noise perturbation masking method which is classified into three categories: random data perturbation, fixed data perturbation (reviewed in Chapter 5) and variabledata perturbation. Based on the work of Garfinkel et al. (2002) and Dinur and Nissim (2003), we raise our research questions and construct a theoretical model from the perspective of PAC learning theory. We attempt to derive an error bound for perturbations with a distribution specified by its first two moments and also develop a heuristic method to estimate the mean and standard deviation for the variabledata perturbation method. Dinur and Nissim (2003) studied the case of data perturbation bounded by a fixed number and provide a theoretical foundation for our research. 6.1 Research Problems Our research focuses on the category of variabledata perturbation. Firstly, we intend to derive a bound on the level of error that an adversary may make, given the variabledata perturbation method. We extend the bound on the fixeddata perturbation proposed by Dinur and Nissim (2003) with an attempt to bound the perturbation of each e query with a random variable qe which has a discrete distribution with known parameters, such as the finite mean and variance. We need to develop a new Disqualifying Lemma, analogous to Dinur and Nissim's (2003), for the variabledata perturbation by deploying PAC learning theory. Like the Disqualifying Lemma in Dinur and Nissim (2003), our result bounds the probability that a query does not eliminate hypotheses that are far away from the true confidential answer. Using this, we develop an error bound on the number of queries within which the database could be compromised with high probability. 6.2 The PAC Model For the Fixeddata Perturbation We start our model by interpreting the results of Dinur and Nissim (2003) within the methodology of PAC learning theory. Suppose an adversary attempts to compromise the SDB by applying PAC learning theory. We define a NonPrivate Database as follows: a database is nonprivate if a computationallybound adversary can expose 1 fraction of the confidential data for E > 0 with probability 1 where 3 > 0. We call 1 the confidence level. Consider a statistical database with n records. Its confidential field is a binary string denoted as (d,,...,d,)' e {0,1". See Table 51 for an example database. In this table, "HIV" status is the column we represent. An hypothesis space H0 contains n bit binary vectors, each of which is an hypothesis h e H0 = {0,1)" and denotes a candidate vector for the confidential field of the database. The cardinality of the hypothesis space, or the number of hypothesis is H0 = 2". The true confidential field is regarded as the target concept d e H0. The online database receives a SUM (or COUNT) query q c (1,...,n} sent by the user and responds with a perturbed answer A(q) of the true 64 answer a, = d A perturbation is added to each query answer instead of every record and bounded by a fixed number e > a A(q) . PAC Learning starts by random sampling. We take / samples consisting of queries and their perturbed responses, S =((q,,A(q,)),..,(q,,A(q,))). Since A(q) is a perturbed answer, we will consider this learning from noisy data. Our learning algorithm is a linear program. As such, answers can be continuous and will be rounded. Thus it is useful to define another hypothesis space H2 = [0,1]". For analysis, a grid will prove useful. Let the hypothesis space H1 = K", where K = 0, 2,..., ,1 Note that H0 c H1 c H2 where all containment are strict when n > 1. Let 4 : H > H1 by rounding each component in H2 to the nearest integer multiple of 1/n midpointss rounded down). Further, let h: H, > H0 (i = 1, 2) by rounding each component in H, to the nearest of 0 and 1 (0.5 rounds down). Note that 1 h(c)=c+f, where f < i=l1, , n. Given a sample S and a fixed perturbation e, Dinur and Nissim (2003) gave a polynomial algorithm y that finds c e H2, from which one can output h0 (c). We represent this algorithm by c < y(S). As already discussed, the specific algorithm is a linear program (see Table 54). See Figure 61 for an illustration of the relationships of Ho, H1, H,, ho, 4 and d. h (c): H, Ho h (c): H, H, dist(ho(y(S)),d) : Figure 61: Relationships H0, H1, H, ho, h and d in the FixedData Perturbation. Let c e H0, then the hamming distance between c and d is dist (c,d)= {i:c, d, = c, d, . 1=1 Let x H2. Pr x 3 > > E" means the probability of choosing i {1, ., n 1 randomly such that x, d, > That is, for this x there are gn expected records where 3 x d . Denote this by E, x d > en where E > 0 arbitrarily. Ultimately, 3 311 we wish to show how to choose a sample size / so that dist(h, (7(S)),d) < en . Lemma 1: If x eK" and E x 3d > Proof: 1 1 1 1 First note that if x d< then h0(x) d < <. Thus since no more than 3 2 3 3 en i 's, on average, have x d/ >, then no more than en records, on average, of 3 h0 (x) can have h0 (x), d >. The number in x d < guarantees that x1 round 3 3 3 to the same number as dc. End of Proof Let T = x K" : E xd > en From the point of view of the intruder, we want our sample to disqualify all points of T with high probability (1 ) where 3 e (0,1) and is usually chosen so that (1 3) is large. For a sample of size 1, generated independently and identically according to an unknown but fixed distribution D, the probability that an hypothesis c is far away from the true target d is measured by the risk functional err(c)= D qc {1,.,n}: Y(c,d,) < 1+2e = Dq c{1,..,n n (c d,) >1+2e where c e H1. As we stated before (see Figure 61), the solution c from the LP can be rounded either to a binary vector h0 (c) or a vector 4h (c) e K". The probability that the distance between the true vector d and the rounded vector h (c) is greater than is bounded by 3 E. Based on this condition, for any random query, the difference between the answers from these two vectors is bounded by a function of the perturbation, 2e +1. So, we can see that e and E are related and they describe the error from different perspectives. Then we use a probability which is a function of E, denoted as (E), to bound the risk functional as err (c) > (E) We intend to bound D' (S:err( ((S))) >( ) by 3>0. Provided e = o(n), the Disqualifying Lemma of Dinur and Nissim (2003) proved (E) >0. Then, for r(e)= 1 (e) D' (S err (y(S))) >( <(n+ 1) ( (E))' =(ni+ 1)" () (6.1) where (n +1)" = \K > T is the union bound over T, and therefore the worstcase scenario is bounded. The proof of the Disqualifying Lemma in Dinur and Nissim (2003) shows ic() with T > , 500 68 Recall that the Disqualify Lemma (Dinur and Nissim 2003) proves Pr (x, d,) 2e +1 > Z In the proof, G1, 1, n are defined as independent random variables such that S= x d and = 0 both with probability . Let m = n' 1 The authors 2 approached the proof by dividing it into two cases based on the size of the expected value of z, denoted as E(m). Let T>  be a constant to be specified later in the proof. In the case of E(~) > Tjn, the probability satisfies qPr [ (x da) >2e+1 >12e 2/8 In the second case of E(~) < TFn, the probability satisfies Pr L qPr [ (x[ d) 2 2e+1 > SR[n]] 1 ] 3,8 The role of / is discussed below. (For the proof details, see the Appendix A of Dinur and Nissim 2003). From the result of Disqualifying Lemma, we choose A (e) to be the minimum of the probabilities from these two cases. So, in term (1), 1 2e /8 = 1 2e /4000 < 0, so 1 12e /8 >1. Interm (2), we know a = so a = >0 and 1  <1. He36 3 108n/ 3c Hence K(s) (2) Thus, where we choose (E) a for the worst case. Dinur and Nissim choose 8 large 3/7 enough so that a> 3 p,(k+1)e kf/2 k=l (note the right side is decreasing in 8f). Simple manipulations show that e kf/2 k=l 1 e /2 After taking the partial derivative with respects to / for the above formula we obtain Thus 2 a e kf/2 Z(k+1)e kf2 k=l Thus we need a> 3Y (k+1)e kf/2 k=l E Since a = we get 36 i4/2 e (1 82)2 Zkekf/2 k=l e81 e f/2 1 /2 2 e/2 1 e/2  /12 2 e (1e f2)2 ie2 2 3e /2 (1 e e 8)2 >, \a r> 82 2 el2  >3fef/22e 36 (e /z)2 2 e Pf/2 e > 108fe P/22 (1 e /2)2 Let x= e7/2. Then 2x e >108/x (1 x)2 /8 is decided by E (E is a predefined parameter). For 0 < E <1, numerical calculations show we need f > 17 thus giving x < 0.0002. Since 3,8 108/ if we plug 2x S>108/x (1 )2 into 3,8 108,/ we get 2x (1 _x) where (E)= = x 2  3/, (1 x) Now back to the inequality (6.1), D(S:er (S(Se)))e>(E))<;(+1)"rr; = (n + 1)"(1 ()). If we bound the probability with the parameter 3 > 0, we get D' S:err (Y{S))) > () (n+1)"(1 ))< where 3 > 0 is the confidence parameter. Then take the base 2 logarithm (denoted as Ig in all the following formula) on both sides of the last two terms (n+1)"(1 (E)), <' to get g [(n + 1)"(1 )) < lg Given a predefined parameter the minimum sample size is computed as lg(3)nlg(n+l) >1 g ( ((6.2) lg(l (E)) 2 x where (E) = x and x = e'2 with /7 chosen large enough. / is bounded by (1_x)2 three parameters 3, E and n. Since (E) is a very small number, if we apply it directly into formula (6.2), the resulting bound for the sample size / is quite large, much more than / = n(logn)2 from Dinur and Nissim (2003), even for a small n. See Table 61 for examples of two bounds on the sample size with different values of n when 3 = 0.05. Table 61 shows that by interpreting Dinur and Nissim (2003)'s Disqualifying Lemma, we get a PAC bound which is looser than the one derived in Dinur and Nissim (2003), no matter what n is. However this PAC bound is still much less than the total number of queries in a database, 2", except the n is very small, such as n = 10 Table 61: Bounds on the Sample Size with Different Values of n. Slg(3)nlg(n+l) n n(logn) 1> (2")) 10 111 373643 1024 50 1,593 2,274,447 1.1259E+15 100 4,415 5,191,750 1.2677E+30 500 40,193 34,338,167 3.2734E+150 1000 99,317 76,188,677 1.0715E+301 5000 754,940 469,076,527  2x In section 6.4, we will show how to replace (E) = x with a more (1 x) practical number by using the bound in Dinur and Nissim (2003), therefore deriving a tighter bound for the variabledata perturbation case. 6.3 The PAC Model For the Variabledata Perturbation In this section, we move to the case that an adversary tries to compromise a database in which the confidential data is modified by adding variabledata perturbation. In this method, each query q is added with a perturbation created from a database protection algorithm. The perturbed response is A(q) while the true query answer is a =i~d~ 6.3.1 PAC Model Setup In the fixeddata perturbation case, a fixed number bounds the perturbation: a A (q) < e. In the variabledata perturbation case, a A (q) =eq and we assume that the perturbation eq is a random variable with an unknown discrete distribution with known finite mean / and variance C2. Based on the knowledge of these parameters, we attempt to develop a bound on the error that an adversary makes. The bound will be expressed in terms of these parameters. A threshold on the number of queries, within which the database is compromised, can be derived from this error bound. Given S and q for each q e S, we develop a polynomial algorithm 72 that obtains an hypothesis c e H2 from which we can output h0 (c). The algorithm, c < 72 (S), is a linear program: n MAin c CG[0,1] 1 s.t. A(q) where ej is the realization of the random variable eq in the LP algorithm and is sampled from the perturbation distribution. Then the distance between h (c) and the true vector d is bounded by Z(4 (C), ) < Z((c)) + Y (C ') itq itq itq n q < +e <1+e qI 1 where 4 (c), =c + f and f, <. Recall that q denotes the cardinality of the query q. n In the variabledata perturbation case, we need to develop a new Disqualifying Lemma which would disqualify all h (c) which are far away from the true vector d. That is, for any x e H, query q disqualifies x, if Z( (x) ,) > 1+ e. See Figure 62 for an illustration of the relationships of Ho, H1, H,, ho, h and d. h, (c): H, 2 Ho 4 (c): H2 H d(h((S)),d)n (() < dist (h,(72(S)), d)<:! en (c), d) Figure 62: Relationships of Ho, H,, H2, ho, 4 and d in the VariableData Perturbation 6.3.2 Disqualifying Lemma 2 For a sample of size / which is generated i.i.d according to an unknown but fixed discrete distribution D, the probability that an hypothesis h (c) is far away from the true target d is measured by the risk functional err(h4 (c)) = 1 D q { 1,...,n}: (h(c) d,) <1+e eq =D qc1,,n}: (h(c)d,)>l+eq We intend to bound this error rate. As in section 6.2, we want D (S :err(h (y, (S)))> (cs)) < (6.3) where e (0,1). We now develop our Lemma 2, a disqualifying lemma, analogous to Dinur and Nissim's Disqualifying Lemma. Lemma 2 assumes that the mean and standard deviation of the distribution of eq satisfies / > a, o + /u < 2sFn and p > Fn Practical reasons motivate these respective cases as we now discuss. (1) if u Since the standard deviation measures how spread out the perturbations (eq values) can be, if /u < a, many perturbations will be widely dispersed, meaning that the corresponding intervals offer little information. This can take many forms. For example (see Figure 63), with a bimodal distribution some intervals will be tight and others very disperse. The tight ones might provide an attacker the ability to easily disclose parts of the confidential information. The wide intervals may provide too little usable information to be meaningful for the user. (2) if o + _2 > 2n, there are four possible cases: a. u > Tn > c In this case, most perturbations are clustered around a large mean. Although a large perturbation provides better protection of the database, it reduces the usability of the 76 query answers. The user gets very little information. For a demonstration of this case, see the following Figure 64. Consequently, a database security method is meaningless if it produces perturbations with a large mean and relatively small standard deviation. A Bimodel Distribution of Perturbations 4000 3202 3000 2415 2475 2405 S 2000 1793 m 1560 939 1000 612 435 465 50 34 0   1 2 3 4 5 6 7 8 9 10 11 12 Perturbation Figure 63: A Bimodal Distribution of Perturbations in the CVC Network while p/ < a . b. /u _> c _> Vn Very high mean and standard deviation imply two situations: (1) all query responses are perturbed with big noises which are widely spread out in the high mean area. In this case, the user can not get any useful data from these query answers; and (2) many query answers have large perturbations while others provide users with very tight answers which can reveal the confidential data easily. Neither of above distributions is meaningful for our research. c. U _> Fn > /U The same reason described in (1) is used here also. A Discret Distribution of Perturbations with high mean and small standard deviation 6000 5543 5122 5000 4000 3000 2284 2151 2000 1000  360 0 0 90 22 0 0 0  1 2 3 4 5 6 7 8 9 10 11 12 Perturbation Figure 64: A Distribution of Perturbations in the CVC Network with p > n > a . (3) if u >n holds: A database usually includes a large number of records. Therefore, the mean of the perturbations is likely less than Nn in most cases. If the mean > is true, then the security method likely offers little information to the users, no matter what the standard deviation is. See the discussion in (2) a, b and c for similar explanations. Lemma 2: Let x e [0, ]", d e {0,1)" and e, be a random variable generated from a distribution with mean p = E(e ) < and variance o2 < o where u > a c +u _< 2Fn and p < If Pr, () > E, then there exists a constant (s) > 0, such that w r3 Pr,, (x), >)>I+e > 7) qcR[ n] where r7 is a function of . Disqualifying Lemma 2 Proof: Let Y = h (x), be i.i.d. random variables. For any fixed q e [n], let m = q, the cardinality of q. Without loss of generality, assume q = {1, ,m}. Given a random variable e, and constant a e [0, n, we have P >i+eq PZ >1+eq,e <2a + P Y >1+eq,eq >2a According to Chebyshev's Inequality, since eq is a random variable with ,=E(eq) 2 (2a _)2 Then, we obtain S > 1+eq >P 1 Y >1+eq I eq <2a 2 (6.4) S(2 (6.4) (1) (2) Let the probability oi(E) be equal to the product of term (1) and term (2) in formula (6.4). Next, we continue our proof by solving two problems, respectively. (1) Prove r(s) is a positive number: In all steps of Dinur and Nissim (2003)' proof for their Disqualifying Lemma, term (1), can be substituted for P\ Y >1+2e provided a e 0, oJn To see this we have the following: P Y >1+eq eq <2a = P Y2 >1+j (e = j) z=1 j=0 z=1 >P Y >1 2a Since Eeq J=0 +2e P(e = j). eq = j) > 0. Now, Dinur and Nissim (2003) proved P Y >1+ 2e > 12e _2 / for the appropriate choice of T. Rescaling ,I ) 2a T in proportion to [P(eq ]=0 j) proves our point. Similarly for the second part of his 2a proof the parameters a and /6 can be rescaled in proportion to YP(eq J=0 j) This gives then that P Y >+eq eq <2a >max 12e 2, ,I 3,8^ 2x x (1 x)2 where x = e P/2 with f chosen large enough as seen in Section 6.2. Thus Sm P >Y 1 >+eq eq _<2a I ,I ) mi 2cx a P Y >+e x(l 2 (2a . ,I (1 X)2 (2a U)2 So the probability q(c) will be a positive number as long as term (2) is greater than 0. Thus we need to have 1 >0 (2a /)2 which is true when 0 < a and + < a < provided u > and 2 2 a + _u < 2H respectively. These latter two conditions are assumed in the Lemma 2. Thus, Pr, (x)d, >1+e >P ~ >l+eqleq<2a 1 2 (6.5) qc R[n] r (2 ) a where parameter a 0, U 2 . 2' "2 (2) We now maximize the lower bound over a. In order to derive a tight bound, we seek to find the maximum value of (6.5) subject to ae 0, U, 2n]. So P >1e >maxP Y > e e <2a 1 2 >P Y >1+e le <2a max 1 2aC2 a (2a U)2 where the a in the first term is any a 0, U J. Using (forthis term) a=o(I n) gives us S q (1x)2 (2a)2 Note that S(2a U) *U u Ci C + is decreasing over 0, / and increasing over 2 n so we merely need to compare K1 to I [1 (2,  p)2 By assumption p > n so the latter is maximal. Thus 2 x C P >+e () x 12 >0. l+e (1 x) (2n p) End of proof. Lemma 2 is a crucial step for our model. The successful proof provides a bound on the error E in terms of the mean and variances of e In the next section, we will continue discussing these two parameters. Based on the results of Lemma 2, we are able to derive a bound for the number of queries, within which the adversary would be able to compromise the database protected by using the variabledata perturbation method with a high probability (1 3). 6.4 The Bound of the Sample Size for the Variabledata Perturbation Case In this section, based on the proof of Lemma 2, we develop the sampling bound for the variabledata perturbation case from two approaches. In the first approach, we use Dinur and Nissim (2003)'s result directly from their Disqualifying Lemma proof in our bound; the second approach applies instead their sample bound to obtain a tighter bound. 6.4.1 The Bound Based On the Disqualifying Lemma Proof Recall that err(h (c)) > 7 (E) (see section 6.3), and we intend to bound D( S:err(4 (7(S)))>(7)) by the confidence parameter 3 > 0. We use a probability (E) to bound err(k (c))> 7(E). Then, D' (S err (2(s)))> T ()) (n+1) (e) where ()= 1 (). Thus we get D (s err (h (72S)))> s()) <(n+1)i x ) <(n+l)" 1x2x 1 C"2 (n+l ( X)2 (2Nn _)2 Bounding this with 3 gives D'(s: err(l(2(S)))>g)<(n+l)" 1x 2 I < Then, we take base 2 logarithm on both of the latter two sides to obtain 2x C 2 lg x 2x 1 2 The minimum sample size is thus lg nlg(n+1) l 1x 2x l , (1X)2 (2, U2 2x Since x where x = e is a very small number, the resulting bound is (1 x)2 very loose (as was the similar bound under the Dinur and Nissim framework discussed earlier). If n is small, the sample size / can be even greater than 2", which is the total number of all possible queries. With larger n, / becomes much smaller than 2". However, / is still a very large number. In order to reduce the sample size /, we need to 2x find a more practical value instead of x. (1x)2 6.4.2 The Bound based on the Sample Size Starting from Dinur and Nissim (2003), the sample size / is bounded by nlg2 n if the fixed perturbation is less than n Therefore, we have a sufficient bound for the fixeddata perturbation case (see section 6.2 for the details): l>g() nlg(n+ 1) >g2 S, 2x 0g 1x  1 1X n)2 Consider the boundary case S lg(3)nlg(n+1) nlg2 n = lg(l (()) Then lg() nlg(n+1) Ig (1 (E)) = nlg2 n lg(S)nlg(n+l) 1(E)= 2 nlg2n Based on the above result for (F), we replaced 2x x t (1 x)2 with Ig(S)nlg(n+1) 12 "ng2n This formula provides a better value than () while developing a tighter bound for the sample size in the variabledata perturbation case. Since the reasoning used by Dinur and Nissim (2003) to arrive at n g2 n remains unchanged for our case, so we can use lg(S)nlg(n+l) 12 nlg2n in place of our 7 (E). This gives lg(S)nlg(n+l) 2 (n+1)" 1 12 "lg2 < from which we obtain Ig(2)nlg(n+) l2 lg(S)nlg(n+l 2 Ig(+l)"+lg 1 12 2lg2I \  g >~~i Ig nlg (n +1) (6.7) 1> (6.7) l g(S )nlg(n+l) 2 Ig 1 12 "ng2 1 2l From formula (6.7) we can see that the sample size 1 decreases when /u and C decrease. 6.4.3 Discussion As we know from section 6.2, the larger the number of camouflage vectors s is, the larger the response intervals are, which lead to the larger perturbation mean and standard deviation. This result simply implies that sample size / increases with an increase of s. Our experiments based on the three examples in Garfinkel et al. (2002) support these conclusions. The database has 14 records, n =14. Three cases are considered in Table 62. Table 62: The Relationship among /u, a, s and 1. Network Vwa iw=3 and m=l w=5 and m=2 w=7 and m=3 Variable s 3 10 35 P1 2.0236 2.7760 3.3019 a 1.1150 1.1114 1.174 1 213 217 223 From Table 62, we can see that the sample size / increases while j/, a and s increase. These results of sample sizes are very close to the bound nlg2 n from Dinur and Nissim (2003) and much less than 214 = 16,384. 6.5 Estimated the Mean and Standard Deviation In the previous section, we derived a bound on the sample size, which is the minimum number of queries required to disclose the binary confidential information in a database protected by the variabledata perturbation method. The bound (see formula 6.7) is decided by four parameters: the number of database records n, the confidence parameter 3, and the mean / and standard deviation a of the perturbation distribution. Among these four parameters, n and 3 are known and predetermined. In this section, we will develop a method to identify the estimated mean and standard deviation of the perturbation distribution. Perturbations' mean / and standard deviation c are fixed in the Garfinkel et al. (2002) as soon as the algorithm design is finished, such as those networks for camouflage vectors in the CVC technique. However, the actual mean and standard deviation can be calculated only if all responses from 2" queries are obtained, which is not practical in most situations. Instead of computing the true mean and standard deviation from 2" queries, our heuristic method intends to estimates these two values approximately, denoted as /i and a, by using the following random sampling method. Let i: index of query i q,: the ith query e : interval length of query q, /i,: mean of perturbations using queries 1, .., i r, : standard deviation of perturbations using queries 1, i / : sample size computed from fi, and ~, using formula (6.7) Table 63 lists the heuristic steps for estimating the mean, standard deviation and the bound on the sample size. We use the network example in Garfinkel et al. (2002) to illustrate our heuristic. The basic setting for the network algorithm is: there are n = 14 database records, and parameters w = 3 and m = 1. The true mean and standard deviation computed from 214 queries are u = 2.023 and c = 1.115, which give a sample size = 213 from formula (6.7). Also see Table 52 and Figure 51 for all camouflage vectors and the CVC network algorithm. Next, we show how the heuristic is applied to estimate /i and & for the CVC technique example in Garfinkel et al. (2002). Table 63: Heuristic to Estimate the Mean /, Standard Deviation &, and the Bound . Heuristic: 0. for(i=l;i<_30;i++) Generate query q, and record its perturbation . 1. Generate query q, and record its perturbation e . 2. Compute /, and c, using e,,' ,e 3. Compute 1 from formula (6.7) using the estimated /, and . 4. Increment i and repeat step 1 to step 3 until i > This /, is the final bound on the sample size, I /, and c, are final values for the estimated / and & . For example, the intruder sends a random query q, to the database, asking how many employees in Company B have positive HIV (see Table 51). The query responds an interval answer as [1, 2] (see Table 52 for the set of camouflage vectors), from which the random perturbation is recorded as 2 = 21= 1. Continue sending queries and recording perturbations. The mean and standard deviation are computed as /,= 'Jl e and usin when the number of queries is more than 30, and 6 = \ 1  using el ,* *, e when the number of queries is more than 30, 1 