UFDC Home  myUFDC Home  Help 



Full Text  
QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVAL By MARK CECCHINI A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 2005 Copyright 2005 by Mark Cecchini This document is dedicated to Tara, Julian and Campbell, who were my inspiration in pursuing and finishing a PhD. ACKNOWLEDGMENTS I would like to thank Tara and the rest of the Cecchinis for putting up with me throughout this process. I would also like to thank our families for their support throughout this endeavor. Without my committee there would be no dissertation. So, I would like to acknowledge my Advisor Gary Koehler, who came up with the initial research idea and has seen this research through from the beginning, Haldun Aytug, who has been working on this project for three years, Praveen Pathak for his information retrieval expertise and Gary McGill for helping me to understand the accounting relevance of the work. Finally, I'd like to thank Karl Hackenbrack for his guidance in the early stages of this work. TABLE OF CONTENTS page A C K N O W L E D G M E N T S ................................................................................................. iv LIST OF TABLES ........ .................................................. ....... ........ viii LIST OF FIGURES ................................. ...... ... ................. .x L IST O F O B JE C T S .... ......................................... ............. .. .. ...... .............. xi ABSTRACT ........ .............. ............. .. ...... .......... .......... xii CHAPTER 1 INTRODUCTION AND MOTIVATION ......................................... ...............1 2 F IN A N C IA L E V E N T S ..................................................................... .....................5 2 .1 F rau d D election ................... .... ...................................... ........ ......... .. .... 2.2 B bankruptcy D election ............................................................ ............ 8 2.3 R estate ent D election ........................................................... ...............12 3 INFORMATION RETRIEVAL METHODOLOGIES............... ................ 16 3 .1 O v e rv iew ......................................................................................................... 1 6 3.2 V sector Space M odel .......................... .................... .. .. .. ...... ........... 18 3 .3 W o rd N et ............................................................................................................... 2 1 3.4 O ntology C reaction ........................ ................ .. .. ......... ..... .... 23 4 MACHINE LEARNING METHODOLOGIES ............... ............................... 27 4.1 Statistical L earning Theory........................................................ ............... 28 4.2 Support V ector M machines ............................................................................. 29 4.3 Kernel Methods .................... ....................................33 4.3.1 G general K ernel M ethods........................................ ......................... 34 4.3.2 D om ain Specific K ernels...................................... ......................... 40 5 THE FINANCIAL KERNEL .............................................................................. 43 6 THE ACCOUNTING ONTOLOGY AND CONVERSION OF DOCUMENTS T O T E X T V E C T O R S ...................................................................... .....................54 6.1 The A accounting O ntology ................... ............ ............................... .... 54 6.1.1 Step 1: Determine Concepts and Novel Terms that are specific to the accounting dom ain .............. .. .............. .............. .. ........ ...... ............ 55 6.1.2 Step 2: Merge Novel Terms with Concepts ...........................................61 6.1.3 Step 3: Add multiword domain concepts to WordNet.............................64 6.2 Converting Text to a Vector via the Accounting Ontology..............................65 7 COMBINING QUANTITATIVE AND TEXT DATA ............................................69 8 RESEARCH QUESTIONS, METHODOLOGY AND DATA ..............................72 8 .1 H y p o th e se s ...................................................................................................... 7 2 8.2 R research M odel .......................... .............. ................. .... ....... 74 8 .3 D a ta se ts ........................................................................................................... 7 6 8 .3 .1 F rau d D ata .............................................................7 6 8.3.2 Bankruptcy D ata ................................................... .... ................. 77 8.3.3 R estate ent D ata .......................................................................... .... ... 79 8.4 The Ontology ................................. ............................... ........ 80 8.5 Data Gathering and Preprocessing...................... .... ......................... 82 8.5.1 PreprocessingQuantitative Data..... .......... ...................................... 84 8.5.2 PreprocessingText D ata ........................................ ........................ 84 9 R E S U L T S .......................................................................... 8 8 9.1 F raud R results .......................................................................89 9.2 Discussion of Fraud Results ................................................... ......... ..........92 9.3 Bankruptcy Results ................................................. ...... .. ............ 94 9.4 Discussion of Bankruptcy Results............................................... .................. 97 9.5 Restatem ent Results ............................................. .... .... ............... ... 98 9.6 Discussion of Restatement Results .............. .............................................. 102 9.7 Support of H ypotheses................................................ ............................ 103 10 SUMMARY, CONCLUSION AND FUTURE RESEARCH...............................104 10 .1 Su m m ary ...................................... ............................................. 104 10.2 C conclusion ................................................................... ......... 105 10.3 Future R research ........................................ ................... ........ 106 APPENDIX A ONTOLOGIES AND STOPLIST .................................. .............................. ...... 109 A 1 O ntologies ..................................................... ...... ......... ............... 109 A.1.1 GAAP, 300 Dimensions, 100 concepts, 100 2grams, 100 3grams.......109 A. 1.2 GAAP, 60 Dimensions, 40 concepts, 10 2grams, 10 3grams.............115 A .1.3 GAAP, 10 D im tensions, 10 concepts ......................................................117 A.1.4 10K, Bankruptcy, 100 Dimensions ....... ......... ....................117 A.1.5 10K, Bankruptcy, 50 Dimensions, 50 Concepts............... ...............119 A.1.6 10K, Bankruptcy, 25 Dimensions, 25 concepts..................................120 A. 1.7 10K, Fraud, 150 Dimensions, 50 concepts, 50 2grams, 50 3grams......121 A. 1.8 10K, Fraud, 50 Dim tensions, 50 concepts...............................................124 A. 1.9 10K, Fraud, 25 Dimensions, 25 concepts................. ..... ............. 125 A .2 S to p list ..................................................................................... 12 7 B QUANTITATIVE AND TEXT DATA..... .................. ...............128 B 1 Q uantitative D ata ...................................................... ... .. ......... .... 129 B.2 Text Data ........................... .............................162 LIST O F R EFEREN CE S ......... ................... ....................................... ............... 163 BIOGRAPHICAL SKETCH ................................... ............. 172 LIST OF TABLES Table pge 1 F financial K ernel V alidation ........................................................................... .... ... 51 2 Fraud Detection Results using Financial Kernel .............. ............ .....................89 3 Fraud Detection Results using Text Kernel, 300 Dim GAAP Ont............................90 4 Fraud Detection Results using Comb. Kernel, 300 Dim GAAP Ont..........................90 5 Fraud Detection Results using Text Kernel, 60 Dim GAAP Ont.............................90 6 Fraud Detection Results using Comb. Kernel, 60 Dim GAAP Ont............................90 7 Fraud Detection Results using Text Kernel, 10 Dim GAAP Ont .............................90 8 Fraud Detection Results using Comb. Kernel, 10 Dim GAAP Ont............................91 9 Fraud Detection Results using Text Kernel, 150 Dim 10K Ont ................................91 10 Fraud Detection Results using Comb. Kernel, 150 Dim 10K Ont ..........................91 11 Fraud Detection Results using Text Kernel, 50 Dim 10K Ont ................................91 12 Fraud Detection Results using Comb. Kernel, 50 Dim 10K Ont ............................91 13 Fraud Detection Results using Text Kernel, 25 Dim 10K Ont ................................92 14 Fraud Detection Results using Comb. Kernel, 25 Dim 10K Ont ............................92 15 Bankruptcy Prediction Results using Financial Kernel ..........................................94 16 Bankruptcy Prediction Results using Text Kernel, 300 Dim GAAP Ont..................94 17 Bankruptcy Prediction Results using Comb. Kernel, 300 Dim GAAP Ont. ............94 18 Bankruptcy Prediction Results using Text Kernel, 60 Dim GAAP Ont .................. 95 19 Bankruptcy Prediction Results using Combination Kernel, 60 Dim GAAP Ont......95 20 Bankruptcy Prediction Results using Text Kernel, 10 Dim GAAP Ont .................. 95 21 Bankruptcy Prediction Results using Combination Kernel, 10 Dim GAAP Ont .....95 22 Bankruptcy Prediction Results using Text Kernel, 100 Dim 10K Ont....................95 23 Bankruptcy Prediction Results using Combination Kernel, 100 Dim 10K Ont........96 24 Bankruptcy Prediction Results using Text Kernel, 50 Dim 10K Ont.......................96 25 Bankruptcy Prediction Results using Combination Kernel, 50 Dim 10K Ont..........96 26 Bankruptcy Prediction Results using Text Kernel, 25 Dim 10K Ont........................96 27 Bankruptcy Prediction Results using Text Kernel combined with Financial Attributes, 25 D im 10K Ont ........................................ ............... ............... 97 28 Restatement (1,379 cases) Prediction Results using Financial Kernel ....................99 29 Restatement Prediction Results using Financial Kernel ........................................ 99 30 Restatement Prediction Results using Text Kernel, 300 Dim GAAP Ont ...............99 31 Restatement Prediction Results using Comb. Kernel, 300 Dim GAAP Ont.............99 32 Restatement Prediction Results using Text Kernel, 60 Dim GAAP Ont...............100 33 Restatement Prediction Results using Combination Kernel, 60 Dim GAAP Ont...100 34 Restatement Prediction Results using Text Kernel, 10 Dim GAAP Ont .................100 35 Restatement Prediction Results using Combination Kernel, 10 Dim GAAP Ont..100 36 Restatement Prediction Results using Text Kernel, 150 Dim 10K Ont.................00 37 Restatement Prediction Results using Combination Kernel, 150 Dim 10K Ont.....101 38 Restatement Prediction Results using Text Kernel, 50 Dim 10K Ont.....................101 39 Restatement Prediction Results using Combination Kernel, 50 Dim 10K Ont.......101 40 Restatement Prediction Results using Text Kernel, 25 Dim 10K Ont..................... 101 41 Restatement Prediction Results using Text Kernel combined with Financial A attributes, 25 D im 10K O nt ..................................................................................101 LIST OF FIGURES Figure pge 1 Ontology Hierarchy ........................................ ... ...... .......... ...23 2 Basic Graph K ernel .............................. .............................. 38 3 Graph Kernel ...................... ................................ 40 4 T he F financial K ernel 1........................................................................ ...................4 8 6 U updated Financial K ernel .................................................. .............................. 53 7 Accounting Ontology Creation Process.................................... ....................... 56 8 WordNet Noun hierarchy with Domain Concepts.....................................................61 9 WordNet Noun hierarchy with Domain Concepts enriched with Novel Terms..........63 10 WordNet Noun Hierarchy with Domain Concepts, Novel Terms and MultiWord C o n c e p ts ............................................................................................................. 6 5 1 1 T ex t K ern el ................................................................6 9 12 Com bined K ernel ............................................... ........ .............. .. 70 13 T he D discovery P rocess...................................................................... ...................75 14 Fraud Features...................... ......... .. ..... ..... .. ............77 15 B bankruptcy F features ......................................................................... ................... 78 16 Fraud D ataset ................................................................... .. ............ 130 17 B bankruptcy D ataset................................................................................ ...... ... 134 18 R estate ent D ataset ........................................................................ ...................139 x LIST OF OBJECTS Object 1 T ex t D ata ............. ................... .... ......... ................... ................ 16 2 page Abstract of Dissertation Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVAL By Mark Cecchini August, 2005 Chair: Gary Koehler Major Department: Decision and Information Sciences A financial event is any happening which dramatically changes the value of a firm. Examples of financial events are management fraud, bankruptcy, exceptional earnings announcements, restatements, and changes in corporate structure. This dissertation creates a method for timely detection of financial events using machine learning methods to create a discriminant function. As there are a myriad of possible causes for any financial event, the method created must be powerful. In order to increase the power of current methods of detection text related to the company is analyzed together with quantitative information on the company. The text variables are chosen based on an automatically created accounting ontology. The quantitative variables are mapped to a higher dimension which takes into account ratios and yearoveryear changes. The mapping is achieved via a kernel. Support vector machines use the kernel to perform the learning task. The methodology is tested empirically on three datasets: management fraud, bankruptcy, and financial restatements. The results show that the methodology is competitive with the leading management fraud detection methods. The bankruptcy and restatement results show promise. CHAPTER 1 INTRODUCTION AND MOTIVATION SAS 99, Consideration ofFraud in a Financial Statement Audit, establishes external auditors' responsibility to plan and perform audits to provide a reasonable assurance that the audited financial statements are free of material fraud. Recent events highlight that failing to detect fraudulent financial reporting not only exposes the audit firm to adverse legal consequences (e.g., the demise of Arthur Andersen LLP), but exposes the audit profession to increased public and governmental scrutiny that can lead to fundamental changes in the structure of the public accounting industry, accounting firm conduct, and government oversight of the accounting profession (consider, for example, the SarbanesOxley Act of 2002 89 and subsequent actions of the SEC 92 and NYSE 71). Research that helps auditors better assess the risk of material misstatement during the planning phase of an audit will reduce instances of fraudulent reporting. Such research is of interest to academics, standard setters, regulators, and audit firms. Current research in accounting has examined methods to assess the risk of fraudulent financial reporting. The methodologies are varied and usually combine some behavioral and quantitative factors. For example, Loebbecke, Eining and Willingham 55 compiled an extensive list of company characteristics associated with fraudulent reporting (called "red flags"). This list contains financial ratios and behavioral characteristics of company management. Other methods scrutinize accounting entries that are not easily verified by outside sources; these entries are called discretionary accruals. Board composition and executive compensation are also used to model the type of environment that is ripe for fraud. This dissertation proposes a methodology that can estimate the likelihood of fraudulent financial reporting. The resulting decisionaid has the potential to complement the unaided auditor risk assessments envisioned in SAS 99. Our approach combines novel aspects of the fraud assessment research in accounting with computational methods and theory used in Information Retrieval (IR) and machine learning/datamining. Machine learning uses computational techniques to automate the discovery of patterns that may be difficult to find by normal analytic techniques. Machine learning methodologies have been used in order to determine financial statement validity or, somewhat related, the likelihood of bankruptcy and credit worthiness. There are many models commonly used in machine learning with neural networks 66, linear discriminant functions 34, logit functions 3, and decision trees 80 being popular choices. Attempts have been made to recognize patterns in fraudulent companies using neural networks, linear discriminant functions, logit functions, and decision trees. These studies utilized quantitative data from financial statements and surveys from auditors. Unlike these earlier studies, recent advances in machine learning theory consider generalization ability and domain knowledge while the learning task is undertaken. Existing work on fraud detection has left out a key component of and about the company, text documents. In most public documents, the preponderance of information is textual but most automated methods for detecting fraud are based on quantitative information alone. So, either an expert has to distill the text to numbers, which is a monumental task, or the textbased information is largely ignored. We hypothesize that there is information hidden "between the lines" that is overlooked. Our approach can incorporate textual materials like management discussion and analysis, news articles, and so on. An area of research called Information Retrieval (IR) can help us to make use of the text. IR is often employed in library science and, more recently, in powerful Internet search engines (such as Google 39). IR is used for varied purposes, including question answering, document sorting, knowledge engineering, query expansion, and inferencing. We use IR methodologies to cull the financial text down to numbers, which can be used in conjunction with numerical attributes obtained from the financial statements to automatically predict the likelihood of fraud. What distinguishes the proposed approach from prior attempts to understand and aid fraudrisk assessments are advances in machine learning theory, both through a theory that addresses generalization errors and methods incorporating domain knowledge while the learning task is undertaken, and in IR that enable computer programs to analyze textual materials. The methodologies we create can be generalized to other accounting issues, such as the early detection of bankruptcy, detection of earnings management, early detection of increased market value, and general industry stability. Each of these issues has the potential to impact a company's value significantly shortly after a related first press release or news item is made public. As a result of the speedy impact, these issues can be called financial events. In this dissertation, we look at the early detection of bankruptcy, together with the detection of management fraud. The goal of this dissertation is discussed below. In the following chapter we review financial events detection literature and summarize key concepts and results. Chapter 3 summarizes relevant machine learning literature. Chapter 4 summarizes relevant Information Retrieval literature. In Chapter 5 we develop a machine learning methodology that handles quantitative financial data. In Chapter 6 we develop the IR methodologies that enable us to utilize text for financial event detection. In Chapter 7 we explain how we put the text data together with the quantitative data. We also extend the methodology we create in Chapter 5 to include text. These methods are used to study some actual data on which we ask a number of questions. The research model and hypotheses are developed in Chapter 8 and tested. Chapter 9 explains the results along with a conclusion and an explanation of future work. CHAPTER 2 FINANCIAL EVENTS As explained in the Introduction, a financial event is any event that significantly alters the value of a company. One can think of such an event as one that raises or lowers the value of the company. A partial list of possible events which lower the value of the firm are as follows: civil or criminal litigation, bankruptcy, management fraud, defalcations, restatements, earnings management and poor press. We focus on three such events in particular, management fraud, bankruptcy and restatements. In Section 2.1 we look at the fraud detection literature from accounting and machine learning. In section 2.2 we look at bankruptcy detection literature from those perspectives as well. In section 2.3 we look at the Restatements literature. 2.1 Fraud Detection A key result in audit research was given by Loebbecke, Eining and Willingham 55. They partitioned a large set of indicators into three main components: conditions, motivation, and attitude. They find in 86% of the fraud cases at least one factor from each component was present, indicating it is extremely rare for fraud to exist without all three components existing simultaneously. Hackenbrack 41 finds the relative influence of such components on auditor fraudrisk assessments varies systematically with auditor experiences. This research influenced standard setting and much of the fraud assessment research that has followed. Bell and Carcello 9 developed a logistic regression model to estimate the likelihood of fraudulent financial reporting. The significant risk factors considered were as follows: weak internal control environment, rapid company growth, inadequate or inconsistent relative profitability, management places undue emphasis on meeting earnings projections, management lied to the auditors or was overly evasive, the ownership status (public vs. private) of the entity, and an interaction term between a weak control environment and an aggressive management attitude toward financial reporting. The logistic regression model was tested on a sample of 77 fraud engagements and 305 non fraud engagements. The model scored better than auditing professionals in the detection of fraud. The model performed equally as well as audit professionals for the nonfraud portion. The authors suggest that the use of this model might be used to satisfy the SAS 82 requirements for assessing the risk of material misstatement due to fraud. Hansen, McDonald, Messier, and Bell 42 develop a generalized qualitative response model to analyze management fraud. They use the same dataset of 77 fraud and 305 nonfraud cases as collected by Loebbecke, Eining, and Willingham. They first tested the model with symmetric costs between type I and type II errors. Over 20 trials they got an 89.3% predictive accuracy. They adjusted the model to allow for asymmetric costs and the accuracy dropped to 85.5%; however, the type II errors decreased markedly. The consideration of type I and II errors is important in fraud detection research. Minimizing the type II error is minimizing the chance that the model will miss an actual fraud company. When type II error is minimized, type I error will increase. In the case of fraud detection, type I error is much less important than type II error. In fraud detection, discretionary accruals are a cause for concern as discretionary accruals have been known to be used to help "smooth" fluctuations in periodic income. Accounts that are used in discretionary accruals, such as Bad Debts Expense, Inventory and Accounts Receivable, are susceptible to "engineering" on the part of management. By considering yearoveryear changes in ratios, which include these accounts, a clearer picture of the company emerges. McNichols and Wilson 56 look at the provision for bad debts and consider how it should be reported in the absence of earnings management. Earnings management is a term that describes a spectrum of "cheating" that at a minimum is aggressive and not in strict compliance with GAAP, and at maximum is management fraud. The research found that firms use the provision for bad debts as an income smoothing method; in other words, it is raised in times of high earnings and lowered in times of low earnings. Ragothaman, Carpenter, and Buttars 82 developed an expert system to help auditors in the planning stage of an audit. The system is designed to detect error potential in order to determine if additional substantive testing is necessary for the auditors. The expert system rules were developed using financial statement data. The expert system methodology was rule induction. The system decides whether the firm is an "error" firm or a "nonerror" firm. If the firm is an "error" firm then the auditor should consider additional substantive testing. A training sample of 55 firms (22 error firms and 33 non error firms) was used. A holdout sample of 37 firms was used. The training sample was able to group 86.4% of errors correctly and 100% of nonerror firms correctly. The holdout sample classified 83.3% of error firms correctly and 92% of nonerror firms. This study was limited by the available data. The accounting literature on fraud detection is covered at great length in Davia 25 and Rezaee 84. Beneish 10 developed a probit model and considered several quantitative financial variables for fraud detection. Five of the 8 variables involved yearoveryear changes. The study considered differing levels of relative error cost. At the 40:1 error cost ratio (Type I:Type II) 76% of the manipulators were correctly identified. Also, descriptive statistics showed that the Days Receivable Index and the Sales Growth Index were most effective in separating the manipulators from the nonmanipulators. 2.2 Bankruptcy Detection Bankruptcy detection is a wellstudied area. Many methodologies have been used to solve this problem, including discriminant analysis, neural networks, fuzzy networks, ID3 (a decision tree classification algorithm), logistic regression and genetic algorithms. In this section we describe some major contributions to the literature. In 1966 Beaver 8 showed the efficacy of financial ratios for detecting bankruptcy. The study was performed on a dataset of 79 bankrupt and 79 nonbankrupt firms. Beaver computed the mean values of fourteen financial ratios for all companies in the study for a five year period prior to bankruptcy. Many of the ratios proved to be valuable to the detection, because the mean values of the bankrupt companies were significantly different than the mean value for the nonbankrupt companies. Altman's paper in 1968 4 was a seminal work in bankruptcy detection. He developed a discriminant analysis model using financial ratios. Using a pairedsample approach, Altman compared twentytwo ratios for efficacy in bankruptcy prediction. Five ratios stood out as they were able to accurately predict bankruptcy one year preceding the event. The model predicted bankruptcy correctly 95% of the time and nonbankruptcy correctly 80% of the time. The resulting function, dubbed the Altman Z Score, has been the benchmark for bankruptcy detection work ever since. The specific ratios of the Altman ZScore are as follows: Working Capital/Total Assets (WC/TA) Retained Earnings/Total Assets (RE/TA) Earnings Before Interest and Taxes/Retained Earnings (EBIT/RE) Book Value of Equity/Total Liabilities (BVE/TL). The function has had several incarnations and its weights differ based on industry. For the manufacturing industry it is 6.56(WC / TA) +3.26(RE / TA) +6.72(EBIT/RE) +1.05(BVE/TL) = Z score. The weights on each ratio indicate the ratio's relative importance for classification of healthy and unhealthy companies. A score which is less than some threshold means the company is likely in financial distress while a score greater than or equal to this threshold means the company is likely safe from bankruptcy, at least for the short term. There is a gray area around the threshold that can be construed as an area of concern. The predictive accuracy of this discriminant analysis function is still competitive today for sorting out healthy companies from unhealthy ones. Altman et al. noted that the discriminant analysis technique had limitations, one being its inability to handle time series 5. Bankruptcy is the sum product of many events. A company which goes bankrupt is likely to have been in a deteriorating state for more than one period. Yearoveryear changes can capture this deterioration better than single year measures. Ohlson 72 was the first to utilize a logistic regression approach to bankruptcy prediction. He identified four factors as statistically significant in affecting the probability of failure within one year. The factors are as follows: the size of the company, a measure of financial structure, a measure of performance, and a measure of current liquidity. Another finding of the research was that the predictive powers of linear transforms of a vector of ratios appear to be robust for estimating the probability of bankruptcy. Abdelkhalik and ElSheshai 1 designed an experiment testing human judgment. Decision makers (loan officers) were allowed to choose the information cues they use to make their judgments. The information cues were used to determine whether a loan would end in default. In comparison to mechanical models discriminantt analysis), loan officers performed worse. The finding was that the choice of information cues is more responsible for the lack of correct prediction than the processing of the cues. Frydman, Altman and Kao 36 developed a recursive partitioning algorithm (RPA) for bankruptcy classification. The RPA is a Bayesian procedure, with classification rules derived in order to minimize the expected cost of misclassification. In most cases, the RPA outperformed Altman's previous results via discriminant analysis. Messier and Hansen 62 use inductive inference to analyze examples of bankrupt companies and loan defaults to infer a set of general rules in the form of ifthenelse statements. The set of output statements is called a production system. The bankrupt study used only the following ratios: current ratio, earnings to total tangible assets and retained earnings to total tangible asset. The production system was 100% accurate on a very small holdout set (12 bankrupt and 4 nonbankrupt). The study also used the production system to detect potential loan defaults. The method was 100% accurate on the training sample and 87.5% accurate on a validation sample. In both studies, the production system used fewer ratios and was more accurate than the discriminant models it was compared against. It should be noted that Tsai and Koehler 104 tested the robustness of the results of several papers using inductive learning, including the results of Messier and Hansen 62. The authors determined the accuracy of the induced concepts when tested on the same or similar domains. In the case of Messier and Hansen, their findings included a probability of error on the learned concept of the bankruptcy sample. The probability that the error of the learned concept exceeds 20% is 30.96%. This is due, in part, to the small sample size. The study throws up a caution flag, warning readers that the true accuracy of concepts learned by induction may not be revealed in studies of small sample size. Tam and Kiang 101 use a back propagation neural network to predict bank defaults. They compare their results with k nearest neighbor, discriminant analysis 34, logistic regression 79 and ID3 80. When considering the year prior to bankruptcy, a multilayer neural network got the best results. When considering two years prior to bankruptcy, logistic regression performed the best. Charalambous, Charitou, and Kaourou 17 compare the performance of three neural network methods, namely Learning Vector Quantization, Radial Basis Function, and the Feedforward network. They test their results on 139 matched pairs of bankrupt and nonbankrupt U.S. firms. Their results indicate that Learning Vector Quantization gave superior results over feedforward networks and logit analysis. Piramuthu, Raghavan, and Shaw 77 develop a method of feature construction. The method finds the features that are most pertinent to the classification problem and discards the ones that are not. The "constructed" features are fed into a back propagation neural network. The method was tested on Tam and Kiang's 101 bankruptcy data. The network showed a significant improvement in both classification results and computation time. 2.3 Restatement Detection The literature covering restatements is found in conjunction with earnings management literature as well as fraud literature. Each company for which the SEC discovers fraud is forced to restate. All restatements, however, are not fraudulent. Restatements can be made for various reasons, including stock splits, errors, accounting irregularities, and fraud. Restatements may be voluntary or involuntary. For the purposes of this research, restatements are defined as in General Accounting Office report GAO03138 37. These restatements may be voluntary or involuntary and only arise as a result of accounting irregularities. An accounting irregularity is fraudulent if committed with intention and nonfraudulent if committed by mistake. Restatements can be seen as a superset of fraud. The restatement literature specifically related to detection is limited as compared to fraud and bankruptcy. The literature reviewed in this section gives an overview of the research problems related to restatements. Dechow et al. 26 evaluate the performance of competing models of earnings management detection. The models tested are the Jones model, the Modified Jones model, the DeAngelo model and the Industry model. These models are based on the amount of discretionary accruals made by a company in a particular year. Discretionary accruals are not readily observable based on publicly available reports. The models infer the amount of discretionary accruals based on other inputs and total accruals. The results show that all methods are accurate for detecting earnings management for extreme cases. However, all methods gave poor results when faced with discretionary accruals which were a small percentage of total assets (1% 5%). Earnings management is more likely to occur at the 1% 5% levels, so the practical value of the methods used is brought into question. Koch and Wall develop an economic model of earnings management, which elucidates the situations in which earnings management are most likely to occur, based on executive compensation packages. The authors determine how accruals can be used to manage reported earnings. In the paper the authors explain several earnings management tactics. A partial list follows; (1) The "Live for Today" strategy Managers minimize accrued expenses in order to maximize profit. (2) The "Occasional Big Bath" strategy managers attempt to meet earnings targets whenever possible. If it looks impossible to meet targets then they attempt to accrue a high amount of expenses in that period to allow for meeting earnings targets the next. (3) Miscellaneous Cookie Jar Reserves strategy This is defined as the usage of unrealistic assumptions in the process of estimating accruals. These methods can be readily detected afterthefact. It is much more difficult to detect these tactics as they are happening. Abbot et al. 1 study the impact of the audit committee on the likelihood of restatement. The authors find that an independent and active audit committee significantly reduces the likelihood of restatement. An audit committee which contains at least one member with financial expertise further reduces the likelihood of restatement. This empirical study gives weight to the arguments for having an audit committee. Feroz et al. 33 study the effect of Accounting and Auditing Enforcement Releases on company valuation. A reporting violation leads to a 13% decline over a two day period, on average. The study also finds that the companies which are in violation substantially under perform the market in the years prior to the release, indicating that the incentive to cheat is at least in part due economic pressures on the executives of the company. Hribar and Jenkins 43 study the effect of restatements on a firm's cost of equity capital. The authors find that restatements lead to a decrease in expected future earnings and increases in the firm's cost of equity capital. The increases were found to be between 7% and 12%. Over the longterm the rates remain higher than before the restatement by at least 6%. Another finding of the work are that firms with greater leverage are associated with larger increases in capital. Kinney et al. 50 approach the problem from the auditor's perspective. They study the correlation between restatements and the amount of nonaudit services performed by the auditor. This topic became especially interesting when the SarbanesOxley Act of 2002 specifically forbade auditors to provide certain nonaudit services to their clients. The study found no significant positive correlation between financial information systems services and restatements. There was a significant positive correlation between unspecified nonaudit service fees and restatements. This study supports the notion that auditor independence can be compromised by nonaudit consulting engagements with audit clients. Peasnell et al. 75 focus on the factors associated with low earnings quality by looking at a sample of 47 firms which have been identified as having defective financial statements. A positive correlation between defective financial statements and losses or significant earnings decreases was found. Restating firms were less likely to increase dividends, provide optimistic forecasts, and more likely to be involved in corporate restructuring. Restating firms were also less likely to employ a Big 4 auditor and often carried higher debt as a percentage of total assets as compared to nonrestating firms. The study also found that firms which employed active audit committees were less likely to have defective financial statements. In this Chapter the literature on Financial Events was reviewed. Bankruptcy, fraud and restatement research was reviewed. The next two chapters explain the methodologies used for this research project. Chapter 3 reviews Information Retrieval Methodologies and Chapter 4 reviews Machine Learning methodologies. CHAPTER 3 INFORMATION RETRIEVAL METHODOLOGIES This chapter presents a general overview of research in the area of IR. As this is an enormous area of research, the main focus is on contributions to the field as they relate to this dissertation. Specifically, we focus on methods of ontology creation and WordNet. The sections are as follows: Section 4.1 provides a brief overview of general IR research. Section 4.2 explains the Vector Space Model. Section 4.3 explains the basic concepts of lexical databases with specific details about WordNet. Section 4.4 explains the fundamentals of ontology creation. 3.1 Overview "An Information Retrieval system does not inform (i.e., change the knowledge of) the user on the subject of his inquiry. It merely informs on the existence (or non existence) and whereabouts of documents relating to his request" 105. This field of study has exploded with the reality of massive amounts of text in an online environment the Internet. The need to correctly choose the documents that are relevant to a keyword search has become important to industry (in the form of search engines), decision scientists, and computer scientists. There is much more to the field of IR than merely document retrieval. Some of these are as follows. Question answering systems take natural language questions as input, allowing the user to avoid learning tedious query structures. In response, the system outputs a number of short responses, designed to answer the specific question of the user. The goal of questionanswering is to give a more precise response to the user. Whereas normal document retrieval outputs a list of documents, questionanswering outputs small passages from documents 74. Query expansion is a research area which has grown tremendously as a result of the internet. Query expansion is most commonly used by search engines as a means to improve the accuracy of results of user queries. A user types a few words as a query, and the system expands that query by adding words which will presumably give better results. There are many methods of query expansion. Automatic query expansion uses machine learning techniques to choose the best expanded query 64. Inferencing systems are a generalization of query expansion. They can be used at all levels of the IR process. They attempt to "infer" the meaning of a query and add further detail. The inference is usually based on semantic relatedness of the words in the query. Semantic relatedness can be determined by parsing a particular corpus as in the case of latent semantic analysis 52, which uses statistical techniques to find co occurrences between words in a corpus, or it can be determined by using a lexical reference system, such as WordNet. Literaturebased discovery uses IR techniques to discover hidden truths from a particular domain. The basic idea is: parse a set of documents A related to a particular subject and find a list of subjects that A refers to. Parse a second set of documents B related to the subjects A refers to in order to find the subjects B refers to. The subjects B refers to are called C. If some subjects in C are unexplored in relation to A, then they may be worth looking at. The seminal work in this area is by Swanson 99. Using Medline (a medical document repository) he was able to find previously unknown connections between Raynaud's disease and fish oil. Those connections were tested empirically by medical researchers. The results showed that fish oil can actually reduce the symptoms of Raynaud's disease. 3.2 Vector Space Model A primary goal of IR research is to relate relevant documents to user queries. Using IR methods, one seeks to separate relevant textual documents from nonrelevant ones. A powerful method in IR research is called the vector space model 18, 54, 88. This approach begins by truncating all words in the document into word stems. Word stems are the base form of words, without suffixes. Stemming is important because a computer cannot see that "stymies" and "stymied" are basically the same thing. If we stem the two words, they both become "stymie." This allows the computer to see the two as one word, thus adding to the words importance (via word count) in the document. Then it transforms the document into a vector by counting the frequency of each word in the document. Various ways of normalizing these vectors are available. A key observation is that these vectors are now quantitative representations of the textual parts of documents. Here is a more formal explanation of the vector space model 48. Each document in the vector space model is represented by a vector of keywords as follows: d W = (wjW,,w2, "Wn,j)'f where n is the number of keywords and w, is the weight per keyword i in document j . This characterization allows us to view the whole document collection as a matrix of weights and is called the termbydocument matrix. The columns of this matrix are the documents and the terms are the rows. A document is translated into a point in an n dimensional vector space. For this method to be useful, the vectors must be normalized. The dot product between normalized vectors gives the cosine of the angle between the two vectors. When the vectors representing two documents are identical, they will have a cosine of 1; when they are orthogonal, they will receive a cosine of 0. The similarity measure between documents j and k is as follows: n Y W k sim(d, dk) = V 2 2 W,k Finding a w that most accurately depicts the importance of the keywords in the collection is very important to document classification. Sparck Jones 97 made a seminal breakthrough on this problem with the TFIDF function. The function stands for Term Frequency, Inverse Document Frequency. The basic TFIDF function is as follows: N w' = (tf), *log tf, is the frequency of term t, in document d N is the number of n documents in the collection and n is the number of documents where the term t occurs at least once. The logic is as follows: for (tf), or term frequency, a word that occurs more often in a document is more likely to be important to the classification of that N document. A measure of inverse document frequency, idf, is defined by (idf), = log . n The logic is that a word that occurs in all documents is not helpful in the classification of the document (hence the inverse) and therefore gets a 0 value. A word that appears in only one document is likely to be helpful in classifying that document and gets a value of 1. Many researchers have attempted to improve upon the basic vector space model. Improvements take the form of making the document vector more accurately depict the document itself. Partofspeech tagging is one such improvement. A partofspeech tagger reads a document in natural language and tags every word with a partofspeech, such as noun, verb, adjective and adverb. The tags are created using sentence structure. All partofspeech taggers are heuristics with no guaranteed accuracy. However, the recent taggers have become so accurate that they only make a few mistakes on entire corpuses 11. Another improvement is word sense disambiguation (WSD). WSD is the attempt to understand the actual definition of a word, in the context of a sentence. Often words that are spelled identically have several meanings. In the basic vector space model, the document vector would take all instances of the word "crane" and add them up. What if one sentence read, "The crane is part of the animal kingdom" and another sentence read, "The crane was the only thing that could move the 2 ton truck to safety"? Crane in the first sense is referring to a bird whereas crane in the second sentence is referring to a mechanical device. A word sense disambiguated vector would have two versions of the word crane if both showed up in the corpus. This avoids some confusion that might arise were we comparing the similarity between two documents, one which was about the bird called a crane, and the other which was about the piece of equipment. How is WSD accomplished? One method is to look at a previously handtagged corpus. One such corpus is called SemCor 22. It is a corpus of documents, which are all tagged with particular word meanings. Researchers use SemCor as a tool to learn WSD. For example, take all sets of word pairs from a corpus and compare with SemCor, looking for pairs that appear together often enough to be considered statistically significant. The phrase "crane lifts beams" may show up in the corpus. It is possible to determine if the noun "crane" and the verb "lifts" are found together often enough in SemCor to be considered significant. If this cooccurrence pair is considered significant, then "crane" will be given the particular sense number for which it was tagged in SemCor. 3.3 WordNet A lexical reference system is one which allows a user to type a word in and get in return that word's relationships with other words. "WordNet is an online lexical reference system whose design is inspired by current psycholinguistic theories of human lexical memory. English nouns, verbs, adjectives and adverbs are organized into synonym sets, each representing one underlying lexical concept 22." The current version of WordNet has 114,648 nouns, 11,306 verbs, 21,436 adjectives and 4669 adverbs in its system today 22. WordNet is handcrafted by linguists. The basic relation in WordNet is called synonymy. Sets of synonyms (called synsets) form its basic building blocks. For example, the word "history" is in the same block as the words past, past times, yesteryear, and yore. Due to synonymy, WordNet would be much closer to a thesaurus than a dictionary. Nouns are organized into a separate lexical hierarchy as are verbs, adjectives and adverbs. There are two main types of relations in WordNet, lexical relations and semantic relations. Lexical relations are between words and semantic relations are between concepts. A concept is another word for a synset. A relationship between concepts can be hierarchical, as is the case of hyponyms and hypernyms. The hyponym/hypernym is a relation on nouns. Nouns are separated from other parts of speech because their relationships are considered different than the relationships between verbs and adjectives. A hyponym/hypemym relationship is an "is a" relation. WordNet can be represented as a tree. Starting at the top or root node, the concept is very general (as in the case of "entity" or "psychological feature"). As you go down the tree, you encounter more fine grained concepts For example, a robin is a subordinate of the noun bird and bird is a superordinate of robin. The subordinates are called hyponyms (is a kind of bird) and the superordinates are called hypernyms (robin is a kind of). Modifiers, which are adverbs and adjectives are connected similarly as are verbs. Hyponomy is only one of many relations in WordNet. Below is a list of other WordNet relations with examples 94: Relation Example Applicable POS HasMember Faculty Professor Noun Memberof Copilot Crew Noun Haspart Table Leg Noun Partof Course Meal Noun Antonym Leader Follower Noun IncreaseDecrease Verb Troponym Walk Stroll Verb Entails Snore Sleep Verb Traditional vector space model retrieval techniques focus on the amount of times a word stem appears in a document without considering the context of the word. Consider the following two sentences, "What are you eating?" "What's eating you?" The words "what," "are" and "you" would most likely be stop words. (A stop word is any word that is thought to have little impact on the classification of any document. Common stop words are "the", "and", "but", "what", "are" and "you". The list of stop words is usually determined by taking statistics on the document set. If a word appears too often it is said to carry little weight. This word becomes a stop word. Stop words do not appear in the document vector.) The two sentences above would have identical meaning in the vector space model. The meaning of the two sentences are however, completely different. Using concepts and contexts it is possible to create a lexical reference system that interprets data specific to a particular area of interest. 3.4 Ontology Creation Figure 1 67 shows that there are three types of ontologies. There are top ontologies, upper domain ontologies, and specific domain ontologies. Top ontologies are populated with general, abstract concepts. Upper domain ontologies are more specialized, but still very general. Specific domain ontologies are populated with concepts that are specific to a particular subject. Top ontologies for the English language are relatively complete. Upper domain ontologies and specific domain ontologies are still under construction 67. lordl r of Ituoca epts (ordrr afl 10 c oncqpt SpI"r it DamAin On nti..I' S(ordcr o I c cnc ptls) ) Figure 1 Ontology Hierarchy WordNet is a top ontology. Many domain engineers attempt to make domain specific ontologies using the backbone of top ontologies. Often a problem arises in that there is a gap between the top ontology and the specific domain ontology. In this case, an upper domain ontology is necessary. An upper domain ontology connects the top ontology to the specific domain ontology. The upper domain ontology forms the root nodes for the Domain Specific Ontologies. Domain specific ontologies are usually created for a specific purpose and these are very difficult to obtain. Navigli and Velardi explain "A domain ontology seeks to reduce or eliminate conceptual and terminological confusion among the members of a user community who need to share various kinds of electronic documents and information 68." Domain ontology creation is a new and active research area in IR. Here are some papers which highlight the current state of the research. Khan and Luo 49 construct ontologies using domain corpora and clustering algorithms. The hierarchy is created using a selforganizing tree. WordNet is used to find domain concepts. The concept hyponyms are added to the tree, under the concept. This is a novel usage of WordNet and a completely automated method of ontology construction. The method is tested on the Reuters 21578 text document corpus. Navigli and Velardi 68 give a stepbystep method explaining the process of obtaining ontology. Candidate terminology is extracted from a domain corpus and filtered by contrastive corpora. The contrastive corpora are used to ignore candidate terms which are in actuality part of the general domain. The word senses of domain terminology are discovered via SemCor and WordNet. New domain specific relationships are determined based on rule based machine learning techniques. These relationships are used to determine multiword terms which are domain specific. Finally, the domain ontology is trimmed and pruned. This methodology was used to create a tourism domain ontology. Vossen 108 describes a methodology of extending WordNet to the technical domain. The domain corpus is parsed into header and modifier structures. A header is a noun or verb and a modifier is an adjective or adverb respectively. A header may have more than one modifier, as in the example "inkjet printer technology". Here "technology" is the head and "inkjet" and "printer" are modifiers. Salient multiword terms are hierarchically organized creating a domain concept forest. A domain concept forest is a set of concepts related to a specific domain together with relationships between the concepts. The root node of each of the domain concepts is attached to a WordNet concept. In the above example "technology" would be the root node. The result is a domain concept forest attached to WordNet. Buitelaar and Sacaleanu 15 create a method of ranking synsets by domain relevance. The relevance of a synset is determined by its importance to the domain corpus. The importance is determined by the amount of times the concept appears in the corpus. A contrastive corpora is used to filter out concepts that are general, as in Navigli and Velardi 68. A unique contribution of this research is the usage of hyponyms to determine domain relevance. A hyponym is lower on the tree, therefore it is a specialization of the concept. The authors look at how often a hyponym to a concept appears in the document as part of the relevance measure. The result is an ordered list of domain terms. The authors tested the methodology on the medical domain by parsing medical journal abstracts. Buitelaar and Sacaleanu 14 extend their work by adding words to domain concepts based on lexicosyntactic patterns. The domain corpus is parsed to look at the syntax patterns of seven word combinations. Each pattern is separately considered for relevance. For all salient patterns, mutual information scores are given to cooccurrences within the pattern. Novel terms from the domain which are not in WordNet are added to WordNet concepts if it is determined that they are statistically significant. This methodology is tested on the medical domain. In this Chapter Information Retrieval Methodologies were explained. The specific areas reviewed were the Vector Space Model, WordNet and Ontology creation. These areas were chosen because of their relevance to the contributions of this work. In the Chapter 4 Machine Learning Methodologies are reviewed. CHAPTER 4 MACHINE LEARNING METHODOLOGIES Most machine learning/datamining methods 66 start with a training set of data from past cases illustrating positive and negative examples of the concept to be learned. This is called supervised learning. For example, if we are trying to learn how to discriminate between companies likely to default on loans in the coming year from those unlikely to default, we would collect past cases of defaulting and nondefaulting companies as done in studies such as 1 62. Such a training set consists of 1 observations and a classification for each. That is, there are 1 pairs of the form z' (u',y') where u' E X c 91" represent the n input attributes (the independent variables) with X called the instance space of all possible companies, y' e {1, +1 the classification (+1 means a positive example and 1 a negative example of the concept) for i = 1,..., and the sample S is ((u1, yl),...(u', y)) c (XxY)' 20. Unless otherwise stated, a vector is denoted by a bold, lowercase letter. The superscript on the vector is reserved for the observation number. An unbolded, subscripted, lowercase letter refers to the components of the vector. The subscript represents the index of the component. In Chapter 5 we add a second subscript to denote the year (or period). Typical approaches, such as neural networks, logit, etc. start with a training set and try to fit the data as best as possible using the concept structure chosen (i.e., a neural network, a logit function, etc. respectively). This invariably leads to overfitting. To ameliorate this, the training set is often broken into two (or more) sets where part of the cases are used to fit a function and part to test it's ability to predict on a data set not used for fitting. These approaches do help with overfitting but are largely ad hoc. 4.1 Statistical Learning Theory Statistical learning theory 106 formally develops the goal of learning a function from examples as that of minimizing a risk functional R(c) = JL(z,g (z, c))dF(z) over a e A where L( ) is a loss function, and g(z, a) is a set of target functions parametrically defined by ca A (the family of functions we are investigating). In this approach it is assumed that observations, z, are drawn randomly and independently according to an unknown probability distribution F(z). Since F(z) is unknown, an induction principle must be invoked. One common induction principle is to minimize the number of misclassifications. Minimizing the number of misclassifications is directly equivalent to minimizing the empirical risk with the loss function as a simple indicator function. Other loss functions give different risk functions. For example, the classical method for linear discriminant functions, developed by Fisher 34, is equivalent to minimizing the probability of misclassification. As is well known, empirical risk minimization often results in overfitting. That is, for small sample sizes, a small empirical risk does not guarantee a small overall risk. This has been observed in many studies. For example, Eisenbeis 28 critiques studies based on such overfitting. Statistical learning theory approaches this problem by using a structural risk minimization principle 106. For an indicator loss function, it has been shown 106 that for any ca A with a probability at least 1 r the bound Rst (h, l1,) R () R (a) Rmp+(a)+ R R () holds where the structural risk Rstuct ( ) depends on the sample size, 1 ,the confidence level, Tr and the capacity, h, of the target function. The Rs,,, expression is as follows 20: Sh, h(41n(21 /h)+4) ln( / 4) Rstmct (h,l, )=1 The capacity, h, measures the expressiveness of the target class of functions. In particular, for binary classification, h is the maximal number of points (k) that can be separated into two classes in all possible 2k ways using functions in the target class of functions. This measure is called the VCdimension. For linear discriminant functions, without additional assumptions, the VCdimension is h = n + 1 107, 20. The empirical risk is measured by a loss function on the set of examples ? as follows 91: Remp (U) = L (x, g (x,, .a)). Since we cannot directly minimize R (a) the structural risk minimization principle instead tries to minimize Rbound (a). It is almost always the case that the smaller the VC dimension, the lower this bound. 4.2 Support Vector Machines Support Vector Machines (SVM) are growing in popularity rapidly in part because both theoreticians and applied scientists find them useful. SVMs incorporate ideas from many fields of study including applied mathematics, operations research, machine learning, and more. Based on Statistical Learning Theory, early research suggests that SVMs have had good success with supervised learning. They have compared well with other learning algorithms such as Neural Networks, kMeans, and Decision Trees 20. Joachmis 46 used SVMs to categorize news stories. Pontil and Verri 78 used SVMs for object recognition (independent of aspect). Cortes and Vapnik 23 tested SVMs on hand written zip code identification, getting accuracy just shy of human error. Brown et al. 12 applied SVMs to the problem of classifying unseen genes with success. Support vector machines determine a hyperplane in the feature space that best separates positives from negative examples. Features are mappings of original attributes (we discuss this shortly). The margin of an example (u', y') with respect to a hyperplane (w, b) is A' = y'((w, u) + b) where w is a weight vector is and b is a bias term. The margin about the hyperplane A is the minimum of the margin distribution with respect to a training sample S. The VCdimension is bounded by h <1+min n,R2 ] where R is the radius of a ball large enough to contain the input attribute space. If a margin is large enough, the VCdimension may be much smaller than n + 1. SVMs learn by maximizing the margin which, in turn, minimizes the VCdimension and, usually, the bound of the risk functional. This distinguishes them from other popular methods such as neural networks which use heuristic methods to help find parameters that best generalize. In addition, and unlike most methods, SVM learning is theoretically guaranteed to find the best such linear concept, if the data are separable. Neural networks, decision trees, etc. do not carry this guarantee leading to a plethora of heuristic approaches to find acceptable results. For example, most decision tree induction use pruning algorithms that try to create the smallest tree that produces an acceptable training error in the hopes that smaller trees generalize better (this is the so called Occam's razor or minimum description length principle) 80. Unfortunately, there is no guarantee that the tree produced minimizes generalization error. SVMs also scaleup to very large data sets and have been applied to problems involving text data, pictures, etc. The SVM is formulated as a quadratic optimization problem with linear inequality constraints. Below is the primal formulation assuming the data is separable. min(w,w) st y'((w,u')+b)>1, i=1,... (w, w) is minimized in the objective function in order to maximize A, thus potentially minimizing the bound on the VCdimension which was expressed above. This can be explained as follows. We replace the functional margin with the geometric margin. The geometric margin will equal the functional margin if the weight vector is a unit vector. Thus we normalize the linear function y'(w, u) + b)] and A = because w w w the inequality will be tight at a support vector. In order to maximize A we merely minimize w . This problem has a dual formulation. The dual solution is useful as w is no longer explicitly computed and the explicit usage of the data points is collapsed into a matrix of inner products, allowing for higher, possibly infinite dimensional feature spaces. These feature spaces are implicitly calculated by a kernel which we explain in great detail below. The dual formulation is: 1 max W(A)= A' 2 y'yAA' (u',uJ 1=1 z,j=1 st y'A' = 0 1=1 A >0 i =1,..., where A are the dual variables. w is no longer in the formulation and all data appears inside the dot product, which is key to using kernels in the SVM. A kernel is an implicit mapping q! of an input attribute space X onto a potentially higher dimensional feature space F. The kernel improves the computational power of the learning machine by implicitly allowing combinations and functions of the original input variables. For example, if only price and earnings are inputs, a PE ratio would not be explicitly considered by a linear learning mechanism. A kernel, properly chosen, would allow many different relationships between variables to be simultaneously examined, presumably including price divided by earnings. The PE measure is termed a "feature" of the input variables. There are many powerful, generic kernels 20, 38 but kernels can also be made to suit a specific application area as we do later in this study. Some application areas are sensitive to periodic changes, making correct pattern recognition more likely with the usage of time series analysis. Ruping 87 shows how to extend a number of kernels to handle time series data. Jin, Lu, and Shi 45 show that the right subset of attributes for a particular domain is important to time series classification for knowledge discovery applications. Their methodology trimmed the attributes to include only data pertinent to the domain's time series. Preliminary research suggests that kernels which are constructed with the help of application specific information tend to have better results 20. 4.3 Kernel Methods A kernel is a central component of the SVM. ShaweTaylor and Cristianini call it the information bottleneck of the SVM 95. This is because all data input into a SVM goes through the kernel function and ends up in the kernel matrix. The kernel matrix is a matrix with entries Kj =< q(u'), ~(u)) >, where q is a mapping S: X > R", and u', ui E X. Often the dimension of the feature space is much larger than the attributes space, and may even be infinite (ref. the Gaussian kernel in Section 4.3.1). Key to the value of kernel methods is the ability to implicitly capture this feature space via a mapping qS. The dual formulation expressed in Section 4.2 can be generalized to allow the usage of kernels as follows: maxW(A) = ' y'yJA'AJK(u',u') 1=1 I=1 The kernel function is an inner product between feature vectors and is denoted as K(u, v) =< 0(u), 0(v) > where {u, v} e X. The feature vectors may not have to be explicitly calculated if the kernel function can create a mapping implicitly. In Section 4.3.1 we show how a kernel can increase the dimension of the attribute space, thus allowing for more unique features, without significantly increasing computational cost. An alternative to using a kernel is to explicitly create all features deemed necessary for classification as direct input to the SVM as attributes. However, this is both time consuming and computationally costly. Creating a kernel unleashes the potentially nonlinear power of the learning machine, allowing it to find patterns on the attributes that were previously unknown. In Section 4.3.1 we explain the properties of general kernels. In Section 4.3.2 we extend our explanation of kernels by considering domain specific kernels. These kernels are designed with the structure of a particular domain in mind. 4.3.1 General Kernel Methods As explained above, a kernel is evaluated within an inner product between mappings of examples u', where examples are vectors of attributes from the instance space X. There are many known kernels and the list is growing 214687. Two specific kernels can be used to illustrate the nature and expressive power of these functions. The polynomial kernel is: K(u, v) (K(u, v)+ R)d where K(u, v) is the normal inner product < u, v >, d is a positive integer and R is fixed. Consider a set of examples S = ((u, y'),...(u', yL)) c (X x Y) each with four attributes, u' = (u,,u,2, u,3, u,,)' and v' = (v,,,V,2,v ,,v,4)', with d=1 and R= 0. K(u, v) = uv, + u2v + uv3 + 4V4. and with R= 0 and d =2, K(u, v) = (u1Vl +u2v2 +u3v3 +u4v4 )2. While K(u, v) has four features, namely (u1,u2,u3,14)', K(u, v) has 10 features, namely all monomials of degree 2, or (u1, u2, u, u, 2uu2,2u 1u3,2uu4, 2u12u3,2 u2u4, 2u3u4 Consider a d of arbitrary dimension with n attributes, the number of features is n+datinaloplibeoeunra a and d ). The computational complexity becomes unreasonable as n and d grow. Due to the implicit mapping in the polynomial kernel (between examples via the inner product), the monomials of degree d can be features of an SVM without their explicit creation. An even more powerful kernel is the Gaussian, which is defined as: K(u, v) = exp( u v /(2&2)), where is the 2norm 91 and a is a positive parameter. An exponential function can be approximated by polynomials with positive coefficients, making the Gaussian kernel a limit of the sum of polynomial kernels 95. The features of the Gaussian can be best illustrated by considering the Taylor expansion O 1 of the exponential function exp(x)= x' 95. The features are all possible monomials with no restriction on the degree. This feature space has infinitely many dimensions. Now that it is obvious that kernels are a powerful tool, we will look at their properties. To be useful in SVM work, a kernel function must have the following minimum characteristics (Cristianini and ShaweTaylor 20): (1) the function must be symmetric (ie K(u, v) = K(v, u)) (2) the function must be positive semidefinite, and 3. the function must obey the CauchySchwarz inequality. #1 is easy to check. #2 is a little more complicated and it is usually determined by studying a related squaresymmetric matrix, A, and its eigen decomposition. Let K(u, v) be a symmetric function on X. K(u, v) is a kernel function if and only if the matrix A = (K(u, u,)),, is positive semidefinite (has nonnegative eigenvalues) 20. #3 is satisfied as long as the function obeys the CauchySchwarz inequality. The Cauchy Schwarz inequality as applied to kernels is defined by Cristianini and ShaweTaylor 20 as: K(u, v)2 = ((u), O(v)) < \(u)2 I(v) 2= < <(u), O(u)> = (&(u), 0(u)) (O(v), O(v)) = K(u, u)K(v, v) A kernel function often alters the dimensionality of the data, mapping it into feature space. The inner product between all feature vectors is carried out using a kernel matrix. A matrix formed by such inner products is called a Gram matrix. The Gram matrix has some useful properties, for example it is positive semidefinite. Since all of the entries in the Gram matrix are in the form of an inner product, we must be concerned with their proper existence. An inner product space is a vector space endowed with an inner product. The inner product is actually the metric used to determine the distance between two points. The inner product space is enough structure to properly define each element of the Gram matrix when considering the finite dimensional case. However, if we want to take advantage of an infinite dimensional feature space (as in the Gaussian case) we need the inner product to define a complete metric (defined below). If the inner product defines a complete metric, then it is a Hilbert space 59. A complete metric is one in which every Cauchy sequence is convergent. Consider all countable sequences of real numbers. The Hilbert space is a subset of all countable sequences x = {x,, x,..., x ,... such that x 2= x2 < o The inner product of sequences can be defined as his infinite space is also called L 59 An important characteristic of (x,y) = x, y,. This infinite space is also called L2 59. An important characteristic of 1=1 an Hilbert space is that it is isomorphic to R" in the finite case and L2 in the infinite case. A compelling property of kernel methods is the ability to form new kernels from existing kernels. For example, one could take a polynomial and a gaussian kernel and add them up to get the features from each. Kernels are also multiplicative. Cristianini and ShaweTaylor 20 show that the following functions of kernels are in fact kernels: K(u, v) = K(u, v)+ K2(u, v) K(u, v) aK, (u, v) K(u,v)= K,(u, v)K2(u, v) K(u, v) = K3 ((u), (v)) where K,, K2, and K3 are kernels, q: X > R" and a > 0. Until this point, we have looked at two kernels, the Polynomial and the Gaussian. These kernels are very powerful, but offer little opportunity for crafting kernels that are specific to a domain. The graph kernel is a general kernel which can be made domain specific, as long as certain rules are followed. A powerful characteristic of the graph kernel is its intuitive nature. The graphic representation allows us to better understand how a kernel works. Before formulating this kernel, a simple example is useful. Consider a graph G(A,E) with nodes a and edges e. See the Figure 2 below: S & el e 2 se3 e4 e e5 Figure 2 Basic Graph Kernel All e, in this graph are base kernels (for example, a polynomial kernel on a component of the attribute space). To differentiate base kernels from general kernels, the base kernels are denoted as K(u,,v) Any path from s to t is a feature. This feature is arrived at via the product of all edges in the path between s and t. In general, all paths from s to t create features. This allows the researcher to create his own kernel, by choosing the structure of the graph. Here is a more formal explanation of the graph kernel. It is based on a directed graph G with a source vertex s of indegree 0 and a sink vertex t of out degree 0. A directed graph is one where the flow on each edge is in a single direction. Each edge is labeled with a base kernel. It is assumed that this is a simple graph, meaning that there are no directed loops. In general, loops are allowed but that makes proving that the resulting mapping is, indeed, a kernel extremely complicated. Takimoto and Warmuth 100 proved that a directed, acyclical graph with base kernels on the edges is indeed a kernel. ShaweTaylor and Cristianini 95 describe the kernel as follows: Let P, be the set of directed paths from s to t for a path p = (aal ...ad). The product of the kernels associated with the edges of p can be seen as follows: d K (u, v)=n K( (,)(,v) . 1=1 The graph kernel is the aggregation of all Kp (u, v) and can be seen as follows: d K(u,v)= K,(u,v)= n iK(,,,,)(u,v) PGPt P P .t 1 Here is another example for clarification. Look at the Figure 3 below. It is a slightly more complex version of the one above. The nodes are labeled for explanatory purposes and the edges are labeled with the base kernel K(u, v) )= (u, v. If s = 1 and t = 2, then there would be a single feature, u1. If s = 1 and t = 3, there would also be a single feature uIu2, but the feature would be the product of the two base kernels on the path p = (alazas). Three paths converge at node 5, specifically p, = (alazalas) and P2 = (alaza) and p, = (alaza4as). Node 5 can be seen as a kernel which sums the products of the base kernels on each path. If Node 5 were t, the output would be the sum of all paths into node 5, p, + p2 + p, or u1u2u5 + u1u4 + u1u3u6. In general, at each node a (except s), all paths from s to a are summed. The contribution of a path to the kernel is based on the product of its edges. U3V3 U 4V4 a55V5 sa>^ 3 6V64 U7V7 Figure 3 Graph Kernel 4.3.2 Domain Specific Kernels A kernel should have two properties for a particular application. First, it should capture a similarity measure appropriate to the domain. The features that offer the most information content for a particular domain need to be represented by the kernel. Second, its evaluation should require significantly less computation than would be needed by using the explicit feature mapping 95. The first point is key to the contribution of this dissertation. General kernels are building blocks but the goal of a kernel method is to determine patterns correctly and tuning a kernel to a specific domain best does this. Much empirical research has been done where a dataset is tested using several kernels and results are given as to which kernel performs better. This is ad hoc. It seems likely that a kernel which is tuned to a domain will better capture the features necessary to correctly classify instances in that domain. Ultimately we will combine kernels that deal with quantitative financial information and textual information. Below is a brief summary of some textbased kernels. Joachims 46 uses the polynomial and gaussian kernel for text categorization. He compares several parameters for each kernel (d for polynomial and C for Gaussian). He showed that the parameter that elicited the lowest estimated VC dimension was the one with the best performance on the empirical tests. Thus, he has tailored general kernels to the text domain. This is an early and simplistic example of domain specificity. Cristianini et al. 21 develop a Latent Semantic Kernel designed to sort documents into categories by keywords, which are automatically derived from the text corpus. The kernel implicitly maps keywords into a "semantic" space, which allows documents which share no keywords to be related. This is accomplished by analyzing cooccurrence patterns. A cooccurrence pattern is where two terms which are often found in the same document are considered related. The cooccurrence information is extracted using a singular value decomposition of the term by document matrix. This paper illustrates the usage of domain knowledge in the development of a kernel. Another kernel adaptable to text problems is the string subsequence kernel 20. A string is a finite set of characters from a set T. In the case of a subsequence kernel, T is the alphabet. The goal of this kernel is to define the similarity between two documents by calculating the number of subsequences these documents have in common. The subsequences do not have to be contiguous. However, there is a penalty incorporated into the function based on the distance between words of a subsequence. Early researchers in kernel methods have given us several general forms with which to work. Recent applications of kernel methods to domains include protein folding, handwriting recognition, face recognition, image retrieval, and text retrieval. Finding the right kernel for a particular problem has proven to be an ad hoc, yet extremely important step. The real power of kernels is harnessing those general forms to create kernels that are specific to these domains. This work has just begun. A domain may be defined by more than one type of data, thus complicating matters. In the case of the accounting domain, both quantitative and text attributes contain information on a firm. In order to utilize the text data, we must first understand how to narrow down our potential attributes, by looking at text specific to the domain of accounting. The next Chapter utilizes the methodologies reviewed in this Chapter to create a Financial Kernel. A review of Chapters 2 as well as this Chapter should give the reader an understanding of the reasons for the particular design of the Financial Kernel in Chapter 5. CHAPTER 5 THE FINANCIAL KERNEL Defining a domain specific kernel for finance entails looking to the finance and accounting literature to see what attributes and features are often utilized for classification. It also requires us to consider the kernels available and which ones would fit our work the best. As this work focuses specifically on financial events the main publications reviewed were in the realm of management fraud and bankruptcy, as seen in Chapter 2. Without fail, most financial analyses look to ratios of items on the financial statements. Models for earnings quality in accounting utilize ratios, such as the study by Francis, LaFond, Olsson and Schipper 35. Loebbecke, Eining and Willingham 55 use financial ratios as part of their management fraud model as well. All of the studies detailed in Section 2.2 on Bankruptcy Detection use financial ratios. McNichols and Wilson 56 used yearoveryear changes in key account values to help determine earnings management. Francis, LaFond, Olsson and Schipper 35 utilized yearoveryear changes extensively in their study on earnings quality. Beneish 10 utilized yearoveryear changes to help determine management fraud. The majority of the bankruptcy prediction methods which were reviewed in Section 2.2 show the accuracy of their methodologies for the year of bankruptcy, the year prior to bankruptcy, and sometimes further back. As years prior to bankruptcy increase, the predictive accuracy of the models decreases. In general, the picture is not clear. However, a trend may be emerging. This trend can be captured by yearoveryear changes in key ratios. As explained in Section 2.2 Altman 5 notes that a limitation on his discriminant analysis function for bankruptcy detection was its lack of yearoveryear changes. Yearoveryear changes in ratios are captured by this function: U1 2 _U11 U 2 j2 where i, j = ...n are the attribute numbers and the second subscript is the year (or period). We created two kernels to handle ratios and yearoveryear changes. The first kernel utilizes the polynomial kernel structure on a mapping of the data to produce inverses. Recall, the general polynomial kernel is K(u, v) = (K(u, v) + R)d where R is a constant and d is the degree of the polynomial. We apply the polynomial kernel to a mapping of the input attributes 0(u) i>i, where u = (ul,u2,..., u,)' and u= U1, U2,...,Un,, ,. " Setting R to zero and d = 2, K(u, v) =< 0(u), q(v) >, gives all possible ratios of individual attributes In addition, we get the following attributes: uj2 and This Ui UlUi can be seen in a simple example. Consider u = (u1,, 2, u3)' and v = (v1,, v2,v)', for all 111 ia111n u,vGX. Ob(u)=\U,2 ,, and U(vU)= v, 1 1 1 V, , 1, u2, u3 V v V2 V3 The function result is: K(u, v) = (< (u), (v) >)2 =2ulu2vv + u2v2 +2+ +2 v2 +22uu3 1V3 +2u2U3V2V 1212 1 22 3 3 +2u1v 2uv 2 u 2u3 2uv3 2uv2 2 + + + + + + u3V3 u3v3 1,V,1 u22 u1V1 u1u212Vv 2 1 1 1 + + +  23223 12 222 2 u2u3V2V3 U1VI U2V2 U3v3 which gives the following feature vector: ,F 2 1 2 3 2 123 2' 2 2 2 11 2' u u ^ u ^ uu u 2 3 1 U2 U3 U3 U1 U2 U1 U1U2 U2U3 U 1 U2 U3 We validated this kernel on simulated data. We used the Altman ZScore with weights for the manufacturing industry (ref Ch. 2). We created attributes for each variable in the Zscore. The attributes were TA, EBIT, RE, B.V.E., TL, WC, as defined in Section 2.2. The attribute values were created using a normal distribution with means and variances appropriate to the domain. When we created the variables we preserved the structure of the balance sheet (i.e. TA = TL + B.V.E. + RE). Each example was input into the Altman ZScore function to obtain its score. The examples with scores were sorted by score. The top 50% of scores were labeled with a +1 and the bottom 50% of scores were labeled 1. We only input the attributes and labels into the SVM. We were able to separate perfectly on the Altman Z score, but had problems rediscovering weights from the actual function. We determined this is due to the fact that many extra features are created by this kernel and are highly correlated with each other. This correlation is due in part to the structure of the Altman Z score. Total Assets and Retained Earnings are two of the six attributes used in creating the ratios of the Altman Zscore. Both of these attributes are used in two different ratios. Our kernel creates (2n)d features, some of which did not add a significant amount of information to the learning algorithm (i.e. uj and 1). uu To add a time series representation for this kernel, we would have to represent the U12 U11 21 following relationship: =2 J 1 uu The left hand side of this function is U,2 U/U,12 uj2 the yearoveryear changes as explained above. The right side is a representation that can be constructed by our kernel by dropping the constant. The attribute vector would double in size as the second year would be concatenated onto the end. In order to get yearover year changes in this format '2 we need d > 3. The number of features in the year 2j1 12 overyear case would be at least (2(2n))3. Even a modest number of attributes causes a huge explosion in features. The n = 4 case would generate 4,096 features. The second kernel we created was built as a response to the problems we had with the first, namely dimensionality explosion and unnecessary features. We design this kernel with the goal of getting all the important features, including all possible intrayear ratios and yearoveryear ratios. However, we want to avoid the problem of unwanted features. For this we chose the graph kernel. As discussed already, the graph kernel is extremely flexible, which makes it a natural choice when trying to construct specific features. We exploit the research of Takimoto and Warmuth 100 to build this kernel. We call this kernel the Financial Kernel, and denote it as KF (u, v). KF (u, v) is a directed graph G e (A,E) with base kernels on all edges e and K(u, v) on a. The Financial Kernel has as input n attributes per year for 2 years. The attributes vector is u= (u11,...,ul,u22,..., un2)', where the first index is the attribute number and the second index is the time period. See Figures 4 and 5 for an illustration of financial domain kernel. Figure 4 illustrates one of n 1 graphs that make up the Financial Kernel. Each of the n 1 graphs has a source node s, and a sink node t,. The graphs decrease in size with n. The reason is that each graph carries information for attributes i through n. Each path from source to sink is a feature. The number of features are equal to the number of paths. All n 1 graphs from Figure 4 are brought together by the graph in Figure 5. The paths from s to t make up all of the features in KF (u, v). The kernels on e are base kernels. As defined in Chapter 4, a base kernel is a kernel function on a vector component. We can have as many different kernels as there are edges. For the creation of a financial kernel, we limited the base kernels to two forms, one is the standard inner product kernel of K(u,, v,) =< u,, v, > the second is K(u,,v,)=1 UIVI / 1 1 1 1 1F 1 4(+1)1 ( 1)1 0(a+2)1 (n+2)1 nlnnl Figure 4 The Financial Kernel 1 \ s! G,,  92 t Figure 5 The Financial Kernel 2 According to Takimoto and Warmuth 100, in order to prove that K (u, v) is a kernel, we need only have a directed graph without cycles and show that each edge e is a valid kernel. (For details of their proof, see pg. 33 of Takimoto and Warmuth 100.) Examination of Figures 4 and 5 clearly show that the graph is directed and free of cycles. We need to show that both K(u,, v,) and K(u, v ) are kernels. K(u,, v,) is simply the standard inner product kernel. K(u,, v ) can be shown to be a kernel as follows: (1) f(u)= u ,i ...n (2) f(v,)= v,',i =l...n (3) f(u,)f(v,) = u ,1= uiv (4) By Cristianini and ShaweTaylor [2000] (pg. 42) 20 K(u, v,) f(u, )f(v,) The features of the Financial Kernel are: b(u)= u, Uj2 ,2i,j = 1...n,i Here is a small example. In this example u = (u I,u21 ,U12, 22)' and v= (v11,v21, 12, 22)'. 50 1111 u 22 22 u 11 V11 22V22 KF (u, v) + + u21v21 u12v12 U21V21U12V12 which gives the following features: U11 U 22 U11 U22 U21 U12 U21 U12 In general, for year 1 we get all ratios in the form of u. In year two we get all Ui Uu ratios in the form of which is the inverse of year 1. We structure the ratios in this U form in order to get yearoveryear changes of the form uu UJIU12 The feature space we have constructed so far with intrayear ratios has the structure u' and j. It is evident that with this kernel we get the feature or its inverse. In other J 1 ,12 words, if the true feature is u this mapping only gives the inverse. By constructing the U1 features in this manner, we reduce the dimensionality necessary to get yearoveryear changes, but we lose a potentially important set of features in the process. For the year overyear changes all we need to do is get the product of the intrayear ratios u' and UJu SThe computational complexity of the Financial Kernel is 3n(n for n u2 2 2 attributes and 2 periods. This is easy to see as each pair of attributes i, j are represented U, I u uj2 1UU, I j n(n1) three times, U 2 UJ and the number of attribute pairs are  u\ ul 2 u1u 22 2 We validate the financial kernel on simulated data to test the kernel's ability on inputs of a known function. We take the Altman ZScore and modify it slightly, to add a time series component. The function we create is: 6.56 ((CA CL)/ TA) + 3.26 (RE/TA), + 6.72* (EBIT / RE), +1.05 *(BE / TL), + 6.56 *((CA CL)/ TA), + 3.26 (RE/TA)2 + 6.72* (EBIT / RE)2 +1.05 (BE / TL)2 + 2 ((CA CL)/TA), *(TA (CA CL)) + 3 (RE/TA) (TA /RE)2 + 3* (EBIT/RE), (RE/ EBIT)2 + (BE/TL), *(TL/BE), = score The first and second rows of this function are year 1 and year 2 individual Altman Z Scores. The third row is yearoveryear changes in the ratios of the Altman ZScore. The weights on the yearoveryear changes were chosen arbitrarily. Our dataset contains 2,000 randomly generated examples labeled with the modified Altman Zscore function. We divide the examples up by sorting the data on the score. The threshold value for our modified Altman function is chosen as a midpoint between the score of sorted item 1,000 and 1,001. Thus all of the tophalf are labeled as +1 and all of the bottom half as 1. We run experiments using the financial kernel, a polynomial kernel of degree 2, a Gaussian kernel, and a linear kernel. The results are as follows: Table 1 Financial Kernel Validation SV Test on Train 10 fold cross validation Linear 877 85% 84% Polynomial (deg 2) 1998 75% 55% Gaussian 1056 86% 86% Financial Kernel 707 92% 91% The results show that the Financial Kernel achieves superior results when using 10 fold cross validation. The first column is the number of support vectors. A bound on #SV generalization error is The generalization error of the Financial Kernel is the lowest of the listed kernels. The result is not quite as expected though. One would expect the Financial Kernel to achieve perfect separation. The reason for the error is an assumption we made when developing the kernel. The assumption was that we could U U u U represent both of the following ratios by only one of and In order to get Ui U, Ui U, U, U perfect separation, we hypothesize that we need to have both and as features. U U, This has been easily achieved by adding a mirror image of Figure 4, with the components being inverses of the components of Figure 4. Figure 6 shows the updated Financial Kernel. This Chapter detailed the development of the Financial Kernel, one of the two main methodological contributions of this research. In Chapter 6 the development of the Accounting Ontology is explained. n^l nl 1 a2Vn2 i2i2 E il l Figure 6 Updated Financial Kernel n2vn2 1 ,IVnI CHAPTER 6 THE ACCOUNTING ONTOLOGY AND CONVERSION OF DOCUMENTS TO TEXT VECTORS We describe a methodology for creating an accounting ontology in Section 6.1. Section 6.2 describes how the ontology is used in conjunction with the vector space model to turn accounting documents into text vectors. 6.1 The Accounting Ontology The accounting ontology is built using an accounting corpus to represent the accounting domain and general corpora to represent the general domain. The accounting corpus is the US Master GAAP Guide 16. We chose this because it explains generally accepted accounting principles in a fairly nontechnical manner. It uses all the terminology, but in more regular language than a legal publication. We get our general corpora from the Text Research Collection, which is syndicated by TREC 103. This collection includes material from the Associate Press, The Wall Street Journal, the Congressional Record, and various newspapers. The Text Research Collection has been used in many natural language processing applications and is often used to test IR methodologies. A domain specific ontology is created by a series of major steps, each with its own series of minor steps. Figure 7 shows how the ontology is created. There are two classes of corpora, the domain corpora and the general corpora. Both are partofspeech tagged and fed into the function that determine which concepts are germane to the accounting domain, as described in Step 1 below. A set of concepts and other domain specific terms called Novel Concepts are put through a process that uses the syntactic structure of the accounting corpora, as described in Step 2 below. The result of this step is a WordNet enriched with novel terms from the accounting domain. The final step in ontology creation is to add new multiword concepts to WordNet based on an algorithm that uses the syntactic structure of domain concepts, as described in Step 3 below. The details of each step are explained in the remainder of the section. 6.1.1 Step 1: Determine Concepts and Novel Terms that are specific to the accounting domain We start with a partofspeech (POS) tagger used to tag the natural language text. This puts additional structure on the individual words. The POS tagger used is a derivative of the Brill tagger, called MontyTagger 110. The tagger is run on both the accounting corpus and the general corpora. The POS tagged data is culled down to the following form: Word 1 #POS#WordCount Word2#POS#WordCount WordN#POS#WordCount Word 1 #POS#WordCount Word2#POS#WordCount Domain Concepts enriched with Novel Terms (Step 2) Add Multi word Domain Concepts (Step 3) Figure 7 Accounting Ontology Creation Process WordN#POS#WordCount Word 1 #POS#WordCount Word2#POS#WordCount WordN#POS#WordCount Word 1 #POS#WordCount Word2#POS#WordCount WordN#POS#WordCount where Word#POS#WordCount is as follows: Word Stemmed Word POS Part of Speech of Word WordCount Number of times a word appears in a document. The word counts are run through a function in order to detect words that have the highest amount of information for that particular domain. For example, when considering the accounting domain versus a general domain, the word "defeasance" will have a higher score for the accounting domain because it is specific to accounting, while the word "balance" will have a lower score as it can be found equally in the accounting and the general domain. The function used is a modification of the basic TFIDF function which includes WordNet Concepts. N Recall the basic TFIDF function: w, = (tf) log where w is the weight of n term tj in document d tfj is the frequency of term t, in document d N is the number of documents in the collection and n is the number of documents where the term t N occurs at least once 44 The inverse document frequency (idf), = log . n We modify the TFIDF as follows: (1) rlv(t I d) =log(tf,dlo N (2) rlv(c d) = rlv(t I d) tEC (3) rlv(c, Id)= rlv(t d)T (4) max rlv(c,+ d)= rlv(t d) c I^c,+ C, ) Function (1) mimics the basic TFIDF function, only the d stands for domain instead of document in our research. rlv is the domain relevance of a term t on domain d N is the number of domains. In function (2), we introduce c for concept, where c = {t, ,..t.,t. }. This introduces the notion of a synset. By considering the relevance of terms and their synonyms, we get a clearer understanding of the domain. Function (2) sums up the relevance rlv for all terms t in the synset c. This is a concept relevance score. Function (3) sharpens this, considering hyponyms. The c. is the concept, unioned with all of its direct hyponym sets. Recall, the hyponym of a concept i is a concept j which is a specific instance of concept i. For example, a bank account is an account. Function (3) sums up the relevance rlv for all terms t in c and all of c's direct hyponyms. Looking at the direct hyponyms gives us one more measure of a concept's T relevance. Function 3 adds an additional term T where T is the total number of terms t, within a concept c which are found in the domain corpus. The c is the cardinality of the concept. This set of functions was developed by Sacaleanu and Buitelaar 15. We add a measure of word sense disambiguation in Function 4 by comparing the domain frequency of various senses of a term t. In other words, consider concept c, with terms (t, t, t3) and concept c, with terms (t3,t4, 5). Notice that t3 is in both c, and c2. We determine which concept t3 actually belongs to by comparing the Function 3 scores of the two concepts. We choose the concept (c, or c2) which achieves the maximum value in Function 3. Here is an illustrative example. The noun "stocks" in WordNet has 17 different senses (definitions). Listed below are 4 of the 17 senses. 1. stock  (the capital raised by a corporation through the issue of shares entitling holders to an ownership interest (equity); "he owns a controlling share of the company's stock") => capital, working capital  (assets available for use in the production of further assets) 2. broth, stock(liquid in which meat and vegetables are simmered; used as a basis for e.g. soups or sauces; "she made gravy with a base of beef stock") => soup(liquid food especially of meat or fish or vegetable stock often containing pieces of solid food) 3. stock, inventory(the merchandise that a shop has on hand; "they carried a vast inventory of hardware") => merchandise, wares, product(commodities offered for sale; "good business depends on having good merchandise"; "that store offers a variety of products") 4. livestock, stock, farm animal(not used technically; any animals kept for use or profit) => placental, placental mammal, eutherian, eutherian mammal(mammals having a placenta; all mammals except monotremes and marsupials) Senses 1 and 3 are much more likely to come up in an accounting context than senses 2 and 4. In order to test which sense is the most likely sense in the context of a document or corpus, we compare relevance scores, which include for Sense 1 "stock" and its hyponyms "capital" and "working capital", with Sense 3 "stock", "inventory", and its hyponyms "merchandise", "wears", and "product". The Sense with the highest relevance becomes the candidate to be a domain specific concept. All word sense disambiguated concepts are sorted based on score, and the highest scoring concepts become domain specific concepts. Novel terms are those terms that have high scores but do not fit into a WordNet category. These terms are very important as they give us an opportunity to enrich WordNet with domain knowledge. WordNet can be viewed as a hierarchical tree where the nodes are concepts and the edges are relationships. Figure 8 shows a simplified WordNet tree after Step 1. In this tree accounting domain concepts are filled in with the color gray. We also show that there is a listing of novel terms which are highly important to the accounting domain but cannot be matched to current WordNet concepts. These terms are the subject of Step 2. Novel Term List Term 1 Term 2 Domain Concept Term n 0 Domain Do n Concept Domain Concept Figure 8 WordNet Noun hierarchy with Domain Concepts 6.1.2 Step 2: Merge Novel Terms with Concepts In this step, we take the novel terms that were not matched to concepts and we attempt to fit them into a domain concept. We use the methodology of Buitelaar and Sacaleanu 14. This process is done using lexicosyntatic patterns. Consider the natural text, before preprocessing. In such a text there are certain syntactic patterns that arise, such as [determiner, adjective, noun, verb, noun]. A sentence with this structure would be "The large crane eats breakfast." The "the" is a determiner, large is an adjective (ADJ), crane is a noun (NN), eats is a verb (V) and breakfast is a noun (NN). We consider the syntactic patterns that arise in 7grams, that is, contiguous 7 word structures. We look for patterns with three words to the left and three words to the right of a central word. This central word will always be either a domain concept (which includes all constituent terms) or a novel term. The basic idea is as follows: look for patterns where novel terms and concepts appear together, often interchangeably. A visual representation of a 7gram is below: [null, null, ADJ, Term/Concept, null, V, NN] This 7gram has words in the following sequence: two words on the left that are not important to us (signified by null), an adjective, the term or concept, another unimportant word, and then a verb and a noun. The partsofspeech we are concerned with are nouns, verbs and adjectives. All other partsofspeech are considered "null". In order to determine patterns which are populated with words that are related, we use a mutual information score based on cooccurrence. The sore is used to determine the semantic similarity of twoword pairs based on how often pairs of words are found together relative to chance. The mutual information score MI is the following function: MI(c, w) = log2 tG N Y fQ, )f(w) where c is a concept, w is a word in the pattern, and N is the total number of words in the pattern. MI is an approximation of the probabilistic mutual information function: log2 P(x, y) P(x)P(x) The details of the derivation of MI(C,w) can be found in 14. In order to determine if a novel term belongs inside a particular concept we have to first decide whether the pattern is reliable. We assume a pattern is reliable if all the terms of a concept are assigned back to the concept, using an unsupervised clustering algorithm called knearest neighbor. Below is the data structure for the example pattern above: [c,MI, ADJ, V, NN]. As "null" attributes are unimportant, we simply leave them out. For reliability testing we expect the concept c in its representation above to be clustered together will all term instances, represented as [t, ,MI,ADJ,V,NN]. If this is not the case, the pattern is considered unreliable. For all reliable patterns, we use the knearest neighbor to cluster the concepts as seen above together with the novel terms (NT) in the following representation: [NT, MI, ADJ, V, NN]. If a NT is clustered with a Concept, then we add the NT to the concept, thus enriching WordNet. Figure 9 shows the WordNet tree after Step 2. This Figure updates Figure 7. The domain concepts which were found in Step 1 are shaded gray. Figure 8 illustrates that after Step 3 some of the domain concepts include novel terms, thus enriching WordNet. Domain Concept Concept Nove terms {Concept u NovelTerms} Figure 9 WordNet Noun hierarchy with Domain Concepts enriched with Novel Terms 6.1.3 Step 3: Add multiword domain concepts to WordNet At this point, we have domain concepts enriched with novel terms. We would like to extend WordNet, by adding new nodes. We do this using a slightly altered form of the method described by Vossen 108. Vossen utilized the headermodifier relationship to determine multiword concepts. For our purposes, a header is a noun and a modifier is one or more adjectives describing it. For example, "bank account" is a twoword structure with account as the header and bank as the modifier. Vossen considers all headermodifier structures, limiting the final set to the ones above a statistical threshold for a particular domain. We already have our domain concepts from Steps 1 and 2, so we consider only headermodifiers where the header is one of the domain concepts. If an instance of the headermodifier structure is considered statistically significant, then it is added as a node below the header in the tree. This means it becomes a hyponym of the domain concept. The potential for more than one layer exists. Consider the following phrases, "federal tax expense" and state tax expense." Both of these multiword phrases are actually line items on an income statement. "Expense" is a term specific to the accounting domain. A "tax expense" is a term that belongs below "tax" in a hierarchy. There can be an additional set of nodes below "tax expense" called "federal tax expense" and "state tax expense." There can be any number of modifiers for any noun, although it is likely that the number of modifiers will be between one and three. The WordNet tree takes on new nodes underneath domain concepts. The new nodes are the header modifiers deemed significant to the domain. Figure 10 shows a simplified representation of the tree after Step 3. The figure shows (as Figures 8 and 9) the domain concepts as shaded light gray. Additional nodes are added below some domain concepts. This is represented by nodes which are shaded dark gray. These new nodes are the domain specific multiword concepts, added in this step. Figure 10 WordNet Noun Hierarchy with Domain Concepts, Novel Terms and Multi Word Concepts 6.2 Converting Text to a Vector via the Accounting Ontology Above we developed an accounting ontology methodology. Now we use this ontology to aid in detection of financial events by using it as domain filter to get rid of unwanted noise. Recall that the output of the Accounting Ontology is a set of concepts specific to the accounting domain, as well as relationships between those concepts. The process of getting a quantitative form of a text vector is as follows: We input the company reports in natural language and use a PartofSpeech tagger 110 as a preprocessor. The preprocessed document is then parsed down to word counts labeled with a Partofspeech (e.g. Word#POS#WordCount). Each word is compared to all domain concepts. If a word fits into one of the domain concepts, its word count is added to the vector. The power of concepts rests in the fact that all words inside a concept will count the same. For example, the word "liability" has the following words as synonyms, "indebtedness", and "financial obligation". All three of these words are part of the same concept. If one document has the word "liability", a word count is placed in the index reserved for the "liability" concept. If another document has the word "indebtedness" or "financial obligation", a word count is placed in the index reserved for the "liability" concept. Below is a class of concepts: wC1, c,..., cj W ECj i = 1,..., I c I j = 1,...,n where ci are concepts, w, are words and I cj I is the size of the concept set. The filtering process leaves us with only concepts ci that are specifically related to the accounting domain. We take this reduction a step further by considering the relationships between the concepts in the Accounting Ontology. We do this by utilizing the tree structure of WordNet. We need a measure to determine the similarity between nodes (or concepts). There is a vast literature on similarity measures, so we choose an offtheshelf measure that has proven to be among the best. Based on the work of Budanitsky and Hirst 13, we choose the Jiang and Conrath measure, which has been shown to be more accurate on the MillerCharles 63 set than competing similarity measures. We create a similarity matrix, comparing each of the concepts, where sy is the similarity between concepts c, and ci for i,j e {1,2,..., n Here is a view of this matrix: S11 S12 1 SIn S21 S22 S2n SIn S2n "' Snn We input the similarity matrix into an agglomerative clustering algorithm 47. This algorithm clusters the most similar items and shrinks the matrix. This algorithm is iterative, in each run concepts which are less similar are added to existing clusters, so we choose a parameter k where k is the minimum level of similarity with which two concepts can be clustered. The clustered concepts c are called superconcepts sc . c <_ sc where is the size. In turn, the total number of superconcepts sc are less than or equal to the total number of concepts c. There are two goals to creating superconcepts: (1) The superconcepts are designed to cluster concepts that are similar, therefore financial documents which share accounting superconcepts are more likely to be similar. (2) The superconcepts allow us to shrink the size of an undoubtedly large vector. This can help us avoid overfitting on the empirical data, which is possible due the small datasets available for fraud and bankruptcy. Below is a class of superconcepts: {sc, sc,,..., sc, } C1 SCk j = 1,...,mk k= ,...,s where mk is the number of such concepts in superconcept k. 68 Chapter 6 explained the methods used to develop the Accounting Ontology. The procedure for converting text to a vector of numbers was also explained. In the next Chapter the method of combining the text data with the quantitative data is detailed. CHAPTER 7 COMBINING QUANTITATIVE AND TEXT DATA In this chapter we combine quantitative and textual financial data for subsequent analyses. We turn the text into a numeric vector as discussed in Chapter 6, here we concatenate the quantitative form of text to the vector of quantitative financial data. Since we will be applying a kernel to this concatenated vector, we need to expand the financial kernel developed in Chapter 5. We concatenate the text and quantitative attribute vectors as a single, partitioned vector u =(u11,u 21 ..., unl 12u,22...,' n2 I 2n+ ,2n+2,,...,u )'. The Financial Kernel is applied to u1 ,..., UnI, u 2, U22, un2 and the text kernel is applied to u2n+ ,..., um where these m n values are the quantitative representation of text. This is a two step process. (1) We create a graph kernel K, (u, v) for the text. (2) We add the text graph to the Financial Kernel graph. (1) Text Graph: The text kernel is a linear kernel K, (u, v) =< u, v >. We show K,(u, v) in graph form (Figure 11): U2n+1 V+1 Figure 11 Text Kernel This is a directed graph with no cycles and each edge e is a base kernel. This graph is a kernel by Takimoto and Warmuth [2004] 100 (pg. 33). (2) Add K, (u, v) to KF (u, v) to create a combined kernel Kc (u, v). See Figure 12 below: Figure 12 Combined Kernel The text graph TG is added to the Financial Kernel. The addition of TG does not alter the fundamental structure of the Financial Kernel graph. The graph is still directed and still contains no cycles. Thus K (u, v) is a kernel. A simple example illustrates the Combined Kernel. There is an input of 2 quantitative attributes for both years, u11,u21,u12, u22 and 4 text attributes, U5,u6,u7,u8 .The input vectors are u = (u1, u21 12,u22 u5 u5,u6,u78)' and V = (11, 21, V12, 22 I V5, V6, V7, V8 ) . /1 1 22v22 u 1v11u 22'v''22 Kc (u, v) = + + u 2 2222 + 55 +U66 +U77 +U8V8 u21 21 u12V12 u21 21u12V12 with the following features: U1,1 22 U 11U 22U .u21 u12 u21 12 Other kernels could be used in place of the linear kernel, giving additional features on the text. For this study it is not necessary due to the extensive preprocessing steps used during the creation of the text vector. This Chapter explained the method used to combine text and quantitative data. Chapter 7 is the final chapter in the methodology creation. The following three Chapters delve into the empirical research, testing, results and a conclusion. Specifically, Chapter 8 details the research questions, the three datasets used for testing (management fraud, bankruptcy, and restatements), and the ontologies created. Chapter 9 gives the results from the tests on the datasets. Chapter 10 gives a summary, conclusion and explanation of future research. CHAPTER 8 RESEARCH QUESTIONS, METHODOLOGY AND DATA This chapter explains the research hypotheses, the data gathering methodology, accounting ontology creation and data preprocessing. In Section 8.1 the Hypotheses and test mechanisms are articulated. Section 8.2 outlines the Research Model. Section 8.3 explains the methods used for gathering data for the Fraud, Bankruptcy and Restatement datasets. Section 8.4 details the ontologies created and Section 8.5 explains data preprocessing. 8.1 Hypotheses The main contributions of this research are threefold. (1) We have developed a financial kernel that operates on quantitative financial attributes. (2) We have developed an accounting ontology to aid in using textual data in learning tasks. (3) We have combined these two kernels to simultaneously analyze quantitative and text information. These methods will be tested to for their effectiveness in early detection of financial events. Our first testable hypothesis is as follows. Hypothesis 1: A support vector machine using the Combined Kernel, which includes the Financial Kernel for quantitative data and the Text Kernel for text data detects financial events with greater accuracy than quantitative methods alone, including the Financial Kernel. A series of tests are run on the financial events data, using the Combination kernel. All available data, both quantitative and text is used. We use 10fold cross validation as a method for estimating generalization error. We compare the classification accuracy of our method with the other methods as explained in Chapter 2, including linear discriminant functions, and logit functions. The concepts in WordNet include semantic relationships between individual words. Developing an ontology specific to the domain of accounting allows us to utilize these relationships when creating the text vector. The basic vector space model does not take these relationships into account. The expectation is that the ontology driven text vector will provide a better representation of accountingrelated documents than the basic vector space model. Hypothesis 2: A Support Vector Machine using data from a text vector filtered through the accounting ontology will detect financial events with greater accuracy than a Support Vector Machine using only the vector space model. Two tests are run on the financial events data, using the combination kernel. One test uses a vector created by filtering the text through the accounting ontology. The other is run using a vector of word counts. The results of the tests' 10fold cross validation are reported and compared. Comparing the classification accuracy of the text and quantitative data allows us to effectively compare the "information content" in the numbers against that of the text. Hypothesis 3: Text filtered through an accounting ontology will detect financial events at least as accurately as compared to pure quantitative methods. Two tests are run on the financial events data, one using only the quantitative data which is fed into the Financial Kernel and the SVM. The other uses only the text data in the form of the concept vector which is preprocessed using the accounting ontology. The concept vector is fed into the Text Kernel and the SVM. The results of the tests' 10fold cross validation are compared. 8.2 Research Model In this section, the Research Model is explained. Figure 13 shows the process we use to study the efficacy of our approach. The empirical analysis is carried out to test our methodology. Starting on the left of the figure, we gather our dataset, which consists of companies that were shown to be fraudulent and/or bankrupt. We match the fraud and bankrupt companies with nonfraud and nonbankrupt companies based on year, sector, and total assets. Once we have chosen the companies in our dataset, we gather quantitative data from financial statements and text data from the 10Ks. The financial data is converted into a vector of attribute values. The text data is filtered through the accounting ontology and turned into a numerical vector using the counts of the concepts in the ontology. The text and financial vectors are concatenated and run through the combination kernel. An SVM using the combination kernel is used to determine a classifier to distinguish the companies as fraud/nonfraud and bankrupt/nonbankrupt. The financial vector is similarly processed using the financial kernel to get classification results for the quantitative data alone. We compare the quantitative results against the results for the textonly case by feeding the text vector into the text kernel SVM. Figure 13 The Discovery Process 8.3 Datasets The data gathering methods are described in this section for the Fraud, Bankruptcy and Restatement datasets. Text and quantitative data are gathered for all companies in the datasets. 8.3.1 Fraud Data Gathering fraud data is a task which requires considerable time and effort. The main data sources are the SEC Accounting and Auditing Enforcement Releases 93 as well as the Accounting and Auditing Association Monograph by Palmrose 73. The set was limited to fraud which occurred no earlier than 1993. The extracted financial data consists of financial statement figures for two years. The text data set consists of the text portion of annual reports (10Ks). As the fraud dataset required both text and quantitative attributes, any company which was missing either the text or quantitative attributes was deleted from the dataset. The quantitative dataset is shown in Figure 16 of Appendix B. The attribute definitions are as follows: Ticker Company ticker for stock market Label fraudulent (1) nonfraudulent (1) Ind Industry Number Year 1st year of data collection Salesyr[l,2] Sales ARyr[l,2] Accounts Receivable INVyr[l,2] Inventory TAyr[l,2] Total Assets OAyr[l,2] Other Assets CEyr[l,2] Capital Expenditures The attributes were chosen based on their reported occurrence in cases of fraud. A secondary reason for choosing these particular attributes was the likelihood of getting reported data. This is in contrast to other highly reported fraud attributes, such as Advertising Expense, Research and Development Expense and Allowance for Bad Debts. The dimension of the feature space for the Financial Kernel in this experiment is 90. The features are listed in Figure 14. A "YOY" in front of a ratio means the year overyear change for that ratio. Here is a listing of the features: Salesyrl/ARyrl ARyrl/Salesyrl Salesyr2/ARyr2 ARyr2/Salesyr2 YOYSalesyrl/ARyrl YOYARyrl/Salesyrl Salesyrl/INVyrl INVyrl/Salesyrl Salesyr2/INVyr2 INVyr2/Salesyr2 YOYSalesyrl/INVyrl YOYINVyrl/Salesyrl Salesyrl/TAyrl TAyrl/Salesyrl Salesyr2/TAyr2 TAyr2/Salesyr2 YOYSalesyrl/TAyrl YOYTAyrl/Salesyrl Salesyrl/OAyrl OAyrl/Salesyrl Salesyr2/OAyr2 OAyr2/Salesyr2 YOYSalesyrl/OAyrl YOYOAyrl/Salesyrl Salesyrl/CEyrl CEyrl/Salesyrl Salesyr2/CEyr2 CEyr2/Salesyr2 YOYSalesyrl/CEyrl YOYCEyrl/Salesyrl Figure 14 Fraud Features ARyrl/INVyrl INVyrl/ARyrl ARyr2/INVyr2 INVyr2/ARyr2 YOYARyrl/INVyrl YOYINVyrl/ARyrl ARyrl/TAyrl TAyrl/ARyrl ARyr2/TAyr2 TAyr2/ARyr2 YOYARyrl/TAyrl YOYTAyrl/ARyrl ARyrl/OAyrl OAyrl/ARyrl ARyr2/OAyr2 OAyr2/ARyr2 YOYARyrl/OAyrl YOYOAyrl/ARyrl ARyrl/CEyrl CEyrl/ARyrl ARyr2/CEyr2 CEyr2/ARyr2 YOYARyrl/CEyrl YOYCEyrl/ARyrl INVyrl/TAyrl TAyrl/INVyrl INVyr2/TAyr2 TAyr2/INVyr2 YOYINVyrl/TAyrl YOYTAyrl/INVyrl INVyrl/OAyrl OAyrl/INVyrl INVyr2/OAyr2 OAyr2/INVyr2 YOYINVyrl/OAyrl YOYOAyrl/INVyrl INVyrl/CEyrl CEyrl/INVyrl INVyr2/CEyr2 CEyr2/INVyr2 YOYINVyrl/CEyrl YOYCEyrl/INVyrl TAyrl/OAyrl OAyrl/TAyrl TAyr2/OAyr2 OAyr2/TAyr2 YOYTAyrl/OAyrl YOYOAyrl/TAyrl TAyrl/CEyrl CEyrl/TAyrl TAyr2/CEyr2 CEyr2/TAyr2 YOYTAyrl/CEyrl YOYCEyrl/TAyrl OAyrl/CEyrl CEyrl/OAyrl OAyr2/CEyr2 CEyr2/OAyr2 YOYOAyrl/CEyrl YOYCEyrl/OAyrl 8.3.2 Bankruptcy Data The bankrupt companies were chosen using the Compustat Research database 19. All chosen companies are from the Manufacturing sector (Industry codes 2000 3999). The companies chosen were delisted between 1993 and 2002. A company is delisted when it does not meet the minimal requirements of financial stability according to the market (NYSE, NASDAQ, AMEX). The analysis is limited to post1992 company years due to the fact that the model requires the text of the 10Ks to be in electronic form. Electronic 10Ks were not available until 1993. Figure 17 in Appendix B shows the entire quantitative dataset. The attribute definitions are as follows: Label bankrupt (1) nonbankrupt (1) Ticker company ticker for stock market Ind  Industry Number Year 1s year of data collection TAyr[l,2] Total Assets REyr[l,2] Retained Earnings WCyr[l,2] Working Capital EBITyr[1,2] Earnings before Interest and Taxes SEyr[l,2] Stockholder's Equity TLyr[l,2] Total Liabilities These attributes chosen were the components of the Altman Z score for manufacturing. The dimension of the feature space for the Financial Kernel in this experiment is 90. The features are listed in Figure 15. A "YOY" in front of a ratio means the yearoveryear change for that ratio. Here is a listing of the features: TAyrl/REyrl REyrl/TAyrl TAyr2/REyr2 REyr2/TAyr2 YOYTAyrl/REyrl YOYREyrl/TAyrl TAyrl/WCyrl WCyrl/TAyrl TAyr2/WCyr2 WCyr2/TAyr2 YOYTAyrl/WCyrl YOYWCyrl/TAyrl TAyrl/EBITyrl EBITyrl/TAyrl TAyr2/EBITyr2 EBITyr2/TAyr2 REyrl/WCyrl WCyrl/REyrl REyr2/WCyr2 WCyr2/REyr2 YOYREyrl/WCyrl YOYWCyrl/REyrl REyrl/EBITyrl EBITyrl/REyrl REyr2/EBITyr2 EBITyr2/REyr2 YOYREyrl/EBITyrl YOYEBITyrl/REyrl REyrl/SEyrl SEyrl/REyrl REyr2/SEyr2 SEyr2/REyr Figure 15 Bankruptcy Features WCyrl/SEyrl SEyrl/WCyrl WCyr2/SEyr2 SEyr2/WCyr2 YOYWCyrl/SEyrl YOYSEyrl/WCyrl WCyrl/TLyrl TLyrl/WCyrl WCyr2/TLyr2 TLyr2/WCyr2 YOYWCyrl/TLyrl YOYTLyrl/WCyrl EBITyrl/SEyrl SEyrl/EBITyrl EBITyr2/SEyr2 SEyr2/EBITyr2 YOYTAyrl/EBITyrl YOYEBITyrl/TAyrl TAyrl/SEyrl SEyrl/TAyrl TAyr2/SEyr2 SEyr2/TAyr2 YOYTAyrl/SEyrl YOYSEyrl/TAyrl TAyrl/TLyrl TLyrl/TAyrl TAyr2/TLyr2 TLyr2/TAyr2 YOYTAyrl/TLyrl YOYTLyrl/TAyrl YOYREyrl/SEyrl YOYSEyrl/REyrl REyrl/TLyrl TLyrl/REyrl REyr2/TLyr2 TLyr2/REyr2 YOYREyrl/TLyrl YOYTLyrl/REyrl WCyrl/EBITyrl EBITyrl/WCyrl WCyr2/EBITyr2 EBITyr2/WCyr2 YOYWCyrl/EBITyrl YOYEBITyrl/WCyrl YOYEBITyrl/SEyrl YOYSEyrl/EBITyrl EBITyrl/TLyrl TLyrl/EBITyrl EBITyr2/TLyr2 TLyr2/EBITyr2 YOYEBITyrl/TLyrl YOYTLyrl/EBITyrl SEyrl/TLyrl TLyrl/SEyrl SEyr2/TLyr2 TLyr2/SEyr2 YOYSEyrl/TLyrl YOYTLyrl/SEyrl Figure 15 Continued 8.3.3 Restatement Data Restatements as defined in this research are annual reports by publicly traded companies, which have been restated either voluntarily or involuntarily. Restatements are a much more loosely defined dataset than that of bankruptcy or fraud. There is a strong interest as to the underlying causes of restatements, which was a primary motivation for the addition of this dataset. The restatements analyzed in this study were a subset of all restatements of publicly traded companies for the years of 1997 2002 (details are explained below). The Restatement dataset was gathered using report code GAO03138 37 by the General Accounting Office. The restatements in this report involve accounting irregularities resulting in material misstatements of financial results. Restatements can be seen as a superset which includes fraud and earnings management as subsets. When a company is deemed to have committed fraudulent activity or managed earnings, the SEC requires that the company restate its financial. The GAO report includes an appendix which lists all restatements for the years between 1997 and 2002. The restatement dataset is the largest of the datasets tested (800/1,379), (i.e. the fraud dataset had 122 cases and the bankruptcy dataset had 156 cases). There were 919 restatements for publicly traded companies between the years of 1997 and 2002 37. The quantitative dataset was 1379 companies, 690 of which were restatements and 689 of which were non restatements. The smaller, 800 case dataset is a subset of the 1,379 case dataset which includes text and quantitative attributes. The size 800 dataset is split evenly between restatements and nonrestatements. The reduction from 919 to 690 was due completely to the lack of quantitative data available for some of the companies in the GAO report. The drop from 690 to 400 restatements for the combined dataset was due to the inability to get 10K data for some of the GAO companies. This was due in part to the GAOs inclusion of foreign companies and companies traded on Over The Counter markets, both of which are not required to submit the same type of 10K. The quantitative attributes for this dataset are as follows: Ticker Company ticker for stock market Label restatement (1) nonrestatement (1) Ind Industry Number Year 1st year of data collection Salesyr[l,2] Sales ARyr[l,2] Accounts Receivable INVyr[l,2] Inventory TAyr[l,2] Total Assets OAyr[l,2] Other Assets CEyr[l,2] Capital Expenditures The entire quantitative dataset is in Appendix B under the title of Figure 18. The features for the Restatement Dataset are the same as the features in the Fraud Dataset, under Figure 14. 8.4 The Ontology The ontology is a threelevel ontology composed of concepts, twograms and three grams. The concepts may be one word or two word concepts. The twograms and three grams are built on top of the concepts. The size of the ontology is determined at three levels, the concept, twogram, and threegram level. A concept can have many children at the twogram and threegram levels. A twogram can have many children at the three gram level. A twogram is always a direct child of a concept. A threegram may be a direct child of a concept or a twogram. Appendix A shows a 300 dimension ontology. This ontology was built using the entire GAAP text [28] as the accounting corpus. The 300 dimensions include 100 concepts, 100 twograms and 100 threegrams. Given the small number of examples in the fraud and bankruptcy datasets, 300 dimensions was the largest ontology created. The concepts are determined by the functions described in Chapter 6. The concepts chosen for this ontology are the ones that had the highest scores as described in Chapter 6. The twograms and threegrams are chosen based on mutual information scores, using respectively the Dice Coefficient and 113 [5]. Commonly accepted Mutual Information scores are available for two and threegrams. Higher order ngrams do not have accepted Mutual Information scores, therefore this analysis is limited to two and threegrams. An ontology of 100 twograms and 100 threegrams makes it feasible to have some concepts with both children and grandchildren. The deeper the tree the more specific the ontology gets. The effect is a more precise ontology. The prediction accuracy on the test datasets ultimately determine which ontologies are the best for this particular project. The two grams and threegrams are preceded by their partofspeech (nnoun, aadjective, vverb). As seen in Appendix A, there are two numbers after the two and threegrams. The first is the mutual information score and the second is the overall ranking of the ngram's importance as compared to all ngrams. The ranking is used to determine which two and threegrams are used in the ontology. A two or threegram is eligible for the ontology if at least one of its component words is a concept in the ontology. Ontology creation is an iterative process. The process must be refined based on the actual results achieved. The 300 dimension GAAP ontology Appendix A was used in conjunction with 10Ks of bankrupt and nonbankrupt companies (see Section 9.x for further details). Due to the small size of the dataset the 300 dimension ontology appears to be overfitting. Two additional GAAP ontologies were created having 60 and 10 dimensions, respectively. These ontologies are available in Appendix A. Choosing an accounting text as the basis of the ontology has a major impact on the results. GAAP was chosen because it is a general purpose text that covers all major accounting topics and is written in natural language. A drawback of GAAP is its indirect relationship to the MDNAs. A more direct accounting text would be the MDNAs. A set of ontologies were created using the MDNAs from the bankrupt and nonbankrupt companies as the accounting text. These ontologies are of the following dimensions, 150 (including 50 concepts, 50 2grams, 50 3grams), 50 (including 50 concepts), and 25 (including 25 concepts). All ontologies are available in Appendix A. 8.5 Data Gathering and Preprocessing The financial information for bankrupt firms was gathered for two consecutive years prior to delisting. In the event that the financial information was not available for the two years directly prior to delisting, the latest two years of predelisted data were taken instead. In the case of fraud the financial data was gathered for the first year of fraud and the year prior to fraud, as reported by the SEC. For example, if the first year of fraudulent activity was 2000, then data from 1999 and 2000 is gathered. In the case of restatements, the restatements were gathered for the year of the restatement and the year prior to the restatement. The fraudulent/bankrupt/restatement companies were matched with nonfraudulent/nonbankrupt/nonrestatement companies based on industry, year and total assets. A match was accepted if total assets of a nonfraudulent/nonbankrupt/nonrestatement company were within 10% of the fraudulent/bankrupt company for year one. If no company met this requirement, then the company with the nearest total asset value was chosen. The Compustat Industrial Annual database was used in conjunction with a script created using Perl to download the quantitative financial data for all three datasets. The 1OKs were gathered directly from www.sec.gov. There is one 10K per company and the year of the 10K matches (in most instances) the last year of the financial information. If the 10K was not available for the last year, then the 10K was chosen as follows: (1) The 10K for the year prior to the final year (2) If (1) was not available, the year after the final year (as long as it is not past the delist year in the bankruptcy case, the restatement year in the restatement case or the fraud year in the fraud case). If (1) and (2) were not available, both the company and its match company were deleted from the analysis. The text analysis was limited to the section entitled "Management's Discussion and Analysis of Financial Condition and Results of Operations (MDNA)." The MDNA section is a natural choice as it is the portion of the 10K which allows management to explain the underlying causes of the company's financial condition. It also is a section where forwardlooking statements are allowed. Using the Financial Kernel, the attributes are mapped to features, as explained in Chapter 5. The total size of the attribute space is 12 for the fraud, bankruptcy and restatement datasets. The attributes in the fraud and restatement datasets are described in Section 8.2. The attributes in the bankruptcy dataset sets are described in Section 8.3. The feature space is determined by the function 6(A 1/Y ) where A/ Y = the number of attributes per year. 8.5.1 PreprocessingQuantitative Data There are three issues to consider for quantitative attributes: missing data, "0 valued" data, and scaling. Missing data is a common problem with publicly available financial information. The method used to fill in the missing data for this paper is called multiple imputation [81]. This method takes into account not only the statistics of the missing variable over the entire dataset, but also the relationship between the missing variable and the other variables in the example. The data is put through a multiple imputation routine in the statistics package R 81. Quantitative attributes with a value of 0 is a problem in this analysis because of the extensive usage of ratios in the Financial x Kernel. A ratio of the form for any x is undefined. In order to avoid this problem, 0 0 data are given a value of .0001 and the entire dataset is scaled between 1 and 1. 8.5.2 PreprocessingText Data The preprocessing of the text data involved the following steps: (1) Making all text lowercase. This is done to avoid the problem that a computer will see the same words as different if they are different cases. For example, the word Asset, asset and ASSET would be considered three separate words. Making all letters lowercase avoids this problem. (2) Deleting stopwords. Stopwords are common words that add noise to text analysis. Deleting these stopwords is a method of cleaning the text. The stopword list used for preprocessing the ontology is the same stoplist used for preprocessing the MDNAs. The stoplist is available in Appendix A. (3) Partofspeech tagging and stemming. Partofspeech tagging assures that matches between the MDNAs and the ontology will occur only for words with the same spelling and partofspeech. Stemming removes the suffixes from the words to facilitate matching of concepts that are the same but used in different tenses. (4) Conceptnaming. For this step, all synsets from each concept from the ontology are given a single, representative word. For example, the concept liability has three synonyms; liability, indebtedness and financial obligation. The MDNA is searched for all three words and each instance is replaced with a single representative word. This allows for correct matches between the ontology (which was preprocessed with concept naming as well) and the MDNAs. Simple counts of each component of the ontology are placed in vector form for each company MDNA. The size of the ontology is a userdefined parameter. The size of an ontology is limited to the top scoring concepts, twograms and threegrams. The user decides how many of each should be in the ontology. The main limitation is that only two and threegrams that have an ontology concept as one of their components words can be in the ontology. Below is an example of one MDNA, (company name Fifth Dimension Inc., year 1996) This is a sample of the raw text. The Corporation spent $122,128 on capital additions during 1996 while recording $124,611 of depreciation expense. A reduction in capital spending is projected for 1997 while depreciation reserves are projected at slightly lower levels than in 1996. This is the text after Steps (1) and (2). corporation spent 122,128 capital additions 1996 recording 124,611 depreciation expense. reduction capital spending projected 1997 depreciation reserves projected slightly lower levels 1996. This is the text after Steps (3) and (4). null/JJ/null corporation/NN/corporation spent/VBD/spend 122/CD/122 ,/,/, 128/CD/128 capital/NN/capital additions/NNS/addition 1996/CD/1996 recording/NN/recording 124/CD/124 ,/,/, 611/CD/611 depreciation/NN/depreciation expense/NN/expense ././. reduction/NN/reduction capital/NN/capital spending/NN/spending projected/VBN/project 1997/CD/1997 depreciation/NN/depreciation reserves/NNS/reserve projected/VBN/project slightly/RB/slightly lower/JJR/lower levels/NNS/level 1996/CD/1996 ././. The complete MDNA is available via a link in Appendix B. The text vectors are created by totaling the number of times each ontology component is encountered in the text of a company's MDNA. The text vectors are 87 normalized by dividing each vector component by the total word count of the company's MDNA text. This normalization procedure assures that the importance of concepts to a particular document is not diminished due to the difference in sizes between documents. This Chapter gave the research questions along with detailed explanations of the bankruptcy, fraud and restatement datasets. Data preprocessing was explained as well as ontology creation. In the next Chapter test results are given on the three datasets along with discussions on each. CHAPTER 9 RESULTS This chapter gives the results of the empirical tests. Each dataset is tested individually and the results are listed in table format. Following the results for each dataset is an discussion of the results. The format of the results is explained below. The experiments are set up so that the hypotheses in Chapter 8 can be either supported or negated. There are three main categories of tests. The quantitative data is tested using a SVM with the Financial Kernel. The Text Kernel is tested using various sizes and types of ontologies. The Combination Kernel is tested using various sizes and types of ontologies as well. The results are given in tables 2 40. The table headings are described as follows: "SV" is the number of support vectors. SV / is a rough measure of the generalizability of the "Test on Training" results. Here ? is the number of examples in the dataset. "C" is a userdefined parameter that defines the penalty for a mistake in the quadratic optimization problem. Deciding on the right C is more of an art than a science. After raising C to a certain point, the results will level off or decline. Results are given for various values of C. "Test on Training" is the test results of the examples used to train the SVM. The number shown is the prediction accuracy of the model. "10fold Cross Validation" results are the average prediction accuracy of 10 SVM runs where 10% of the examples are left out from training on each run and used for 