Automated knowledge acquisition using inductive learning


Material Information

Automated knowledge acquisition using inductive learning application to Mutual Fund classification
Physical Description:
ix, 217 leaves : ill. ; 29 cm.
Norris, Robert Clayton
Publication Date:


Subjects / Keywords:
Decision and Information Sciences thesis, Ph.D   ( lcsh )
Dissertations, Academic -- Decision and Information Sciences -- UF   ( lcsh )
bibliography   ( marcgt )
non-fiction   ( marcgt )


Thesis (Ph.D.)--University of Florida, 1997.
Includes bibliographical references (leaves 155-216).
Statement of Responsibility:
by Robert Clayton Norris.
General Note:
General Note:

Record Information

Source Institution:
University of Florida
Rights Management:
All applicable rights reserved by the source institution and holding location.
Resource Identifier:
aleph - 028715168
oclc - 48449355
System ID:

This item is only available as the following downloads:

Full Text







Copyright 1997


Robert Clayton Norris, Jr.

To my wife, Suzanne

and my daughter, Christina


I wish to thank Dr. Gary J. Koehler, the chairman of my committee, and Dr.

Robert C. Radcliffe, the external member, who gave me their time, support, guidance,

and patience throughout this research. Dr. Koehler provided the initial concept of the

research topic in the area of artificial intelligence and finance. Dr. Radcliffe provided the

idea of studying the Morningstar rating system. I wish to thank Dr. Richard Elnicki and

Dr. Patrick Thompson for serving on my committee and for their help and advice over

these years.

I wish to thank Dr. H. Russell Fogler who has always shown an interest in my

research. I also would like to thank Dean Earle C. Traynham and Dr. Robert C. Pickhardt

of the College of Business Administration, University of North Florida, for their support.

I wish to thank my wife, Suzanne, for her love, assistance, and understanding as I

worked on this most important undertaking. I would also like to thank my aunt and

uncle, Lillian and Jack Norris, for encouraging me to continue my education. Finally, I

would like to thank my late father for his advice and guidance over the years and my late

mother for her love.


ACKNOW LEDGEM ENTS .......................................... ........................iv

A B STRA CT ................................... .............................. viii


1 IN TRO D U CTION ................. ..................................... 1

1.1 Background................. .......... ......................
1.2 Research Problem ............................................ .................3
1.3 Purpose .... .................................. ................... .... .......... 5
1.4 M otivation ................................. ................... ...............
1.5 Chapter Organization ............... ........................ ................ 6

2 LITERA TURE REVIEW .................................... .... ............. ......8

2.1 Historical Overview of Machine Learning.................................8
2.1.1 Brief History of AI Research on Learning........................8
2.1.2 Four Perspectives on Learning................................... 10
2.2 AI and Financial Applications................................................. 15
2.2.1 Expert Systems................................. .............. ............. 15
2.2.2 Neural Networks and Financial Applications................... 17
2.2.3 Genetic Algorithms and Financial Applications...............27
2.3 The C4.5 Learning System ............................ ....................32
2.3.1 Brief History of C4.5 ..................................... ........... 33
2.3.2 C4.5 Algorithm s.......................... .............................. 37
2.3.3 Limitations of C4.5........................... ...............44
2.3.4 C4.5 Financial Applications.............. ..................45
2.4 Linear Discriminant Analysis.............................................47
2.4.1 Overview ........................... .......................... 47
2.4.2 Limitations of LDA ...................................... ....... 48
2.5 Logistic Regression (Logit)................................. ... ........... 49
2 .6 Su m m ary................ .... .......................... .... ........... .... 50

3 DOM AIN PROBLEM ................................................ .................. 51

3.1 Overview of Mutual Fund Ratings Systems... ........................ 51
3.1.1 Morningstar, Inc. Overview.............................. ....54
3.1.2 Morningstar Rating System ........................................... 55
3.1.3 Review and Criticism of the Morningstar Rating System.58
3.1.4 Investment Managers Use of Ratings ............ ............59
3.1.5 Performance Persistence in Mutual Funds......................61
3.1.6 Review of Yearly Variation of Morningstar Ratings ........64
3.2 Problem Specification ................ ......................... 68


4.1 R research G oals ............................. .... ........................7 1
4.2 Research Caveats ................................. ... ......... ...........71
4.3 Example Databases ................. .......... ............... ...72
4.4 Brief Overview of the Research Phases....................................72
4.5 Phase 1 Classifying 1993 Funds...................... ................73
4.5.1 M methodology ........................................ ......................73
4.5.2 Results................... ..... ..... .... ............... 77
4.5.3 Conclusions.............................. ... ................. 81
4.6 Phase 2 1993 Data with Derived Features................................ 81
4.6.1 M ethodology ....................... ................ ................81
4.6.2 Results for the Regular Dataset.................. ............... 84
4.6.3 Results for the Derived Features Dataset.........................87
4.6.4 Conclusions................... ..................................89
4.7 Phase 3 Comparing 5-Star and 3-Star Classifications ................91
4.7.1 M ethodology ..................................................91
4.7.2 R esults............... ... ........... .............. ... ....... ..93
4.7.3 C conclusions ............................. ............. ............. .. 100
4.8 Phase 4 Crossvalidation with C4.5 ......................................... 101
4.8.1 M ethodology .................. ........................................ 10 1
4.8.2 R esults........................... .. .. ........ .... ....... .... 103
4.8.3 Conclusions................ ...................... .................. 104
4.9 Overall Summary ....................................... ....................... 105

RATINGS CHANGES................................ ................ 107

5.1 Phase 5 Predicting Ratings with a Common Feature
Vector Over Two Years.......................... ......... 109
5.1.1 M ethodology ................................................................ 109
5 .1.2 R esu lts................................ .. ... .. ... .. ............ ... 1 10
5.1.3 C o nclu sio ns....................................... ..................... 112

5.2 Phase 6 Predicting Matched Mutual Fund Rating Changes......113
5.2.1 M ethodology ...................................................... ........ 113
5.2.2 Results for 1994 Data Predicting 1995 Ratings..............116
5.2.3 Results for 1995 Data Predicting 1996 Ratings ..............125
5.2.4 C onclusions............................................................... 133
5.3 Phase 7 Predicting Unmatched Mutual Fund Ratings ..............134
5.3.1 M ethodology .............................. .......................... 134
5.3.2 Results for 1994 Data Predicting 1995 Ratings ..............135
5.3.3 Results for 1995 Data Predicting 1996 Ratings.............. 141
5.3.4 Conclusions ................. ................... ... ....... ... 148
5.4 Overall Summary................. .......................................... 148

6 SUMMARY AND FUTURE RESEARCH................................... 150


A DESCRIPTION OF MUTUAL FUND FEATURES............................. 155

B PHASE 1 CLASSIFICATION FEATURES .................................... 161

C PHASE 2 CLASSIFICATION FEATURES ..................................... 165

D PHASE 3 CLASSIFICATION FEATURES ..................................... 174

E BEST CLASSIFICATION TREES FROM PHASES 1-4 ..................... 185

REFERENCES ................................. .................................... 208

BIOGRAPHICAL SKETCH .................................. ............................ 217

Abstract of Dissertation Presented to the Graduate School
of the University of Florida in Partial Fulfillment of the
Requirements for the Degree of Doctor of Philosophy



Robert Clayton Norris, Jr.

December 1997

Chairman: Dr. Gary J. Koehler
Major Department: Decision and Information Sciences

This research uses an inductive learning methodology that builds decision trees,

Quinlan's C4.5, to classify mutual funds according to the Morningstar Mutual Fund five

class or star rating system and predict the mutual fund ratings one year in the future. In

the first part of the research, I compare the performance of C4.5, Logistic Regression and

Linear Discriminant Analysis in classifying mutual funds according to the Morningstar

rating system. As the size of the training set increases, so does C4.5's performance

versus the two statistical methods. Overall, C4.5 performed as well as Logistic

Regression in classifying the mutual funds and outperformed Linear Discriminant

Analysis. This part of the research also explored the ability of C4.5 to classify equity

mutual funds that were unrated by Morningstar. The results suggested that, with the

proper features and a modification to the Morningstar five class rating system to three

classes, unrated mutual funds could be classified with a 30% error.

Anecdotal evidence suggested that investors purchase mutual funds by ratings and

have an expectation that the rating will stay the same or improve. The second part of the

research used a training set of one year to construct a decision tree with C4.5 to predict

the ratings of mutual funds one year in the future. The testing set consisted of examples

from the prediction year in question and the predictions were compared to the actual

ratings for that year. The results were that, with the necessary feature vector, five-star

fund ratings could be predicted with 65% accuracy. With a modification to the rating

system changing it to three stars, predicted mutual fund ratings were 75% accurate.

This research also identifies features that are useful for the classifying mutual

funds by the Morningstar rating system and for the prediction of fund ratings.


1.1 Background

Glancing through a copy of Technical Analysis of Stocks & Commodities

magazine, you find a great deal of information about artificial intelligence (AI) and the

selection of stocks and commodities for portfolios. In the February 1997 issue of the

magazine, the Traders' Glossary even defines the term neural network. However, it is

difficult to find scientific research about AI systems used on Wall Street since usually

they are proprietary and could provide a competitive advantage to the investment firm.

Five years ago AI use in financial applications was just beginning to be noticed.

For example, this story in the Wall Street Journal of October 27, 1992 about the use of

artificial intelligence to select stocks for a mutual fund portfolio:

"Bradford Lewis, manager of Fidelity Investment Inc.'s Fidelity
Disciplined Equity Fund, has found a way to out-perform standard indices
using neural network software. A neural network is an artificial
intelligence program that copies the workings of the human brain. The
mutual fund, which invests in the same businesses as the Standard and
Poor's 500 stock index (S & P 500), has won over the index by 2.3 to 5.6
percent for three years running and is maintaining its performance in FY
1993...Lewis checks with analysts at Fidelity to double-check his results,
but sometimes when he buys stock contrary to the computer's advice, he
loses money." (McGough, 1992, p. C1)

Academic research concerning mutual funds and artificial intelligence is

relatively new since only three studies were cited in the literature from 1986 to the

present. Chiang et al. (1996) described a neural network that used historical economic

information to predict the end-of-year Net Asset Value and that outperformed regression

models. A second study (Lettau, 1994) used genetic algorithms to simulate the actions of

investors buying and selling mutual funds. The third paper studied portfolio decisions of

boundedly rational agents by modeling learning with a Genetic Algorithm (Lettau, 1997)

and had nothing to do with mutual fund ratings, the topic of this research.

The difficulty of conducting research about working AI systems requires

researchers to design systems for study that would be of interest to the investor and

practitioner. This leaves much room for applied research to develop systems that could,

for example, classify mutual funds according to one of several popular rating systems. If

we could classify mutual funds according to a rating system, why not go one step further

and try to predict the ratings over some fixed horizon. Classification and prediction of

mutual fund ratings are the essence of our research with an inductive learning program

called C4.5, the direct descendent of ID3 (Quinlan, 1993, p.vii).

Machine learning (ML) is a branch of artificial intelligence concerned with

automated knowledge acquisition and inductive learning is a strategy of ML that seeks to

produce generally applicable rules from the examination of a number of examples (Trippi

and Lee, 1996). Since inductively generated rules could be used for classification

problems, a common concern is the performance relative to other existing models for

classification, such as linear discriminant analysis (LDA) and logistic regression (Logit).

Unlike LDA and Logit, inductive learning makes no a priori assumptions about forms of

the distribution of data (e.g., a normal distribution) or the relationship between the

features or variables describing each mutual fund. C4.5 builds decision trees to classify

the examples, which consist of the features and an indicator function for class

membership. It then uses the best decision tree produced using a training set of examples

to classify unseen examples in a testing set to determine the accuracy of the tree. C4.5

also has the ability to translate the decision tree into a ruleset that could be used as an

alternative means for classification.

We have selected the C4.5 decision tree generating program for this research for

several reasons. First, as our literature review will show, decision tree programs,

specifically ID3 and its successors, have been used for a variety of financial applications

over 10 years (Braun and Chandler, 1987) and good results were achieved. Second,

decision trees provide practitioners with a way of understanding how an individual

example was classified. They can start at the root and see the values used for the features

in partitioning the examples into their respective classes. The decision tree may be

complex and difficult to understand but each example can be explained with it. Other AI

programs used for financial applications do not have this capability. Neural networks

inherently lack explanatory capability (Trippi and Lee, 1996). Genetic algorithms are

complex systems that are considered hard to design and analyze (Goldberg, 1994). Third,

C4.5 processes discrete, continuous and nominal valued features without transformation

while this is not possible for neural networks and genetic algorithms.

1.2 Research Problem

The number of common stocks far exceeded the number of mutual funds for

many years after the enactment of the Investment Company Act of 1940. After all, the

acquisition of mutual funds was a way of selecting a professional portfolio manager to

pick stocks for you and, by 1976, there were 452 funds. Today we have the situation

where the number of equity and bond funds exceeds the number of stocks on the New

York Stock Exchange. For many investors, selecting a fund has become a difficult

decision. Over the years rating services have appeared to aid the investor in their

decision of which mutual funds to buy and sell.

There is contemporary evidence, presented later in this study, that investors

appear to be buying mutual funds based on the ratings. This occurs despite the rating

services' disclaimer that their ratings evaluate historical performance and are not a

predictor of future performance. We also found that the rating services do not rate all the

mutual funds. Momingstar, the major mutual fund rating service, does not evaluate a

mutual fund for a rating unless they have a three-year financial history. For example, the

June 30, 1996, Momingstar Principia CD-ROM showed Morningstar was tracking 3,794

equity mutual funds but had rated only 1,848 funds or less than half. Chapter 3 discusses

how Momingstar rates mutual funds.

Being able to classify equity mutual funds according to the rules of a rating

system would be an important ability. There may be relationships among the variety of

data, financial and otherwise, that would permit the classification of mutual funds not yet

rated by Momingstar. In addition, if we could classify mutual funds with a high degree

of accuracy to a known rating system, there could be relationships among the data that

permit predicting the future rating of a mutual fund already rated by Morningstar.

In developing a process to classify mutual funds and predict their ratings, we

could also automate the identification of important features that aid the classification

process. In addition, we could use decision trees to develop a knowledge base of the

relationships between these features.

This investigation of inductive learning using decision trees for classification and

prediction consists of two parts. The first part evaluates the performance of C4.5 in

classifying mutual funds against two statistical techniques used for classification in the

field of finance, LDA and Logit. The second part evaluates the ability of C4.5 to predict

mutual fund ratings by comparing the performance of C4.5 to actual ratings and

conducting statistical tests of goodness-of-fit on the predicted and actual ratings

distributions. The results are analyzed to gain insights into the relationships among the

data used for classification and prediction.

The benefits of this research could be extended to studying other mutual fund

rating systems and other types of mutual funds such as bond and hybrid funds. This

problem is of interest because the domain theory is not well developed. In such

problems, data-driven search techniques, such as inductive learning, have been found to

provide particularly good results.

1.3 Purpose

The approach to this study is empirical. Beginning with the research problem a

number of experiments were designed using an AI methodology and current statistical

classification techniques to find a solution to the classification of mutual funds and the

prediction of their ratings

The major goals of the study are as follows:

Demonstrate the relevance of inductive learning techniques in solving a real
world problem.

Investigate relationships in the domain data that could contribute to
understanding the rating system.

Investigate the application of an existing methodology to a new domain.

1.4 Motivation

Interest in the use of AI techniques in various business domains has grown very

rapidly and this is evident in the field of investment analysis. We have already identified

the use of neural networks in selecting portfolios for mutual funds and predicting the end-

of-year Net Asset Value. Additionally, traditional statistical methods are often used in

conjunction with, or in competition with, AI techniques. However, statistical methods

rely upon assumptions about the underlying distribution of the data that are usually

ignored or assumed away. AI methodologies make no such assumptions and can be

applied without invalidating the model.

Exploring the application of induction to mutual funds classification and

prediction will prove to be very useful and provide an area of research for strategic and

competitive use of artificial intelligence in information systems.

1.5 Chapter Organization

Chapter 2 of this study reviews the literature on artificial intelligence and its

financial applications. The chapter begins with an overview of artificial intelligence and

machine learning research, a review of AI and financial applications, followed by

definitions of learning and other formal concepts. Then we discuss the induction process

of decision trees for classification and prediction. The chapter ends with a discussion of

LDA and Logit classification.

Chapter 3 provides an overview of the problem domain. It reviews mutual fund

rating systems, followed by an analysis of the problem, the hypotheses to be tested, and

the benefits of the research effort. This chapter ends with a figure mapping out the

experimental design.


Chapters 4 and 5 provide details of the experimental methodology, results and

analysis, and the conclusions we draw from the results. Chapter 4 involves classifying

mutual funds using C4.5, LDA, and Logit. Chapter 5 involves predicting mutual fund

ratings one year in the future using C4.5.

Finally, Chapter 6 provides a summary and conclusion of the research and

discusses extensions of this work.


2.1 Historical Overview of Machine Learning

The field of machine learning concerns computer programs that can imitate the

learning behavior of humans (Natarajan, 1991). Learning is the improvement of

performance in some environment through the acquisition of knowledge resulting from

experience in that environment (Langley, 1996). From the very beginning of artificial

intelligence, researchers have sought to understand the process of learning and to create

computer programs that can learn (Cohen and Feigenbaum, 1982). Two reasons for

studying learning are to understand the process and to provide computers with the ability to


2.1.1 Brief History of AI Research on Learning

AI research on learning started in the late 1950's with work on self-organizing

systems that modified themselves to adapt to their environments. Through the use of

feedback and a given set of stimuli, the researchers thought the systems would evolve. Most

of these first attempts did not produce systems of any complexity or intelligence (Cohen and

Feigenbaum, 1982).

In the 1960s, AI research turned to knowledge-based problem solving and natural

language understanding. Workers adopted the view that learning is a complex and difficult

process, and that a learning system could not learn high-level concepts by starting without

any knowledge at all. This viewpoint resulted in some researchers studying simple

problems in great detail and led others to incorporate large amounts of domain knowledge

into learning systems so they could explore high-level concepts (Cohen and Feigenbaum,


A third stage of learning research, searching for ways to acquire knowledge for

expert systems, focuses on all forms of learning, including advice-taking and learning from

analogies. This stage began in earnest in the late 1970s (Feigenbaum et al., 1988).

An expert system imitates the intellectual activities that make a human an expert in

an area such as financial applications. The key elements of a traditional expert system are a

user interface, knowledge base, an inference engine, explanation capability, and a

knowledge acquisition system (Trippi and Lee, 1996). The knowledge base consists of

facts, rules, and heuristic knowledge supplied by an expert who may be assisted in this task

by a knowledge engineer. Knowledge representation formalizes and organizes the

knowledge using IF-THEN production rules (Feigenbaum et al., 1988). Other

representations of knowledge, such as frames, may be used.

Figure 2.1: Basic Structure of an Expert System.

The inference engine uses the knowledge base plus facts provided by the user to

draw inferences in making a recommendation. The system can chain the IF-THEN rules

together from a set of initial conditions moving to a conclusion. This approach to problem

solving is called forward chaining. If the conclusion is known but the path to that

conclusion is not known, then reasoning backwards, or backward chaining, is used

(Feigenbaum et al., 1988).

Because an expert system uses uncertain or heuristic knowledge, its credibility is

often in question. The explanation capability is available to explain to the user how a

particular fact was inferred or why a particular question was asked. This capability can be

used to find incorrect rules in the knowledge base (Feigenbaum et al., 1988).

2.1.2 Four Perspectives on Learning

With this brief overview of AI research and learning, it is now important to turn to

the four perspectives on learning itself Simon defined learning "as any process by which a

system improves its performance (Cohen and Feigenbaum, 1982, p.326)." This assumes

that the system has a task that it is attempting to perform and it may improve its

performance in two ways: applying new methods and knowledge, or improving existing

methods and knowledge.

Expert systems researchers take a more limited view of learning by saying it is "the

acquisition of explicit knowledge." (Cohen and Feigenbaum, 1982). Expert systems usually

represent knowledge as a collection of rules and this viewpoint means that acquired

knowledge should be explicit so that it can be verified, modified, and explained.

A third view is that learning is skill acquisition. Researchers in AI and cognitive

psychology have sought to understand the kinds of knowledge needed to perform a task


A fourth view of learning comes from the collective fields of science and focuses on

theory formation, hypothesis formation, and inductive inference.

Simon's perspective of learning has been the most useful for machine learning

development and Cohen and Feigenbaum (1982) have modeled a learning system consisting

of the environment, a learning element, a knowledge base, and a performance element.

Environment Learning KnowledgePerformance
Element a Element

Figure 2.2: A Simple Model of Learning Systems.

The environment supplies some information to the learning element, the learning

element uses this information to make improvements in an explicit knowledge base, and the

performance element uses the knowledge base to perform its task. Information gained

during attempts to perform the task can serve as feedback to the learning element. This

simple model allows us to classify learning systems according to how they fit into these four

functional elements.

From these four perspectives and with the availability of a learning model, AI

researchers have developed four learning situations: rote learning, learning by being told,

learning from examples or induction, and learning by analogy. Rote learning

Rote learning is memorization of the problem and the solution. New knowledge is

saved to be retrieved later. However, rote learning is useful only if it takes less time to

retrieve the desired item than it does to recompute it. Rote learning is not very useful in a

dynamic environment since a basic assumption is that information acquired today will be

valid in the future (Cohen and Feigenbaum, 1982).

An example of a rote learning system is Samuel's Checkers Player that evaluated

possible moves by conducting a minimax game-tree search and was able to improve its

performance by memorizing every board position it evaluated. Cohen and Feigenbaum

(1982) describe how the system could not search the 104 possible moves in checkers and

evaluated just a few moves into the future, choosing the move that would lead to the best

position. The look-ahead search portion of Samuel's program served as the environment. It

supplied the learning element with board positions and their backed-up minimax values.

The learning element simply stored these board positions and indexed them for rapid

retrieval. The program became capable of playing a very good opening game. Rote

learning did not improve the middle game since the number of possible moves was greater.

At the end game, the system would wander since each possible solution, winning the game,

had comparable values. Learning by taking advice

Learning by taking advice focuses on converting expert advice into expert

performance. Research on advice-taking systems has followed two major paths: 1) systems

that accept abstract, high-level advice and convert it into rules to guide a performance

element, and 2) systems that develop sophisticated tools that make it easier for the expert to

transform their own expertise into rules. Five processes identified to convert expert advice

into program performance were as follows:

1. Request advice from the expert,

2. Interpret or assimilate the advice into an internal representation,

3. Operationalize or convert the advice into a usable form,

4. Integrate advice correctly into the knowledge base, and

5. Evaluate the resulting actions of the performance element.

The principal shortcoming of learning by taking advice was that the various methods were

quite specific to the task and generalization would require substantial effort.

This approach was used in building knowledge-based expert systems such as

MYCIN, which acted as a medical consultant system aiding in the diagnosis of patients with

bacteremia or meningitis infections (Barr and Feigenbaum, 1981). The system carried on an

interactive dialogue with a physician and was capable of explaining its reasoning. MYCIN

had a knowledge acquisition subsystem, TEIRESIAS, which helped expert physicians

expand or modify the rule base. Learning by analogy

A third approach to learning is by analogy. If a system has an analogous knowledge

base, it may be able to improve its performance on a related task by recognizing the

analogies and transferring relevant knowledge to another knowledge base specific to the

task (Cohen and Feigenbaum, 1982). An example of this approach in actual use is the AM

computer program written by Douglas Lenat that discovers concepts in elementary

mathematics and set theory. In searching the rule space, AM may employ one of 40

heuristics described as reasoning by analogy. Cohen and Feigenbaum (1982) reported little

research in this area. Learning by example

Learning by example, or inductive learning, requires a program to reason from

specific instances to general rules that can guide the actions of the performance element

(Cohen and Feigenbaum, 1982). The researcher presents the learning element with very low

level information, in the form of a specific situation, and the appropriate behavior for the

performance element in that situation. The program generalizes this information to obtain

general rules of behavior. An important early paper on induction described the two-space

view of learning from examples:

Simon and Lea (1974)...describe the problem of learning from examples as
the problem of using training instances, selected from some space of possible
instances, to guide a search for general rules. They call the space of possible
training instances the instance space and the space of possible general rules
the rule space. Furthermore, Simon and Lea point out that an intelligent
program might select its own training instances by actively searching the
instance space in order to resolve some ambiguity about the rules in the rule
space (Cohen and Feigenbaum, 1982, p. 360).

Simon and Lea viewed the learning system as moving back and forth between an instance

space and a rule space until it converged on the desired rule.

Many different approaches to learning-by-example, such as neural networks and

genetic algorithms, have been developed and used in financial applications. In the

remainder of this chapter we will review the use of AI in financial applications by

discussing the use of expert systems and describing neural networks, genetic algorithms, and

inductive learning systems.

2.2 AI and Financial Applications

2.2.1 Expert Systems

We have previously defined expert systems in 2.1.1. Artificial intelligence has been

applied to business and finance since the early 1980's starting with expert systems

(Schreiber, 1984). Expert Systems were used for production, management, sales, and

finance. For example, an investment banking firm was using an expert system in its

international brokerage operations to manage foreign institutional portfolios (Wilson and

Koehler, 1986). Hansen and Messier (1986) suggested the use of an expert system for

auditing advanced computer systems while Culbertson (1987) provided an overview

showing how expert systems could also be used in accounting. In (Sena and Smith,

1987) an expert system was developed to ask questions about oil company financial

statements and made a judgement about whether the statements were within industry


By 1988, expert systems were being used in a number of companies. Texas

Instruments, Du Pont, IBM, American Express, Canon, Fujitsu, and Northrop were

showcased in The Rise of the Expert Company (Feigenbaum et al., 1988). Expert

systems could provide internal cost savings, improve product quality control, improve the

consistency of decision making, preserve knowledge, and restructure business to enlarge

customer choice. This book reported on 139 expert systems in use at a variety of

companies in the agriculture, communications, computers, construction, financial,

manufacturing, mining, medical, and transportation industries. The financial applications

included internal audit risk assessment systems, sales tax advising, risk analysis for

underwriting, portfolio management, credit authorization, income tax advising, financial

statement analysis, mortgage loan analysis, and foreign exchange options analysis.

The use of expert systems for commercial loan decisions is described in Duchessi

et al. (1988). The late 80s saw the rise of the bankruptcy of the Savings & Loan industry

so expert systems were used to conduct a financial analysis of their potential failure

(Elmer and Borowski, 1988). Shaw and Gentry (1988) described an improvement in the

design of expert systems: the ability to enhance performance by reacting to a changing

environment. Their MARBLE system used an inductive learning capability to update the

80 decision rules that evaluated business loans.

Expert systems were also moving into the area of stock trading. Laurance (1988)

described a trading expert system using 30 rules and a buy-and-hold strategy that was

superior to the Standard & Poor's 500 Index, a benchmark for manager performance.

Arend (1988) reported about an expert trading system that constantly adapted to new

information based on current market conditions. In Holsapple et al. (1988) it was pointed

out that the business world was slow to accept expert systems. Notably, unsatisfactory

results were achieved in finance due to unrealistic expectations and managerial mistakes.

The authors went on to mention that the current technology was inadequate for

applications requiring insight, creativity, and intuition; however, it could be used for

financial decision support systems.

Recognizing that arbitrageurs could take advantage of the discrepancies between

the futures market and the stock market, an expert system for program trading was

proposed (Chen and Liang, 1989). This system had the ability to update its rule base with

a learning mechanism and human observations about the markets. Miller (1990) provides

an overview of financial expert systems technology to include a discussion of the

problem domain, heuristics, and the architecture; and discussed the rule structure and

logic of an expert system.

Expert systems have also been applied to assessing audit risks and evaluating

client economic performance (Graham et al., 1991). Coopers & Lybrand developed a

system to assist in audit planning. The field of mortgage guaranty insurance underwriting

has also been able to harness expert systems (Gluch-Rucys and Walker, 1991). United

Guaranty Residential Insurance Co. implemented an expert system that could assist

underwriters with 75% of mortgage insurance applications. The determination of

corporate tax status and liabilities is another application of expert systems to finance (Jih

and Patterson, 1992). The STAX system used the Guru expert system shell to calculate

taxes and determined tax consequences of different corporate filing statuses.

2.2.2 Neural Networks and Financial Applications

Neural networks originated as a model of how the brain works:

McCulloch and Pitts formulated the first neural network model [McCulloch
43]. It featured digital neurons but no ability to learn. The work of another
psychologist, Donald Hebb, introduced the idea of Hebbian learning (as
detailed in Organization of Behavior [Hebb 49]), which states that changes
in synaptic strengths (connections between neurons) are proportional to the
activations of the neurons. This was a formal basis for the creation of neural
networks with the ability to learn (Blum, 1992, p. 4).

Real neurons, such as Figure 2.3, consist of a cell body, one axon (a protuberance that

delivers the neuron's output to connections with other neurons), and many dendrites which

receive inputs from axons of other neurons (Winston, 1992). A neuron does nothing unless

the collective influence of all its inputs reaches a threshold level. When that happens, the

neuron produces a full-strength output in the form of an electrical pulse that travels down the

axon to the next cell, which is separated from it by the synapse. Whenever this happens, the

neuron is said to fire. Stimulation at some synapses may cause a neuron to fire while

stimulation at others may discourage the neuron from firing. There is mounting evidence

that learning takes place near synapses (Winston, 1992).



j CON body

Figure 2.3: Real Neuron.

In the neural network, multipliers, adders, and thresholds replace the neuron

(Winston, 1992). Neural networks do not model much of the character of real neurons. A

simulated neuron simply adds up a weighted sum of its inputs and fires whenever the

threshold level is reached.

The development of neural networks was seriously derailed in 1969 by the

publication of Perceptrons by Marvin Minsky and Seymour Papert which pointed out

limitations in the prevailing memory model at that time (Blum, 1992). It wasn't until 1986

with the development of backpropagation (explained later on), permitting the training of

multi-layer neural networks, that neural networks became a practical tool for solving

problems that would be quite difficult using conventional computer science techniques.

Most neural networks consist of an input layer, a hidden or processing layer, and an

output layer. The hidden layer may be more than one layer itself. Figure 2.2 is an example

Figure 2.4: A Multilayer Neural Network.

of a multi-layer neural network. In this figure, xl, hi, and ol represent unit activation levels

of input, hidden, and output units.

In Figure 2.5, we show a simple neural network. Input signals are received from the

node's links, assigned weights (the ws), and added. The value of the node, Y, is the sum of

all the weighted input signals. This value is compared with the threshold activation level of

the node. When the value meets the threshold level, the node transmits a signal to its

neighboring nodes.

Figure 2.5: An Artificial Neuron.

Each unit in one layer is connected in the forward direction to every unit in the next

layer. Activations flow from the input layer through the hidden layer, then on to the output

layer. The knowledge of the network is encoded in the weights on connections between

units. The existence of hidden units allows the network to develop complex feature

detectors, or internal representations (Rich and Knight, 1991).

Neural networks learn by supervised training and self-organizing training (Winston,

1992). Supervised training, which we explain here, has the network given a set of examples

(x,y) where is the correct response forx.

In a multilayered neural network, the output nodes detect the errors. These errors

are propagated back to the nodes in the previous layer, and the process is repeated until the

input layer is reached. An effective algorithm that learns in this fashion (adjusting the

weights incrementally toward reducing the errors to within some threshold) is the

backpropagation algorithm (Rumelhart and McClelland, 1986). The discovery of this

algorithm was largely responsible for the renewal of interest in neural networks in the mid-

1980s, after a decade of dormancy (Trippi and Lee, 1996).

Yo Node



The backpropagation neural network typically starts out with a random set of

weights. The network adjusts its weights each time it sees an example (x,y). Each example

requires two stages: a forward pass and a backward pass. The forward pass involves

presenting a sample input, x, to the network and letting activations flow until they reach the

output layer. During the backward pass, the network's actual output from the forward pass

is compared with the correct response, y, and error estimates are computed for the output

units. The weights connected to the output units are adjusted in order to reduce those errors

(Rich and Knight, 1991).

The advantages and disadvantages of neural networks are as follows:

1. Neural networks excel at taking data presented to them and determining

what data are relevant. Irrelevant data simply have such low connection

strength to all of the output neurons that it results in no effect.

2. Because of the abundance of input factors, noise in the data are not as much

of a problem with neural networks.

3. Each synapse in a neural net model can be its own processor. There are no

time dependencies among synapses in the same layer. Thus, neural networks

exhibit inherent parallelism.

4. Training may require thousands of evolutions.

5. Back propagation, which uses gradient descent, can get stuck in local

minima or become unstable.

6. Excess weights may lead to overfitting of the data.

Some consider neural network training to be an art that requires trial-and-error (Winston,


The use of neural networks for financial applications occurred after the use of

expert systems in this area. Dutta and Shekhar (1988) proposed using a neural network to

predict bond ratings. They trained a neural network using ten features they felt were

representative of bond ratings and had thirty bond issues in the training set. They tested

the network against seventeen bonds and the neural network outperformed regression


Miller (1990) devoted no more than six pages in his book to explaining the neural

network concept without identifying possible applications. Hawley et al. (1990) outlined

the advantages and disadvantages of neural networks vs. expert systems. They also

included potential applications such as: financial simulation, financial forecasting,

financial valuation, assessing bankruptcy risk, portfolio management, pricing out Initial

Purchase Offerings, identifying arbitrage opportunities, performing technical analysis to

predict the short-term movements in stock prices, and performing fundamental analysis to

evaluate stocks.

Coats and Fant (1991) used neural networks to forecast financial distress in

businesses. The neural network correctly forecast 91% of the distressed firms as

distressed and 96% of the healthy firms as healthy. This is in contrast to multiple

discriminant analysis correctly identifying 72% of the distressed firms and 89% of the

healthy firms. Neural networks have also been used in credit scoring (Jensen, 1992) or

procedures used to grant or deny credit. Applicant characteristics were the input nodes

and three categories of payment history were the output nodes. The neural network was

trained with 125 credit applicants whose loan outcomes were known. Correct

classifications were made on 76% of the testing sample. Neural networks have also been

used in predicting savings and loan company failures (Salchenberger et al., 1992) and

bank failures (Tam and Kiang, 1992).

Trippi and DeSieno (1992) described trading Standard and Poor's 500 index

futures with a neural network. Their system consisted of several trained networks plus a

set of rules for combining network results to generate a composite recommendation for

the current day's position. The training period spanned 1,168 days from January 1986 to

June 1990. The test period covered 106 days from December 1990 through May 1991

and the system outperformed a passive investment strategy in the index. In

(Kryzanowski et al., 1993) a neural network was provided historical and current

accounting data, and macroeconomic data to discriminate between stocks having superior

future returns and inferior future returns. On 149 test cases the system correctly

classified 66.4% of the stocks.

Pirimuthu et al. (1993) studied ways of improving the performance of the

backpropagation algorithm. They noted that back propagation uses the steepest gradient

search for hill climbing. In essence, it is a linear method and they developed a quadratic

method to improve convergence of the algorithm. They compared the results of

predicting bankruptcy by several types of neural networks to the performance of NEWQ

(Hansen et al., 1993), ID3, and Probit. The training set consisted of 56 randomly selected

examples and the testing set was the remaining 46 examples. Overall, the

backpropagation neural network algorithms performed better than the ID3, NEWQ, and

Probit although the run times by the neural networks were much longer than the other


Yoon et al. (1994) brought together neural networks and the rule-based expert

system. The motivation for this study was to highlight the advantages and overcome the

disadvantages of the two approaches used separately. The primary advantage of rule-

based expert systems was the readability of the process since it uses explicit rules. A

disadvantage to developing such an expert system is the difficulty of developing those

rules. The authors used an artificial neural network as the knowledge base of the expert

system. The connection weights of the neural network specify the decision rules in an

implicit manner. The explanation module is a rule-based system in which knowledge

implicitly encoded in the neural network has been translated into an "IF-THEN" format.

The training and testing sets consisted of 76 companies each. The neural network system

achieved a correct classification of 76%. In comparison, a Multivariate Discriminant

Analysis model classified the data correctly only 63%.

Hutchinson et al. (1994) proposed using a neural network for pricing and hedging

derivative securities. They took as inputs the primary economic variables that influenced

the derivative's price and defined the derivative price to be the output into which the

neural network maps the inputs. When properly trained, the network "becomes" the

derivative pricing formula. The neural network would provide a nonparametric pricing

method. It was adaptive and responded to structural changes in the data-generating

processes in ways that parametric models could not, and it was flexible enough to

encompass a wide range of derivative securities. The disadvantage of this approach was

that large quantities of data were required, meaning that this would be inappropriate for

thinly traded derivatives or new instruments. Overall, the system achieved error levels

similar to those of the Black-Scholes formula (Black and Scholes, 1973) used for pricing

the derivatives.

Trading on the Edge (Deboeck, 1994) reviewed the use of neural networks for

securities trading. It provided an overview of neural network techniques, explained the

need for pre-processing financial data, discussed using neural networks for predicting the

direction of the Tokyo stock exchange, and described a neural network for trading U.S.

Treasury notes. The Tokyo stock exchange neural network had a 62.1% correct

prediction rate after being put in service in September 1989. A major benefit was that it

reduced the number of trades needed to implement a hedging position, which saved on

commissions. The Treasury notes neural network was evaluated based on the number of

recommended trades, the average profit and loss per trade, and the maximum gains,

losses, and drawdowns. In each case the system provided a higher average profit than

that achieved during the same period. It was noted, however, that the system performed

better when trained on a specific two-year period than when trained on data from a longer

period. We will mention this book again in our discussion of genetic algorithms.

Jain and Nag (1995) developed a neural network for pricing initial public

offerings (IPO). They noted that a vast body of empirical evidence suggested that such

offerings were underpriced by as much as 15% and this represented a huge loss to the

issuer. In developing the model, 276 new issues were used for training the network and

276 new issues were used for testing the network. They used 11 input features

representing a broad spectrum of financial indicators: the reputation of the investment

banker, the log of the gross proceeds, the extent of ownership retained by the original

entrepreneurs, the inverse of sales in millions of dollars in the year prior to the IPO,

capital expenditures over assets, capital expenditures over sales, operating return over

assets, operating return over sales, operating cash flow over assets, operating cash flow

over sales, and asset turnover. The results showed that the neural network generated

market price distributions that outperformed the pricing of investment bankers.

Neural networks have also been used to predict the targets of investigation for

fraudulent financial reporting by the Securities and Exchange Commission (Kwon and

Feroz, 1996). The network outperformed Logit and the study showed that non-financial

information could provide more predictive information than the financial information


Another study compared the performance of neural networks to LDA and Logit

scoring models for the credit union environment (Desai et al., 1996). The study

determined that neural networks outperformed the other methods in correctly classifying

the percentage of bad loans. If the performance measure was correctly classifying good

and bad loans, then logistic regression is comparable to the neural network.

Hobbs and Bourbakis (1996) studied the success of a neural network computer

model to predict the price of a stock, given the fluctuations in the rest of the market that

day. Based on the neural network's prediction, the program then measured its success by

simulating buying or selling that stock, based on whether the market's price was

determined to be overvalued or undervalued. The program consistently averaged over a

20% annual percent return and was time tested over six years with several stocks.

Two books that focused on the use of neural networks for investing were from

Trippi and Turban (1996b), a collection of journal articles written by others from 1988 to

1995, and Trippi and Lee (1996), a revision of an earlier book they published in 1992. This

book reviews modem portfolio theory, provides an overview of AI in investment

management, discusses machine learning and neural networks, and describes integrating

knowledge with databases.

A final study concerned using a neural network to forecast mutual fund end-of-year

net asset value (Chiang et al., 1996). Fifteen economic variables for 101 U.S. mutual funds

were identified as input to three models: a neural network, a linear regression model, and a

nonlinear regression model. The models were developed using a dataset covered the six-

year period from 1981 to 1986 and were evaluated using the actual 1986 Net Asset Values.

The predictions by the neural network had the lowest error rate of the three models.

2.2.3 Genetic Algorithms and Financial Applications

Genetic Algorithms (GAs) are search algorithms based on the mechanics of natural

selection and natural genetics (Holland, 1975). They have been shown to be effective at

exploring large and complex spaces in an adaptive way, guided by the equivalent biological

mechanisms of reproduction, crossover, and mutation. GAs have been used for machine

learning applications, including classification and prediction tasks, to evolve weights for

neural networks, and rules for learning classifier systems (Mitchell, 1997).

Genetic Algorithms combine survival-of-the-fittest among string structures with a

structured, yet randomized, information exchange to form a search algorithm with some of

the innovative flair of human search. The strings are referred to as chromosomes and they

are composed of genes (a feature on the chromosome) which have values referred to as

alleles (Goldberg, 1989).

In every generation, three operators create a new set of chromosomes: selection,

crossover, and mutation. The selection operator selects chromosomes in the population for

reproduction based on a fitness function that assigns a score (fitness) to each chromosome in

the current population. The fitness of a chromosome depends on how well that chromosome

solves the problem at hand (Mitchell, 1997). The fitter the chromosome, the greater the

probability for it to be selected to reproduce. The crossover operator randomly chooses a

locus and exchanges the chromosomal subsequences before and after that locus to create

two offspring. The mutation operator randomly flips some of the bits in a chromosome.

Mutation can occur at each bit position with some very small probability. While

randomized, Genetic Algorithms are no simple random walk. They efficiently exploit

historical information to speculate on new search point with expected improved

performance (Goldberg, 1989).

By way of explanation, we provide a simple Genetic Algorithm with a fitness

function that we want to maximize, for example, the real-valued one dimensional function:

f y) = y + Isin (32y) O
(Riolo, 1992). The candidate solutions are values of y, which are encoded as bit strings

representing real numbers. The fitness calculation translates a given bit string x into a real

number y and then evaluates the function at that value (Mitchell, 1997). The fitness of a

string is the function value at that point.

reproduction step individual strings are copied according to their objective

function values, f(y). Copying strings according to their

fitness value means that strings with a higher value have a

higher probability of contributing one or more offspring in

the next generation. This operator is an artificial version of

natural selection (Goldberg, 1989).

crossover step After reproduction, crossover may proceed in two steps.

First, members of the newly reproduced strings in the mating

pool are mated at random. Second, each pair of strings could

undergo crossing over as shown below, however, not all

strings mate:

Consider stringsAi andA2

AI= 1011 0101
A2= 111010000

The separator I indicates the uniformly, randomly

selected crossover site. The resulting crossover

yields two new strings where the prime (') means the

strings are part of the new generation:

A'1= 10110000
A'2= 11100101

The mechanics of reproduction and crossover are surprisingly

simple, involving random number generation, string copies,

and some partial string exchanges.

mutation step Mutation has been referred to as bit flipping. This operator

randomly changes Os to Is, and vice versa. When used

sparingly, as recommended, with reproduction and crossover,

it is an insurance policy against premature loss of important

string values (Goldberg, 1989).

With the production of a new generation, the system evaluates the fitness function

for the maximum fitness of the artificial gene pool. A Genetic Algorithm is typically

iterated for anywhere from 50 to 500 or more generations (Mitchell, 1997). One stopping

criteria for GAs is convergence of the chromosome population, defined as when 95% of the

chromosomes in the population all contain the same value or, more loosely, when the GA

has stopped finding new, better solutions (Heitkoetter and Beasley, 1997). Other stopping

criteria concern the utilization of resources (computer time, etc.). The entire set of

generations is called a run and, at the end of a run, there are often one or more highly fit

chromosomes in the population.

Goldberg (1989) describes the development of a classifier system using genetic

algorithms. The backbone of a classifier system is its rule and message system, a type of

production or rule-based system. The rules are of the form, if then ;

however, in classifier systems, conditions and actions are restricted to be fixed-length

strings. Classifier systems have parallel rule activation versus expert systems that use serial

rule activation.

Mitchell (1997) noted that Genetic Algorithms are used for evolving rule-based

systems, e.g., classifier systems, in which incremental learning (and remembering what has

already been learned) is important and in which members of the population collectively

solve the problem at hand. This is often accomplished using the Steady-State population

selection operator in which only a few chromosomes of the least fit individuals are replaced

by offspring resulting from crossover and mutation of the fittest individuals.

GAs have been proposed to work in conjunction with other machine learning

systems, such as neural networks (Kuncheva, 1993). In this sketch of an AI application, the

neural network is set up to provide a trading recommendation, for example, stay long, stay

short, or stay out of the market. The Genetic Algorithm is used to estimate the weights for a

neural network that optimizes the user-defined performance objectives and meets user-

defined constraints or risk limits. For example, they used a fitness function of the average

annual return achieved over three years.

A GA was applied to a portfolio merging problem of maximizing the return/risk

ratio with the added constraint of satisficing expected return (Edelson and Gargano, 1995).

The original problem was recast as a goal programming problem so that GAs could be used.

The results obtained with the GAs were comparable to those calculated by quadratic

programming techniques. The use of the goal programming conversion reduced the number

of generations to obtain convergence from 6,527 down to 780.

Mahfoud and Mani (1995) developed a procedure for extending GAs from

optimization problems to classification and prediction so that they could predict individual

stock performance. They describe the use of a niching method that permits the GA to

converge around multiple solutions or niches, instead of the traditional single point in the

solution space. The analogy in the financial forecasting case is that different rules within the

same GA population can perform forecasting for different sets of market and individual

company conditions, contexts, or situations. The niching method was used to predict the

direction of a randomly selected MidCap stock from the Standard & Poor's 400. The GA

correctly predicted the stock's direction relative to the market 47.6% of the time, produced

no prediction 45.8% of the time, and incorrectly predicted the direction relative to the

market 6.6% of the time. The no prediction state is equivalent to the stock being equally

likely to go in either direction.

In Trippi and Lee (1996), they describe a Genetic Algorithm used for a stock market

trading rule generation system. Buy and sell rules were represented by 20-element bit

strings to examine a solution space of 554,496 possible combinations. The GA was run

using a crossover rate of 0.6 and a mutation rate of 0.002. In 10 experiments of 10 trials

each using different starting strings, the average monthly returns of the best rule parameters

ranged from 6.04 to 7.52 percent, ignoring transaction costs. Although these results were

converged upon quickly by the GA, they did not differ much from optimal rules that were

obtained by a time-consuming exhaustive search.

Genetic Algorithms were used them to optimize the topology of a neural network

that predicted a stock's systematic risk, using the financial statements of 67 German

corporations from the period 1967 to 1986 (Wittkemper and Steiner, 1996).

Additionally, in two studies related to mutual funds but not rating systems, GAs were

used to simulate adaptive learning in a simple static financial market designed to exhibit

very similar behavior as mutual fund investors (Lettau, 1994) and (Lettau, 1997).

2.3 The C4.5 Learning System

Research on learning is composed of diverse subfields. At one extreme, adaptive

systems monitor their own performance and attempt to improve it by adjusting internal

parameters. A quite different approach sees learning as the acquisition of structured

knowledge in the form of concepts, or classification rules (Quinlan, 1986). A primary task

studied in machine learning has been developing classification rules from examples (also

called supervised learning). In this task, a learning algorithm receives a set of training

examples; each labeled as belonging to a particular class. The goal of the algorithm is to

produce a classification rule for correctly assigning new examples to these classes. For

instance, examples could be a vector of descriptive values or features of mutual funds. The

classes could be the Momingstar Mutual Fund ratings and the task of the learning system is

to produce a rule (or a set of rules) for predicting with high accuracy the rating for new

mutual funds.

We will focus on the data-driven approach of decision trees, specifically the C4.5

system (Quinlan, 1993). We present a brief history of C4.5, the algorithms used, the

limitations of the system, and examples of its use.

2.3.1 Brief History of C4.5

C4.5 traces its roots back to CLS (Concept Learning System), a learning algorithm

devised by Earl Hunt (Hunt et al., 1966). It solved single-concept learning tasks and used

the learned concepts to classify new examples. CLS constructed a decision tree that

attempted to minimize the cost of classifying an object (Quinlan, 1986). This cost had two

components: the measurement cost of determining the value of property A exhibited by the

object, and the misclassification cost of deciding that the object belongs to class J when its

real class was K.

The immediate predecessor of C4.5 was ID3 and it used a feature vector

representation to describe training examples. A distinguishing aspect of the feature vector is

that it may take on continuous real values as well as discrete symbolic or numeric values

(Cohen and Feigenbaum, 1982). Concepts are represented as decision trees. We classify an

example by starting at the root of the tree and making tests and following branches until a

node is arrived at that indicates the class. For example, Figure 2.4 shows a decision tree

with symbolic values of Good and Bad expert opinions on a stock. We call this node the

root of the decision tree. The tree branches to Price/Earnings (P/E) if the Expert Opinion is

Good and Price/Book (P/B) if the Expert Opinion is Bad. If the Expert Opinion is Good and

the P/E is > 3, then we classify the stock as Expected Return = High. If the P/E of the stock

is < 2, then we classify the stock as Expected Return = Medium. If the Expert Opinion is

Bad and the P/B is < 3, then we classify the stock as Expected Return = Medium; for P/B >

4, then Expected Return = Low.

Figure 2.4: Decision Tree Example.

Decision trees are inherently disjunctive, since each branch leaving a decision node

corresponds to a separate disjunctive case. The left-hand side of the decision tree in Figure

2.4 for high expected return is equivalent to the predicate calculus expression:

[Expert Opinion (x,Good) v Expert Opinion (x, Bad)] A
[P/E (x, > 3) v P/E (x, < 2)]

Consequently, decision trees can be used to represent disjunctive concepts (Cohen and

Feigenbaum, 1982).

ID3 was designed for the learning situation in which there are many features and the

training set contains many examples, but where a reasonably good decision tree is required

without much computation. It has generally been found to construct simple decision trees,

but the approach it uses cannot guarantee that better trees have not been overlooked

(Quinlan, 1986). This will be discussed in more detail in Section 4.2.

The crux of the problem for ID3 was how to form a decision tree for an arbitrary

collection C of examples. If C was empty or contained only examples of one class, the

simplest decision tree was just a leaf labeled with the class. Otherwise, let T be any test on

an example with possible outcomes {Oi, 02,... Ow). Each example in C would give one of

these outcomes for T, so T produced a partition {C1, C2,..., Cw} of C with C, containing

those examples having outcome Oi. If each subset Ci was replaced by a decision tree for Ci,

the result would be a decision tree for all of C. Moreover, so long as two or more Ci's are

non-empty, each C, is smaller than C. In the worst case, this divide-and-conquer strategy

would yield single-example subsets that satisfied the one-class requirement for a leaf Thus,

if a test could always be found that gave a non-trivial partition of any set of examples, this

procedure could always produce a decision tree that correctly classifies each example in C

(Quinlan, 1986).

The choice of test was crucial for ID3 if the decision tree was to be simple and ID3

used an information-based method that depended on two assumptions. Let C contain p

examples of class P and n of class N. The assumptions were:

(1) Any correct decision tree for C will classify examples in the same proportion

as their representation in C. An arbitrary example will be determined to

belong to class P with probability p/(p+n) and to class N with probability


(2) When a decision tree is used to classify an example, it returns a class. A

decision tree can thus be regarded as a source of a message 'P' or 'N, with

the expected information needed to generate this message given by

I(p,n) log, ._ log2
p+n p+n p+n p+n

If feature A with values {Ah, A2,..., Aw} is used for the root of the decision tree, it will

partition C into {Ci, C2, ...,Cv) where Ci contains those examples in C that have value Ai of

A. Let Ci contain pi examples of class P and ni of class N. The expected information

required for the subtree for C, is I(p,, n). The expected information required for the tree

with A as root is then obtained as the weighted average

E(A) = l(p,, n,)

where the weight for the ith branch is the proportion of the examples in C that belong to Ci.

The information gained by branching on A is, therefore

gain(A) = I(p, n) E(A)

One approach would be to choose a feature to branch on which gains the most information.

ID3 examines all candidate features and chooses A to maximize gain(A), forms the tree as

above, and then uses the same process recursively to form decision trees for the residual

subsets Ci, C2, ..., C, (Quinlan, 1986).

The worth of ID3's feature-selecting greedy heuristic can be assessed by how well

the trees express real relationships between class and features as demonstrated by the

accuracy with which they classify examples other than those in the training set. A

straightforward method of assessing this predictive accuracy is to use only part of the given

set of examples as a training set and to check the resulting decision tree on the remainder or

testing set.

Quinlan (1986) carried out several experiments to test ID3. In one domain of 1.4

million chess positions, using 49 binary-valued features in the feature vector, the decision

tree correctly classified 84% of the holdout sample. Using simpler domains of the chess

problem, correct classification was 98% of the holdout sample.

2.3.2 C4.5 Algorithms

C4.5 is an improved version of ID3 that provides the researcher with the ability to

prune the decision tree to improve classification of noisy data. It also provides a subsystem

to transform decision trees into classification rules. This system of computer programs

constructs classification models similar to ID3 by discovering and analyzing patterns found

in the examples provided to it. Not all classification tasks lend themselves to this inductive

approach and Quinlan (1993) reviews the essential requirements:

Feature-value description: All information about an example must be

expressible in terms of a fixed collection of properties or features. Each

feature may be either discrete or continuous, but the features used to describe

an example may not vary from one example to another.

Predefined classes: The categories to which the examples are to be assigned

must have been established beforehand. This is the supervised learning


Discrete classes: The classes are sharply delineated. An example belongs to

only one class.

Sufficient data: Inductive generalization proceeds by identifying patterns in

data. The approach fails if valid, robust patterns cannot be distinguished

from chance coincidences. As this differentiation usually depends on

statistical tests of one kind or another, there must be sufficient examples to

allow these tests to be effective.

"Logical" classification models: The programs construct only classifiers that

can be expressed as decision trees or sets of rules. These forms essentially

restrict the description of a class to a logical expression whose primitives are

statements about the values of particular features.

Figure 2.5 presents the schematic diagram of the C4.5 system algorithm (Quinlan et

al., 1987). We will discuss several algorithms concerning the evaluation tests carried out by

C4.5, the handling of unknown feature values, and pruning decision trees to improve

classification accuracy on the testing set.

Most decision tree construction methods are nonbacktracking, greedy algorithms.

A greedy algorithm chooses the best path at the time of the test although this may later be

shown suboptimal. Therefore, as noted before, a greedy algorithm is not guaranteed to

provide an optimal solution (Cormen et al., 1990).


repeat several times:

initialize working set

if stopping criterion
choose best
choose best feature
divide working set
invoke FORM TRI

test on remainder of training
add some misclassified iter
until no improvement possi

while decision tree contains
both complex and of margin
replace subtree by leaf

select most promising pruned tree

Figure 2.5: Schematic Diagram of C4.5. Gain criterion and eain ratio criterion

working set:
Sis satisfied,

EE on subsets

ns to working set

subtrees that are
al benefit,

C4.5 provides two means of evaluating the heuristic test used by the divide-and-

conquer algorithm:

the gain criterion which was used in ID3

the gain ratio criterion

The information theory underpinning the gain criterion has been summarized by

Quinlan (1993, p. 21) as, "The information conveyed by a message depends on its

probability and can be measured in bits as minus the logarithm to base 2 of that probability."

The probability that a randomly drawn example for a set S of examples belonging to some

class C, is

freq(C ,S)

and the information it conveys is

-log2 (freq(C, S) 'bits.

We define the expected information from such a message pertaining to class membership by

summing over the classes in proportion to their frequencies in S (Quinlan, 1993),

k freq(C,, S) (freq(C,,S).
info(S) x log ) b its.
ii S( Is J
When applied to the set of training cases, T, info(T) measures the average amount of

information needed to identify the class of a case in T

Now consider a similar measurement after partitioning T in accordance with the n

outcomes of a test X The expected information requirement can be found as the weighted

sum over the subsets, as

info (T) = x info(T,).
TheI an I
TThe quantity
The quantity

gain(X) = info(T) info (T)

measures the information that is gained by partitioning T in accordance with the test X

(Quinlan, 1993). The gain criterion, then, selects a test to maximize this information gain.

The gain ratio criterion was developed to eliminate the gain criterion of bias in favor

of tests with many outcomes (Quinlan, 1993). The bias can be rectified by the following

sets of equations which, by analogy with the definition of info(S), we have

split info(X) T -L x log 2(J ,

This represents the potential information generated by dividing the training set, T, into n

subsets, whereas the information gain measures the information relevant to classification

that arises from the same division. Then,

gain ratio(X) = gain(X) / split info(X)

expresses the proportion of information generated by the split that is useful, i.e., that appears

helpful for classification (Quinlan, 1993). If the split is near-trivial, split information will be

small and this ratio will be unstable. To avoid this, the gain ratio criterion selects a test to

maximize the ratio above, subject to the constraint that the information gain must be large--

at least as great as the average gain over all tests examined.

Mingers (1989) performed an empirical comparison of selection measure used for

decision tree induction reviewing Quinlan's information measure (1979), the 2

contingency table statistic, using probabilities rather than the X2 the GINI index of

diversity developed by Breiman et al. (1984), the Gain-ratio measure as discussed above,

and the Marshall correction factor, which can be applied to any of the previous measures

and favors features which split the examples evenly and avoid those which produce small

splits. Mingers evaluated these measures on four datasets and concluded that the predictive

accuracy of induced decision trees is not sensitive to the goodness of split measure.

However, the choice of measure does significantly influence the size of the unpruned trees.

Quinlan's Gain-ratio generated the smallest trees, whereas 2 produced the largest. An

additional study (Buntine and Niblett, 1992) confirmed Mingers results while taking issue

with his use of random selection as a comparison to the various methods he studied.

Fayad and Irani (1992) reviewed the ability of ID3 to classify datasets with

continuous-valued features. Such a feature is handled by sorting all the values for that

feature and then partitioning it into two intervals using the gain or gain ratio criterion. They

determined that the algorithm used by ID3 for finding a binary partition for a continuous-

valued feature will always partition the data on a boundary point. Handling unknown feature values

The above algorithms assume that the outcome of a test for any example can be

determined. In many cases of classification research, unknown features appear due to

missed determinations, etc. In the absence of some procedure to evaluate unknown features,

entire examples would have to be discarded, much the same as for missing data in LDA and


C4.5 improves upon the definition of gain to accommodate unknown feature values

(Quinlan, 1993). It calculates the apparent gain from looking at examples with known

values of the relevant feature, multiplied by the fraction of such cases in the training set.

Expressed mathematically this is

gain(X)= probability A is known x (info(T) infox (T))

Similarly, the definition of split info(X) can be altered by regarding the examples with

unknown values as an additional group. If a test has n outcomes, its split information is

computed as if the test divided the cases into n + I subsets (Quinlan, 1993). Pruning decision trees

The recursive partitioning method of constructing decision trees continues to

subdivide the set of training cases until each subset in the partition contains cases of a single

class, or until no test offers any improvement (Quinlan, 1986). The result is often a very

complex tree that "overfits the data" by inferring more structure than is justified in the

training cases. Two approaches to improving the results of classification are prepruning or

construction-time pruning, and postpruning. C4.5 uses postpruning.

In prepruning, the typical approach is to look at the best way of splitting a subset and

to assess the split from the point of view of statistical significance, information gain, or error

reduction. If this assessment falls below some threshold, the division is rejected and the tree

for the subset is just the appropriate leaf. Prepruning methods have a weakness in that the

criterion to stop expanding a tree is being made on local information alone. It is possible

that descendent nodes of a node may have better discriminating power (Kim and Koehler,


C4.5 allows the tree to grow through the divide-and-conquer algorithm and then it is

pruned. C4.5 performs pessimistic error rate pruning developed by Quinlan (1987) and uses

only the training set from which the tree is built. An estimate is made of the error caused by

replacing a subtree with a leaf node If the error is greater with the leaf node, the subtree

remains and vice-versa. Michie (1989) noted that tests on a number of practical problems

gave excellent results with this form of pruning. Mingers (1989) also reported that pruning

improved the ability of decision tree classification.

2.3.3 Limitations of C4.5

Like any classifier, a decision tree specifies how a description space is to be carved

up into regions associated with the classes. When the task is such that class regions are not

hyperrectangles, the best that a decision tree can do is approximate the regions by

hyperrectangles. This is illustrated in Figure 2.6 below in which the classification region is

defined better by the triangular region on the left versus the rectangular regions that would

be used by C4.5 (Quinlan, 1993).

v- .

Figure 2.6: Real and Approximate Divisions for an Artificial Task.

(Michie, 1987, 1989) identified other limitations of ID3 descendents. The former

mentioned that pruning of decision trees when the data are inconclusive and this was

mentioned in (Quinlan, 1993). Inconclusive data are when the features used in describing a

set of examples are not sufficient to specify exactly one outcome or class for each example.

In the latter reference, the use of a feature vector for large domains, such as medical

diagnosis, is discussed.

2.3.4 C4.5 Financial Applications

Braun and Chandler (1987) performed the earliest reported business application

research of rule-induction classification with a variant of ID3, known as ACLS. Using a

database of 80 examples from an investment expert's predictions, they used ACLS to

formulate rules to predict not only the expert's prediction of the market but to predict the

actual market movement. Using 108 examples, ACLS correctly predicted actual market

movement 64.4% of the time. The expert correctly predicted market movement 60.2% of

the time.

Rules for loan default and bankruptcy were developed using a commercial variant of

ID3 (Messier, Jr. and Hansen, 1988). The investigators were surprised with the small

decision tree that was developed and how it correctly classified the testing set with 87.5%

accuracy. The ID3 results of the bankruptcy data were favorably compared to LDA.

Miller (1990) discussed the use of classification trees, similar to ID3, for credit

evaluation systems. Chung and Silver (1992) compared Logit, ID3, and Genetic Algorithms

to the outcomes of experts for graduate admissions and bidder selection. The three methods

performed comparably on the graduate admissions problem but significantly different on the

bidder selection problem where the GA had the superior performance. One conclusion of

the study is that the nature of the problem-solving task matters. A review of the data

showed that the bidder selection problem had a feature that made it difficult for ID3 to build

the decision tree to a high degree of accuracy.

An application of ID3 that is pertinent to our present study is for stock screening and

portfolio construction (Tam, 1991). To demonstrate the effectiveness of the inductive

approach, trading rules were inferred from eight features. Three portfolios were constructed

from three rules each year, and their performance was compared to market standards. In

every case, the portfolios outperformed the Standard & Poor's 500 Index.

In Kattan et al. (1993), human judgment was compared to the machine learning

techniques of ID3, regression trees (Breiman et al., 1984), a back-propagation neural

network, and LDA. Human subjects were put into teams and were allowed to induce rules

from historical data under ideal conditions, such as adequate time and opportunity to sort the

data as desired. The task at hand was to emulate the decisions made by a bank officer when

processing checking account overdrafts A sample of 340 useable observations was

gathered for the experiment. The results on multiple holdout samples indicated that human

judgment, regression trees, and ID3 were equally accurate and outperformed the neural


Fogler (1995) discussed the strengths and weaknesses of using classification tree

programs, such as ID3, to explain nonlinear patterns in stocks. The algorithm sequentially

maximizes the explanatory power at each branch, looking forward only one step at a time.

Additionally, he notes that in larger problems, the classification trees might differ. This

paper also reviews financial applications of Neural Nets, Genetic Algorithms, Fuzzy Logic,

and Chaos.

Harries and Horn (1995) researched the use of strategies to enhance C4.5 to deal

with concept drift and non-determinism in a time series domain. An aim of this study was

to demonstrate that machine learning is capable of providing useful predictive strategies

in financial prediction. For short term financial prediction, a successful prediction rate of

60% is considered the minimum useful to domain experts. Their results implied that

machine learning can exceed this target with the use of new techniques By trading off

coverage for accuracy, they were able to minimize the effect of both noise and concept


Trippi and Lee (1996) suggest that inductive learning algorithms such as ID3 could

be used to generate rules that classify stocks and bonds into grades. They note that when

used for classification problems, inductive learning algorithms compete well with neural

network approaches.

The classification performance of twenty-two decision tree, nine statistical tests, and

two neural network methods were recently compared in terms of prediction error,

computational time, and the number of terminal nodes for decision trees using thirty-two

datasets (Lim et al., 1997). The datasets were obtained from the University of California at

Irvine Repository of Machine Learning Databases. It was found that a majority of the

methods, including C4.5, LDA, and Logit, had similarly low prediction error rates in the

sense that differences in their error rates were not statistically significant.

2.4 Linear Discriminant Analysis

2.4.1 Overview

Linear discriminants, first studied by Fisher (1936), are the most common form of

classifier, and are quite simple in structure (Weiss and Kulikowski, 1991). The name

indicates that a linear combination of the evidence will be used to separate or discriminate

among the classes and to select the class assignment for an unseen case. For a problem

involving d features, this means geometrically that the separating surface between the

sample will be a (d-1) dimensional hyperplane.

The general form for any linear classifier is given as follows:

wle1 + w2e2 +. +wded WO

where (el,e2,...,ed) are the feature vectors, dis the number of features, and wl are constants

that must be estimated. Intuitively, we can think of the linear discriminant as a scoring

function that adds to or subtracts from each observation, weighing some observations more

than others and yielding a final total score. The class selected, C1, is the one with the highest

score (Weiss and Kulikowski, 1991).

2.4.2 Limitations of LDA

In classical LDA there are some limits on the statistical properties which the

discriminating variables are allowed to have. No variable may be a linear combination of

other discriminating variables. A "linear combination" is the sum of one or more variables

which may have been weighted by constant terms. Thus, one may not use either the sum or

the average of several variables along with all those variables. Likewise, two variables

which are perfectly correlated cannot be used at the same time. Another requirement is that

the population covariance matrices are equal for each group (Klecka, 1980).

Another assumption for classical LDA is that each group is drawn from a population

that has a multivariate normal distribution. Such a distribution exists when each variable has

a normal distribution about fixed values on all the others. This permits the precise

computation of tests of significance and probabilities of group membership. When this

assumption is violated, the computed probabilities are not exact but they may still be quite

useful if interpreted with caution.

It should be noted that there are many generalizations of LDA, including the Linear

Programming variant, that don't have these restrictions (Koehler, 1989), however, many

financial applications studies don't seem to be concerned about the restrictions. Karels and

Prakash (1987), in their study of the use of discriminant analysis for bankruptcy prediction,

noted that violating the multivariate normality constraint was the rule rather than the

exception in finance and economics. A more recent study of neural network classification

vs. discriminant analysis (Lacher et al., 1995, p. 54) noted that the multivariate normal

constraint, and others, are incompatible with the complex nature and interrelationships of

financial ratios. Discriminant analysis techniques have proven better in financial

classification problems until recently when new artificial intelligence classification

procedures were developed.

2.5 Logistic Regression (Logit)

Logit techniques are well-described in the literature (Altman et al., 1981),(Johnston,

1972), and (Judge et al., 1980). If we incorrectly specify a model as linear, the statistical

properties derived under the linearity assumption will not, in general, hold. The obvious

solution to this problem is to specify a nonlinear probability model in place of the linear

model (Sestito and Dillon, 1994). Logistic regression uses a nonlinear probability model

that investigates the relationship between the response probability and the explanatory

features. It is useful in classification problems involving nonnormal population distributions

and noncontinuous features. Studies have shown that the normal linear discriminant and

logistic regression usually give similar results (Weiss and Kulikowski, 1991).

Logistic regression calculates the probability of membership in a class. The model

has the form:

Logit (p) = log (p / (1 p)) = a + j'x

where p = Pr (Y= 1| x) is the response probability to be modeled, a is the intercept

parameter, and 13 is the vector of slope parameters (SAS Institute, 1992). Output from the

analysis is used to calculate the probability of membership in a class.

Several recent financial application studies compared Logit to machine learning or

other AI techniques such as Case-Based Reasoning (CBR). Dwyer (1992) compared the

performance of Logit and nonparametric Discriminant Analysis to two types of Neural

Networks in predicting corporate bankruptcies. Bankruptcy data drawn from a ten-year time

horizon was input into each of the four models, with predictive ability tested at one, three,

and five years prior to bankruptcy filing. The results suggested that Logit and the

backpropagation Neural Network were generally evenly matched as prediction techniques.

Hansen et al. (1993) studied a difficult audit decision problem requiring expertise

and they compared the performance of ID3, Logit, and a new machine learning algorithm

called NEWQ. While NEWQ performed best with 15 errors, Logit produced 16 errors and

ID3 had 18 errors out of 80 examples.

Logit was compared to a CBR system called ReMind (Bryant, 1996) for predicting

corporate bankruptcies. The database used in the study consisted of nonbankrupt and

bankrupt firms in a 20:1 ratio. Logit outperformed ReMind and one conclusion of the study

was that the sample size of bankrupt firms was too small for ReMind to work well.

2.6 Summary

This chapter has reviewed the literature of machine learning, inductive learning,

C4.5 and ID3, and the statistical techniques that we propose using in our research. In

Chapter 3 we focus on the domain problem of classifying mutual funds, identify our

hypotheses for further study, explain the statistical tests used to verify them, and conclude

with a discussion of the benefits of this research.


In this chapter we will provide an overview of mutual fund ratings systems and

provide background on Morningstar, Inc., the mutual fund rating company of interest to

this research. We will describe the Morningstar rating system, mention observations and

criticisms of the Morningstar rating system, and discuss how investment professionals

use rating systems. We will also review research about the persistence of mutual fund

performance and identify an interesting relationship between the average mutual fund

rating one year and the succeeding year's rating and average one-year return. This

chapter will end with a specification the problems to be studied in this research.

3.1 Overview of Mutual Fund Ratings Systems

Mutual funds are open-end investment companies that buy and sell their shares to

individuals or corporations. Mutual fund buy and sell transactions occur between the fund

and the investor, and do not take place on a secondary market such as the New York Stock

Exchange. Since asset holdings are restricted to various forms of marketable securities, the

total market value of the fund's assets is relatively easy to calculate at the end of each day's

trading. The market value per share of a given mutual fund is equal to the total market value

of its assets divided by the number of shares of stock the fund has outstanding. We refer to

this value as the fund's Net Asset Value (NAV) and it is the price at which the fund will buy

or sell shares to the public (Radcliffe, 1994).

Peter Lynch (1993), until recently manager of the large Fidelity Magellan mutual

fund, wrote that, " funds were supposed to take the confusion out of investing--no

more worrying about which stock to pick." The growth in the number of mutual funds has

been quite astounding. The November 13, 1995 issue of Business Week noted:

"Still, many investors can't resist them. Equities have been on a long bull
run, and the number of new funds keeps growing-474 have been added to
Morningstar Inc.'s 6,730-fund database so far this year. The temptation to
invest in newbies is understandable, since 61% of all equity mutual funds are
less than three years old, according to CDA/Wiesenberger, a Rockville (Md.)
mutual-fund data service. In fact, nearly one-third of all money flowing into
equity mutual funds in the past 12 months went to those with less than a five-
year track record, says State University of New York at Buffalo finance
professor Charles Trzcinka." (Dunkin, 1995, p.160)

In 1976, 452 mutual funds existed and this number had only grown to 812 mutual

funds managing $241.9 billion in assets by 1987. According to the Investment Company

Institute (ICI), there are 2,855 stock mutual funds today managing $2.13 trillion of assets.

This is comparable to the number of stocks on the New York Exchange. The ICI is an

association that represents investment companies. Its membership includes 5,951 open-end

investment companies or mutual funds, 449 closed-end investment companies and 10

sponsors of unit investment trusts. Its mutual fund members have assets of about $3.056

trillion, accounting for approximately 95% of total industry assets, and have over 38 million

individual shareholders. Moreover, the growth rate continues. The August 28, 1997 issue

of the Wall Street Journal noted that investors put a net $26.56 billion into stock funds in

July and net bond inflows for July were $4.21 billion.

Mutual fund rating services, similar to stock and bond rating services, have been in

existence since 1940. According to the January 7, 1993 Wall Street Journal, Wiesenberger

Financial Services was the oldest mutual fund performance tracking company and had been

rating funds since The Investment Company Act of 1940 which authorized the creation of

mutual funds. This company merged with CDA Investment Technologies Inc. into

CDA/Wiesenberger (CDA/W) in 1991.

CDA/W provides a monthly report to subscribers listing performance, portfolio

characteristics, and risk and dividend statistics on mutual funds (CDA/Wiesenberger, 1993).

The company determines the CDA Rating of mutual funds, a proprietary measure, which is

a composite percentile rating from 1 (best) to 99 (worst), based on the fund's performance

over the past four market cycles. Two up cycles and two down cycles are used if available,

however, at least two cycles (one up and one down) are required for a rating. According to

CDA/W, the best-rated funds will be those that have done well in different market

environments, whose recent performance continues strong, and whose comparative results

have not fluctuated wildly over varying time periods. In determining the CDA Rating, they

give extra weight to recent performance (latest 12-months), and penalize funds for

inconsistency. CDA/W does not provide the methodology for determining the CDA Rating

in their newsletter.

The newest rating service is the Value Line Mutual Fund Survey that rates mutual

funds by a proprietary rating system. Fund risk is rated from one (safest) to five (most

volatile) by Value Line. They also provide an overall rating for the fund on a scale of 1 to 5.

Value Line also prints a one-page summary of the fund's performance.

Lipper Analytical Services publishes mutual fund indexes in the Wall Street Journal

and Barron's, and a Mutual Fund Scorecard in the Wall Street Journal. Lipper's scorecard

does not provide a rating for funds but lists the top 15 and bottom 10 performers based on

total return over 4 weeks, 52 weeks, and 5 years. The Wall Street Journal mutual fund list

ranks mutual funds by investment objective from A, B, C, D, or E (units of 20% each) for

total return. The list includes 4 week, 13 week, 26 week, 39 week, 1 year, 3 year, 4 year,

and 5 year returns. In 1996, Lipper was tracking 4,555 stock mutual funds.

3.1.1 Morningstar. Inc. Overview

Morningstar Mutual Funds is a mutual fund rating service started in April 1984 Its

first publication was the quarterly Mutual Fund Sourcebook for stock equity funds. Within

two years it was publishing Mutual Fund Values, featuring the one-page analysis of funds

that was to become the firm's cornerstone product. It also added bond funds to its coverage.

In November 1985, Business Week asked Morningstar to provide data for a new mutual

fund issue. Business Week insisted upon a fund rating system, and development work on

the magazine's rating system paved the way for Momingstar's own 5-star rating system,

which was introduced early in 1986 (Leckey, 1997). Momingstar sales went from $750,000

in its third year of operation to $11 million in its seventh year.

Morningstar publishes the following software and print products:

Software Published Monthly
Morningstar Ascent: Software for the do-it-yourself investor with a database of 7,800 funds

Morningstar Stock Tools: online stock newsletter that lets you screen, rank, and create
model porfolios from a database of 7,900 stocks

Morningstar Principia and Principia Plus: Software for investment professionals providing
data on mutual funds, closed-end funds, and variable annuities. The Principia Plus also
features a portfolio developer and advanced analytics.

Morningstar Mutual Funds: Indepth data and analysis on more than 1,600 funds that is
published every other week.

Morningstar No-Load Funds: a detailed look at nearly 700 no- and low-load funds that is
published every four weeks.

Morningstar Investor: A 48-page monthly publication featuring articles and information on
500 mutual funds.

Morningstar Mutual Fund 500: A year-end synopsis of 500 of the best funds.

Morningstar Variable Annuity/Life Performance Report: One monthly guide that covers the
variable annuity universe.

The Chicago-based firm has become the preeminent source of fund information for

investors. Charles A. Jaffe, the Boston Globe financial reporter, said in an August 6, 1996

article, that more than 95 percent of all money flowing into funds goes to those carrying

Morningstar's four- and five-star ratings. According to the February 24, 1997 Wall Street

Journal, eighty percent of Morningstar's client base are made up of financial planners and

brokers say that the firm's star ratings are a big factor in selling funds.

3.1.2 Morningstar Rating System

In June 1994, Morningstar had over 4,371 mutual funds in its total universe of funds

on CD-ROM. Of these, 2,342 (54%) having three or more years of performance data were

rated according to the Morningstar one-star to five-star rating system. Only 1,052 (24%) of

the funds were equity mutual funds based on a self-identified investment objective. By June

1995, the number of funds had increased by 51% to 6,584 with 2,871 (44%) having

Morningstar ratings and 1,234 (19%) rated as equity funds. The July 1996 Morningstar

Principia had 1,583 rated equity mutual funds. Thus, less than half of all rated funds are

equity funds and this represents only part of all the funds in the Morningstar database.

The main criteria Morningstar uses for including a fund in the biweekly newsletter is

that the fund be listed on the NASDAQ (National Association of Security Dealers

Automatic Quotation System). Other factors that enter this determination are the

cooperation of the fund group, the space limitations of the publication, the asset value of the

fund, and the investor interest in the fund (Momingstar, 1992).

Domestic stock funds, taxable-bond funds, tax-free bond funds, and international

stock funds are each rated separately. These are referred to as the fund classes. Momingstar

includes hybrid funds in the domestic stock universe. To determine the star rating,

Morningstar analysts calculate Morningstar risk and Morningstar return. They then

determine the relative placement of a fund in the rating system by subtracting Morningstar

Risk from Morningstar Return (Morningstar, 1992) and ordering the funds from highest to


A fund's Morningstar Risk is determined by subtracting the 3-month Treasury bill

return from each month's return by the fund. They sum those months with a negative value

and the total losses are divided by the total number of months in the rating period (36, 60, or

120). They compare the average monthly loss for a fund to those of all equity funds by

dividing the average risk for this class of funds into all these values and this sets the average

Morningstar Risk to 1.00. The resulting risk value expresses the percentage points of how

risky the fund is relative to the average fund. For example, a mutual fund with a

Morningstar Risk rating of 1.30 is 30%/ more risky than the average mutual fund risk.

Morningstar Return is the fund's total return adjusted for all loads (sales

commissions) and management fees applied by a fund, that is in excess of the Treasury Bill

rate. Morningstar (1992) asserts that the effect of loads will clearly affect a three-year star

rating more than a ten year one. Unless the fund's load is substantially different from those

of its competitors the effect will not be unduly pronounced. The main reason for including

the load in the rating process is so that investors can compare the Morningstar Return

numbers for load and no-load funds. For example, in the June 1995 Morningstar CD-ROM

database there were 2,360 mutual funds with a mean front-end load of 4.33%. Of these, 940

were equity funds having a mean front-end load of 4.75%.

Morningstar assumes the full load was paid on front-end loads. Investors who hit

the load fund's breakpoints will receive higher than published returns. Deferred sales

charges and redemption fees are included in the calculation by assuming the investor sold

the fund at the end of the particular rating period. For the three-year period Morningstar

typically charges half of the fund's maximum deferred charge and, for the ten-year period,

they ignore the deferred charges. The average value of the Morningstar Return is divided

into each calculated return value resulting in the average Morningstar Return being set to

1.00 to allow quick comparisons between different mutual funds. The interpretation of

Morningstar Return is similar to Momingstar Risk: a Morningstar Return of 1.30 means that

the fund is returning 30% more than the average fund.

The result of Morningstar Return minus Morningstar Risk is the Risk-Adjusted

Performance (RAP). Morningstar calculates it for each fund class for three years, five years

and ten years, if the financial data are available for those periods. Based on the number of

years of data available, a weighted average is calculated to report the overall Morningstar

Risk-Adjusted Rating:

1. If three years are available, Morningstar uses 100% of 3 year RAP.

2. If five years of data are available, they use 40% of the 3-year RAP and 60%/ of the 5-
year RAP.

3. If ten years of data are available, they use 20% of the 3- year RAP, 30% of the 5-
year RAP, and 50% of the 10-year RAP.

Momingstar orders the results from highest to lowest and distributes the star ratings by a

symmetric normal curve. The top 10% of funds receive five stars (highest), the next 22.5%

receive four stars, the middle 35% receive 3 stars, the lower 22.5% receive two stars, and the

bottom 10% receive I star.

3.1.3 Review and Criticism of the Momingstar Rating System

The Momingstar ratings are published in the biweekly Morningstar Mutual Fund

newsletter and the ratings are updated monthly. Momingstar suggests that an investor could

develop their own rating system by revising the weights for the data, for example, to

emphasize risk more than return. In practice, this is difficult to do since Morningstar

publishes the detailed quantitative data on a select sample of 130 mutual funds every two

weeks in the newsletter. A subscriber to the newsletter will only see the quantitative values

needed for developing their own rating system about twice a year unless they subscribe to

the more expensive CD-ROM database that provides all the necessary information on a

monthly basis.

Momingstar (1992) states that its rating system is a purely quantitative method for

evaluating mutual funds and they only use objective data to determine the Morningstar

rating. They tell investors to use it only as a screening device and not a predictor for future

performance. However, in the early years of the star rating system Momingstar labeled 5-

star funds "buy" and 1-star funds "sell", a practice that was dropped in 1990 (Leckey, 1997).

Two studies of the Morningstar rating have shown that the 5-star system is not

predictive of fund performance, according to Leckey (1997). Mark Hulbert, editor of The

Hulbert Financial Digest, which tracks investment letters, used the Morningstar ratings to

invest $10,000 in 5-star funds. He sold the mutual funds when their rating declined and,

over a five-year period, this trading system failed to beat the Wilshire 5000 index.

Momingstar argued that the 5-star funds should not be seen as a portfolio but Hulbert


A study by Lipper Analytical Services looked at how the 5-star funds performed

over the next twelve months when purchased at the beginning of 1990, 1991, 1992, and

1993. Lipper reported that a majority of 5-star stock funds did worse in the rest of the year

than the average stock fund (Leckey, 1997).

Momingstar has conducted a study that shows that stock funds rated 5-stars in 1986,

when the rating system started, posted respectable or better results during the next nine

years. Conversely, more than a dozen 1-star stock funds have performed so badly as to be

merged out of existence (Leckey, 1997). Another study performed by Morningstar showed

that a statistically significant majority of funds that received 4- and 5-star ratings in 1987

maintained those high ratings a decade later. In her commentary, Morningstar senior analyst

Laura Lallos noted that, "By the standards of what it sets out to do-separating the long-term

winners from the losers-the rating is actually quite successful (Harrell, 1997)."

3.1.4 Investment Managers Use of Ratings

Ratings possess a considerable value to the investment community as indicated by

the extensive use made of them by many institutional and individual investors over a long

period (Teweles and Bradley, 1987). Even organizations with extensive staffs use them in

cross-checking their investigations. They are a quick, easy reference available to most

investors and, when used with care, they are a valuable source of information to supplement

other data

The Momingstar approach to rating mutual funds with five classes is similar to the

classification of stocks in the securities industry today. Elton and Gruber (1987) found that

the stockbroker making an investment decision quite often receives a list of stocks with a

ranking on each (usually from one to five) and perhaps some partial risk information. If

stocks are ranked (grouped) from 1 to 5, stocks ranked in group 1 are best buys, a 2 is a buy,

3 is a hold, 4 is a sell, while a 5 is a definite sell. Of 40 financial institutions they surveyed,

80%0 stated that the data the brokerage community and/or their analysts supplied to the

portfolio managers were in the form of grouped data. The institutions using grouped data

reported that 50% grouped them on expected return, 30% on risk-adjusted return, 10%/ on

expected deviations from a Capital Asset Pricing Model, and 10%0 responded they did not

even know the basis of the grouping.

Fama (1991) noted that the Value Line Investment Survey publishes weekly

rankings of 1,700 common stocks into five groups. Group 1 has the best return prospects

and group 5 the worst. There is evidence that, adjusted for risk and size of the company,

group 1 stocks have higher average returns than group 5 stocks for horizons out to one year.

A study of the Banker's Trust Company stock forecast system (Elton et al., 1986)

used the ranking of stocks on a five-point scale from 33 brokerage firms. A rating of I or 2

was a buy recommendation, 3 was neutral, and 4 and 5 were sell recommendations.

Approximately 700 stock analysts used the system over the three-year period of the study.

We made two observations from this study:

1. On average, changes occurred to 11% of the classifications every month.

2. Table 3.1 shows that, over the three-year period of the study, the distribution

of the monthly average of 9,977 stock forecast ratings was in a skewed curve

favoring buy recommendations.

Table 3.1: Distribution of Stock Ratings

Rating 1981 1982 1983 Overall
1 17.4% 14.9% 14.7% 15.8%
2 37.6% 29.4% 30.3% 32.5%
3 32.9% 36.8% 40.6% 38.0%
4 10.6% 10.7% 11.6% 11.2%
5 1.5% 2.5% 2.8% 2.4%

The use of rankings for investment instruments has a long history of usage in the financial

community and the investment community readily accepts them in recommending broker

sales to customers.

3.1.5 Performance Persistence in Mutual Funds

Brokers usually base their recommendation to purchase a financial instrument on

historical performance. Investors flock to well-performing mutual funds and common

stocks based on the anticipation of continued performance. This continued performance, if

it does exist, is referred to as performance persistence and a review of the literature provides

mixed results about this strategy. Goetzmann and Ibbotson (1994) showed that past returns

and relative ranking were useful in predicting future performance, particularly for raw

returns. A later study (Brown and Goetzmann, 1995) demonstrated that the relative

performance pattern of funds depended upon the period observed and was correlated across

managers. A year-by-year decomposition showed that persistence of return was due to a

common investment strategy and not standard stylistic categories and risk-adjustment


Another study of performance persistence (Manly, 1995) showed that relative

performance of no-load, growth-oriented mutual funds persisted for the near term with the

strongest evidence for a one-year evaluation horizon. The difference in risk-adjusted

performance between the top and bottom octile portfolios was six to eight percent per year.

Malkiel (1990) observed that while performance rankings always show many funds

beating the averages--some by significant amounts--the problem is that there is no

consistency to the performances. He felt that no scientific evidence has yet been assembled

to indicate that the investment performance of professionally managed portfolios as a group

has been any better than that of randomly selected portfolios. A later study (Malkiel, 1995)

noted that performance persistence existed in the 1970s but not in the 1980s.

Grinblatt and Titman (1992) identified the difficulty of detecting superior

performance by mutual fund managers by analyzing total returns since the fund manager

may be able to charge higher load fees or expenses. They constructed gross returns for a

sample of mutual funds for December 31, 1974, to December 31, 1984. They concluded

that superior performance by fund managers might exist, particularly among aggressive-

growth and growth funds and those funds with the smallest net asset values. However,

funds with the smallest net asset values have the highest expenses so that actual returns, net

of expenses, will not exhibit abnormal performance.

In (Patel et al., 1991) past performance of a mutual fund was found to have an effect

on cash flows. A one-percentage-point return higher than the average fund's return implies a

$200,000 increased flow in the next year (where the median fund's size is $80 million and

the median flow is $21 million). This performance effect is based on investors' belief that a

managed fund with a superior past will perform better than individuals.

Another confirmatory study is Phelps (1995) that showed that sophisticated and

unsophisticated investors were chasing prior year returns. Was this a good investment

strategy? Yes. Risk-adjusted fund specific performance was significantly positively related

to past performance during the earlier half of Phelps' sample for 1985-89.

Three studies that are more recent show persistence is a real phenomenon while

arguing if fund managers are or are not responsible for it. Carhart (1997) stated that

common factors in stock returns and investment expenses almost completely explain

persistence in equity mutual funds' mean and risk-adjusted returns. He argues against fund

managers being skilled portfolio managers but does not deny persistence when it comes to

strong underperformance by the worst-return mutual funds. Elton et al. (1996) examined

predictability for stock mutual funds using risk-adjusted return. They found that past

performance is predictive of future risk-adjusted return. A combination of actively managed

portfolios was formed with the same risk as a portfolio of index funds. The actively

managed funds were found to have a small, but statistically positive risk-adjusted return

during a period where mutual funds in general had negative risk-adjusted returns. Phelps

and Detzel (1997) claim that they confirmed the persistence of returns from 1985-89 but it

disappeared when risk was properly controlled for, or the more recent past was examined.

With these differing results about performance persistence, we would expect

investors would rely on performance and rating systems as a way of investing wisely. The

popularity of rating systems attests to their use. The ability to predict mutual fund ratings,

and improvements or declines in ratings, from one month (the Momingstar rating cycle) to

one year would be an important tool in developing an investment plan.

3.1.6 Review of Yearly Variation of Morningstar Ratings

We studied matched mutual funds, funds with no name change over a one-year

period, from 1993 to 1996 to determine the relationship between the Morningstar rating

of one year, and the average one-year return and Morningstar rating of the succeeding

year. If some relationship existed, it would show why it would be interesting to know the

predicted Morningstar rating one year in the future. The 1993-94 period had 770

matched funds, 1994-95 had 934 matched funds, and 1995-96 had 1,059 matched funds.

In Figure 3.1, the Morningstar 5-star ratings for 1993 appear as the data points on

the graph. The 1994 Morningstar ratings are the x-axis. The y-axis is the one-year

average return percentage. The line connecting the data points shows, for example, the

one-year average return percentage for a fund with a 2-Star rating in 1993 if it increased

or decreased its star rating in 1994. To be more specific, the average fund with a 1993 2-

Star rating would have a 2% one-year return if it stayed 2-Star. If this average fund

increased to 3-Star it had a 15% one-year return. If the average fund decreased to 1-Star

it had a -7% one-year return.

Figure 3.1 also shows that a 1993 2-star and 3-star fund that retained their

Morningstar rating in 1994 had the same one-year average return, approximately 2%. As

noted in the preceding paragraph, the 1993 2-Star fund that increased its rating had an

average 15% one-year return. However, the 1993 3-Star that went up to 4-Star in 1994

only had an average 10% one-year return.

This effect for one-year average returns was more distinct in Figure 3.2, using a

three-star rating system for rating the mutual funds. The relationship between average

return and rating changes was found to exist for all three years studied. Of course, it

should be noted that these are years when the stock market increased year-by-year.

1993 Rating vs. One-Year Average Return and 1994 Rating

~ 20 -0 93 1-Star
0 15 93 2-Star
o 10 [ -_____-_ 93 3-Star

>.4 0 -M93 4-Star
-5 --- 93 5-Star
1-Star 2-Star 3-Star 4-Star 5-Star
1994 Rating

Figure 3.1: 1993 Ratings and 1994 Ratings for the Five-Star Rating System.

1993 Rating vs One-Year Average Return and 1994 Rating

S25 -

10 ---93 1-Star
5o 5---93 2-Star
-~ -93 3-Star

1-Star 2-Star 3-Star
1994 Ratings

Figure 3.2: 1993 Ratings and 1994 Ratings for the Three-Star System.

1994 Ratings vs One-Year Average Return and 1995 Rating

50 s
50 -- 94 I-Star

30 ---94 2-Star
94 3-Star
20 )f4 94 4-Star

S-0- 94 5-Star

0 -10
1-Star 2-Star 3-Star 4-Star 5-Star
1995 Ratings

Figure 3.3: 1994 Ratings and 1995 Ratings for the Five-Star System.

1994 Ratings vs. One-Year Return and 1995 Rating

S40 -
30 94 I-Star

20 ---94 3-Star
to 1
0 -9,

I-Star 2-Star 3-Star
1995 Rating

Figure 3.4: 1994 Ratings and 1995 Ratings for the Three-Star System.

Figures 3.2, 3.4 and 3.6 for the 3-Star rating system show that it was better to hold

2 -Star rated funds that maintained their rating or improved them to 3-Stars, rather than to

own 3-Star funds that declined to 2-Star. The average one-year return on these funds was

less than a 2-Star fund.

1-Star 2-Star 3-Star 4-Star 5-Star
1996 Rating

Figure 3.5: 1995 Ratings and 1996 Ratings for the Five-Star System.

1995 Ratings vs One-YearReturn
and 1996 Rating

3 35

S25 ---95 1-Star
---95 2-Star
S20 -1 r95 3-Star

O 10
1-Star 2-Star 3-Star
1996 Rating

Figure 3.6: 1995 Ratings and 1996 Ratings for the Three- Star Rating System.

1995 Ratings vs One-Year Average Return and
1996 Rating

-- 95 1-Star
---95 2Star
-A 95 3-Star
-W-95 4-Star
--- 95 5-Star

Based upon this evidence, for the period studied, it would have been interesting and

profitable to have a prediction of the Momingstar mutual fund ratings one year in the future.

The graphs indicate that over a one-year period the Morningstar rating responds to increased

returns by having the rating go up and the ratings decline when average returns go down.

This is somewhat surprising given the 3-year, 5-year, and 10-year return data used by

Morningstar in calculating its ratings and the moderating effect it should have on one-year

rating changes.

3.2 Problem Specification

Our review indicates that there are two issues of interest associated with the

Morningstar Mutual Fund rating system:

1) Due to the rapidly increasing number of mutual funds, Momingstar rates

approximately half the mutual funds in their database because unrated funds

do not have three years of financial data for Morningstar to calculate a rating.

Classifying unrated funds could be useful information for investors wanting

to know how these funds compare to Momingstar-rated mutual funds, and

2) Anecdotal evidence exists that investors buy mutual funds that will

maintain or improve their Morningstar rating, meaning investors have high

regard for the Morningstar rating system. The ability to predict Morningstar

mutual fund ratings one year in advance would be useful information for

planning an investment portfolio.

Based on the concept of performance persistence we would expect that mutual fund

data could have information that would indicate a fund would continue to maintain or

improve their Morningstar rating. It could be due to average returns remaining the same or

improving, or a combination of features. Likewise, if the average return or these features

decrease, we would expect the rating of the mutual fund to decrease.

We will determine the ability of C4.5 to classify mutual funds versus LDA and Logit

to demonstrate that unrated funds can be classified with this technology. Success with

classification by C4.5 will provide a foundation for our study of the prediction of

Morningstar mutual fund ratings. Therefore, our research hypotheses are as follows:

1. C4.5 can classify mutual funds as well as LDA and Logit, and

2. C4.5 can predict mutual fund ratings changes one year in the future compared to an

investment strategy of the fund maintaining the same rating one year in the future.

Figure 3.7 on the following page shows a map of the experimental design of this

study. In Chapter 4, we compare the classification of mutual funds by C4.5 to LDA and

Logit to test hypothesis 1. In Chapter 5, we perform experiments with C4.5 to predict

mutual fund ratings and ratings changes one year hence to test hypothesis 2. We conducted

these studies using two rating systems: the standard Morningstar 5-Star rating system; and a

new 3-Star rating system based on merging 1-Star and 2-Star ratings, and the 4-Star and 5-

Star ratings into two new ratings. The 3-Star system was designed to reduce the

classification error caused by the small number of funds rated 1-Star and 5-Star.

Experimental Plan


Chapter 4 Chapter 5
Classifying Funds Predicting Fund
with C4.5, LDA Ratings with C4.5
and Logit

Phase 1 Phase 5
Classifying Predicting with
1993 Funds a Common
Feature Vector

Phase 2 Phase 6
Classifying Predicting Matched
with Derived Mutual Fund
Features Ratings

Phase 3 Phase 7
Comparing Predicting Unmatched
3-Star and 5-Star Mutual Fund
Classification Ratings

Phase 4
by C4.5 with
Large Feature Vector

Figure 3.7: Overview of Research Phases.


The research design for this study divides into two parts. In this chapter, we

determined that C4.5 classifies mutual funds by their Morningstar rating as well as Linear

Discriminant Analysis (LDA) and Logistic Regression (Logit). In Chapter 5, C4.5

predicted future Momingstar mutual fund ratings and ratings changes.

4.1 Research Goals

The research that we have conducted has several broad research goals:

1) Demonstrate the use of decision trees as a useful classification technique for

mutual funds,

2) Improve our understanding of the domain, the Morningstar rating system,

from the standpoint of the relationship between ratings and the features or

attributes used to describe the mutual funds, and

3) Develop a knowledge base and several decision trees that can be used to

predict mutual fund ratings.

4.2 Research Caveats

Before reviewing the research methodology of these phases it should be noted that

C4.5, and also the stepwise forms of LDA and Logit in this research, use a greedy

algorithm heuristic for selecting a feature to classify an example. As noted earlier in

Chapter 2, a greedy algorithm always makes the choice that, at the moment, looks best.

Greedy algorithms do not always yield optimal solutions but, for many problems, they

often do.

Another caveat is that the decision tree that is generated by the algorithm to

correctly classify examples in the training set is only one of the many possible trees that

could classify better or worse (Quinlan, 1990). These two points mean that the order of

feature selection in these three methods of classification is not a measure of their value to

the classification process. Specifically, in the case of C4.5, more than one decision tree

could exist that would classify the training set as well as the one selected by it. Never the

less, if a feature is consistently selected in our samples we feel that this is an indication of

its importance to the classification process.

4.3 Example Databases

Morningstar, Inc. approved the use of their data for the experiments conducted as

part of this research. We selected examples from the Momingstar Mutual Funds OnDisc

CD-ROM for April 1993, July 1994, and July 1995, and the Momingstar Principia for

July 1996. An example consists of various features describing a mutual fund and the

Morningstar rating of I-Star to 5-Star. A complete listing of the available features is in

Appendix A. In each of the following phases we will list the features that made up the

feature vector for each dataset.

4.4 Brief Overview of the Research Phases

Phase 1 consisted of classifying the April 1993 Momingstar data with C4.5, LDA,

and Logit. In Phase 2 we added new features derived from the April 1993 data to determine

if this improved classification. Phase 3 of this study used the July 1994 Morningstar

database to study improvements to classification caused by increased sample size (more

mutual funds now qualified for a Morningstar rating). We also consolidated the

Morningstar rating system into three ratings instead of five ratings and explained the reason

for doing this in Section 4.7. Phase 4 departed from comparing C4.5 with LDA and Logit

and tested the ability of C4.5 to classify funds using a large number of features, fifty, by

crossvalidation with the 5-Star and 3-Star rating system. We also studied the effect of three-

year features on classification. The results of this experiment were used to design the final

three phases of our research presented in Chapter 5.

4.5 Phase 1 Classifying 1993 Funds

4.5.1 Methodology

In this phase, we performed three separate classifications of the April 1993

Morningstar data using C4.5, LDA, and Logit. First, we obtained examples of equity

mutual funds with complete data from the selected database. This meant that the

examples did not have missing feature values.

Equity mutual fund features used in the classification process were selected by

two criteria:

(1) minimizing the expected correlation among certain features since LDA

required that the features not be highly correlated (Klecka, 1980), and

(2) excluding total and average return features beyond the first year (three, five,

and ten year values are used by Morningstar in calculating Morningstar return) to

eliminate correlation with the Morningstar classification.

We tested for correlation among features and Percentage Rank All Funds had a

highly negative mean correlation of -0.85 with Total Return. The mean correlation was

0.64 for Total Return and Percentage Rank Funds by Objective. Percentage Rank Funds

by Objective had a mean correlation of 0.75 with Percentage Rank All Funds. Thus, we

used the Total Return feature in the dataset and excluded those correlating features.

Factor analysis was not used to reduce this set to independent uncorrelated factors

due to its underlying assumption that the variates are multivariate normal (Morrison,

1990). We show later in this chapter that most features are not univariate normal and,

therefore, not multivariate normal. Manly (1995) considers univariate normality a

minimum requirement before testing for multivariate normality. In addition, the data

consisted of five classification groupings (the Morningstar stars). Finding a single

orthogonal transformation such that the factors are simultaneously uncorrelated for the

five groupings would require that the transformations be the same and differ only by the

sampling error (Flury and Riedwyl, 1988). This would be very difficult to measure and

would be of suspect value.

C4.5 did not require an assumption about an underlying distribution of the data or

the correlation between features (Quinlan, 1993). However, others (Han et al., 1996)

have identified decreased classification accuracy of ID3, the C4.5 precursor, as

correlation among explanatory variables increased. Therefore, the data just were not

ideal for any of the three classification systems.

The twenty-four continuous features used for this phase are presented in Table

4.1. We selected 555 equity mutual funds having no missing feature values from the

Morningstar database. The funds included four investment objectives: Aggressive

Growth, Equity-Income, Growth, and Growth-Income.

We used the procedure described by Weiss and Kulikowski (1991), and referred

to as the train-and-test paradigm, for estimating the true error rate and performance of the

Table 4.1: Classification Features for Phase 1

Yield Return on Assets
Year-to-Date Return Debt % Total Capitalization
3-Month Total Return Median Market Capitalization
Year-1 Total Return Cash %
Alpha Natural Resources Sector
Beta Industrial Products Sector
R-Squared Consumer Durables Sector
Total Assets Non-Durables Sector
Expense Ratio Retail Trade Sector
Turnover Services Sector
P/E Ratio Financial Services Sector
P/B Ratio Manager Tenure

three systems in classifying the mutual funds. We randomly sampled the 555 funds using

a uniform distribution to produce a training set of 370 examples and a testing set of 185

examples. We calculated the error rate according to the following formula:

error rate = number of errors
number of cases

We computed a lower bound for the size of the training set with twenty-four

continuous features for a two class decision tree induction sufficient to guarantee an error

level, e, within a specified degree of confidence, 6, for binary data (Kim and Koehler,

1995). With E = 0.1 and 6 = 0.01 the sample size required was 378. This would be

appropriate because our data consisted of continuous features upon which C4.5 would

perform binary splits.

Our classification problem, however, had five classes, so we required five times

as many examples for training (Langley, 1996, p. 31) or 1,892 examples using the above

parameters. However, we only had 370 examples and had to modify the error and

confidence parameters to e = 0.35 and 5 = 0.115. This meant that given the 370

examples, we were 88.5% confident that the output of the algorithm was 65% correct for

the five classes or star ratings.

We processed each of the twenty training datasets with the SAS stepwise

discriminant analysis procedure, STEPDISC. A major assumption for this statistical

procedure is that the features are multivariate normal with a common covariance matrix.

Using the one-sample two-tailed Kolmogorov-Smirnov Test we determined that of the 22

features only Debt % of Total Capitalization (p = 0.396) and Return on Assets (p =

0.044) were univariate normally distributed and concluded that the data were not

multivariate normal.

The classification features determined by STEPDISC then were processed with

the SAS DISCRIM procedure to compute the discriminant function based on the selected

features. The result was a classification of the mutual funds into the five classes or

Morningstar ratings for the training and testing sets. In addition, the SAS system

produced a confusion matrix for each set of examples that compared the classification

determined by the system to the actual Morningstar rating. We made an error count for

each holdout set.

Next, we processed the training set with the SAS LOGISTIC procedure and this

fitted a linear logistic regression model for ordinal response data by the method of

maximum likelihood. The approach selected for this procedure was the stepwise

selection of features and, using the following equations, the system performed a

probability calculation to determine the classification of each example:
logit(p) = intercept + E parameters *feature value

where was the probability of the classification and was defined by:

p = elogi(p) + elogit(p))

We considered the first rating classification with a probability greater than 50% to be the

determined rating and we compared this to the actual Moringstar rating. A count was

made of the misclassified examples to determine the classification errors of Logit. We

determined a classification rating for every example for the training set and testing set

and all were included in the error count.

C4.5 classified the training set by building the classification decision tree using

the Gain Ratio Criterion, described in Section, and the test examples were then

subsequently classified with the best tree. We used the default settings for C4.5

described in Quinlan (1993) with the exception that any test used in the classification tree

must have at least two outcomes with a minimum of x examples. Quinlan (1993)

recommends higher values for x for noisy data. We incremented this value in units

between 5 and 25 for the experiments.

The C4.5 program calculated the error rate for the training set and the testing set

and then produced a Confusion Matrix of the classification of the test set by comparing it

to the actual Momingstar rating from the example.

4.5.2 Results

Logit performed best with mean classification errors of 33.5% (62 mean errors

per sample) over the twenty samples. C4.5 followed with 37.2% mean classification

errors (69 mean errors per sample) and LDA had 39.6% mean classification errors (73

mean errors). Figure 4.1 shows the error rate for classification for each of the 20 samples

(labeled A through T). We performed an Analysis of Variance (ANOVA) test with the

null hypothesis of equal means on the number classification errors. With F=30 (p=0.O),

for dfi=2 and df2=57, we rejected the null hypothesis meaning that the three methods did

not classify the funds equivalently.

85 -
75 A
70 -A C4 5
60 \-.Logit

45 .

Figure 4.1: C4.5, LDA, and Logit Classification Errors for April 1993 Morningstar Data.

Table 4.2: C4.5 Classification Features for Phase 1.

Feature Frequency Feature Frequency
Expense Ratio 20 Year 1 Total Return 4
Alpha 19 YTD Total Return 4
Yield 14 Cash % 3
Median Market 13 Debt % of Total 3
Capitalization Capitalization
Assets 11 Industrial Products Sector 3
Return on Assets 10 Natural Resources Sector 3
R-Squared 9 Nondurables Sector 3
P/B Ratio 8 Manager Tenure 2
Beta 5 P/E Ratio 2
Consumer Durables Sector 4 Retail Sector 2
Turnover 4 Service Sector 2

In Table 4.2, we list the features selected by C4.5 for the classification process

and the selection frequency. While frequency is not an absolute measure of classification

importance we gained useful knowledge about how often a feature was selected. On

average, each decision tree consisted of 7.4 features with a median of 7.5 features.

LDA and Logit select features for one time use, while the procedures may later

discard the feature due to the stepwise nature of both procedures. Tables 4.3 and 4.4

display not only the selection frequency of the features but also their position in the

classification process. Both values, taken together, provide some indication of the

classification importance of a feature to the 20 samples. For example, while the Services

Sector has an average position of 6.9, it has a position standard deviation (a) of 2.7 and it

Table 4.3: LDA Classification Features.

Feature Average Position a Frequency
Alpha 1.0 0.0 20
R-Squared 2.0 0.0 20
Beta 3.0 0.0 20
Assets 4.8 2.1 20
Year 1 Total Return 7.2 1.9 20
Turnover 6.3 1.9 19
Retail Sector 7.6 1.3 16
Expense Ratio 7.9 2.4 16
Return on Assets 8.1 2.0 14
Cash % 9.4 2.6 13
Debt % of Total Capitalization 8.9 2.5 10
Consumer Durables Sector 10.4 2.1 8
P/E Ratio 11.0 1.3 8
Services Sector 6.9 2.7 7
P/B Ratio 9.4 3.2 7
Financial Sector 12.4 1.5 5
Industrial Products Sector 9.8 2.6 4
Median Market Capitalization 11.3 3.3 4
Yield 10.0 2.0 3
Nondurables Sector 11.3 0.6 3
YTD Total Return 12.0 0.0 3
Natural Resources Sector 12.7 1.5 3

was used in only 7 samples. Year-1 Total Return, while having an Average Position of

7.2, has a much smaller a and was used in classifying all 20 samples. Thus, we

considered Year-1 Total Return to be more important to the classification process than

the Services Sector feature. On the other hand, we would be suspicious of a classification

starting out with Natural Resources Sector rather than Alpha, Beta, or Assets.

LDA used a mean of 12.2 features to classify each sample (median = 12.0). Logit

required the fewest features with a mean of 5.9 features for classifying the training set

(median = 6.0).

Table 4.4 : Logit Classification Features.

Feature Average Position a Frequency
Alpha 1.0 0.0 20
R-Squared 2.0 0.0 20
Beta 3.0 0.0 20
Assets 4.3 0.6 20
Expense Ratio 5.2 0.8 13
Debt % of Total Capitalization 5.3 0.5 11
Median Market Capitalization 6.3 0.6 3
Financials Sector 6.0 0.0 3
Natural Resources Sector 8.0 1.4 2
P/E Ratio 6.0 1.4 2
Return on Assets 5.0 0.0 2
Yield 5.0 0 1
Retail Sector 8.0 0 1

A listing of Phase 1 features used for classification appears in Appendix B. In Table 4.5,

we provide a consensus list of the features that the three programs selected consistently

for classification. This listing represents the features most frequently selected by all three

classification methodologies.


Table 4.5: Consensus List of Classification Features for Phase 1

Expense Ratio
Debt % of Total Capitalization
Return on Assets

4.5.3 Conclusions

The results showed that Logit had fewer classification errors than C4.5 and LDA.

Logit performed better than C4.5 in seventeen of the twenty samples. Also, the three

classification algorithms used a very limited and comparable mix of the twenty-four

features to classify the twenty testing sets. After reviewing the error and confidence

factors, we concluded that a larger sample size could improve the performance of C4.5.

4.6 Phase 2 1993 Data with Derived Features

4.6.1 Methodology

The second phase of this study used the Morningstar April 1993 database to

determine if new features, derived from the database, improved the classification of

mutual funds. The derived features included approximating Morningstar Return minus

Morningstar Risk (not published by Morningstar in 1993), reversing the weights on the

approximation of Morningstar Return minus Momingstar Risk, and the Treynor Index.

The Treynor Performance Index, Tp, is a measure of relative performance not

calculated by Momingstar and is defined as follows:


where Rp is the portfolio return, RF is the risk-free rate of return, and f, is the portfolio's

beta or the nondiversifiable past risk. The Treynor Performance Index treats only that

portion of a portfolio's historical risk that is important to investors, as estimated by tp, and

neglects any diversifiable risk (Radcliffe, 1994).

Tp was calculated by subtracting the three-year Mean Treasury Bill rate from the

Momingstar Three-year Annualized Return for the mutual fund, and dividing this value

by the three-year / of the mutual fund in the Morningstar database.

We constructed the other new features also using data from Momingstar. The

April 1993 Morningstar CD-ROM did not provide the actual Morningstar Return values

but did provide Morningstar Risk. To approximate Morningstar Return the 3-year, 5-

year, and 10-year Average Returns were used as surrogates in the Morningstar Return

minus Risk formulas. The difference between them and the actual Morningstar Return

for those periods was that Morningstar deducted the effects of mutual fund loads and

redemption fees from these average returns. Morningstar made assumptions about the

fund loads and redemption fees that would make it very difficult for anyone to construct a

precise determination of the load-adjusted return.

In developing the Reversed Weight feature, we reversed the weights used by

Morningstar in their Risk-Adjusted Return formulas and applied these to the calculation

of five-year and ten-year Return minus Risk data. For five-year old mutual funds,

Morningstar used 40% of the three-year value and 60% of the five-year value and we

reversed them. For ten-year old mutual funds, Morningstar used 20% of the three-year

data, 30% of the five-year data, and 50% of the ten-year data. We reversed these weights

to 50% for three-year data, 30% for five-year data, and 20% for ten-year data.

We prepared a dataset consisting of 20 random samples. By increasing the

number of Investment Objectives from the four used in Phase 1 to the twelve domestic

equity investment objectives we were able to increase the total number of examples to

784. The Regular Dataset had 23 features listed in Table 4.6. We constructed a similar

dataset, including the three derived features or 26 features, and referred to it as the

Derived Dataset. This provided a training set of 523 examples and a testing set of 261

examples. A lower bound for the size of the 5-Star training set, using the method of

Phase 1, was calculated at 523 examples for the Regular Dataset with e = 0.35 and 5 =

0.078, and 523 examples for the Derived Dataset with e = 0.35 and 6 = 0.111.

The measurement of interest for this Phase was the fewest classification errors

using C4.5, LDA, and Logit. We processed the training sets and testing sets through the

same SAS procedures used in Phase 1, as well as C4.5. Table 4.6 lists the features

common to the Regular and Derived datasets.

Table 4.6: Phase 2 Common Features for Classification by C4.5, LDA, and Logit.

Yield Return on Assets
YTD Return Debt % Total Capitalization
Year 1 Total Return Median Mkt. Capitalization
Alpha Cash % of Holdings
Beta Natural Resources Sector
R-Squared Industrial Products Sector
Total Assets Consumer Durables Sector
Expense Ratio Non-Durables Sector
Turnover Retail Trade Sector
P/E Ratio Services Sector
P/B Ratio Financial Services Sector
Manager Tenure

4.6.2 Results for the Regular Dataset

The mean classification errors over the twenty samples for the Regular Dataset

was as follows: C4.5 was 36.6% (95.6 mean errors), LDA was 37.7% (98.5 mean errors),

and Logit was 35.8% (93.4 mean errors). Figure 4.2 shows the errors by sample. Logit

had fewer errors than C4.5 in fifteen samples.

However, an ANOVA test with a null hypothesis of equal means was performed

on the number of errors and calculated F=1.35 (p=0.267) for dfi=2 and df2=57. Thus, we

failed to reject the null hypothesis and we assumed the three methods classified the

mutual funds equivalently.




Figure 4.2: C4.5, LDA, and Logit Classification Errors for 1993 Morningstar Data for the
Regular Dataset.

The features selected by C4.5 and their frequency are displayed in Table 4.7.

C4.5 used an average of 8.9 features per sample with a median of 8.0 features.

Table 4.7: C4.5 Feature Selection for the Regular Dataset.

Feature Frequency Feature Frequency
Alpha 20 Natural Resources Sector 6
Assets 20 Service Sector 5
Industrial Products Sector 16 Cash%/ 4
Median Market 13 Turnover 4
R-Squared 13 YTD Total Return 4
SEC Yield 13 Beta 3
Expense Ratio 12 Consumer Durables Sector 3
Return on Assets 10 P/B Ratio 3
Debt % of Total 9 Financial Sector 1
P/E Ratio 8 Manager Tenure 1
Year 1 Total Return 8 Nondurables 1
Retail Sector 1

Table 4.8: LDA Feature Selection for the Regular Dataset.

Features Average Position a Frequency
Alpha 1.0 0.0 20
R-Squared 2.0 0.0 20
Beta 3.0 0.0 20
Assets 5.0 1.0 20
Industrial Products Sector 5.7 1.7 20
Debt % of Total Capitalization 7.7 2.7 20
YTD Total Return 7.9 2.7 18
P/B Ratio 8.1 2.4 14
Retail Sector 10.0 1.8 14
Cash % 11.0 2.8 14
Year 1 Total Return 11.0 2.3 14
Turnover 9.2 1.8 13
P/E Ratio 11.0 4.0 10
Expense Ratio 11.0 3.4 10
Finance Sector 11.0 3.2 9
Service Sector 11.0 2.6 8
Return on Assets 13.0 3.3 8
Consumer Durables 12.0 2.3 7
Natural Resources Sector 5.8 3.1 6
Manager Tenure 14.0 1.2 6
SEC Yield 12.0 2.1 3
Non-Durables Sector 13.0 0 1

Table 4.8 summarizes the feature selection positioning, the standard deviation of

the position (a), and the frequency with which LDA used the feature for classification.

LDA required an average of 13 features to classify the samples.

The features most frequently selected by Logit are listed in Table 4.9. Alpha, R-

Squared, Beta, and Assets were selected for every sample. Logit used an average of 7.3

features to classify the 20 samples.

Table 4.9: Logit Feature Selection for the Regular Dataset.

Features Average Position a Frequency
Alpha 1.0 0.0 20
R-Squared 2.0 0.0 20
Beta 3.0 0.0 20
Assets 4.6 0.9 20
YTD Total Return 6.2 0.6 13
Return on Assets 5.5 1.4 11
Debt % of Total Capitalization 4.8 0.6 10
Consumer Durables Sector 6.4 1.6 8
Expense Ratio 5.7 0.8 6
Industrial Products Sector 6.5 1.3 4
Manager Tenure 7.3 0.6 3
P/B Ratio 7.7 0.6 3
P/E Ratio 6.0 1.4 2
Retail Sector 7.0 1.4 2
Median Market Capitalization 7.0 1
SEC Yield 8.0 1
Year 1 Total Return 8.0 1

Table 4.10: Consensus List for Regular Features.

YTD Total Return

Table 4.10 lists the consensus features used with high regularity by the three

methodologies classifying the Regular Dataset. There was little agreement beyond these

five features.

4.6.3 Results for the Derived Features Dataset

C4.5 had a mean classification error rate of 20.9% (54.6 mean errors), LDA

performed with a mean error rate of 26.7% (69.7 mean errors) and Logit had a mean error

rate of 20.5% (53.6 mean errors). Figure 4.3 on the next page shows the classification

errors for the samples. While Logit performed best, it was not statistically significant.

We performed an ANOVA test with null hypothesis of equal means for the twenty

samples and calculated F=52.7 (p=0.0) for dfi=2 and df2=57, and we reject the null

hypothesis. A separate t-test with a null hypothesis of equal means for C4.5 vs. Logit had

a p-value of 0.40 so we fail to reject that null hypothesis and consider the mean errors of

C4.5 and Logit to be equal. Therefore, C4.5 and Logit classified the mutual funds

equally well and LDA performed the worst.


70 ---C4.5
60, -9 a-Logi
50 -* 4'/


Figure 4.3: C4.5, LDA, and Logit Classification Errors with Derived Features.

The features most frequently selected by C4.5 are listed in Table 4.11. On

average, the features used for classification declined from to 4.4, with a median of 4.0.

The Treynor Index was only selected twice.

Table 4.11: C4.5 Derived Feature Selection.

Feature Frequency Feature Frequency
Return minus Risk 19 Median Market 2
Assets 17 R-Square 2
SEC Yield 9 Treynor Index 2
Reversed Weight Return minus 5 Cash % 1
Alpha 4 Manager Tenure 1
Beta 4 Nondurables Sector 1
Debt % of Total Capitalization 4 P/B Ratio 1
Consumer Durables Sector 3 P/E Ratio 1
Turnover 3 Retail Sector 1
Expense Ratio 2 Services Sector 1
Finance Sector 2 Year 1 Total Return 1
Industrial Sector 2

In Table 4.12 on the next page we see that LDA used more features to classify

these datasets than did C4.5 and Logit. Fifteen features (mean and median) were required

for classification. Surprisingly, it even used the derived Reversed-Weight Return minus

Risk feature in all twenty samples. The Treynor Index was used for 12 samples.

Table 4.13, two pages hence, shows that Logit used a mean of 4.8 features for

classification (median = 4.5).

Table 4.14 is the consensus features used with regularity by the three

classification methodologies. Return minus Risk was the most commonly used feature

and dominated the selection of other features. Both C4.5 and Logit had few features that

were used in more than ten samples. LDA used many more features with poor results.

Table 4.12: LDA Derived Feature Selection.

Features Average Position a Frequency
Return minus Risk 1.1 0.2 20
R-Squared 2.0 0.2 20
Industrial Products Sector 3.8 1.5 20
Assets 5.1 1.6 20
Alpha 6.4 2.2 20
Reversed-Weight Return minus Risk 6.9 2.0 20
Beta 7.8 1.8 20
Expense Ratio 9.6 2.7 19
Debt % of Total Capitalization 11 2.6 18
Return on Assets 8.7 3.2 17
Turnover 9.4 2.3 14
Cash% 13 3.6 12
Treynor Index 10.0 2.9 12
Consumer Durables Sector 13 2.1 11
Manager Tenure 14 1.4 10
Retail Sector 12 3.0 9
P/B Ratio 12 3.9 8
YTD Total Return 6.6 5.3 7
Service Sector 13 3.5 7
Yield 12 4.2 5
Year 1 Total Return 13 2.5 5
Natural Resources Sector 14 2.5 5
P/E Ratio 15 0.8 4
Median Market Capitalization 15 6.4 2
Finance Sector 15 2.8 2
Non-Durable Sector 13 1

4.6.4 Conclusions

The classification features for each sample are in Appendix C. A review of the

features selected by C4.5 and Logit showed that a small number of the 23 available

features were used to classify the Regular Dataset and the Derived Dataset. Increased

sample size reduced the classification errors ofC4.5 to where it was equivalent to Logit.

Table 4.13: Logit Derived Feature Selection.

Features Average Position a Frequency
Return minus Risk 1.0 0.0 20
Beta 2.2 0.5 19
Manager Tenure 3.7 1.1 14
Assets 3.1 0.8 11
Natural Resources Sector 4.4 0.9 11
Expense Ratio 4.1 0.9 9
R-Squared 4.5 0.6 4
Consumer Durables Sector 6.0 1.0 3
Alpha 6.0 1
Retail Sector 5.0 1
Debt % of Total Capitalization 4.0 1
P/B Ratio 2.0 1

Table 4.14: Consensus List for Derived Features

Return minus Risk
Manager Tenure

Using the derived feature of Return minus Risk resulted in improved

classification versus the Regular Dataset. We observed that when this feature was present

in the feature vector, C4.5 and Logit used substantially fewer features for classification.

The derived features of Reversed Weight Return minus Risk and the Treynor Index were

not selected for classification by Logit and were seldom used by C4.5.

The interesting result of the second phase was the disappointing performance of

LDA with the derived features. We concluded that the derived features either introduced

more noise than acceptable, a degree of nonlinearity, or LDA was affected by high

correlation among features. Since we had violated the constraints of multivariate

normality and multicollinearity for this classification methodology, it was impossible to

determine the exact cause of this failure.

4.7 Phase 3 Comparing 5-Star and 3-Star Classification

4.7.1 Methodology

Phase 3 of this research tested the idea that we could improve classification error

rates, i.e., lower it, by combining the five Morningstar ratings into three using the

following scheme:

(1) Morningstar ratings 1-Star and 2-Star became new rating 1-Star,
(2) Morningstar rating 3-Star became new rating 2-Star, and
(3) Momingstar ratings 4-Star and 5-Star became new rating 3-Star.

We proposed this variation of the standard Morningstar rating system to increase the

number of examples at each end of the Morningstar scale. An examination of the data

from the previous phases showed that a higher error rate was occurring in I-Star and 5-

Star Morningstar ratings than for 2-Star, 3-Star, or 4-Star. Momingstar rating

classifications 1-Star and 5-Star each represented 10% or less each of the examples in the

datasets. Additionally, previously cited material showed that investors mostly purchased

4- and 5-Star mutual funds. Combining the individual ratings in this manner still

permitted segmenting the 4- and 5-Star funds from the funds with lower ratings.

Minitab was used to generate ten uniformly random training sets and testing sets

from the July 1994 Momingstar CD-ROM database. Thirty-two features were selected

for the classification problem and are listed in Table 4.15. The number of complete

examples extracted from the equity mutual funds database was 999, resulting in 666

training set examples and 333 testing set examples. We calculated a lower bound on the

required number of training examples. For s = 0.30 and 6 = 0.13 we required 669

training examples for the three-class experiment. With e = 0.30 and 5 = 0.1925 we

required 668 training examples for the five-class experiment.

Full Text
xml version 1.0 encoding UTF-8
REPORT xmlns http:www.fcla.edudlsmddaitss xmlns:xsi http:www.w3.org2001XMLSchema-instance xsi:schemaLocation http:www.fcla.edudlsmddaitssdaitssReport.xsd