UFDC Home  UF Institutional Repository  UF Theses & Dissertations  Internet Archive   Help 
Material Information
Subjects
Notes
Record Information

Table of Contents 
Title Page
Page i Page ii Acknowledgement Page iii Table of Contents Page iv Page v List of Tables Page vi List of Figures Page vii Page viii Abstract Page ix Chapter 1. Introduction Page 1 Page 2 Page 3 Page 4 Page 5 Chapter 2. Background Page 6 Page 7 Page 8 Page 9 Page 10 Page 11 Page 12 Page 13 Page 14 Page 15 Page 16 Page 17 Page 18 Page 19 Page 20 Page 21 Page 22 Page 23 Chapter 3. The proposed approach Page 24 Page 25 Page 26 Page 27 Page 28 Page 29 Page 30 Page 31 Page 32 Page 33 Page 34 Page 35 Page 36 Page 37 Page 38 Chapter 4. Time and space complexity analysis Page 39 Page 40 Page 41 Page 42 Page 43 Chapter 5. Evaluation of the approach Page 44 Page 45 Page 46 Page 47 Page 48 Page 49 Page 50 Page 51 Page 52 Page 53 Page 54 Page 55 Page 56 Page 57 Page 58 Page 59 Page 60 Page 61 Page 62 Page 63 Page 64 Page 65 Page 66 Page 67 Page 68 Page 69 Page 70 Page 71 Page 72 Page 73 Page 74 Page 75 Page 76 Page 77 Page 78 Page 79 Page 80 Page 81 Page 82 Page 83 Page 84 Page 85 Page 86 Page 87 Page 88 Page 89 Page 90 Page 91 Page 92 Page 93 Page 94 Page 95 Page 96 Page 97 Page 98 Page 99 Page 100 Page 101 Page 102 Page 103 Page 104 Page 105 Page 106 Page 107 Page 108 Page 109 Chapter 6. Applications: Design recovery based on the approach Page 110 Page 111 Page 112 Page 113 Page 114 Chapter 7. A prototype for the proposed approach Page 115 Page 116 Page 117 Page 118 Page 119 Page 120 Page 121 Page 122 Page 123 Chapter 8. Experience with the object finder Page 124 Page 125 Page 126 Page 127 Page 128 Page 129 Page 130 Page 131 Page 132 Page 133 Page 134 Page 135 Page 136 Page 137 Page 138 Page 139 Chapter 9. Conclusions and further study Page 140 Page 141 Page 142 Page 143 Page 144 Appendix A. Object finder prototype user's manual Page 145 Page 146 Page 147 Page 148 Page 149 Page 150 Page 151 Page 152 Page 153 Page 154 References Page 155 Page 156 Page 157 Page 158 Biographical sketch Page 159 Page 160 Page 161 
Full Text 
AN APPROACH TO SOFTWARE SYSTEM MODULARIZATION BASED ON DATA AND TYPE BINDINGS By Roger M. Ogando A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 1991 Copyright 1991 by Roger M. Ogando ACKNOWLEDGMENTS This work is dedicated to Ms. Olga E. Rivera and my parents and grandmoth ers. I would like to express my deepest appreciation to my chairman, Prof. Stephen Yau, former cochairman Prof. SyingSyang Liu, current cochairman Prof. Stephen Thebaut, Prof. Randy Chow, Prof. Justin Graver, and Prof. Jack Elzinga for their guidance and invaluable insight throughout this research. I am particularly indebted to Prof. Liu, Prof. Norman Wilde of the University of West Florida, Prof. Yau, and Prof. Thebaut for their financial support, supportive counsel and for dedicating long hours to discussions and reviews. I am also grateful to many fellow graduate and undergraduate students whose contributions in research and development made this thesis possible; in particular, I am grateful to my colleague, AbuBakr M. Taha, who made valuable contributions to this research project. In addition, my thanks go to many roommates, classmates, and people from all over the world who made an initially strange land feels like home. Finally, I would like to thank the PRA/OAS Fellowship for their additional financial support which they provided during my study. TABLE OF CONTENTS ACKNOWLEDGEMENTS ............................ iii LIST OF TABLES ................................. vi LIST OF FIGURES ................................ vii ABSTR ACT . . . . . . . . . . . . . . . . . . ix CHAPTERS 1 INTRODUCTION ............................... 1 2 BACKGROUND ................................ 6 2.1 Design Recovery .............................. 6 2.1.1 Clustering Approaches ...................... 7 2.1.2 Program Slicing Approach .................... 11 2.1.3 Program Dependence Approach .................... 12 2.1.4 Knowledgebased Approach ...................... 13 2.2 Complexity M etrics ............................ 13 2.2.1 M odularity M etrics ........................ 14 2.2.2 Zage's Design M etrics ...................... 19 2.2.3 Cyclomatic Complexity . . . . . . . . . ... .. 21 2.2.4 Stability M etrics . . . . . . . . . . . ... .. 22 3 THE PROPOSED APPROACH . . . . . . . . . . ... .. 24 3.1 The Proposed Approach . . . . . . . . . . . ... .. 25 3.2 Applicability of the Algorithms . . . . . . . . . ... .. 37 3.3 Conditions for Best Results . . . . . . . . . . ... .. 38 4 TIME AND SPACE COMPLEXITY ANALYSIS . . . . . .... ..39 4.1 Algorithm 1. Globalsbased Object Finder Complexity . . ... 39 4.2 Algorithm 2. Typesbased Object Finder Complexity . . . ... 41 5 EVALUATION OF THE APPROACH . . . . . . . . ... .. 44 5.1 Goals of the Evaluation Studies . . . . . . . . . ... .. 45 iv 5.2 Methodology of the Evaluation Studies . . . . . . . .... .. 47 5.3 Primitive Metrics of Complexity . . . . . . . . . ... .. 48 5.3.1 Definitions . . . . . . . . . . . . . .. .. 49 5.3.2 Intergroup Complexity Factors . . . . . . . .... .. 57 5.3.3 Intragroup Complexity Factors . . . . . . . .... .. 61 5.3.4 Example of the Factors . . . . . . . . . ... .. 68 5.3.5 Validation of the Factors . . . . . . . . . ... .. 69 5.4 The Test Cases: Identified Objects, Clusters and Groups ....... 80 5.4.1 Test Case 1: Name Crossreference Program . . . .... ..80 5.4.2 Test Case 2: Algebraic Expression Evaluation Program . . 83 5.4.3 Test Case 3: Symbol Table Management for Acx . . ... 88 5.5 Comparison of Complexity . . . . . . . . . . ... .. 94 5.5.1 Test Case 1: Name Crossreference Program . . . .... ..99 5.5.2 Test Case 2: Algebraic Expression Evaluation Program . . 101 5.5.3 Test Case 3: Symbol Table Management Program for Acx. . 103 5.5.4 Summary and Conclusions of the Comparison . . . ... ..106 6 APPLICATIONS: DESIGN RECOVERY BASED ON THE APPROACH 110 7 A PROTOTYPE FOR THE PROPOSED APPROACH . ...... .. .115 7.1 The Object Finder: A GNU Emacsbased Design Recovery Tool . 115 7.2 Design Goals of the Object Finder . . . . . . . . ... .. 117 7.3 Design of the Object Finder . . . . . . . . . . ... .. 119 7.4 Xobject: Graphical User Interface . . . . . . . . ... .. 122 8 EXPERIENCE WITH THE OBJECT FINDER . . . . . . ... .. 124 8.1 Example of the Topdown Analysis: Name Crossreference Program 124 8.2 Comparison with C++ Classes: Algebraic Expression Evaluation Pro gram . . . . . . . . . . . . . . . . . . 129 8.3 Example of the Bottomup Analysis: Name Crossreference Program. 136 9 CONCLUSIONS AND FURTHER STUDY . . . . . . . ... .. 140 APPENDIX A OBJECT FINDER PROTOTYPE USER'S MANUAL. 145 A.1 Introduction . . . . . . . . . . . . . . ... .. 145 A .2 Operation . . . . . . . . . . . . . . . .. .. 145 A.2.1 Basic Setup Commnands . . . . . . . . . ... .. 146 A.2.2 Object Finder Analysis . . . . . . . . . ... .. 146 A.2.3 Display Analysis Results and Identified Objects . . ... .147 A.3 Buffer Structure and Files . . . . . . . . . . ... .. 151 A.3.1 System Buffer Structure . . . . . . . . . ... .. 151 A.3.2 System Files . . . . . . . . . . . . ... .. 153 REFERENCES . . . . . . . . . . . . . . . . ... .. 155 BIOGRAPHICAL SKETCH . . . . . . . . . . . . ... .. 159 LIST OF TABLES 2.1 Information flows relation generated for the example . . . ... .. 18 3.1 Complexity index function for types in the "C" programming language. 34 5.1 Type size associated with several types . . . . . . . ... .. 56 5.2 Type size associated with variables of different types in the original version of the recursive descent expression parser . . . . ... .. 70 5.3 Type size associated with variables of different types in version 1 of the recursive descent expression parser . . . . . . . ... .. 72 5.4 Statistics of the test case programs . . . . . . . . ... .. 81 8.1 Statistics of the name crossreference program . . . . . ... .. 124 8.2 Statistics of the algebraic expression evaluation program . . ... .. 129 8.3 Comparison of candidate objects and "C++" classes . . . ... .. 131 LIST OF FIGURES 2.1 An example of information flow . . . . . . . . . ... .. 17 2.2 Possible skeleton code for the example . . . . . . . ... .. 17 3.1 Tree representation of types for complexity index function . . . 35 5.1 Schematic illustrations of access pairs and data bindings ....... 55 5.2 Example of the primitive complexity metrics factors . . . ... ..68 5.3 Identified objects in original version of recursive descent expression parser . . . . . . . . . . . . . . . . .. . 71 5.4 Identified objects in version 1 of recursive descent expression parser 73 5.5 Validation of the primitive complexity metrics factors . . . ... ..75 5.6 Groups based on objects identified in name crossreference program 82 5.7 Clusters found in the name crossreference program by basili . . 82 5.8 Groups based on clusters found in name crossreference program . 83 5.9 Groups based on typesbased objects identified in algebraic expression evaluation program . . . . . . . . . . . . ... .. 84 5.10 Groups based on globalsbased objects identified in algebraic expres sion evaluation program . . . . . . . . . . . ... .. 86 5.11 Clusters found in algebraic expression evaluation program . . . 86 5.12 Groups based on clusters found in algebraic expression evaluation pro gram . . . . . . . . . . . . . . . . . . 87 5.13 Typesbased candidate objects identified in symbol table management program . . . . . . . . . . . . . . . . .. .. 89 5.14 Globalsbased candidate objects identified in symbol table manage m ent program . . . . . . . . . . . . . . ... .. 91 5.15 Groups based on typesbased objects identified in symbol table man agement program . . . . . . . . . . . . . ... .. 92 5.16 Groups based on globalsbased objects identified in symbol table man agement program . . . . . . . . . . . . . ... .. 94 5.17 Clusters found in symbol table management program . . . ... 95 5.18 Groups based on clusters found in symbol table management program 96 6.1 The object finder system flow . . . . . . . . . ... .. 111 7.1 The object finder conceptual model . . . . . . . . ... .. 116 7.2 The object finder implementation outline . . . . . . ... .. 118 7.3 A scanner of tokens . . . . . . . . . . . . ... .. 121 7.4 Process to control the ANSI "C" crossreference tool acx ...... .. .121 7.5 Xobject commands . . . . . . . . . . . . ... .. 122 8.1 Candidate objects identified in the name crossreference program. . 126 8.2 Candidate objects identified in the name crossreference program dis played by xobject . . . . . . . . . . . . ... .. 128 8.3 Modified candidate objects in the name crossreference program dis played by xobj ect . . . . . . . . . . . . ... .. 130 8.4 Typesbased candidate objects identified in the algebraic expression evaluation program . . . . . . . . . . . . ... .. 132 8.5 Globalsbased candidate objects identified in the algebraic expression evaluation program . . . . . . . . . . . . ... .. 133 8.6 Candidate objects identified in algebraic expression evaluation pro gram displayed by xobj ect . . . . . . . . . . ... .. 135 8.7 Initial candidate object defined by the user in the name cross reference program . . . . . . . . . . . . . . . . .. .. 137 8.8 Extended candidate object resulting from the dataroutine analysis 137 8.9 Final candidate object . . . . . . . . . . . ... .. 138 Abstract of Dissertation Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy AN APPROACH TO SOFTWARE SYSTEM MODULARIZATION BASED ON DATA AND TYPE BINDINGS By Roger M. Ogando August, 1991 Chairman: Stephen S. Yau Major Department: Computer and Information Sciences The maintenance of software systems usually begins with considerable effort spent in understanding their system structures. A system modularization defines a structure of the system in terms of the grouping of routines into modules within the system. This dissertation presents an approach to obtain a modularization of a program based on the objectlike features found in the program. While objectoriented methodologies for software design and development have only been clearly enunciated in the last few years, many objectlike features such as data grouping, abstract data types, and inheritance have been in use for some time. In this dissertation, methodologies which aid in the recovery of the object like features of a program written in a non object oriented programming language are explained. Two complementary methods for "object" identification are proposed which focus on data bindings and on type bindings in a program. The proposed approach looks for clusters of data, structures, and routines, that are analogous to the objects and object classes of objectoriented programming. The object finder is an interactive tool that combines the two methods while using human input to guide the object identification process. The experience of using the object finder and two evaluation studies of the object identification methods are also presented. ix CHAPTER 1 INTRODUCTION The maintenance of software systems usually begins with considerable effort spent in understanding their system structures and data. A system modularization defines a structure of the system in terms of the grouping of routines into modules within the system. This dissertation presents an approach to obtain a modularization of a program based on the objectlike features found in the program. In modifying existing software, professional maintainers are almost unanimous in identifying the understanding of system data as one of their greatest challenges. Successful main tenance requires precise knowledge of the data items in the system, the ways these items are created and modified, and their relationships. Changing a program without a clear vision of its implicit data model is a very risky undertaking. There seems to be little work that has explicitly addressed the problem of data understanding during software maintenance. A number of methodologies attempt to aid human program understanding of program constructs by cross referencing, by capturing dependencies [9, 46], by program slicing [13, 43], or by the ripple effect analysis of changes [50]. Other tools that have been proposed so far, such as those described by Ambras and O'Day [1], Biggerstaff [4], Kaiser et al. [16], Rich and Wa ters [32], and Yau and Liu [49], use knowledgebased approaches to provide inference capabilities. As a result, a user can derive additional information that may not be explicit in the program code. Technologists in software design have made great progress in abstracting the ways computer programs use data. Most notable, perhaps, has been the emergence of the concept of objectoriented design and development. Booch defines an object as "an entity whose behavior is characterized by the actions it suffers and that it requires of other objects" [5]. In practice, most objects are collections of data, together with the methods needed to access and manipulate those data. Although objectoriented programming constructs are not directly supported in conventional programming languages such as "C" and "Ada," several objectlike features, such as groupings of related data, abstract data types, and inheritance, have been in use for some time and may occur in an existing program. If such software needs to be maintained, it would be highly advantageous to identify the objectlike features in the system. Knowledge of such "objects" would be important to: 1. Understand the system's design more precisely. 2. Facilitate reuse of existing "methods" contained in the system. 3. Avoid degrading the existing design by introducing unnecessary references to data that should be private to a given class of "objects." 4. Reengineer the system from a conventional programming language (such as "C") into an objectoriented language (such as "C++") to facilitate future enhancements. We identified two important factors necessary for characterizing objects: global or persistent data [18] and the types of formal parameters and return values. Each factor will in turn originate a new algorithm for object identification. This disser tation presents these two algorithms for identifying objectlike features in existing source code. One focuses on the data bindings between program components and the other on type bindings between program components. The two algorithms, as well as their implementations, are collectively known as the object finder. The Globalsbased Object Finder algorithm uses the information provided by persistent data structures to identify the objects in a program. Data bindings be tween system components have been previously used as the basis for clustering. In hierarchical clustering [14], for example, the elements chosen for grouping are the ones with the smallest "dissimilarity," that is to say, with the highest number of data bindings between elements. Our approach for globalsbased object identifica tion provides similar capabilities by grouping those routines which access a common global variable into "highly connected" objects. This algorithm handles procedural programming languages, such as "Ada," "C," "COBOL," "Pascal," or "FORTRAN," that provide scoping mechanisms that allow the definition of global variables and side effects on those global variables. The Typesbased Object Finder algorithm uses type binding as the basis for grouping routines into objects. This algorithm groups the routines according to the types of data used for return value and formal parameters of routines. Typesbased object identification considers the "semantic" information provided by the types for the clustering. This is different from other clustering techniques based on seman tic information such as conceptual clustering [35] which considers "light semantic" profiles about a system (from detailed cross reference information) during clustering. In the latter, a concept tree represents the common features of the members of a group, such as the names that a software unit uses ("namesused") and the names of the places where it is used ("usernames"). This algorithm handles those procedural programming languages which provide explicit type construction mechanisms. This dissertation also presents the experience of using the object finder algo rithms and an evaluation of the object finder algorithms through careful examination of the results. Two studies were performed to evaluate the object identification al gorithms. In study I, the evaluation consisted of comparing the groups (identified objects) identified using the object finder with those groups (clusters) identified us ing hierarchical clustering [14]. Study II compared the identified objects found in a program with the objectoriented programming classes found in the objectoriented version of the program. The comparison in study I was based on the complexity of the two partition ings resulting from each clustering technique. The measure of the complexity of the partitionings is similar to coupling and cohesion [6]. Thus, we defined a new set of complexity metrics factors called intergroup complexity, which measures the complexity of the interface of a group, and intragroup complexity, which measures the internal complexity of a group, to measure the complexity of a given group in a partitioning. For this evaluation, we instrumented the program with access pairs and data binding triples in order to measure those metrics. These newly defined complexity metrics factors were validated and the results are reported in this work. The comparison results are also reported in this dissertation. In study II, a "C" program, translated from a "C++" program, is used to compare the objects identified in the "C" program with the classes found in the "C++" version of the program. This example also illustrates some of the design recovery capabilities of the object finder when abstracting the underlying structure of a system. The remainder of the dissertation is organized as follows: Chapter 2 provides a brief overview of design recovery and complexity metrics. Chapter 3 introduces the new approach, by way of several examples, for modularizing an existing program by identifying the objectlike features found in the program. Chapter 4 discusses the time and space complexity of the new approach. Chapter 5 presents the evaluation results of the proposed approach, using complexity metrics, and discusses the new set of complexity metrics factors and their validation. Chapter 6 discusses an application of the proposed approach in software systems design recovery and introduces a new approach for design recovery either topdown or bottomup. Chapter 7 outlines the implementation of a prototype designed to demonstrate the flexibility and portability of the design of the prototype. Chapter 8 discusses the experience of using the ap proach to modularize and recover the design information from several programs. The most important limitation of the proposed approach is that it is currently restricted to two criteria of object identification. Many other criteria are suggested for the future study including objectoriented principles, such as the classification of the ob jects in the application, organization of the objects, and identification of operations on the objects. More experimentation is also needed to further evaluate the object identification approach, and other metrics factors should be added to measure the "objectorientedness" of the recovered design. The conclusions and the future study are discussed in Chapter 9. CHAPTER 2 BACKGROUND This chapter summarizes some relevant background information on design re covery and complexity metrics. 2.1 Design Recovery Design recovery is defined by Biggerstaff [4] as recreating design abstractions from a combination of code, existing design documentation (if available), personal experience, and general knowledge about problem and application domain. Design recovery must generate all of the information required for a person to understand the functionality of a program, how it accomplishes its responsibilities, and why it performs that functionality. Design recovery is common and critical throughout the software life cycle. The developer of new software usually has to spend a great deal of time trying to un derstand the structure of similar systems and system components. The software maintainer spends most of his or her time studying a system's structure to under stand the nature and effect of a requested change. Without fully understanding the similar systems, the developer may not harvest the reusable components and reusable designs. Without fully understanding the program to be maintained, the maintainer may conduct inefficient or incorrect program modification. Without automated techniques, design recovery is difficult and time consuming because the source code does not usually contain much of the original design informa tion, changes are not adequately documented, there is no change stability, and there are ripple effects when making changes. Furthermore, large scale software worsen these difficulties. Thus, automated supports for the understanding process are very desirable. In the following sections we survey the current approaches to design recovery. 2.1.1 Clustering Approaches The main characteristics of the clustering approach are that it shows the struc tural analysis of large software systems via adequate clustering techniques and the retrieval of high level structural information from the existing code [23]. The goal is to group routines into modules within the systems to reflect the modules defined by the developer. Clustering is the analysis of the interface between the components of a system. It helps to determine the modularization that those interfaces define. The modules defined by this analysis are called clusters. Clustering techniques with data bindings have been well studied, but the tech nique which involves the type bindings approach has only recently been published. Our object identification approach falls under the latter of these clustering techniques. In the following two sections, the data binding approach and the type binding ap proach are examined. 2.1.1.1 Data binding In this section, we explain in detail the clustering techniques, based on data bindings, used in deriving a system's clusters. Later, we use these clusters to compare the grouping of the routines derived by the object finder, with those groups derived by these clustering techniques. Data binding [2, 41, 14] reflects the interaction among the components of a system; it has been previously used for module interaction metrics [2]. In their work, Hutchens and Basili [14] use data bindings to measure the interface between components of a system and to derive system clusters. For example, assume there are two procedures, pi and P2, and a variable x in a program. When procedures p, and P2 and the variable x are in the same static scope, whether the procedures access the variable or not, this is called a potential data binding, denoted as (p1, x,p2) because it reflects the possibility of data interaction between the two procedures. If both procedures access variable x, then there exists a used data binding. More work is required to calculate the used data binding than the potential data binding, but the former reflects a similarity between pi and p2. If procedure pi assigns a value to variable x and procedure P2 references x, this will reflect a flow of information from P1 to p2 via the variable x. This is called actual data binding and it is difficult to calculate since a distinction between reference and assignment must be maintained. Besides these bindings, control flow data binding is another kind of data binding, but it requires considerably more computation effort than actual data binding, because static data flow analysis must be performed. After the data binding information is available, the system components can be grouped based on the strength of their relationships with each other [3]. This is derived by some specialized clustering techniques. The use of data bindings in clustering analysis provides meaningful pictures of high level system interactions and even system "fingerprints," which are descriptive analogies to galactic star systems. Next, we summarize Hutchens and Basili's hierarchical clustering modularization approach [14] and the current implementation [21] of their approach. The nature of Hutchens and Basili's clustering algorithms is bottomup or ag glomerative since they iteratively create larger and larger groups, until the elements have coalesced into a single cluster. Thus, this is a module composition technique. The elements chosen for grouping are the ones with the smallest dissimilarity. There are several methods for computing the dissimilarity [14]; the clustering algorithms used by Hutchens and Basili correspond to the "singlelink" algorithm [14] which takes the smallest dissimilarity between the elements of each pair of newly formed clusters as the new coefficient between them. The first step in hierarchical clustering is to abstract the data to obtain a binding matrix which represents the number of data bindings between any two components of the system. This matrix is a symmetric matrix. The current implementation considers all levels of data binding up to actual data bindings. The second step is to obtain a dissimilarity matrix using one of two alternative methods: 1. Recomputed Bindings: based on the percentage of the bindings that connect to either of two components of the system. The dissimilarity matrix p is defined by d(i,j) = p(i,j) = (sumrni + sumi 2b(i,j))/(sumrni + sumj b(i,j)) where sumi is the number of data bindings in which component i occurs and sumj is the number of data bindings in which component j occurs. In this case, p(i,j) is the probability that "a random data binding chosen from the union of all bindings associated with i or j is not in the intersection of all bindings associated with i or j" [14]. 2. Expected Bindings: a weight is assigned to each binding level relative to the total number of elements under consideration in a given iteration. The dissimilarity matrix p is defined by d(i,j) = (k/(n 1))/bind(i,j) where n is the number of elements under consideration and k is the number of bindings involving either element i or element j. Thus, one would expect k/(n 1) of the bindings to be between i and j. The current implementation of this clustering technique uses expected bindings to compute the dissimilarity matrix. During the third step, the new clusters are formed by grouping together those elements whose dissimilarity is the smallest. Then, a new binding matrix is obtained from the clusters and the process starts again from this new binding matrix. Hutchens and Basili summarize the problem of using data bindings only for modularization as that "whenever a module that defines an abstract data type and has no local data that is shared among the operations on the type, [the module] will not be located using this method" [14]. They explain that this is due to the fact that there is no direct data binding between the operations of the module and that all the interactions are indirect through the procedures that use the abstraction. The algorithms for object identification in this dissertation present a solution to this problem that consists of using data types and establishing "relationships" between such types and the routines that use them for formal parameters or return values. Then, the abstract data type is revealed from the hiding imposed by the abstraction mechanisms. 2.1.1.2 Type binding Type binding methodology [22] is the most recently published design recovery technique. It analyzes a conventional procedural programs in the context of an object oriented paradigm. Due to the gradual software paradigm progression from a purely procedural approach to an objectbased approach and now to the objectoriented approach, objectlike features, such as data grouping, abstract data type, and inheritance, al ready exist in conventional programming languages such as "C" and "Ada." It is very likely that the objectoriented concepts have been used in existing programs for some time. Type binding methodology consists of identifying the objects in the con ventional procedural programs in order to recover the underlying system structures [22]. The focus of this dissertation, the object finder, is a methodology that combines both the data binding approach and the type binding approach in modularizing a software system. The object finder is somewhat analogous to other software clustering methods, but it is unique in searching for a particular kind of cluster which is similar to an abstract data type or object and cannot be clustered by a data binding methodology. 2.1.2 Program Slicing Approach The concept of program slicing is originally discussed by Mark Weiser [43]. Weiser defines a slicing criterion as a pair (p, V), where p is a program point and V is a subset of the program's variables. In his work, a slice of program consists of all statements and predicates of the program that might affect the values of variables in V at point p. Weiser's work has been improved by Horwitz et al. [13], on the problem of interprocedural slicing generating a slice of an entire program, where the slice crosses the boundaries of procedure calls. The program slicing technique can help a programmer understand complicated source codes by isolating individual computation threads within a program. It can also aid in debugging and be used for automatic parallelization. Program slicing is also used for automatically integrating program variants; in this case, slices are used to compute a safe approximation to the change in behavior between a program P and a modified version of P and to help determine whether two different modifications to P interfere [13]. The main drawback of program slicing in recovering high level information is that this approach is oriented towards low level information abstraction. This makes it difficult to understand high level system interaction and the overall system structure. 2.1.3 Program Dependence Approach Program dependencies arise as the result of two separate effects. First, a de pendence exists between two statements whenever a variable appearing in one of the statements may have an incorrect value if the statements were reversed. For example, given A = B C S1 D = A E+ I  S2 S2 depends on S1, since executing S2 before S1 would result in S2 using an incorrect value for A. Dependencies of this type are called data dependencies. Second, a dependence exists between a statement and the predicate whose value immediately controls the execution of that statement. In the sequence if(A) then Sl B = C D S2 endif S2 depends on predicate A since the value of A determines whether S2 is executed. Dependencies of this type are called control dependencies. Program dependence analysis can be used in program slicing and in identifying reusable components for software maintenance [46]. Capturing the program depen dencies can help program understanding to aid modification. On the other hand, this is limited to specific components of a program and it is does not facilitate high level understanding of a program. However, one significant application of program dependence knowledge is in detection of parallelism and code optimization [9]. 2.1.4 Knowledgebased Approach Knowledgebased approaches [49, 1, 16, 32, 4] are very sophisticated and com plicated general methodologies of design recovery. Their most distinguishing property as far as program understanding is concerned is that instead of just using the ex isting code, they use all the program information, including the source code, the documentation, the execution histories, program analysis results, etc. The program information is expressed as patterns of program structures, prob lem domain structures, language structures, naming conventions, and so forth. It is stored in a central knowledge base which provide frameworks for the interpretation of the code. A knowledgebased system is not a single tool, such as an editor or debugger. It is a collection of tools sharing a common knowledge base. It holds the promise of providing programmers a nextgeneration programming environment with the goal of dramatically improving the quality and productivity of software development [1]. However, knowledgebased approaches are still at the research stage. We consider that a more constrained tool, such as the object finder, provides immediate, more accurate knowledge about the highlevel understanding of a program, directly from its source code. 2.2 Complexity Metrics This section presents the complexity metrics considered for the evaluation of the object finder. We also explain the rationale for developing a new set of metrics to be used in this evaluation. Software complexity is defined by Ramamoorthy [30] as "the degree of difficulty in analysis, testing, design, and implementation." However, our notion of software complexity is closer to the structure complexity metrics [15] which views the program as a component of a larger system and focus on the interconnections of the system components. The software metrics are classified as either design or implementation (code) metrics; design metrics measure the complexity of a design whereas implementation (code) metrics measure that of an implementation. The metrics chosen include design metrics such as the metrics of modularity [27] (cohesion and coupling) and imple mentation (code) metrics such as the cyclomatic complexity [25] of the program. Furthermore, we include a designimplementation metrics: the stability of programs [48]. In addition, we considered code metrics, which focus on the individual system components (procedures and modules) and require a detailed knowledge of their internal mechanisms, such as McCabe's cyclomatic complexity number [25]. We also consider, as indicated above, structure metrics, including Henry and Kafura's Information Flow [11] metrics and Yau and Collofello's Logical Stability [48] metric. Each of these metrics is measured by different characteristics of the program; that is to say, by evaluating the characteristics we may infer the degree of the metrics. In the following sections we explain several of these metrics and the rationale for discarding each metric. 2.2.1 Modularity Metrics Modularity metrics were initially defined by Myers [27]. They include two re lated aspects: cohesion (strength or binding) and the coupling of modules. Coupling is an indication of the level of interconnections between modules. A software compo nent is said to exhibit a high degree of cohesion if the elements in that unit exhibit a high degree of "functional relatedness" [39]; that is to say, they exhibit functional unity [8]. The metrics of coupling among "modules" (either procedures/functions in a nonobjectoriented language and objects in the object finder) are measured by "structural fanin/fanout" degrees [6] and by "informational fanin/fanout" [11]. Since the data flow count includes procedure calls, informational fanin/fanout sub sumes structural fanin/fanout [39]. Information flow [11] concepts are used to measure the module coupling between modules in a software system. These measures focus on the interface which connects system components. Myers [27] established six categories of coupling based on the data relationships among the modules. The information flow metrics can recognize two of these categories, including content coupling (refers to direct references between the modules) and common coupling (refers to the sharing of a global data structure). The information flow metrics is also used to measure the procedure and module complexity of a software system. A measure of the "strength of the connections from module A to module B" [11] is: (PEI(A) + PII(B)) IP(A, B) where PEI(A) is the number of procedures exporting information from module A, PII(B) is the number of procedures importing information into module B, and IP(A, B) is the number of information paths between the procedures. Thus, the resulting metrics are a matrix of the coupling between any two modules in the software system. The information needed to calculate this measure of coupling follows: The information flow between modules depends on the information flow between procedures which are part of the module. A module is defined with respect to a data structure D in the program consisting of those procedures which either directly update D or directly retrieve information from D. Thus, examination of the global flow in each module reveals the number of procedures in each module and all possible interconnections between the module procedures and the data structure. There are several kinds of information flow between modules: Definition 1 There is a global flow of information from module A to module B through a global data structure D if A deposits (updates) information into D and B retrieves information from D. Definition 2 There is a local flow of information from module A to module B if one or more of the following conditions hold: 1. if A calls B, 2. if B calls A and A returns a value to B, which B subsequently utilizes, or 3. if C calls both A and B passing an output value from A to B. Definition 3 There is a direct local flow of information from module A to module B if condition (1) of Definition 2 holds for a local flow. Definition 4 There is an indirect local flow of information from module A to module B if condition (2) or condition (3) of the Definition 2 holds for a local flow. Some examples of these information flows are illustrated in Figure 2.1. Figure 2.1 shows a simple system consisting of six modules (A,B,C,D,E,F), a data structure (D S), and the connections among these components. The possible skeleton code for this system is shown in Figure 2.2. As indicated in the skeleton code, module A retrieves information from DS and then calls B passing a parameter; module B then updates DS. C calls D, passing a parameter. Module D calls E with a parameter and E returns a value which D then uses and passes to F. The function of F is to update DS. Figure 2.1. An example of information flow typeds DS; void A() { type.ds x; x = DS 1; B(x) void B(typeds x) { DS = x; } void C(type ds p) { void D(typeds p) { typeds q; q = E(p); F(p,q); } void E(typeds p) { return (p + 3); void F(type.ds p, typeds q) { DS = p +q; Figure 2.2. Possible skeleton code for the example Table 2.1. Information flows relation generated for the example. Direct local flows A B CD D +E D F Indirect local flows E D E F Global flows B A FA The information flow analysis [11] process of a program consists of deriving the complete flow structure using a procedurebyprocedure analysis of the program. The information flows for this example are summarized in Table 2.1. The direct local flows are simply due to the parameter passing. The indirect local flows are due to "sideeffects" relationships between modules. The first indirect flow, from E to D, results when E returns a value (q) which D "uses" in its computation. The second indirect flow, from E to F, results when other information (q), that D receives from E, is passed unchanged to F. Finally, the global flow is due to information flow passing through the global data structure DS. The next step is to compute the complexity of a module. The complexity of a module is the sum of the complexities of the procedures within the module. The complexity of a procedure depends on two factors: the complexity of the procedure code and the complexity of the procedure's connections to its environment. A very simple length measure was used as an index of procedure code complexity [11]: the number of lines of text in the source code of the procedure (including embedded comments but not those preceding the procedure statement). The connections of a procedure to its environment are determined by the (informational) fanin and fanout of the procedure defined as follows: Definition 5 The fanin of procedure A is the number of local flows into procedure A plus the number of data structures from which procedure A retrieves information. Definition 6 The fanout of procedure A is the number of local flows from procedure A plus the number of data structures which procedure A updates. However, in order to compute the complexity of the procedures which are in cluded in a module, one should only consider local flows for the data structures associated with the module. Finally, the complexity of a procedure is: length (fanin fanout)2 The term (fan in fan out) represents the total possible number of combinations of an input source to an output destination. In conclusion, the complexity of a procedure contained in a specific module is computed using only the local flows for the data structure associated with that module. The complexity of a module is computed as the sum of the complexities of the procedures within the module. We use these complexity metrics concepts in defining a new set of complexity measure to evaluate the object finder. 2.2.2 Zage's Design Metrics In the case of a structured design, Zage [51] developed a design quality metric D(G) of the form D(G) = k1(D) + k2(Di) In this equation, k, and k2 are constants and D, and D, are, respectively, an external and an internal design quality component. De considers a module's external relation ships to other modules in the software system, whereas D, considers factors related to the internal structure. The calculation of D, and Di is performed in two different stages of software design. D, is based on information available during architectural design, whereas Di is calculated after the detailed design is completed. De is calculated for each module of a system and is comprised of two terms: one product related to the amount of data flowing through the module and another product giving the number of paths through the module. De = (Weighted Inflows Weighted Outflows) + (Fan In Fan Out) De appears to highlight stress points in architectural design. By redesigning these points, lower values of De were obtained [51] which, in effect, says that a reduction of D, means a reduction on the coupling between modules. The internal design metrics component Di is calculated as follows: Di = wi(CC) + w2(DSM) + W3(I/O) where CC (Central Calls) are procedure or function invocations DSM (Data Structure Manipulations) are references to complex data types I/O (Input/Output) are external device accesses and wi, w2, and w3 are weighting factors. The use of these three measures (CC, DSM, and I/O) is due to the desire to "choose a small set of measures which would be easy to collect and which would identify stress points to capture the essence of complexity" [51]. Their results indicate that stressing the data structure manipulation usage within modules gives excellent results as a predictor of errorprone modules. The proposed values of the weighting factors are: w, = 1, w2 = 2.5, and w3 = 1. Di was also better than cyclomatic complexity v(G) and lines of code (LOC) as predictor of errorprone modules [51]. The design metrics were used as guidelines for the development of our new set of complexity metrics. 2.2.3 Cyclomatic Complexity The computation of the metrics on cyclomatic complexity [25] of a program is based on the number of "connected" components, the number of edges, and the number of nodes in the program control graph. In practice, this metrics for struc tured programs consist of the number of predicates in a program plus one [25]. The conditionals are treated as contributing one for each predicate, thus we need to add two whenever there is a logical "and" and N 1 whenever there is a case statement with N cases. According to Shooman [37], McCabe has validated this metrics of cyclomatic complexity. McCabe concluded that, from a set of programs, those with a complex graph and a large v(G) are often the troubleprone programs. The cyclomatic complexity of a program is derived by computing the number of conditions in each component (function). That is, McCabe's cyclomatic number for a collection of strongly connected graphs is determined from the union of those strongly connected graphs. The cyclomatic number of this union is computed to determine the cyclomatic number of the complete program. The measure of complexity provided by these metrics focuses on individual system components. The complexity metrics which we must consider should focus on the structure of the system as well as on the highlevel design information as criteria to determine the complexity of a modularization. 2.2.4 Stability Metrics The metrics on program stability [48] are based on the stability of a module, defined as a measure of the resistance to the potential ripple effect from a modification of the module on other modules in the program. There are two sides to these metrics: the logical stability of a module, defined in terms of logical considerations, and the performance stability of a module, defined in terms of performance considerations. In our metrics study, we concentrate on the logical stability aspect of programs. The logical stability of programs is defined in terms of a primitive subset of the maintenance activity, such as a change to a single variable definition in a module. The incremental approach to computing the logical stability metrics [48] of a program begins by computing the sets of interface variables which are affected by primitive modifications to variable definitions in a module by intramodule change propagation. In addition, for each interface variable, it is necessary to compute the set of modules which are involved in intermodule change propagation as a consequence of affecting the variable. Then, for each variable definition one must compute the set of modules which are involved in intermodule change propagation as a consequence of modifying the variable definition. Next, the individual complexity associated with each module is defined using McCabe's cyclomatic complexity. The potential ripple effect of a module is the probability that a variable definition will be chosen for modification times the complexity associated with the modules affected by such variable definition. Finally, the logical stability of a program is the inverse of this potential ripple effect of a primitive modification to a variable definition in a module. We argue that our measure of complexity is based on similar grounds as the sta bility metrics. The stability metrics concentrate on the opposition to the propagation of the effect of primitive modifications by a software system. The propagation occurs 23 through the data transfer in the system. On the other hand, our metrics consider the relationship between components based on the links established by those data transfers. CHAPTER 3 THE PROPOSED APPROACH Our approach aids software maintenance by assisting the understanding pro cess of the design of a software from its source code. The approach's output is a modularization of a program that consists of the collection of "objects" found in the program. This modularization of a system, that usually has no (or little) existing high level documentation, gives the maintainers an understanding of the structure of the system. Since the approach is based on the data and type of the system, it also assists the maintainers with understanding of the system data. The proposed approach consists of a partial classification of the program ele ments (routines, types, and data items) that is meaningful in the context of the target program and its real world domain. The information required for this classification consists of the relationships between those program elements in term of data bindings and type bindings. Two methods of object identification are used. The first is based on global and persistent data and establishes links to the routines that manipulate such data. The second method is based in the data types and establishes relationships between such types and the routines that used them for formal parameters and return values. Both methods result in sets of identified candidate objects. The resulting candidate objects from both methods are not completely disjoint since they represent both object classes and instances of objects in the more classical sense. In addition, this allows the methods to capture the intentional "violations" made by the designer/implementor of the underlying design. The candidate objects represent the structure of the program in terms of groups of routines implicit from the design and the relationships between those groups. 3.1 The Proposed Approach The proposed approach consists of identifying the objectlike features in conven tional programming languages. Objectoriented constructs are not directly supported in conventional programming languages; however, several objectlike features, such as groupings of related data and abstract data types, are found in those programming languages. The proposed approach identifies these groupings of data and abstract data types in terms of "objects." Most objects are collections of data, together with the methods needed to access and manipulate those data. An "object," in a conventional programming language, can be identified as a collection of routines, types, and/or data items. The routines will implement the methods associated with the object, the types will structure the data they conceal or process, and the data items will represent or point to actual instances of the object class. Thus, we may characterize our candidate "objects" as tuples of three sets: Candidate Object = (F,T,D) where F is a set of routines, T is a set of types, and D is a set of data items. Any of these sets may be empty; ideally sets from distinct objects will not overlap so that a routine, type, or data item should not appear in more than one object. A program contains routines, types, and data items. The proposed approach consists of a partial classification of these elements that is meaningful in the context of the target program and its real world domain. A large part of the information for this classification can come from analyzing the relationships between the components of the program, but human intervention or very carefully chosen heuristics will be needed to remove coincidental and meaningless relationships. The identified candidate objects are not completely disjoint; there is some in tentional fuzziness in the definition of candidate objects. It is often the case in a real program that the original implementor (or a later maintainer!) has violated the cleanness of the underlying design in a few instances, either from laziness or to gain efficiency. It would unnecessarily reduce the usefulness of object finding to reject out of hand any candidates that had small overlaps or violations of good information hiding practice. Furthermore, the definition given does not distinguish clearly between the con cept of an object class and the concept of an object. As a practical matter, in some cases it may be easier to first find the class and then its instances and in other cases to reverse this procedure. Thus, it is more convenient to treat the two together. Two broad methods of object finding seem to be useful. The first is based on global and persistent (e.g. static in "C") data and establishes links to the routines that manipulate such data. The second methodology is based on data types and establishes relationships between such types and the routines that use them for formal parameters or return values. Without loss generality, we assume that all identifiers, such as routines, variables, and types, are distinguishable from their names. The first method of object identification is given in Algorithm 1. Algorithm 1 Globalsbased Object Finder Input : A program in a conventional programming language, such as Ada, COBOL, C, or FORTRAN, with scoping mechanisms. Output : A collection of candidate objects. Steps : 1. For each global variable x (i.e., a variable shared by at least two routines), let P(x) be the set of routines which directly uses x. 2. Considering each P(x) as a node, construct a graph G = (V, E) such that: V = {P(x) I x is shared by at least two routines} E = {P(xl)P(x2) I P(xi) nP(x2) # 0}. 3. Construct a candidate object (F, T, D) from each strongly connected com ponent (v, e) in G where F = U {P(x) I P(x) E v} T = 0 D = U{x I P(x) E v} Example 1. Globalsbased Object FinderSingle Stack/Queue: To motivate this first method, take as an example a package in an Adalike language for manipulating a single queue and a single stack of data with type Element. The package provides the interface routines so that the following three routines access a global data STACK: procedure Push_S (X:Element);  Push an element X to the Stack. function PopS return Element;  Pop the top element from the Stack. function IsEmptyS return Boolean;  Return true if the Stack is empty. and the following three routines access a global data QUEUE: procedure PushQ (X:Element);  Push an element X to the Queue. function Pop_Q return Element;  Pop the front element from the Queue. function IsEmptyQ return Boolean;  Return true if the Queue is empty. If there is no other direct relation between these two groups, then clearly, PushS, PopS, and IsEmptyS belong to one candidate object and PushQ, PopQ, and IsEmptyQ belong to another. The Globalsbased Object Finder given above would easily identify these two objects as the following two tuples: (F1,T1,D1) = ({PushS, PopS, IsEmptyS},0, {Stack}) (F2,T2, D2) = ({PushQ, PopQ, IsEmpty_Q},0, {Queue}) However, this method in many cases may produce objects which are "too big," since any routine that uses global data from two objects creates a link between them. Thus, a further stage of refinement will likely be necessary in which human intervention or heuristically guided search procedures improve the candidate objects by excluding offending routines or data items from the graph G. The Globalsbased Object Finder could utilize information other than the ac cesses to global variables by routines; in particular, it could also use the information about references and definitions of local variables and formal parameters. Then, a more detailed analysis of the internals of a routine will be required to obtain that information such as a semantic analysis to determine whether the references to local variables and formal parameters in a routine effectively access the same data item across invocations of the routine. Then, the routine could be made part of the group of routines that access that data. The kind of analysis needed to obtain this knowledge will be developed in the future study of this dissertation. One suggestion is to use data flow analysis of the internals of routines to determine the "pattern" of accesses of local variables and formal parameters inside the routines. A pattern represents the uses and definitions of local variables and formal parameters in a routine similar to knowledgebased approaches in Section 2.1.4. These patterns could be used as a criteria to group together those routines that exhibit similar patterns. For example, a pattern could be defined as the use of a local variable as an index in an array of elements. If this pattern is identified in a routine that has an array representation of an address table and in another routine with an array representation of a symbol table, we argue that both routines should be part of the same table indexing group. The second method of object identification is given in Algorithm 2. Algorithm 2 Typesbased Object Finder Input : A program in a conventional programming language, such as Ada, COBOL, or C with data type abstraction mechanisms. Output : A collection of candidate objects. Steps : 1. (Ordering) Define a topological order of all types in the program as follows: (a) If type x is used to define type y, then we say x is a "part of" y and y "contains" x, denoted by x < y. (b) x < x is true. (c) If x < y and y < x, then we say x is "equivalent" to y, denoted x =y. (d) If x < y and y < z, then x < z. 2. (Initial classification) Construct a relationship matrix R(F, T) in which rows are routines and columns are types of formal parameters and return values. Initially, all entries of R(F, T) are zeroes. An entry R(f, t) is set to 1 if type t is a "part of" the type of a formal parameter or of a return value of routine f. 3. (Classification readjustment) For each row f of the matrix R, mark R(f,t) as 0 if there exists any other type on the same row which "contains" type t and has been marked as 1. 4. (Grouping) Collect the routines into maximal groups based on sharing of types. Specifically, routines fi and f2 are in the same group if there exists a type t such that R(fi,t) = R(f2,t)= 1. 5. Construct a candidate object (F, T, D) from each group where F = {f I the routine f is a member of the group} T = {t I R(f,t) = 1 for some f in F} D = 0 Again, in many cases the candidate objects created may be "too big." As can be seen in the following example, the culprit will often be a type that a human can easily identify as irrelevant to the objects being identified. Example 2. Typesbased Object FinderMultiple Stacks/Queues: In this example, there are four basic groups of routines. The first group manip ulates complex numbers. The second group is related to multiple instances of stacks of complex numbers. The third group is related to multiple instances of queues of complex numbers. The fourth group involves routines that manipulate both stacks and queues. The four groups are specified in algebraic form as follows:  Group I: Construct (Real, Real) => Complex  construct a complex from two reals "+" (Complex, Complex) => Complex plus "" (Complex, Complex) => Complex minus "*" (Complex, Complex) => Complex multiplication "/" (Complex, Complex) => Complex  division  Group II: PopS (Stack) => Stack x Complex  remove the top element from a stack and return it Push_S (Stack, Complex) => Stack  push a complex number onto a stack IsEmptyS (Stack) => Boolean  return true if Stack is empty Group III: PopQ (Queue) => Queue x Complex  remove the head element from a queue and return it Push._Q (Queue, Complex) => Queue  push a complex number onto a queue IsEmptyQ (Queue) => Boolean  return true if Queue is empty Group IV: Queue_toStack (Queue) => Stack convert a queue to a stack Stack_to_Queue (Stack) => Queue  convert a stack to a queue In this example, there are no global variables used. In applying Algorithm 2, we will develop the following matrix R: Group ID Routines Complex Real Stack Queue Boolean I Construct 1 0 "+" 1 11*1 1 Il"/" 1 II PushS 0 1______ Pop.S 0 1 IsEmptyS 11 III Push.Q 0 1 Pop.Q 0 1 IsEmptyQ _____ 1 1 IV Queue.to_Stack 1 1 Stack_to_Queue 1 1 Blank entries (which will be marked as 0's according to Algorithm 2) indicate no direct relationship between the routine and the type. "0" means a "part of" relationship has been found by examining the internal data structures of the program. In this example, at some point a complex will be found to contain real values and the stack and queue found to contain complex values. The Typesbased Object Finder will initially classify all the types and routines of groups II, III, and IV in a single large object because of the false links created by the Boolean type that link the stack and the queue. Some heuristics could be used to reduce such conflicts (e.g., eliminate primitive types, etc.), but it would also be a fairly easy task for a user to intervene at this point and identify Complex, Stack, and Queue as the objects of interest, provided that the data can be presented to him clearly. There seems, however, to be no easy way to categorize group IV, which involves routines that operate on both Queue and Stack objects. A solution given by object oriented design consists of a guideline which requires that a routine to be member of a class it must either access or modify data defined within the class [18]. Clearly, this indicates that a routine would be classified according to the type of its input formal parameters; then, routine QueuetoStack belongs to group II (the Queue candi date object) and routine Stack_toQueue belongs to group III (the Stack candidate object). Excluding group IV and type Boolean, the objects identified by Algorithm 2 would be listed as follows: (F1,T1, D1) = ({Construct, "+", "", "*", "/"}, {Complex}, 0) (F2, T2, D2) = ({PushS, PopS, IsEmptyS}, {Stack}, 0) (F3, T3, D3) = ({PushQ, PopQ, IsEmptyQ}, {Queue}, 0) The previously described topological ordering of types, in Step 1 of Algorithm 2, is appropriate whenever all types in a program are related in terms of the "part of" relationship defined above. This ordering represents the fact that a given type is "part of" another type; thus, the latter type is "more important" than the former type when classifying routines into candidate objects. That is to say, a routine with formal parameters or return values with multiple types should be classified as part of the group of routines that manipulate data with the most important of those types, i.e., the type that was defined using the other types of the data manipulated by the routine. The main problem with this type ordering scheme occurs when some types are not related by the "part of" relationship. In this case, the previous topological ordering of types does not completely characterize the relative importance of all the types; thus, the classification of routines using this type ordering scheme results in routines which may be classified under more than one type, such as the routines in group IV of Example 2. This problem can be handled by an alternative type ordering scheme based on the "complexity" of the data types in a system. The relative complexity type ordering consists of ordering all the types in the program based on the "complexity" of the types. This complexity is expressed by a complexity index function, called CI, for a given type. Assume that the types in a system are represented using a tree that captures the "part of" relationship between types. Then, if type y is a "part of" type x, type x is an ancestor of type y and type y is a descendent of type x in the tree. An example of trees representing structure types is given in Figure 3.1 of Example 3. The complexity index of type t, denoted CI(t, 0), is computed using the complexity index function in Table 3.1. Given that the type of interest is the root of a tree representation of the type, the complexity index function recursively computes the complexity of the type as the sum of the path lengths (in number of arcs) from the root type to all its descendent types in the tree. In Table 3.1, the added complexity by primitive types is simply the length of the path, denoted d, from the root type to the primitive type. The complexity due to Table 3.1. Complexity index function for types in the "C" programming language. type t CI(t, d) primitive d pointer to primitive type y d + (drf CI(y, d + 1)) or userdefined type y array of primitive type y d + (dimension(t) CI(y,d + 1)) or userdefined type y struct t { fl, f2, ..., fn} d + =1CI(fj,d + 1) and fl, f2, ..., fn are all base field types struct t { fl, ..., fk, ..., fn} d+S+f*S and f l, ..., fk are base field types where and fk+1, ..., fn are recursive field types S = Zj=i CI(fj, d + 1) f = (n k)/n pointer types is modified by the dereferencing factor, called drf, which represents the pointer's complexity added to the complexity of the type; currently, the dereferencing factor may fluctuate between 1.5 and 2.0 depending on the system's use of pointer types. The complexity due to an array type, t, is modified by the number of elements in the array, i.e., its dimension(t). The complexity due to a structure type consists of the sum of the complexity of its base fields types. The field types of a structure are of two categories: recursive field types are those which consists of either a pointer to the containing structure or essentially the same structure type as the containing structure (in "C", a typedef type); base field types are those which are not recursive types. In the presence of recursive field types, the complexity due to an structure type consists of the sum of the complexities of the base field types, denoted by S, plus a fraction of this complexity caused by the recursive field types. Two types could be easily ordered by comparing their complexity indexes where a type A is more complex than type B iff CI(B) < CI(A). Hence, type A is "more important" than type B in the topological ordering of types. A partial ordering of A/ K C B M D E N B D E Figure 3.1. Tree representation of types for complexity index function the types in a program is established by comparing the complexity indexes of all the types. Example 3 illustrates the computation of the complexity indexes. Example 3. Type ordering based on the type complexity index function: Consider two "C" data structures below where an identifier in capital letters denotes a type in the language and one in lowercase letters denotes a data structure field name. Assume that types A, B, K, and L are structure types and types C, D, E, M, and N are primitive types in the "C" programming language as follows: struct A { struct K { C c; struct L { struct B { struct B { D d; D d; E e; E e; }; }; }; N n; }; M m; }; The tree representations of those structures types, which captures the "part of" relationship among types, are shown in Figure 3.1. Then, The complexity indexes of types A and B in structure A, according to the com plexity index function in Table 3.1, are: CI(A, 0) 0 + CI(C, 1) + CI(B, 1) = 0 + 1 + (1 + CI(D,2) + CI(E,2)) = 0+1+(1+2+2)) = 6 CI(B, 0) = 0 + CI(D, 1) + CI(E, 1) = 0+1+1 = 2 Type A is more complex than type B, thus type A is higher in the ordering of types than type B. The complexity index of type K is CI(K, 0) = 0 + CI(M, 1) + CI(L, 1) = 0 + 1 + (1 + CI(N,2) + CI(B,2)) = 0 + 1+(1 + 2 + (2 + CI(D,3) + CI(E,3))) = 0+1+(1+2+(2+3+3)) = 12 Then, type K is more complex than type A. This type ordering scheme allows the comparison of two unrelated types to determine the most complex, and most important, of the types. The complexity indexes of type B by itself and within structures A and L, are equal since the complexity index depends on the structure that the type repre sents which is the same in all three cases. An additional problem, which is not addressed by either of the two type ordering schemes, occurs when fields of an structure are conceptually "more important" than the structure, thus, the routines could be classified into objects according to the former. For example, consider another routine AddToptoOperand in the multiple stacks/queues (Example 2) which implements the following functionality: Complex AddToptoOperand (Stack: st, Complex: op)  add the top element from a stack to another complex and return result Complex: result, top; top = PopS(st); result = "+"(topop); return (result); Using either of the two type ordering schemes, routine AddTopto Operand would be classified under the Stack object. However, the functionality of the routine indicates that it should be classified under the Complex object since the routine implements operations on Complex entities of the program. The functionality of routines could be captured by a data flow analysis of the internals of a routine and by examination of the types of variables accessed inside the routines (including global variables, local variables, and formal parameters). 3.2 Applicability of the Algorithms The applicability of these algorithms for object identification includes the kind of conventional, procedural programming language such as "Ada," "C," "COBOL," or "Pascal." The Globalsbased Object Finder handles those conventional programming lan guages as well as "FORTRAN." Procedural programming languages provide static and dynamic scoping mechanisms which allow the definition of global variables as well as local variables with respect to a particular scope level. The execution of the body of a function in the case of these programming languages results in side effects being observed on the values of the global variables [24]. The Typesbased Object Finder handles procedural programming languages that provide a data abstraction mechanism which permits the construction of com posite types using more primitive types. Clearly, typing mechanisms are also required for this algorithm. "FORTRAN" is one of these programming languages which does not provide explicit type construction mechanisms, with the exception of arrays. The limitation of applicative languages, such as "LISP", is that they are not explic itly typed; in addition, their (abstract) data type construction mechanisms (e.g., list structure) are not currently handled by our analysis approach. 3.3 Conditions for Best Results The proposed modularization approaches are particularly useful when the fol lowing conditions hold: 1. The program being maintained is written in a programming language that supports objectlike features such as grouping of related data and abstract data types. Implicit abstract data types are identified by the approach of object identification. In the case of programming languages that explicitly support the syntactic specification of abstract data types, this modularization is used to define the relationships between the abstract data types defined in the system. 2. The Globalsbased Object Finder requires that the program being maintained supports a static scoping mechanism. This allows the definition of global vari ables and the occurrence of side effects by the invocation of routines referencing the global variables. 3. The Typesbased Object Finder requires that the program being maintained supports a data type abstraction mechanism that permits the construction of composite types using more primitive types. CHAPTER 4 TIME AND SPACE COMPLEXITY ANALYSIS In this chapter, the time and space complexities of the proposed approach are discussed. These complexities are independently analyzed for the two approaches of object identification. 4.1 Algorithm 1. Globalsbased Object Finder Complexity The time and space complexity of this algorithm follows: Step 1. Build set P(x) for each global variable x: Let N be the number of rou tines and g be the number of global variables in a system. Assume that the input to step 1 is a symbol table representation of a program. This symbol table consists of a sorted list of entries each of which contains an identifier's definition and refer ences information. The time required to look up the references information about an identifier from the symbol table of a program is linearly proportional to the number of identifiers in the program; the time required is at most O(g + N). Thus, for each global variable x, the time required to build the unordered set P(x) of routines which directly access x is O(g + N). For g global variables, the total time complexity of this step is O(g(g + N)). For real programs, N is usually larger than g, and the time complexity is O(gN). The space requirement for this step consists of the space to store the sets P(x); the maximum size of a set P(x) is 0(N). For g global variables, the space require ments is 0(gN). Step 2. Construct graph G: Graph G construction consists of making each set P(x) to be a node in the graph; the edges of the graph consists of the set intersection between two nodes (only its magnitude is important). The data structure to store graph G consists of lists which represent the edges of the graph as follows: an edge (gb1 gb2 P(gbi) n P(gb2)) where g.bi and gb2 are global variables and P(gbi) fn P(gb2) is the set of common routines which is also represented as a list. The time required to construct one edge is the time re quired to obtain the intersection of unordered sets P(gbi) and P(gb2). The current implementation does not presort sets P(x) before an intersection operation. Thus, the time is proportional to IP(gbi)IIP(gb2)\ and the maximum time is 0(N2). An improvement consists of presorting the sets P(x). This improvement is possible since the kind of maintenance tasks (design recovery) which we consider do not involve changes to the connectivity of a program; then, the set P(x), for global variable x, will remain unchanged after a modification. In this case, the time required to perform an intersection would be 0(N). The total time required to construct all the edges is proportional to the number of global variables, g, in the program. The maximum number of edges is 0(g2). Hence, the total time complexity is O(g2N2). Step 3. Construct candidate objects from strongly connected components: These candidate objects are obtained using a depthfirst search algorithm [10, 42] for deter mining the connected components of the graph G; the complexity of such an algorithm is 0(M), where M is the number of edges in the graph which in turn is bound by the number of nodes in the graph as 0(g2). This algorithm starts with some node of graph G. Then, we visit all the connected nodes in the order of a depthfirst search, i.e., we walk, first, as far as possible into the graph without forming a cycle, and then we return to the last bifurcation which had been ignored, and so on until we return to the node from which we started. We restart the procedure just described from some new node until all nodes have been visited. Based on this analysis, the time complexity of the Globalsbased Object Finder is O(gN + g2N2 + g2). The bounding time complexity is O(g2N2). The space complexity of this algorithm is also proportional to the space required to save graph G. The space required to save all the nodes in the graph is clearly proportional to the number of nodes (i.e., number of global variables) and the number of routines in the set P(x) associated with a node. Since the maximum number of routines in a node is N, this space complexity is O(gN). The space required to save all the edges in the graph is bound by the number of nodes and the intersection set corresponding to an edge, i.e., O(g2N2). Then, the total space complexity is O(gN + g2N2). 4.2 Algorithm 2. Typesbased Object Finder Complexity The time and space complexity of this algorithm only considers the type order ing scheme based on the "part of" relationship between types; it follows: Step 1. Type ordering: The definition of a topological ordering is required for all the (abstract data) types used as types of formal parameters or return values of the routines in a program. The topological ordering of types defines an ordering of all the types in the program according to the "part of" relationship between any two types; i.e., if type ti is used to define type t2, then we say that t, is a "part of" t2. Assume the number of types used in a program, T, is proportional to the size of the program in lines of code L1. One algorithm for topological sort [38] has time complexity of 0(n2) where n is the number of vertices in the graph. In our analysis, the number 'For real programs, T will be usually smaller than L. of vertices is T. Then, the time complexity, using this topological sort, is bound by 0(T2). Another algorithm for topological sort [17] has a total time complexity bounded by (32m + 24n + 7b + 2c + 16a) where m is the number of input relations between types, n is the number of objects, a is the number of objects without no predecessors (primitive types), b is the number of tape records in input, and c is the number of tape records in output. The topological ordering of types is stored as a tree which represents the "part of" relationship between types in a program. The "part of" relationship, between two types x and y, is stored as lists (x y) where type y is "part of" type x and type x "contains" type y. Given a list (tl t2 ... tn), then type ti "contains" types t2 through t,. The maximum space required to save this tree is 0(T2). This tree of types usually has a maximum of three or four levels in its branches. Step 2. Initial classification: The time required to construct matrix R with N routines and T (abstract data) types is 0(TN). For real programs, T is usually smaller than N. Thus, the time required to construct matrix R is bound by O(N). Step 3. Classification readjustment: The time required for the classification readjustment is proportional to the number of routines, N, and the number of types, T, in matrix R. For a given routine in row r which has been marked with 1 in type t, the time required to determine whether there exists any other type s on the same row r which "contains" type t and has also been marked with 1, is proportional to the number of types T and to the time required to determine whether type s "contains" type t. The latter time is proportional to the number of types in the type ordering tree, and it is equivalent to the time to search for a path from type t to type s in the subtree of the type ordering tree with root equal to s. The algorithm used to search for this path is a depthfirst search [47] which has constant time complexity [38]. Hence, for a routine, the time required for the classification readjustment is 0(T). Given that we have N routines, the time required for the classification read justment of matrix R is 0(NT). Step 4. Grouping: The candidate objects are formed by collecting all routines, r, that share a type t; this is to say, for a given type in column t, a candidate object of routines, sharing type t, consists of all the routines, in row r, with (column) type t set to 1 after step 3. The time required to form these candidate objects is proportional to the number of (abstract data) types T and the number of routines N in the system. The time complexity of this step is 0(TN). Based on this analysis, the time complexity of the Typesbased Object Finder is O(T2 + N + NT + NT). As indicated above, for real programs T is usually small with respect to N, and the bounding time complexity is 0(max(T2, NT)). The space complexity of this algorithm is clearly the space required to store matrix R plus the space required to store the tree representing the types ordering, i.e., O(NT + T2). CHAPTER 5 EVALUATION OF THE APPROACH This chapter presents some guidelines for an evaluation of the object identifi cation methods. The goals of this evaluation are to compare the algorithms used for identifying objects with other existing modularization techniques in term of the com plexity of the resulting modularization and to compare the resulting modularization in a conventional, procedural programming language with the "classes" found in an objectoriented version of the program. The evaluation of the object finder algorithms consists of careful examination of the results of this approach. Two studies were used to evaluate the identified objects. In the first study, named study I, we compared the groups (based on the identified candidate objects) identified by the object finder with those groups (based on the clusters) identified with hierarchical clustering [14]. The comparison was based on the complexity of these two partitionings resulting from each analysis. The results of this study are presented in this chapter. In the second study, named study II, we compared the identified objects found in a program with the object oriented programming classes found in the objectoriented version of the program. We explained the results of study II in Section 8.2. A system's partitioning results from the system modularization, i.e., the group ing of routines into disjoint groups. The object finder and the hierarchical clustering technique [14] are different methods to obtain these partitionings. Metrics of the complexity of software structures have been used as valuable management aids, important design tools, and as basic elements in research efforts to establish a quantitative foundation for comparing language constructs and design methodologies [11]. In addition, module and interface metrics have been used in evaluating the modularization and the level of coupling of a system [14]. In study I, we continue to use metrics to evaluate the complexity of a system partitioning in terms of the complexity of the interfaces and the complexity of its components. These two views of complexity parallel coupling and cohesion. Conte et al. [7], measure coupling as the number of interconnections among modules and cohesion as the measure of the relationships of the elements within a module. A design with high degree of coupling and low module cohesion will contain more errors than a design with low module coupling and high module cohesion. Several primitive complexity metrics can be used to quantify the coupling and cohesion of a module. Then, several factors are used to quantify our complexity metrics. 5.1 Goals of the Evaluation Studies The primary goal of the evaluation studies was to demonstrate that the algo rithms used for identifying objects result in system partitionings less complex than other existing modularization methods. This chapter also includes the development of a new set of primitive factors that measured the complexity of a modularization of a system by characterizing the complexity of the corresponding partitioning. The evaluation consisted of comparing the complexity of the candidate objects identified by the object finder and the complexity of the clusters defined using Basil i's hierarchical clustering technique [14]. Another motivation for comparing the two approaches was to determine whether the object finder approach results effectively capture the structure of a software system. Since Basili's hierarchical clusters repre sent an experimentally validated initial approximation of the groups intended by the designer of the software system [14]; then, the results of the object finder should be less, or at least equally, complex than the hierarchical clustering approach results. The chosen criteria to make the comparison was the complexity of each parti tioning measured by complexity metrics similar to coupling and cohesion. According to Hutchens and Basili [14], the question of strength and coupling between elements of a given partitioning is largely still unresolved. Thus, a new set of primitive factors was developed to measure the complexity of the relationships between modules of a software system as well as the complexity of the relationships inside the modules. Furthermore, a measure of module strength in terms of the uniqueness of the types manipulated by the module was developed. That is to say, the new set of factors when applied to the partitioning of a system, measures the complexity of the interface between the system modules, the complexity inside the modules, and the strength of the components inside a module. Other metrics [27, 25, 6, 11, 51, 48, 29] were considered for computing the complexity of the partitioning. They were not suitable for this purpose since the characteristics to be measured corresponds to the complexity of the interface (simi lar to coupling) and to the complexity of the relationships between elements inside a module (similar to cohesion). The object finder approach assumes that we iden tify candidate objects in programs which were originally designed without explicit objectoriented syntactic features even though we assumed that objectlike features are present in the programs. Thus, metrics based on objectoriented syntactic fea tures could not be used in the evaluation. Hence, we created a new set of primitive complexity metrics factors which measured three aspects of complexity: factors that measured the complexity of the interfaces between modules in a partitioning; fac tors that measured the complexity of the relationships between components inside modules, in terms of interactions between components in the modules; and primitive metrics factors that measured the strength of the relationships between components in a module in terms of the similarity of data types manipulated by the components of the module. 5.2 Methodology of the Evaluation Studies The methodology of the evaluation of the object finder included two studies: (1) study I consisted of comparing the groups identified by the object finder (candidate objects) with those groups identified by hierarchical clustering [14] (clusters), and (2) study II compared the identified candidate objects in a program with those classes found in the objectoriented version of the same program. Several other evaluation studies are proposed for the future research including: Use of experts to compare the identified candidate objects in a system with the expert knowledge about the system. Use student academic projects as well as industrialsize programs to further evaluate the object finder algorithms. The steps in study I were: 1. Identify example programs for the study. Three sample programs were identified for this study from the literature. Another industrialsize software system was considered for future evaluation studies. 2. Compute the identified candidate objects using the Topdown Analysis (Method 1 in Chapter 6) for a sample program. The result of this method is a partition ing of the system that consists of groups which corresponds to the identified candidate objects. 3. Compute the clusters in the sample program using Basili's hierarchical cluster ing technique [14]. The result of this technique is a partitioning of the system which normally corresponds to the toplevel clusters of the system. 4. Compute the primitive complexity metrics factors for each partitioning above. In this step, we determined the primitive metrics factors for a partitioning of the system according to the definitions in Section 5.3. 5. Compare the complexity of the two partitionings using the results of the prim itive complexity metrics factors for each partitioning. The steps in study II are: 1. Identify example programs for the study. One program was chosen for this study from the literature. 2. Conversion of objectoriented code into non objectoriented code. For the com parison of identified objects and classes of an objectoriented version of a pro gram, we require a program with two versions including an objectoriented version and an equivalent non objectoriented version. We choose an object oriented program from the literature. Then, we derived a functionally equiva lent version of this program using non objectoriented techniques. 3. Compare the identified objects with the classes. The instrument and results of study II are reported in Section 8.2. 5.3 Primitive Metrics of Complexity Assume that a system contains a large number of routines and data structures. Our complexity metrics factors measure the ability to partition [3] the system as to absorb as many relations between routines in a group as possible and thus leave few intergroup connections which results in less complex group interfaces. In addition, the complexity metrics measure the ability of reducing the number of relations be tween routines inside a group which results in less complex relations inside groups. That is to say, we categorize what makes one partition "better" than another in spite of the fact that the connectivity of routines is always the same and only their group assignment changes from partitioning to partitioning. Beladi and Evangelisti define the degree of connectivity of a cluster as the number of "connections" between the elements of the cluster. Program routines and data structures are interconnected by routines invocations and references in software systems [3]. In the absence of a direct measure of the intergroup complexity and intragroup complexity which given a partitioning would compute these metrics, we developed the set of primitive complexity metrics factors which, we argue, measure intergroup and intragroup complexities, as well as intragroup strength, as demonstrated in Sec tion 5.3.5. The union of all primitive metrics factors related to one kind of complexity (intergroup complexity, intragroup complexity, or intragroup strength) quantifies the corresponding complexity. The following sections present important definitions and examples, the primitive factors of complexity, and a validation of these factors. 5.3.1 Definitions Assume that all variables in a program have unique names. A use of a vari able refers to either a definition (a value is assigned to a variable) or a reference (a variable's value is used) of the variable [12]. Definition 7 A global variable is a variable directly used within at least two modules. Definition 8 Let P be a module. There is a moduleglobal access pair (P,g) if g is a global variable used within P. Synopsis: A moduleglobal access pair (M,g) represents the fact that global variable g is used within module M. Example 1 Moduleglobal access pair (M, g) M() { // g is used within M } Definition 9 Let (Q,g) be a moduleglobal access pair. There is a moduleglobal indirect access pair (P,g) if 3 a module Q with a call to P, P(... ,g,...) such that the formal parameter of P corresponding to g is used within P. Synopsis: A moduleglobal indirect access pair (M, g) represents the fact that global variable g is indirectly used by module M when there is a call to M with formal parameter g in module N which uses g. Example 2 Moduleglobal indirect access pair (M, g) N() M(a) { { M(g) // a is used within M } } Definition 10 Let P, Q be two modules and g be a global variable in P and Q. There exists a moduleglobal access data binding triple (P,g, Q) if g is defined within P and referenced within Q. Synopsis: A moduleglobal access data binding triple (P, g, Q) represents the fact that module P defines the value of global variable g and module Q references the value of global variable g. It reflects the data relationship between modules P and Q, and the direction of the information flow (from P to Q). Example 3 Moduleglobal access data binding triple (P,g, Q) P() Q() { { g< // is defined <g // is referenced } } Definition 11 Let P, Q be two modules and x be a local variable in P. There is a localexport data binding triple (P, x, Q) if 1. x is defined within P before a call Q(..., x,...) in P, and 2. the formal parameter of Q corresponding to x is referenced within Q. Synopsis: A localexport data binding triple (P, x, Q) represents the fact that module P defines the value of local variable x, and the corresponding formal pa rameter is referenced within module Q. It is based on the binding between the local variable and the corresponding formal parameter and the direction of the information flow (from P to Q). Example 4 Localexport data binding triple (P, x, Q) P() Q(a:[by reference,by valueresult, by value]) { x:local { x< // is defined <a // is referenced Q(x) } } Definition 12 Let P, Q be two modules and x be a local variable in P. There is a localimport data binding triple (Q, x, P) if 1. 3 a call Q(..., x,...) in P such that the formal parameter of Q corresponding to x is a callbyreference or callbyvalueresult parameter and is defined within Q, and 2. x is referenced within P after the call. Synopsis: A localimport data binding triple (Q, x, P) represents the fact that module Q defines the value of the formal parameter a, and the corresponding actual parameter in a call to Q within P, is referenced within P after the call. It reflects the data relationship between modules P and Q due to the local variableformal parameter bindings, and the direction of the information flow (from Q to P). Example 5 Localimport data binding triple (Q, x, P) P() Q(a:[by reference,by valueresult]) { x:local { Q(x) a< // is defined <x // is referenced } } The data bindings due to return value relationships are handled similar to localimport data bindings. First, one of several transformations are performed on the function invocation and invoked function definition as follows: Invoked function definition transformations: 1. C CODE TRANSFORMATION Q(b,c) Q(a,b,c) { { return a; } 2. C CODE Q(b,c) { return val; a< // is defined } TRANSFORMATION Q( { invocation transformations: C CODE P .. .) {x; x =Q(y,z); < x // is used } 2. C code P(...) TRANSFORMATION P( .) {x; Q(x,y,z); < x // is used }' transformation P .. .) {x; x(xjy,z); exp(x); * Function 1. exp ((y,z); Definition 13 Let P, Q be two modules and P uses the return value from Q after a call Q(...) in P. Perform one of two kind of transformations on the invoked module definition and the module invocation. Let a or < retval > be the formal parameter resulting from the transformation on the invoked module definition. There is a return value data binding triple (Q, a, P) or (Q, < retval >, P) if 1. 3 a call Q(x) in P such that x is a transformationgenerated local variable in P corresponding to the transformationgenerated formal parameter a or < retval > in Q that is defined within Q, and 2. x is referenced within P after the call. Synopsis: A returnvalue data binding triple (Q, a, P) or (Q, < retval >, P) represents the fact that module Q defines a returnvalue, returns it to module P, and this return value is referenced within module P after the call. Either, the return value is saved in a variable after the call for later reference, or it is directly referenced after the call. Notice that the data binding is expressed in terms of the return value variable as opposed to the local variable in the invoking module. Definition 14 Let P, Q be two modules and g be a global variable in P. There is a globalexport data binding triple (P,g, Q) if 1. g is defined within P, before a call Q(...,g,...) in P, and 2. the formal parameter of Q corresponding to g is referenced within Q. Synopsis: A globalexport data binding triple (P, g, Q) is symmetric to the local export data binding triple except that in the former the local variable x is replaced by a global variable g. Example 6 Globalexport data binding triple (P,g, Q) P() Q(a:[by reference,by valueresult, by value]) { { g< // is defined <a // is referenced Q (g) Definition 15 Let P, Q be two modules and g be a global variable in P. There is a globalimport data binding triple (Q, g, P) if 1. 3 a call Q(... g,...) in P such that the formal parameter of Q corresponding to g is a callbyreference or callbyvalueresult parameter and is defined within Q, and 2. g is referenced within P after the call. Synopsis: A globalimport data binding triple (Q,g,P) is symmetric to the localimport data binding triple except that in the former the local variable x is replaced by a global variable g. Example 7 Globalimport data binding triple (Q, g, P) P() Q(a:[by reference,by valueresult]) { { Q(g) a< // is defined <g // is referenced } } Whenever it is not confusing, we will use the term data binding interchangeably with any of the forms of data bindings above. Definition 16 Let Type(v) be the type of variable v. Type size of variable v, denoted Tsize(v), is the amount of information carried by type Type(v) of variable v. Tsize(v) quantifies the information level associated with variable v due to the its type. In a conventional programming language, such as C or Ada, a primitive type is the least difficult to understand of its types. Since a pointer represents the address Moduleglobal access pair (M,g) Elf ()Q EhBW. P  1 Moduleglobal indirect access pair (M,g) Moduleglobal access data binding (P,g,Q) Localexport data binding (P,x,Q) Localimport data binding (Q,x,P) Globalexport data binding (P,g,Q) Globalimport data binding (Q,g,P) Returnvalue data binding (Q,a,P) Returnvalue data binding (Q, Figure 5.1. Schematic illustrations of access pairs and data bindings Table 5.1. Type size associated with several types. TYPE (Based on "C"/ "Ada") TSIZE primitive types e.g.: int, char, float, 1 void,boolean,enum 1 pointer 1 array f (Tsize(array[i]), No. of elements, Control Info Size) userdefined types; e.g:struct, union Sum of the Tsizes of elements of an object, we conclude that it is as easy to understand as primitive types. The userdefined types, such as structures and unions, are more complicated because their components' information level; thus, we assume that the difficulty of understanding a userdefined type is the added difficulty of understanding each of its components. Finally, an array has special mechanisms to manipulate its elements which adds to the difficulty of understanding the array. Table 5.1 specifies the type size for the type found in two conventional program ming languages. The higher the Tsize value, the more information is carried by the type and it is more difficult to understand a variable of such type. For arrays, control info size refers to the complexity associated with the array control mechanisms of the particular language under analysis (e.g.: in C, type control size = 2: 1 unit to account for the index offset used in addressing the individual elements of the array and 1 unit to account for the storage of the address of the beginning of the array). Definition 17 A group G is a pair of sets of modules and global variables. Definition 18 The boundary of group G is the set of data bindings (M v, Mj) such that either Mi e G and Mj G, or Mi G and Mj e G. In the case of a moduleglobal access data binding (Mi, g, Mj) in the boundary of group G, there are two associated access pairs in the boundary, namely (Mi, g) and (Mj,g). Definition 19 The interior set of group G is the set of data bindings (M, v, Mj) such that both Mi E G and Mj G. Let ISI be the number of elements in set S. 5.3.2 Intergroup Complexity Factors The intergroup complexity of group G measures the complexity in the interface between the group G and other groups of the system. The intergroup complexity of group G is characterized by the union of all the primitive measurements of intergroup complexity. The total intergroup complexity of a system is the average intergroup complexity of the groups in the system. Another interesting measure is to consider the maximum intergroup complexity of all the groups in the system. 5.3.2.1 Databased primitive measurements In this section, we present the primitive measurements of the intergroup com plexity which are based on the data of the system. fl(G) is the set of global variables used by modules inside group G and also used by modules outside group G, or the set of direct interface variables, i.e.: fi(G) = { g 3 moduleglobal access pairs (M,g) and (N,g) such that M E G and N 0 G.} Assume that a global variable which is concurrently used by modules in two groups, increases the intergroup complexity of both groups. We expect that the more global variables are shared between a group and other groups, the more this group is related to these other groups. The relationship is due to the potential of sharing the same data space among several groups. Beladi's intercluster complexity [3] is proportional to N, the total number of nodes (where nodes are either modules or global variables) in a system. This complexity is given by Co = N E, where Eo is the number of "intercluster" edges. Thus, given the number of global variables N, we conclude that the number of global variables increases the intergroup complexity. * f2(G) is the set of global variables outside of group G indirectly used by modules within group G, or the set of indirect interface variables, i.e.: f2(G) = { g 1 3 moduleglobal indirect access pair (M,g)such that M G and g G.} We expect that the more global variables which indirectly affect or are affected by modules in the group, the probability that the group is indirectly related to other groups increases. This relationship is due to the indirect knowledge about a global that the module must have. In this factor, Beladi's intercluster complexity is also used to verify our intuition. Since the intercluster complexity is proportional to N, the total number of nodes in a system, we conclude that the number of global variables increases the intergroup complexity. * f3(G) is the boundary set of group G, i.e.: f3(G) = {(Mi, v, Mj) I 3 a data binding (AMi, v, Mj) in the boundary set of G.} The intercluster complexity is proportional to the number of "intercluster edges" [3]. Accordingly, the more intercluster edges there is, the more complex is the interface of the system. In this factor, the data bindings across groups bound aries correspond to the intercluster edges. Hence, the complexity increases with the number of data bindings. A similar observation was made by Henry and Kafura [11] in the case of modularity metrics. f4(G) is the set of different variables transferring information between group G and other groups, or the set of variables in the boundary set, i.e.: f4(G) = {v 13 a data binding (M,v, Mj) in the boundary set of G.} Assume that variables which transfer information between the group and other groups, thus defining data bindings, relate the group to those other groups. Hence, the more variables relate a group to other groups, the intergroup com plexity increases. Similarly, Beladi's intercluster complexity considers unique nodes, where a node is either a module or a global data in the system, as a factor which increases intercluster complexity. Factor f4 considers unique occurrence of variables in the boundary set. 5.3.2.2 Typebased primitive measurements In this section, we present the primitive measurements of the intergroup com plexity which are based on the types of data in the system. Assume that the information level of a variable is determined by its type size (Tsize) according to Definition 16. f5(G) = is the sum of type sizes of the global variables used by group G and also used by modules outside group G, or the sum of type sizes of direct interface variables, i.e.: f5(G) = E Tsize(g) {g 3 module global access pairs (M,g) and (N, g) ye such that M E G and N C G.} We expect that the higher total type size of all global variables concurrently used by modules inside and outside the group, the probability that the group is related to those other groups increases and the intergroup complexity increases. * f6(G) = is the sum of type sizes of global variables outside group G indirectly used by modules within group G, or the sum of type sizes of indirect interface variables, i.e.: f6(G) = T Tsize(g) {g  3 module global indirect access pair (M,g) ge such that M E G and g G.} We expect that the higher total type size of the global variables outside the group indirectly used by modules inside the group, the probability that the group is related to those other groups increases and the intergroup complexity increases. * Consider the set of data binding triples in the boundary set of a group. f7(G) is the sum of type sizes of boundary set variables. The sum of the type sizes of different variables passing data into group G is: f7.1i(G) = E Tsize(v) vE{v I 3 data binding (Mi,v,Mj)E boundary set of G 9 MiOG and MjEG.} The sum of the type sizes of different variables passing data out of group G is: f7.2(G) = E Tsize(v) ve{v I 3 data binding (Mi,v,Mj)E boundary set of G 3M, E G and MjOG.} Assume that variables which transfer information between the group and other groups, defining data bindings, contribute to the intergroup complexity of the group according to the variable's type size. We expect that the higher total type size of the variables passing data to or from the group, the intergroup complexity increases. f8(G) is the set of different types of variables passing data between group G and other groups, or the set of types of boundary set variables, i.e.: fs(G) = {Type(v) 13 data binding(Mi v, Mj) E boundary set of G.} Assume that each type of variable passing data to or from the group needs to be supported by the group's interface. Each type increases the intergroup complexity. Consequently, the more different types of variables are supported, the intergroup complexity increases. That is due to the different behaviors which need to be supported by the group's interface. 5.3.3 Intragroup Complexity Factors The intragroup complexity of group G consists of the complexity of group G in a system partition. The intragroup complexity of group G is characterized by the union of all primitive measurements of intragroup complexity. The total intragroup complexity of a system is the sum of the intragroup complexity of all the system's groups. Another measure related to these factors is the intragroup strength1 which is a measure of the relationship between the modules in a group. 5.3.3.1 Databased primitive measurements In this section, we present the primitive measurements of the intragroup com plexity which are based on the data of the system. fg9(G) is the set of global variables used exclusively by modules in group G, or the set of direct internal variables, i.e.: f9(G) = { g I V moduleglobal access pair (Mi,g) Mi E G.} 1This term was originally used by Myers [27]. * fio(G) is the set of global variables used indirectly only by modules in group G, or the set of indirect internal variables, i.e.: flo0(G) = { g I V moduleglobal indirect access pair (Mi,g) A ME e G.} Factors f9 and fio above consists of the global variables "exclusively used" by modules in the group. Assume that a global variable exclusively used by modules in the group weakens the intragroup strength and increases the intragroup complexity. According to [3], a global variable negatively affects the intracluster complexity by increas ing its complexity value, given a set of edges. Consequently, a global variable increases the intragroup complexity. Also, we argue that a global variable neg atively affects the intragroup strength since other groups may use this global variable thus reducing the functional relatedness of the group. We expect that the more global variables (exclusively used by modules in the group and thus not used by modules in other groups), the intragroup com plexity increases. An extreme case occurs when the global variable becomes a local variable with respect to the group. Similarly, we expect that the strength between modules in the group weakens. * Consider the set of data bindings triples in the interior set of a group. fll(G) is the interior set of group G, i.e.: fli.i(G) is the set of moduleglobal access data bindings involving modules within group G, i.e.: fi.1(G) = {(MA, v, Mj) 3 a moduleglobal access data binding (M, v, Mj) in the interior set of G.} fI'.2(G) is the set of localexport data bindings involving modules within group G, i.e.: fII.2(G) = {(Mi, v,Mj) I 3 a localexport data binding (Mi,v,Mj) in the interior set of G.} f11.3(G) is the set of localimport data bindings involving modules within group G, i.e.: f1.3(G) = {(Mi, v, Mj) 1 3 a localimport data binding (Mi, v, Mj) in the interior set of G.} f11.4(G) is the set of globalexport data bindings involving modules within group G, i.e.: fi.4(G) = {(M,v, Mj) I 3 a globalexport data binding (Mi, v, Mj) in the interior set of G.} fl1.5(G) is the set of globalimport data bindings involving modules within group G, i.e.: f1i.5(G) = {(M, v, Mj) 3 a globalimport data binding (Mi, v, Mj) in the interior set of G.} f11.6(G) is the set of returnvalue data bindings involving modules within group G, i.e.: f11.6(G) = {(Mi,v,Mj) I 3 a returnvalue data binding (Mi, v, Mj) in the interior set of G.} According to Beladi and Evangelisti [3], the intracluster complexity of a single cluster j is Cj = nj ej where nj is the number of nodes in the cluster and ej the number of pairwise edges, that is, "connections between the same elements." Given that edges correspond to data bindings in our model, we conclude that the more data bindings between the modules in a group, the higher the intra group complexity of the group. Furthermore, the intragroup strength increases with the number of data bind ings since the more "channels" of data passing between modules in the group, the higher the functional relatedness of the group. It appears as if factors fil conflict with factors f9 and fio; however, data bind ings define relations between modules which theoretically speaking will increase the strength. However, the fact that the data bindings are sometimes due to global variables involves a risk in that other groups in the system may access these global variables which weakens the strength of the group. This factor applies to groups derived from using the globalsbased analysis (Algorithm 1): f12(G) is the percentage of modules in group G which directly use all global variables in group G, i.e.: f12(G) = J{M 1 3 moduleglobal access pair (M,gi) V global variable gi G GandM G.}I/I{M M M G}l Assume that a global variable used by all modules in the group increases the intragroup strength. The higher the ratio of the number of modules which directly use all global variables in the group to the total number of modules, the intragroup strength increases. 5.3.3.2 Typebased primitive measurements In this section, we present the primitive measurements of the intragroup com plexity which are based on the types of data in the system. Assume that the information level of a variable is determined by its type size (Tsize) according to Definition 16. * Consider the set of data bindings triples in the interior set of group G. Also, consider only the type of different variables since we measure the com plexity due to the kind of information passed between modules of the group, as opposed to the volume of this information. f13(G) is the sum of the type sizes of different variables passing data between modules M, and Mj in group G: f13(G) = E Tsize(v) vE {v 3 data binding (Mi, v, Mj) interior set of G.} We expect that the higher total type size of all the variables used by modules in the group, the intragroup complexity increases and the intragroup strength decreases. * f14(G) is the set of types of different variables passing data between modules Mi and Mj in group G, or the set of types of interior set variables, i.e.: f14(G) = {Type(v) I 3 data binding (Mi, v, Mj) the interior set of G.} Assume that each type of variable passing data among the modules in the group needs to be supported by the module. Each type increases the intragroup complexity. Therefore, the more different types of variables are supported, the intragroup complexity increases. Also, less commonality is observed in the group and the intragroup strength decreases due to the reduced functional relatedness of the group. * The following factors only apply to groups derived from using the typesbased analysis (Algorithm 2): Definition 20 The Base Type of group G, denoted Btype(G), is the type used as the grouping criteria during the "grouping" step of the Typesbased Object Finder, Algorithm 2. Definition 21 Module M manipulates type t, denoted by the manipulation pair [M, t], ift is the type of a formal parameter of M, or t is the type of the return value of M, or t = Type(g) such that 3 access pair (M,g). This is, t is one of the types that module M may manipulate. Definition 22 A grouping manipulation pair in group G is a manipulation pair [M, t] such that M e G and t = Btype(G). There exists a grouping manipulation pair for a module M, M E G, and type t, denoted [M, t], iff any of module M types of formal parameters or return value, t, is equivalent2 to the base type of group G, Btype(G); i.e.: t = Btype(G). Also, there exists a grouping manipulation pair for a module M, M G, and type of variable v, Type(v), denoted [M, Type(v)], iff there exists a module global access pair (M, v) and the type of variable v, Type(v), is equivalent to the base type of group G; i.e.: Type(v) = Btype(G). f15(G) is the set of grouping manipulation pairs [M, t] such that M G, i.e.: f15s(G) = {[M,t] is a grouping manipulation pair I M E G.} We expect that the more grouping manipulation pairs, the intragroup strength increases. The evidence indicates that this factor does not affect the intragroup complexity. 2Equivalent types are defined in Algorithm 2 in Chapter 3. * Definition 23 A degrouping manipulation pair in group G is a manipulation pair [M,t] such that M e G and t < Btype(G). There exists a degrouping manipulation pair for a module M, M E G, and type t, denoted [M, t], iff any of module M types t of formal parameters or return value is a "part of" the base type of group G or "contains" type that is the base type of group G, Btype(G); i.e.: t there exists a degrouping manipulation pair for a module M, M E G, and type of variable v, Type(v), denoted [M,Type(v)], iff there exists a moduleglobal access pair (M, v) and the type of variable v, Type(v), is a "part of" the base type of group G or "contains" type that is the base type of group G, Btype(G); i.e.: Type(v) < Btype(G) or Type(v) > Btype(G). f16(G) is the set of degrouping manipulation pairs [M, t] such that M G, i.e.: f16(G) = {[M,t] is a degrouping manipulation pair I M E G.} We expect that the more degrouping manipulation pairs, the intragroup strength decreases. Once more, this factor does not affect the intragroup complexity. * f17(CG) = the ratio of number of grouping manipulation pairs to the number of degrouping manipulation pairs. f17(G) = f15(G)i : f16(G) Given factors fis and f16 above, we expect that the higher the ratio of grouping manipulation pairs to degrouping manipulation pairs, the intragroup strength increases. In the case that types t and Btype(G) are not explicitly related (e.g., types are equivalent or there is a "part of" or "contains" relationship between the types), 68 M3 0@ G ::   i :M 7 M rI Q M5 J . . . :. . ** ** ** ._ M6 MIl 4^^v)^M 1      V5        M1C M9 Figure 5.2. Example of the primitive complexity metrics factors the corresponding manipulation pairs of the form [M, t] are not considered in factors fis, f16, and f17. 5.3.4 Example of the Factors This section presents an example of the primitive complexity metrics factors. Figure 5.2 consists of a set of several routines, global variables, and a group. Examples of the complexity metrics factors, in Figure 5.2, related to group G are: 1. Intergroup Complexity f3 Boundary set: f3(G) = {(MI, V1, Al2), (M3, V3, Ms), (M3, V3, M7), (M4, V4, M8), (Ms, Vs, Mg), (Mio, vu, M11), (M1o, Vs, M11)} f4 Variables in boundary set: f4(G) = {vi,V3,V4, V5} f7 Sum of type sizes of different variables in boundary set: f7(G) = Tsize(vi) + Tsize(v3) + Tsize(v4) + Tsize(v5) 2. Intragroup Complexity fil Interior set: f,1(G) = {(M2, v2, M6), (M2, v3, M7), (M2, V3, iMs), ( Ms5, V5, M11)} f13 Sum of type sizes of different variables in interior set: f13(G) = Tsize(v2) + Tsize(v3) + Tsize(v5) 5.3.5 Validation of the Factors This section presents the validation of the complexity metrics factors of the previous sections. The validation approach consists of proving that the validation hypothesis holds. This approach is described in Figure 5.5. This section also illus trates the sensitivity of the factors to a particular modification of a program. More experimentation is needed to further validate these factors with respects to other kind of changes applied to programs. Other interesting modifications to programs include a functionally equivalent program with several functions merged into a single func tion or several abstract data types of the original program replaced with equivalent data structures. The example program consists of a simple recursive descent expression parser implemented in "C". First, we present the complexity metrics computed on a ver sion of this program which consists of two source files, eighteen functions, and four global variables. Second, we present the metrics computed on another version of this program with the same functionality except that most function parameters are con verted into global variables; this version consists of the same number of source files as Table 5.2. Type size associated with variables of different types in the original version of the recursive descent expression parser. Variable Type TSIZE prog char* 1 toktype char 1 token char [80] 1 80 + 2 = 82 vars float[26] I 26 + 2 = 28 answer float 1 "op" register char 1 result float* 1 c~ char 1 "hold" float 1 the previous version but with a total of nineteen functions and six global variables. Third, we explain the effect that the structure of each version of the program had on the primitive complexity metrics factors and illustrate the sensitivity of the metrics to this class of changes in the structure of a program. In the first version of the program, named the original version, the topdown analysis approach was used with both globalsbased and typesbased analyses; we choose to ignore C primitive types during the typesbased analysis. The identi fied objects are shown in Figure 5.3. Some modifications of the candidate objects were performed to obtain completely disjoint candidate objects in terms of their routines. These modifications consisted of removing common components (routines) from object Gtoktype#2 to a single candidate object. The resulting objects after the modifications are shown in Figure 5.3. Table 5.2 illustrates the types sizes corresponding to the types of variables in the original version of the example. The primitive metrics factors were computed for intergroup complexity, intragroup complexity, and intragroup strength. Object z#objTchar*#3 is { ~~char*76 FloatParsing~___II~iswhite~181 FloatParsing~___II~isdelim175 FloatParsing ___II~isin~172 } Object z#objTfloat*#4 is { ~~float*24 FloatParsing~___II~primitive240 FloatParsing~___II~leve16198 FloatParsing~___II~leve15196 FloatParsing~___IIlevel4194 FloatParsing___II~leve13~192 FloatParsing~___II~level2~190 FloatParsing~___II~levell188 FloatParsing___II~getexp~143 } Object z#objT+undeterminedtype#5 is { float*~24 ~char*~76 FloatParsing~___IIunary371 FloatParsing___II~findvar~100 FloatParsing.___II~arith~21 } Object z#objGtoktype#2 is { main___II~prog~243 FloatParsingl___II~vars~374 FloatParsing___II~token~357 FloatParsing___IItoktype~356 main~___IImain~206 FloatParsing___II~putback~244 FloatParsing___IIgettoken~145 FloatParsing~__II~findvar100 } :(T)char* :(R)iswhite :(R)isdelim (R)isin (T)float* (R)primitive (R)level6 (R)level5 (R)level4 (R)level3 (R)level2 (R)levell (R)getexp :(T)float* :(T)char* :(R)unary :(R)findvar :(R)arith (G)prog (G)vars (G)token (G)toktype (R)main (R)putback (R)gettoken (R)findvar Figure 5.3. Identified objects in original version of recursive descent expression parser Table 5.3. Type size associated with variables of different types in version 1 of the recursive descent expression parser. Variable Type TSIZE prog char 1 toktype char 1 token char[80] 1 80 + 2 = 82 vars float[26] 1 26 + 2 = 28 answer float[1048] 1 1048 + 2 = 1050 result float 1 op register char 1 ~c char 1 hold float 1 For the other version of the program, named version 1, the topdown analysis approach was used with both globalsbased and typesbased analyses; we choose to ignore C primitive types during the typesbased analysis. The identified objects are shown in Figure 5.4. Some modifications of the candidate objects were performed to obtain completely disjoint candidate objects, specifically routine findvar was removed from object T+undeterminedtype#4. The resulting objects after the mod ifications are shown in Figure 5.4. Table 5.3 illustrates the types sizes used for the metrics computation. In this version of the program, as well as the rest of the examples in this thesis, whenever possible, the name of an identifier denotes the identifier, instead of using the unique ID which is automatically generated by the object finder tool of Chapter 7. See Section 8.1 for a complete explanation of unique ID in the internal representation of the program used by the object finder tool. The primitive metrics factors were computed for intergroup complexity, intra group complexity, and intragroup strength. Object z#objTchar*#3 is { ~~char*~77 FloatParsing___IIiswhite~177 FloatParsing___II~isdelim~171 FloatParsing~___II~isin~168 } Object z#objT+undeterminedtype#4 is { ~float*25 char*~77 FloatParsing~___II~unary~362 FloatParsing~___II~arith~22 } Object z#objGanswer#2 is { mainJ___II~prog232 FloatParsing___IIvars~365 FloatParsing~___II~token~348 FloatParsing~___IItoktype347 FloatParsing~___IIresult309 FloatParsing~___IIanswer20 main~___II'main~196 FloatParsing~___II~resultinc~310 FloatParsing~___II~putback~233 Float_Parsing~___II~primitive230 FloatParsing~___II~level6~189 FloatParsing~__.II~level5~188 FloatParsing~___II~level4~187 FloatParsing___II~level3186 FloatParsing___IIlevel2~185 FloatParsing___II~levell184 FloatParsing___IIgettoken~145 FloatParsing~___II~getexp~144 FloatParsing~___II~findvar~101 :(T)char* :(R)iswhite :(R)isdelim (R)isin (T)float* (T)char* (R)unary (R)arith (G)prog (G)vars (G)token (G)toktype (G)result (G)answer (R)main (R)resultinc (R)putback (R)primitive (R)level6 (R)level5 (R)level4 (R)level3 (R)level2 (R)levell (R)get_token (R)getexp (R)findvar Figure 5.4. Identified objects in version 1 of recursive descent expression parser The approach to validate the complexity metrics consists of proving that the fol lowing hypothesis hold. The hypothesis states that the primitive complexity metrics factors are sensitive to changes in the complexity caused by different partitionings. Different partitionings are the result of (1) using different partitionings approaches under the same connectivity, or (2) totally different connectivities. The connectivity of a program is the set of relationships between components of the program defined by the data bindings among components due to global variables and calling sequences. We use the second situation to prove that the hypothesis holds. The methodology to prove the hypothesis is to use a case study program to show that the primitive metrics reflect different complexity measures in two versions of the program which are functionally equivalent and have different connectivities as a result of having different structures. An schematic description of this validation methodolo gy is given in Figure 5.5. From Figure 5.5, the validation of the metrics factors consists of proving the following hypothesis: If Col is less complex than Co02, then Me1 < Me2, where Col is the expected complexity of the original version, Co2 is the ex pected complexity of version 1, Me1 is the measured primitive metrics factors values of the original version, and Me2 is the measured primitive metrics factors values of version 1. For the proof, we use the program above with two versions: the original version and version 1. We made version 1 to be more complex in terms of the connectivity of the program, since a program with greater connectivity is expected to be more complex given that all other factors remain the same. Version 1 is made more complex by replacing most formal parameters with global variables. Thus, the number of expected relations between components of the program will increase. The expected complexity of each of the two versions is defined as the complexity resulting from the connectivies between components of a system. We expect that the connectivity partitioning complexity metrics original version version I C1 C2 Co1 Me1 t Co2 3CteED Figure 5.5. Validation of the primitive complexity metrics factors groups in the partitioning obtained from version 1 are more intragroup complex than the groups in the partitioning obtained from the original version. This is the case since each group in version 1 partitionings is expected to be highly cohesive by the added global variables in version 1. We also expect that the groups in version 1 are less intergroup complex because the groups in version 1 partitioning are expected to be loosely connected by the global variables in version 1. Each version of the program exhibits different connectivity between the program components. Hence, the object finder partitionings for each version of the program were different. In addition, the objective of the study that the program versions were functionally equivalent but structurally different was met. The validation results for each complexity metric factor are summarized nex t. First, the results about the complexity metrics factors that measure intergroup complexity. fA Set of direct interface variables The output for the two versions of the pro gram indicated that the original version presented higher intergroup complexity than version 1. This observation shows the effect of different partitionings on the metric s. In addition, as expected, there is higher interactions between components in the original version than between components in version 1. f2 Set of indirect interface variables No relevant results were obtained for this factor. ia Boundary set The output for the two versions of the program indicated that the original version presented slightly higher intergroup complexity than version 1. This observation is in accordance with the expected variation in the metrics. /4 Variables in boundary set The output for the two versions of the program indicated that the two versions presented similar intergroup complexity. f. Sum of type sizes of direct interface variables The output for the two ver sions of the program indicated that the original version presented higher intergroup complexity than version 1. This is the case since version 1 partitioning reduces the number of global variables transferring information between components. fr, Sum of type sizes of indirect interface variables No relevant results were ob tained for this factor. f7 Sum of type sizes of boundary set variables The output for the two ver sions of the program indicated that the original version presented lower intergroup complexity than version 1. Version 1 added some global variables to the program. They influenced the sizes observed in the new program. However, the increased val ue of the type size is related to the number of relations (data bindings) between components. This number was lower in version 1 than in the original version. fR Set of types of boundary set variables The output for the two versions of the program indicated that the original version presented higher intergroup com plexity than version 1. This is the case since version 1 partitioning reduces the number of variables transferring information between components. In conclusion, the intergroup complexity in version 1 is lower than the one in the original version. This coincides with our expectations regarding the effect of the changes and the resulting partitioning on the primitive complexity metrics factors. That is to say, the partitioning resulting from version 1 agglomerates most relations between components inside the groups which reduces the intergroup complexity. fAq Set of direct internal variables The output for the two versions of the pro gram indicated that version 1 presented higher intergroup complexity than the orig inal version. This coincides with our expectations that the intragroup complexity will increase in version 1 since more global variables are used. fin Set of indirect internal variables The output for the two versions of the program indicated that the two versions presented similar intragroup complexity. One reason for such situation is the fact that the code related to factor fio is not modified between versions. f i Interior set The output for the two versions of the program indicated that the original version presented lower intragroup complexity than version 1. This coincide with our expectation of the intragroup complexity will increase in version 1 since more global variables are used. Sf19 Percentage of modules accessing all global variables No relevant results were obtained for this factor. fi., Sum of type sizes of interior set variables The output for the two versions of the program indicated that the original version presented lower intragroup com plexity than version 1. This demonstrates that factor f13 increases with a higher intragroup complexity partitioning, such as the one in version 1 of the program. f14 Set of types of interior set variables The output for the two versions of the program indicated that the original version presented lower intragroup complexity than version 1. This coincides with our expectations that the intragroup complexity will increase in version 1 since more relations occur. In conclusion, the intragroup complexity in version 1 is higher than the intra group complexity in the original version. This coincides with our expectation regard ing the effect of version 1 partitioning on the metrics. That is to say, the partitioning resulting from version 1 agglomerates most relations between components inside the groups which increases the intragroup complexity. fs Set of grouping manipulation pairs The output for the two versions of the program indicated that the original version presented higher intragroup strength than version 1. This coincides with our expectation that the partitioning of version 1 decreases the strength because parameters have been replaced with global variables and, as previously indicated, the addition of global variables reduces the strength of individual components. fi, Set of degrouping manipulation pairs The output for the two versions of the program indicated that the original version presented higher intragroup strength than version 1. This coincides with our expectations that the partitioning resulting from version 1 decreases the strength. f 7 Ratio of grouping to degrouping manipulation pairs The output for the t wo versions of the program indicated that the original version presented higher intra group strength than version 1. This coincides with our expectations that the parti tioning resulting from version 1 decreases the strength. In conclusion, the intragroup strength in the original version of the program is higher than the one in version 1. This coincides with our expectations regarding the effect of version 1 partitioning on the primitive metrics factors. That is to say, version 1 reduces the strength of the resulting components due to the elimination of the parameters from the routines interface which were replaced with global variables. The validation of these primitive metrics factors shows a correlation between the primitive metrics factors and the expected complexity associated with a partitioning of a system. We conclude that the complexity of a partitioning is effectively measured by these primitive complexity metrics factors. Different complexities will result in different values of the primitive metrics factors. The future research consists of describing the relationships between complexity metrics factors in terms of a formula. 5.4 The Test Cases: Identified Obiects, Clusters and Groups The following sections present the partitionings identified in the three test case programs using two modularization techniques. The partitionings consists of the identified objects using the topdown analysis method of the object finder and the clusters identified using Hutchens and Basili hierarchical clustering technique [14]. A generalized view of both kind of partitionings consists of a set of groups, from Definition 17; a group usually corresponds to an identified object in the object finder and to a cluster in hierarchical clustering. In the following sections, we explain how the groups based on identified objects and on cluster were specified for each test case. 5.4.1 Test Case 1: Name Crossreference Program The first test case consists of the name crossreference program from Section 8.1. The statistics of this example are given in Table 5.4. This section presents the iden tified objects by the object finder using the topdown analysis method. In addition, we present the clusters identified in this program using Basili's hierarchical clustering technique [14]. The identified objects, obtained by the object finder during the topdown anal ysis method, are shown in Figure 8.1. The groups corresponding to the identified objects in Figure 8.1 are derived after the user performs some modifications on the Table 5.4. Statistics of the test case programs. Program name Lines of code Global variables Functions Types Name cross 282 1 10 10 reference Algebraic expression 1324 10 50 14 evaluation Table 1,900 8 50 4 management identified objects. The purpose of the user's modifications is to eliminate the com monality between objects by removing common components between objects, as ex plained in Section 8.1, using xobject, which results in the objects of Figure 8.3; these disjoint objects were used to derive the groups. The groups are shown in Figure 5.6. Since these groups are based on objects after the modifications, we name a group by using the name of the object that correspond to the group. The clusters of this test case were identified by a clustering tool called basil [21] that implements Basili's hierarchical clustering technique. The identified clusters in the crossreference program are shown in Figure 5.7. Figure 5.7 showed the clusters defined on this example using Basili's clustering techniques. The groups corresponding to these clusters are shown in Figure 5.8. In this case, a group consists of one or more clusters of routines which maintain the same levels of coupling and strength [14] of the cluster identified with the hierarchical clustering technique. The groups are constructed as follows. Initially, groups are defined based on the top level of corresponding clusters. In this case, the naming of groups is arbitrary and only serves to distinctly identify each group. Then, the routines which did not cluster were considered to be groups of size one. An alternative approach, during this last step, is to group together the unclustered routines. Group z#objTLINEPTR#3 is { ~LINEPTR16 :(T)LINEPTR xref_tab~__jII'makelinenode138 } Group z#objTWTPTR*#4 is { ~WTPTR*104 :(T)WTPTR* xref _tab ___IIImakewordnode~140 xref_tab___II~inittab102 xref_tab ~___IIaddword~ 64 xreftab~___IIaddlineno61 xrefout~___II~writewords181 } Group z#objTchar*#5 is { "char*56 xreftab~___II~strsave~164 } Group z#objTint*#6 is { ~~int*'69 :(T)int* xref~ IImain137 :(R)main } Group z#objGlineno#2 is { xrefin___IIlineno130 xrefin~___IIgetword~98 xrefin~___IIgetachar~93 } :(R)make_linenode :(R)makewordnode :(R)init_ tab :(R)addword (R)addlineno (R)writewords :(T)char* :(R)strsave :(G)lineno :(R)getword :(R)getachar Figure 5.6. Groups based on objects identified in name crossreference program Final dendrogram:  Cluster No.l  (100 (50 addlineno addword) makelinenode (50 makewordnode strsave) )  Cluster No.2  (100 getachar getword) Figure 5.7. Clusters found in the name crossreference program by basili Group I is { xreftab~___II~addlineno~61 xreftab~___IIaddword~64 xreftab~___IImakewordnode~140 xref_tab~___IIstrsave~164 xreftab ___.II~makejlinenode138 } Group II is { xrefin___II~getword~98 xref in~_ II~getachar~93 } Group III is { xref_tab~__ Il~inittab102 } Group IV is { xrefout~__IIwritewords~181 } Group V is { xref_ II~main137 } :(R)addlineno :(R)addword :(R)makewordnode :(R)strsave :(R)makelinenode :(R)getword :(R)getachar (R)init _tab :(R)writewords :(R)main Figure 5.8. Groups based on clusters found in name crossreference program 5.4.2 Test Case 2: Algebraic Expression Evaluation Program The second test case consists of the simple algebraic expression evaluation pro gram from Section 8.2. The statistics of this program are listed in Table 5.4. This section presents the identified objects by the object finder during the topdown anal ysis method. In addition, we present the clusters identified in this program using Hutchens and Basili hierarchical clustering technique [14]. The identified objects, obtained by the object finder during the topdown anal ysis method, are shown in Figures 8.4 and 8.5. The groups corresponding to these identified objects are derived after the user performs modifications on the identified objects. Similarly to Section 5.4.1, we name a group based on the name of the object that corresponds to the group. The groups are shown in Figures 5.9 and 5.10. Group z#objTNODE*#6 is { ~NODE*~15 function~___IIVariableeval"140 function___IIPluseval 104 function___IINodeeval"101 function ___II~Multiply_eval96 function___IIMinus_eval94 function ___IIFunctionprecedence"82 function ___II~Functiondeletefunction function___II~Functioncheckandadd68 function __IIFunctionbuild _tree66 function___II~Functionaddoperator62 function~ ___IIFunctionaddoperands60 function___IIEcho_treeO29 function ___IIDivideeval~22 function ___II"Constructfunction16 function"___II~Constanteval12 function ___IIFunctionparenthesis80 function ___II~Functionovldop~73 function ~___IItotalparen~292 function~___IIroot'267 function___II~queue263 function___II~last216 function ___IIfirst~193 } Group z#objTchar*#7 is { char*~18 state~___IIState9_transition128 state~___IIState7_transition126 state~___IIState6_transition~ 124 state___II~State5_transitionM122 state___II~State4_transition~120 state___II~State3_transition118 state___II State2_transition116 state___II~Statel_transition~114 state___IIStateO_transition112 function___IIEchotree"27 } Figure 5.9. Groups based on typesbased objects evaluation program : (T)NODE* (R)Variableeval (R)Pluseval (R)Nodeeval (R)Multiplyeval (R)Minuseval (R)Funct ionprecedence (R)Funct iondeletefunct ion (R)Funct ioncheckandadd (R)Functionbuildtree :(R)Functionadd_operator (R)Funct ion_addoperands (R)Echo.tree0 (R)Divideeval (R)Construct_function (R)Constant_eval (R)Functionparenthesis (R)Functionovldop (G)totalparen (G)root (G)queue (G)last (G)first (T)char* (R)State9_transition : (R)State7_transition (R)State6_transition (R)State5_transition (R)State4_transition (R)State3_transition (R)State2_transition : (R)Statel_transition (R)State0_transition (R)Echotree identified in algebraic expression Group z#objTfloat#8 is { float'5 function~___IIF652 function IIF546 function~___II~F4'41 function___11F337 function___II"F234 function ___IIF1l32 } Group z#objTint#9 is { int4 exprtst___IIImain" 217 } Group z#objTvoid#10 is { "'void~2 function ___IIDestruct_function21 } (T)float (R)F6 (R)F5 (R)F4 (R)F3 (R)F2 (R)FI1 : (T) int (R) main (T)void :(R)Destruct.function Figure 5.9  continued Similarly to Section 5.4.1, the clusters of this test case were identified by a clustering tool called basili [21] based on Basili's hierarchical clustering technique. The identified clusters in the expression evaluation program are shown in Figure 5.11. Figure 5.11 showed the clusters defined in this example using Basili's clustering techniques. The groups corresponding to these clusters are shown in Figure 5.12. In this test case, a group consists of one or more clusters and subclusters of routines which results in the same degree of coupling and strength [14] of the cluster identified by the hierarchical clustering technique. The groups are constructed as follows. First, a group corresponds to the set of routines which form subclusters with the lowest coupling and highest cohesion between the routines according to Hutchens and Basili [14] definition of strength and coupling. Next, the routines which did not cluster were grouped into a single group, named group V. Group z#objGtableindex#2 is { symbol___II~table~284 symbol ___IItableindex~286 symbol___II~Symboltableget_valueM138 symbol ~___II~Symboltablegetindex~136 symbol ~___I I Symboltableclear~ 135 symbol ~___II~Symboltableaddvariable symbol___I ISymboltableaddvalue~ 130 symbol___IIEchosymboltable~26 symbol "___II"Construct symbol_table~20 } Group z#objGparencount#3 is { state~___ II~parencount254 state___III~StateOincparen111 state___II~StateOgetparen~110 state.___II~StateOdecparen~108 } Group z#objGexpression#4 is { state~___IIindex213 state___ IIexpression~178 state ___II Stat eOgetchar~ 109 state~___II~ConstructstateO~19 function" II~Functionvalid85 } Figure 5.10. Groups based on globalsbased objects evaluation program : (G)table (G)tableindex (R)Symboltablegetvalue (R)Symboltablegetindex (R)Symboltableclear :(R)Symboltableadd_variable (R)Symboltableaddvalue (R)Echosymboltable (R)Constructsymbol.table :(G)parencount :(R)StateO_inc_paren :(R) StateO_get_paren :(R)StateOdecparen :(G)index :(G)expression :(R) StateOgetchar :(R)ConstructstateO :(R)Functionvalid identified in algebraic expression Final dendrogram:  Cluster No.1  (9 Functionovldop Symboltableaddvalue)  Cluster No.2  (100 (66 Constructfunction (62 Functionaddoperands (44 Functionbuildtree Symboltableadd.._variable) Function.checkandadd) ) Echoqueue Echotree0 Functionaddoperator Functiondeletefunction main) Figure 5.11. Clusters found in algebraic expression evaluation program Group I is { function ___II~Functionovldop73 symbol__ II'Symboltableaddvalue"130 } Group II is { function ___IIFunctionaddoperator"62 function" ___IIFunctiondeletefunction exprtst_ IImain'217 } Group III is { function___ IIConstructfunction" 16 function___IIFunctionaddoperands60 function" ___IIFunction checkandadd68 } Group IV is { function ___IIFunctionbuildtree66 symbol__ II Symboltableaddvariable~ } Group V is { function___IIVariableeval140 function~___II~Pluseval104 function___ II~Nodeeval 101 function___IIMultiplyeval96 function'___II"Minuseval94 function ___IIFunctionprecedence82 function ___II~EchotreeO29 function___IIDivideeval~22 function___IIIConstanteval12 function ___IIFunctionparenthesis~80 state~___IIState9_transition~128 state___IIState7_transition~126 state~___IIState6_transition~124 state___II State5_transition~122 state___IIState4_transition120 state~___IIState3_transition~118 state___ II~State2_transition~116 state~___IIStatel_transition~l4 state ___IIStateOtransition~112 function___II~Echotree27 function~___IIF652 function___II~F546 function ___IIF441 function ___IIF337 function___IIF234 function~___FIIF32 } :(R)Functionovldop :(R)Symbol_table_add_value :(R)Functionaddoperator :(R)Funct iondeletefunction :(R)main :(R)Construct_function :(R)Function_add_operands :(R)Funct ioncheckandadd :(R)Functionbuildtree :(R)Symbol_table_add_variable :(R)Variable_eval :(R)Pluseval :(R)Nodeeval :(R)Multiply._eval :(R)Minuseval :(R)Funct ion.precedence :(R)Echotree0 :(R)Divide.eval :(R)Constanteval :(R)Funct ionparenthes is :(R)State9_transition :(R)State7_transition :(R)State6_transition :(R)State5_transition : (R) State4_transition :(R)State3_transition :(R)State2_transition :(R)Stateltransition :(R)State0_transition :(R)Echo_tree :(R)F6 :(R)F5 :(R)F4 :(R)F3 :(R)F2 :(R)F1 Figure 5.12. Groups based on clusters found in algebraic expression evaluation pro gram Group V (cont.) is { function ___IIDestructfunction21 symbol___I I'Symboltableget.value 138 symbol~___IISymboltablegetindex~136 symbol~ __IISymboltableclear~135 symbol~ __II~Echosymboltable~26 symbol" __IIConstructsymboltable~20 state'" II~StateOincparen111 state~_ II'StateOget.paren~110 state ___III StateOdecparen~108 state~ ___II~StateO_get_char'109 state'___II~ConstructstateO19 function ___IIFunctionvalid85 (R)Destructfunction (R)Symboltable.getvalue (R)Symboltablegetindex (R)Symbol.table.clear (R)Echosymbol_table (R)Constructsymboltable (R)State0_incparen (R)State0_Ogetparen (R)State0_decparen (R) State0_Ogetchar (R)ConstructstateO (R)Functionvalid Figure 5.12  continued 5.4.3 Test Case 3: Symbol Table Management for Acx The third test case consists of the symbol table management module of an ANSI "C" parser tool. The statistics of this program are listed in Table 5.4. This section presents the identified objects by the object finder during the topdown analysis method. In addition, we present the clusters identified in this program using Hutchens and Basili hierarchical clustering technique [14]. The identified objects, obtained by the object finder during the topdown anal ysis method, are shown in Figures 5.13 and 5.14. The groups corresponding to these identified objects are derived after the user performs modifications on the identified objects. Similarly to Section 5.4.1, we name a group based on the name of the object that correspond to the group. The groups are shown in Figures 5.15 and 5.16. Similarly to Section 5.4.1, the clusters of this test case were identified by a clustering tool called basili [21] based on the hierarchical clustering technique. The identified clusters in the symbol table management program are shown in Figure 5.17. Object z#objTFILE*#3 is { "FILE*124 table___IIoutputusage283 table II outputtable~271 } Object z#objTSymbol*#4 is { ~~Symbol*64 table ___II wheredefined448 table___II symboldup361 table___IInewsymbol"251 table___ IImoveuse'244 table ___II"movereference~242 table ___IImovedef237 table___ IIlmergetype234 table___IImakeundefinedsymbol7211 table___IImakesymbol209 table___IIismemberof194 table ___IIisdefined192 table___II istypealias190 table ___IIis_macro_name~188 table___IIinsertuse~184 table___II insertsymbol182 table~___IIIinsertreference180 table___ II insertdef"178 table___IIgettopsymbol 149 table___II getentrystring~145 table~___IIgetdefault~144 table~___IIfindsymbol132 table II addtype63 } Object z#objTToken*#5 is { ~Token*95 table~___IItokendup~416 table&___IItoken_cmp413 table___IInewtoken~252 table___IIisthere&197 } Object z#objTType*#6 is { ~Type*254 table"___IItypedup429 table___IItypecat426 table __IInewtype253 } : (T) FILE* :(R)outputusage :(R)outputtable :(T)Symbol* :(R)wheredefined (R)symbol._dup :(R)newsymbol :(R)moveuse :(R)movereference (R)movedef :(R)mergetype :(R)makeundefined_symbol (R)make_symbol (R)ismemberof (R)isdefined :(R)is.typealias :(R)ismacroname :(R)insertuse :(R)insert_symbol :(R)insertreference :(R)insert_.def :(R)get.topsymbol :(R)get_entry_string : (R) get_default :(R)find_ symbol :(R)addtype :(T)Token* :(R)tokendup :(R)tokencmp :(R)newtoken :(R)isthere :(T)Type* :(R)typedup :(R)typecat :(R)newtype Figure 5.13. Typesbased candidate objects identified in symbol table management program Object z#objTchar*#7 is { ~char*58 table~___IIgettypename151 table~__ IIgettableindex~147 } Object z#objTentrytag*#8 is { ~entrytag*250 table~___IInewentry~249 } Object z#objT+undeterminedtype#9 is { ~~symbotag*208 Type*~254 ~Token*~95 Symbol*64 ~FILE*~124 table___II~outputuse285 table~___IIoutputtype~276 table&___II~outputtypemember~280 table~___IIoutputtoken'273 table~___IIoutputreference~268 table~___II~outputparameter~262 table~___II~outputparameterusage265 table___II~outputdef~259 table& ___II~ outputdeclarationtype~256 table__ II'makename~205 } :(T)char* :(R)gettype_name :(R)get_table_index :(T)entry.tag* :(R)newentry (T)symbotag* (T)Type* (T)Token* (T)Symbol* (T)FILE* (R)outputuse (R)outputtype (R)outputtypemember (R)outputtoken (R) outputreference : (R) output .parameter (R)outputparameterusage (R)outputdef (R)outputdeclarationtype : (R)makename Figure 5.13  continued Object z#objGcurrentcount#2 is { table___IIuselist446 table___II symboltable363 table __ _II goto_reference_list~157 table___IIdef98 table___II'debugtoken94 table~___IIcurrentside92 table ___ II"current _scope91 table___IIcurrentfuncname90 table___IIcurrent _count~89 table ___II wheredefined448 table ___II tablereset382 table___IItableinit381 table___II tablefinal~380 table~___IIremovescopeflag310 table ___IIprdebugtoken~294 table___ II outputusage283 table___IIoutput.table271l table~___IImakesymbol209 table___IImakename205 table___IIisdefined192 table___IIistypealias190 table ___II ismacroname188 table___IIinsertuse~184 table___II'insertsymbol 182 table___II~insertreference"180 table___IIinsertdef178 table___ II flagscopeflag 136 table___IIfindsymbol132 (G)use_list (G)symbol_table (G)gotoreferencelist (G)def (G)debugtoken (G)current.side (G)currentscope (G)currentfuncname :(G)current_count (R)where_defined (R)table_reset (R)tableinit (R)table._final (R)removescopeflag (R)prdebugtoken (R)outputusage (R)outputtable (R)makesymbol (R)makename (R)isdefined (R)is_typealias (R)ismacroname (R)insertuse (R)insert.symbol (R)insertreference (R)insertdef (R)flagscopeflag (R)find_symbol Figure 5.14. Globalsbased candidate objects identified in symbol table management program 