An approach to software system modularization based on data and type bindings

MISSING IMAGE

Material Information

Title:
An approach to software system modularization based on data and type bindings
Physical Description:
ix, 159 leaves : ill. 28 cm. ;
Language:
English
Creator:
Ogando, Roger M., 1961-
Publication Date:

Subjects

Subjects / Keywords:
Software maintenance   ( lcsh )
Computer and Information Sciences thesis Ph. D
Dissertations, Academic -- Computer and Information Sciences -- UF
Genre:
bibliography   ( marcgt )
non-fiction   ( marcgt )

Notes

Thesis:
Thesis (Ph. D.)--University of Florida, 1991.
Bibliography:
Includes bibliographical references (leaves 155-158).
General Note:
Vita.
Statement of Responsibility:
by Roger M. Ogando.

Record Information

Source Institution:
University of Florida
Rights Management:
All applicable rights reserved by the source institution and holding location.
Resource Identifier:
aleph - 026528860
oclc - 25248350
System ID:
AA00022634:00001

Table of Contents
    Title Page
        Page i
        Page ii
    Acknowledgement
        Page iii
    Table of Contents
        Page iv
        Page v
    List of Tables
        Page vi
    List of Figures
        Page vii
        Page viii
    Abstract
        Page ix
    Chapter 1. Introduction
        Page 1
        Page 2
        Page 3
        Page 4
        Page 5
    Chapter 2. Background
        Page 6
        Page 7
        Page 8
        Page 9
        Page 10
        Page 11
        Page 12
        Page 13
        Page 14
        Page 15
        Page 16
        Page 17
        Page 18
        Page 19
        Page 20
        Page 21
        Page 22
        Page 23
    Chapter 3. The proposed approach
        Page 24
        Page 25
        Page 26
        Page 27
        Page 28
        Page 29
        Page 30
        Page 31
        Page 32
        Page 33
        Page 34
        Page 35
        Page 36
        Page 37
        Page 38
    Chapter 4. Time and space complexity analysis
        Page 39
        Page 40
        Page 41
        Page 42
        Page 43
    Chapter 5. Evaluation of the approach
        Page 44
        Page 45
        Page 46
        Page 47
        Page 48
        Page 49
        Page 50
        Page 51
        Page 52
        Page 53
        Page 54
        Page 55
        Page 56
        Page 57
        Page 58
        Page 59
        Page 60
        Page 61
        Page 62
        Page 63
        Page 64
        Page 65
        Page 66
        Page 67
        Page 68
        Page 69
        Page 70
        Page 71
        Page 72
        Page 73
        Page 74
        Page 75
        Page 76
        Page 77
        Page 78
        Page 79
        Page 80
        Page 81
        Page 82
        Page 83
        Page 84
        Page 85
        Page 86
        Page 87
        Page 88
        Page 89
        Page 90
        Page 91
        Page 92
        Page 93
        Page 94
        Page 95
        Page 96
        Page 97
        Page 98
        Page 99
        Page 100
        Page 101
        Page 102
        Page 103
        Page 104
        Page 105
        Page 106
        Page 107
        Page 108
        Page 109
    Chapter 6. Applications: Design recovery based on the approach
        Page 110
        Page 111
        Page 112
        Page 113
        Page 114
    Chapter 7. A prototype for the proposed approach
        Page 115
        Page 116
        Page 117
        Page 118
        Page 119
        Page 120
        Page 121
        Page 122
        Page 123
    Chapter 8. Experience with the object finder
        Page 124
        Page 125
        Page 126
        Page 127
        Page 128
        Page 129
        Page 130
        Page 131
        Page 132
        Page 133
        Page 134
        Page 135
        Page 136
        Page 137
        Page 138
        Page 139
    Chapter 9. Conclusions and further study
        Page 140
        Page 141
        Page 142
        Page 143
        Page 144
    Appendix A. Object finder prototype user's manual
        Page 145
        Page 146
        Page 147
        Page 148
        Page 149
        Page 150
        Page 151
        Page 152
        Page 153
        Page 154
    References
        Page 155
        Page 156
        Page 157
        Page 158
    Biographical sketch
        Page 159
        Page 160
        Page 161
Full Text











AN APPROACH TO SOFTWARE SYSTEM MODULARIZATION
BASED ON
DATA AND TYPE BINDINGS














By

Roger M. Ogando


A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL
OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY

UNIVERSITY OF FLORIDA


1991





























Copyright 1991

by

Roger M. Ogando

















ACKNOWLEDGMENTS


This work is dedicated to Ms. Olga E. Rivera and my parents and grandmoth-

ers. I would like to express my deepest appreciation to my chairman, Prof. Stephen

Yau, former cochairman Prof. Sying-Syang Liu, current cochairman Prof. Stephen

Thebaut, Prof. Randy Chow, Prof. Justin Graver, and Prof. Jack Elzinga for their

guidance and invaluable insight throughout this research. I am particularly indebted

to Prof. Liu, Prof. Norman Wilde of the University of West Florida, Prof. Yau,

and Prof. Thebaut for their financial support, supportive counsel and for dedicating

long hours to discussions and reviews. I am also grateful to many fellow graduate

and undergraduate students whose contributions in research and development made

this thesis possible; in particular, I am grateful to my colleague, Abu-Bakr M. Taha,

who made valuable contributions to this research project. In addition, my thanks go

to many roommates, classmates, and people from all over the world who made an

initially strange land feels like home. Finally, I would like to thank the PRA/OAS

Fellowship for their additional financial support which they provided during my study.
















TABLE OF CONTENTS




ACKNOWLEDGEMENTS ............................ iii

LIST OF TABLES ................................. vi

LIST OF FIGURES ................................ vii

ABSTR ACT . . . . . . . . . . . . . . . . . . ix

CHAPTERS

1 INTRODUCTION ............................... 1

2 BACKGROUND ................................ 6

2.1 Design Recovery .............................. 6
2.1.1 Clustering Approaches ...................... 7
2.1.2 Program Slicing Approach .................... 11
2.1.3 Program Dependence Approach .................... 12
2.1.4 Knowledge-based Approach ...................... 13
2.2 Complexity M etrics ............................ 13
2.2.1 M odularity M etrics ........................ 14
2.2.2 Zage's Design M etrics ...................... 19
2.2.3 Cyclomatic Complexity . . . . . . . . . ... .. 21
2.2.4 Stability M etrics . . . . . . . . . . . ... .. 22

3 THE PROPOSED APPROACH . . . . . . . . . . ... .. 24

3.1 The Proposed Approach . . . . . . . . . . . ... .. 25
3.2 Applicability of the Algorithms . . . . . . . . . ... .. 37
3.3 Conditions for Best Results . . . . . . . . . . ... .. 38

4 TIME AND SPACE COMPLEXITY ANALYSIS . . . . . .... ..39

4.1 Algorithm 1. Globals-based Object Finder Complexity . . ... 39
4.2 Algorithm 2. Types-based Object Finder Complexity . . . ... 41

5 EVALUATION OF THE APPROACH . . . . . . . . ... .. 44

5.1 Goals of the Evaluation Studies . . . . . . . . . ... .. 45

iv











5.2 Methodology of the Evaluation Studies . . . . . . . .... .. 47
5.3 Primitive Metrics of Complexity . . . . . . . . . ... .. 48
5.3.1 Definitions . . . . . . . . . . . . . .. .. 49
5.3.2 Inter-group Complexity Factors . . . . . . . .... .. 57
5.3.3 Intra-group Complexity Factors . . . . . . . .... .. 61
5.3.4 Example of the Factors . . . . . . . . . ... .. 68
5.3.5 Validation of the Factors . . . . . . . . . ... .. 69
5.4 The Test Cases: Identified Objects, Clusters and Groups ....... 80
5.4.1 Test Case 1: Name Cross-reference Program . . . .... ..80
5.4.2 Test Case 2: Algebraic Expression Evaluation Program . . 83
5.4.3 Test Case 3: Symbol Table Management for Acx . . ... 88
5.5 Comparison of Complexity . . . . . . . . . . ... .. 94
5.5.1 Test Case 1: Name Cross-reference Program . . . .... ..99
5.5.2 Test Case 2: Algebraic Expression Evaluation Program . . 101
5.5.3 Test Case 3: Symbol Table Management Program for Acx. . 103
5.5.4 Summary and Conclusions of the Comparison . . . ... ..106

6 APPLICATIONS: DESIGN RECOVERY BASED ON THE APPROACH 110

7 A PROTOTYPE FOR THE PROPOSED APPROACH . ...... .. .115

7.1 The Object Finder: A GNU Emacs-based Design Recovery Tool . 115
7.2 Design Goals of the Object Finder . . . . . . . . ... .. 117
7.3 Design of the Object Finder . . . . . . . . . . ... .. 119
7.4 Xobject: Graphical User Interface . . . . . . . . ... .. 122

8 EXPERIENCE WITH THE OBJECT FINDER . . . . . . ... .. 124

8.1 Example of the Top-down Analysis: Name Cross-reference Program 124
8.2 Comparison with C++ Classes: Algebraic Expression Evaluation Pro-
gram . . . . . . . . . . . . . . . . . . 129
8.3 Example of the Bottom-up Analysis: Name Cross-reference Program. 136

9 CONCLUSIONS AND FURTHER STUDY . . . . . . . ... .. 140

APPENDIX A OBJECT FINDER PROTOTYPE USER'S MANUAL. 145

A.1 Introduction . . . . . . . . . . . . . . ... .. 145
A .2 Operation . . . . . . . . . . . . . . . .. .. 145
A.2.1 Basic Setup Commnands . . . . . . . . . ... .. 146
A.2.2 Object Finder Analysis . . . . . . . . . ... .. 146
A.2.3 Display Analysis Results and Identified Objects . . ... .147
A.3 Buffer Structure and Files . . . . . . . . . . ... .. 151
A.3.1 System Buffer Structure . . . . . . . . . ... .. 151
A.3.2 System Files . . . . . . . . . . . . ... .. 153

REFERENCES . . . . . . . . . . . . . . . . ... .. 155

BIOGRAPHICAL SKETCH . . . . . . . . . . . . ... .. 159
















LIST OF TABLES


2.1 Information flows relation generated for the example . . . ... .. 18

3.1 Complexity index function for types in the "C" programming language. 34

5.1 Type size associated with several types . . . . . . . ... .. 56

5.2 Type size associated with variables of different types in the original
version of the recursive descent expression parser . . . . ... .. 70

5.3 Type size associated with variables of different types in version 1 of
the recursive descent expression parser . . . . . . . ... .. 72

5.4 Statistics of the test case programs . . . . . . . . ... .. 81

8.1 Statistics of the name cross-reference program . . . . . ... .. 124

8.2 Statistics of the algebraic expression evaluation program . . ... .. 129

8.3 Comparison of candidate objects and "C++" classes . . . ... .. 131
















LIST OF FIGURES


2.1 An example of information flow . . . . . . . . . ... .. 17

2.2 Possible skeleton code for the example . . . . . . . ... .. 17

3.1 Tree representation of types for complexity index function . . . 35

5.1 Schematic illustrations of access pairs and data bindings ....... 55

5.2 Example of the primitive complexity metrics factors . . . ... ..68

5.3 Identified objects in original version of recursive descent expression
parser . . . . . . . . . . . . . . . . .. . 71

5.4 Identified objects in version 1 of recursive descent expression parser 73

5.5 Validation of the primitive complexity metrics factors . . . ... ..75

5.6 Groups based on objects identified in name cross-reference program 82

5.7 Clusters found in the name cross-reference program by basili . . 82

5.8 Groups based on clusters found in name cross-reference program . 83

5.9 Groups based on types-based objects identified in algebraic expression
evaluation program . . . . . . . . . . . . ... .. 84

5.10 Groups based on globals-based objects identified in algebraic expres-
sion evaluation program . . . . . . . . . . . ... .. 86

5.11 Clusters found in algebraic expression evaluation program . . . 86

5.12 Groups based on clusters found in algebraic expression evaluation pro-
gram . . . . . . . . . . . . . . . . . . 87

5.13 Types-based candidate objects identified in symbol table management
program . . . . . . . . . . . . . . . . .. .. 89

5.14 Globals-based candidate objects identified in symbol table manage-
m ent program . . . . . . . . . . . . . . ... .. 91











5.15 Groups based on types-based objects identified in symbol table man-
agement program . . . . . . . . . . . . . ... .. 92
5.16 Groups based on globals-based objects identified in symbol table man-
agement program . . . . . . . . . . . . . ... .. 94

5.17 Clusters found in symbol table management program . . . ... 95

5.18 Groups based on clusters found in symbol table management program 96

6.1 The object finder system flow . . . . . . . . . ... .. 111

7.1 The object finder conceptual model . . . . . . . . ... .. 116

7.2 The object finder implementation outline . . . . . . ... .. 118

7.3 A scanner of tokens . . . . . . . . . . . . ... .. 121

7.4 Process to control the ANSI "C" cross-reference tool acx ...... .. .121

7.5 Xobject commands . . . . . . . . . . . . ... .. 122

8.1 Candidate objects identified in the name cross-reference program. . 126

8.2 Candidate objects identified in the name cross-reference program dis-
played by xobject . . . . . . . . . . . . ... .. 128

8.3 Modified candidate objects in the name cross-reference program dis-
played by xobj ect . . . . . . . . . . . . ... .. 130

8.4 Types-based candidate objects identified in the algebraic expression
evaluation program . . . . . . . . . . . . ... .. 132

8.5 Globals-based candidate objects identified in the algebraic expression
evaluation program . . . . . . . . . . . . ... .. 133

8.6 Candidate objects identified in algebraic expression evaluation pro-
gram displayed by xobj ect . . . . . . . . . . ... .. 135

8.7 Initial candidate object defined by the user in the name cross reference
program . . . . . . . . . . . . . . . . .. .. 137

8.8 Extended candidate object resulting from the data-routine analysis 137

8.9 Final candidate object . . . . . . . . . . . ... .. 138











Abstract of Dissertation Presented to the Graduate School
of the University of Florida in Partial Fulfillment
of the Requirements for the Degree of Doctor of Philosophy

AN APPROACH TO SOFTWARE SYSTEM MODULARIZATION
BASED ON
DATA AND TYPE BINDINGS

By

Roger M. Ogando

August, 1991

Chairman: Stephen S. Yau
Major Department: Computer and Information Sciences

The maintenance of software systems usually begins with considerable effort

spent in understanding their system structures. A system modularization defines

a structure of the system in terms of the grouping of routines into modules within

the system. This dissertation presents an approach to obtain a modularization of a

program based on the object-like features found in the program.

While object-oriented methodologies for software design and development have

only been clearly enunciated in the last few years, many object-like features such

as data grouping, abstract data types, and inheritance have been in use for some

time. In this dissertation, methodologies which aid in the recovery of the object-

like features of a program written in a non object oriented programming language

are explained. Two complementary methods for "object" identification are proposed

which focus on data bindings and on type bindings in a program. The proposed

approach looks for clusters of data, structures, and routines, that are analogous to

the objects and object classes of object-oriented programming. The object finder is

an interactive tool that combines the two methods while using human input to guide

the object identification process. The experience of using the object finder and two

evaluation studies of the object identification methods are also presented.
ix













CHAPTER 1
INTRODUCTION

The maintenance of software systems usually begins with considerable effort
spent in understanding their system structures and data. A system modularization

defines a structure of the system in terms of the grouping of routines into modules

within the system. This dissertation presents an approach to obtain a modularization
of a program based on the object-like features found in the program. In modifying

existing software, professional maintainers are almost unanimous in identifying the

understanding of system data as one of their greatest challenges. Successful main-

tenance requires precise knowledge of the data items in the system, the ways these

items are created and modified, and their relationships. Changing a program without

a clear vision of its implicit data model is a very risky undertaking.

There seems to be little work that has explicitly addressed the problem of data

understanding during software maintenance. A number of methodologies attempt

to aid human program understanding of program constructs by cross referencing, by

capturing dependencies [9, 46], by program slicing [13, 43], or by the ripple effect

analysis of changes [50]. Other tools that have been proposed so far, such as those

described by Ambras and O'Day [1], Biggerstaff [4], Kaiser et al. [16], Rich and Wa-

ters [32], and Yau and Liu [49], use knowledge-based approaches to provide inference

capabilities. As a result, a user can derive additional information that may not be
explicit in the program code.

Technologists in software design have made great progress in abstracting the

ways computer programs use data. Most notable, perhaps, has been the emergence

of the concept of object-oriented design and development. Booch defines an object as











"an entity whose behavior is characterized by the actions it suffers and that it requires

of other objects" [5]. In practice, most objects are collections of data, together with

the methods needed to access and manipulate those data.

Although object-oriented programming constructs are not directly supported

in conventional programming languages such as "C" and "Ada," several object-like

features, such as groupings of related data, abstract data types, and inheritance, have

been in use for some time and may occur in an existing program. If such software

needs to be maintained, it would be highly advantageous to identify the object-like

features in the system. Knowledge of such "objects" would be important to:


1. Understand the system's design more precisely.

2. Facilitate reuse of existing "methods" contained in the system.

3. Avoid degrading the existing design by introducing unnecessary references to

data that should be private to a given class of "objects."

4. Reengineer the system from a conventional programming language (such as

"C") into an object-oriented language (such as "C++") to facilitate future

enhancements.


We identified two important factors necessary for characterizing objects: global

or persistent data [18] and the types of formal parameters and return values. Each

factor will in turn originate a new algorithm for object identification. This disser-

tation presents these two algorithms for identifying object-like features in existing

source code. One focuses on the data bindings between program components and the

other on type bindings between program components. The two algorithms, as well

as their implementations, are collectively known as the object finder.











The Globals-based Object Finder algorithm uses the information provided by

persistent data structures to identify the objects in a program. Data bindings be-

tween system components have been previously used as the basis for clustering. In

hierarchical clustering [14], for example, the elements chosen for grouping are the

ones with the smallest "dissimilarity," that is to say, with the highest number of

data bindings between elements. Our approach for globals-based object identifica-

tion provides similar capabilities by grouping those routines which access a common

global variable into "highly connected" objects. This algorithm handles procedural

programming languages, such as "Ada," "C," "COBOL," "Pascal," or "FORTRAN,"

that provide scoping mechanisms that allow the definition of global variables and side

effects on those global variables.

The Types-based Object Finder algorithm uses type binding as the basis for

grouping routines into objects. This algorithm groups the routines according to the

types of data used for return value and formal parameters of routines. Types-based

object identification considers the "semantic" information provided by the types for

the clustering. This is different from other clustering techniques based on seman-

tic information such as conceptual clustering [35] which considers "light semantic"

profiles about a system (from detailed cross reference information) during clustering.

In the latter, a concept tree represents the common features of the members of a

group, such as the names that a software unit uses ("names-used") and the names of

the places where it is used ("user-names"). This algorithm handles those procedural

programming languages which provide explicit type construction mechanisms.

This dissertation also presents the experience of using the object finder algo-

rithms and an evaluation of the object finder algorithms through careful examination











of the results. Two studies were performed to evaluate the object identification al-

gorithms. In study I, the evaluation consisted of comparing the groups (identified

objects) identified using the object finder with those groups (clusters) identified us-

ing hierarchical clustering [14]. Study II compared the identified objects found in a

program with the object-oriented programming classes found in the object-oriented

version of the program.

The comparison in study I was based on the complexity of the two partition-

ings resulting from each clustering technique. The measure of the complexity of

the partitionings is similar to coupling and cohesion [6]. Thus, we defined a new

set of complexity metrics factors called inter-group complexity, which measures the

complexity of the interface of a group, and intra-group complexity, which measures

the internal complexity of a group, to measure the complexity of a given group in

a partitioning. For this evaluation, we instrumented the program with access pairs

and data binding triples in order to measure those metrics. These newly defined

complexity metrics factors were validated and the results are reported in this work.

The comparison results are also reported in this dissertation.

In study II, a "C" program, translated from a "C++" program, is used to

compare the objects identified in the "C" program with the classes found in the

"C++" version of the program. This example also illustrates some of the design

recovery capabilities of the object finder when abstracting the underlying structure

of a system.

The remainder of the dissertation is organized as follows: Chapter 2 provides a

brief overview of design recovery and complexity metrics. Chapter 3 introduces the

new approach, by way of several examples, for modularizing an existing program by

identifying the object-like features found in the program. Chapter 4 discusses the











time and space complexity of the new approach. Chapter 5 presents the evaluation

results of the proposed approach, using complexity metrics, and discusses the new set

of complexity metrics factors and their validation. Chapter 6 discusses an application

of the proposed approach in software systems design recovery and introduces a new

approach for design recovery either top-down or bottom-up. Chapter 7 outlines the

implementation of a prototype designed to demonstrate the flexibility and portability

of the design of the prototype. Chapter 8 discusses the experience of using the ap-

proach to modularize and recover the design information from several programs. The

most important limitation of the proposed approach is that it is currently restricted

to two criteria of object identification. Many other criteria are suggested for the

future study including object-oriented principles, such as the classification of the ob-

jects in the application, organization of the objects, and identification of operations

on the objects. More experimentation is also needed to further evaluate the object

identification approach, and other metrics factors should be added to measure the
"object-orientedness" of the recovered design. The conclusions and the future study

are discussed in Chapter 9.














CHAPTER 2
BACKGROUND

This chapter summarizes some relevant background information on design re-

covery and complexity metrics.

2.1 Design Recovery

Design recovery is defined by Biggerstaff [4] as recreating design abstractions

from a combination of code, existing design documentation (if available), personal

experience, and general knowledge about problem and application domain. Design

recovery must generate all of the information required for a person to understand

the functionality of a program, how it accomplishes its responsibilities, and why it

performs that functionality.

Design recovery is common and critical throughout the software life cycle. The

developer of new software usually has to spend a great deal of time trying to un-

derstand the structure of similar systems and system components. The software

maintainer spends most of his or her time studying a system's structure to under-

stand the nature and effect of a requested change. Without fully understanding the

similar systems, the developer may not harvest the reusable components and reusable

designs. Without fully understanding the program to be maintained, the maintainer

may conduct inefficient or incorrect program modification.

Without automated techniques, design recovery is difficult and time consuming

because the source code does not usually contain much of the original design informa-

tion, changes are not adequately documented, there is no change stability, and there

are ripple effects when making changes. Furthermore, large scale software worsen











these difficulties. Thus, automated supports for the understanding process are very

desirable.

In the following sections we survey the current approaches to design recovery.

2.1.1 Clustering Approaches

The main characteristics of the clustering approach are that it shows the struc-

tural analysis of large software systems via adequate clustering techniques and the

retrieval of high level structural information from the existing code [23]. The goal is

to group routines into modules within the systems to reflect the modules defined by

the developer.

Clustering is the analysis of the interface between the components of a system.

It helps to determine the modularization that those interfaces define. The modules

defined by this analysis are called clusters.

Clustering techniques with data bindings have been well studied, but the tech-

nique which involves the type bindings approach has only recently been published.

Our object identification approach falls under the latter of these clustering techniques.

In the following two sections, the data binding approach and the type binding ap-

proach are examined.

2.1.1.1 Data binding

In this section, we explain in detail the clustering techniques, based on data

bindings, used in deriving a system's clusters. Later, we use these clusters to compare

the grouping of the routines derived by the object finder, with those groups derived

by these clustering techniques.

Data binding [2, 41, 14] reflects the interaction among the components of a

system; it has been previously used for module interaction metrics [2]. In their

work, Hutchens and Basili [14] use data bindings to measure the interface between











components of a system and to derive system clusters. For example, assume there

are two procedures, pi and P2, and a variable x in a program. When procedures

p, and P2 and the variable x are in the same static scope, whether the procedures

access the variable or not, this is called a potential data binding, denoted as (p1, x,p2)

because it reflects the possibility of data interaction between the two procedures. If

both procedures access variable x, then there exists a used data binding. More work

is required to calculate the used data binding than the potential data binding, but

the former reflects a similarity between pi and p2. If procedure pi assigns a value to

variable x and procedure P2 references x, this will reflect a flow of information from

P1 to p2 via the variable x. This is called actual data binding and it is difficult to
calculate since a distinction between reference and assignment must be maintained.

Besides these bindings, control flow data binding is another kind of data binding, but

it requires considerably more computation effort than actual data binding, because

static data flow analysis must be performed.

After the data binding information is available, the system components can

be grouped based on the strength of their relationships with each other [3]. This

is derived by some specialized clustering techniques. The use of data bindings in

clustering analysis provides meaningful pictures of high level system interactions and

even system "fingerprints," which are descriptive analogies to galactic star systems.

Next, we summarize Hutchens and Basili's hierarchical clustering modularization

approach [14] and the current implementation [21] of their approach.

The nature of Hutchens and Basili's clustering algorithms is bottom-up or ag-

glomerative since they iteratively create larger and larger groups, until the elements

have coalesced into a single cluster. Thus, this is a module composition technique.

The elements chosen for grouping are the ones with the smallest dissimilarity. There











are several methods for computing the dissimilarity [14]; the clustering algorithms
used by Hutchens and Basili correspond to the "single-link" algorithm [14] which
takes the smallest dissimilarity between the elements of each pair of newly formed
clusters as the new coefficient between them.
The first step in hierarchical clustering is to abstract the data to obtain a binding

matrix which represents the number of data bindings between any two components
of the system. This matrix is a symmetric matrix. The current implementation

considers all levels of data binding up to actual data bindings.
The second step is to obtain a dissimilarity matrix using one of two alternative

methods:

1. Recomputed Bindings: based on the percentage of the bindings that connect to
either of two components of the system. The dissimilarity matrix p is defined
by d(i,j) = p(i,j) = (sumrni + sumi 2b(i,j))/(sumrni + sumj b(i,j)) where

sumi is the number of data bindings in which component i occurs and sumj

is the number of data bindings in which component j occurs. In this case,

p(i,j) is the probability that "a random data binding chosen from the union

of all bindings associated with i or j is not in the intersection of all bindings
associated with i or j" [14].

2. Expected Bindings: a weight is assigned to each binding level relative to the total
number of elements under consideration in a given iteration. The dissimilarity
matrix p is defined by d(i,j) = (k/(n 1))/bind(i,j) where n is the number of

elements under consideration and k is the number of bindings involving either
element i or element j. Thus, one would expect k/(n 1) of the bindings to

be between i and j.











The current implementation of this clustering technique uses expected bindings to

compute the dissimilarity matrix.

During the third step, the new clusters are formed by grouping together those

elements whose dissimilarity is the smallest. Then, a new binding matrix is obtained

from the clusters and the process starts again from this new binding matrix.

Hutchens and Basili summarize the problem of using data bindings only for

modularization as that "whenever a module that defines an abstract data type and

has no local data that is shared among the operations on the type, [the module] will

not be located using this method" [14]. They explain that this is due to the fact that

there is no direct data binding between the operations of the module and that all the

interactions are indirect through the procedures that use the abstraction.

The algorithms for object identification in this dissertation present a solution

to this problem that consists of using data types and establishing "relationships"

between such types and the routines that use them for formal parameters or return

values. Then, the abstract data type is revealed from the hiding imposed by the

abstraction mechanisms.

2.1.1.2 Type binding

Type binding methodology [22] is the most recently published design recovery

technique. It analyzes a conventional procedural programs in the context of an object-

oriented paradigm.

Due to the gradual software paradigm progression from a purely procedural

approach to an object-based approach and now to the object-oriented approach,

object-like features, such as data grouping, abstract data type, and inheritance, al-

ready exist in conventional programming languages such as "C" and "Ada." It is

very likely that the object-oriented concepts have been used in existing programs for











some time. Type binding methodology consists of identifying the objects in the con-

ventional procedural programs in order to recover the underlying system structures

[22]. The focus of this dissertation, the object finder, is a methodology that combines

both the data binding approach and the type binding approach in modularizing a

software system.

The object finder is somewhat analogous to other software clustering methods,

but it is unique in searching for a particular kind of cluster which is similar to an

abstract data type or object and cannot be clustered by a data binding methodology.

2.1.2 Program Slicing Approach

The concept of program slicing is originally discussed by Mark Weiser [43].

Weiser defines a slicing criterion as a pair (p, V), where p is a program point and V

is a subset of the program's variables. In his work, a slice of program consists of all

statements and predicates of the program that might affect the values of variables in

V at point p. Weiser's work has been improved by Horwitz et al. [13], on the problem

of interprocedural slicing generating a slice of an entire program, where the slice

crosses the boundaries of procedure calls.

The program slicing technique can help a programmer understand complicated

source codes by isolating individual computation threads within a program. It can

also aid in debugging and be used for automatic parallelization. Program slicing is

also used for automatically integrating program variants; in this case, slices are used

to compute a safe approximation to the change in behavior between a program P and

a modified version of P and to help determine whether two different modifications to

P interfere [13].

The main drawback of program slicing in recovering high level information

is that this approach is oriented towards low level information abstraction. This












makes it difficult to understand high level system interaction and the overall system

structure.

2.1.3 Program Dependence Approach

Program dependencies arise as the result of two separate effects. First, a de-

pendence exists between two statements whenever a variable appearing in one of the

statements may have an incorrect value if the statements were reversed. For example,

given

A = B C --S1
D = A E+ I -- S2


S2 depends on S1, since executing S2 before S1 would result in S2 using an incorrect

value for A. Dependencies of this type are called data dependencies. Second, a

dependence exists between a statement and the predicate whose value immediately

controls the execution of that statement. In the sequence

if(A) then Sl
B = C D S2
endif


S2 depends on predicate A since the value of A determines whether S2 is executed.

Dependencies of this type are called control dependencies.

Program dependence analysis can be used in program slicing and in identifying

reusable components for software maintenance [46]. Capturing the program depen-

dencies can help program understanding to aid modification. On the other hand,

this is limited to specific components of a program and it is does not facilitate high-

level understanding of a program. However, one significant application of program

dependence knowledge is in detection of parallelism and code optimization [9].











2.1.4 Knowledge-based Approach

Knowledge-based approaches [49, 1, 16, 32, 4] are very sophisticated and com-

plicated general methodologies of design recovery. Their most distinguishing property

as far as program understanding is concerned is that instead of just using the ex-

isting code, they use all the program information, including the source code, the

documentation, the execution histories, program analysis results, etc.

The program information is expressed as patterns of program structures, prob-

lem domain structures, language structures, naming conventions, and so forth. It is

stored in a central knowledge base which provide frameworks for the interpretation

of the code.

A knowledge-based system is not a single tool, such as an editor or debugger.

It is a collection of tools sharing a common knowledge base. It holds the promise of

providing programmers a next-generation programming environment with the goal

of dramatically improving the quality and productivity of software development [1].

However, knowledge-based approaches are still at the research stage. We consider

that a more constrained tool, such as the object finder, provides immediate, more

accurate knowledge about the high-level understanding of a program, directly from

its source code.

2.2 Complexity Metrics

This section presents the complexity metrics considered for the evaluation of

the object finder. We also explain the rationale for developing a new set of metrics

to be used in this evaluation.

Software complexity is defined by Ramamoorthy [30] as "the degree of difficulty

in analysis, testing, design, and implementation." However, our notion of software

complexity is closer to the structure complexity metrics [15] which views the program











as a component of a larger system and focus on the interconnections of the system

components.
The software metrics are classified as either design or implementation (code)
metrics; design metrics measure the complexity of a design whereas implementation

(code) metrics measure that of an implementation. The metrics chosen include design
metrics such as the metrics of modularity [27] (cohesion and coupling) and imple-

mentation (code) metrics such as the cyclomatic complexity [25] of the program.
Furthermore, we include a design-implementation metrics: the stability of programs
[48].
In addition, we considered code metrics, which focus on the individual system
components (procedures and modules) and require a detailed knowledge of their
internal mechanisms, such as McCabe's cyclomatic complexity number [25]. We

also consider, as indicated above, structure metrics, including Henry and Kafura's
Information Flow [11] metrics and Yau and Collofello's Logical Stability [48] metric.

Each of these metrics is measured by different characteristics of the program; that is

to say, by evaluating the characteristics we may infer the degree of the metrics.
In the following sections we explain several of these metrics and the rationale
for discarding each metric.

2.2.1 Modularity Metrics

Modularity metrics were initially defined by Myers [27]. They include two re-
lated aspects: cohesion (strength or binding) and the coupling of modules. Coupling
is an indication of the level of interconnections between modules. A software compo-

nent is said to exhibit a high degree of cohesion if the elements in that unit exhibit
a high degree of "functional relatedness" [39]; that is to say, they exhibit functional
unity [8]. The metrics of coupling among "modules" (either procedures/functions











in a non-object-oriented language and objects in the object finder) are measured by
"structural fan-in/fan-out" degrees [6] and by "informational fan-in/fan-out" [11].

Since the data flow count includes procedure calls, informational fan-in/fan-out sub-

sumes structural fan-in/fan-out [39].

Information flow [11] concepts are used to measure the module coupling between

modules in a software system. These measures focus on the interface which connects

system components. Myers [27] established six categories of coupling based on the

data relationships among the modules. The information flow metrics can recognize

two of these categories, including content coupling (refers to direct references between

the modules) and common coupling (refers to the sharing of a global data structure).

The information flow metrics is also used to measure the procedure and module

complexity of a software system.

A measure of the "strength of the connections from module A to module B" [11]

is:

(PEI(A) + PII(B)) IP(A, B)

where PEI(A) is the number of procedures exporting information from module A,

PII(B) is the number of procedures importing information into module B, and

IP(A, B) is the number of information paths between the procedures. Thus, the

resulting metrics are a matrix of the coupling between any two modules in the software

system.
The information needed to calculate this measure of coupling follows:

The information flow between modules depends on the information flow between

procedures which are part of the module. A module is defined with respect to a
data structure D in the program consisting of those procedures which either directly

update D or directly retrieve information from D. Thus, examination of the global











flow in each module reveals the number of procedures in each module and all possible
interconnections between the module procedures and the data structure.
There are several kinds of information flow between modules:

Definition 1 There is a global flow of information from module A to module B through
a global data structure D if A deposits (updates) information into D and B retrieves
information from D.

Definition 2 There is a local flow of information from module A to module B if one
or more of the following conditions hold:

1. if A calls B,

2. if B calls A and A returns a value to B, which B subsequently utilizes, or

3. if C calls both A and B passing an output value from A to B.

Definition 3 There is a direct local flow of information from module A to module B

if condition (1) of Definition 2 holds for a local flow.

Definition 4 There is an indirect local flow of information from module A to module

B if condition (2) or condition (3) of the Definition 2 holds for a local flow.

Some examples of these information flows are illustrated in Figure 2.1. Figure 2.1
shows a simple system consisting of six modules (A,B,C,D,E,F), a data structure (D-

S), and the connections among these components.
The possible skeleton code for this system is shown in Figure 2.2. As indicated
in the skeleton code, module A retrieves information from DS and then calls B passing
a parameter; module B then updates DS. C calls D, passing a parameter. Module D
calls E with a parameter and E returns a value which D then uses and passes to F.
The function of F is to update DS.































Figure 2.1. An example of information flow


typeds DS;

void A()
{
type.ds x;

x = DS 1;
B(x)


void B(typeds x)
{

DS = x;
}

void C(type ds p)
{


void D(typeds p)
{
type-ds q;

q = E(p);
F(p,q);

}

void E(typeds p)
{
return (p + 3);


void F(type.ds p, typeds q)
{

DS = p +q;


Figure 2.2. Possible skeleton code for the example












Table 2.1. Information flows relation generated for the example.

Direct local flows A B
C-D
D- +E
D- F
Indirect local flows E D
E F
Global flows B A
F-A


The information flow analysis [11] process of a program consists of deriving the

complete flow structure using a procedure-by-procedure analysis of the program.

The information flows for this example are summarized in Table 2.1. The direct

local flows are simply due to the parameter passing. The indirect local flows are due

to "side-effects" relationships between modules. The first indirect flow, from E to D,

results when E returns a value (q) which D "uses" in its computation. The second

indirect flow, from E to F, results when other information (q), that D receives from E,

is passed unchanged to F. Finally, the global flow is due to information flow passing

through the global data structure DS.

The next step is to compute the complexity of a module. The complexity of a

module is the sum of the complexities of the procedures within the module.

The complexity of a procedure depends on two factors: the complexity of the

procedure code and the complexity of the procedure's connections to its environment.

A very simple length measure was used as an index of procedure code complexity [11]:

the number of lines of text in the source code of the procedure (including embedded

comments but not those preceding the procedure statement). The connections of

a procedure to its environment are determined by the (informational) fan-in and

fan-out of the procedure defined as follows:











Definition 5 The fan-in of procedure A is the number of local flows into procedure A

plus the number of data structures from which procedure A retrieves information.

Definition 6 The fan-out of procedure A is the number of local flows from procedure

A plus the number of data structures which procedure A updates.

However, in order to compute the complexity of the procedures which are in-

cluded in a module, one should only consider local flows for the data structures

associated with the module.

Finally, the complexity of a procedure is:

length (fanin fanout)2

The term (fan in fan out) represents the total possible number of combinations

of an input source to an output destination.

In conclusion, the complexity of a procedure contained in a specific module

is computed using only the local flows for the data structure associated with that

module. The complexity of a module is computed as the sum of the complexities

of the procedures within the module. We use these complexity metrics concepts in

defining a new set of complexity measure to evaluate the object finder.

2.2.2 Zage's Design Metrics

In the case of a structured design, Zage [51] developed a design quality metric

D(G) of the form

D(G) = k1(D) + k2(Di)

In this equation, k, and k2 are constants and D, and D, are, respectively, an external

and an internal design quality component. De considers a module's external relation-

ships to other modules in the software system, whereas D, considers factors related

to the internal structure.











The calculation of D, and Di is performed in two different stages of software

design. D, is based on information available during architectural design, whereas Di
is calculated after the detailed design is completed.

De is calculated for each module of a system and is comprised of two terms:

one product related to the amount of data flowing through the module and another
product giving the number of paths through the module.



De = (Weighted Inflows Weighted Outflows) + (Fan In Fan Out)

De appears to highlight stress points in architectural design. By redesigning

these points, lower values of De were obtained [51] which, in effect, says that a

reduction of D, means a reduction on the coupling between modules.
The internal design metrics component Di is calculated as follows:



Di = wi(CC) + w2(DSM) + W3(I/O)

where

CC (Central Calls) are procedure or function invocations

DSM (Data Structure Manipulations) are references to complex data types

I/O (Input/Output) are external device accesses

and wi, w2, and w3 are weighting factors.

The use of these three measures (CC, DSM, and I/O) is due to the desire

to "choose a small set of measures which would be easy to collect and which would
identify stress points to capture the essence of complexity" [51]. Their results indicate

that stressing the data structure manipulation usage within modules gives excellent











results as a predictor of error-prone modules. The proposed values of the weighting

factors are: w, = 1, w2 = 2.5, and w3 = 1. Di was also better than cyclomatic

complexity v(G) and lines of code (LOC) as predictor of error-prone modules [51].

The design metrics were used as guidelines for the development of our new set

of complexity metrics.

2.2.3 Cyclomatic Complexity

The computation of the metrics on cyclomatic complexity [25] of a program

is based on the number of "connected" components, the number of edges, and the

number of nodes in the program control graph. In practice, this metrics for struc-

tured programs consist of the number of predicates in a program plus one [25]. The

conditionals are treated as contributing one for each predicate, thus we need to add

two whenever there is a logical "and" and N 1 whenever there is a case statement

with N cases.

According to Shooman [37], McCabe has validated this metrics of cyclomatic

complexity. McCabe concluded that, from a set of programs, those with a complex

graph and a large v(G) are often the trouble-prone programs.

The cyclomatic complexity of a program is derived by computing the number

of conditions in each component (function). That is, McCabe's cyclomatic number

for a collection of strongly connected graphs is determined from the union of those

strongly connected graphs. The cyclomatic number of this union is computed to

determine the cyclomatic number of the complete program.
The measure of complexity provided by these metrics focuses on individual

system components. The complexity metrics which we must consider should focus

on the structure of the system as well as on the high-level design information as

criteria to determine the complexity of a modularization.











2.2.4 Stability Metrics

The metrics on program stability [48] are based on the stability of a module,

defined as a measure of the resistance to the potential ripple effect from a modification

of the module on other modules in the program. There are two sides to these metrics:

the logical stability of a module, defined in terms of logical considerations, and the

performance stability of a module, defined in terms of performance considerations.

In our metrics study, we concentrate on the logical stability aspect of programs.
The logical stability of programs is defined in terms of a primitive subset of the

maintenance activity, such as a change to a single variable definition in a module.

The incremental approach to computing the logical stability metrics [48] of a program

begins by computing the sets of interface variables which are affected by primitive

modifications to variable definitions in a module by intramodule change propagation.

In addition, for each interface variable, it is necessary to compute the set of modules

which are involved in intermodule change propagation as a consequence of affecting

the variable. Then, for each variable definition one must compute the set of modules

which are involved in intermodule change propagation as a consequence of modifying

the variable definition. Next, the individual complexity associated with each module

is defined using McCabe's cyclomatic complexity. The potential ripple effect of a

module is the probability that a variable definition will be chosen for modification

times the complexity associated with the modules affected by such variable definition.

Finally, the logical stability of a program is the inverse of this potential ripple effect

of a primitive modification to a variable definition in a module.

We argue that our measure of complexity is based on similar grounds as the sta-

bility metrics. The stability metrics concentrate on the opposition to the propagation

of the effect of primitive modifications by a software system. The propagation occurs








23


through the data transfer in the system. On the other hand, our metrics consider

the relationship between components based on the links established by those data

transfers.














CHAPTER 3
THE PROPOSED APPROACH

Our approach aids software maintenance by assisting the understanding pro-

cess of the design of a software from its source code. The approach's output is a

modularization of a program that consists of the collection of "objects" found in the

program. This modularization of a system, that usually has no (or little) existing

high level documentation, gives the maintainers an understanding of the structure of

the system. Since the approach is based on the data and type of the system, it also

assists the maintainers with understanding of the system data.

The proposed approach consists of a partial classification of the program ele-

ments (routines, types, and data items) that is meaningful in the context of the target

program and its real world domain. The information required for this classification

consists of the relationships between those program elements in term of data bindings

and type bindings.

Two methods of object identification are used. The first is based on global and

persistent data and establishes links to the routines that manipulate such data. The

second method is based in the data types and establishes relationships between such

types and the routines that used them for formal parameters and return values. Both

methods result in sets of identified candidate objects.

The resulting candidate objects from both methods are not completely disjoint

since they represent both object classes and instances of objects in the more classical

sense. In addition, this allows the methods to capture the intentional "violations"

made by the designer/implementor of the underlying design. The candidate objects











represent the structure of the program in terms of groups of routines implicit from

the design and the relationships between those groups.

3.1 The Proposed Approach

The proposed approach consists of identifying the object-like features in conven-

tional programming languages. Object-oriented constructs are not directly supported

in conventional programming languages; however, several object-like features, such

as groupings of related data and abstract data types, are found in those programming

languages. The proposed approach identifies these groupings of data and abstract

data types in terms of "objects." Most objects are collections of data, together with

the methods needed to access and manipulate those data.

An "object," in a conventional programming language, can be identified as a

collection of routines, types, and/or data items. The routines will implement the

methods associated with the object, the types will structure the data they conceal or

process, and the data items will represent or point to actual instances of the object

class. Thus, we may characterize our candidate "objects" as tuples of three sets:

Candidate Object -= (F,T,D)

where F is a set of routines, T is a set of types, and D is a set of data items. Any of

these sets may be empty; ideally sets from distinct objects will not overlap so that a

routine, type, or data item should not appear in more than one object.

A program contains routines, types, and data items. The proposed approach

consists of a partial classification of these elements that is meaningful in the context

of the target program and its real world domain. A large part of the information for

this classification can come from analyzing the relationships between the components

of the program, but human intervention or very carefully chosen heuristics will be

needed to remove coincidental and meaningless relationships.











The identified candidate objects are not completely disjoint; there is some in-

tentional fuzziness in the definition of candidate objects. It is often the case in a

real program that the original implementor (or a later maintainer!) has violated the

cleanness of the underlying design in a few instances, either from laziness or to gain

efficiency. It would unnecessarily reduce the usefulness of object finding to reject out

of hand any candidates that had small overlaps or violations of good information

hiding practice.

Furthermore, the definition given does not distinguish clearly between the con-

cept of an object class and the concept of an object. As a practical matter, in some

cases it may be easier to first find the class and then its instances and in other cases

to reverse this procedure. Thus, it is more convenient to treat the two together.

Two broad methods of object finding seem to be useful. The first is based on

global and persistent (e.g. static in "C") data and establishes links to the routines

that manipulate such data. The second methodology is based on data types and

establishes relationships between such types and the routines that use them for formal

parameters or return values. Without loss generality, we assume that all identifiers,

such as routines, variables, and types, are distinguishable from their names.

The first method of object identification is given in Algorithm 1.


Algorithm 1 Globals-based Object Finder

Input : A program in a conventional programming language, such as Ada, COBOL,

C, or FORTRAN, with scoping mechanisms.

Output : A collection of candidate objects.


Steps :











1. For each global variable x (i.e., a variable shared by at least two routines),

let P(x) be the set of routines which directly uses x.

2. Considering each P(x) as a node, construct a graph G = (V, E) such that:

V = {P(x) I x is shared by at least two routines}

E = {P(xl)P(x2) I P(xi) nP(x2) # 0}.


3. Construct a candidate object (F, T, D) from each strongly connected com-
ponent (v, e) in G where

F = U {P(x) I P(x) E v}

T = 0

D = U{x I P(x) E v}


Example 1. Globals-based Object Finder-Single Stack/Queue:

To motivate this first method, take as an example a package in an Ada-like

language for manipulating a single queue and a single stack of data with type Element.

The package provides the interface routines so that the following three routines access

a global data STACK:

procedure Push_S (X:Element);
-- Push an element X to the Stack.
function PopS return Element;
-- Pop the top element from the Stack.
function IsEmptyS return Boolean;
-- Return true if the Stack is empty.

and the following three routines access a global data QUEUE:

procedure PushQ (X:Element);
-- Push an element X to the Queue.
function Pop_Q return Element;
-- Pop the front element from the Queue.
function IsEmptyQ return Boolean;
-- Return true if the Queue is empty.











If there is no other direct relation between these two groups, then clearly, PushS,

PopS, and IsEmptyS belong to one candidate object and PushQ, PopQ, and

IsEmptyQ belong to another. The Globals-based Object Finder given above would

easily identify these two objects as the following two tuples:

(F1,T1,D1) = ({Push-S, PopS, Is-Empty-S},0, {Stack})

(F2,T2, D2) = ({PushQ, PopQ, IsEmpty_Q},0, {Queue})

However, this method in many cases may produce objects which are "too big," since

any routine that uses global data from two objects creates a link between them. Thus,

a further stage of refinement will likely be necessary in which human intervention or

heuristically guided search procedures improve the candidate objects by excluding
offending routines or data items from the graph G.

The Globals-based Object Finder could utilize information other than the ac-

cesses to global variables by routines; in particular, it could also use the information

about references and definitions of local variables and formal parameters. Then, a

more detailed analysis of the internals of a routine will be required to obtain that

information such as a semantic analysis to determine whether the references to local

variables and formal parameters in a routine effectively access the same data item

across invocations of the routine. Then, the routine could be made part of the group

of routines that access that data.

The kind of analysis needed to obtain this knowledge will be developed in the

future study of this dissertation. One suggestion is to use data flow analysis of the

internals of routines to determine the "pattern" of accesses of local variables and

formal parameters inside the routines. A pattern represents the uses and definitions

of local variables and formal parameters in a routine similar to knowledge-based

approaches in Section 2.1.4. These patterns could be used as a criteria to group











together those routines that exhibit similar patterns. For example, a pattern could

be defined as the use of a local variable as an index in an array of elements. If this
pattern is identified in a routine that has an array representation of an address table

and in another routine with an array representation of a symbol table, we argue that
both routines should be part of the same table indexing group.
The second method of object identification is given in Algorithm 2.

Algorithm 2 Types-based Object Finder

Input : A program in a conventional programming language, such as Ada, COBOL,
or C with data type abstraction mechanisms.

Output : A collection of candidate objects.

Steps :

1. (Ordering) Define a topological order of all types in the program as follows:

(a) If type x is used to define type y, then we say x is a "part of" y and
y "contains" x, denoted by x < y.

(b) x < x is true.

(c) If x < y and y < x, then we say x is "equivalent" to y, denoted
x =y.

(d) If x < y and y < z, then x < z.

2. (Initial classification) Construct a relationship matrix R(F, T) in which
rows are routines and columns are types of formal parameters and return
values. Initially, all entries of R(F, T) are zeroes. An entry R(f, t) is set
to 1 if type t is a "part of" the type of a formal parameter or of a return
value of routine f.











3. (Classification readjustment) For each row f of the matrix R, mark R(f,t)

as 0 if there exists any other type on the same row which "contains" type
t and has been marked as 1.

4. (Grouping) Collect the routines into maximal groups based on sharing of

types. Specifically, routines fi and f2 are in the same group if there exists

a type t such that R(fi,t) = R(f2,t)= 1.

5. Construct a candidate object (F, T, D) from each group where

F = {f I the routine f is a member of the group}

T = {t I R(f,t) = 1 for some f in F}

D = 0

Again, in many cases the candidate objects created may be "too big." As can

be seen in the following example, the culprit will often be a type that a human can

easily identify as irrelevant to the objects being identified.

Example 2. Types-based Object Finder-Multiple Stacks/Queues:
In this example, there are four basic groups of routines. The first group manip-

ulates complex numbers. The second group is related to multiple instances of stacks

of complex numbers. The third group is related to multiple instances of queues of
complex numbers. The fourth group involves routines that manipulate both stacks

and queues. The four groups are specified in algebraic form as follows:
-- Group I:
Construct (Real, Real) => Complex
-- construct a complex from two reals
"+" (Complex, Complex) => Complex plus
"-" (Complex, Complex) => Complex minus
"*" (Complex, Complex) => Complex --multiplication
"/" (Complex, Complex) => Complex -- division
-- Group II:
PopS (Stack) => Stack x Complex












-- remove the top element from a stack and return it
Push_S (Stack, Complex) => Stack
-- push a complex number onto a stack
IsEmptyS (Stack) => Boolean
-- return true if Stack is empty

Group III:
PopQ (Queue) => Queue x Complex
-- remove the head element from a queue and return it
Push._Q (Queue, Complex) => Queue
-- push a complex number onto a queue
IsEmptyQ (Queue) => Boolean
-- return true if Queue is empty

Group IV:
Queue_toStack (Queue) => Stack convert a queue to a stack
Stack_to_Queue (Stack) => Queue -- convert a stack to a queue


In this example, there are no global variables used. In applying Algorithm 2,

we will develop the following matrix R:



Group ID Routines Complex Real Stack Queue Boolean
I Construct 1 0
"+" 1

11*1 1
Il"/" 1
II PushS 0 1______
Pop.S 0 1
IsEmptyS 11
III Push.Q 0 1
Pop.Q 0 1
IsEmptyQ _____ 1 1
IV Queue.to-_Stack 1 1
Stack_to_Queue 1 1



Blank entries (which will be marked as 0's according to Algorithm 2) indicate

no direct relationship between the routine and the type. "0" means a "part of"

relationship has been found by examining the internal data structures of the program.











In this example, at some point a complex will be found to contain real values and

the stack and queue found to contain complex values.
The Types-based Object Finder will initially classify all the types and routines

of groups II, III, and IV in a single large object because of the false links created by

the Boolean type that link the stack and the queue. Some heuristics could be used
to reduce such conflicts (e.g., eliminate primitive types, etc.), but it would also be

a fairly easy task for a user to intervene at this point and identify Complex, Stack,

and Queue as the objects of interest, provided that the data can be presented to him

clearly.

There seems, however, to be no easy way to categorize group IV, which involves

routines that operate on both Queue and Stack objects. A solution given by object-

oriented design consists of a guideline which requires that a routine to be member of

a class it must either access or modify data defined within the class [18]. Clearly, this

indicates that a routine would be classified according to the type of its input formal

parameters; then, routine QueuetoStack belongs to group II (the Queue candi-

date object) and routine Stack_toQueue belongs to group III (the Stack candidate
object).

Excluding group IV and type Boolean, the objects identified by Algorithm 2
would be listed as follows:

(F1,T1, D1) = ({Construct, "+", "-", "*", "/"}, {Complex}, 0)

(F2, T2, D2) -= ({PushS, PopS, Is-Empty-S}, {Stack}, 0)

(F3, T3, D3) = ({Push-Q, PopQ, IsEmptyQ}, {Queue}, 0)

The previously described topological ordering of types, in Step 1 of Algorithm 2,
is appropriate whenever all types in a program are related in terms of the "part of"

relationship defined above. This ordering represents the fact that a given type is











"part of" another type; thus, the latter type is "more important" than the former

type when classifying routines into candidate objects. That is to say, a routine with

formal parameters or return values with multiple types should be classified as part of

the group of routines that manipulate data with the most important of those types,

i.e., the type that was defined using the other types of the data manipulated by the

routine.

The main problem with this type ordering scheme occurs when some types

are not related by the "part of" relationship. In this case, the previous topological

ordering of types does not completely characterize the relative importance of all the

types; thus, the classification of routines using this type ordering scheme results in

routines which may be classified under more than one type, such as the routines in

group IV of Example 2. This problem can be handled by an alternative type ordering

scheme based on the "complexity" of the data types in a system.

The relative complexity type ordering consists of ordering all the types in the

program based on the "complexity" of the types. This complexity is expressed by a

complexity index function, called CI, for a given type. Assume that the types in a

system are represented using a tree that captures the "part of" relationship between

types. Then, if type y is a "part of" type x, type x is an ancestor of type y and type

y is a descendent of type x in the tree. An example of trees representing structure

types is given in Figure 3.1 of Example 3. The complexity index of type t, denoted

CI(t, 0), is computed using the complexity index function in Table 3.1.

Given that the type of interest is the root of a tree representation of the type, the

complexity index function recursively computes the complexity of the type as the sum

of the path lengths (in number of arcs) from the root type to all its descendent types in

the tree. In Table 3.1, the added complexity by primitive types is simply the length of

the path, denoted d, from the root type to the primitive type. The complexity due to











Table 3.1. Complexity index function for types in the "C" programming language.

type t CI(t, d)
primitive d
pointer to primitive type y d + (drf CI(y, d + 1))
or user-defined type y
array of primitive type y d + (dimension(t) CI(y,d + 1))
or user-defined type y
struct t { fl, f2, ..., fn} d + -=1CI(fj,d + 1)
and fl, f2, ..., fn are all base field types
struct t { fl, ..., fk, ..., fn} d+S+f*S
and f l, ..., fk are base field types where
and fk+1, ..., fn are recursive field types S = Zj=i CI(fj, d + 1)
f = (n- k)/n


pointer types is modified by the dereferencing factor, called drf, which represents the
pointer's complexity added to the complexity of the type; currently, the dereferencing
factor may fluctuate between 1.5 and 2.0 depending on the system's use of pointer
types. The complexity due to an array type, t, is modified by the number of elements
in the array, i.e., its dimension(t). The complexity due to a structure type consists
of the sum of the complexity of its base fields types. The field types of a structure
are of two categories: recursive field types are those which consists of either a pointer
to the containing structure or essentially the same structure type as the containing
structure (in "C", a typedef type); base field types are those which are not recursive
types. In the presence of recursive field types, the complexity due to an structure
type consists of the sum of the complexities of the base field types, denoted by S,
plus a fraction of this complexity caused by the recursive field types.
Two types could be easily ordered by comparing their complexity indexes where
a type A is more complex than type B iff CI(B) < CI(A). Hence, type A is "more
important" than type B in the topological ordering of types. A partial ordering of












A/ K
C B M


D E N B

D E

Figure 3.1. Tree representation of types for complexity index function

the types in a program is established by comparing the complexity indexes of all the
types. Example 3 illustrates the computation of the complexity indexes.
Example 3. Type ordering based on the type complexity index function:

Consider two "C" data structures below where an identifier in capital letters
denotes a type in the language and one in lowercase letters denotes a data structure
field name. Assume that types A, B, K, and L are structure types and types C, D, E,
M, and N are primitive types in the "C" programming language as follows:
struct A { struct K {
C c; struct L {
struct B { struct B {
D d; D d;
E e; E e;
}; };
}; N n;
};
M m;
};


The tree representations of those structures types, which captures the "part of"
relationship among types, are shown in Figure 3.1. Then,

The complexity indexes of types A and B in structure A, according to the com-
plexity index function in Table 3.1, are:


CI(A, 0) 0 + CI(C, 1) + CI(B, 1)











= 0 + 1 + (1 + CI(D,2) + CI(E,2))
= 0+1+(1+2+2))
= 6
CI(B, 0) = 0 + CI(D, 1) + CI(E, 1)
= 0+1+1
= 2


Type A is more complex than type B, thus type A is higher in the ordering of
types than type B.

The complexity index of type K is


CI(K, 0) = 0 + CI(M, 1) + CI(L, 1)
= 0 + 1 + (1 + CI(N,2) + CI(B,2))
= 0 + 1+(1 + 2 + (2 + CI(D,3) + CI(E,3)))
= 0+1+(1+2+(2+3+3))
= 12


Then, type K is more complex than type A. This type ordering scheme allows

the comparison of two unrelated types to determine the most complex, and
most important, of the types.

The complexity indexes of type B by itself and within structures A and L, are
equal since the complexity index depends on the structure that the type repre-
sents which is the same in all three cases.

An additional problem, which is not addressed by either of the two type ordering

schemes, occurs when fields of an structure are conceptually "more important" than
the structure, thus, the routines could be classified into objects according to the
former. For example, consider another routine AddToptoOperand in the multiple
stacks/queues (Example 2) which implements the following functionality:











Complex AddToptoOperand (Stack: st, Complex: op)
-- add the top element from a stack to another complex and return result
Complex: result, top;

top = PopS(st);
result = "+"(topop);
return (result);


Using either of the two type ordering schemes, routine AddTopto Operand

would be classified under the Stack object. However, the functionality of the routine

indicates that it should be classified under the Complex object since the routine

implements operations on Complex entities of the program. The functionality of

routines could be captured by a data flow analysis of the internals of a routine and

by examination of the types of variables accessed inside the routines (including global

variables, local variables, and formal parameters).

3.2 Applicability of the Algorithms

The applicability of these algorithms for object identification includes the kind

of conventional, procedural programming language such as "Ada," "C," "COBOL,"

or "Pascal."

The Globals-based Object Finder handles those conventional programming lan-

guages as well as "FORTRAN." Procedural programming languages provide static and

dynamic scoping mechanisms which allow the definition of global variables as well as

local variables with respect to a particular scope level. The execution of the body

of a function in the case of these programming languages results in side effects being

observed on the values of the global variables [24].

The Types-based Object Finder handles procedural programming languages

that provide a data abstraction mechanism which permits the construction of com-

posite types using more primitive types. Clearly, typing mechanisms are also required

for this algorithm. "FORTRAN" is one of these programming languages which does











not provide explicit type construction mechanisms, with the exception of arrays.

The limitation of applicative languages, such as "LISP", is that they are not explic-

itly typed; in addition, their (abstract) data type construction mechanisms (e.g., list

structure) are not currently handled by our analysis approach.

3.3 Conditions for Best Results

The proposed modularization approaches are particularly useful when the fol-

lowing conditions hold:


1. The program being maintained is written in a programming language that

supports object-like features such as grouping of related data and abstract

data types. Implicit abstract data types are identified by the approach of

object identification. In the case of programming languages that explicitly

support the syntactic specification of abstract data types, this modularization

is used to define the relationships between the abstract data types defined in

the system.

2. The Globals-based Object Finder requires that the program being maintained

supports a static scoping mechanism. This allows the definition of global vari-

ables and the occurrence of side effects by the invocation of routines referencing

the global variables.

3. The Types-based Object Finder requires that the program being maintained

supports a data type abstraction mechanism that permits the construction of

composite types using more primitive types.













CHAPTER 4
TIME AND SPACE COMPLEXITY ANALYSIS

In this chapter, the time and space complexities of the proposed approach are

discussed. These complexities are independently analyzed for the two approaches of

object identification.

4.1 Algorithm 1. Globals-based Object Finder Complexity

The time and space complexity of this algorithm follows:


Step 1. Build set P(x) for each global variable x: Let N be the number of rou-

tines and g be the number of global variables in a system. Assume that the input

to step 1 is a symbol table representation of a program. This symbol table consists

of a sorted list of entries each of which contains an identifier's definition and refer-

ences information. The time required to look up the references information about an

identifier from the symbol table of a program is linearly proportional to the number

of identifiers in the program; the time required is at most O(g + N). Thus, for each

global variable x, the time required to build the unordered set P(x) of routines which

directly access x is O(g + N). For g global variables, the total time complexity of

this step is O(g(g + N)). For real programs, N is usually larger than g, and the time

complexity is O(gN).

The space requirement for this step consists of the space to store the sets P(x);

the maximum size of a set P(x) is 0(N). For g global variables, the space require-

ments is 0(gN).











Step 2. Construct graph G: Graph G construction consists of making each set

P(x) to be a node in the graph; the edges of the graph consists of the set intersection
between two nodes (only its magnitude is important).
The data structure to store graph G consists of lists which represent the edges

of the graph as follows: an edge (g-b1 g-b2 P(g-bi) n P(g-b2)) where g-.bi and g-b2

are global variables and P(g-bi) fn P(g-b2) is the set of common routines which is
also represented as a list. The time required to construct one edge is the time re-

quired to obtain the intersection of unordered sets P(g-bi) and P(g-b2). The current
implementation does not presort sets P(x) before an intersection operation. Thus,

the time is proportional to IP(g-bi)IIP(g-b2)\ and the maximum time is 0(N2). An
improvement consists of presorting the sets P(x). This improvement is possible since

the kind of maintenance tasks (design recovery) which we consider do not involve

changes to the connectivity of a program; then, the set P(x), for global variable x,
will remain unchanged after a modification. In this case, the time required to perform

an intersection would be 0(N).

The total time required to construct all the edges is proportional to the number

of global variables, g, in the program. The maximum number of edges is 0(g2).

Hence, the total time complexity is O(g2N2).


Step 3. Construct candidate objects from strongly connected components: These
candidate objects are obtained using a depth-first search algorithm [10, 42] for deter-
mining the connected components of the graph G; the complexity of such an algorithm
is 0(M), where M is the number of edges in the graph which in turn is bound by
the number of nodes in the graph as 0(g2). This algorithm starts with some node of
graph G. Then, we visit all the connected nodes in the order of a depth-first search,
i.e., we walk, first, as far as possible into the graph without forming a cycle, and then











we return to the last bifurcation which had been ignored, and so on until we return

to the node from which we started. We restart the procedure just described from

some new node until all nodes have been visited.

Based on this analysis, the time complexity of the Globals-based Object Finder

is O(gN + g2N2 + g2). The bounding time complexity is O(g2N2).

The space complexity of this algorithm is also proportional to the space required

to save graph G. The space required to save all the nodes in the graph is clearly

proportional to the number of nodes (i.e., number of global variables) and the number

of routines in the set P(x) associated with a node. Since the maximum number of

routines in a node is N, this space complexity is O(gN). The space required to save

all the edges in the graph is bound by the number of nodes and the intersection

set corresponding to an edge, i.e., O(g2N2). Then, the total space complexity is

O(gN + g2N2).

4.2 Algorithm 2. Types-based Object Finder Complexity

The time and space complexity of this algorithm only considers the type order-

ing scheme based on the "part of" relationship between types; it follows:


Step 1. Type ordering: The definition of a topological ordering is required for

all the (abstract data) types used as types of formal parameters or return values of the

routines in a program. The topological ordering of types defines an ordering of all the

types in the program according to the "part of" relationship between any two types;

i.e., if type ti is used to define type t2, then we say that t, is a "part of" t2. Assume

the number of types used in a program, T, is proportional to the size of the program

in lines of code L1. One algorithm for topological sort [38] has time complexity of

0(n2) where n is the number of vertices in the graph. In our analysis, the number
'For real programs, T will be usually smaller than L.











of vertices is T. Then, the time complexity, using this topological sort, is bound

by 0(T2). Another algorithm for topological sort [17] has a total time complexity

bounded by (32m + 24n + 7b + 2c + 16a) where m is the number of input relations

between types, n is the number of objects, a is the number of objects without no

predecessors (primitive types), b is the number of tape records in input, and c is the

number of tape records in output.

The topological ordering of types is stored as a tree which represents the "part

of" relationship between types in a program. The "part of" relationship, between

two types x and y, is stored as lists (x y) where type y is "part of" type x and

type x "contains" type y. Given a list (tl t2 ... tn), then type ti "contains" types

t2 through t,. The maximum space required to save this tree is 0(T2). This tree of

types usually has a maximum of three or four levels in its branches.


Step 2. Initial classification: The time required to construct matrix R with N

routines and T (abstract data) types is 0(TN). For real programs, T is usually

smaller than N. Thus, the time required to construct matrix R is bound by O(N).


Step 3. Classification readjustment: The time required for the classification

readjustment is proportional to the number of routines, N, and the number of types,

T, in matrix R. For a given routine in row r which has been marked with 1 in type t,

the time required to determine whether there exists any other type s on the same row
r which "contains" type t and has also been marked with 1, is proportional to the

number of types T and to the time required to determine whether type s "contains"

type t. The latter time is proportional to the number of types in the type ordering

tree, and it is equivalent to the time to search for a path from type t to type s in the











subtree of the type ordering tree with root equal to s. The algorithm used to search

for this path is a depth-first search [47] which has constant time complexity [38].
Hence, for a routine, the time required for the classification readjustment is

0(T). Given that we have N routines, the time required for the classification read-

justment of matrix R is 0(NT).


Step 4. Grouping: The candidate objects are formed by collecting all routines,

r, that share a type t; this is to say, for a given type in column t, a candidate object of

routines, sharing type t, consists of all the routines, in row r, with (column) type t set
to 1 after step 3. The time required to form these candidate objects is proportional to
the number of (abstract data) types T and the number of routines N in the system.
The time complexity of this step is 0(TN).

Based on this analysis, the time complexity of the Types-based Object Finder

is O(T2 + N + NT + NT). As indicated above, for real programs T is usually small
with respect to N, and the bounding time complexity is 0(max(T2, NT)).

The space complexity of this algorithm is clearly the space required to store

matrix R plus the space required to store the tree representing the types ordering,

i.e., O(NT + T2).














CHAPTER 5
EVALUATION OF THE APPROACH

This chapter presents some guidelines for an evaluation of the object identifi-

cation methods. The goals of this evaluation are to compare the algorithms used for

identifying objects with other existing modularization techniques in term of the com-

plexity of the resulting modularization and to compare the resulting modularization

in a conventional, procedural programming language with the "classes" found in an

object-oriented version of the program.

The evaluation of the object finder algorithms consists of careful examination

of the results of this approach. Two studies were used to evaluate the identified

objects. In the first study, named study I, we compared the groups (based on the

identified candidate objects) identified by the object finder with those groups (based

on the clusters) identified with hierarchical clustering [14]. The comparison was

based on the complexity of these two partitionings resulting from each analysis. The

results of this study are presented in this chapter. In the second study, named

study II, we compared the identified objects found in a program with the object-

oriented programming classes found in the object-oriented version of the program.

We explained the results of study II in Section 8.2.

A system's partitioning results from the system modularization, i.e., the group-

ing of routines into disjoint groups. The object finder and the hierarchical clustering

technique [14] are different methods to obtain these partitionings.

Metrics of the complexity of software structures have been used as valuable

management aids, important design tools, and as basic elements in research efforts

to establish a quantitative foundation for comparing language constructs and design











methodologies [11]. In addition, module and interface metrics have been used in

evaluating the modularization and the level of coupling of a system [14]. In study I,

we continue to use metrics to evaluate the complexity of a system partitioning in terms

of the complexity of the interfaces and the complexity of its components. These two

views of complexity parallel coupling and cohesion. Conte et al. [7], measure coupling

as the number of interconnections among modules and cohesion as the measure of

the relationships of the elements within a module. A design with high degree of

coupling and low module cohesion will contain more errors than a design with low

module coupling and high module cohesion. Several primitive complexity metrics can

be used to quantify the coupling and cohesion of a module. Then, several factors are

used to quantify our complexity metrics.

5.1 Goals of the Evaluation Studies

The primary goal of the evaluation studies was to demonstrate that the algo-

rithms used for identifying objects result in system partitionings less complex than

other existing modularization methods. This chapter also includes the development

of a new set of primitive factors that measured the complexity of a modularization

of a system by characterizing the complexity of the corresponding partitioning.

The evaluation consisted of comparing the complexity of the candidate objects

identified by the object finder and the complexity of the clusters defined using Basil-

i's hierarchical clustering technique [14]. Another motivation for comparing the two

approaches was to determine whether the object finder approach results effectively

capture the structure of a software system. Since Basili's hierarchical clusters repre-

sent an experimentally validated initial approximation of the groups intended by the

designer of the software system [14]; then, the results of the object finder should be

less, or at least equally, complex than the hierarchical clustering approach results.











The chosen criteria to make the comparison was the complexity of each parti-

tioning measured by complexity metrics similar to coupling and cohesion. According

to Hutchens and Basili [14], the question of strength and coupling between elements

of a given partitioning is largely still unresolved. Thus, a new set of primitive factors

was developed to measure the complexity of the relationships between modules of

a software system as well as the complexity of the relationships inside the modules.

Furthermore, a measure of module strength in terms of the uniqueness of the types

manipulated by the module was developed. That is to say, the new set of factors

when applied to the partitioning of a system, measures the complexity of the interface

between the system modules, the complexity inside the modules, and the strength of

the components inside a module.

Other metrics [27, 25, 6, 11, 51, 48, 29] were considered for computing the

complexity of the partitioning. They were not suitable for this purpose since the

characteristics to be measured corresponds to the complexity of the interface (simi-

lar to coupling) and to the complexity of the relationships between elements inside

a module (similar to cohesion). The object finder approach assumes that we iden-

tify candidate objects in programs which were originally designed without explicit

object-oriented syntactic features even though we assumed that object-like features

are present in the programs. Thus, metrics based on object-oriented syntactic fea-

tures could not be used in the evaluation. Hence, we created a new set of primitive

complexity metrics factors which measured three aspects of complexity: factors that

measured the complexity of the interfaces between modules in a partitioning; fac-

tors that measured the complexity of the relationships between components inside

modules, in terms of interactions between components in the modules; and primitive

metrics factors that measured the strength of the relationships between components











in a module in terms of the similarity of data types manipulated by the components

of the module.

5.2 Methodology of the Evaluation Studies

The methodology of the evaluation of the object finder included two studies: (1)

study I consisted of comparing the groups identified by the object finder (candidate

objects) with those groups identified by hierarchical clustering [14] (clusters), and (2)

study II compared the identified candidate objects in a program with those classes

found in the object-oriented version of the same program.

Several other evaluation studies are proposed for the future research including:

Use of experts to compare the identified candidate objects in a system with the

expert knowledge about the system.

Use student academic projects as well as industrial-size programs to further

evaluate the object finder algorithms.

The steps in study I were:

1. Identify example programs for the study. Three sample programs were identified

for this study from the literature. Another industrial-size software system was

considered for future evaluation studies.

2. Compute the identified candidate objects using the Top-down Analysis (Method

1 in Chapter 6) for a sample program. The result of this method is a partition-

ing of the system that consists of groups which corresponds to the identified

candidate objects.

3. Compute the clusters in the sample program using Basili's hierarchical cluster-
ing technique [14]. The result of this technique is a partitioning of the system

which normally corresponds to the top-level clusters of the system.











4. Compute the primitive complexity metrics factors for each partitioning above.

In this step, we determined the primitive metrics factors for a partitioning of

the system according to the definitions in Section 5.3.

5. Compare the complexity of the two partitionings using the results of the prim-

itive complexity metrics factors for each partitioning.

The steps in study II are:

1. Identify example programs for the study. One program was chosen for this

study from the literature.

2. Conversion of object-oriented code into non object-oriented code. For the com-

parison of identified objects and classes of an object-oriented version of a pro-

gram, we require a program with two versions including an object-oriented

version and an equivalent non object-oriented version. We choose an object-

oriented program from the literature. Then, we derived a functionally equiva-

lent version of this program using non object-oriented techniques.

3. Compare the identified objects with the classes.

The instrument and results of study II are reported in Section 8.2.

5.3 Primitive Metrics of Complexity

Assume that a system contains a large number of routines and data structures.

Our complexity metrics factors measure the ability to partition [3] the system as to

absorb as many relations between routines in a group as possible and thus leave few

inter-group connections which results in less complex group interfaces. In addition,

the complexity metrics measure the ability of reducing the number of relations be-

tween routines inside a group which results in less complex relations inside groups.











That is to say, we categorize what makes one partition "better" than another in spite

of the fact that the connectivity of routines is always the same and only their group

assignment changes from partitioning to partitioning. Beladi and Evangelisti define

the degree of connectivity of a cluster as the number of "connections" between the

elements of the cluster. Program routines and data structures are interconnected by

routines invocations and references in software systems [3].

In the absence of a direct measure of the inter-group complexity and intra-group

complexity which given a partitioning would compute these metrics, we developed

the set of primitive complexity metrics factors which, we argue, measure inter-group

and intra-group complexities, as well as intra-group strength, as demonstrated in Sec-

tion 5.3.5. The union of all primitive metrics factors related to one kind of complexity

(inter-group complexity, intra-group complexity, or intra-group strength) quantifies

the corresponding complexity.

The following sections present important definitions and examples, the primitive

factors of complexity, and a validation of these factors.

5.3.1 Definitions

Assume that all variables in a program have unique names. A use of a vari-

able refers to either a definition (a value is assigned to a variable) or a reference (a

variable's value is used) of the variable [12].


Definition 7 A global variable is a variable directly used within at least two modules.

Definition 8 Let P be a module. There is a module-global access pair (P,g) if g is a

global variable used within P.


Synopsis: A module-global access pair (M,g) represents the fact that global
variable g is used within module M.











Example 1 Module-global access pair (M, g)
M()
{
// g is used within M
}

Definition 9 Let (Q,g) be a module-global access pair. There is a module-global
indirect access pair (P,g) if 3 a module Q with a call to P, P(... ,g,...) such that
the formal parameter of P corresponding to g is used within P.

Synopsis: A module-global indirect access pair (M, g) represents the fact that
global variable g is indirectly used by module M when there is a call to M with
formal parameter g in module N which uses g.

Example 2 Module-global indirect access pair (M, g)
N() M(a)
{ {
M(g) // a is used within M
} }

Definition 10 Let P, Q be two modules and g be a global variable in P and Q. There
exists a module-global access data binding triple (P,g, Q) if g is defined within P
and referenced within Q.

Synopsis: A module-global access data binding triple (P, g, Q) represents the
fact that module P defines the value of global variable g and module Q references
the value of global variable g. It reflects the data relationship between modules P
and Q, and the direction of the information flow (from P to Q).
Example 3 Module-global access data binding triple (P,g, Q)
P() Q()
{ {
g<- // is defined <-g // is referenced
} }










Definition 11 Let P, Q be two modules and x be a local variable in P. There is a
local-export data binding triple (P, x, Q) if

1. x is defined within P before a call Q(..., x,...) in P, and

2. the formal parameter of Q corresponding to x is referenced within Q.

Synopsis: A local-export data binding triple (P, x, Q) represents the fact that
module P defines the value of local variable x, and the corresponding formal pa-
rameter is referenced within module Q. It is based on the binding between the local
variable and the corresponding formal parameter and the direction of the information
flow (from P to Q).
Example 4 Local-export data binding triple (P, x, Q)
P() Q(a:[by reference,by value-result, by value])
{ x:local {
x<- // is defined <-a // is referenced
Q(x)
} }

Definition 12 Let P, Q be two modules and x be a local variable in P. There is a
local-import data binding triple (Q, x, P) if

1. 3 a call Q(..., x,...) in P such that the formal parameter of Q corresponding
to x is a call-by-reference or call-by-value-result parameter and is defined within
Q, and

2. x is referenced within P after the call.

Synopsis: A local-import data binding triple (Q, x, P) represents the fact that
module Q defines the value of the formal parameter a, and the corresponding actual
parameter in a call to Q within P, is referenced within P after the call. It reflects
the data relationship between modules P and Q due to the local variable-formal
parameter bindings, and the direction of the information -flow (from Q to P).












Example 5 Local-import data binding triple (Q, x, P)
P() Q(a:[by reference,by value-result])
{ x:local {
Q(x) a<- // is defined
<-x // is referenced
} }


The data bindings due to return value relationships are handled similar to

local-import data bindings. First, one of several transformations are performed on

the function invocation and invoked function definition as follows:

Invoked function definition transformations:

1. C CODE TRANSFORMATION
Q(b,c) Q(a,b,c)
{ {


return a;
}

2. C CODE
Q(b,c)
{

return val;


a<- // is defined
}

TRANSFORMATION
Q(,b,c)
{


<- val // is defined


invocation transformations:
C CODE
P .. .)
{x;

x =Q(y,z);
<- x // is used


}
2. C code
P(...)


TRANSFORMATION
P( .)
{x;

Q(x,y,z);
<- x // is used
}'
transformation
P .. .)
{x;

x(xjy,z);
exp(x);


* Function
1.


exp ((y,z);










Definition 13 Let P, Q be two modules and P uses the return value from Q after a
call Q(...) in P. Perform one of two kind of transformations on the invoked module
definition and the module invocation. Let a or < retval > be the formal parameter
resulting from the transformation on the invoked module definition. There is a return-
value data binding triple (Q, a, P) or (Q, < retval >, P) if

1. 3 a call Q(x) in P such that x is a transformation-generated local variable
in P corresponding to the transformation-generated formal parameter a or <
retval > in Q that is defined within Q, and

2. x is referenced within P after the call.

Synopsis: A return-value data binding triple (Q, a, P) or (Q, < retval >, P)
represents the fact that module Q defines a return-value, returns it to module P, and
this return value is referenced within module P after the call. Either, the return-
value is saved in a variable after the call for later reference, or it is directly referenced
after the call. Notice that the data binding is expressed in terms of the return value
variable as opposed to the local variable in the invoking module.

Definition 14 Let P, Q be two modules and g be a global variable in P. There is a
global-export data binding triple (P,g, Q) if

1. g is defined within P, before a call Q(...,g,...) in P, and

2. the formal parameter of Q corresponding to g is referenced within Q.

Synopsis: A global-export data binding triple (P, g, Q) is symmetric to the local-
export data binding triple except that in the former the local variable x is replaced
by a global variable g.
Example 6 Global-export data binding triple (P,g, Q)











P() Q(a:[by reference,by value-result, by value])
{ {
g<- // is defined <-a // is referenced
Q (g)


Definition 15 Let P, Q be two modules and g be a global variable in P. There is a
global-import data binding triple (Q, g, P) if

1. 3 a call Q(... g,...) in P such that the formal parameter of Q corresponding
to g is a call-by-reference or call-by-value-result parameter and is defined within
Q, and

2. g is referenced within P after the call.

Synopsis: A global-import data binding triple (Q,g,P) is symmetric to the
local-import data binding triple except that in the former the local variable x is
replaced by a global variable g.
Example 7 Global-import data binding triple (Q, g, P)
P() Q(a:[by reference,by value-result])
{ {
Q(g) a<- // is defined
<-g // is referenced
} }

Whenever it is not confusing, we will use the term data binding interchangeably
with any of the forms of data bindings above.

Definition 16 Let Type(v) be the type of variable v. Type size of variable v, denoted
Tsize(v), is the amount of information carried by type Type(v) of variable v.

Tsize(v) quantifies the information level associated with variable v due to the
its type. In a conventional programming language, such as C or Ada, a primitive type
is the least difficult to understand of its types. Since a pointer represents the address










Module-global access pair
(M,g)


Elf -()--Q




E-hB-W.


P -- 1


Module-global indirect access pair
(M,g)
Module-global access data binding
(P,g,Q)
Local-export data binding
(P,x,Q)
Local-import data binding
(Q,x,P)
Global-export data binding
(P,g,Q)
Global-import data binding
(Q,g,P)
Return-value data binding
(Q,a,P)
Return-value data binding
(Q,,P)


Figure 5.1. Schematic illustrations of access pairs and data bindings












Table 5.1. Type size associated with several types.


TYPE (Based on "C"/ "Ada") TSIZE
primitive types
e.g.: int, char, float, 1
void,boolean,enum 1
pointer 1
array f (Tsize(array[i]),
No. of elements,
Control Info Size)
user-defined types; e.g:struct, union Sum of the Tsizes of elements


of an object, we conclude that it is as easy to understand as primitive types. The

user-defined types, such as structures and unions, are more complicated because their

components' information level; thus, we assume that the difficulty of understanding
a user-defined type is the added difficulty of understanding each of its components.

Finally, an array has special mechanisms to manipulate its elements which adds to

the difficulty of understanding the array.

Table 5.1 specifies the type size for the type found in two conventional program-

ming languages. The higher the Tsize value, the more information is carried by the

type and it is more difficult to understand a variable of such type. For arrays, control

info size refers to the complexity associated with the array control mechanisms of

the particular language under analysis (e.g.: in C, type control size = 2: 1 unit to

account for the index offset used in addressing the individual elements of the array

and 1 unit to account for the storage of the address of the beginning of the array).

Definition 17 A group G is a pair of sets of modules and global variables.


Definition 18 The boundary of group G is the set of data bindings (M v, Mj) such

that either Mi e G and Mj G, or Mi G and Mj e G.











In the case of a module-global access data binding (Mi, g, Mj) in the boundary

of group G, there are two associated access pairs in the boundary, namely (Mi, g)
and (Mj,g).


Definition 19 The interior set of group G is the set of data bindings (M, v, Mj) such

that both Mi E G and Mj G.

Let ISI be the number of elements in set S.

5.3.2 Inter-group Complexity Factors

The inter-group complexity of group G measures the complexity in the interface

between the group G and other groups of the system. The inter-group complexity of

group G is characterized by the union of all the primitive measurements of inter-group
complexity. The total inter-group complexity of a system is the average inter-group

complexity of the groups in the system. Another interesting measure is to consider

the maximum inter-group complexity of all the groups in the system.

5.3.2.1 Data-based primitive measurements

In this section, we present the primitive measurements of the inter-group com-

plexity which are based on the data of the system.


fl(G) is the set of global variables used by modules inside group G and also

used by modules outside group G, or the set of direct interface variables, i.e.:

fi(G) = { g 3 module-global access pairs (M,g) and (N,g) such that M E
G and N 0 G.}

Assume that a global variable which is concurrently used by modules in two

groups, increases the inter-group complexity of both groups.











We expect that the more global variables are shared between a group and other

groups, the more this group is related to these other groups. The relationship

is due to the potential of sharing the same data space among several groups.

Beladi's intercluster complexity [3] is proportional to N, the total number of

nodes (where nodes are either modules or global variables) in a system. This

complexity is given by Co = N E, where Eo is the number of "intercluster"

edges. Thus, given the number of global variables N, we conclude that the

number of global variables increases the inter-group complexity.

* f2(G) is the set of global variables outside of group G indirectly used by modules

within group G, or the set of indirect interface variables, i.e.:

f2(G) = { g 1 3 module-global indirect access pair (M,g)such that M

G and g G.}

We expect that the more global variables which indirectly affect or are affected

by modules in the group, the probability that the group is indirectly related

to other groups increases. This relationship is due to the indirect knowledge

about a global that the module must have.

In this factor, Beladi's intercluster complexity is also used to verify our intuition.

Since the intercluster complexity is proportional to N, the total number of

nodes in a system, we conclude that the number of global variables increases

the inter-group complexity.

* f3(G) is the boundary set of group G, i.e.:

f3(G) = {(Mi, v, Mj) I 3 a data binding (AMi, v, Mj) in the boundary set of G.}

The intercluster complexity is proportional to the number of "intercluster edges"

[3]. Accordingly, the more intercluster edges there is, the more complex is the











interface of the system. In this factor, the data bindings across groups bound-
aries correspond to the intercluster edges. Hence, the complexity increases with
the number of data bindings. A similar observation was made by Henry and
Kafura [11] in the case of modularity metrics.

f4(G) is the set of different variables transferring information between group G

and other groups, or the set of variables in the boundary set, i.e.:

f4(G) = {v 13 a data binding (M,v, Mj) in the boundary set of G.}

Assume that variables which transfer information between the group and other
groups, thus defining data bindings, relate the group to those other groups.
Hence, the more variables relate a group to other groups, the inter-group com-
plexity increases. Similarly, Beladi's intercluster complexity considers unique
nodes, where a node is either a module or a global data in the system, as a factor

which increases intercluster complexity. Factor f4 considers unique occurrence
of variables in the boundary set.

5.3.2.2 Type-based primitive measurements

In this section, we present the primitive measurements of the inter-group com-
plexity which are based on the types of data in the system.
Assume that the information level of a variable is determined by its type size

(Tsize) according to Definition 16.

f5(G) = is the sum of type sizes of the global variables used by group G and also
used by modules outside group G, or the sum of type sizes of direct interface
variables, i.e.:

f5(G) = E Tsize(g)
{g 3 module global access pairs (M,g) and (N, g)
ye such that M E G and N C G.}











We expect that the higher total type size of all global variables concurrently

used by modules inside and outside the group, the probability that the group is
related to those other groups increases and the inter-group complexity increases.

* f6(G) = is the sum of type sizes of global variables outside group G indirectly
used by modules within group G, or the sum of type sizes of indirect interface
variables, i.e.:

f6(G) = T Tsize(g)
{g | 3 module -global indirect access pair (M,g)
ge such that M E G and g G.}

We expect that the higher total type size of the global variables outside the

group indirectly used by modules inside the group, the probability that the
group is related to those other groups increases and the inter-group complexity
increases.

* Consider the set of data binding triples in the boundary set of a group. f7(G)
is the sum of type sizes of boundary set variables.

The sum of the type sizes of different variables passing data into group G is:

f7.1i(G) = E Tsize(v)
vE{v I 3 data binding (Mi,v,Mj)E boundary set of G 9 MiOG and MjEG.}
The sum of the type sizes of different variables passing data out of group G is:

f7.2(G) = E Tsize(v)
ve{v I 3 data binding (Mi,v,Mj)E boundary set of G 3M, E G and MjOG.}
Assume that variables which transfer information between the group and other
groups, defining data bindings, contribute to the inter-group complexity of the
group according to the variable's type size. We expect that the higher total

type size of the variables passing data to or from the group, the inter-group
complexity increases.











f8(G) is the set of different types of variables passing data between group G
and other groups, or the set of types of boundary set variables, i.e.:

fs(G) = {Type(v) 13 data binding(Mi v, Mj) E boundary set of G.}

Assume that each type of variable passing data to or from the group needs

to be supported by the group's interface. Each type increases the inter-group

complexity. Consequently, the more different types of variables are supported,

the inter-group complexity increases. That is due to the different behaviors

which need to be supported by the group's interface.


5.3.3 Intra-group Complexity Factors

The intra-group complexity of group G consists of the complexity of group G in

a system partition. The intra-group complexity of group G is characterized by the

union of all primitive measurements of intra-group complexity. The total intra-group

complexity of a system is the sum of the intra-group complexity of all the system's

groups.

Another measure related to these factors is the intra-group strength1 which is a

measure of the relationship between the modules in a group.

5.3.3.1 Data-based primitive measurements

In this section, we present the primitive measurements of the intra-group com-

plexity which are based on the data of the system.

fg9(G) is the set of global variables used exclusively by modules in group G, or

the set of direct internal variables, i.e.:

f9(G) = { g I V module-global access pair (Mi,g) Mi E G.}
1This term was originally used by Myers [27].











* fio(G) is the set of global variables used indirectly only by modules in group
G, or the set of indirect internal variables, i.e.:

flo0(G) = { g I V module-global indirect access pair (Mi,g) A ME e G.}

Factors f9 and fio above consists of the global variables "exclusively used" by
modules in the group.

Assume that a global variable exclusively used by modules in the group weakens
the intra-group strength and increases the intra-group complexity. According

to [3], a global variable negatively affects the intracluster complexity by increas-
ing its complexity value, given a set of edges. Consequently, a global variable
increases the intra-group complexity. Also, we argue that a global variable neg-
atively affects the intra-group strength since other groups may use this global
variable thus reducing the functional relatedness of the group.

We expect that the more global variables (exclusively used by modules in the
group and thus not used by modules in other groups), the intra-group com-
plexity increases. An extreme case occurs when the global variable becomes a

local variable with respect to the group. Similarly, we expect that the strength

between modules in the group weakens.

* Consider the set of data bindings triples in the interior set of a group. fll(G)
is the interior set of group G, i.e.:

fli.i(G) is the set of module-global access data bindings involving modules
within group G, i.e.:

fi.1(G) = {(MA, v, Mj) 3 a module-global access data binding (M, v, Mj)
in the interior set of G.}










fI'.2(G) is the set of local-export data bindings involving modules within group
G, i.e.:

fII.2(G) = {(Mi, v,Mj) I 3 a local-export data binding (Mi,v,Mj) in the
interior set of G.}

f11.3(G) is the set of local-import data bindings involving modules within group
G, i.e.:

f1.3(G) = {(Mi, v, Mj) 1 3 a local-import data binding (Mi, v, Mj) in the
interior set of G.}

f11.4(G) is the set of global-export data bindings involving modules within group
G, i.e.:

fi.4(G) = {(M,v, Mj) I 3 a global-export data binding (Mi, v, Mj) in the
interior set of G.}

fl1.5(G) is the set of global-import data bindings involving modules within
group G, i.e.:

f1i.5(G) = {(M, v, Mj) 3 a global-import data binding (Mi, v, Mj) in the
interior set of G.}

f11.6(G) is the set of return-value data bindings involving modules within group
G, i.e.:

f11.6(G) = {(Mi,v,Mj) I 3 a return-value data binding (Mi, v, Mj) in the
interior set of G.}

According to Beladi and Evangelisti [3], the intracluster complexity of a single
cluster j is Cj = nj ej where nj is the number of nodes in the cluster and ej the
number of pairwise edges, that is, "connections between the same elements."
Given that edges correspond to data bindings in our model, we conclude that











the more data bindings between the modules in a group, the higher the intra-

group complexity of the group.

Furthermore, the intra-group strength increases with the number of data bind-
ings since the more "channels" of data passing between modules in the group,
the higher the functional relatedness of the group.

It appears as if factors fil conflict with factors f9 and fio; however, data bind-
ings define relations between modules which theoretically speaking will increase

the strength. However, the fact that the data bindings are sometimes due to

global variables involves a risk in that other groups in the system may access
these global variables which weakens the strength of the group.

This factor applies to groups derived from using the globals-based analysis

(Algorithm 1): f12(G) is the percentage of modules in group G which directly

use all global variables in group G, i.e.:

f12(G) = J{M 1 3 module-global access pair (M,gi) V global variable gi G
GandM G.}I/I{M M M G}l

Assume that a global variable used by all modules in the group increases the

intra-group strength. The higher the ratio of the number of modules which

directly use all global variables in the group to the total number of modules,

the intra-group strength increases.

5.3.3.2 Type-based primitive measurements

In this section, we present the primitive measurements of the intra-group com-

plexity which are based on the types of data in the system.

Assume that the information level of a variable is determined by its type size

(Tsize) according to Definition 16.











* Consider the set of data bindings triples in the interior set of group G.

Also, consider only the type of different variables since we measure the com-
plexity due to the kind of information passed between modules of the group, as
opposed to the volume of this information.

f13(G) is the sum of the type sizes of different variables passing data between
modules M, and Mj in group G:

f13(G) = E Tsize(v)
vE {v 3 data binding (Mi, v, Mj) interior set of G.}
We expect that the higher total type size of all the variables used by modules
in the group, the intra-group complexity increases and the intra-group strength
decreases.

* f14(G) is the set of types of different variables passing data between modules
Mi and Mj in group G, or the set of types of interior set variables, i.e.:

f14(G) = {Type(v) I 3 data binding (Mi, v, Mj) the interior set of G.}

Assume that each type of variable passing data among the modules in the group
needs to be supported by the module. Each type increases the intra-group

complexity. Therefore, the more different types of variables are supported, the
intra-group complexity increases. Also, less commonality is observed in the
group and the intra-group strength decreases due to the reduced functional
relatedness of the group.

* The following factors only apply to groups derived from using the types-based
analysis (Algorithm 2):











Definition 20 The Base Type of group G, denoted Btype(G), is the type used
as the grouping criteria during the "grouping" step of the Types-based Object
Finder, Algorithm 2.


Definition 21 Module M manipulates type t, denoted by the manipulation pair
[M, t], ift is the type of a formal parameter of M, or t is the type of the return
value of M, or t = Type(g) such that 3 access pair (M,g). This is, t is one
of the types that module M may manipulate.


Definition 22 A grouping manipulation pair in group G is a manipulation pair
[M, t] such that M e G and t = Btype(G).


There exists a grouping manipulation pair for a module M, M E G, and type t,
denoted [M, t], iff any of module M types of formal parameters or return value,
t, is equivalent2 to the base type of group G, Btype(G); i.e.: t = Btype(G).
Also, there exists a grouping manipulation pair for a module M, M G, and
type of variable v, Type(v), denoted [M, Type(v)], iff there exists a module-
global access pair (M, v) and the type of variable v, Type(v), is equivalent to
the base type of group G; i.e.: Type(v) = Btype(G).

f15(G) is the set of grouping manipulation pairs [M, t] such that M G, i.e.:

f15s(G) = {[M,t] is a grouping manipulation pair I M E G.}

We expect that the more grouping manipulation pairs, the intra-group strength
increases. The evidence indicates that this factor does not affect the intra-group
complexity.
2Equivalent types are defined in Algorithm 2 in Chapter 3.











* Definition 23 A degrouping manipulation pair in group G is a manipulation
pair [M,t] such that M e G and t < Btype(G).

There exists a degrouping manipulation pair for a module M, M E G, and type
t, denoted [M, t], iff any of module M types t of formal parameters or return
value is a "part of" the base type of group G or "contains" type that is the
base type of group G, Btype(G); i.e.: t Btype(G). Also,
there exists a degrouping manipulation pair for a module M, M E G, and type
of variable v, Type(v), denoted [M,Type(v)], iff there exists a module-global
access pair (M, v) and the type of variable v, Type(v), is a "part of" the base
type of group G or "contains" type that is the base type of group G, Btype(G);
i.e.: Type(v) < Btype(G) or Type(v) > Btype(G).

f16(G) is the set of degrouping manipulation pairs [M, t] such that M G, i.e.:

f16(G) = {[M,t] is a degrouping manipulation pair I M E G.}

We expect that the more degrouping manipulation pairs, the intra-group strength
decreases. Once more, this factor does not affect the intra-group complexity.

* f17(CG) = the ratio of number of grouping manipulation pairs to the number of
degrouping manipulation pairs.

f17(G) = |f15(G)i : |f16(G)|

Given factors fis and f16 above, we expect that the higher the ratio of grouping
manipulation pairs to degrouping manipulation pairs, the intra-group strength
increases.

In the case that types t and Btype(G) are not explicitly related (e.g., types are
equivalent or there is a "part of" or "contains" relationship between the types),








68


M3
0@
G ::

- ---- --i -:M 7



M rI
Q M5


J . . . :. -. ** ** ** ._
M6 |MIl 4^-^v)-^M

1- -- -- -- -- --- V5 - -- -- -- -- - -


M1C M9

Figure 5.2. Example of the primitive complexity metrics factors

the corresponding manipulation pairs of the form [M, t] are not considered in
factors fis, f16, and f17.

5.3.4 Example of the Factors

This section presents an example of the primitive complexity metrics factors.
Figure 5.2 consists of a set of several routines, global variables, and a group.
Examples of the complexity metrics factors, in Figure 5.2, related to group G
are:

1. Inter-group Complexity

f3 Boundary set:
f3(G) = {(MI, V1, Al2), (M3, V3, Ms), (M3, V3, M7), (M4, V4, M8),

(Ms, Vs, Mg), (Mio, vu, M11), (M1o, Vs, M11)}











f4 Variables in boundary set: f4(G) = {vi,V3,V4, V5}

f7 Sum of type sizes of different variables in boundary set: f7(G) =
Tsize(vi) + Tsize(v3) + Tsize(v4) + Tsize(v5)

2. Intra-group Complexity

fil Interior set:
f,1(G) = {(M2, v2, M6), (M2, v3, M7), (M2, V3, iMs), ( Ms5, V5, M11)}

f13 Sum of type sizes of different variables in interior set: f13(G) =
Tsize(v2) + Tsize(v3) + Tsize(v5)

5.3.5 Validation of the Factors

This section presents the validation of the complexity metrics factors of the
previous sections. The validation approach consists of proving that the validation

hypothesis holds. This approach is described in Figure 5.5. This section also illus-
trates the sensitivity of the factors to a particular modification of a program. More
experimentation is needed to further validate these factors with respects to other kind
of changes applied to programs. Other interesting modifications to programs include
a functionally equivalent program with several functions merged into a single func-
tion or several abstract data types of the original program replaced with equivalent

data structures.
The example program consists of a simple recursive descent expression parser
implemented in "C". First, we present the complexity metrics computed on a ver-
sion of this program which consists of two source files, eighteen functions, and four
global variables. Second, we present the metrics computed on another version of this
program with the same functionality except that most function parameters are con-
verted into global variables; this version consists of the same number of source files as












Table 5.2. Type size associated with variables of different types in the original version
of the recursive descent expression parser.

Variable Type TSIZE
prog char* 1
toktype char 1
token char [80] 1 80 + 2 = 82
vars float[26] I 26 + 2 = 28
-answer- float 1
"op" register char 1
result- float* 1
c~ char 1
"hold" float 1



the previous version but with a total of nineteen functions and six global variables.

Third, we explain the effect that the structure of each version of the program had on

the primitive complexity metrics factors and illustrate the sensitivity of the metrics

to this class of changes in the structure of a program.

In the first version of the program, named the original version, the top-down

analysis approach was used with both globals-based and types-based analyses; we

choose to ignore C primitive types during the types-based analysis. The identi-

fied objects are shown in Figure 5.3. Some modifications of the candidate objects

were performed to obtain completely disjoint candidate objects in terms of their

routines. These modifications consisted of removing common components (routines)

from object Gtoktype#2 to a single candidate object. The resulting objects after

the modifications are shown in Figure 5.3.

Table 5.2 illustrates the types sizes corresponding to the types of variables in

the original version of the example. The primitive metrics factors were computed for

inter-group complexity, intra-group complexity, and intra-group strength.




















Object z#objTchar*#3 is {
~~char*-76
FloatParsing~___II~iswhite~181
FloatParsing~___II~isdelim-175
FloatParsing ___II~isin~172
}
Object z#objTfloat*#4 is {
~~float*-24
FloatParsing~___II~primitive-240
FloatParsing~___II~leve16-198
FloatParsing~___II~leve15-196
FloatParsing~___II-level4-194
FloatParsing-___II~leve13~192
FloatParsing~___II~level2~190
FloatParsing~___II~levell-188
FloatParsing-___II~getexp~143
}
Object z#objT+undetermined-type#5 is {
-float*~24
~char*~76
FloatParsing~___II-unary-371
FloatParsing-___II~findvar~100
FloatParsing.___II~arith~21
}
Object z#objGtoktype#2 is {
main-___II~prog~243
FloatParsingl___II~vars~374
FloatParsing-___II~token~357
FloatParsing-___II-toktype~356
main~___II-main~206
FloatParsing-___II~putback~244
FloatParsing-___II-gettoken~145
FloatParsing~__II~findvar-100
}


:(T)char*
:(R)iswhite
:(R)isdelim
(R)isin


(T)float*
(R)primitive
(R)level6
(R)level5
(R)level4
(R)level3
(R)level2
(R)levell
(R)getexp


:(T)float*
:(T)char*
:(R)unary
:(R)findvar
:(R)arith


(G)prog
(G)vars
(G)token
(G)toktype
(R)main
(R)putback
(R)gettoken
(R)findvar


Figure 5.3. Identified objects in original version of recursive descent expression parser












Table 5.3. Type size associated with variables of different types in version 1 of the
recursive descent expression parser.

Variable Type TSIZE
prog char 1
toktype char 1
token char[80] 1 80 + 2 = 82
vars float[26] 1 26 + 2 = 28
answer float[1048] 1 1048 + 2 = 1050
result float 1
-op- register char 1
~c- char 1
-hold- float 1


For the other version of the program, named version 1, the top-down analysis

approach was used with both globals-based and types-based analyses; we choose to

ignore C primitive types during the types-based analysis. The identified objects are

shown in Figure 5.4. Some modifications of the candidate objects were performed

to obtain completely disjoint candidate objects, specifically routine findvar was

removed from object T+undetermined-type#4. The resulting objects after the mod-

ifications are shown in Figure 5.4.

Table 5.3 illustrates the types sizes used for the metrics computation. In this

version of the program, as well as the rest of the examples in this thesis, whenever

possible, the name of an identifier denotes the identifier, instead of using the unique

ID which is automatically generated by the object finder tool of Chapter 7. See

Section 8.1 for a complete explanation of unique ID in the internal representation of

the program used by the object finder tool.

The primitive metrics factors were computed for inter-group complexity, intra-

group complexity, and intra-group strength.





















Object z#objTchar*#3 is {
~~char*~77
FloatParsing-___II-iswhite~177
FloatParsing-___II~isdelim~171
FloatParsing~___II~isin~168
}
Object z#objT+undetermined-type#4 is {
~float*-25
-char*~77
FloatParsing~___II~unary~362
FloatParsing~___II~arith~22
}
Object z#objGanswer#2 is {
mainJ___II~prog-232
FloatParsing-___II-vars~365
FloatParsing~___II~token~348
FloatParsing~___II-toktype-347
FloatParsing~___II-result-309
FloatParsing~___II-answer-20
main~___II-'main~196
FloatParsing~___II~resultinc~310
FloatParsing~___II~putback~233
Float_Parsing~___II~primitive-230
FloatParsing~___II~level6~189
FloatParsing~__.II~level5~188
FloatParsing~___II~level4~187
FloatParsing-___II~level3-186
FloatParsing-___II-level2~185
FloatParsing-___II~levell-184
FloatParsing-___II-gettoken~145
FloatParsing~___II~getexp~144
FloatParsing~___II~findvar~101


:(T)char*
:(R)iswhite
:(R)isdelim
(R)isin


(T)float*
(T)char*
(R)unary
(R)arith


(G)prog
(G)vars
(G)token
(G)tok-type
(G)result
(G)answer
(R)main
(R)resultinc
(R)putback
(R)primitive
(R)level6
(R)level5
(R)level4
(R)level3
(R)level2
(R)levell
(R)get_token
(R)getexp
(R)findvar


Figure 5.4. Identified objects in version 1 of recursive descent expression parser











The approach to validate the complexity metrics consists of proving that the fol-

lowing hypothesis hold. The hypothesis states that the primitive complexity metrics

factors are sensitive to changes in the complexity caused by different partitionings.

Different partitionings are the result of (1) using different partitionings approaches

under the same connectivity, or (2) totally different connectivities. The connectivity

of a program is the set of relationships between components of the program defined by

the data bindings among components due to global variables and calling sequences.

We use the second situation to prove that the hypothesis holds.

The methodology to prove the hypothesis is to use a case study program to show

that the primitive metrics reflect different complexity measures in two versions of the

program which are functionally equivalent and have different connectivities as a result

of having different structures. An schematic description of this validation methodolo-

gy is given in Figure 5.5. From Figure 5.5, the validation of the metrics factors consists

of proving the following hypothesis: If Col is less complex than Co02, then Me1 <

Me2, where Col is the expected complexity of the original version, Co2 is the ex-

pected complexity of version 1, Me1 is the measured primitive metrics factors values

of the original version, and Me2 is the measured primitive metrics factors values of

version 1.

For the proof, we use the program above with two versions: the original version

and version 1. We made version 1 to be more complex in terms of the connectivity

of the program, since a program with greater connectivity is expected to be more

complex given that all other factors remain the same. Version 1 is made more complex

by replacing most formal parameters with global variables. Thus, the number of

expected relations between components of the program will increase.

The expected complexity of each of the two versions is defined as the complexity

resulting from the connectivies between components of a system. We expect that the























connectivity



partitioning




complexity


metrics


original version version I
C-1 C-2


Co-1


Me-1


t
Co-2


3CteED


Figure 5.5. Validation of the primitive complexity metrics factors












groups in the partitioning obtained from version 1 are more intra-group complex than

the groups in the partitioning obtained from the original version. This is the case

since each group in version 1 partitionings is expected to be highly cohesive by the

added global variables in version 1. We also expect that the groups in version 1 are

less inter-group complex because the groups in version 1 partitioning are expected to

be loosely connected by the global variables in version 1.

Each version of the program exhibits different connectivity between the program

components. Hence, the object finder partitionings for each version of the program

were different. In addition, the objective of the study that the program versions were

functionally equivalent but structurally different was met.

The validation results for each complexity metric factor are summarized nex-

t. First, the results about the complexity metrics factors that measure inter-group

complexity.


fA Set of direct interface variables The output for the two versions of the pro-

gram indicated that the original version presented higher inter-group complexity than

version 1. This observation shows the effect of different partitionings on the metric-

s. In addition, as expected, there is higher interactions between components in the

original version than between components in version 1.


f2 Set of indirect interface variables No relevant results were obtained for this

factor.


ia Boundary set The output for the two versions of the program indicated that

the original version presented slightly higher inter-group complexity than version 1.

This observation is in accordance with the expected variation in the metrics.











/4 Variables in boundary set The output for the two versions of the program

indicated that the two versions presented similar inter-group complexity.


f. Sum of type sizes of direct interface variables The output for the two ver-

sions of the program indicated that the original version presented higher inter-group

complexity than version 1. This is the case since version 1 partitioning reduces the

number of global variables transferring information between components.


fr, Sum of type sizes of indirect interface variables No relevant results were ob-

tained for this factor.


f7 Sum of type sizes of boundary set variables The output for the two ver-

sions of the program indicated that the original version presented lower inter-group

complexity than version 1. Version 1 added some global variables to the program.

They influenced the sizes observed in the new program. However, the increased val-

ue of the type size is related to the number of relations (data bindings) between

components. This number was lower in version 1 than in the original version.


fR Set of types of boundary set variables The output for the two versions of

the program indicated that the original version presented higher inter-group com-

plexity than version 1. This is the case since version 1 partitioning reduces the

number of variables transferring information between components.
In conclusion, the inter-group complexity in version 1 is lower than the one in

the original version. This coincides with our expectations regarding the effect of the

changes and the resulting partitioning on the primitive complexity metrics factors.
That is to say, the partitioning resulting from version 1 agglomerates most relations

between components inside the groups which reduces the inter-group complexity.











fAq Set of direct internal variables The output for the two versions of the pro-

gram indicated that version 1 presented higher inter-group complexity than the orig-

inal version. This coincides with our expectations that the intra-group complexity

will increase in version 1 since more global variables are used.


fin Set of indirect internal variables The output for the two versions of the

program indicated that the two versions presented similar intra-group complexity.

One reason for such situation is the fact that the code related to factor fio is not

modified between versions.


f i Interior set The output for the two versions of the program indicated that

the original version presented lower intra-group complexity than version 1. This

coincide with our expectation of the intra-group complexity will increase in version

1 since more global variables are used.


Sf19 Percentage of modules accessing all global variables No relevant results were

obtained for this factor.


fi., Sum of type sizes of interior set variables The output for the two versions

of the program indicated that the original version presented lower intra-group com-

plexity than version 1. This demonstrates that factor f13 increases with a higher

intra-group complexity partitioning, such as the one in version 1 of the program.


f14 Set of types of interior set variables The output for the two versions of the

program indicated that the original version presented lower intra-group complexity

than version 1. This coincides with our expectations that the intra-group complexity

will increase in version 1 since more relations occur.











In conclusion, the intra-group complexity in version 1 is higher than the intra-

group complexity in the original version. This coincides with our expectation regard-

ing the effect of version 1 partitioning on the metrics. That is to say, the partitioning

resulting from version 1 agglomerates most relations between components inside the

groups which increases the intra-group complexity.


fs Set of grouping manipulation pairs The output for the two versions of the

program indicated that the original version presented higher intra-group strength

than version 1. This coincides with our expectation that the partitioning of version 1

decreases the strength because parameters have been replaced with global variables

and, as previously indicated, the addition of global variables reduces the strength of

individual components.


fi, Set of degrouping manipulation pairs The output for the two versions of

the program indicated that the original version presented higher intra-group strength

than version 1. This coincides with our expectations that the partitioning resulting

from version 1 decreases the strength.


f 7 Ratio of grouping to degrouping manipulation pairs The output for the t-

wo versions of the program indicated that the original version presented higher intra-

group strength than version 1. This coincides with our expectations that the parti-

tioning resulting from version 1 decreases the strength.

In conclusion, the intra-group strength in the original version of the program

is higher than the one in version 1. This coincides with our expectations regarding

the effect of version 1 partitioning on the primitive metrics factors. That is to say,

version 1 reduces the strength of the resulting components due to the elimination of

the parameters from the routines interface which were replaced with global variables.











The validation of these primitive metrics factors shows a correlation between the

primitive metrics factors and the expected complexity associated with a partitioning

of a system. We conclude that the complexity of a partitioning is effectively measured

by these primitive complexity metrics factors. Different complexities will result in

different values of the primitive metrics factors. The future research consists of

describing the relationships between complexity metrics factors in terms of a formula.

5.4 The Test Cases: Identified Obiects, Clusters and Groups

The following sections present the partitionings identified in the three test case

programs using two modularization techniques. The partitionings consists of the

identified objects using the top-down analysis method of the object finder and the

clusters identified using Hutchens and Basili hierarchical clustering technique [14].

A generalized view of both kind of partitionings consists of a set of groups, from

Definition 17; a group usually corresponds to an identified object in the object finder

and to a cluster in hierarchical clustering. In the following sections, we explain how

the groups based on identified objects and on cluster were specified for each test case.

5.4.1 Test Case 1: Name Cross-reference Program

The first test case consists of the name cross-reference program from Section 8.1.

The statistics of this example are given in Table 5.4. This section presents the iden-

tified objects by the object finder using the top-down analysis method. In addition,

we present the clusters identified in this program using Basili's hierarchical clustering

technique [14].

The identified objects, obtained by the object finder during the top-down anal-

ysis method, are shown in Figure 8.1. The groups corresponding to the identified

objects in Figure 8.1 are derived after the user performs some modifications on the












Table 5.4. Statistics of the test case programs.

Program name Lines of code Global variables Functions Types
Name cross- 282 1 10 10
reference
Algebraic expression 1324 10 50 14
evaluation
Table 1,900 8 50 4
management


identified objects. The purpose of the user's modifications is to eliminate the com-

monality between objects by removing common components between objects, as ex-

plained in Section 8.1, using xobject, which results in the objects of Figure 8.3; these

disjoint objects were used to derive the groups. The groups are shown in Figure 5.6.

Since these groups are based on objects after the modifications, we name a group by

using the name of the object that correspond to the group.

The clusters of this test case were identified by a clustering tool called basil

[21] that implements Basili's hierarchical clustering technique. The identified clusters

in the cross-reference program are shown in Figure 5.7.

Figure 5.7 showed the clusters defined on this example using Basili's clustering

techniques. The groups corresponding to these clusters are shown in Figure 5.8.

In this case, a group consists of one or more clusters of routines which maintain the

same levels of coupling and strength [14] of the cluster identified with the hierarchical

clustering technique. The groups are constructed as follows. Initially, groups are

defined based on the top level of corresponding clusters. In this case, the naming

of groups is arbitrary and only serves to distinctly identify each group. Then, the

routines which did not cluster were considered to be groups of size one. An alternative

approach, during this last step, is to group together the unclustered routines.
















Group z#objTLINEPTR#3 is {
~LINEPTR-16 :(T)LINEPTR
xref_tab~__jII'makelinenode-138
}
Group z#objTWTPTR*#4 is {
~WTPTR*-104 :(T)WTPTR*
xref _tab ___IIImakewordnode~140
xref_tab-___II~inittab-102
xref_tab ~___II-addword~ 64
xreftab~___II-addlineno-61
xrefout~___II~writewords-181
}
Group z#objTchar*#5 is {
"char*-56
xreftab~-___II~strsave~164
}
Group z#objTint*#6 is {
~~int*'69 :(T)int*
xref-~ II-main-137 :(R)main
}
Group z#objGlineno#2 is {
xrefin-___II-lineno-130
xrefin~___II-getword~98
xrefin~___II-getachar~93
}


:(R)make_linenode



:(R)makewordnode
:(R)init_ tab
:(R)addword
(R)addlineno
(R)writewords


:(T)char*
:(R)strsave






:(G)lineno
:(R)getword
:(R)getachar


Figure 5.6. Groups based on objects identified in name cross-reference program







Final dendrogram:
-- Cluster No.l --
(100 (50 addlineno addword)
makelinenode
(50 makewordnode strsave) )
-- Cluster No.2 --
(100 getachar getword)


Figure 5.7. Clusters found in the name cross-reference program by basili













Group I is {
xreftab~___II~addlineno~61
xreftab~___II-addword~64
xreftab~___II-makewordnode~140
xref_tab~___-IIstrsave~164
xreftab ___.II~makejlinenode-138
}
Group II is {
xrefin-___II~getword~98
xref -in~_ II~getachar~93
}
Group III is {
xref_tab~__- Il~inittab102
}
Group IV is {
xrefout-~__II-writewords~181
}
Group V is {
xref-_ II~main-137
}


:(R)addlineno
:(R)addword
:(R)makewordnode
:(R)strsave
:(R)makelinenode

:(R)getword
:(R)getachar


(R)init _tab


:(R)writewords


:(R)main


Figure 5.8. Groups based on clusters found in name cross-reference program

5.4.2 Test Case 2: Algebraic Expression Evaluation Program

The second test case consists of the simple algebraic expression evaluation pro-

gram from Section 8.2. The statistics of this program are listed in Table 5.4. This

section presents the identified objects by the object finder during the top-down anal-

ysis method. In addition, we present the clusters identified in this program using

Hutchens and Basili hierarchical clustering technique [14].

The identified objects, obtained by the object finder during the top-down anal-

ysis method, are shown in Figures 8.4 and 8.5. The groups corresponding to these

identified objects are derived after the user performs modifications on the identified

objects. Similarly to Section 5.4.1, we name a group based on the name of the object

that corresponds to the group. The groups are shown in Figures 5.9 and 5.10.

















Group z#objTNODE*#6 is {
~NODE*~15
function~___II-Variableeval"140
function___II-Pluseval- 104
function-___II-Nodeeval"101
function- ___II~Multiply_eval-96
function-___II-Minus_eval-94
function- ___II-Functionprecedence"82
function ___II~Functiondeletefunction
function-___II~Functioncheckandadd-68
function __II-Functionbuild _tree-66
function-___II~Functionaddoperator-62
function~ ___II-Functionaddoperands-60
function-___II-Echo_treeO-29
function ___-II-Divideeval~22
function- ___II"Constructfunction-16
function"___II~Constanteval-12
function- ___II-Functionparenthesis-80
function- ___II~Functionovldop~73
function ~___II-totalparen~292
function~___II-root'267
function-___II~queue-263
function-___II~last-216
function- ___II-first~193
}
Group z#objTchar*#7 is {
-char*~18
state~___II-State9_transition128
state~___II-State7_transition-126
state~___II-State6_transition~ 124
state-___II~State5_transitionM122
state-___II~State4_transition~120
state-___II~State3_transition118
state-___II- State2_transition-116
state-___II~Statel_transition~114
state-___II-StateO_transition-112
function-___II-Echotree"27
}


Figure 5.9. Groups based on types-based objects
evaluation program


: (T)NODE*
(R)Variableeval
(R)Pluseval
(R)Node-eval
(R)Multiply-eval
(R)Minuseval
(R)Funct ionprecedence
(R)Funct iondeletefunct ion
(R)Funct ioncheckandadd
(R)Function-buildtree
:(R)Function-add_operator
(R)Funct ion_addoperands
(R)Echo.tree0
(R)Divideeval
(R)Construct_function
(R)Constant_eval
(R)Function-parenthesis
(R)Functionovldop
(G)totalparen
(G)root
(G)queue
(G)last
(G)first


(T)char*
(R)State9_transition
: (R)State7_transition
(R)State6_transition
(R)State5_transition
(R)State4_transition
(R)State3_transition
(R)State2_transition
: (R)Statel_transition
(R)State0_transition
(R)Echotree



identified in algebraic expression













Group z#objTfloat#8 is {
-float'5
function~___II-F6-52
function- IIF5-46
function~___II~F4'41
function-___11F3-37
function-___II"F2-34
function ___II-F1l32
}
Group z#objTint#9 is {
-int-4
exprtst___III-main" 217
}
Group z#objTvoid#10 is {
"'void~2
function ___II-Destruct-_function-21
}


(T)float
(R)F6
(R)F5
(R)F4
(R)F3
(R)F2
(R)FI1


: (T) int
(R) main


(T)void
:(R)Destruct.function


Figure 5.9 - continued


Similarly to Section 5.4.1, the clusters of this test case were identified by a

clustering tool called basili [21] based on Basili's hierarchical clustering technique.

The identified clusters in the expression evaluation program are shown in Figure 5.11.

Figure 5.11 showed the clusters defined in this example using Basili's clustering

techniques. The groups corresponding to these clusters are shown in Figure 5.12. In

this test case, a group consists of one or more clusters and sub-clusters of routines

which results in the same degree of coupling and strength [14] of the cluster identified

by the hierarchical clustering technique. The groups are constructed as follows. First,

a group corresponds to the set of routines which form sub-clusters with the lowest

coupling and highest cohesion between the routines according to Hutchens and Basili

[14] definition of strength and coupling. Next, the routines which did not cluster were

grouped into a single group, named group V.














Group z#objGtableindex#2 is {
symbol-___II~table~284
symbol -___II-tableindex~286
symbol-___II~Symboltableget_valueM138
symbol ~___II~Symboltablegetindex~136
symbol ~___I I Symboltableclear~ 135
symbol ~___II~Symboltableaddvariable-
symbol-___I I-Symboltableaddvalue~ 130
symbol-___II-Echosymboltable~26
symbol "___II"Construct symbol_table~20
}
Group z#objGparencount#3 is {
state~___ II~parencount-254
state___III~StateOincparen-111
state-___II~StateOgetparen~110
state.___II~StateO-decparen~108
}
Group z#objGexpression#4 is {
state~___II-index-213
state-___ IIexpression~178
state ___II Stat eOgetchar~ 109
state~___II~ConstructstateO~19
function-" II~Functionvalid-85
}


Figure 5.10. Groups based on globals-based objects
evaluation program


: (G)table
(G)tableindex
(R)Symboltablegetvalue
(R)Symboltablegetindex
(R)Symboltableclear
:(R)Symbol-tableadd_variable
(R)Symboltableaddvalue
(R)Echosymboltable
(R)Constructsymbol.table


:(G)parencount
:(R)StateO_inc_paren
:(R) StateO_get_paren
:(R)StateO-decparen


:(G)index
:(G)expression
:(R) StateOgetchar
:(R)ConstructstateO
:(R)Functionvalid



identified in algebraic expression


Final dendrogram:
-- Cluster No.1 --
(9 Functionovldop
Symboltableaddvalue)

-- Cluster No.2 --
(100 (66 Constructfunction
(62 Functionaddoperands
(44 Functionbuildtree
Symboltableadd.._variable)
Function.checkandadd) )
Echoqueue Echotree0
Functionaddoperator
Function-deletefunction main)


Figure 5.11. Clusters found in algebraic expression evaluation program













Group I is {
function ___II~Functionovldop-73
symbol__ II'Symbol-table-add-value"130
}
Group II is {
function- ___II-Functionaddoperator"62
function" ___IIFunctiondeletefunction
exprtst-_ II-main'217
}
Group III is {
function___ II-Constructfunction" 16
function-___II-Functionaddoperands-60
function" ___II-Function checkandadd-68
}
Group IV is {
function- ___II-Functionbuildtree-66
symbol-__ II Symboltableaddvariable~
}
Group V is {
function-___II-Variableeval-140
function~___II~Pluseval-104
function-___ II~Nodeeval- 101
function-___II-Multiplyeval-96
function'___II"Minuseval-94
function- ___II-Functionprecedence-82
function ___II~EchotreeO-29
function-___II-Divideeval~22
function-___III-Constanteval-12
function- ___IIFunctionparenthesis~80
state~___II-State9_transition~128
state-___II-State7_transition~126
state~___II-State6_transition~124
state-___II- State5_transition~122
state-___II-State4_transition-120
state~___II-State3_transition~118
state-___ II~State2_transition~116
state~___II-Statel_transition~l4
state ___II-StateOtransition~112
function-___II~Echotree-27
function~___II-F6-52
function-___II~F5-46
function- ___II-F4-41
function ___II-F3-37
function-___II-F2-34
function~___-FII-F32
}


:(R)Functionovldop
:(R)Symbol_table_add_value


:(R)Functionaddoperator
:(R)Funct iondeletefunction
:(R)main


:(R)Construct_function
:(R)Function_add_operands
:(R)Funct ioncheckandadd


:(R)Functionbuildtree
:(R)Symbol_table_add_variable


:(R)Variable_eval
:(R)Pluseval
:(R)Nodeeval
:(R)Multiply._eval
:(R)Minuseval
:(R)Funct ion.precedence
:(R)Echotree0
:(R)Divide.eval
:(R)Constant-eval
:(R)Funct ionparenthes is
:(R)State9_transition
:(R)State7_transition
:(R)State6_transition
:(R)State5_transition
: (R) State4_transition
:(R)State3_transition
:(R)State2_transition
:(R)Stateltransition
:(R)State0_transition
:(R)Echo_tree
:(R)F6
:(R)F5
:(R)F4
:(R)F3
:(R)F2
:(R)F1


Figure 5.12. Groups based on clusters found in algebraic expression evaluation pro-
gram













Group V (cont.) is {
function ___II-Destructfunction-21
symbol___I I'Symboltableget.value 138
symbol~___II-Symbol-tablegetindex~136
symbol~ __II-Symbol-tableclear~135
symbol~ __II~Echo-symboltable~26
symbol" __IIConstructsymbol-table~20
state'" II~StateOincparen111
state~_ II'StateOget.paren~110
state- ___III- StateOdecparen~108
state~ ___II~StateO_get_char'109
state'___II~ConstructstateO19
function ___II-Functionvalid-85


(R)Destructfunction
(R)Symboltable.getvalue
(R)Symbol-tablegetindex
(R)Symbol.table.clear
(R)Echosymbol_table
(R)Construct-symboltable
(R)State0_incparen
(R)State0_Ogetparen
(R)State0_decparen
(R) State0_Ogetchar
(R)Construct-stateO
(R)Functionvalid


Figure 5.12 - continued


5.4.3 Test Case 3: Symbol Table Management for Acx


The third test case consists of the symbol table management module of an ANSI
"C" parser tool. The statistics of this program are listed in Table 5.4. This section

presents the identified objects by the object finder during the top-down analysis

method. In addition, we present the clusters identified in this program using Hutchens

and Basili hierarchical clustering technique [14].

The identified objects, obtained by the object finder during the top-down anal-

ysis method, are shown in Figures 5.13 and 5.14. The groups corresponding to these

identified objects are derived after the user performs modifications on the identified

objects. Similarly to Section 5.4.1, we name a group based on the name of the object

that correspond to the group. The groups are shown in Figures 5.15 and 5.16.

Similarly to Section 5.4.1, the clusters of this test case were identified by a

clustering tool called basili [21] based on the hierarchical clustering technique. The

identified clusters in the symbol table management program are shown in Figure 5.17.















Object z#objTFILE*#3 is {
"FILE*-124
table-___II-outputusage-283
table- II- output-table~271
}
Object z#objTSymbol*#4 is {
~~Symbol*-64
table- ___II wheredefined-448
table-___II- symboldup-361
table-___II-newsymbol"251
table-___ II-moveuse'244
table ___II"movereference~242
table ___II-movedef-237
table___ IIlmergetype-234
table-___II-makeundefinedsymbol-7211
table-___IImakesymbol-209
table-___II-ismemberof-194
table ___II-isdefined-192
table-___II- istypealias-190
table- ___II-is_macro_name~188
table-___II-insertuse~184
table___II- insertsymbol-182
table~___IIIinsertreference-180
table-___ II- insertdef"178
table-___II-gettopsymbol 149
table-___II getentrystring~145
table~___II-getdefault~144
table~___II-findsymbol-132
table- II- addtype-63
}
Object z#objTToken*#5 is {
-~Token*-95
table~___IItokendup~416
table&___II-token_cmp-413
table-___II-newtoken~252
table-___II-isthere&197
}
Object z#objTType*#6 is {
-~Type*-254
table"___II-typedup-429
table-___II-typecat-426
table- __II-new-type-253
}


: (T) FILE*
:(R)outputusage
:(R)outputtable


:(T)Symbol*
:(R)wheredefined
(R)symbol._dup
:(R)newsymbol
:(R)moveuse
:(R)move-reference
(R)movedef
:(R)mergetype
:(R)makeundefined_symbol
(R)make_symbol
(R)ismemberof
(R)isdefined
:(R)is.typealias
:(R)ismacroname
:(R)insertuse
:(R)insert_symbol
:(R)insertreference
:(R)insert_.def
:(R)get.top-symbol
:(R)get_entry_string
: (R) get_default
:(R)find_ symbol
:(R)addtype


:(T)Token*
:(R)tokendup
:(R)tokencmp
:(R)newtoken
:(R)isthere


:(T)Type*
:(R)typedup
:(R)typecat
:(R)newtype


Figure 5.13. Types-based candidate objects identified in symbol table management
program
























Object z#objTchar*#7 is {
~-char*-58
table~___II-gettypename-151
table~__ II-get-table-index~147
}
Object z#objTentrytag*#8 is {
~entrytag*-250
table~___II-new-entry~249
}
Object z#objT+undetermined-type#9 is {
~~symbotag*-208
--Type*~254
~Token*~95
--Symbol*-64
~FILE*~124
table-___II~outputuse-285
table~___II-outputtype~276
table&___II~outputtypemember~280
table~___II-outputtoken'273
table~___II-outputreference~268
table~___II~outputparameter~262
table~___II~outputparameterusage-265
table-___II~outputdef~259
table& ___II~ outputdeclarationtype~256
table__ II'makename~205
}


:(T)char*
:(R)gettype_name
:(R)get_table_index


:(T)entry.tag*
:(R)newentry


(T)symbotag*
(T)Type*
(T)Token*
(T)Symbol*
(T)FILE*
(R)outputuse
(R)outputtype
(R)outputtypemember
(R)outputtoken
(R) output-reference
: (R) output .parameter
(R)outputparameterusage
(R)outputdef
(R)outputdeclaration-type
: (R)makename


Figure 5.13 - continued






















Object z#objGcurrentcount#2 is {
table-___II-uselist-446
table___II- symboltable-363
table- __ _II goto-_reference_list~157
table-___II-def-98
table-___II'debugtoken-94
table~___II-currentside-92
table- ___ II"current _scope-91
table-___II-currentfuncname-90
table-___II-current _count~89
table- ___II -wheredefined-448
table- ___II -tablereset-382
table-___II-tableinit-381
table-___II tablefinal~380
table~___II-removescopeflag-310
table -___II-prdebugtoken~294
table-___ II- outputusage-283
table-___IIoutput.table271l
table~___II-makesymbol-209
table-___II-makename-205
table-___IIisdefined-192
table-___II-istypealias-190
table- ___II- ismacroname-188
table-___II-insertuse~184
table-___II'insertsymbol- 182
table-___II~insertreference"180
table-___II-insertdef-178
table___ II- flagscopeflag- 136
table-___II-findsymbol-132


(G)use_list
(G)symbol_table
(G)gotoreferencelist
(G)def
(G)debugtoken
(G)current.side
(G)currentscope
(G)currentfuncname
:(G)current_count
(R)where_defined
(R)table_reset
(R)tableinit
(R)table._final
(R)removescope-flag
(R)prdebugtoken
(R)outputusage
(R)outputtable
(R)makesymbol
(R)makename
(R)isdefined
(R)is_typealias
(R)ismacroname
(R)insertuse
(R)insert.symbol
(R)insertreference
(R)insertdef
(R)flagscopeflag
(R)find_symbol


Figure 5.14. Globals-based candidate objects identified in symbol table management
program