
Citation 
 Permanent Link:
 http://ufdc.ufl.edu/AA00031471/00001
Material Information
 Title:
 Data fragmentation and allocation algorithms for distributed database design
 Creator:
 Ra, Minyoung
 Publication Date:
 1990
 Language:
 English
 Physical Description:
 ix, 116 leaves : ill. ; 28 cm.
Subjects
 Subjects / Keywords:
 Algorithms ( jstor )
Bytes ( jstor ) Cost allocation ( jstor ) Database design ( jstor ) Distributed databases ( jstor ) Heuristics ( jstor ) Information attributes ( jstor ) Minimization of cost ( jstor ) Objective functions ( jstor ) Transaction costs ( jstor ) Computer and Information Sciences thesis Ph. D Dissertations, Academic  Computer and Information Sciences  UF Distributed databases  Design ( lcsh )
 Genre:
 bibliography ( marcgt )
nonfiction ( marcgt )
Notes
 Thesis:
 Thesis (Ph. D.)University of Florida, 1990.
 Bibliography:
 Includes bibliographical references (leaves 113115).
 General Note:
 Typescript.
 General Note:
 Vita.
 Statement of Responsibility:
 by Minyoung Ra.
Record Information
 Source Institution:
 University of Florida
 Holding Location:
 University of Florida
 Rights Management:
 The University of Florida George A. Smathers Libraries respect the intellectual property rights of others and do not claim any copyright interest in this item. This item may be protected by copyright but is made available here under a claim of fair use (17 U.S.C. Â§107) for nonprofit research and educational purposes. Users of this work have responsibility for determining copyright status prior to reusing, publishing or reproducing this item for purposes other than what is allowed by fair use or other copyright exemptions. Any reuse of this item in excess of fair use or other copyright exemptions requires permission of the copyright holder. The Smathers Libraries would like to learn more about this item and invite individuals or organizations to contact the RDS coordinator (ufdissertations@uflib.ufl.edu) with any additional information they can provide.
 Resource Identifier:
 025576824 ( ALEPH )
24529474 ( OCLC )

Downloads 
This item has the following downloads:

Full Text 
DATA FRAGMENTATION AND ALLOCATION ALGORITHMS
FOR DISTRIBUTED DATABASE DESIGN
By
MINYOUNG RA
A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY
UNIVERSITY OF FLORIDA
1990
"The Lord is my shepherd; I shall not want."
(Psalms 23:1)
ACKNOWLEDGMENTS
I would like to express my deep gratitude to my advisor,
Dr. Shamkant B. Navathe, f or giving me the opportunity to work in the interesting area of distributed database design and for his continual encouragement and support through the course of this work. I indeed owe him a great deal.
I would like to express my special thanks to my supervisory committee cochair, Dr. Ravi Varadarajan, for his everpresent help and valuable comments on this work. I would like to thank Dr. Sharma Chakravarthy for his useful suggestions and discussions. I would like to thank Dr. Jose C. Principe and Dr. Manuel Bermudez for their time and interest in my research.
I would like to thank Mrs. Sharon Grant for the unceasing help and assistance. I would also like to thank all those unnamed friends who helped me in pursuing my research.
I would like to thank Korea Military Academy and Korean
Army for allowing my studying here and for their encouragement and support.
Finally, I would like to thank my family, especially my
wife Eunjin and children Shinil and Eunsong, for their continuous prayer and encouragement.
iii
This work was supported by National Science Foundation under grant no. IRI 8716798.
iv
TABLE OF CONTENTS
Page
ACKNOWLEDGMENTS............................................ iii
ABSTRACT................................................... vii
CHAPTERS
1. INTRODUCTION........................................... 1
1.1 Basic Concepts of Fragmentation and Allocation 2
1.2 Proposed Problem.................................. 4
1.3 Contributions of This Dissertation............... 5
2. BACKGROUND AND PROPOSED RESEARCH..................... 9
2.1 Previous Related Work............................. 9
2.2 The Outline of the Proposed Research............ 14
3. A VERTICAL PARTITIONING ALGORITHM.................... 18
3.1 Overview.......................................... 19
3.2 Definitions and Notations........................ 21
3.3 Fundamental Concepts............................. 22
3.4 The Graph Oriented Algorithm..................... 25
3.5 Application and Extension........................ 34
4. A HORIZONTAL PARTITIONING ALGORITHM.................. 43
4.1 Overview.......................................... 43
4.2 Predicate Usage Matrix and Predicate Affinity
Matrix............................................ 45
4.3 The Algorithm..................................... 47
4.4 Nonoverlapping Horizontal Partitioning.......... 54
5. A MIXED PARTITIONING METHODOLOGY..................... 61
5.1 Proposed Methodology............................. 61
5.2 Grid Creation..................................... 65
5.3 Grid Optimization................................. 65
5.4 Effect of Changes in Data and Transactions ...74
v
6. AN ALLOCATION ALGORITHM FOR THE MIXED FRAGMENTS ... 79
6.1 Distributed Transaction Processing.............. 80
6.2 Representation and Strategy...................... 84
6.3 A Heuristic Procedure for Mixed Fragment
Allocation........................................ 92
6.4 Complexity of the Algorithm...................... 95
6.5 Experimental Results............................. 96
6.6 Dealing with Nonaccessed Fragments.............. 97
7. ALTERNATIVE WAYS OF DEALING WITH FRAGMENTATION AND
ALLOCATION............................................ 106
7.1 Alternatives for Combining Fragmentation and
Allocation........................................ 106
7.2 Cell Allocation Followed by Local Optimization 108
8. CONCLUSIONS AND FURTHER RESEARCH.................... 111
REFERENCES................................................. 113
BIOGRAPHICAL SKETCH........................................ 116
vi
Abstract of Dissertation Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy
DATA FRAGMENTATION AND ALLOCATION ALGORITHMS
FOR DISTRIBUTED DATABASE DESIGN By
Minyoung Ra
December 1990
Chairman: Dr. Shamkant B. Navathe Major Department: Computer and Information Sciences
In a distributed database system data fragmentation and
allocation are the major design issues. This dissertation presents a set of algorithms for fragmentation (or partitioning) and allocation. The relational model of data is
assumed for convenience. The algorithms can be applied to other data models with minor variations.
The partitioning of a global schema into fragments can be performed in two different ways: vertical partitioning and horizontal partitioning. Vertical partitioning is the process of subdividing the attributes of a relation or a record type
into multiple records, thereby creating fragments. Horizontal partitioning is the process that divides a global relation into subsets of tuples, called horizontal fragments. In this dissertation a new vertical partitioning algorithm is
vii
presented. This algorithm starts from the attribute affinity
matrix by considering it as a complete graph. Then, forming a linearly connected spanning tree, it generates all meaningful fragments simultaneously. A new horizontal partitioning algorithm is also presented, which is developed by applying the same graphical technique as in vertical partitioning.
The need for mixed partitioning arises because database
users usually access data subsets which are simultaneously vertical and horizontal fragments of global relations.
However, this problem has not been addressed well in the current literature. A mixed partitioning methodology, which first forms a grid by partitioning a global relation
vertically and horizontally in an independent fashion and then produces the final fragments (called mixed fragments) by merging the grid cells, is addressed.
In most of the previous allocation work, the unit of allocation is the fragment that results from horizontal
partitioning or vertical partitioning. Little work has been done for the allocation of the result of mixed partitioning. This dissertation presents an allocation algorithm for
fragments that are generated from our mixed partitioning procedure. In this algorithm a mixed fragment is the unit of allocation. This algorithm is developed based on a heuristic called the pseudoallocation technique.
viii
The contribution of this dissertation consists of providing efficient graph oriented algorithms for
partitioning, developing a mixed partitioning approach to distribution and a new approach to fragment allocation in distributed databases.
ix
CHAPTER 1
INTRODUCTION
In recent years, due to the demand for system
availability and autonomy, and enabled by advances in database and communication technology, distributed database systems have become an important area of information processing, and it is easy to foresee that their importa nce will rapidly grow. There are both organizational and technological reasons for this trend; distributed databases eliminate many of the shortcomings of centralized databases and fit more naturally in the decentralized structures of many organizations [Ceri 84].
There are two alternative approaches to the design of data distribution: the topdown and the bottomup approaches [Cer 83a, Ceri 84]. The former approach is typical of
distributed databases developed from scratch, while the latter approach is typical of the development of multidatabase systems as the aggregation of existing databases. This dissertation is concerned with the topdown approach.
In the topdown approach, the description of the global schema is given, the fragmentation of the databases is designed and then these fragments are allocated to the sites. The objective of this approach is to evaluate the utility of
1
2
partitioning data objects into fragments both horizontally and vertically, and then to determine the allocation of fragments to the database sites [Ceri 82]. Optimal and heuristic
mathematical models are used to determine the allocation of data to reduce the aggregate cost of transaction processing [Cer 83b, Nava 84].
Figure 11 describes an overview of distributed database design when it is performed in a topdown manner [Cer 83b].
The main difference of this overview compared with centralized database design is the distribution design step. Prior to distribution design, a global schema is produced by using standard techniques of centralized database design. During the distribution design step, several local logical schemas for the distributed database are produced. Data objects which belong to the global database schema are fragmented and
allocated to the different sites, and associations between objects are taken into account in the process. The design is completed by performing the physical design of the data which are allocated to each site.
1.1 Basic Concepts of Fragrmentation and Allocation
The topdown approach to distributed design consists of
solving for each global object the design problems of fragmentation and allocation [Cer 83b]. Fragmentation or
partitioning is the process of subdividing a global object (entity or relation) into several pieces, called fragments,
3
and allocation is the process of mapping each fragment to one or more sites. Fragments must be appropriate units of allocation. Thus the database designers should define fragments as homogeneous collections of information from the viewpoint of distribution [Ceri 87].
The decomposition of a global schema into fragments can be performed using two different types of fragmentation techniques: horizontal fragmentation and vertical
fragmentation. Vertical fragmentation is the process that divides a global object into groups of their attributes, called vertical fragments. In order to be able to access the same data after partitioning, it is required that each fragment include a key attribute of the global object. Horizontal fragmentation is the process that divides a global
relation into subsets of tuples, called horizontal fragments. Each fragment is associated with a predicate which indicates
the distinguishing property possessed by the instances or tuples of that fragment. Note that fragments need not be disjoint. This means that fragments can have common attributes in vertical fragmentation and common tuples in horizontal fragmentation. Another type of fragmentation is mixed
fragmentation. Mixed fragmentation can be built by alternating horizontal and vertical fragmentations.
The allocation of fragments can be either nonredundant or redundant. A nonredundant (nonreplicated) allocation
requires each fragment to be mapped to exactly one site,
4
whereas a redundant (replicated) allocation allows each fragment to be mapped to one or more sites. Typically,
replication permits higher availability and reliability of the system, though it needs more storage and additional effort for maintaining consistency.
1.2 Proposed Problem
Figure 12 shows the Distributed Database Design Tool (D 3T) reference architecture which is used in our research. The term "GRID" is used to refer to the result of applying horizontal partitioning and vertical partitioning algorithms
simultaneously to a global relation. It consists of cells called "grid cells." The grid suggests all possible ways in
which the global relation in a distributed database may be partitioned. The input information to D 3T is the global schema which consists of a set of relations, together with information about the important transactions on the proposed database. As stated in the previous work [Cer 83b, Nava 84],
it is not necessary to collect information on 100% of the expected transactions (that would, of course, be impossible) . Since the 8020 rule applies to most practical situations, it
is adequate to supply information regarding the 20% of the heavily used transactions which account for about 80% of the activity against the database. The other input to D 3T shown in Figure 12 is the distribution constraints which include preferences or special considerations designers/users may have
5
that would influence partitioning and allocation. In this figure, grid creation is related to vertical fragmentation and horizontal fragmentation, and grid optimization is related to mixed fragmentation. In this dissertation, we will deal with vertical fragmentation, horizontal fragmentation, mixed
fragmentation and allocation with no redundancy by presenting new algorithms that have less complexity or provide minimum transaction processing cost.
1.3 Contributions of This Dissertation
In a distributed database system, data fragmentation and
allocation are the major design issues. This dissertation presents a set of algorithms for data fragmentation and allocation in a distributed database system environment.
In this dissertation, a number of problems for data fragmentation and allocation are defined and attacked by using graphical techniques. One of the main contributions of this dissertation is the development of the graph oriented partitioning algorithm for vertical and horizontal partitioning. The major feature of this algorithm is that all fragments are generated in one iteration in time 0(n 2) , where n represents the number of attributes, that is more efficient
than the previous approaches. Furthermore, it provides a cleaner alternative without arbitrary objective functions.
Another contribution is the development of the first comprehensive mixed partitioning methodology for this problem.
6
A graphbased allocation algorithm is also presented along with some heuristics.
7
Users
requirement specification
enterprise schema
transact ion definition
distribution requirement
distribution of the enterprise schema into local logical
Physical database design for each local database
implementation schema lo operational database
Figure 11 The overall distributed database design methodology:
Topdown manner [Cer 83b]
44
8
GLOBAL SCHEMA
CONSTRAINTS
C
GRID OPTIMIZATION
F ALLOCAT ION
   I " . ...........................
LOIAL SCHEMA
INFORMTION
site independent
D
site dependent
PHYSICAL SCHEMA
Figure 12 Distributed database design tool
reference architecture
GRID CREATIONS
CHAPTER 2
BACKGROUND AND PROPOSED APPROACH
2.1 Previous Related Work
We review the previous work in the design of data
distribution. This review is divided into the work on data fragmentation and the work on data allocation.
2.1.1 Previous Work on Data Fragmrentation
Data fragmentation is performed during the design of a database to improve performance of transactions. In order to obtain improved performance, fragments must be "closely matched" to the requirements of the transactions.
The work related to this topic can be classified into the work on vertical partitioning, on horizontal partitioning, and on mixed partitioning.
2.1.1.1 Vertical partitioning
Hoffer and Severance [Hoff 75] measure the affinity
between pairs of attributes and try to cluster attributes according to their pairwise affinity by using the bond energy algorithm (BEA) developed in [McCo 72].
Navathe et al. [Nava 84] extend the results of Hoffer and Severance and propose a twophase approach to vertical
9
10
partitioning. During the first phase, they use the given input parameters in the f orm, of an attribute usage matrix and transactions to construct the attribute affinity matrix on which clustering is performed. After clustering, iterative binary partitioning is attempted, first with an empirical objective function. The process is continued until no further partitioning results. During the second phase, the fragments
can be further refined by incorporating estimated cost factors weighted on the basis of the type of problem being solved.
Cornell and Yu [Corn 87] apply the work of Navathe et al. [Nava 84] to relational databases. They propose an algorithm
which decreases the number of disk accesses to obtain an optimal binary partitioning. They show how knowledge of
specific physical factors can be incorporated into the overall fragmentation methodology.
Ceri, Pernici and Wiederhold [Ceri 88] extend the work of Navathe et al. [Nava 84] by considering it as a DIVIDE tool and by adding a CONQUER tool. Their CONQUER tool extends the same basic approach in the direction of adding details about
operations and physical accesses similar to [Corn 87]. This approach focuses on the decomposition of the design process into several design subproblems and provides no algorithmic
improvement in the process of vertical partitioning itself.
2.1.1.2 Horizontal partitioning
Ceri, Negri and Pelagatti [Ceri 82] analyze the horizontal partitioning problem, dealing with the
11
specification of partitioning predicates and the application of horizontal partitioning to different database design problems.
Ceri, Navathe and Wiederhold [Cer 83b] develop an optimization model for horizontal partitioning without replication in the form of a linear integer 01 programming problem. They modeled the logical schema of a database as a directed graph with objects as nodes and links as edges, and
required the user to specify the information about (1) the data about schema (attribute size, relationship size,
cardinality), (2) tabulation of transactions (frequencies, site of origin), and (3) distribution requirement (constraints).
Yu et al. [Yu 85] propose an adaptive algorithm for record clustering, which is conceptually simple and highly intuitive. However, the adaptive approach does not use the transaction information that is useful for partitioning. Hence, it cannot be applied until the execution of transactions.
2.1.1.3 Mixed partitioning
Currently mixed partitioning has not been adequately addressed in the literature. The only work we are aware of is
that of Apers [Aper 88], who considers the fragmentation problem together with the allocation problem. In his approach, the fragmentation scheme is the output of the allocation
12
algorithm. Thus we cannot know the fragmentation scheme bef ore allocation.
2.1.2 Previous Work on Data Allocation
Previous work on data allocation has been mainly in two
areas: (1) data allocation independent of network parameters, andl (2) extension to the pure data allocation problems including the network topology and communication channels. Note that since a distributed f ile system dif fers greatly f rom. a distributed database, the solutions for the file allocation problems [Dowd 82] do not characterize solutions to the
allocation problems in a distributed database. This is because in the distributed database systems the way the data are accessed is far more complex [Aper 88]. This review
concentrates on how the data allocation problem has been attacked in the context of distributed databases and mainly in areas of data allocation itself.
Ceri, Martella and Pelagatti [Cer 80] consider the problem of data allocation for typical database applications with a simple model of transaction execution. However, they do not consider any partitioning.
Ceri, Navathe and Wiederhold [Cer 83b] propose a model
for a nonreplicated data allocation in the form of a linear 01 integer programming problem. But only horizontal
partitioning is considered and a certain query processing strategy is assumed.
13
Wah [Wah 84] investigates the important problems of file
allocation in general. However, he does not consider data fragmentation for database design.
Sacca and Wiederhold [Sacc 85] propose a heuristic allocation algorithm in a cluster of processors in which the
user does not address a transaction request to a specific processor. They try to minimize the global cost for allocating data in the given network. However, this approach cannot be applied to the system in which processors are remotely distributed, because, in a remotely distributed system, the user has to request data from a specific site.
Cornell and Yu (Corn 88, Corn 89] propose a strategy to
integrate the treatment of relation assignment and query strategy to optimize performance by using the 01 integer programming method. They, however, neither consider any partitioning nor join order.
Apers [Aper 88] proposes a model for computing the cost
of a completely or partially specified allocation for various cost functions. In this approach fragments are the result of
the allocation. So we do not know about the fragmentation scheme before allocation. When fragments are generated, since clustering of fragments is not considered, the generated fragments may contain only a few or even one tuple.
Ceri, Pernici and Wiederhold [Ceri 88] propose a
distributed design tool which provides a flexible environment for the design of data distribution. In this paper vertical
14
fragments are allocated to each site according to the DIVIDE and CONQUER tools. They, however, do not consider horizontal partitioning.
2.2 The Outline of the Proposed Approach
This dissertation is primarily concerned with the algorithms for data fragmentation and allocation. We provide
an overview of the structure of the dissertation by discussing the basic ideas for solving partitioning and allocation problems in distributed database design.
2.2.1 Vertical Partitioning
In all vertical partitioning algorithms that we have surveyed, the binary partitioning technique has been used for partitioning after clustering attributes. Thus binary
partitioning is repeated until all meaningful fragments are determined. It is also necessary that clustering be repeated on two new affinity matrices corresponding to the newly generated fragments. In Chapter 3 we propose a new vertical
partitioning algorithm which has less computational complexity and generates all meaningful fragments simultaneously by using a graphical method. This approach is based on the fact that
all pairs of attributes in a fragment have high "within fragment affinity" but low "between fragment affinity."
15
2.2.2 Horizontal Partitioningr
In horizontal partitioning, most of the previous
approaches that we know of try to select an optimal horizontal partitioning for each relation. In the present approach, however, we do not consider how to select an optimal horizontal partitioning for each relation in the database. We are focusing our attention on identifying all the candidate
horizontal partitions by using the same graphical approach as in the vertical partitioning. To this end, we construct a predicate affinity matrix. Based on this matrix we develop a horizontal partitioning algorithm in Chapter 4.
2.2.3 .Mixed Partitioning
Mixed partitioning can be accomplished in a sequential manner in one of two ways: first performing vertical partitioning and then horizontally partitioning the vertical partitions (called VH partitioning), or first performing the horizontal partitioning and then vertically partitioning the horizontal partitions (called HV partitioning) . obviously, this is not adequate since it potentially leads to different
results and leaves out the possibility of combining fragments at a smaller granularity to produce more efficient data distribution. Thus, in Chapter 5, we propose a uniform mixed
partitioning methodology in order to solve the above problems. A grid consisting of "cells" is created by simultaneously
16
applying the vertical partitioning and horizontal partitioning algorithms on the relation. The grid cells will be merged by taking into account the cost factors to minimize the global transaction processing costs. This approach produces optimal mixed fragments that cannot be otherwise produced by independent partitioning models.
2.2.4 Allocation
The problem of data allocation in distributed database systems has been extensively studied. However, in most cases the unit of allocation is the fragment resulting from horizontal partitioning or vertical partitioning. Little work
has been done for the allocation of the result of mixed partitioning. So, in Chapter 6, we propose an allocation algorithm for the result of mixed partitioning. A mixed
fragment, which comes from our mixed partitioning procedure by considering both vertical partitioning and horizontal
partitioning simultaneously, is the unit of allocation. The objective of this algorithm is to allocate the mixed fragments nonredundantly with minimum data transmission cost while
satisfying the load balancing constraints. To this end we will present some heuristics that combine fragment allocation and
transaction processing strategy. Based on the heuristics we will develop this algorithm
17
2.2.5 Alternative Ways of Dealing with Fragmentation and Al location
Distributed database design has been dealt with in terms of fragmentation of data and allocation of these fragments. There may be other alternatives with which these problems can be attacked. In Chapter 7, we describe the various alternatives and evaluate them by comparing one another.
CHAPTER 3
A VERTICAL PARTITIONING ALGORITHM
Vertical partitioning is the process that divides a "global object" which may be a single relation or more like a universal relation into groups of their attributes called
vertical fragments [Cer 83a, Nava 84, Corn 87]. It is used during the design of a database to enhance the performance of transactions [Nava 84]. In order to obtain improved
performance, fragments must closely match the requirements of the transactions. Vertical partitioning has a variety of applications wherever the "match" between data and transactions can affect performance. That includes
partitioning of individual files in centralized environments, data distribution in distributed databases, dividing data among different levels of memory hierarchies, and so on. Figure 31 shows the transaction specification for our example that will cover both vertical and horizontal algorithms.
Although this and the subsequent chapters are presented
using the relational model, the approach presented in this dissertation is general enough. By replacing the term relation with record or object it can be easily applied to the hierarchical, network or the objectoriented models.
18
19
3.1 Overview
The algorithm that we propose starts from the attribute
affinity (AA) matrix, which is generated from the attribute usage matrix using the same method as that of the previous approach in [Nava 84]. The attribute usage matrix represents
the use of attributes in important transactions. Each row refers to one transaction; the "1"l entry in a column indicates that the transaction "uses" the corresponding attributes. Whether the transaction retrieves or updates the relation can also be captured by another column vector with R and U entries for retrieval and update. That information may be used by an
empirical objective function as in [Nava 84]. The attribute usage matrix for 10 attributes and 8 transactions is shown in Figure 32. Attribute affinity is defined as
affq~ Z accuij (i+j)
ke r
where accuj is the number of accesses of transaction k referencing both attributes i and j. The summation occurs over all transactions that belong to the set of important
transactions r. This definition means that attribute affinity measures the strength of an imaginary bond between the two attributes, predicated on the fact that attributes are used together by transactions. Although affinity affa does not have any physical meaning, it is reasonable to define it as follows:
affii= accd
ker
20
where accki means the number of accesses of transaction k referencing attribute i. This is reasonable since it shows the "strength" of that attribute in terms of its use in all transactions.
Based on these definitions of attribute affinity, the attribute affinity matrix is defined as follows: It is an n
x n matrix for the nattribute problem whose (i,j) element equals affii. Figure 33 shows the attribute affinity matrix which was formed from the Figure 32.
A note about the attributes: In this proposed technique as well as in the previous techniques, the set of attributes considered may be
(1) the universal set of attributes in the whole
database.
(2) the set of attributes in a single relation (or
record type or object type).
By using (a), the fragments generated may be interpreted as relations or record types. By using (b), fragments of a single relation are generated.
In previous approaches, a clustering algorithm is applied to the AA matrix. In our present approach, however, we
consider the AA matrix as a complete graph called the af finity graph in which an edge value represents the affinity between the two attributes. Then, forming a linearly connected spanning tree, the algorithm generates all meaningful
21
fragments in one iteration by considering a cycle as a fragment. A "linearly connected" tree has only two ends. Figure 34 shows the affinity graph corresponding to the AA
matrix of Figure 33. Note that the AA matrix serves as a data structure for the affinity graph.
The major advantages of the proposed method over that in [Nava 84] are as follows:
(1) There is no need for iterative binary partitioning.
The major weakness of iterative binary partitioning is that at each step two new problems are generated
increasing the complexity.
(2) The method obviates the need for using any empirical
objective functions as in [Nava 84]. As shown in [Corn 87], the intuitive objective functions used in [Nava 84] do not necessarily work well when an actual detailed cost formulation for a specific
system is utilized.
(3) The method requires no complementary algorithms such
as the SHIFT algorithm of [Nava 84].
(4) The complexity of the algorithm is 0(n 2) , better than
the O(n 2logn) complexity of the previous algorithms.
3.2 Definitions and Notations
We shall use the following notation and terminology in the description of our algorithm.
22
 A,B,C .... denotes nodes.
 a,b,c,.... denotes edges.
 p(e) denotes the affinity value of an edge e.
 "primitive cycle" denotes any cycle in the affinity
graph.
 "cycle completing edge" denotes a "to be selected" edge
that would complete a cycle.
 "cycle node" is that node of the cycle completing edge,
which was selected earlier.
 "affinity cycle" denotes a primitive cycle that contains
a cycle node. We assume that a cycle means an affinity
cycle, unless otherwise stated.
 "former edge" denotes an edge that was selected earlier
than the cycle node.
 "cycle edge" is any of the edges forming a cycle.
 "extension of a cycle" refers to a cycle being extended
by pivoting at the cycle node.
The above definitions are used in the proposed algorithm to process the affinity graph and to generate possible cycles from the graph. Note that each cycle gives rise to a vertical fragment.
3.3 Fundamental Concepts
Based on the above definitions we would like to explain the mechanism of forming cycles. For example, in Figure 35,
suppose edges a and b were selected already and c was selected
23
next. At this time, since c forms a primitive cycle, we have
to check if it is an affinity cycle. This can be done by checking the possibility of a cycle. "Possibility of a cycle" results from the condition that no former edge exists, or p(former edge) 5 p(all the cycle edges). The primitive cycle a,b,c is an affinity cycle because it has no former edge and satisfies the possibility of a cycle (i.e., the first cycle with three edges is always an affinity cycle). Therefore the primitive cycle a,b,c is marked as a candidate partition and node A becomes a cycle node.
Now let us explain how the extension of a cycle is performed. In Figure 35, after the cycle node is determined, suppose edge d was selected. At this time, d is checked as a potential edge for extension. It can be done by checking the possibility of extension of the cycle by d. "Possibility of extension" results from the condition of p(being considered edge or cycle completing edge) : p(any one of the cycle edges). Thus the old cycle a,b,c is extended to the new cycle
a,b,d,f if the edge d is under consideration, or the cycle completing edge f satisfies the possibility of extension which is: p(d) or p(f) : minimum of (p(a),p(b),p(c)). Now the process is continued: suppose e was selected as the next edge. But we know from the definition of the "extension of a cycle" that e cannot be considered as a potential extension because the primitive cycle d,b,e does not include the cycle node A. Hence it is discarded and the process is continued.
24
The next concept that we wish to explain corresponds to the relationship between a cycle and a partition. There are two cases in partitioning.
(1) Creating a partition with a new edge.
In the event that the edge selected next for inclusion (e.g. d in Figure 35) was not considered before, we call it
a new edge. If a new edge by itself does not satisfy the possibility of extension, then we continue to check an
additional new edge called cycle completing edge (e.g. f in Figure 35) for the possibility of extension. In Figure 35,
new edges d and f would potentially provide such a possibility of extension of the earlier cycle formed by edges a,b,c.
If d,f meet the condition for possibility of extension stated above (namely p(d) or p (f) minimum of
(p(a),p(b),p(c))), then the extended new cycle would contain edges a,b,d,f. If that condition were not met, we produce a
cut on edge d (called the cut edge) isolating the cycle a,b,c. This cycle can now be considered a partition.
(2) Creating a partition with a former edge.
After a cut is produced in (1), if there is a former edge, then change the previous cycle node to that node where the cut edge was incident, and check for the possibility of
extension of the cycle by the former edge. For example, in Figure 36, suppose that a,b, and c form a cycle with A as the cycle node, and that there is a cut on d, and that the former edge w exists. Then the cycle node A is changed to C because
25
the cut edge d originates in C. We are now evaluating the possibility of extending the cycle a,b,c into one that would contain the former edge w. Hence we consider the possibility of the cycle a,b,e,w. Assume that neither w nor e satisfies the possibility of extension, i.e., if 'p(w) or p(e) : minimum of (p(a),p(b),p(c))" is not true. Then the result is the following: i) w will be declared as a cut edge, ii) C remains
as the cycle node, and iii) a,b,c becomes a partition. Alternately, if the possibility of extension is satisfied, the result is as follows: i) cycle a,b,c is extended to cycle w,a,b,e, ii) C remains as the cycle node, and iii) no partition can yet be formed.
3.4 The Graph Oriented Algorithm
An algorithm for generating the vertical fragments by the affinity graph is described below.
3.4.1 Description of The Algorithm
First we briefly describe the algorithm in 5 steps.
Step 1. Construct the af finity graph of the attributes of
the object being considered. Note that the AA matrix
is itself an adequate data structure to represent this graph. No additional physical storage of data would
be necessary.
Step 2. Start from any node.
26
Step 3. Select an edge which satisfies the following
conditions:
(1) It should be linearly connected to the tree already
constructed.
(2) It should have the largest value among the possible
choices of edges at each end of the tree. Note that
if there are several largest values, anyone can be
selected.
This iteration will end when all nodes are used for
tree construction.
Step 4. When the next selected edge forms a primitive cycle.
(1) If a cycle node does not exist, check for the
"possibility of a cycle" and if the possibility exists, mark the cycle as an affinity cycle. Consider
this cycle as a candidate partition. Go to step 3.
(2) If a cycle node exists already, discard this edge and
go to step 3.
Step 5. When the next selected edge does not form a cycle and
a candidate partition exists.
(1) If no former edge exists, check for the possibility
of extension of the cycle by this new edge. If there
is no possibility, cut this edge and consider the
cycle as a partition. Go to step 3.
(2) If a former edge exists, change the cycle node and
check for the possibility of extension of the cycle
by the former edge. If there is no possibility, cut
27
the former edge and consider the cycle as a partition.
Go to step 3.
To obtain a more detailed algorithm, suppose that the following data structures are used during implementation [Bras 88]: The nodes of the affinity graph are numbered from 1 to n, N = (1,2, ..., n), and a symmetric matrix L gives the weight of each edge. Three vectors are used: B, strongest, and maxwt. B gives the sequence of scanned nodes. For each node i e N \ B, strongest[i] gives the node in B that is strongest with respect to i, and maxwt[i] gives the weight from i to strongest~i]; strongest~l] and maxwt[l] are not used. Without
loss of generality we can assume that the algorithm starts from node 1. The detailed description of the algorithm now follows. The algorithm uses variables with the following meaning:
 p cycle: a binary variable which denotes whether a
primitive cycle exists
 cnode: a binary variable which denotes whether a cycle
node exists
 f edge: a binary variable which denotes whether a former
edge exists
 candidates: a binary variable which denotes whether an
affinity cycle exists which can potentially generate a
fragment
 cycle cedgewt: an integer variable for the weight of
the cycle completing edge
28
former edge wt: an integer variable for the weight of
the former edge
procedure Makepartition(L[1..n,l..n]): set of edges
( initialize flags and variables
B[1] < 1
fl < 1, f2 < 0 {fl & f2 refer to the two ends of the spanning tree) p cycle, c_node, f edge, candidatep < false
pmin <w { minimum edge of a cycle)
{ initialize vectors
for i < 2 to n do
strongest~i] < 1 maxwt[i] < L[i,l]
B[i] < 0
endfor
repeat n1 times
max < 1
( select the next node
for j < 2 to n do
if maxwt[j) > max and
(strongest[j] = fl or strongest[j] =f2)
then max < maxwt~j]
k < j
endfor
( adjust the pointers for checking a primitive cycle}
if strongest[k] = fl then if f2 = 0 then f2 < k
else swap fl,f2 f2 < k
else f2 < k
{check if there is a primitive cycle)
for j < 2 to n do
if B[j] = k and c node = false
then p cycle <: true
if B[jl] > 0 then fedge < true
end for
if p cycle = true ( primitive cycle exists )
then { check if it is an affinity cycle )
if pmin : former edge wt. or f edge = false
then candidates < true cnode < true else f edge < false
else
insert k into B, maxwt~k] < oo
if candidates = true
then if maxwt~k] < pmin or
cyclecedgewt < pmin
29
then { partition exists reinitialize variables if f edge = false
then save this partition else change the cycle node if former edge wt < pmin then save this partition remove k from B
else extend the cycle else extend the cycle {pmin contains the minimum edge of a cycle
if pmin > maxwt~k] then pmin < maxwt[k]
{ rearrange vectors for next selection)
for j < 2 to n do
if L[k,j] > maxwt[j] then maxwt[j] < L~k,j] strongest~j] < k
endfor
end repeat
3.4.2 why does the above Algorithm Produce Reasonable Partitioning ?
Now we will give the proof of the reasonable partitioning
of the algorithm. The partitioning produced by this algorithm
satisfies the following properties: (1) For every cluster C
and for every vertex vEC and a vertex wotC, there exists
another vertex uEC such that p(v,u) > p(v,w). Intuitively,
this partitioning guarantees that for a vertex vECj, it is
better to cluster v with a vertex in C than with a vertex
outside C . (2) For every cluster C , there exists two vertices
aeC1, beC such that for every vertex ceC , there exists a
vertex dECj with p(c,d) > p (a, b) . Intuitively, this
partitioning guarantees that there is a certain threshold
affinity for each cluster, which is satisfied for the pairs
of vertices (vj, xj) , (V2, X2) I . ... (vp, xP) where C3 = {v1,
30
vp} and xjeC1, for all l:5i:p. Note that the second property applies only when the cluster set consists of more than one cluster.
To prove these properties formally, we define the following variables.
7 = {C1,I C2, .... Cr,), set of clusters
p,(v) = (i) maximum p(u,v) for all 15j:5r, veqj, ueCJ
(i i) p (v,V) , i f C, = {V}
(1) States that for every (v,w) such that veCj, weCk,, j~k, either p.,(v) > p(v,w) or p,(w) > p(v,w).
Proof: Case 1: Let C be constructed before Ck. When vertex w is encountered for the first time in building
the cycle Ck, let (v,w) be the largest affinity edge incident on w. In this case, since, when C was constructed, an edge (v, u) , ucCj, with af finity greater than (v,w) was selected as part of cycle edge, p,(v) > p(v,w) is true. Case 2: Let Ck be constructed before C .
When vertex w is encountered, a cycle edge (wIx), XECk,
with affinity greater than (v,w) is selected. In this
case, therefore p,(w) > p(v,w) is true.
(2) States that for every j except kc, l:5k:r, there exists a
cut edge (a,b), aeC,, beCj, such that p,(v) > p(a,b) for all veCq.
Proof: Note that there are r1 cut edges. Since there is
a unique cycle (cluster) that precedes a cut edge, it follows that each cut edge is associated uniquely with
31
a cluster. By the definition of extension of a cycle, each cut edge has less affinity than all the cycle edges associated with this cut edge. For example, in Figure 37, suppose that a, b,c and d formed a cycle and that there was a cut on edge w. Then, if p(w) had been greater than or equal to the minimum of (p (a) , p (b) , p (c) , p (d)) , then, by the definition of the possibility of extension, the cycle A,B,C,D would have been extended to the larger
cycle A,B,C,D,W.
3.4.3 Examples
We will use the same example problems from Navathe et al. [Nava 84] to illustrate how this algorithm works and to compare partitioning decisions. Since our algorithm uses the same attribute affinity matrix, we assume that it has already
been completed from the original transaction matrix and the computation of affinities. For ease of understanding, we will refer back to the steps of the algorithm in Chapter 3.4.1.
The attribute affinity matrix of the first example is shown in Figure 33 and its affinity graph after excluding the zerovalued edges appears in Figure 34. Suppose we start at node 9 (step 2), then, by the algorithm, edges 93, 92, and
28 are selected in order (step 3). At this time, edge 89 cannot form a cycle because it does not satisfy the
possibility of a cycle (step 3). Thus edge 83 is selected as the next edge and it forms a candidate partition (step 4).
32
Note that node 3 becomes a cycle node (step 4) . Then the process is continued and edge 87 is selected (step 3). Since there is a candidate partition, the possibility of extension
is checked (step 5.1). Thus the cycle 9,3,2,8 is considered as a partition because edge 87 (edge being considered) and
37 (cycle completing edge) are both less than any of the cycle edges (step 5.1). The relevant part of the graph is shown again in Figure 38. As shown in Figure 38, our algorithm generates three affinity cycles separated by edges 34 and 78. They generate three fragments: (1,5,7), (2, 3, 8,9) , (4, 6, 10) . From that Figure, we know that the result of our algorithm is the same as that of Navathe et al. [Nava 84].
We conjecture that this algorithm does not depend upon the starting node. For example, let us start at node 1. By the algorithm, the first affinity cycle is not formed until edges
15, 57, 78, 82, 29, and 93 are selected. The first cycle 8,2,9,3 is identified as a candidate partition and node 8 becomes a cycle node. Then a cut occurs on edge 34 because
neither edge 34 nor edge 48 satisfies the possibility of extension of the cycle (step 5.1). At this time, since there is a former edge, we have to change the cycle node to node 3
and check for the possibility of extension of the cycle by the former edge 78 (step 5.2). Thus another cut occurs on edge 78 because edge 78 and 73 are both less than any of the cycle edges. Figure 39 shows this result. Thus we can
33
conclude that the resulting fragments are always the same irrespective of the node from which you start.
The second example we will use is a global relation with 20 attributes and 15 transactions. The result of Navathe et al. [Nava 84] partitions this relation into four fragments in three iterations: (1,4,5,6,8), (2,9,12,13,14),
(3,7,10,11,17,18), (15,16,19,20). Our algorithm, however, generates five fragments in one iteration as shown in Figure 310: (1,5,8), (4,6), (2,9,12,13,14), (3,7,10,11,17,18), (15,16,19,20). Note that the algorithm starts from node 1 and the cut of edge 32 is performed earlier than that of edge 47. This result shows that our algorithm can find one more possibility of partitioning. Thus what the empirical objective function could not discriminate as a potential partitioning in [Nava 84], is actually detected by our procedure.
3.4.4 Complexity of the Algorithm
Now we consider the computational complexity. Step 1 does not affect the computational complexity because the attribute
affinity matrix can be used as a symmetric matrix L. The repeat loop in the detailed description is executed ni times, where n denotes the number of attributes. At each iteration, selection of the next edge takes a time 0(n). Also, whether a cycle exists or not can be implemented in time of 0(n) by scanning the vector B. Thus the algorithm takes a time 0(n 2), which is less than that of [Nava 84], namely, O(n 2lon)
34
3.5 Application and Extension
This algorithm can be used effectively for vertical partitioning because it overcomes the shortcomings of binary
partitioning and it does not need any complementary procedures such as the SHIFT procedure and the CLUSTER procedure that are used in [Nava 84]. Furthermore, the algorithm involves no arbitrary empirical objective functions to evaluate candidate partitions such as those used in [Nava 84]. Also this algorithm can be used in place of the first phase without the
use of any cost factors as in [Nava 84]. The second phase involving the costoptimized vertical partitioning approach can still be applied to refine the results of the first phase.
Another important application of this algorithm is the mixed partitioning methodology that will be proposed in
Chapter 5. The mixed partitioning methodology will first generate a grid for a relation vertically and horizontally, and then merge cells as much as possible by using a cost function for determining a fragment.
This algorithm can be further enhanced to address the problem of primary/ secondary memory partitioning, or in the context of any memory hierarchy. By combining with the MULTIALLOCATE algorithm in [Nava 84], this algorithm can be
used to achieve the allocation of vertical fragments over a network.
35
Transactions Attributes Predicates Number of accesses
per time period
TI al,a5,a7 pl,p7 25
T2 a2,a3,a8,a9 p,750
T3 a4,a6,alO p,725
T4 a2,a7,a8 p4,p8 35
T5 al,a2,a3,a5,a7,a8,a9 p5,p8 25
T6 al,a5 p6,p8 25
T17 a3,a9 p5,p8 25
T8 a3,a4,a6,a9,a10 p6,p8 15
Figure 31 Transaction specifications
Attribute usage matrix
Number of
Type accesses per
time period
Attbuteusi
Transactions
TI T2 T3
T4 T5 T6 'p7
T8
23 45 6 78
1 0 001 01 0 01 1 00 0 01 00 01 0 10 0 0 10 00 01 1 1 1 10 1 01 1 10 00 1 00 0 00 10 0 00 0 0 01 10 10 0
9 10
0 0 1 0 0 1 0 0 1 0 0 0 1 0 1 1
R
R
R
R
U
U
U
U
Acc 1 =25 Acc 2 = 50 Acc 3 = 25 Aec 4 =35 Acc 5 = 25 Aec 6 = 25 Acc 7 = 25 Acc 8 = 15
Figure 32 Attribute usage matrix
36
Attributes 1 2 3 4 5 6 7 8 9 10
1 75 25 25 0 75 0 50 25 25 0
2 25 110 75 0 25 0 60 110 75 0
3 25 75 115 15 25 15 25 75 115 15
4 0 0 15 40 0 40 0 0 15 40
5 75 25 25 0 75 0 50 25 25 0
6 0 0 15 40 0 40 0 0 15 40
7 50 60 25 0 50 0 85 60 25 0
8 25 110 75 0 25 0 60 110 75 0
9 25 75 115 15 25 15 25 75 115 15
10 0 0 15 40 0 40 0 0 15 40
Figure 33 Attribute affinity matrix
37
25
5
2 10
5
7 5
1 10 4
115
5 4 5
5
1 r 5 7
15
0
25 8
2
2
2 60
0
5 50 7
6
Figur 34 Affinity graph after excluding zerovalued edges
5
Figure 34
38
cycle node
a
Figure 35
A cycle and its extension
new
cycle node
Figur 36 Producing a partition
ie
Figure 36
39
cycle node
a
C
Figure 37 Using affinity cycle as a means
for reasonable partitioning
40
start node 40 40
F i u r 38 R e u l of t h4ir t e x m l
strtn atnoe 0
41
start node
75 5
5. 0
400
Figure~ ~~ 31eut10tefrs xml
startng atnode2
42
8 85
707
1580
Figur 310 Result of the second example
Figure 310
CHAPTER 4
A HORIZONTAL PARTITIONING ALGORITHM
4.1 overview
Horizontal partitioning is the process that divides a global relation into subsets of tuples, called horizontal fragments [Ceri 82, Cer 83a, Ceri 84]. Ceri, Negri and Pelagatti [Ceri 82] analyze the horizontal partitioning problem, dealing with the specification of partitioning
predicates. Ceri, Navathe, and Wiederhold [Cer 83b] propose an optimization model for designing distributed database schemas with all meaningful candidate horizontal partitionings.
In the present research, we are focusing our attention on identifying all the candidate horizontal partitions. For
this we propose a horizontal partitioning methodology which uses the MAKEPARTITION algorithm presented in Chapter 3. In
order to use the MAKEPARTITION procedure of our vertical partitioning algorithm we consider only those transactions whose processing frequency is large. These transactions access tuples of the relations based on some predicates. These predicates are called simple predicates [Ceri 82]. The syntax for a simple predicate is as follows:
43
44
simpleypredicate ::= attributename attributename Iattributename value
attributename :=string
value ::= element of the domain of the attribute.
A transaction gives rise to a set of simple predicates.
The WHERE clause of a database operation (say SELECT) may consist of a disjunction or conjunction or negation of a set of simple predicates. Simple predicates are easier to handle
and understand. Moreover, a simple predicate splits a relation into two horizontal fragments. one horizontal fragment
consists of those tuples that satisfy the simple predicate and other with those tuples that do not. The correctness of fragmentation requires that each tuple of the global relation be selected in one and only one fragment.
Another consideration in horizontal partitioning is derived partitioning as stated in [Ceri 84]. Thus the
predicates which give rise to derived partitioning (we call it derived predicates) should be considered in the same way as simple predicates. We limit the scope of this dissertation by assuming that all simple and derived predicates are previously determined.
As explained earlier, the focus of this research is on single relation fragmentation. Hence join predicates of the
45
form Rl.A=R2.B which deal with a pair of relations are not considered for horizontal partitioning. They are very much a
part of the allocation phase and will be considered for minimizing the effort and cost of joins. In contrast, horizontal partitioning approaches used in systems like Bubba [Cope 88] or Gamma [DeW 86] attempt to achieve parallelism of join queries, etc. In our decomposition of the overall
distribution problem, join predicates are in fact considered, but at a later stage.
4.2 Predicate Usage Matrix and Predicate Affinity Matrix
We use a simple example to explain our horizontal partitioning methodology below. The inputs are a set of
transactions and a corresponding set of predicates as follows (assume D# and SAL are attributes of an example relation):
T1 : D#<10 (p1), SAL>40K (p7)
T2 : D#<20 (p2), SAL>40K T3 : D#>20 (p3), SAL>40K
T4 : 30
T5 : D#<15 (p5), SAL<40K T6 : D#>40 (p6), SAL<40K T7 : D#<15 (p5), SAL40 (p6), SAL<40K
Note that the above set of predicates do not span all the
tuples of the relation; tuples with SAL40K will not be accessed by any of the above transactions.
46
As the previous work [Cer 83b, Nava 84] pointed out, it is not necessary to collect information on 100% of the expected transactions (that would of course be impossible) . Since the 8020 rule applies to most practical situations, it
is adequate to supply information regarding the 20% of the heavily used transactions which account for about 80% of the activity against the database.
The algorithm that we propose starts from the predicate usage matrix. The predicate usage matrix represents the use of predicates in important transactions. The predicate usage
matrix for the above example (8 predicates and 8 transactions) is shown in Figure 41. Each row refers to one transaction: the 1"1 entry in a column indicates that the transaction uses the corresponding predicates. Whether the transaction
retrieves or updates the relation can also be captured by another column vector with R and U entries for retrieval and
update. Predicate affinity is generated in a similar manner as attribute affinity. Figure 42 shows a predicate affinity matrix generated from the predicate usage matrix in Figure 41. The numerical value of the (i,j) element in this matrix gives the combined frequency of all transactions accessing both predicates i and j and is obtained the same way as in vertical partitioning in Chapter 3. The value 1"==>"1 of the (i,j) element indicates that predicate i implies predicate j, and the value "1*" represents the close usage of predicates.
Two predicates i and j are "close" when the following
47
conditions are satisfied: (1) i and j must be defined on the
same attribute, (2) i and j must be used jointly with some common predicate c, (3) c must be defined on an attribute other than the attribute used in i and j. This is reasonable because predicates i, j, and c are different from one another and thus two fragments generated by predicates i,c and
predicates j,c are considered "closely related" since they both involve predicate c. In the above example, p1 and p2 are "close" because of their usage with the common predicate p7 in transactions T1 and T2. These two relationships are introduced to represent logical connectivity between
predicates. The attractive features of this approach are as follows:
(1) Fragments are based on actual predicates; by applying
implication between predicates, the number of
fragments is reduced,
(2) We can make use of the predeveloped algorithm in
Chapter 3,
(3) By using clustering of predicates suggested above, a
relatively small number of horizontal fragments are
generated,
(4) 01 integer formulation is not needed.
4.3 The Algorithm
First a modified version of the graphical partitioning algorithm for clustering predicates will be given. Then by
48
using this clustering, the entire steps for horizontal fragmentation will be described.
4.3.1 Clustering of Predicates
We will apply the graphical algorithm in Chapter 3 to horizontal partitioning. However, we cannot use this algorithm directly because the predicate affinity matrix may be too sparse. Thus we introduce two more relationships which
represent logical connectivity between predicates. They are "==>" for implication and 11*11 for close usage as explained in Chapter 4.2.
To obtain the modified version of the graphical
partitioning algorithm, the following heuristic rules are applied.
(1)A numerical value (except zero) has higher priority
than the values 1==>" , "<==" and 11*11 when selecting a next edge. This is because we place more importance on affinity values which are obtained from transaction usage rather than on logical connectivity among the
predicates.
(2) In the comparisons involved in checking f or the
possibility of a cycle or extension of a cycle, we
ignore cycle edges with affinity values h==>",1=1
and 1111 For example, in Figure 43, in comparing edge (p4,p8) with edges of the cycle (p8,p5,p6) we ignore edge (p5,p6) which has affinity 11*'1. This is because
49
the affinity values "==>", "<==#I and 11*11 represent implicit relationships and therefore can be collapsed.
(3) If==>" and "<==" are considered to have higher affinity
value than "h*" since the former indicates direct implication, whereas the latter represents only logical connectivity between the two predicates
through their usage with a common predicate.
(4) If there are two "==>I's in a column corresponding to
predicate pk., one implied by predicate pi and another implied by predicate pj, then the entry (i,k) has higher priority than the entry (j,k), either if the entry (i,j) is equal to "<=="1, or if the entry (j,i)
is equal to "==>"1. In other words,
pi=>pj=> pk .... (a) Pj => PA > pk .... (b)
in the above implication, if (a) holds, then j has higher priority than i, and if (b) holds, then i has
higher priority than j.
These rules can be easily incorporated into the graphical algorithm in Chapter 3. For example, if we assume that every
affinity value in predicate affinity matrix is greater than 2, then we can assign value 1 for "1*" and 2 for "1==>"1 to represent rules 1 and 3. The detailed description of the modified algorithm now follows. We use the same data
structures and variables as in vertical partitioning. The meanings of variables are available from Chapter 3.
50
procedure Make_Honi Partition(L(l..n,l..n]): set of edges
{ assign value 1 for 11*"1 and 2 for "==>" respectively in
a predicate affinity matrix)
(initialize flags and variables
Bfi] < 1
fl < 1, f2 < 0 (fl & f2 each refer to an end of the spanning tree pcycle,c_node,f edge,candidatep < false
pmin <c { minimum edge of a cycle)
{ initialize vectors
for i < 2 to n do
strongest[i] < 1 maxwt~i] < L[i,l]
B[i] < 0
endfor
repeat ni times
max < 1
( select the next node)
for j < 2 to n do
if maxwt~j] > max and
(strongest[j] = fl or strongest[j] =f2)
then max < maxwt[j]
k < j
endfor
( adjust the pointers for checking a primitive cycle)
if strongest[k] = fl then if f2 = 0 then f2 < k
else swap fl,f2 f2 < k
else f2 < k (check if there is a primitive cycle
for j < 2 to n do
if B[j] = k and c node = false
then p cycle < true
if B[jl] > 0 then fedge < true
endfor
if p cycle = true ( primitive cycle exists
then { check if it is an affinity cycle
if pmin : former edge wt or f edge = false
then candidatep < true cnode < true else f edge < false
else
insert k into B, maxwt[k] < c
if candidatep = true
then if maxwt~k] < pmin or
cyclecedgewt < pmin
51
then { partition exists} reinitialize variables if f edge = false then save this partition else change the cycle node if former edge wt < pmin then save this partition remove k from B else extend the cycle else extend the cycle {pmin contains the minimum edge of a cycle
if (maxwt[k] != 1 or maxwt[k] != 2) and pmin > maxwt[k]
then pmin < maxwt(k]
( rearrange vectors for next selection
for j < 2 to n do
if L[k,j] > maxwt[j] then maxwt~j] < L~k,j] strongest[j] < k endfor
endrepeat
4.3.2 A Heuristic Procedure for Horizontal Partitioning
Now we describe the complete horizontal partitioning procedure by using the MakeHori Partition algorithm and the example introduced above.
Procedure HORIZPARTITIONING
Step 1. Perform clustering of predicates: It is done by using
the MakeHon _Partition algorithm presented in Chapter 4.3.1. We obtain in this step a set of subsets of predicates. In our example, we can obtain three clusters of predicates as shown in Figure 43:
(pl,p7,p2), (p3,p4), (p5,p6,p8).
Step 2. Optimize predicates in each subset: In this step
predicate inclusion and predicate implication are
52
considered to minimize the number of predicates. In our example, the first subset (D#<10, D#<20, SAL>40K) is refined into (D#<20, SAL>40K) since D#<10 ==> D#t<20. The second subset (D#>20, 3020) since 30 D#>20, but the last one (D#<15, D#>40, SAL<40K) has no change. Note that this optimization can be done before step 1. But
in this dissertation, we perform this step here in order to allow more detailed clustering. For example, a pair of predicates such as p1 and p2 in which one, (say, p1), implies another (say, p2), can be grouped
in different clusters in step 1. The three clusters of predicates produced in this step, called "cluster sets", are listed in Figure 43.
Compose predicate terms: Corresponding to the cluster sets (Figure 43) produced in step 2, we proceed as follows. The cluster sets are first evaluated to determine "the least common attribute". In our example, since SAL does not appear in cluster set 2 (corresponding to the second cluster of predicates),
it is the least common attribute. Note that D# appears in all three sets. A table called the "predicate term schematic table" is now considered by placing in the
first column the chosen attribute with its appropriate ranges to cover that attribute exhaustively. In our example, we create two entries: SAL<40K and SAL>40K
Step 3.
53
f or the SAL attribute. Then, we apply the next to least common attribute and write its appropriate
ranges that appear in the cluster sets against each entry for the first column. Note that these ranges may be overlapping. In our example, D# is the next attribute. Its ranges applicable to the cluster sets
are: D#<15 OR D#>40 coupled with SAL<40K (from cluster set 3), and D#<20 coupled with SAL>40K (from cluster set 1). The D#>20 predicate appearing in cluster set 2 must be written twice into the table against each
entry for SAL. This resulting predicate term schematic table is shown at the top in Figure 44. Now we construct predicate terms from the above table as follows. Each horizontal entry in the table gives rise to one predicate term. If predicates refer to the same attributes then they are ORed, otherwise they are ANDed. The resulting predicate terms are as follows:
SAL<40K AND D#>20,
SAL<40K AND (D#<15 OR D#>40),
SAL>40K AND D#<20, SAL>40K AND D#>20.
Step 4. Perform fragmentation: We have one horizontal fragment
per predicate term. Thus the number of horizontal fragments will at most be the number of predicate terms plus one because there is one remaining fragment
54
called the "ELSE" fragment, which is the negation of
the conjunction of predicate terms.
Note that the result of step 4 may be overlapped
fragments or nonoverlapped fragments. Chapter 4.4 provides the ADJUST function to obtain nonoverlapped horizontal fragments.
This algorithm was implemented by using C language to demonstrate its effectiveness. It should be noted that the complexity of this algorithm is dominated by the step 1, and thus will be 0(n 2) for n predicates as in the vertical partitioning algorithm in Chapter 3. A smaller value of n indicates a good understanding of the heavily used predicates by users. The smaller the n, the better will be the performance of horizontal partitioning.
4.4 Nonoverlappinr Horizontal Partitioning
The result of predicate partitioning may give rise to overlapped horizontal fragments. If nonoverlapped fragments are needed, then we apply the ADJUST function. For example, Figure 44 shows the final nonoverlapped fragments obtained by using the ADJUST function from the overlapping fragmentation generated in Chapter 4.3. There may be more than one way of creating nonoverlapped fragments from overlapped fragments. Figure 45 shows the three possible ways of dealing with overlapped horizontal fragments. Among these three
possible ways, the best way is selected on the following criteria: (1) transaction processing cost, (2) minimization
55
of the number of fragments. It may be desirable to choose the one that creates finer granularity for the grid cells.
A heuristic procedure for generating nonoverlapped fragments is proposed based on these criteria. Let us define the following set of variables. Here, by "part" we mean that
it is a potential horizontal partition (For example, there are three parts in Figure 45).
i = 1,...,n denote the parts.
k = 1, ... ,m denote the transactions accessing the parts.
CAk cost of accessing part i by transaction k.
Fik =the frequency with which transaction k accesses part
1.
oo cost of unioning two parts i and j into part ij
required by transaction k.
5(o)k= cost of selecting the i th part from the merged part
ij by transaction k.
The cost factors would depend on implementation details such as access methods, storage structures, use of special algorithms for duplicate elimination, etc. This level of detail is out of the present scope of our dissertation.
Using the above parameters and variables, we develop a
heuristic adjusting procedure. The merging occurs if the overall cost is reduced because of some transactions accessing the two parts together.
56
Procedure ADJUST Step 1. For each pair of contiguous parts i and j, find the
cost of accessing these parts with and without
merging respectively.
A. Cost without merging:
m m m
Z CaFik + Z C~~k+ 1:UZ~ k=1 k=1 k=1
B. Cost with merging: m m
Z C(ij)IkF(ij)k + 2: Sii~ k=1 k=1
Step 2. Find two parts i and j that produce the maximum
saving if they are merged, and then merge them.
Step 3. After merging two parts, only one new part is
generated. Let the set of parts after step (2) be S
Step 4. Then repeat steps (1) (3) f or the parts in set S
till no two contiguous parts can be merged.
Note that if no merging occurs, then each part will be a horizontal fragment.
57
Number of
Predicate usage matrix Type accesses per
time period
': pl p2 p3 p4 p5 p6 p7 p8
Trasatons
Ti 1 0 0 0 0 0 1 0 R Accl1=25
T2 0 1 0 0 0 0 1 0 R Acc 2 =50
T3 0 0 1 0 0 0 1 0 R Acc 3 =25
T4 0 0 0 1 0 0 0 1 R Acc 4 =35
T5 0 0 0 0 1 0 0 1 U Acc 5 =25
T6 0 0 0 0 0 1 0 1 U Acc 6 =25
TP7 0 0 0 0 1 0 0 1 U Acc 7 =25
T8 0 0 0 0 0 1 0 1 U Acc 8 =15
Figure 41 Predicate usage matrix
Predicates 1
2 3 4 5 6 7 8
1
2 symmetric
8 0 0 0 35 50 40 0
Figure 42 Predicate affinity matrix
58
D#<10O p 25 p7 SAL>40K
CLUSTER0 CLUSTER 2
p4 30
CLUSTER 3 50p6 D#040
D#<15
Cluster sets produced:
{D#<20, SAL>40K} {D#>201I
{D#<15, D#>40, SAL<40K}
Figur 43 Clustering of predicates
Figure 43
59
PREDICATE TERM SCHEMATIC TABLE
D# > 20 (p3,p4) SAL < 40K
D# < 15 OR (p5)
(p8) D# > 40 (p6)
D# < 20 (pl,p2) SAL > 40K
(p7) D# > 20 (p3,p4)
ELSE
ADJUST
NONOVERLAP PED FRAGMENTS
SAL < 40K AND D# > 20 SAL < 40K AND D# < 15 SAL > 40K AND D# < 20 SAL > 40K AND D# > 20
ELSE
H1 (p3,p4,p6,p8) H2 (p5,p8) H3 (pl,p2,p7) H4 (p3,p4,p7) H5
Figure 44 Nonoverlapping horizontal fragmentation
60
Fl
P: overlapping part
(a) (b) (c)
Figure 45 Three possible ways of dealing with overlapped horizontal fragments
(a) F2 includes P
(b) P is separated
(c) F1 includes P
CHAPTER 5
A MIXED PARTITIONING METHODOLOGY
Partitioning of a global relation in a distributed
database can be performed in two different ways: vertical partitioning and horizontal partitioning. The current
literature has addressed vertical and horizontal partitioning independently.
However, since database users usually access data subsets that are both vertical and horizontal partitions of global relations, there is a definite need for research on mixed partitioning. Presently there is little, if any, work on the
investigation of the effects of the different sequences in which the vertical partitioning, horizontal partitioning, and mixed partitioning problems can be solved.
5.1 Proposed Methodoloarv
As shown in Figure 12, the allocation problem is considered as a consequence of partitioning in our research, and thus it will be studied after partitioning problem. The mixed fragmentation approach consists of deriving a distribution design by taking into consideration the cost of
processing the distributed transactions. We shall use the vertical partitioning and horizontal partitioning algorithms
61
62
presented in Chapter 3 and 4 to generate a grid consisting of grid cells. These grid cells are candidate fragments as they
are based on the af finity between the transactions and the data stored in the database. These grid cells may be small, and hence a transaction needs to access a set of grid cells in order to process the data. Hence, a set of grid cells in
its most refined form may not give an optimal transaction processing performance. That is, the grid cells may need to be joined for the transaction to process the data. In order to minimize the number of fragments that need to be accessed by a transaction, the grid cells are considered for merging. The two major phases in the proposed mixed partitioning methodology are as follows.
GRID CREATION: This is composed of two stages; they are the vertical and the horizontal partitioning stages for grid creation. The output of the GRID CREATION is a grid
corresponding to a global relation. The grid suggests all possible ways. in which the global relation in a
distributed database may be partitioned. Each element of
the grid is called a grid cell.
GRID OPTIMIZATION: After making a grid, the GRID OPTIMIZATION performs merging of grid cells as much as
possible according to the merging procedures. The merging of the grid cells is an anti fragmentation procedure.
Having created the cells in a topdown fashion as the ultimate smallest fragments of a relation, we now
63
consider whether we should combine them in a bottomup
fashion. Merging is considered desirable if it reduces the overall transaction processing cost. In this dissertation two types of merging sequences are considered: vertical merging followed by horizontal
merging, and horizontal merging followed by vertical merging. Thus we can calculate the cost of transaction processing at the end of each sequence respectively and then select the better of the two end results. We call
the result of this phase mixed fragments.
Based on the Figure 12, the proposed methodology for mixed partitioning using a grid can be described in the following steps.
Procedure MIXEDPARTITIONING
Step 1. Specification of inputs: In this step, the following
inputs are specified.
(1) Schema information: relations, attributes,
cardinalities, attribute sizes, etc.
(2) Transaction information: name, frequency, attribute
usage, etc. The attribute usage matrix is a matrix
containing transactions as rows and attributes as columns. Element (i,j) = 1 if transaction i uses
attribute j, else it is 0.
(3) Distribution constraints: any predetermined
partitions or fixed allocation of data.
64
(4) System information: number of sites, transmission
costs, etc. This information is used particularly to
solve the allocation problem.
Step 2. Vertical partitioning for grid: In this step all
candidate vertical fragments are determined. We use a graphical algorithm proposed in Chapter 3 for
generating all fragments in one iteration.
Step 3. Horizontal partitioning for grid: In this step, all
candidate horizontal fragments are determined. An algorithm for this step is described in Chapter 4.
Note that the sequence of steps 2 and 3 can be
changed.
Step, 4. Merging fragments in different sequences: In this
step, cells are merged as much as possible based on
cost functions. The merging is carried out in two possible sequences: VHvertical merging followed by horizontal merging, or HVhorizontal merging
followed by vertical merging.
Step 5. Selection of the better sequence: Between the two
different sequences, the better one is selected based on the estimation of the transaction processing cost.
This selected sequence determines the final fragments
of the global relation.
65
5.2 Grid Creation
In Chapter 3 and 4, we have seen the vertical and
horizontal partitioning schemes based on the attribute and predicate affinities. We can now construct a grid consisting
of cells by simultaneously applying the vertical partitioning and horizontal partitioning algorithms on the relation. Each
of these cells belongs to one vertical fragment and one horizontal fragment.
Note that the cells generated by both the methods are the same. We shall use these cells for the merging procedure or grid optimization.
5.3 Grid Optimization
Based on the cost functions we want to merge cells as much as possible in order that the total cost of transaction
processing (as seen from the standpoint of the one relation in question) will be minimized. Two kinds of merging are defined: horizontal merging and vertical merging. Horizontal
merging deals with merging of the cells in the same row of the grid. In horizontal merging, the total number of ways of
merging a fragment is Z C (n, i) = 221, where C represents combination and n represents the number of fragments, because the sequence of attributes has no meaning in a relation. In our approach, however, we can minimize the possible ways of combinations of horizontal merging by using the ordered
66
sequence of fragments generated in vertical partitioning. Thus, given n candidate horizontal fragments, a total of (n1)+n2 .....+1= n(nl)/2 merging possibilities are
generated. This is because, in the linearly connected spanning tree in vertical partitioning, since a cut edge between two vertical fragments is a bridge that has the maximum affinity value among all connectable edges, we can say that a fragment is more closely related to contiguous fragments compared to noncontiguous fragments. For example, in Figure 38, if merging is needed, fragment (1,5,7) will be merged with fragment (2,3,8,9), because fragment (1,5,7) is more related
to fragment (2,3,8,9) in terms of transaction access rather than to fragment (4,6,10). This allows us to develop a heuristic procedure for horizontal merging. In vertical merging, cells in the same vertical column of the grid may be merged to produce larger fragments. Since the same graphical
partitioning approach is used f or horizontal partitioning, the total number of possible ways of vertical merging can be minimized in the same way as in horizontal merging.
In this dissertation, we consider both merging sequences: the VH sequence (performing vertical merging followed by horizontal merging) and the HV sequence (performing horizontal merging followed by vertical merging).
67
5.3.1 A Heuristic Procedure for Horizontal Merging
It should be noted that during vertical partitioning in
Chapter 3 we did not take transaction processing cost into consideration, but only attribute affinity values. To that extent this is an "intuitive" affinity based partitioning.
Horizontal merging results in bringing together tuples of same id or key from two fragments. Thus for a transaction requiring two or more fragments, some saving accrues. This is a form of join where a onetoone "matching" operation among horizontal fragments occurs based on key matching. In
horizontal merging, as noted before, a fragment is merged with a contiguous fragment in the spanning tree first and then extended recursively to a larger fragment if possible.
We give below a heuristic procedure for horizontal
merging. This procedure is iterative and at each iteration all the contiguous cells are considered for the possibility of merging. Those two cells which produce the maximum saving by merging are merged. This procedure stops when no two contiguous cells can be merged. We assume that the costs and
frequencies considered in the formulation can be estimated. For cost estimation we use information inherent to the data and its use which is system independent, such as the length
of attributes accessed and the number of tuples accessed. This approach is similar to the "bottomup" approach used by Hammer and Niamir (Hamm 79]. They started with single attributes as
68
the extreme vertical fragments and combined them into pairs,
triplets, etc., successively. The parameters and notations for the horizontal merging are given as follows.
1 = 1,...,n denote the horizontal cells.
k = 1,...,m denote the transactions accessing the
horizontal cells.
CAk = cost of accessing horizontal cell i by transaction
k.
F= the frequency with which transaction k accesses
horizontal cell i.
Jiji)k = cost of matching horizontal cells i and j into
cell ij required by transaction k. This matching
cost takes into consideration the frequency of transaction accesses to cells i and j, and the attributes accessed from the i and j "parts" of the cell ij. This cost is dependent on the
transaction k.
P~~jk= cost of projection of the ith cell from the merged
cell ij by transaction k.
Then, the heuristic procedure for horizontal merging is formulated by using a greedy method as follows.
Procedure HORIZONTALMERGING Step 1. For each pair of horizontally contiguous cells i and
j, find the cost of accessing these cells with and
without merging respectively.
69
A. Cost without merging:
m m m
Z CaFik + Z CjkFjk + 1:JZ~ k=l k=l k=1
B. Cost with merging:
m m
Z ~jkFi~ + I ~fi~
k=l k=1
Step 2. Find two cells i and j that produce the maximum
saving if they are merged, and then merge them.
Step 3. After merging two cells horizontally, only one new
cell is generated. Let the set of resultant merged
cells after step (2) be I = 1..p
Step 4. Then repeat steps (l)(3) for the cells in set I till
no two contiguous cells can be merged.
Note that this merging procedure is obviously dominated
by step 4. Other steps take a time of 0(n) , where n represents the number of fragments. Thus the complexity of this merging procedure is 0(n 2) .
5.3.2 A Heuristic Procedure for Vertical Merging
vertical merging is very similar to horizontal merging. Note that we performed graphical topdown decomposition based on the notion of affinities. But affinity is a heuristic. Costbased optimization gives further improvement. We achieve
this by merging. The merging occurs if the overall cost is reduced because of some transactions accessing the two cells together. Let us define the following set of variables:
70
1 = l,...,n denote the vertical cells.
k 1 ,...,m denote the transactions accessing the
vertical cells.
CA cost of accessing vertical cell i by transaction
kc.
Fik = the frequency with which transaction k accesses
vertical cell i.
=~jk' cost of union of two vertical cells i and j into
cell ij required by transaction k.
S(ji)k cost of selection of the ith cell from the merged
cell ij by transaction k.
Using the above parameters and variables, we develop a heuristic procedure for vertical merging in a similar manner as in the horizontal merging.
Procedure VERTICALMERGING Step 1. For each pair of vertically contiguous cells i and
j, find the cost of accessing these cells with and
without merging respectively.
A. Cost without merging:
m m m
Z CikFik + E: C Jjk + Z 00 k=1 k=1 k=1
B. Cost with merging:
m m
Z C~i~F(j)k + 1:SE~~ k=l k=1
71
Sterp 2. Find two cells i and j that produce the maximum
saving if they are merged, and then merge them.
Step 3. After merging two cells vertically, only one new cell
is generated. Let the set of resultant merged cells
after step (2) be I = 1..p
Step 4. Then repeat steps (1) (3) f or the cells in set I till
no two contiguous cells can be merged.
Note that this merging procedure is obviously dominated by step 4 as in horizontal merging. Other steps take a time of 0(n), where n represents the number of fragments. Thus the complexity of this merging procedure is also 0(n 2).
5.3.3 An Example
In our example we use an access factor a that reflects
the cost of processing a merged fragment relative to the total cost of accessing its constituent cells if they are not merged together. Thus a is def ined as the ratio of the cost of accessing a merged fragment to the cost of accessing its constituent grid cells. Detailed estimation of a would depend upon implementation factors, particularly the access structures such as indices and the access methods used.
In Figure 51 we show the mapping of the transactions to
the grid cells. Note that some of the grid cells are not accessed by the set of most important transactions, but may
be accessed by other transactions. The mapping is done by mapping attributes and the predicates of the transactions with
72
the attributes and the predicates forming the grid cells. For
example, transaction T4 accesses attributes a2, a7 and a8, and is based on predicates p4 and p8, whereas the grid cell G1 is formed of attributes al, a5 and a7, and predicates p3, p4, p6 and p8. Hence transaction T4 accesses the cell G1.
Figure 52 shows the costs of accessing each of the horizontal fragments H1, H2, H3, H4 and H5. They are 100, 150, 200, 75 and 125 respectively. The length of the attributes of
each of the vertical fragments are 14, 20 and 16 respectively. Note that we assume a linear cost access model in which the cost of accessing a grid cell is proportional to the length
of attributes in that cell. Hence the cost of accessing a single grid cell G5 is C5 = 150 * 20/(14+20+16) = 60.
Now we give an example that shows how two contiguous cells are merged. In the following illustration, we use the join cost factor as a variable to show how merging is dependent upon the join cost. Let us consider the two contiguous cells G1 and G2 in Figure 52. The parameter a is
assumed to be 1.2 and the projection cost is assumed to be
0. 1*J.
Cost without merging:
8
Z FlkClk = F14C14 + F16C16 =35 * 28 + 25 * 28 = 1680
k=1
8
Z F~kC~ = F24CI4 + F28Cm = 35 * 40 + 15 * 40 = 2000
k= 1
73
8
Z F(12)kT2 F(24= + F1612 + F(292
k~l =35*J + 0+ 0
=35*J
Note that J12 is the join cost for the grid cells Gi and G2, and the only transaction T4 accesses the merged
fragment with frequency 35.
Cost with merging:
8
Z F(12)kC(12)k = F(12)4C(12)4 + F(12)6C(12) + F10 (2
k=1
= 35 * 81.6 + 25 * 81.6 + 15 * 81.6
= 6120
8 8
Z FIkPln12 + r F~kP~2 =F16P12 + FgZ1
k=1 k=1
= 25 *0.l*J + 15 * 0.l*J = 4*J
where C12 =a * (CI + CA) 1.2 * (28 + 40) = 81.6
Thus we can conclude that the two cells are merged if 1680 + 2000 + 35*J > 6120 + 4*J, that is if 31*J > 2440 or J > 78.7. Figure 53 shows one of the possible results of grid optimization. The cost factor J generally would depend upon a variety of factors including join selectivity, implementation details, and join algorithms.
5.3.4 Result Comparison by Total Cost Computation
The above two procedures for horizontal and vertical merging are independent and can be applied in any sequence. The end result of both can be subjected to a cost computation
74
against the given set of transactions to pick the better of the two sequences (i.e. HV and VH).
The total cost for the process of merging cells is thus calculated in the following way.
n m
CT =Z E COSTik
i=l k=l
where n :number of final fragments
m :number of transactions accessing the final fragments
CT total cost
COST, cost for accessing a resulting fragment i (which
is given at the end of the merging procedure) by
transaction k.
Note that at the end of this step we have achieved a fragment configuration which could not be obtained by either an IIY or VH sequence of partitioning independently. The merging steps are necessary to achieve the meaningful combinations of grid cells to minimize the total processing
cost. The resulting set of cells is then evaluated together with allocation (with and without redundancy) later.
5.4 Effect of Changes in Data and Transactions
In the case of new transactions we have to do all the five steps given in the MIXEDPARTITIONING procedure in
Chapter 5. 1. In the case of the same schema and the same transactions, but new data tuples, only the cost parameters will be affected for the grid optimization part. Hence only
75
step 4 of the MIXEDPARTITIONING procedure has to be redone
for merging the fragments. This is an advantage of this methodology. In the case of adding new relations to the schema, if existing transactions are not modified to access these new relations, an analysis based on the old set of transactions will not reveal any new result. However, if we incorporate new transactions that use the new relations, we have to consider all the five steps of the MIXEDPARTITIONING procedure.
vi
(al, a5, a7)
Hi
(p3, p4, p6, p8l
H2
(p5, p8) H3
(pi, p2, p7)
H4
(p3, p4, p7)
H5
V2
(a2, a 3, a8, a9)
V3
(a4, a6, alO)
~4T4,T6, T8 (a2, a7, a8;al, a5;a3, a4, a6, a9, alO)
4T5, T7
(al,a2,a3,a5,a7,a8,a9;a3,a9)
 Ti, T2
(a1,a5,a7;a2,a3,a8,a9)
4 T3
(a4, a6, alO)
t A L4t
Ti,T4,T5,T6 jT3,T8
(pl,p7;p4,p8;p5,p8;p6,p8) (p3,p7;p6,p8)
T2, T4, T5, T7, T8
(p2,p7;P4,p8 ;p5, p8 ;p5, p8;p6, p8)
Figure 51 Mapping from transactions to grid cells
T4, T6 T4, T8 T8
T5 T5, T7
Ti T2
T3
.4
0~
77
Ci =cost of accessing the corresponding
horizontal fragment
*L(attributes of the corresponding vertical fragment)
L(all attributes)
20
e. g. C5 = 150 * 142=660
Figur 52 Cost model for merging grid cells
Attributes length
Cost
14 20 16
100 G1 G2 G3
150 G4 G5 G6
200 G7 G8 G9
75 G10 G11 G12
125 G13 G14 G15
Figure 52
78
Figure 53
A possible result of grid optimization
F1 F2
F3 xl
F4 F5 X2
X3 X4 F6
X5 X6 X7
CHAPTER 6
AN ALLOCATION ALGORITHM FOR THE MIXED FRAGMENTS
In a distributed database system, a user accesses the system through a local processor. The system may make local and remote accesses functionally transparent. So data allocation is one of the most important distributed database issues [Cer 83b, Corn 88]. The problem of data allocation in
distributed systems has been extensively studied together with file assignment problem. Even though this problem is closely
related to the file assignment problem, it differs greatly from the file assignment problem because, in the distributed database systems, the way the data are accessed is far more complex [Aper 88].
Data allocation is the process of mapping each logical fragment to one or more sites. Previous research in this area
has been performed in two ways: Data allocation by itself, and extension of the pure data allocation problem by including the network topology and communication channels in the decision variables. This dissertation is concerned with pure data
allocation in which the unit of allocation is the mixed fragment which comes from our mixed partitioning procedure in
Chapter 5 that considers both horizontal partitioning and vertical partitioning simultaneously. The objective of our
79
80
approach is to allocate the mixed fragments nonredundantly with minimum distributed transaction processing cost. To achieve this we present some heuristics that combine fragment allocation and transaction processing strategy. Based on the
heuristics we develop an algorithm not by using the 01 integer programming method, but by using a graphical method called allocationrequest graph. Note that in mixed fragment
allocation there may be a lot of different types of fragments. Thus we define three more operations in addition to the join operation for these mixed fragments.
This approach begins from the transaction analysis which produces operation trees and a fragment usage matrix. Then, by using this fragment usage matrix, an allocationrequest graph is generated, and finally fragments are allocated according to the heuristic allocation procedure.
6.1 Distributed Transaction Processing
In data allocation problems, data should be locally processed by applying the conditions of the query before any
transmission. The selections and projections, which do not require data movement should be performed locally. The really
difficult problem of distributed query optimization is the execution of joins.
When the unit of allocation is the mixed fragment, since there are no restrictions on the type of fragments, in
addition to the join operation, there are other operations
81
which require data movement: outerjoin, union, and outerunion. We call them multifragment operations. Outerjoin is introduced for keeping all tuples in fragments regardless of whether or not they have matching tuples in the other fragment, and outerunion is introduced for keeping nonunion compatible attributes.
In a distributed system there are several factors that must be taken into account such as link cost, data transmission cost, storage cost, 10 cost, and CPU cost. Among
them, the first and the most important factor to be considered in distributed environments is the cost of transferring data
over the network [Wied 87]. Thus, in this dissertation we minimize the amount of data transmission, which we call the total transmission cost, while satisfying the CPU cost
constraint at each site. The details will be explained in Chapter 6.2. Note that we assume that the link transmission costs between any two nodes are the same.
When a multifragment operation is related to the
fragments that reside at different sites, the query processing strategy on site selection for performing this operation can have a significant impact on data movement. There are
basically two alternatives for query processing strategy [Elma 89]. One is the siteoforigin strategy in which all multifragment operations are performed at the site where the query was submitted; the other is the movesmall strategy in which for each multifragment operation, the smaller fragment
82
participating in the operation is always sent to the site where the relation of larger size resides in order to perform multifragment operation. The movesmall strategy was shown to be superior to the siteoforigin strategy [Corn 88]. We
will illustrate this with a simple example query. Suppose fragments F1 and F2 are as shown in Figure 61. Consider the
query Q: "For each employee retrieve the employee name and salary". Then each transaction against a relational database can be decomposed into a sequence of relational algebra operations.
Suppose the query was submitted at a distinct site 3. Neither the Fl nor the F2 fragments reside at site 3. There
are two simple strategies for executing the distributed query:
(1) Data transfer in siteoforigin strategy
Transfer both the Fl and F2 fragments to the site 3 and perform the query at site 3. In this case we
need to transfer a total of 1000*50 + 100*30 = 53000
bytes.
(2) Data transfer in movesmall strategy
Transfer the F2 to site 1, execute the query at site 1, and send the result to site 3. The size of query
result is 36*1000 = 36000 bytes, so we transfer
36000 + 3000 = 39000 bytes.
However, in the movesmall strategy the multifragment operation is performed at the site where the larger fragment is located. The result of the multifragment operation should
83
be returned to the site which requests the multifragment operation if the larger fragment participating in the multifragment operation is not in that site. Thus we have to determine the query processing strategy in order to minimize
the amount of data transmission. However, even though we consider the transaction processing strategy to reduce the total transmission cost, it is difficult to obtain optimal allocation by only considering transaction processing strategy because optimal allocation is deeply related to both transaction processing strategy and fragment allocation. Actually, finding a nonredundant, minimum total data transmission cost allocation is NPcomplete [Gare 79]. Hence,
to minimize the transmission cost, we will consider some heuristics which allow multifragment operations to be performed effectively in an integrated way.
In our research, to illustrate how a transaction is decomposed into a sequence of algebraic operations, we use an operation tree which represents algebraic formulation. Figure 62 shows an example of an operation tree. Note that at this
step we assume that the amount of data requested by each transaction is estimated. It can be done by considering attributes accessed and selectivity factors. After analyzing transactions, a fragment usage matrix is generated. Fragment
usage matrix is a matrix which is similar to the attribute usage matrix, and represents the use of the fragments in important transactions. Each row refers to one transaction:
84
the numeric value in a column indicates that the transaction
originated from the specified site or sites (numeric value) and "uses" the corresponding fragment. For example, in Figure 63, transaction T5 originates from the sites 1 and 2 and uses fragment F3.
6.2 Representation and Stratecry
6.2 .1 Representation
To allocate fragments optimally most approaches [Cer 83b, Corn 88] use the 01 integer programming method. However, this method is hard to understand and lacks intuition. It also suffers from limitations on the size of the problem to solve
with commercially available software. In this dissertation we use a graph called allocationrequest graph to represent how data are requested and allocated. Figure 64 shows an example
of the allocationrequest graph. As shown in Figure 64, an allocationrequest graph consists of three components.
(1) rectangular nodes, for fragments
(2) circular nodes, for sites
(3) edges, that denote requests by the transactions for
accessing the fragments
Edges, which are directed, are labeled with a quadruple
(i,j,f,d), where d stands for the amount of data requested from the fragment j that corresponds to the rectangular nodes in the graph for processing the Ph transaction, and f stands
85
for the frequency with which this transaction is executed. For example, in Figure 64, the label (2,1,30,200) means that transaction 2 requests 200 bytes of fragment 1, with frequency of 30.
Join operations are represented by drawing arcs among edges participating in joins and marking them with a "J". Outerunion operations and outerjoin operations are represented by "10", and union operations are represented by "U". Figure 64 shows these representations. Note that nway
(n>2) multifragment operations can also be represented in the same way as shown in Figure 64.
,6.2.2 Strategy
The objective is to allocate the mixed fragments
nonredundantly with minimum distributed transaction processing cost. The distributed transaction processing cost here is measured by the total transmission cost. The allocation and
transaction processing are subject to some load balancing constraints. The total transmission cost is obtained by multiplying transaction frequency by the amount of data requested. To reduce the total transmission cost, a fragment
should be allocated to the site that requests the largest amount of data. To increase the availability of the system for local processing, CPU/IO cost is considered as a balancing constraint at each site. This means that there is an upper limit of the CPU/IO time at each site. The CPU/I0 cost is
86
determined by (cost of tuple retrieval and processing) x ( of tuples) x (# of retrievals). This is because in most relational database systems they use a tuple at a time retrieval storage system (e.g. RSS in System R). Thus what we are trying to do is to minimize the total transmission cost while satisfying the CPU/IO cost constraint at each site.
This objective would be easily obtained if transactions did not contain multifragment operations. However, if
transactions contain multifragment operations, as it would normally happen, we cannot easily figure out the objective because there is an interdependency between query processing
strategy and data allocation as pointed out in the previous approaches [Wah 84, Aper 88, Corn 88]. To solve this problem data allocation and query processing strategy should be
considered together. This means that we should choose the sites where data are to be stored while simultaneously
determining where join operations should take place [Corn 89].
Another consideration is the priority of the requests.
When there are multiple requests for a fragment, we emphasize on multifragment requests for selecting a candidate request. This is because it allows effective processing for multifragment operations. Note that nonmultifragment operations are not neglected. It means that the relationship with the other fragment participating in a multifragment operation is considered to reduce the data transmission. The details will
be explained in strategy 3. This is reasonable because, as
87
pointed out in [Swam 89], some future applications built on top of relational systems will require processing of queries with a much larger number of joins. objectoriented database
systems that use relational systems for information storage are another class of potential applications performing many joins.
Thus, in our research, we present an integrated heuristic that takes into consideration the interdependency problem in
multifragment operations and join emphasis for selecting a candidate request. The integrated heuristic is as follows: "Let the fragment that causes the largest data transmission be located at the requesting site as much as possible while
satisfying the load balancing constraints." This heuristic can be accomplished by using an allocation technique called
pseudoal location, which allows fragment allocation dynamically with multifragment operations to minimize the distributed transaction processing cost. This will be explained later.
Based on the above discussions we can formulate the following allocation strategies. Note that at any step the final allocation of a fragment is determined as long as the
load balancing constraint is not violated, otherwise the fragment goes back to the fragment pool.
Strategy 1. If there is only a single request, allocate
this fragment to the requesting site.
88
Stratecry 2. When there are multiple requests from different
sites to one fragment, which are not participating in any multifragment operations, allocate this fragment to the site that gives rise to the largest amount of data
transmission.
Strategy 3. If the requests are mixed with multifragment
operations, two cases may occur. To explain further, we
use the following terminology.
 former fragment: the fragment that is being considered
for allocation first between the two fragments
participating in a multifragment operation. For example,
in Figure 65, Fl is a former fragment.
latter fragment: the fragment that is being considered for allocation after the former fragment. In Figure 65,
F2 is a latter fragment.
 datamultiformer (datamultilatter): the total amount
of data requested of the former fragment (latter fragment) from the site that has multifragment
operations. Note that if there are several sites that have multifragment operations then the site that has the largest amount of total data requested is considered. In
Figure 65, a and c represent datamultiformer and
datamultilatter respectively.
 datanomulti former (data_nomultilatter): the total
amount of data requested of the former fragment (latter
89
fragment) from the site that has no multifragment
operations. Note that if there are several sites that have no multifragment operations then the site that has
the largest amount of total data requested is considered.
In Figure 65, b and d represent datanomulti former and
datanomulti latter respectively.
3.1 Former fragment: If the candidate fragment is a former
fragment, then the requests that are participating in multifragment operations have higher priority for allocation without cost computation than those that are not participating in multifragment operations. Thus, in Figure 65, fragment Fl will be allocated to site 1
without any cost computation.
3.2 Latter fragment: If the candidate fragment is a latter
fragment, then apply the pseudoallocation technique.
6.2.3 Pseudoallocation Technigue
Pseudoal location technique is an allocation heuristic that allows fragments to be allocated dynamically with multifragment operations. The job of the pseudoallocation technique is to determine whether a former fragment is kept in the allocated site or not. Thus the pseudoal location technique is always meaningful when there is a former fragment. In order to satisfy our proposed heuristic we introduce a parameter P to represent the ratio of the result size of multifragment
90
operations to the size of the larger of the two participating fragments. Thus if both the fragments participating in a multifragment operation are allocated to remote sites, some portion of a fragment depending on the ratio will be added to
the total data transmission as a penalty for the remote accessing. The rules are as follows. We refer to Figure 65.
(1) a
 Meaning: site Si requests less data from Fl and F2 than
site S2.
 Additional condition
(a) a>c & j3*a>b
 Meaning: data requested from F1 is greater than data
requested from F2 in the multifragment operation and data resulting from multifragment operation due to
Fl is greater than data requested by S2 from Fl.
 Decision: keep Fl at S1 and return F2 to the fragment
pool. Note that, however, if P*a
fragments are returned to the fragment pool.
(b) ad
 Meaning: data requested from Fl is less than data
requested from F2 in the multifragment operation and data resulting from multifragment operation due to
F2 is greater than data requested by S2 from F2.
 Decision: release Fl and allocate F2 to S1. Note that
if P*c
fragment pool.
91
(2) ad
 Meaning: Si requests less data from Fl than F2 and S2
requests less data from F2 than S1.
 Decision: release Fl and allocate F2 to S1.
(3) a>b & c
 Meaning: Si requests more data from Fl than F2 and S2
requests more data from F2 than Si.
 Decision: keep Fl at Si and return F2 to the fragment
pool.
(4) a>b & c>d
 Meaning: S1 requests more data from Fl and F2 than S2.
 Additional condition
(a) a>c
 Meaning: data requested from Fl is greater than data
requested from F2 in the multifragment operation.
 Decision: keep Fl at S1 and return F2 to the fragment
pool.
(b) a
 Meaning: data requested from Fl is less than data
requested from F2 in the multifragment operation.
 Decision: release Fl and allocate F2 to S1.
Now let us explain how the pseudoal location technique works. For example, in Figure 66, suppose F2 is the former fragment and 03=1.5. Then, according to the strategy 3.1, fragment F2 is allocated to site 1 with datamulti former <

Full Text 
PAGE 1
DATA FRAGMENTATION AND ALLOCATION ALGORITHMS FOR DISTRIBUTED DATABASE DESIGN By MINYOUNG RA A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 1990
PAGE 2
"The Lord is my shepherd; I shall not want. (Psalms 23:1)
PAGE 3
ACKNOWLEDGMENTS I would like to express my deep gratitude to my advisor, Dr. Shamkant B. Navathe, for giving me the opportunity to work in the interesting area of distributed database design and for his continual encouragement and support through the course of this work. I indeed owe him a great deal. I would like to express my special thanks to my supervisory committee cochair, Dr. Ravi Varadarajan, for his everpresent help and valuable comments on this work. I would like to thank Dr. Sharma Chakravarthy for his useful suggestions and discussions. I would like to thank Dr. Jose C. Principe and Dr. Manuel Bermudez for their time and interest in my research. I would like to thank Mrs. Sharon Grant for the unceasing help and assistance. I would also like to thank all those unnamed friends who helped me in pursuing my research. I would like to thank Korea Military Academy and Korean Army for allowing my studying here and for their encouragement and support. Finally, I would like to thank my family, especially my wife Eunjin and children Shinil and Eunsong, for their continuous prayer and encouragement. iii
PAGE 4
This work was supported by National Science Foundation under grant no. IRI 8716798. iv
PAGE 5
TABLE OF CONTENTS Page ACKNOWLEDGMENTS iii ABSTRACT vii CHAPTERS 1. INTRODUCTION 1 1.1 Basic Concepts of Fragmentation and Allocation 2 1.2 Proposed Problem 4 1.3 Contributions of This Dissertation 5 2. BACKGROUND AND PROPOSED RESEARCH 9 2.1 Previous Related Work 9 2.2 The Outline of the Proposed Research 14 3. A VERTICAL PARTITIONING ALGORITHM 18 3 . 1 Overview 19 3.2 Definitions and Notations 21 3.3 Fundamental Concepts 22 3.4 The Graph Oriented Algorithm 25 3.5 Application and Extension 34 4. A HORIZONTAL PARTITIONING ALGORITHM 43 4.1 Overview 43 4.2 Predicate Usage Matrix and Predicate Affinity Matrix 45 4.3 The Algorithm 47 4.4 Nonoverlapping Horizontal Partitioning 54 5. A MIXED PARTITIONING METHODOLOGY 61 5.1 Proposed Methodology 61 5.2 Grid Creation 65 5.3 Grid Optimization 65 5.4 Effect of Changes in Data and Transactions .... 74 V
PAGE 6
6. AN ALLOCATION ALGORITHM FOR THE MIXED FRAGMENTS ... 79 6.1 Distributed Transaction Processing 80 6.2 Representation and Strategy 84 6.3 A Heuristic Procedure for Mixed Fragment Allocation 92 6.4 Complexity of the Algorithm 95 6.5 Experimental Results 96 6.6 Dealing with Nonaccessed Fragments 97 7. ALTERNATIVE WAYS OF DEALING WITH FRAGMENTATION AND ALLOCATION 106 7.1 Alternatives for Combining Fragmentation and Allocation 106 7.2 Cell Allocation Followed by Local Optimization 108 8. CONCLUSIONS AND FURTHER RESEARCH Ill REFERENCES 113 BIOGRAPHICAL SKETCH 116 vi
PAGE 7
Abstract of Dissertation Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy DATA FRAGMENTATION AND ALLOCATION ALGORITHMS FOR DISTRIBUTED DATABASE DESIGN By Minyoung Ra December 1990 Chairman: Dr. Shamkant B. Navathe Major Department: Computer and Information Sciences In a distributed database system data fragmentation and allocation are the major design issues. This dissertation presents a set of algorithms for fragmentation (or partitioning) and allocation. The relational model of data is assximed for convenience. The algorithms can be applied to other data models with minor variations. The partitioning of a global schema into fragments can be performed in two different ways: vertical partitioning and horizontal partitioning. Vertical partitioning is the process of subdividing the attributes of a relation or a record type into multiple records, thereby creating fragments. Horizontal partitioning is the process that divides a global relation into subsets of tuples, called horizontal fragments. In this dissertation a new vertical partitioning algorithm is vii
PAGE 8
presented. This algorithm starts from the attribute affinity matrix by considering it as a complete graph. Then, forming a linearly connected spanning tree, it generates all meaningful fragments simultaneously. A new horizontal partitioning algorithm is also presented, which is developed by applying the same graphical technique as in vertical partitioning. The need for mixed partitioning arises because database users usually access data subsets which are simultaneously vertical and horizontal fragments of global relations. However, this problem has not been addressed well in the current literature. A mixed partitioning methodology, which first forms a grid by partitioning a global relation vertically and horizontally in an independent fashion and then produces the final fragments (called mixed fragments) by merging the grid cells, is addressed. In most of the previous allocation work, the unit of allocation is the fragment that results from horizontal partitioning or vertical partitioning. Little work has been done for the allocation of the result of mixed partitioning. This dissertation presents an allocation algorithm for fragments that are generated from our mixed partitioning procedure. In this algorithm a mixed fragment is the unit of allocation. This algorithm is developed based on a heuristic called the pseudoallocation technique. viii
PAGE 9
The contribution of this dissertation consists of providing efficient graph oriented algorithms for partitioning, developing a mixed partitioning approach to distribution and a new approach to fragment allocation in distributed databases. ix
PAGE 10
CHAPTER 1 INTRODUCTION In recent years, due to the demand for system availability and autonomy, and enabled by advances in database and communication technology, distributed database systems have become an important area of information processing, and it is easy to foresee that their importance will rapidly grow. There are both organizational and technological reasons for this trend; distributed databases eliminate many of the shortcomings of centralized databases and fit more naturally in the decentralized structures of many organizations [Ceri 84] . There are two alternative approaches to the design of data distribution: the topdown and the bottomup approaches [Cer 83a, Ceri 84], The former approach is typical of distributed databases developed from scratch, while the latter approach is typical of the development of multidatabase systems as the aggregation of existing databases. This dissertation is concerned with the topdown approach. In the topdown approach, the description of the global schema is given, the fragmentation of the databases is designed and then these fragments are allocated to the sites. The objective of this approach is to evaluate the utility of
PAGE 11
2 partitioning data objects into fragments both horizontally and vertically, and then to determine the allocation of fragments to the database sites [Ceri 82]. Optimal and heuristic mathematical models are used to determine the allocation of data to reduce the aggregate cost of transaction processing [Cer 83b, Nava 84] . Figure 11 describes an overview of distributed database design when it is performed in a topdown manner [Cer 83b] . The main difference of this overview compared with centralized database design is the distribution design step. Prior to distribution design, a global schema is produced by using standard techniques of centralized database design. During the distribution design step, several local logical schemas for the distributed database are produced. Data objects which belong to the global database schema are fragmented and allocated to the different sites, and associations between objects are taken into account in the process. The design is completed by performing the physical design of the data which are allocated to each site. 1.1 Basic Concepts of Fragmentation and Allocation The topdown approach to distributed design consists of solving for each global object the design problems of fragmentation and allocation [Cer 83b]. Fragmentation or partitioning is the process of subdividing a global object (entity or relation) into several pieces, called fragments.
PAGE 12
3 and allocation is the process of mapping each fragment to one or more sites. Fragments must be appropriate units of allocation. Thus the database designers should define fragments as homogeneous collections of information from the viewpoint of distribution [Ceri 87]. The decomposition of a global schema into fragments can be performed using two different types of fragmentation techniques: horizontal fragmentation and vertical fragmentation. Vertical fragmentation is the process that divides a global object into groups of their attributes, called vertical fragments. In order to be able to access the same data after partitioning, it is required that each fragment include a key attribute of the global object. Horizontal fragmentation is the process that divides a global relation into svibsets of tuples, called horizontal fragments. Each fragment is associated with a predicate which indicates the distinguishing propeirty possessed by the instances or tuples of that fragment. Note that fragments need not be disjoint. This means that fragments can have common attributes in vertical fragmentation and common tuples in horizontal fragmentation. Another type of fragmentation is mixed fragmentation. Mixed fragmentation can be built by alternating horizontal and vertical fragmentations. The allocation of fragments can be either nonredundant or redundant. A nonredundant (nonreplicated) allocation requires each fragment to be mapped to exactly one site.
PAGE 13
4 whereas a redundant (replicated) allocation allows each fragment to be mapped to one or more sites. Typically, replication permits higher availability and reliability of the system, though it needs more storage and additional effort for maintaining consistency. 1.2 Proposed Problem Figure 12 shows the Distributed Database Design Tool (d't) reference architecture which is used in our research. The term "GRID" is used to refer to the result of applying horizontal partitioning and vertical partitioning algorithms simultaneously to a global relation. It consists of cells called "grid cells." The grid suggests all possible ways in which the global relation in a distributed database may be partitioned. The input information to d't is the global schema which consists of a set of relations, together with information about the important transactions on the proposed database. As stated in the previous work [Cer 83b, Nava 84], it is not necessary to collect information on 100% of the expected transactions (that would, of course, be impossible) . Since the 8020 rule applies to most practical situations, it is adequate to supply information regarding the 20% of the heavily used transactions which account for about 80% of the activity against the database. The other input to D^T shown in Figure 12 is the distribution constraints which include preferences or special considerations designers/users may have
PAGE 14
5 that would influence partitioning and allocation. In this figure, grid creation is related to vertical fragmentation and horizontal fragmentation, and grid optimization is related to mixed fragmentation. In this dissertation, we will deal with vertical fragmentation, horizontal fragmentation, mixed fragmentation and allocation with no redundancy by presenting new algorithms that have less complexity or provide minimum transaction processing cost. 1.3 Contributions of This Dissertation In a distributed database system, data fragmentation and allocation are the major design issues. This dissertation presents a set of algorithms for data fragmentation and allocation in a distributed database system environment. In this dissertation, a number of problems for data fragmentation and allocation are defined and attacked by using graphical techniques. One of the main contributions of this dissertation is the development of the graph oriented partitioning algorithm for vertical and horizontal partitioning. The major feature of this algorithm is that all fragments are generated in one iteration in time O(n^) , where n represents the number of attributes, that is more efficient than the previous approaches. Furthermore, it provides a cleaner alternative without arbitrary objective functions. Another contribution is the development of the first comprehensive mixed partitioning methodology for this problem.
PAGE 15
6 A graphbased allocation algorithm is also presented along with some heuristics.
PAGE 16
7 Users i Requirement collection requirement specification View analysis and integration enterprise schema transaction definition i i DISTRIBUTION DESIGN distribution requirement distribution of the enterprise schema into local logical schemas Physi i cal database design for each local database T implementation schema 1 operational database gure 11 The overall distributed database design methodology: Topdown manner [Cer 83b]
PAGE 17
8 TRANSACTION i INFORMATION I GRID CREATION I GRID OPTIMIZATION site independent ALLOCATION LOGICAL SCHEMA J PHYSICAL SCHEMA J site dependent Figure 12 Distributed database design tool reference architecture
PAGE 18
CHAPTER 2 BACKGROUND AND PROPOSED APPROACH 2 . 1 Previous Related Work We review the previous work in the design of data distribution. This review is divided into the work on data fragmentation and the work on data allocation. 2.1.1 Previous Work on Data Fragmentation Data fragmentation is performed during the design of a database to improve performance of transactions. In order to obtain improved performance, fragments must be "closely matched" to the requirements of the transactions. The work related to this topic can be classified into the work on vertical partitioning, on horizontal partitioning, and on mixed partitioning. 2.1.1.1 Vertical partitioning Hoffer and Severance [Hoff 75] measure the affinity between pairs of attributes and try to cluster attributes according to their pairwise affinity by using the bond energy algorithm (BEA) developed in [McCo 72]. Navathe et al. [Nava 84] extend the results of Hoffer and Severance and propose a twophase approach to vertical 9
PAGE 19
10 partitioning. During the first phase, they use the given input parameters in the form of an attribute usage matrix and transactions to construct the attribute affinity matrix on which clustering is performed. After clustering, iterative binary partitioning is attempted, first with an empirical objective function. The process is continued until no further partitioning results. During the second phase, the fragments can be further refined by incorporating estimated cost factors weighted on the basis of the type of problem being solved. Cornell and Yu [Corn 87] apply the work of Navathe et al. [Nava 84] to relational databases. They propose an algorithm which decreases the number of disk accesses to obtain an optimal binary partitioning. They show how knowledge of specific physical factors can be incorporated into the overall fragmentation methodology. Ceri, Pernici and Wiederhold [Ceri 88] extend the work of Navathe et al. [Nava 84] by considering it as a DIVIDE tool and by adding a CONQUER tool. Their CONQUER tool extends the same basic approach in the direction of adding details about operations and physical accesses similar to [Corn 87]. This approach focuses on the decomposition of the design process into several design subproblems and provides no algorithmic improvement in the process of vertical partitioning itself. 2.1.1.2 Horizontal partitioning Ceri, Negri and Pelagatti [Ceri 82] analyze the horizontal partitioning problem, dealing with the
PAGE 20
11 specification of partitioning predicates and the application of horizontal partitioning to different database design problems . Ceri, Navathe and Wiederhold [Cer 83b] develop an optimization model for horizontal partitioning without replication in the form of a linear integer 01 programming problem. They modeled the logical schema of a database as a directed graph with objects as nodes and links as edges, and required the user to specify the information about (1) the data about schema (attribute size, relationship size, cardinality) , (2) tabulation of transactions (frequencies, site of origin) , and (3) distribution requirement (constraints) . Yu et al. [Yu 85] propose an adaptive algorithm for record clustering, which is conceptually simple and highly intuitive. However, the adaptive approach does not use the transaction information that is useful for partitioning. Hence, it cannot be applied until the execution of transactions . 2.1.1.3 Mixed partitioning Currently mixed partitioning has not been adequately addressed in the literature. The only work we are aware of is that of Apers [Aper 88], who considers the fragmentation problem together with the allocation problem. In his approach, the fragmentation scheme is the output of the allocation
PAGE 21
12 algorithm. Thus we cannot know the fragmentation scheme before allocation. 2.1.2 Previous Work on Data Allocation Previous work on data allocation has been mainly in two areas: (1) data allocation independent of network parameters, an4 (2) extension to the pure data allocation problems including the network topology and communication channels. Note that since a distributed file system differs greatly from a distributed database, the solutions for the file allocation problems [Dowd 82] do not characterize solutions to the allocation problems in a distributed database. This is because in the distributed database systems the way the data are accessed is far more complex [Aper 88]. This review concentrates on how the data allocation problem has been attacked in the context of distributed databases and mainly in areas of data allocation itself. Ceri, Martella and Pelagatti [Cer 80] consider the problem of data allocation for typical database applications with a simple model of transaction execution. However, they do not consider any partitioning. Ceri, Navathe and Wiederhold [Cer 83b] propose a model for a nonreplicated data allocation in the form of a linear 01 integer programming problem. But only horizontal partitioning is considered and a certain query processing strategy is assumed.
PAGE 22
13 Wah [Wah 84] investigates the important problems of file allocation in general. However, he does not consider data fragmentation for database design. Sacca and Wiederhold [Sacc 85] propose a heuristic allocation algorithm in a cluster of processors in which the user does not address a transaction request to a specific processor. They try to minimize the global cost for allocating data in the given network. However, this approach cannot be applied to the system in which processors are remotely distributed, because, in a remotely distributed system, the user has to request data from a specific site. Cornell and Yu [Corn 88, Corn 89] propose a strategy to integrate the treatment of relation assignment and query strategy to optimize performance by using the 01 integer programming method. They, however, neither consider any partitioning nor join order. Apers [Aper 88] proposes a model for computing the cost of a completely or partially specified allocation for various cost functions. In this approach fragments are the result of the allocation. So we do not know about the fragmentation scheme before allocation. When fragments are generated, since clustering of fragments is not considered, the generated fragments may contain only a few or even one tuple. Ceri, Pernici and Wiederhold [Ceri 88] propose a distributed design tool which provides a flexible environment for the design of data distribution. In this paper vertical
PAGE 23
14 fragments are allocated to each site according to the DIVIDE and CONQUER tools. They, however, do not consider horizontal partitioning. 2.2 The Outline of the Proposed Approach This dissertation is primarily concerned with the algorithms for data fragmentation and allocation. We provide an overview of the structure of the dissertation by discussing the basic ideas for solving partitioning and allocation problems in distributed database design. 2.2.1 Vertical Partitioning In all vertical partitioning algorithms that we have surveyed, the binary partitioning technique has been used for partitioning after clustering attributes. Thus binary partitioning is repeated until all meaningful fragments are determined. It is also necessary that clustering be repeated on two new affinity matrices corresponding to the newly generated fragments. In Chapter 3 we propose a new vertical partitioning algorithm which has less computational complexity and generates all meaningful fragments simultaneously by using a graphical method. This approach is based on the fact that all pairs of attributes in a fragment have high "within fragment affinity" but low "between fragment affinity."
PAGE 24
15 2.2.2 Horizontal Partitioning In horizontal partitioning, most of the previous approaches that we know of try to select an optimal horizontal partitioning for each relation. In the present approach, however, we do not consider how to select an optimal horizontal partitioning for each relation in the database. We are focusing our attention on identifying all the candidate horizontal partitions by using the same graphical approach as in the vertical partitioning. To this end, we construct a predicate affinity matrix. Based on this matrix we develop a horizontal partitioning algorithm in Chapter 4. 2. 2. 3. Mixed Partitioning Mixed partitioning can be accomplished in a sequential manner in one of two ways: first performing vertical partitioning and then horizontally partitioning the vertical partitions (called VH partitioning) , or first performing the horizontal partitioning and then vertically partitioning the horizontal partitions (called HV partitioning) . Obviously, this is not adequate since it potentially leads to different results and leaves out the possibility of combining fragments at a smaller granularity to produce more efficient data distribution. Thus, in Chapter 5, we propose a uniform mixed partitioning methodology in order to solve the above problems. A grid consisting of "cells" is created by simultaneously
PAGE 25
16 applying the vertical partitioning and horizontal partitioning algorithms on the relation. The grid cells will be merged by taking into account the cost factors to minimize the global transaction processing costs. This approach produces optimal mixed fragments that cannot be otherwise produced by independent partitioning models. 2.2.4 Allocation The problem of data allocation in distributed database systems has been extensively studied. However, in most cases the unit of allocation is the fragment resulting from horizontal partitioning or vertical partitioning. Little work has been done for the allocation of the result of mixed partitioning. So, in Chapter 6, we propose an allocation algorithm for the result of mixed partitioning. A mixed fragment, which comes from our mixed partitioning procedure by considering both vertical partitioning and horizontal partitioning simultaneously, is the unit of allocation. The objective of this algorithm is to allocate the mixed fragments nonredundantly with minimum data transmission cost while satisfying the load balancing constraints. To this end we will present some heuristics that combine fragment allocation and transaction processing strategy. Based on the heuristics we will develop this algorithm
PAGE 26
17 2.2.5 Alternative Ways of Dealing with Fracmentation and Allocation Distributed database design has been dealt with in terms of fragmentation of data and allocation of these fragments. There may be other alternatives with which these problems can be attacked. In Chapter 7, we describe the various alternatives and evaluate them by comparing one another.
PAGE 27
CHAPTER 3 A VERTICAL PARTITIONING ALGORITHM Vertical partitioning is the process that divides a "global object" which may be a single relation or more like a universal relation into groups of their attributes called vertical fragments [Cer 83a, Nava 84, Corn 87]. It is used during the design of a database to enhance the performance of transactions [Nava 84]. In order to obtain improved performance, fragments must closely match the requirements of the transactions. Vertical partitioning has a variety of applications wherever the "match" between data and transactions can affect performance. That includes partitioning of individual files in centralized environments, data distribution in distributed databases, dividing data among different levels of memory hierarchies, and so on. Figure 31 shows the transaction specification for our example that will cover both vertical and horizontal algorithms. Although this and the subsequent chapters are presented using the relational model, the approach presented in this dissertation is general enough. By replacing the term relation with record or object it can be easily applied to the hierarchical, network or the objectoriented models. 18
PAGE 28
19 3 . 1 Overview The algorithm that we propose starts from the attribute affinity (AA) matrix, which is generated from the attribute usage matrix using the same method as that of the previous approach in [Nava 84]. The attribute usage matrix represents the use of attributes in important transactions. Each row refers to one transaction; the "1" entry in a column indicates that the transaction "uses" the corresponding attributes. Whether the transaction retrieves or updates the relation can also be captured by another column vector with R and U entries for retrieval and update. That information may be used by an empirical objective function as in [Nava 84]. The attribute usage matrix for 10 attributes and 8 transactions is shown in Figure 32. Attribute affinity is defined as affij = Z acckij (i+j) where acc^j is the number of accesses of transaction k referencing both attributes i and j . The summation occurs over all transactions that belong to the set of important transactions r. This definition means that attribute affinity measures the strength of an imaginary bond between the two attributes, predicated on the fact that attributes are used together by transactions. Although affinity affy does not have any physical meaning, it is reasonable to define it as follows: affii = s acCki
PAGE 29
20 where acc^ means the number of accesses of transaction k referencing attribute i. This is reasonable since it shows the "strength" of that attribute in terms of its use in all transactions . Based on these definitions of attribute affinity, the attribute affinity matrix is defined as follows: It is an n X n matrix for the nattribute problem whose (i,j) element equals affy. Figure 33 shows the attribute affinity matrix which was formed from the Figure 32. A note about the attributes: In this proposed technique as well as in the previous techniques, the set of attributes considered may be (1) the universal set of attributes in the whole database . (2) the set of attributes in a single relation (or record type or object type) . By using (a) , the fragments generated may be interpreted as relations or record types. By using (b) , fragments of a single relation are generated. In previous approaches, a clustering algorithm is applied to the AA matrix. In our present approach, however, we consider the AA matrix as a complete graph called the affinity graph in which an edge value represents the affinity between the two attributes. Then, forming a linearly connected spanning tree, the algorithm generates all meaningful
PAGE 30
21 fragments in one iteration by considering a cycle as a fragment. A "linearly connected" tree has only two ends. Figure 34 shows the affinity graph corresponding to the AA matrix of Figure 33. Note that the AA matrix serves as a data structure for the affinity graph. The major advantages of the proposed method over that in [Nava 84] are as follows: (1) There is no need for iterative binary partitioning. The major weakness of iterative binary partitioning is that at each step two new problems are generated increasing the complexity. (2) The method obviates the need for using any empirical objective functions as in [Nava 84], As shown in [Corn 87], the intuitive objective functions used in [Nava 84] do not necessarily work well when an actual detailed cost formulation for a specific system is utilized. (3) The method requires no complementary algorithms such as the SHIFT algorithm of [Nava 84]. (4) The complexity of the algorithm is O(n^) , better than the O(n^logn) complexity of the previous algorithms. 3.2 Definitions and Notations We shall use the following notation and terminology in the description of our algorithm.
PAGE 31
22 A,B,C,... denotes nodes. a,b,c,... denotes edges. p(e) denotes the affinity value of an edge e. "primitive cycle" denotes any cycle in the affinity graph. "cycle completing edge" denotes a "to be selected" edge that would complete a cycle. "cycle node" is that node of the cycle completing edge, which was selected earlier. "affinity cycle" denotes a primitive cycle that contains a cycle node. We assume that a cycle means an affinity cycle, unless otherwise stated. "former edge" denotes an edge that was selected earlier than the cycle node. "cycle edge" is any of the edges forming a cycle, "extension of a cycle" refers to a cycle being extended by pivoting at the cycle node. The above definitions are used in the proposed algorithm to process the affinity graph and to generate possible cycles from the graph. Note that each cycle gives rise to a vertical fragment. 3 . 3 Fundamental Concepts Based on the above definitions we would like to explain the mechanism of forming cycles. For example, in Figure 35, suppose edges a and b were selected already and c was selected
PAGE 32
23 next. At this time, since c forms a primitive cycle, we have to check if it is an affinity cycle. This can be done by checking the possibility of a cycle. "Possibility of a cycle" results from the condition that no former edge exists, or p(former edge) < p(all the cycle edges). The primitive cycle a,b,c is an affinity cycle because it has no former edge and satisfies the possibility of a cycle (i.e., the first cycle with three edges is always an affinity cycle) . Therefore the primitive cycle a,b,c is marked as a candidate partition and node A becomes a cycle node. Now let us explain how the extension of a cycle is performed. In Figure 35, after the cycle node is determined, suppose edge d was selected. At this time, d is checked as a potential edge for extension. It can be done by checking the possibility of extension of the cycle by d. "Possibility of extension" results from the condition of p (being considered edge or cycle completing edge) > p(any one of the cycle edges). Thus the old cycle a,b,c is extended to the new cycle a,b,d,f if the edge d is under consideration, or the cycle completing edge f satisfies the possibility of extension which is: p(d) or p(f) > minimum of (p(a) ,p(b) ,p(c) ) . Now the process is continued: suppose e was selected as the next edge. But we know from the definition of the "extension of a cycle" that e cannot be considered as a potential extension because the primitive cycle d,b,e does not include the cycle node A. Hence it is discarded and the process is continued.
PAGE 33
24 The next concept that we wish to explain corresponds to the relationship between a cycle and a partition. There are two cases in partitioning. (1) Creating a partition with a new edge. In the event that the edge selected next for inclusion (e.g. d in Figure 35) was not considered before, we call it a new edge. If a new edge by itself does not satisfy the possibility of extension, then we continue to check an additional new edge called cycle completing edge (e.g. f in Figure 35) for the possibility of extension. In Figure 35, new edges d and f would potentially provide such a possibility of extension of the earlier cycle formed by edges a,b,c. If d,f meet the condition for possibility of extension stated above (namely p(d) or p(f) > minimum of (p(a) ,p(b) ,p(c) ) ) , then the extended new cycle would contain edges a,b,d,f. If that condition were not met, we produce a cut on edge d (called the cut edge) isolating the cycle a,b,c. This cycle can now be considered a partition. (2) Creating a partition with a former edge. After a cut is produced in (1), if there is a former edge, then change the previous cycle node to that node where the cut edge was incident, and check for the possibility of extension of the cycle by the former edge. For example, in Figure 36, suppose that a,b, and c form a cycle with A as the cycle node, and that there is a cut on d, and that the former edge w exists. Then the cycle node A is changed to C because
PAGE 34
25 the cut edge d originates in C. We are now evaluating the possibility of extending the cycle a,b,c into one that would contain the former edge w. Hence we consider the possibility of the cycle a,b,e,w. Assume that neither w nor e satisfies the possibility of extension, i.e., if "p(w) or p(e) > minimum of (p(a) ,p(b) ,p(c) ) " is not true. Then the result is the following: i) w will be declared as a cut edge, ii) C remains as the cycle node, and iii) a,b,c becomes a partition. Alternately, if the possibility of extension is satisfied, the result is as follows: i) cycle a,b,c is extended to cycle w,a,b,e, ii) C remains as the cycle node, and iii) no partition can yet be formed. 3.4 The Graph Oriented Algorithm An algorithm for generating the vertical fragments by the affinity graph is described below. 3.4.1 Description of The Algorithm First we briefly describe the algorithm in 5 steps. Step 1. Construct the affinity graph of the attributes of the object being considered. Note that the AA matrix is itself an adequate data structure to represent this graph. No additional physical storage of data would be necessary. Step 2 . Start from any node.
PAGE 35
26 Step 3 . Select an edge which satisfies the following conditions: (1) It should be linearly connected to the tree already constructed. (2) It should have the largest value among the possible choices of edges at each end of the tree. Note that if there are several largest values, anyone can be selected. This iteration will end when all nodes are used for tree construction. Step 4 . When the next selected edge forms a primitive cycle. (1) If a cycle node does not exist, check for the "possibility of a cycle" and if the possibility exists, mark the cycle as an affinity cycle. Consider this cycle as a candidate partition. Go to step 3. (2) If a cycle node exists already, discard this edge and go to step 3 . Step 5 . When the next selected edge does not form a cycle and a candidate partition exists. (1) If no former edge exists, check for the possibility of extension of the cycle by this new edge. If there is no possibility, cut this edge and consider the cycle as a partition. Go to step 3. (2) If a former edge exists, change the cycle node and check for the possibility of extension of the cycle by the former edge. If there is no possibility, cut
PAGE 36
27 the former edge and consider the cycle as a partition. Go to step 3 . To obtain a more detailed algorithm, suppose that the following data structures are used during implementation [Bras 88]: The nodes of the affinity graph are numbered from Iton, N={1,2, n}, and a symmetric matrix L gives the weight of each edge. Three vectors are used: B, strongest, and maxwt. B gives the sequence of scanned nodes. For each node i e N \ B, strongest [i] gives the node in B that is strongest with respect to i, and maxwt [i] gives the weight from i to strongest [ i] ; strongest [1] and maxwt [1] are not used. Without loss of generality we can assume that the algorithm starts from node 1. The detailed description of the algorithm now follows. The algorithm uses variables with the following meaning: p_cycle: a binary variable which denotes whether a primitive cycle exists c_node: a binary variable which denotes whether a cycle node exists f_edge: a binary variable which denotes whether a former edge exists candidate_p: a binary variable which denotes whether an affinity cycle exists which can potentially generate a fragment cycle_c_edge_wt: an integer variable for the weight of the cycle completing edge
PAGE 37
28 former_edge_wt: an integer variable for the weight of the former edge procedure Make_partition{L[l. .n, 1. .n] ) : set of edges { initialize flags and variables } B[l] <1 fl <1, f2 <0 { fl & f2 refer to the two ends of the spanning tree ) p_cycle , c_node , f _edge , candidate_p max and (strongest[ j ] = fl or strongest[j] = f2) then max 0 then f_edge f ormer_edge_wt or f_edge = false then candidatejp
PAGE 38
29 then { partition exists } reinitialize variables if f_edge = false then save this partition else change the cycle node if fonner_edge_wt < pmin then save this partition remove k from B else extend the cycle else extend the cycle { pmin contains the minimum edge of a cycle } if pmin > maxwt[k] then pmin maxwt[j] then maxwt[j] p(v,w) . Intuitively, this partitioning guarantees that for a vertex veq, it is better to cluster v with a vertex in q than with a vertex outside q. (2) For every cluster q, there exists two vertices aeq, bÂ«q such that for every vertex ceq, there exists a vertex deq with p(c,d) > p(a,b). Intuitively, this partitioning guarantees that there is a certain threshold affinity for each cluster, which is satisfied for the pairs of vertices (Vj, x,) , (Vj, X2) , (Vp, Xp) where q = {Vj,
PAGE 39
30 Vp} and XjCCj, for all l p(v,w) or p,(w) > p(v,w). Proof: Case 1: Let q be constructed before C^. When vertex w is encountered for the first time in building the cycle Ck, let (v,w) be the largest affinity edge incident on w. In this case, since, when Cj was constructed, an edge (v,u) , ueq, with affinity greater than (v,w) was selected as part of cycle edge, p,(v) > p(v,w) is true. Case 2: Let C^ be constructed before q. When vertex w is encountered, a cycle edge (w,x), xeC^, with affinity greater than (v,w) is selected. In this case, therefore Px(w) > p(v,w) is true. (2) States that for every j except k, l p(a,b) for all veq. Proof: Note that there are r1 cut edges. Since there is a unique cycle (cluster) that precedes a cut edge, it follows that each cut edge is associated uniquely with
PAGE 40
31 a cluster. By the definition of extension of a cycle, each cut edge has less affinity than all the cycle edges associated with this cut edge. For example, in Figure 37, suppose that a,b,c and d formed a cycle and that there was a cut on edge w. Then, if p(w) had been greater than or equal to the minimum of {p(a), p(b), p(c) , p(d)), then, by the definition of the possibility of extension, the cycle A,B,C,D would have been extended to the larger cycle A,B,C,D,W. 3.4.3 Examples We will use the same example problems from Navathe et al. [Nava 84] to illustrate how this algorithm works and to compare partitioning decisions. Since our algorithm uses the same attribute affinity matrix, we assume that it has already been completed from the original transaction matrix and the computation of affinities. For ease of understanding, we will refer back to the steps of the algorithm in Chapter 3.4.1. The attribute affinity matrix of the first example is shown in Figure 33 and its affinity graph after excluding the zerovalued edges appears in Figure 34. Suppose we start at node 9 (step 2), then, by the algorithm, edges 93, 92, and 28 are selected in order (step 3) . At this time, edge 89 cannot form a cycle because it does not satisfy the possibility of a cycle (step 3). Thus edge 83 is selected as the next edge and it forms a candidate partition (step 4).
PAGE 41
32 Note that node 3 becomes a cycle node (step 4) . Then the process is continued and edge 87 is selected (step 3) . Since there is a candidate partition, the possibility of extension is checked (step 5.1). Thus the cycle 9,3,2,8 is considered as a partition because edge 87 (edge being considered) and 37 (cycle completing edge) are both less than any of the cycle edges (step 5.1). The relevant part of the graph is shown again in Figure 38. As shown in Figure 38, our algorithm generates three affinity cycles separated by edges 34 and 78. They generate three fragments: (1,5,7), (2,3,8,9) , (4,6,10) . From that Figure, we know that the result of our algorithm is the same as that of Navathe et al. [Nava 84] . We conjecture that this algorithm does not depend upon the starting node. For example, let us start at node 1. By the algorithm, the first affinity cycle is not formed until edges 15, 57, 78, 82, 29, and 93 are selected. The first cycle 8,2,9,3 is identified as a candidate partition and node 8 becomes a cycle node. Then a cut occurs on edge 34 because neither edge 34 nor edge 48 satisfies the possibility of extension of the cycle (step 5.1). At this time, since there is a former edge, we have to change the cycle node to node 3 and check for the possibility of extension of the cycle by the former edge 78 (step 5.2). Thus another cut occurs on edge 78 because edge 78 and 73 are both less than any of the cycle edges. Figure 39 shows this result. Thus we can
PAGE 42
33 conclude that the resulting fragments are always the same irrespective of the node from which you start. The second example we will use is a global relation with 20 attributes and 15 transactions. The result of Navathe et al. [Nava 84] partitions this relation into four fragments in three iterations: (1,4,5,6,8), (2,9,12,13,14), (3,7,10,11,17,18), (15,16,19,20). Our algorithm, however, generates five fragments in one iteration as shown in Figure 310: (1,5,8), (4,6), (2,9,12,13,14), (3,7,10,11,17,18), (15,16,19,20). Note that the algorithm starts from node 1 and the cut of edge 32 is performed earlier than that of edge 47. This result shows that our algorithm can find one more possibility of partitioning. Thus what the empirical objective function could not discriminate as a potential partitioning in [Nava 84], is actually detected by our procedure. 3.4.4 Complexity of the Algorithm Now we consider the computational complexity. Step 1 does not affect the computational complexity because the attribute affinity matrix can be used as a symmetric matrix L. The repeat loop in the detailed description is executed n1 times, where n denotes the number of attributes. At each iteration, selection of the next edge takes a time 0(n). Also, whether a cycle exists or not can be implemented in time of 0(n) by scanning the vector B. Thus the algorithm takes a time O(n^), which is less than that of [Nava 84], namely, 0(n^ log n) .
PAGE 43
34 3.5 Application and Extension This algorithm can be used effectively for vertical partitioning because it overcomes the shortcomings of binary partitioning and it does not need any complementary procedures such as the SHIFT procedure and the CLUSTER procedure that are used in [Nava 84]. Furthermore, the algorithm involves no arbitrary empirical objective functions to evaluate candidate partitions such as those used in [Nava 84], Also this algorithm can be used in place of the first phase without the use of any cost factors as in [Nava 84]. The second phase involving the costoptimized vertical partitioning approach can still be applied to refine the results of the first phase. Another important application of this algorithm is the mixed partitioning methodology that will be proposed in Chapter 5. The mixed partitioning methodology will first generate a grid for a relation vertically and horizontally, and then merge cells as much as possible by using a cost function for determining a fragment. This algorithm can be further enhanced to address the problem of primary/secondary memory partitioning, or in the context of any memory hierarchy. By combining with the MULTI_ALLOCATE algorithm in [Nava 84], this algorithm can be used to achieve the allocation of vertical fragments over a network.
PAGE 44
35 Transactions Attributes Predicates Number of accesses per time period Tl al,a5,a7 pl,p7 25 T2 a2,a3,a8,a9 p2,p7 50 T3 a4,a6,al0 p3,p7 25 T4 a2,a7,a8 p4,p8 35 T5 al,a2,a3,a5,a7,a8,a9 p5,p8 25 T6 al,a5 p6,p8 25 T7 a3,a9 p5,p8 25 T8 a3,a4,a6,a9,al0 p6,p8 15 Figure 31 Transaction specifications Number of Attribute usage matrix Type accesses pe time period .^..Amibutes 1 8 2 3 4 5 6 7 9 10 TransactiorIs~~ Tl 1 0 0 0 1 0 1 0 0 0 R Acc 1 = 25 T2 0 1 1 0 0 0 0 1 1 0 R Acc 2 = 50 T3 0 0 0 1 0 1 0 0 0 1 R Acc 3 = 25 T4 0 1 0 0 0 0 1 1 0 0 R Acc 4 = 35 T5 1 1 1 0 1 0 1 1 1 0 U Acc 5 = 25 T6 1 0 0 0 1 0 0 0 0 0 U Acc 6 = 25 T7 0 0 1 0 0 0 0 0 1 0 U Acc 7 = 25 T8 0 0 1 1 0 1 0 0 1 1 u Acc 8 = 15 Figure 32 Attribute usage matrix
PAGE 45
36 Attributes 1 2 3 4 5 6 7 8 9 10 1 75 25 25 0 75 0 50 25 25 0 2 25 110 75 0 25 0 60 110 75 0 3 25 75 115 15 25 15 25 75 115 15 4 0 0 15 40 0 40 0 0 15 40 5 75 25 25 0 75 0 50 25 25 0 6 0 0 15 40 0 40 0 0 15 40 7 50 60 25 0 50 0 85 60 25 0 8 25 110 75 0 25 0 60 110 75 0 9 25 75 115 15 25 15 25 75 115 15 10 0 0 15 40 0 40 0 0 15 40 Figure 33 Attribute affinity matrix
PAGE 46
37 Figure 34 Affinity graph after excluding zerovalued edge
PAGE 47
Figure 36 Producing a partition
PAGE 48
39 Figure 37 Using affinity cycle as a means for reasonable partitioning
PAGE 49
40 start node Figure 38 Result of the first example starting at node 9
PAGE 50
Figure 39 Result of the first example starting at node 1
PAGE 51
42 Figure 310 Result of the second example
PAGE 52
CHAPTER 4 A HORIZONTAL PARTITIONING ALGORITHM 4 . 1 Overview Horizontal partitioning is the process that divides a global relation into subsets of tuples, called horizontal fragments [Ceri 82, Cer 83a, Ceri 84]. Ceri, Negri and Pelagatti [Ceri 82] analyze the horizontal partitioning problem, dealing with the specification of partitioning predicates. Ceri, Navathe, and Wiederhold [Cer 83b] propose an optimization model for designing distributed database schemas with all meaningful candidate horizontal partitionings . In the present research, we are focusing our attention on identifying all the candidate horizontal partitions. For this we propose a horizontal partitioning methodology which uses the MAKE_PARTITION algorithm presented in Chapter 3. In order to use the MAKE_PARTITION procedure of our vertical partitioning algorithm we consider only those transactions whose processing frequency is large. These transactions access tuples of the relations based on some predicates. These predicates are called simple predicates [Ceri 82]. The syntax for a simple predicate is as follows: 43
PAGE 53
44 simple_predicate : : = attribute_naine attribute_naine I attribute_name value ::= <{>j<>j<>[= attribute_naine : : = string value ::= element of the domain of the attribute. A transaction gives rise to a set of simple predicates. The WHERE clause of a database operation (say SELECT) may consist of a disjunction or conjunction or negation of a set of simple predicates. Simple predicates are easier to handle and understand. Moreover, a simple predicate splits a relation into two horizontal fragments. One horizontal fragment consists of those tuples that satisfy the simple predicate and other with those tuples that do not. The correctness of fragmentation requires that each tuple of the global relation be selected in one and only one fragment. Another consideration in horizontal partitioning is derived partitioning as stated in [Ceri 84]. Thus the predicates which give rise to derived partitioning (we call it derived predicates) should be considered in the same way as simple predicates. We limit the scope of this dissertation by assuming that all simple and derived predicates are previously determined. As explained earlier, the focus of this research is on single relation fragmentation. Hence join predicates of the
PAGE 54
45 form R1.A=R2.B which deal with a pair of relations are not considered for horizontal partitioning. They are very much a part of the allocation phase and will be considered for minimizing the effort and cost of joins. In contrast, horizontal partitioning approaches used in systems like Bubba [Cope 88] or Gamma [DeW 86] attempt to achieve parallelism of join queries, etc. In our decomposition of the overall distribution problem, join predicates are in fact considered, but at a later stage. 4.2 Predicate Usage Matrix and Predicate Affinitv Matrix We use a simple example to explain our horizontal partitioning methodology below. The inputs are a set of transactions and a corresponding set of predicates as follows (assvune D# and SAL are attributes of an example relation) : Tl : D#<10 (pi) , SAL>40K (p7) T2 : D#<20 (p2), SAL>40K T3 : D#>20 (p3), SAL>40K T4 : 3040 (p6) , SAL<40K T7 : D#<15 (p5) , SAL<40K T8 : D#>40 (p6) , SAL<40K Note that the above set of predicates do not span all the tuples of the relation; tuples with SAL=40K will not be accessed by any of the above transactions.
PAGE 55
46 As the previous work [Cer 83b, Nava 84] pointed out, it is not necessary to collect information on 100% of the expected transactions (that would of course be impossible) . Since the 8020 rule applies to most practical situations, it is adequate to supply information regarding the 20% of the heavily used transactions which account for about 80% of the activity against the database. The algorithm that we propose starts from the predicate usage matrix. The predicate usage matrix represents the use of predicates in important transactions. The predicate usage matrix for the above example (8 predicates and 8 transactions) is shown in Figure 41. Each row refers to one transaction: the "1" entry in a column indicates that the transaction uses the corresponding predicates. Whether the transaction retrieves or updates the relation can also be captured by another column vector with R and U entries for retrieval and update. Predicate affinity is generated in a similar manner as attribute affinity. Figure 42 shows a predicate affinity matrix generated from the predicate usage matrix in Figure 41. The numerical value of the (i,j) element in this matrix gives the combined frequency of all transactions accessing both predicates i and j and is obtained the same way as in vertical partitioning in Chapter 3. The value "==>" of the (i,j) element indicates that predicate i implies predicate j, and the value Â•Â•*" represents the close usage of predicates. Two predicates i and j are "close" when the following
PAGE 56
47 conditions are satisfied: (1) i and j must be defined on the same attribute, (2) i and j must be used jointly with some common predicate c, (3) c must be defined on an attribute other than the attribute used in i and j . This is reasonable because predicates i, j, and c are different from one another and thus two fragments generated by predicates i,c and predicates j,c are considered "closely related" since they both involve predicate c. In the above example, pi and p2 are "close" because of their usage with the common predicate p7 in transactions Tl and T2. These two relationships are introduced to represent logical connectivity between predicates. The attractive features of this approach are as follows: (1) Fragments are based on actual predicates; by applying implication between predicates, the number of fragments is reduced, (2) We can make use of the predeveloped algorithm in Chapter 3, (3) By using clustering of predicates suggested above, a relatively small number of horizontal fragments are generated, (4) 01 integer formulation is not needed. 4.3 The Alaorithm First a modified version of the graphical partitioning algorithm for clustering predicates will be given. Then by
PAGE 57
48 using this clustering, the entire steps for horizontal fragmentation will be described. 4.3.1 Clustering of Predicates We will apply the graphical algorithm in Chapter 3 to horizontal partitioning. However, we cannot use this algorithm directly because the predicate affinity matrix may be too sparse. Thus we introduce two more relationships which represent logical connectivity between predicates. They are "==>" for implication and for close usage as explained in Chapter 4.2. To obtain the modified version of the graphical partitioning algorithm, the following heuristic rules are applied. (1) A numerical value (except zero) has higher priority than the values "==>", Â•Â•<==" and "*" when selecting a next edge. This is because we place more importance on affinity values which are obtained from transaction usage rather than on logical connectivity among the predicates. (2) In the comparisons involved in checking for the possibility of a cycle or extension of a cycle, we ignore cycle edges with affinity values "==>", "<==Â•Â• and "*". For example, in Figure 43, in comparing edge (p4,p8) with edges of the cycle (p8,p5,p6) we ignore edge (p5,p6) which has affinity "*". This is because
PAGE 58
the affinity values Â•Â•==>", "<==Â•Â• and "*" represent implicit relationships and therefore can be collapsed. (3) 'Â•==>Â•Â• and "<==Â•Â• are considered to have higher affinity value than "*" since the former indicates direct implication, whereas the latter represents only logical connectivity between the two predicates through their usage with a common predicate. (4) If there are two "==>"s in a column corresponding to predicate p^, one implied by predicate p; and another implied by predicate pj, then the entry (i,k) has higher priority than the entry (j,k), either if the entry (i,j) is equal to Â•<==", or if the entry (j,i) is equal to "==>". In other words. Pi => Pj => Pk (a) p. => p. => (b) in the above implication, if (a) holds, then j has higher priority than i, and if (b) holds, then i has higher priority than j . These rules can be easily incorporated into the graphical algorithm in Chapter 3. For example, if we assume that every affinity value in predicate affinity matrix is greater than 2, then we can assign value 1 for Â•'*Â•Â• and 2 for Â•Â•==>Â•Â• to represent rules 1 and 3. The detailed description of the modified algorithm now follows. We use the same data structures and variables as in vertical partitioning. The meanings of variables are available from Chapter 3.
PAGE 59
50 procedure Make_Hori_Partition(L[l. .n, 1. .n] ) : set of edges { assign value 1 for "*" and 2 for "==>Â•Â• respectively in a predicate affinity matrix} { initialize flags and variables } B[l] <1 fl <1, f2 <0 { fl & f2 each refer to an end of the spanning tree } p_cycle,c_node, f_edge,candidate_p max and (strongest[ j ] = fl or strongest[j] = f2) then max 0 then f_edge former_edge_wt or f_edge = false then candidate_p
PAGE 60
51 then { partition exists } reinitialize variables if f_edge = false then save this partition else change the cycle node if former_edge_wt < pmin then save this partition remove k from B else extend the cycle else extend the cycle { pmin contains the minimvun edge of a cycle } if (maxwt[k] != 1 or maxwt[k] != 2) and pmin > maxwt[k] then pmin maxwt[j] then maxwt[j]
PAGE 61
52 considered to minimize the number of predicates. In our example, the first subset {D#<10, D#<20, SAL>40K} is refined into {D#<20, SAL>40K} since D#<10 ==> D#<20. The second subset {D#>20, 3020} since 30 D#>20, but the last one {D#<15, D#>40, SAL<40K} has no change. Note that this optimization can be done before step 1. But in this dissertation, we perform this step here in order to allow more detailed clustering. For example, a pair of predicates such as pi and p2 in which one, (say, pi) , implies another (say, p2) , can be grouped in different clusters in step 1. The three clusters of predicates produced in this step, called "cluster sets", are listed in Figure 43. Step 3 . Compose predicate terms: Corresponding to the cluster sets (Figure 43) produced in step 2, we proceed as follows. The cluster sets are first evaluated to determine "the least common attribute". In our example, since SAL does not appear in cluster set 2 (corresponding to the second cluster of predicates) , it is the least common attribute. Note that D# appears in all three sets. A table called the "predicate term schematic table" is now considered by placing in the first colvimn the chosen attribute with its appropriate ranges to cover that attribute exhaustively. In our example, we create two entries: SAL<40K and SAL>40K
PAGE 62
53 for the SAL attribute. Then, we apply the next to least common attribute and write its appropriate ranges that appear in the cluster sets against each entry for the first column. Note that these ranges may be overlapping. In our example, D# is the next attribute. Its ranges applicable to the cluster sets are: D#<15 OR D#>40 coupled with SAL<40K (from cluster set 3), and D#<20 coupled with SAL>40K (from cluster set 1) . The D#>20 predicate appearing in cluster set 2 must be written twice into the table against each entry for SAL. This resulting predicate term schematic table is shown at the top in Figure 44. Now we construct predicate terms from the above table as follows. Each horizontal entry in the table gives rise to one predicate term. If predicates refer to the same attributes then they are ORed, otherwise they are ANDed. The resulting predicate terms are as follows: SAL<40K AND D#>20, SAL<40K AND (D#<15 OR D#>40) , SAL>40K AND D#<20, SAL>40K AND D#>20. Step 4. Perform fragmentation: We have one horizontal fragment per predicate term. Thus the number of horizontal fragments will at most be the number of predicate terms plus one because there is one remaining fragment
PAGE 63
54 called the "ELSE" fragment, which is the negation of the conjunction of predicate terms. Note that the result of step 4 may be overlapped fragments or nonoverlapped fragments. Chapter 4.4 provides the ADJUST function to obtain nonoverlapped horizontal fragments. This algorithm was implemented by using C language to demonstrate its effectiveness. It should be noted that the complexity of this algorithm is dominated by the step 1, and thus will be O(n^) for n predicates as in the vertical partitioning algorithm in Chapter 3. A smaller value of n indicates a good understanding of the heavily used predicates by users. The smaller the n, the better will be the performance of horizontal partitioning. 4.4 Nonoverlappinq Horizontal Partitioning The result of predicate partitioning may give rise to overlapped horizontal fragments. If nonoverlapped fragments are needed, then we apply the ADJUST function. For example, Figure 44 shows the final nonoverlapped fragments obtained by using the ADJUST function from the overlapping fragmentation generated in Chapter 4.3. There may be more than one way of creating nonoverlapped fragments from overlapped fragments. Figure 45 shows the three possible ways of dealing with overlapped horizontal fragments. Among these three possible ways, the best way is selected on the following criteria: (1) transaction processing cost, (2) minimization
PAGE 64
55 of the number of fragments. It may be desirable to choose the one that creates finer granularity for the grid cells. A heuristic procedure for generating nonoverlapped fragments is proposed based on these criteria. Let us define the following set of variables. Here, by "part" we mean that it is a potential horizontal partition (For example, there are three parts in Figure 45) . i = 1 , . . . , n denote the parts . k = 1, . . . ,m denote the transactions accessing the parts. Qk = cost of accessing part i by transaction k. Fik = the frequency with which transaction k accesses part i. U(ij)k= cost of unioning two parts i and j into part ij required by transaction k. S{Mj)k= cost of selecting the i* part from the merged part ij by transaction k. The cost factors would depend on implementation details such as access methods, storage structures, use of special algorithms for duplicate elimination, etc. This level of detail is out of the present scope of our dissertation. Using the above parameters and variables, we develop a heuristic adjusting procedure. The merging occurs if the overall cost is reduced because of some transactions accessing the two parts together.
PAGE 65
56 Procedure ADJUST Step 1. For each pair of contiguous parts i and j , find the cost of accessing these parts with and without merging respectively. A. Cost without merging: m mm k=l k=l k=l B. Cost with merging: m m k=l k=l Step 2. Find two parts i and j that produce the maximum saving if they are merged, and then merge them. Step 3. After merging two parts, only one new part is generated. Let the set of parts after step (2) be S l,...,p. Step 4. Then repeat steps (l)(3) for the parts in set S till no two contiguous parts can be merged. Note that if no merging occurs, then each part will be a horizontal fragment.
PAGE 66
57 Number of Predicate usage matrix Type accesses per time period .^.^ftedicates Pl p2 p3 p4 p5 p6 P7 p8 TransactiwI?"" Tl 1 0 0 0 0 0 1 0 R Acc 1 = 25 T2 0 1 0 0 0 0 1 0 R Acc 2 = 50 T3 0 0 1 0 0 0 1 0 R Acc 3 = 25 T4 0 0 0 1 0 0 0 R Acc 4 = 35 T5 0 0 0 0 1 0 0 U Acc 5 = 25 T6 0 0 0 0 0 1 0 U Acc 6 = 25 T7 0 0 0 0 1 0 0 u Acc 7 = 25 T8 0 0 0 0 0 1 0 u Acc 8 = 15 Figure 41 Predicate usage matrix Predicates 1 2 3 4 5 6 7 8 1 2 <==^ symmetric 3 * * 4 0 0 5 0 0 0 * 6 0 0 0 * * 7 25 50 25 0 0 0 Xs,^^^ 8 0 0 0 35 50 40 0 ^ Figure 42 Predicate affinity matrix
PAGE 67
58 D#<10 CLUSTER 1 CLUSTER 2 CLUSTERS SAL>40K D#>20 3040 D#<15 Cluster sets produced: {D#<20,SAL>40K} {D#>20} {D#<15,D#>40,SAL<40K} Figure 43 Clustering of predicates
PAGE 68
59 PREDICATE TERM SCHEMATIC TABLE SAL < 40K (p8) D# > 20 (p3,p4) D# < 15 OR {p5) D# > 40 . (p6) SAL > 40K (p7) D# < 20 (pl,p2) D# > 20 (p3,p4) ELSE ADJUST NONOVERLAPPED FRAGMENTS SAL < 4 OK AND D# > 20 HI (p3,p4,p6,p8) SAL < 4 OK AND D# < 15 H2 (P5,p8) SAL > 4 OK AND D# < 20 H3 (pl,p2,p7) SAL > 40K AND D# > 20 H4 {p3,p4,p7) ELSE H5 Figure 44 Nonoverlapping horizontal fragmentati
PAGE 69
60 FlP FlP I F2 Eioverlapping part F2P (a) (b) (c) Figure 45 Three possible ways of dealing with overlapped horizontal fragments (a) F2 includes P (b) P is separated (c) Fl includes P
PAGE 70
CHAPTER 5 A MIXED PARTITIONING METHODOLOGY Partitioning of a global relation in a distributed database can be performed in two different ways: vertical partitioning and horizontal partitioning. The current literature has addressed vertical and horizontal partitioning independently . However, since database users usually access data subsets that are both vertical and horizontal partitions of global relations, there is a definite need for research on mixed partitioning. Presently there is little, if any, work on the investigation of the effects of the different sequences in which the vertical partitioning, horizontal partitioning, and mixed partitioning problems can be solved. 5.1 Proposed Methodology As shown in Figure 12, the allocation problem is considered as a consequence of partitioning in our research, and thus it will be studied after partitioning problem. The mixed fragmentation approach consists of deriving a distribution design by taking into consideration the cost of processing the distributed transactions. We shall use the vertical partitioning and horizontal partitioning algorithms 61
PAGE 71
62 presented in Chapter 3 and 4 to generate a grid consisting of grid cells. These grid cells are candidate fragments as they are based on the affinity between the transactions and the data stored in the database. These grid cells may be small, and hence a transaction needs to access a set of grid cells in order to process the data. Hence, a set of grid cells in its most refined form may not give an optimal transaction processing performance. That is, the grid cells may need to be joined for the transaction to process the data. In order to minimize the number of fragments that need to be accessed by a transaction, the grid cells are considered for merging. The two major phases in the proposed mixed partitioning methodology are as follows. GRID CREATION; This is composed of two stages; they are the vertical and the horizontal partitioning stages for grid creation. The output of the GRID CREATION is a grid corresponding to a global relation. The grid suggests all possible ways in which the global relation in a distributed database may be partitioned. Each element of the grid is called a grid cell. GRID OPTIMIZATION : After making a grid, the GRID OPTIMIZATION performs merging of grid cells as much as possible according to the merging procedures. The merging of the grid cells is an antifragmentation procedure. Having created the cells in a topdown fashion as the ultimate smallest fragments of a relation, we now
PAGE 72
63 consider whether we should combine them in a bottomup fashion. Merging is considered desirable if it reduces the overall transaction processing cost. In this dissertation two types of merging sequences are considered: vertical merging followed by horizontal merging, and horizontal merging followed by vertical merging. Thus we can calculate the cost of transaction processing at the end of each sequence respectively and then select the better of the two end results. We call the result of this phase mixed fragments. Based on the Figure 12, the proposed methodology for mixed partitioning using a grid can be described in the following steps. Procedure MIXED_PARTITIONING Step 1. Specification of inputs: In this step, the following inputs are specified. (1) Schema information: relations, attributes, cardinalities, attribute sizes, etc. (2) Transaction information: name, frequency, attribute usage, etc. The attribute usage matrix is a matrix containing transactions as rows and attributes as columns. Element (i,j) = 1 if transaction i uses attribute j, else it is 0. (3) Distribution constraints: any predetermined partitions or fixed allocation of data.
PAGE 73
64 (4) System information: number of sites, transmission costs, etc. This information is used particularly to solve the allocation problem. Step 2. Vertical partitioning for grid: In this step all candidate vertical fragments are determined. We use a graphical algorithm proposed in Chapter 3 for generating all fragments in one iteration. Step 3. Horizontal partitioning for grid: In this step, all candidate horizontal fragments are determined. An algorithm for this step is described in Chapter 4. Note that the sequence of steps 2 and 3 can be changed . Step 4. Merging fragments in different sequences: In this step, cells are merged as much as possible based on cost functions. The merging is carried out in two possible sequences: VH Â— vertical merging followed by horizontal merging, or HV Â— horizontal merging followed by vertical merging. Step 5. Selection of the better sequence: Between the two different sequences, the better one is selected based on the estimation of the transaction processing cost. This selected sequence determines the final fragments of the global relation.
PAGE 74
65 5.2 Grid Creation In Chapter 3 and 4, we have seen the vertical and horizontal partitioning schemes based on the attribute and predicate affinities. We can now construct a grid consisting of cells by simultaneously applying the vertical partitioning and horizontal partitioning algorithms on the relation. Each of these cells belongs to one vertical fragment and one horizontal fragment. Note that the cells generated by both the methods are the same. We shall use these cells for the merging procedure or grid optimization. 5.3 Grid Optimization Based on the cost functions we want to merge cells as much as possible in order that the total cost of transaction processing (as seen from the standpoint of the one relation in question) will be minimized. Two kinds of merging are defined: horizontal merging and vertical merging. Horizontal merging deals with merging of the cells in the same row of the grid. In horizontal merging, the total number of ways of merging a fragment is C (n,i) = 2'l, where C represents combination and n represents the number of fragments, because the sequence of attributes has no meaning in a relation. In our approach, however, we can minimize the possible ways of combinations of horizontal merging by using the ordered
PAGE 75
66 sequence of fragments generated in vertical partitioning. Thus, given n candidate horizontal fragments, a total of (nl)+(n2)+, Â— ,+1 = n(nl)/2 merging possibilities are generated. This is because, in the linearly connected spanning tree in vertical partitioning, since a cut edge between two vertical fragments is a bridge that has the maximum affinity value among all connectable edges, we can say that a fragment is more closely related to contiguous fragments compared to noncontiguous fragments. For example, in Figure 38, if merging is needed, fragment (1,5,7) will be merged with fragment (2,3,8,9), because fragment (1,5,7) is more related to fragment (2,3,8,9) in terms of transaction access rather than to fragment (4,6,10). This allows us to develop a heuristic procedure for horizontal merging. In vertical merging, cells in the same vertical column of the grid may be merged to produce larger fragments. Since the same graphical partitioning approach is used for horizontal partitioning, the total number of possible ways of vertical merging can be minimized in the same way as in horizontal merging. In this dissertation, we consider both merging sequences: the VH sequence (performing vertical merging followed by horizontal merging) and the HV sequence (performing horizontal merging followed by vertical merging) .
PAGE 76
67 5.3.1 A Heuristic Procedure for Horizontal Merging It should be noted that during vertical partitioning in Chapter 3 we did not take transaction processing cost into consideration, but only attribute affinity values. To that extent this is an "intuitive" affinity based partitioning. Horizontal merging results in bringing together tuples of same id or key from two fragments. Thus for a transaction reguiring two or more fragments, some saving accrues. This is a form of join where a onetoone "matching" operation among horizontal fragments occurs based on key matching. In horizontal merging, as noted before, a fragment is merged with a contiguous fragment in the spanning tree first and then extended recursively to a larger fragment if possible. We give below a heuristic procedure for horizontal merging. This procedure is iterative and at each iteration all the contiguous cells are considered for the possibility of merging. Those two cells which produce the maximum saving by merging are merged. This procedure stops when no two contiguous cells can be merged. We ass\ime that the costs and frequencies considered in the formulation can be estimated. For cost estimation we use information inherent to the data and its use which is system independent, such as the length of attributes accessed and the number of tuples accessed. This approach is similar to the "bottomup" approach used by Hammer and Niamir [Hamm 79]. They started with single attributes as
PAGE 77
68 the extreme vertical fragments and combined them into pairs, triplets, etc., successively. The parameters and notations for the horizontal merging are given as follows. i = l,...,n denote the horizontal cells. k = l,...,m denote the transactions accessing the horizontal cells. Cjk = cost of accessing horizontal cell i by transaction k. = the frequency with which transaction k accesses horizontal cell i. J(ij)k = cost of matching horizontal cells i and j into cell ij required by transaction k. This matching cost takes into consideration the frequency of transaction accesses to cells i and j , and the attributes accessed from the i and j "parts" of the cell ij . This cost is dependent on the transaction k. P(i/ij)k= cost of projection of the ith cell from the merged cell ij by transaction k. Then, the heuristic procedure for horizontal merging is formulated by using a greedy method as follows. Procedure HORIZONTAL_MERGING Step 1. For each pair of horizontally contiguous cells i and j , find the cost of accessing these cells with and without merging respectively.
PAGE 78
69 A. Cost without merging: m in n k=l k=l k=l B. Cost with merging: m m k=l k=l Step 2 . Find two cells i and j that produce the maximxim saving if they are merged, and then merge them. Step 3 . After merging two cells horizontally, only one new cell is generated. Let the set of resultant merged cells after step (2) be I = l,...,p. Step 4. Then repeat steps (l)(3) for the cells in set I till no two contiguous cells can be merged. Note that this merging procedure is obviously dominated by step 4. Other steps take a time of 0(n) , where n represents the number of fragments. Thus the complexity of this merging procedure is O(n^) . 5.3.2 A Heuristic Procedure for Vertical Merging Vertical merging is very similar to horizontal merging. Note that we performed graphical topdown decomposition based on the notion of affinities. But affinity is a heuristic. Costbased optimization gives further improvement. We achieve this by merging. The merging occurs if the overall cost is reduced because of some transactions accessing the two cells together. Let us define the following set of variables:
PAGE 79
70 i = l,...,n denote the vertical cells, k = l,...,m denote the transactions accessing the vertical cells. = cost of accessing vertical cell i by transaction k. = the frequency with which transaction k accesses vertical cell i. U(ij)k = cost of union of two vertical cells i and j into cell ij required by transaction k. S(i^,j)k= cost of selection of the ith cell from the merged cell ij by transaction k. Using the above parameters and variables, we develop a heuristic procedure for vertical merging in a similar manner as in the horizontal merging. Procedure VERTICAL_MERGING Step 1. For each pair of vertically contiguous cells i and j , find the cost of accessing these cells with and without merging respectively. A. Cost without merging: m m m S CikFik + s qkFjk + S U(i,)k k=l k=l k=l B. Cost with merging: m m k=l k=l
PAGE 80
. 71 Step 2. Find two cells i and j that produce the maximum saving if they are merged, and then merge them. Step 3. After merging two cells vertically, only one new cell is generated. Let the set of resultant merged cells after step (2) be I = l,...,p. Step 4. Then repeat steps (l)(3) for the cells in set I till no two contiguous cells can be merged. Note that this merging procedure is obviously dominated by step 4 as in horizontal merging. Other steps take a time of 0(n) , where n represents the number of fragments. Thus the complexity of this merging procedure is also O(n^) . 5.3.3 An Example In our example we use an access factor a that reflects the cost of processing a merged fragment relative to the total cost of accessing its constituent cells if they are not merged together. Thus a is defined as the ratio of the cost of accessing a merged fragment to the cost of accessing its constituent grid cells. Detailed estimation of a would depend upon implementation factors, particularly the access structures such as indices and the access methods used. In Figure 51 we show the mapping of the transactions to the grid cells. Note that some of the grid cells are not accessed by the set of most important transactions, but may be accessed by other transactions. The mapping is done by mapping attributes and the predicates of the transactions with
PAGE 81
72 the attributes and the predicates forming the grid cells. For example, transaction T4 accesses attributes a2, a7 and a8, and is based on predicates p4 and p8, whereas the grid cell Gl is formed of attributes al, a5 and a7, and predicates p3, p4, p6 and p8. Hence transaction T4 accesses the cell Gl. Figure 52 shows the costs of accessing each of the horizontal fragments HI, H2, H3, H4 and H5. They are 100, 150, 200, 75 and 125 respectively. The length of the attributes of each of the vertical fragments are 14, 20 and 16 respectively. Note that we assume a linear cost access model in which the cost of accessing a grid cell is proportional to the length of attributes in that cell. Hence the cost of accessing a single grid cell G5 is C5 = 150 * 20/(14+20+16) = 60. Now we give an example that shows how two contiguous cells are merged. In the following illustration, we use the join cost factor as a variable to show how merging is dependent upon the join cost. Let us consider the two contiguous cells Gl and G2 in Figure 52. The parameter a is assumed to be 1.2 and the projection cost is assumed to be 0.1*J. Cost without merging: 8 2 FikCik = F14C14 + FigCifi = 35 * 28 + 25 * 28 = 1680 }C=1 8 S FzkCjk = F24C24 + F28C28 = 35 * 40 + 15 * 40 = 2000 k=l
PAGE 82
73 8 Note that Ji2 is the join cost for the grid cells Gl and G2, and the only transaction T4 accesses the merged fragment with frequency 35. Cost with merging: where C12 = a * (Ci + Cj) = 1.2 * (28 + 40) = 81.6 Thus we can conclude that the two cells are merged if 1680 + 2000 + 35*J > 6120 + 4*J, that is if 31*J > 2440 or J > 78.7. Figure 53 shows one of the possible results of grid optimization. The cost factor J generally would depend upon a variety of factors including join selectivity, implementation details, and join algorithms. 5.3.4 Result Comparison bv Total Cost Computation The above two procedures for horizontal and vertical merging are independent and can be applied in any sequence. The end result of both can be subjected to a cost computation 8 8 2 FikPi/12 k=l = 25 * 0.1*J + 15 * 0.1*J = 4*J
PAGE 83
74 against the given set of transactions to pick the better of the two sequences (i.e. HV and VH) . The total cost for the process of merging cells is thus calculated in the following way. n m Ct = S S COSTik i=l k=l where n : number of final fragments m : number of transactions accessing the final fragments Cx : total cost COSTik ' cost for accessing a resulting fragment i (which is given at the end of the merging procedure) by transaction k. Note that at the end of this step we have achieved a fragment configuration which could not be obtained by either an HV or VH sequence of partitioning independently. The merging steps are necessary to achieve the meaningful combinations of grid cells to minimize the total processing cost. The resulting set of cells is then evaluated together with allocation (with and without redundancy) later. 5.4 Effect of Changes in Data and Transactions In the case of new transactions we have to do all the five steps given in the MIXED_PARTITIONING procedure in Chapter 5.1. In the case of the same schema and the same transactions, but new data tuples, only the cost parameters will be affected for the grid optimization part. Hence only
PAGE 84
75 step 4 of the MIXED_PARTITIONING procedure has to be redone for merging the fragments. This is an advantage of this methodology. In the case of adding new relations to the schema, if existing transactions are not modified to access these new relations, an analysis based on the old set of transactions will not reveal any new result. However, if we incorporate new transactions that use the new relations, we have to consider all the five steps of the MIXED_PARTITIONING procedure .
PAGE 85
76 o > (0 5 Â«J (0 in (0 Eh '3" Eh T5 Tl r00 CM a CN ac in a 5 ^ in ^ a <\j a ^ 00 in ^ Eh . a "a* Â•Â» Eh 00 ^ a Eh a (SI a CO 0) o o Â•H cn O u) 03 c o H P CJ (0 CO c u e o mh c CU (0 2 I in 0) U H a,
PAGE 86
77 Cost Attributes length 14 20 16 100 Gl G2 G3 G4 G5 G6 200 G7 G8 G9 75 GlO Gil G12 125 G13 G14 G15 Ci = cost of accessing the corresponding horizontal fragment * L (attributes of the corresponding vertical fragment) L{all attributes) e.g. C5 = 150 * 20 14+20+16 = 60 Figure 52 Cost model for merging grid cells
PAGE 87
78 Fl F2 F3 XI F4 F5 X2 X3 X4 F6 X5 X6 X7 Figure 53 A possible result of grid optimization
PAGE 88
CHAPTER 6 AN ALLOCATION ALGORITHM FOR THE MIXED FRAGMENTS In a distributed database system, a user accesses the system through a local processor. The system may make local and remote accesses functionally transparent. So data allocation is one of the most important distributed database issues [Cer 83b, Corn 88]. The problem of data allocation in distributed systems has been extensively studied together with file assignment problem. Even though this problem is closely related to the file assignment problem, it differs greatly from the file assignment problem because, in the distributed database systems, the way the data are accessed is far more complex [Aper 88]. Data allocation is the process of mapping each logical fragment to one or more sites. Previous research in this area has been performed in two ways: Data allocation by itself, and extension of the pure data allocation problem by including the network topology and communication channels in the decision variables. This dissertation is concerned with pure data allocation in which the unit of allocation is the mixed fragment which comes from our mixed partitioning procedure in Chapter 5 that considers both horizontal partitioning and vertical partitioning simultaneously. The objective of our 79
PAGE 89
80 approach is to allocate the mixed fragments nonredundantly with minimum distributed transaction processing cost. To achieve this we present some heuristics that combine fragment allocation and transaction processing strategy. Based on the heuristics we develop an algorithm not by using the 01 integer programming method, but by using a graphical method called allocationrequest graph. Note that in mixed fragment allocation there may be a lot of different types of fragments. Thus we define three more operations in addition to the join operation for these mixed fragments. This approach begins from the transaction analysis which produces operation trees and a fragment usage matrix. Then, by using this fragment usage matrix, an allocationrequest graph is generated, and finally fragments are allocated according to the heuristic allocation procedure. 6.1 Distributed Transaction Processing In data allocation problems, data should be locally processed by applying the conditions of the query before any transmission. The selections and projections, which do not require data movement should be performed locally. The really difficult problem of distributed query optimization is the execution of joins. When the unit of allocation is the mixed fragment, since there are no restrictions on the type of fragments, in addition to the join operation, there are other operations
PAGE 90
81 which require data movement: outerjoin, union, and outerunion. We call them multifragment operations. Outerjoin is introduced for keeping all tuples in fragments regardless of whether or not they have matching tuples in the other fragment, and outerunion is introduced for keeping nonunion compatible attributes. In a distributed system there are several factors that must be taken into account such as link cost, data transmission cost, storage cost, 10 cost, and CPU cost. Among them, the first and the most important factor to be considered in distributed environments is the cost of transferring data over the network [Wied 87]. Thus, in this dissertation we minimize the amount of data transmission, which we call the total transmission cost, while satisfying the CPU cost constraint at each site. The details will be explained in Chapter 6.2. Note that we assume that the link transmission costs between any two nodes are the same. When a multifragment operation is related to the fragments that reside at different sites, the query processing strategy on site selection for performing this operation can have a significant impact on data movement. There are basically two alternatives for query processing strategy [Elma 89]. One is the siteof origin strategy in which all multifragment operations are performed at the site where the query was submitted; the other is the movesmall strategy in which for each multifragment operation, the smaller fragment
PAGE 91
82 participating in the operation is always sent to the site where the relation of larger size resides in order to perform multifragment operation. The movesmall strategy was shown to be superior to the siteoforigin strategy [Com 88]. We will illustrate this with a simple example query. Suppose fragments Fl and F2 are as shown in Figure 61. Consider the query Q: "For each employee retrieve the employee name and salary" . Then each transaction against a relational database can be decomposed into a sequence of relational algebra operations . Suppose the query was submitted at a distinct site 3. Neither the Fl nor the F2 fragments reside at site 3. There are two simple strategies for executing the distributed query: (1) Data transfer in siteoforigin strategy Transfer both the Fl and F2 fragments to the site 3 and perform the query at site 3. In this case we need to transfer a total of 1000*50 + 100*30 = 53000 bytes . (2) Data transfer in movesmall strategy Transfer the F2 to site 1, execute the query at site 1, and send the result to site 3. The size of query result is 36*1000 = 36000 bytes, so we transfer 36000 + 3000 = 39000 bytes. However, in the movesmall strategy the multifragment operation is performed at the site where the larger fragment is located. The result of the multifragment operation should
PAGE 92
83 be returned to the site which requests the multifragment operation if the larger fragment participating in the multifragment operation is not in that site. Thus we have to determine the query processing strategy in order to minimize the amount of data transmission. However, even though we consider the transaction processing strategy to reduce the total transmission cost, it is difficult to obtain optimal allocation by only considering transaction processing strategy because optimal allocation is deeply related to both transaction processing strategy and fragment allocation. Actually, finding a nonredundant , minimum total data transmission cost allocation is NPcomplete [Gare 79]. Hence, to minimize the transmission cost, we will consider some heuristics which allow multifragment operations to be performed effectively in an integrated way. In our research, to illustrate how a transaction is decomposed into a sequence of algebraic operations, we use an operation tree which represents algebraic formulation. Figure 62 shows an example of an operation tree. Note that at this step we assume that the amount of data requested by each transaction is estimated. It can be done by considering attributes accessed and selectivity factors. After analyzing transactions, a fragment usage matrix is generated. Fragment usage matrix is a matrix which is similar to the attribute usage matrix, and represents the use of the fragments in important transactions. Each row refers to one transaction:
PAGE 93
84 the numeric value in a column indicates that the transaction originated from the specified site or sites (numeric value) and "uses" the corresponding fragment. For example, in Figure 63, transaction T5 originates from the sites 1 and 2 and uses fragment F3 . 6.2 Representation and Strategy 6.2.1 Representation To allocate fragments optimally most approaches [Cer 83b, Corn 88] use the 01 integer programming method. However, this method is hard to understand and lacks intuition. It also suffers from limitations on the size of the problem to solve with commercially available software. In this dissertation we use a graph called allocationrequest graph to represent how data are requested and allocated. Figure 64 shows an example of the allocationrequest graph. As shown in Figure 64, an allocationrequest graph consists of three components. (1) rectangular nodes, for fragments (2) circular nodes, for sites (3) edges, that denote requests by the transactions for accessing the fragments Edges, which are directed, are labeled with a quadruple (i/j/f/d), where d stands for the amount of data requested from the fragment j that corresponds to the rectangular nodes in the graph for processing the i* transaction, and f stands
PAGE 94
85 for the frequency with which this transaction is executed. For example, in Figure 64, the label (2,1,30,200) means that transaction 2 requests 200 bytes of fragment 1, with frequency of 30. Join operations are represented by drawing arcs among edges participating in joins and marking them with a "J". Outerunion operations and outerjoin operations are represented by "O", and union operations are represented by "U". Figure 64 shows these representations. Note that nway (n>2) multifragment operations can also be represented in the same way as shown in Figure 64. 6.2.2 Strategy The objective is to allocate the mixed fragments nonredundantly with minimum distributed transaction processing cost. The distributed transaction processing cost here is measured by the total transmission cost. The allocation and transaction processing are subject to some load balancing constraints. The total transmission cost is obtained by multiplying transaction frequency by the amount of data requested. To reduce the total transmission cost, a fragment should be allocated to the site that requests the largest amount of data. To increase the availability of the system for local processing, CPU/IO cost is considered as a balancing constraint at each site. This means that there is an upper limit of the CPU/IO time at each site. The CPU/IO cost is
PAGE 95
86 determined by (cost of tuple retrieval and processing) x (# of tuples) X (# of retrievals) . This is because in most relational database systems they use a tuple at a time retrieval storage system (e.g. RSS in System R) . Thus what we are trying to do is to minimize the total transmission cost while satisfying the CPU/IO cost constraint at each site. This objective would be easily obtained if transactions did not contain multifragment operations. However, if transactions contain multifragment operations, as it would normally happen, we cannot easily figure out the objective because there is an interdependency between query processing strategy and data allocation as pointed out in the previous approaches [Wah 84, Aper 88, Corn 88]. To solve this problem data allocation and query processing strategy should be considered together. This means that we should choose the sites where data are to be stored while simultaneously determining where join operations should take place [Corn 89] . Another consideration is the priority of the requests. When there are multiple requests for a fragment, we emphasize on multifragment requests for selecting a candidate request. This is because it allows effective processing for multifragment operations. Note that nonmultifragment operations are not neglected. It means that the relationship with the other fragment participating in a multifragment operation is considered to reduce the data transmission. The details will be explained in strategy 3. This is reasonable because, as
PAGE 96
87 pointed out in [Swam 89], some future applications built on top of relational systems will require processing of queries with a much larger number of joins. Objectoriented database systems that use relational systems for information storage are another class of potential applications performing many joins. Thus, in our research, we present an integrated heuristic that takes into consideration the interdependency problem in mult ifragment operations and join emphasis for selecting a candidate request. The integrated heuristic is as follows: "Let the fragment that causes the largest data transmission be located at the requesting site as much as possible while satisfying the load balancing constraints." This heuristic can be accomplished by using an allocation technique called pseudoallocation, which allows fragment allocation dynamically with multifragment operations to minimize the distributed transaction processing cost. This will be explained later. Based on the above discussions we can formulate the following allocation strategies. Note that at any step the final allocation of a fragment is determined as long as the load balancing constraint is not violated, otherwise the fragment goes back to the fragment pool. Strategy 1. If there is only a single request, allocate this fragment to the requesting site.
PAGE 97
88 Strategy 2 . When there are multiple requests from different sites to one fragment, which are not participating in any multifragment operations, allocate this fragment to the site that gives rise to the largest amount of data transmission . Strategy 3 . If the requests are mixed with multifragment operations, two cases may occur. To explain further, we use the following terminology. former fragment: the fragment that is being considered for allocation first between the two fragments participating in a multifragment operation. For example, in Figure 65, Fl is a former fragment. latter fragment: the fragment that is being considered for allocation after the former fragment. In Figure 65, F2 is a latter fragment. data_multi_former (data_multi_latter) : the total amount of data requested of the former fragment (latter fragment) from the site that has multifragment operations. Note that if there are several sites that have multifragment operations then the site that has the largest amount of total data requested is considered. In Figure 65, a and c represent data_multi_former and data_multi_latter respectively. data_nomulti_former (data_nomulti_latter) : the total amount of data requested of the former fragment (latter
PAGE 98
89 fragment) from the site that has no multifragment operations. Note that if there are several sites that have no multifragment operations then the site that has the largest amount of total data requested is considered. In Figure 65, b and d represent data_nomulti_former and data_nomulti_latter respectively . 3.1 Former fragment: If the candidate fragment is a former fragment, then the requests that are participating in mult ifragment operations have higher priority for allocation without cost computation than those that are not participating in multifragment operations. Thus, in Figure 65, fragment Fl will be allocated to site 1 without any cost computation. 3.2 Latter fragment: If the candidate fragment is a latter fragment, then apply the pseudoallocation technique. 6.2.3 Pseudoal location Technicme Pseudoallocation technique is an allocation heuristic that allows fragments to be allocated dynamically with multifragment operations. The job of the pseudoallocation technique is to determine whether a former fragment is kept in the allocated site or not. Thus the pseudoallocation technique is always meaningful when there is a former fragment. In order to satisfy our proposed heuristic we introduce a parameter )3 to represent the ratio of the result size of multifragment
PAGE 99
90 operations to the size of the larger of the two participating fragments. Thus if both the fragments participating in a multifragment operation are allocated to remote sites, some portion of a fragment depending on the ratio will be added to the total data transmission as a penalty for the remote accessing. The rules are as follows. We refer to Figure 65. (1) ac & /3*a>b Meaning: data requested from Fl is greater than data requested from F2 in the mult ifragment operation and data resulting from multifragment operation due to Fl is greater than data requested by S2 from Fl. Decision: keep Fl at SI and return F2 to the fragment pool. Note that, however, if /3*ad Meaning: data requested from Fl is less than data requested from F2 in the multifragment operation and data resulting from multifragment operation due to F2 is greater than data requested by S2 from F2 . Decision: release Fl and allocate F2 to SI. Note that if /3*c
PAGE 100
91 (2) ad Meaning: SI requests less data from Fl than F2 and S2 requests less data from F2 than SI. Decision: release Fl and allocate F2 to SI. (3) a>b & cb & c>d Meaning: SI requests more data from Fl and F2 than S2. Additional condition (a) a>c Meaning: data requested from Fl is greater than data requested from F2 in the multifragment operation. Decision: keep Fl at SI and return F2 to the fragment pool . (b) a
PAGE 101
92 data_nomulti_f ormer . Then we would like to allocate the next fragment F3 with data_inulti_latter < data_nomulti_latter. Since data_iaulti_f ormer < data_multi_latter & /3*data_multi_latter > data_nomulti_latter, the latter fragment F3 is allocated to site 1 and the former fragment F2 is kicked off from site 1 by strategy 3.2. Fragment F2 is reconsidered and allocated to site 2 if it satisfies the load balancing constraints. However, suppose that data_multi_latter < 4500. Then the former fragment F2 will still be at site 1 and the latter fragment F3 will go back to the fragment pool for reconsideration. This means that allocation of a fragment participating in a multifragment operation is confirmed after considering the effect of all the multifragment operations. In the next chapter we will give a heuristic allocation procedure . 6.3 A Heuristic Procedure for Mixed Fragment Allocation Now a heuristic procedure for allocating the mixed fragments by the allocationrequest graph is described below. First we briefly describe the algorithm in three steps. Step 1 . Form an allocationrequest graph by using operation trees and a fragment usage matrix. Step 2 . Sort fragments based on some reasonable sorting criterion (e.g. size of fragment, frequency of
PAGE 102
93 retrieval or the ratio of frequency over size) . Place the result in a circular queue. Step 3 . While the queue is not empty do (1) take the front fragment, (2) allocate it to an appropriate site according to the allocation strategies, (3) apply the pseudoallocation technique as defined and M explained in Chapter 6.2.3, This iteration will end when all fragments are allocated. To obtain a more detailed algorithm, suppose that the following data structures are used during implementation: The fragments are numbered 1 to n. An adjacency list L represents the allocationrequest graph, in which the array HEAD[l..n] contains fragment_id, total amount of data requested to the fragment, number of requests to the fragment, and number of multifragment operations in the list pointed by this fragment. Each list has five pieces of information: transact ion_id, frequency, data requested, site_of_origin and type of transaction. The detailed description of the algorithm now follows. The algorithm uses more variables with the following meaning in addition to those in Chapter 6.2. #_of_req: an integer variable for the number of data requests to the fragment. #_of_mulop: an integer variable for the number of multifragment operations in the list pointed by the fragment.
PAGE 103
94 pre_allo(i) : an integer vector which denotes the preallocation site without cost computation for fragment i. confirm (i) : a boolean vector which denotes the status of final allocation of fragment i. Procedure Mixed_F_Allo (adjacency list L with HEAD[l..n]) {prepare a circular queue} sort HEAD array based on the sorting criteria and make the result as a circular queue. while queue is not empty do take the front one from the queue (we call it fragment i) if #_of_req = 1 {there is a single request} then check the balancing constraints and allocate this fragment to the requesting site. conf irm ( i ) =true else {there are multiple requests} case #_of_mulop =0: {no requests are participating in multifragment operations} check the balancing constraints and allocate this fragment to the site that gives rise to the largest data transmission, confirm ( i ) =true #_of_mulop >= 1 & pre_allo(k)=empty for all k: { it is a former fragment} If if all mult ifragment operations are from one site then check the balancing constraints and allocate the fragment to the requesting site. pre_allo(i) = 1 & pre_allo(k) is not empty: {it is a latter fragment} find the former fragment and call it j . check the balancing constraints and allocate the fragment to the site where the former fragment resides. if data_multi former < data nomulti former
PAGE 104
95 then if data_inulti_latter < data_nomulti_latter then if data_inulti_former >data_multi_latter then determine the former fragment's site using /3. insert the latter into queue else determine the latter fragment ' s site using p. insert the former into queue else insert the former into queue conf irm ( i ) =true else if data_multi_latter < data_nomulti_latter then insert the latter into queue conf irm ( j ) =true else if data_multi_former > data_multi_latter then conf irm( j ) =true check the balancing constraint for the latter fragment, else conf irm (i) =t rue check the balancing constraint for the former fragment. end_case end_while end_procedure 6.4 Complexity of the Algorithm Now let us analyze the time complexity of the above algorithm based on the steps. Obviously step 2 takes a time of O(nlogn) , where n means the number of fragments, because it needs sorting. Step 3(b) requires the requestlist for the fragment to be traversed, so that it takes a time of 0(m), where m represents the number of transactions. Step 3(c) takes a time of 0(m) because finding a former fragment needs a traversal for a fragment and its list. The "while" loop will run m/n times per fragment, giving a time in 0(m) . This is because, by using a flag, a fragment is allocated only once per site before final allocation. This flag guarantees the
PAGE 105
96 termination of the algorithm. Thus step 3 takes a time of 0 (m^) , and the time required by this algorithm is therefore in 0 (m^) . 6 . 5 Experimental Results In order to show how the proposed heuristic procedure indeed works as claimed and how the pseudo allocation technique can be used for obtaining optimal solutions, we implemented this algorithm by using the C language. To simplify the implementation we assume the following: (1) All mult ifragment operations are 2way operations, (2) If both fragments are allocated to remote sites then these sites are distinct, (3) Load balancing consideration is deferred. We use three examples, namely EXl, EX2, and EX3 as shown in Figure 66, 67, and 68 respectively. EXl has 3 sites and 3 fragments, EX2 3 sites and 6 fragments, and EX3 4 sites and 6 fragments. Figure 69 shows some experimental results. We varied the parameter )3 from 0.5 to 2.0. As explained before, P represents the ratio of the result size of multifragment operations. We computed the amount of data transmission of the optimal allocation scheme by exhaustive search, and compared it with the heuristic allocation scheme. The optimality of the algorithm is dependent on the parameter )8. It appears that as /3 increases, the heuristic solution tends to equal to the optimal solution. This approach assumed an average p. This can be extended to pairwise j9 values.
PAGE 106
It should be noted that we have been considering the accessed fragments so far. Now we are going to discuss the allocation of the nonaccessed fragments. 6.6 Dealing with Nonaccessed Fragments Let us consider the fragments that are not accessed by important transactions. As shown in Figure 53, some of fragments are accessed by the important transactions of 20%, whereas some of fragments are not accessed. However, this does not mean that these nonaccessed fragments are not accessed at all because the 80% of less important transactions may still be accessing those fragments. Thus when we allocate fragments, we have to consider the allocation of these nonaccessed fragments. The problem is caused by the fact that we do not know the nature of the 80% of less important transactions. Based on the current inputs, the following solutions can be possible: (1) Merging with a contiguous accessed fragment When we consider merging a nonaccessed fragment with a accessed fragment, the merging will be in the direction that the result will remove the outerjoin and outerunion to reduce the multifragment operation cost. This is because outerjoin and outerunion operations need more cost because of padding information. For example, in Figure 53, consider fragments F5 and F6, and cells X2 and X4. Here at least there are two choices in merging. One way is that cell X2 and X4 are
PAGE 107
98 merged with F5, and the other way is that cell X2 is merged with F5 and cell X4 is merged with F6. Then it is obvious that the latter way is better than the former way in the context of transaction processing cost because the former way needs outerunion. (2) Allocation to the site that has the cheapest storage cost Suppose there was a site whose storage cost is the cheapest one, then we can allocate all nonaccessed fragment to that site. (3) Random allocation Basically the nonaccessed fragments are considered as don't care conditions. It does not matter where this nonaccessed fragments are allocated.
PAGE 108
99 EMPLOYEE relation Site 1 : Fl ENO FNAME LNAME SEX BDATE MGRSSN 1000 records Each record has 50 bytes long ENO field is 4 bytes long FNAME field is 15 bytes long LNAME field is 15 bytes long Site 2 : F2 ENO ADDR SAL 100 records Each record has 30 bytes long ENO field is 4 bytes long SAL field is 6 bytes long Figure 61 Example to illustrate data transfer
PAGE 109
100 Q Fl F2 Figure 62 Operation tree transactions Fl F2 F3 F4 F5 F6 Number of accesses per time period 1 1,2 Acc 1 = 25 2 2 Acc 2 = 50 3 3 Acc 3 = 25 4 1 1 1 Acc 4 = 35 5 1,2 Acc 5 = 25 6 2 Acc 6 = 25 7 3 Acc 7 = 25 8 3 Acc 8 = 15 Figure 63 Fragment usage matrix
PAGE 110
101 Figure 65 Example to explain terminology
PAGE 111
Â•win* ' Figure 66 Example 1
PAGE 112
Figure 67 Example 2
PAGE 113
104 Figure 68 Example 3
PAGE 114
105 >ex ample 15 >v exl ex2 ex3 heuristic optimal heuristic optimal heuristic optimal 0.5 7250 7250 19500 17750 35250 32000 1.0 10000 10000 19500 19500 35250 35000 1.5 10800 10800 19500 19500 35250 35250 2.0 10800 10800 19500 19500 352^0 35250 Figure 69 Experimental results
PAGE 115
CHAPTER 7 ALTERNATIVE WAYS OF DEALING WITH FRAGMENTATION AND ALLOCATION In this dissertation, we presented a set of distributed database design methodologies. In our approach, mixed fragmentation involves a combined use of horizontal and vertical partitioning to construct grid cells. These cells are then merged during grid optimization. Allocation is deferred as a separate problem that is governed by the site of origin of transactions. We describe the other approaches in the next sub chapter. 7.1 Alternatives for Combining Fragmentation and Allocation Distributed database design has been dealt with in terms of fragmentation of data and allocation of these fragments. Figure 71 shows the various alternatives with which these problems can be attacked. For simplicity we do not show replication in this picture. The arm C in Figure 71 shows our proposed approach in data fragmentation and allocation. We will explain all the alternatives except our approach. A: This approach expects that users specify fragmentation and allocation requirement together. Ceri et al. [Cer 83b] assumed that users are able to specify such candidates and derived horizontal 106
PAGE 116
107 partitioning schemes. The resulting design contains partitioning and candidates based on a 01 integer programming formulation. Replication was introduced after first solving the nonreplicated problem. In this approach fragmentation and allocation are independently handled. Navathe et al. [Nava 84] applied this approach to vertical partitioning where fragments were designed first to achieve the best clustering of attributes based on intuitive empirical objective function. The allocation problem requires detailed cost information. They proposed an objective function based on costs of accessing multiple fragments, irrelevant attributes, etc, which can be weighted differently. Having designed the vertical fragments, they proposed separate allocation algorithms for both centralized and distributed environments. This dissertation falls in this approach. This approach corresponds to wide area networks where we would use allocation algorithms first to allocate the grid cells. Then based on the given allocation, local merging of fragments is performed to minimize overall transaction processing costs. Sreewastav [Sree 90] proposed an integrated system architecture for the initial design based on this approach.
PAGE 117
108 E: This approach involves a more complex model of merging and allocation process where site dependent costs are brought into picture. Among these alternatives we will explain the D approach in more detail in the following chapter because it is more related to our approach. 7.2 Cell Allocation Followed by Local Optimization The grid cells produced after simultaneously applying the vertical and horizontal partitioning algorithms represent all possible partitioning of the relation based on the usage of the relation by the important transactions. The fragments of grid cells are the unit of allocation. The major difference from approach C is that in approach D grid cell allocation is first performed without merging, and then grid cell merging is performed at each site. Basically there are two stages in this approach: (1) Initial allocation of the fragments In this stage each cell of the grid is allocated to sites. The objective of this stage is to optimize total processing cost. The allocation of fragments is not dependent of each other. The usage of fragments together by transactions is taken into account. (2) Local fragment optimization The local optimization of fragments refers to merging of cells allocated to a specific site, into larger fragments. The
PAGE 118
109 vertical and horizontal partitioning algorithms produce a grid of cells based on the attribute and predicate affinities. This stage generates all the possible fragments. The granularity of these fragments is very small; therefore a transaction may access a number of fragments, or a number of joins or unions are involved in transaction processing. Merging of these fragments, which are accessed together very frequently by transactions may result in savings due to an elimination of joins and unions. Based on this approach, an integrated system architecture for the initial design of a distributed database was proposed in [Sree 90] . Sreewastav tried to adopt our mixed partitioning methodology to form a grid for each relation by simultaneously applying vertical and horizontal partitioning. The grid cells or fragments are then allocated nonredundantly using a greedy heuristic procedure. This initial allocation is followed by a replication procedure, which again is a greedy heuristic. A fragment is replicated at a site if the benefit of replication of the fragment at that site is positive and maximum among all possibilities. Finally, the fragments allocated to a site are further optimized with the objective of reducing processing costs attributed to join and union processing.
PAGE 119
110 u o M w < u o 2 O M Eh < U O c o H o o (0 T3 a nt c o H 4J (0 4> c Â•H +J c u (U 4) I u 3 H
PAGE 120
CHAPTER 8 CONCLUSIONS AND FURTHER RESEARCH The major issue of designing an efficient distributed database gives rise to problems of fragmentation and allocation of data. In this dissertation we have formally defined these problems and proposed a set of algorithms for partitioning and allocation. In our view, the major results and contributions of this work are the following. This dissertation has presented a graph theoretical algorithm for vertical partitioning. In this algorithm all fragments are generated by one iteration in a time of O(n^) that is more efficient than previous approaches^ Furthermore, it does not need an arbitrary objective function. We consider it very useful because it is applicable to any domain if an affinity matrix is provided. In fact, the horizontal partitioning algorithm proposed here is the result of applying this same algorithm to horizontal partitioning domain. The mixed partitioning methodology allows the optimal partitioning of global relations in a distributed database by using a grid approach that incorporates horizontal and vertical fragmentations simultaneously. This is the first comprehensive treatment of mixed partitioning. Ill
PAGE 121
112 An allocation algorithm suited for mixed partitioning has been presented. We have specified four multifragment operations and defined allocation strategies based on the heuristic called the pseudoallocation technique by using a graphical method. Further extension of this work will be in the direction of the following issues. The first issue is the development of an interactive design tool. This design tool will allow users to make fragmentation and allocation decisions for distributed databases using vertical partitioning, horizontal partitioning, mixed partitioning, and allocation. The other issue is the investigation of a model for multiple copies of fragments in the allocation phase. Replication can be performed as a consequence ofnonreplicated allocation by ascertaining whether there is cost benefit after replication. More work is needed to model the performance evaluation with respect to the response time of transactions and the effect of network topology. Tying this work closely to actual environments where a, fi, and J are actually computed for an actual system will be good.
PAGE 122
1 mti.ij tt>>,>B^ REFERENCES [Aper 88] Apers, P. M. G., "Data Allocation in Distributed Database Systems , " ACM Trans . on Database Systems , Vol. 13, No. 3, September 1988, pp. 263304. [Bras 88] Brassard, G. , and Br at ley. P., Algorithmics: Theory & Practice, PrenticeHall, Englewood Cliffs, New Jersey, 1988. [Cer 80] Ceri, S., Martella, G. , and Pelagatti, G. , "Optimal File Allocation for a Distributed Database on a Network of Minicomputers," Proc. International Conference on Databases, Aberdeen, Hayden, July 1980. [Cer 83a] Ceri, S., and Navathe, S. B. , "A Methodology to the Distribution Design of Databases," Proc. IEEE COMPCON Conference, San Francisco, CA. , February 1983. [Cer 83b] Ceri, S., Navathe, S. B. , and Wifederhold, G. , "Distribution Design of Logical Database Schemas," IEEE Trans, on Software Engineering, Vol. SE9, No. 4, July 1983, pp. 487504. [Ceri 82] Ceri, S., Negri, M. , and Pelagatti, G. , "Horizontal Data Partitioning in Database Design," Proc. ACM SIGMOD International Conference on Management of Data, Orlando, FL. , June 1982. [Ceri 84] Ceri, S., and Pelagatti, G. , Distributed Databases: Principles and Systems, McGrawHill, New York, 1984. [Ceri 87] Ceri, S., Pernici, B., and Wiederhold, G., "Distributed Database Design Methodologies," IEEE Proceedings, April 1987. [Ceri 88] Ceri, S., Pernici, B. , and Wiederhold, G., "Optimization Problems and Solution Methods in the Design of Data Distribution," Working paper, Stanford University, 1988. 113
PAGE 123
114 [Cope 88] Copeland, G. , Alexander, W. , Boughter, E. and Keller, T. , "Data Placement in Bubba," Proc. ACM SIGMOD International Conference on Management of Data, Chicago, XL., June 1988. [Com 87] Cornell, D. W. , and Yu, P. S., "A vertical Partitioning Algorithm for Relational Databases," Proc. Third International Conference on Data Engineering, Los Angeles, CA. , February 1987. [Corn 88] Cornell, D. W. , and Yu, P. S., "Site Assignment for Relations and Join Operations in the Distributed Transaction Processing Environment," Proc. Fourth International Conference on Data Engineering, Los Angeles, CA. , February 1988. [Corn 89] Cornell, D. W. , and Yu, P. "On Optimal Site Assignment for Relations in the Distributed Data Environment," IEEE Trans, on Software Engineering, Vol. 15, No. 8, August 1989, pp. 10041009 . [DeW 86] DeWitt, D. J., Gerber, R. H. , Graefe, G. , Heytens, M. L. , Kumar, K. B. , and Mural ikrishna, M. , "GAMMA A high Performance Dataflow Database Machine," Proc. Twelfth International Conference on Very Large Data Bases, Kyoto, Japan, August 1986. [Dowd 82] Dowdy, L. W. , and Foster, D. V., "Comparative Models of the File Assignment Problem," Computing Surveys, Vol. 14, No. 2, June 1982, pp. 287313. [Elma 89] Elmasri, R. , and Navathe, S. B. , Fundamentals of Database Systems, Benjamin/Cummings Publishing, Redwood City, California, 1989. [Gare 79] Garey, M. R. , and Johnson, D. S., Computers and Intractability : A Guide to the Theory Of NPCompleteness , Freeman, New York, 1979. [Hamm 79] Hammer, M. , and Niamir, B. , "A Heuristic Approach to Attribute Partitioning," Proc. ACM SIGMOD International Conference on Management of Data, Boston, MA., May 1979. [Hoff 75] Hoffer, J. and Severance, D. G., "The Use of [McCo 72] Cluster Analysis in Physical Database Design," Proc. First International Conference on Very Large Data Bases, Framingham, MA. , September 1975. Mccormick, W. T., Schweitzer, P. J., and White, T. W. , "Problem Decomposition and Data Reorganization
PAGE 124
115 by a Clustering Technique," Operations Research, Vol. 20, No. 5, September 1972, pp. 9931009. [Nava 84] Navathe, S. B. , Ceri, S., Wiederhold, G. , and Dou, J. , "Vertical Partitioning Algorithms for Database Design," ACM Trans, on Database Systems, Vol. 9, No. 4, December 1984, pp. 680710. [Sacc 85] Sacca, [Sree 90] [Swam 89] [Wah 84] [Wied 87] [Yu 85] and Wiederhold, "Database Partitioning in a Cluster of Processors," ACM Trans, on Database Systems, Vol. 10, No. 1, March 1985, pp. 2956. Sreewastav, K. , "A Distributed Database Initial Design System," Master's thesis. University of Florida, 1990. Swami, A., "Optimization of Large Join Queries: Combining Heuristics and Combinatorial Techniques, " ACM SIGMOD International Conference on Management of Data, Portland, OR., May 1989. Wah, B. W., "File Placement on Distributed Computer Systems," IEEE Computer, Vol. 17, No. 1, January 1984, pp. 2333. Wiederhold, G. , File Organization' for Database Design, McGrawHill, New York, 1987. Yu , C . T . , Suen , C , Lam, K. , and Siu, M. K. , "Adaptive Record Clustering," ACM Trans. on Database Systems, Vol. 10, No. 2, June 1985, pp. 180204 .
PAGE 125
BIOGRAPHICAL SKETCH Minyoung Ra is from Korea. He graduated from Korea Military Academy, Seoul, Korea, in 1978. He received the bachelor's degree and master's degree in computer science at Seoul National University, Seoul, Korea, in 1983 and 1986, respectively. Before starting his Ph.D studies he worked as a faculty member at Korea Military Academy. Since 1987, he has been working towards the doctoral degree in computer and information sciences at the University of Florida. His research interests are in the area of database management systems, distributed database design, federated information bases and semantic data modeling. 116
PAGE 126
I certify that I have read this study and that in my opinion it conforms to acceptable standards of scholarly presentation and is fully adequate, in scope and quality, as a dissertation for the degree of Doctor of Philosophy. Shamkant B. Navathe, Chair Professor of Computer and Information Sciences I certify that I have read this study and that in my opinion it conforms to acceptable standards of scholarly presentation and is fully adequate, in scope and quality, as a dissertation for the degree of Doctor of Philosophy. Ravi Varadarajan, ^Cochair Assistant Professor of Computer and Information Sciences I certify that I have read this study and that in my opinion it conforms to acceptable standards " of scholarly presentation and is fully adequate, in scope and quality, as a dissertation for the degree of Doctor of Philosophy. Sharma Chakravarthy Associate Professor of Computer and Information Sciences I certify that I have read this study and that in my opinion it conforms to acceptable standards of scholarly presentation and is fully adequate, in scope and quality, as a dissertation for the degree of Do^J^enrjof Philosophy.
PAGE 127
This dissertation was submitted to the Graduate Faculty of the College of Engineering and to the Graduate school and was accepted as partial fulfillment of the requirements for the degree of Doctor of Philosophy. December 1990 Winfred M. Phillips Dean, College of Engineering Madelyn M. Lockhart Dean, Graduate School

