Group Title: Department of Computer and Information Science and Engineering Technical Reports
Title: Sensitivity analysis of frequency counting
Full Citation
Permanent Link:
 Material Information
Title: Sensitivity analysis of frequency counting
Alternate Title: Department of Computer and Information Science and Engineering Technical Report
Physical Description: Book
Language: English
Creator: Johnson, Theodore
Publisher: Department of Computer and Information Science, University of Florida
Place of Publication: Gainesville, Fla.
Copyright Date: 1993
 Record Information
Bibliographic ID: UF00095206
Volume ID: VID00001
Source Institution: University of Florida
Holding Location: University of Florida
Rights Management: All rights reserved by the source institution and holding location.


This item has the following downloads:

1993120 ( PDF )

Full Text

Sensitivity Analysis of Frequency Counting*

Theodore Johnson
Dept. of Computer and Information Science
University of Florida

Many database optimization activities, such as prefetching, data clustering and partitioning, and
buffer allocation, depend on the detection of hot spots in access patterns. While a database designer can
in some cases use special knowledge about the data and the users to predict hot spots, in general one
must use information about past activity to predict future activity. However, algorithms that make use
of hot spots pay little attention to the way in which hot spot information is gathered, or to the quality of
this information. In this paper, we present a model for analyzing hot spot estimates based on frequency
counting. We present a numerical method for estimating the quality of the data, and a rule-of-thumb.
We find that if b of the references are made to the hottest a of the N data items, then one should process
Na[(1 a)/(b a)2] references.

1 Introduction

Many database optimization activities require the identification and classification of hot spots, or regions of

the database with an exceptionally high access frequency. In some cases, a database designer can identify

hot spots based on special knowledge of the data and the users. However, one is usually forced to predict

future referencing patterns based on past behavior. A common approach is to use frequency counting. The

number of references to the objects in the system are counted over a period of time. Objects with a high

frequency count are labeled "hot", and the other objects are labeled ..I.1".

Frequency counting will accurately identify the hot objects in the database if the reference pattern is

stationary and if enough references are collected. However, one usually wants to make database organization

or re-organization decisions based on limited data. The earlier the hot spots are detected, the sooner the

database can be optimized and the corresponding performance benefits obtained. In addition, reference

patterns are usually not stationary, and decisions based on stale data can degrade performance. Since one is

forced to make decisions based on limited data, a method of evaluating the quality of the hot spot information

is needed.

In this paper, we examine the performance of frequency counting. The input to the frequency counting

algorithm is a reference string composed of the sequence of object references that the system generates.
*We acknowledge the support of USRA grant #5555-19. Part of the work was performed while Theodore Johnson was an
ASEE Summer Faculty Fellow at the NASA's National Space Science Data Center

We assume the Independent Reference Model (IRM), so that each reference is independent and identically

distributed. Given the length of the reference string and the distribution of references, we calculate the

quality of the hot spot estimate. One measure of the estimate quality is the probability that a hot object

is labeled "hot". A second measure of quality is sum of the reference probabilities of the objects that are

labeled "hot". Our objective is to calculate these measures, with an emphasis on the second. Towards this

end, we derive procedures for calculating both performance metrics, and a rule of thumb for estimating when

enough references have been processed.

1.1 Background

An accurate knowledge of the access frequency of objects in a database permits an optimal database orga-

nization. We present a brief survey here.

Suppose that references to database objects in a reference stream are independent and chosen from a

stationary distribution. In this case, the optimal page replacement algorithm is the A0 algorithm, which

locks the most frequently accessed data objects into memory [6]. The layout of objects on a disk drive that

minimizes head travel time is the "organ pipe arrangement" (place the most frequently accessed object in

the middle, and more frequently accessed objects are placed closer to the center than less frequently accessed

objects) [28].

Shared-nothing parallel database systems need to decluster the data stored in a relation, and scatter

it across the processors so that the proper number of processors is involved in an average query. The

declustering algorithms make use of statistics on average queries as part of their input [11, 7].

An object-oriented database typically consists of a number of objects which can have links to one another.

Queries on the object base typically follow a path through the object links. As a result there has been much

interest in placing objects that are 'close' on the same page, to reduce the page miss rate. Many researchers

have observed that some links are more often traversed than others, and have proposed algorithms that use

this kind of information [26, 17, 16, 25]. Tsangaris and Naughton [27] have found that stochastic clustering

out-performs other methods. The reference stream for accessing blocks in a file system often shows a great

deal of regularity. Several authors have proposed algorithms that observe patterns in the reference stream,

and pre-fetch blocks that were often requested after the previous request [8, 19, 22, 21]. For prefetching and

object clustering, one does not count object accesses, rather one counts pairs of references. Hence, the set

of objects is correspondingly larger and the available data correspondingly scarcer.

In the above cited works, there is little analysis of the quality of the access frequence data. In this

work we make a contribution to determining when one has collected enough access frequency data to make

clustering, declustering, migration, and prefetching decisions. We would like to make note of a related work

by Salem, Barbara, and Lipton [23]. These authors consider the problem of identifying the most frequently

accessed objects in a very large reference string while using a limited amount of work space. In contract to

the work by Salem, Barbara, and Lipton, we assume that we already have accurate access frequency counts

and we ask how well these counts reflect the distribution that generated them.

There is also a considerable body of related work in the Statistics literature. We defer a comparison

between this work and related statistics work until Section 6.

2 Problem Statement

We assume that the database consists of N objects, labeled dl, d2,..., dN. We are given a reference string

of length M, R = (ri, r2,..., rM). The reference string is generated by the independent reference model [1],

so that Pr[ri = dj] = pj for all ri in R. We denote the distribution {pj} by P, and we assume that pi < p4+l

for i= 1...,M- 1.

We would like to identify the b I..i -I" objects, di through db, based on the information in R. Since

we are restricted to using the information contained in R, we rank objects according to the number of

references they received in R. For every di, we compute ref(di) to be the number of references to di in

R. For every i = 1,. ., M we compute pos(i) to be the dj such that ref(pos(i)) > ref(pos(i + 1)) (if two

objects have the same reference count, we break ties arbitrarily). The b items with the most references are

pos(l) through pos(b). Finally, we define the rank of a data item, rank(di) to be the inverse mapping of pos,

pos(rank(di)) = di.

After we process our reference string, we would like to determine how good our ranking of the data items

is. There are several ways to measure a good ranking:

1. The quality of a ranking can be calculated by counting the number of data items di through db which

are ranked b or less (i.e, how many hot data items did we identify?). However, this measure does not

capture well the meaning of a "good ranking" that we are after, since some hot items are fairly cool

and some cool items are fairly hot. For example, suppose that the distribution is p= (.4, .3, .299, .001).

Choosing (dl, d3) to be our hot items is judged to be a poor choice, but intuitively it is almost as good

as (di, d2). Also, (d2, d4) is measured to be as good as (dl, d3), but is actually much worse.

2. The quality of ranking can be calculated by summing the reference probabilities of pos(l) through

pos(b). We call this measure the buffer value. It more directly reflects the intuitive idea of a good

ranking. In the above example, (dl, d2) has a buffer value of .7, (dl, d3) has a buffer value of .699, and

(d2, d4) has a buffer value of .301.

We are mostly interested in buffer value as a function of M, but we also calculate the probability that a

hot item is in the buffer, and the expected ranking of a data item.

3 The Model

Let us first try a direct approach. An object di is ranked s if receives more references than N -s other objects

and fewer references than s 1 others. Let ref(j) be the number of references received by object dj, with

distribution Fj and density fj. Let P(s 1, i) be the set of all partitions of {d, d d2, di, di+1, ,..., dN}

into A U B such that |AI = s 1. Let R(s, i) be the probability that object di is ranked s. Then [9]

Rs, i) = fO Z (,,-i,) [FlEA(1 F,(x)) nI5 F,(x)] fi(x)dx

This integral is extremely difficult to evaluate for a large or an arbitrary number of objects in the system.

In addition, the distribution functions Fj do not have closed form for their exact value. Therefore, we are

compelled to search for approximate answers.

3.1 An Approximate Approach

Let us first look at the the distribution functions. Since object dj is accessed independently on every reference

with probability pj, the density function of the number of accesses over M references is binomial(M, p). The

distribution is therefore an incomplete binomial sum, and does not have a closed form [13]. We instead use

the normal approximation to the binomial distribution:

binomial(M, p) w N(Mp, Mp(1 p))

The factor of (1 p) in the variance of the normal distribution creates difficulties in the subsequent

calculations. The value of p is typically very small -a data item with p = .1 is extremely hot. Therefore we

make the further approximation that binomial(M, p) w N(Mp, Mp)

Two objects di and dj do not receive independent numbers of references. However, this dependence

generally has a very small effect, and is cumbersome to account for. We make the approximation that all

objects receive references independently of the number of references to other objects.

To compute the ranking of a data item di, we compare ref(di) to ref(dj) for each dj, and count the

number of times that ref(di) is larger (breaking ties arbitrarily). If we are told that ref(di) > ref(dj), then

we know that di is likely to have a large reference count and thus a low ranking. So, the indicator variables

Iref(d,)>ref(d,) and Iref(d,)>ref(dk) are not independent, and in fact show a very strong dependence. If di
receives K references, rank(di) is is likely to be located in a small interval centered on a point that depends

on K, E[K].

We therefore take the following approach to testing the sensitivity of frequency counting. We ask, if data

item di receives K references, what is the chance that it is ranked in the top b in terms of reference counts?

We denote this probability by 7R<(blK). To determine the probability R4<(b, i) that object di is in the buffer,

we uncondition on K:

'R<(b, i)= =0 oR<(blK)fi(K)

The Derivations

We first list the important symbols that we use in the analysis:

N : Number of data items.

di : Data item i, 1 < i < N.

R : Reference string.

M = RI.

b : Size of the buffer (number of items to mark as hot).

ref(di) : Number of references to di in R.

pos(j) : The index of the data item that is ranked j.

rank(i) : The ranking of di.

Fi (fi) : The distribution (density) of ref(i).

pi : Probability that a reference is to item i.

buffer value : j p=posj-

R(s, i) : The probability that di is ranked s.

R<(b, i) = 1 R(s, i).

R~<(b, K) = R<(b, ildi = K).

E[K] : Average ranking of an object with k references.

* V[K] : Variance of the ranking of an object with K references.

VI[K] : Variance in E[K].

VT[K] : Variance due to tie breaking.

p(-1),p(+1) : Lower and upper bounds of integration wrt. pj.

jlo, jhi : Lower and upper bounds of integration wrt. j.

To calculate 'R<(blK), we take the following approach. For every object dj, we calculate the probability

that the object will have a lower rank than an object with K references as Xj[K] = Pr[ref(j) > K] +

Pr[ref(j) = K]/2 (recall that ref(j) is the random variable corresponding to the number of references to

dj, and we break ties arbitrarily when ranking). Each object dj makes its own contribution to the number

of objects with more than K references. The total number of objects ranked lower than an object with K

reference has mean E[K] = Xj [K]. If dj receives K references, its ranking is a random variable with

mean E[K] and variance V[K]. We approximate the distribution of the ranking by a Normal distribution,


R,<(b|K) = Pr[N(E[K], V[K]) < b]

We need to calculate E[K]. We have approximated the distribution of ref(j) as a normal N(Mpj, Mpj)

distribution, which automatically captures the tiebreaking:

Pr[ref(j) > K] + Pr[ref(j) = K]/2 = Pr[N(Mpj, Mpj) > K]

= Pr N(, 1) < Mp-K1

Even the Normal distribution is too difficult to work with. Fortunately there is a simple first-order

approximation. We will approximate N(O, 1; x) by a linear approximation N*(0, 1; x), which is non-constant

only in [-1, 1]. As a result, we find that:

0 Mp3-K<

Pr[ref(j) > K] C1 + C2MP-K -1 < Mp-- < 1 (1)
S< Mp3-K

where C1 = .5 and C2 = 1/(v27r) w .3989. We note that more accurate higher-order approximations can

be used, at the cost of a more complex model.

We define Ij [K] to be 1 if rank(j) > K and 0 otherwise. Let Ej [K] and V'[K] be the mean and variance

of Ij[K]. Let I[K] = Zj Ij[K]. Then I[K] has mean E[K] = Lj Ej [K] and variance V'[K] = Vj[[Ki].

The sums required to compute E[K] and V'[K] are usually intractable, so we approximate them by an

integral over pj. So, the first thing to determine are the interesting bounds of the integration. We note that

(Mpj K)/ Mpj is an increasing function of pj. By solving

Mpj K
M = 1 (2)

we find that

S1+2K 1+4K
P(-1) =2M
1+2K+ v+4K
P(+1) = 2M (3)

We need to convert p(-1) and p(+i) into bounds over dj i.e., we need to find the j(-1) and j(+1) such that

pj_) = P(_1) and pj(,,) = P(+) As an example, let us specialize to a triangular distribution, pj = a + bj
(b < 0). Note that since b < 0 (i.e, we order the hottest data items first), j(-1) is the index of the first dj

that is not 'assured' of receiving at least K references, and j(+l) is the index of the first dj that is 'assured'

of receiving less than k references. Then

1 + 2K 2Ma /1+ 4K
(- 1)= 2Mb
1 + 2K 2Ma + v1+ 4K
3(+1) 2Mb

Computing E[K] requires an integral of Ej [K] over j. We need to define jlo and jhi as jlo = max(j(+l), 0)

and jhi = min(j(_1), N). Then, the integral is

E[K]= f Id + fhld Cl + C2P dj + Odj

(jhi + jio)/2 + C2 f Mp -Kdj (4)

The bounds of integration can fall into several ranges, as illustrated in Figure 1, each of which provide

different answers. If jio > N (case 6) then E[K] = N. If jhi = 0 (case 1), then E[K]=0. The principle

non-trivial cases are 0 < jio < jhi < N (case 3) and 0 = jlo < jhi < N (case 2). Case 4 is degenerate, and

we approximate case 5 by case 3.

Next, we turn to estimating the variance V[K]. There are two components to the variance in the ranking

of a data items that receives K references. The first component is the variance in E[K], which we write as

V'[K]. The second component of the variance is due to the random tiebreaking, which we write as VT[K].

We calculate V[K] = V'[K] + VT[K].

Since variances sum, we can compute V'[K] by summing the individual contributions of each V[KI],

1) <0 2) -1 <0
V 0< +1

-1 +1 J-1 +1
3) 4)
0< J-1 O

J-1 +1 1 +1
N 0

5) 0< J

-1 +1 -1 +1
0 0

Figure 1: Possibilities for the bounds of integration.

which is Ij[K]2. Thus,

/ )2
V[K] o J C2 ( K dj (5)

If Y[K] data items receive K references, then the ranking of a data item that receives K references

is uniformly distributed in [E[K] Y[K]/2, E[K] + Y[K] + 2]. So, the variance due to tiebreaking is

VT[K] = Y[K]2/12. For small K, one can estimate Y[K] by integrating Pr[Cj = K] over j. For large K, the

binomial distribution or its Poisson approximation become cumbersome. In that case, once can approximate

Y[K] w (E[K] E[K 1])/2 + (E[K + 1] E[K])/2 = (E[K + 1] E[K 1])/2. VT[K] usually dominates


We next calculate E[K] and V[K] for three distributions: triangular, exponential, and partitioned. The

triangular density function has a gradual slope, while the exponential density function has a steep slope.

The partitioned density function has a very steep, and is useful for approximating an unknown distribution.

3.1.1 Triangular Distribution

In a triangular distribution, pi = a + b i. We start by calculating the bounds of integration for E[K] and

V [lK]:

1 + 2K 2Ma 1+ 4K
J(-1) = 2Mb
1 + 2K 2Ma+ v + 4K
3(+1) 2Mb

The term v/1 + 4K makes the integral difficult, so we approximate the bounds by:

1 + 2K- 2Ma + 2v/ + 1/4/K
(-1) 2Mb
1 + 2K 2Ma 2vK 1/4v/
(+), 2Mb


-1) = (-1) + 0 MbK/2

(+1) = J(+1) + 0 MbK3/2

When we substitute pi = a + b i into equation 4, we get

E[K] (jhi + jo)/2 + C2 [2(a+bj)1( 3K+Ma/+Mbj)

When jio = j(-1) and jhi = (+l), we get

E[K] -2K-1+2Ma 2*C2
2Mb 3MB

We note that determining E[K] is not as simple as calculating the number of data items dj such that

Mpj > K, since approach incorrectly calculates (K Ma)/Mb for E[K].

When jio = j(-1) and jhi = J(+l), we get

E[Ki] = 2K+1-2v K-2Ma cC2 V2/2K+1-2 v(4K-1+2 ) 2C2 Va(Ma-3K)
L[] 4Mb 6Mb 3bvM
2 C2 a 1 ) 4 C2K3/ C2+ 'K C2 2C2a3/2VM 1- 2 -K-2Ma
S bvM 2Mb 3Mb 1 2Mb 3Mb 3b 4Mb

We next need to calculate V'[K]. When jio = j(-1) and jhi = J(+l), we get

V[] (1+4 C22K-22 C-2 C22K32 ln(2K+1+2 v)+2 C22K3/2 ln(2K+1-2 K))
V[K] = 2Mb

We want to make an approximation to remove the logs. We observe that

n 1+2 K 2 1+( )

We empirically find that the two-term approximation is very close to the exact value even for small K.

Putting this approximation into the formula for V[K] gives us

VK(1+4 C22K-2 C22-2 C22K1- 3/2( -)
IV[K]= 2Mb
K1/2 (4C22 1

When jlo = J(-1) and jhi = J(+), we get

TI[K]= -2Ma+C2 -4C22 a2M2+1-2K+2K-4C2 21K-12C22 K+8C22 Kln((1+2K-2VK)/(2Ma))+16C2 aMK-8C22 K

S -2Ma+C22 -4C22a2M2+1-2v/+2K-4C22 K- 12C22K2+8C22K2(ln(K/(Ma))-K- /2)+16C2 aMK-8C2 K3

3.1.2 Exponential Distribution

In the exponential distribution, pj = c rJ, where c = (r 1)/(rN 1). After substituting for pj into

equation 3 and solving for j, we find that

n(2)-ln(1+2 K+1+ K)- n(rN+'- 1)+ln(M)+ln(r-1)
(-1) =- In(r)
) In(2)-In(1+2 K-V+4-K)-ln(r +l-1)+1n(M)+ln(r-1)
=(+1) In(r)

After solving equation 4, we find that

2C2 (-r -+NIK+r- K-r+M+rM) hi
E[K] = (jlo + jhi)/2 + 2C2 + -n-)

After solving equation 5, we find that

SC22((r2 -2r+1)M r3+2jKrln(r)M-2jKln(r)M+2r-3+N+lK 2-r-3K2)
[] n(r)M(r 1)(rN+11)

3.1.3 Partitioned Distribution

In a partitioned distribution, the data items are partitioned into n parts. Partition i contains aiN data

items, and receives bi of the references. Every data item in partition i is equally likely to be referenced.

Because the partitioned distribution has an inherently discrete description, we take a different approach

to calculating E[K] and V[K].

1. set [-1] = hits[-1] = hits[*] = 0.

2. For every K,

(a) For every partition i,

i. Calculate the expected number of data items in partition i that will receive K references

using the Poisson approximation to the binomial distribution Citation?.

ii. Add this value to hits[K].

(b) Set E[K] = E[K 1] + (hits[K 1] + hits[K])/2.

(c) set V[K] = hits[]2/12.

3.2 Performing The Computation

To calculate the performance measures of interest (the probability that a data item is in the buffer, and the

probability value in the buffer), we use the following procedure:

1. Determine the range of K for which we need to compute E[K] and V[K]. Set Kmin to the K that

satisfies j(+l)(K) = N, and set Kmax to the K that satisfies j(_j)(K) = 0. If Kmin < 0, set

Kmin = 0, and if Kmax > M, set Kmax = M.

2. For each K in [Kmin... Kmax], compute E[K] and V[K].

3. For each K, compute 7 <(b|K) = Pr[N(E[K], V[K]) < b]

4. For each data item j, compute the probability that it is admitted to the buffer.

(a) Compute bounds on number of references that dj is likely to receive, [Klo, . ., Khi]. These can

be estimated by the normal approximation to the binomial distribution.

(b) Set Padmitted[j] = 0.

(c) For each K in [Klo,... Khi],
i. Compute the probability that dj receives K references, probi, by using the Poisson approxi-
mation to the Binomial distribution.
ii. Set Padmitted[j] = Padmitted[j] + probi 7< (b|K)

5. Compute a normalizing constant normalize by summing padmitted[j] over j and dividing by B. Divide

each padmitted[j] by normalize (to reduce errors).

6. Compute buffer,,ale by summing pj padmitted[j] across all j.

This procedure requires that we perform 0(1) operations for each K, and O(maxj(Khi Klo)) oper-

ations for each j. We know that maxj (Khi Klo) is O( Vmax), so the complexity of the calculation is

O(N Kmax) = O(N pM).

4 Results

To validate our model, we wrote a simulator that generated a reference string and ranked the data items

according to the number of references they received, breaking ties arbitrarily. Based on the ranking, we

calculated the probability that specific data items were ranked B or less, the reference probability of the

data items ranked B or less, and the average ranking of a data item that received K references. We ran the

simulator for 10,000 reference strings to compute expected values. We used a database of N = 200 items,

set B = 40, and varied the length of the reference string.

We used three different probability distributions: triangular, exponential, and partitioned. The results for

the triangular distribution are shown in Figure 2. In the triangular distribution, the probability of accessing

data item i is a i + b, and a and b have been assigned so that pi = 10 P200 In Figure 2, min is the value

the buffer would have if items were placed in the buffer randomly, and max is the best possible buffer value.

The agreement between simulation and analytical predictions is very close. The results for the exponential

distribution are shown in Figure 3. Here, pi = c ri, and r = .97. For small numbers of references,

the analytical model underestimates the value of the buffer. However, for moderate to large numbers of

references, the predictions are accurate. Figure 4 shows the results for the partitioned distribution. In this

distribution, .'i. of the references are made to 20% of the data items. The agreement between simulation

and analytical predictions is good throughout the range of M. In general, the analytical model makes

accurate predictions, though accuracy suffers when the reference string is small and the distribution pi has

a steep slope. The inaccuracies are due to the approximations made in the calculations.

We also measure the quality of a ranking by the probability that a hot item is included in the buffer.

We ran experiments with the triangular, exponential, and partitioned distributions, and plot the probability

that the 30th hottest item is included in the buffer against length of the reference string. The comparison

between analytical and simulation predictions is shown in Figure 5. The agreement between the simulation

and analytic predictions is good.

5 A Rule Of Thumb

While the analytical model that we have presented is far faster than a simulation, a simple model that can

be immediately applied and does not depend on the distribution fpis more useful in practice. In this section,

we present a simple rule of thumb for determining when enough reference have been collected.

Suppose that we want to buffer the aN hottest items, which receive b of the total references. We can

look at a partitioned distribution where phot = b/aN and pcold = (1 b)/(1 a)N.

Buffer value vs. number of references
Triangular distribution

Buffer value
0 .3 5 . . . . . . . . . . .. . . . . . . . . . . . . . . . ..

0 .25........................ .. .


0.15 ................................. ............... m ax
0 simulation

0 I I I I I

0 400 800 1,200 1,600 2,000
Number of references

Figure 2: Comparison of buffer value predictions for a triangular distribution.

By solving equation 2 for K, we can determine the highest and lowest K for which the approximation 1

is non-constant. We denote these as Khi and Klo, and we find that

Kh = Mp+ Mp Ko= Mp V Up

The value of M for which Khi(pcold) = Klo(Phot) is enough references to distinguish a hot item from a

cold item in most cases. So, to find Mcut, we solve Khi(Pcold) = Klo(Phot) for M to find

Na(1 a)
(b a)2 (6)

We notice that Na is the size of the buffer. Therefore, formula 6 asks that we process X references for

every item in the buffer, where

(b a)2

If X is usually non-integral. We can improve our estimate of the number of references we need to collect

by processing [X] references for every hot item. This leads us to:

Rule of Thumb: If the buffer holds aN items, and b of the references are directed to the aN items,

then collect MROT references, where

MROT = Na [( 1 (7)

Table 1 summarizes the predictions that the Rule of Thumb makes on the distributions in our validation

experiments. The parameters a and b are the minimum and maximum possible buffer values. We applied

Buffer value vs. number of references
Exponential distribution

Buffer value
0 .8 . . . . . . . . . . . . . . . . . . . . . . . . . . . ..

0.6 ..... ......... .

0 .4 ......................... ... -


0 100 200 300 400 500 600
Number of references

Figure 3: Comparison of buffer value predictions for a exponential distribution.

distribution a b MROT buffer value percent
triangular .2 .332 1840 .304 78.9
exponential .2 .706 160 .613 81.6
partitioned .2 .8 120 .623 70.5

Table 1: Summary of the Rule of Thumb predictions.

the Rule of Thumb to obtain MROT, and then reported the buffer value that our analytical model predicts.

Finally, we report the quality of the ranking as the distance of the buffer value between the minimum

and maximum, in percent. The table shows that the Rule of Thumb makes a reasonable prediction of the

number of references to process, since the the buffer collects 70- 11'. of the additional buffer value that can

be obtained by processing references. The table shows the one must collect more references to distinguish

between hot and cold data items as the difference in reference probability becomes smaller. Thus, fewer than

N references need to be processed for the exponential and partitioned distributions, while a huge number of

references must be processed for the triangular distribution.

6 Comparison to Statistics Research

Statistics researchers working in the area of design of experiments have examined problems related to the

one examined in this paper. A good survey of the area is contained in the book by Gibbons, Olken, and

Sobel [12].

Buffer value vs. number of references
80/20 partitioned distribution

Buffer value
1 -

0.8 ......

-- min

0.2 ......

0 50 100 150 200 250 300
Number of references

Figure 4: Comparison of buffer value predictions for a exponential distribution.

Bechhofer [2] worked on the original problems in this area. Given k populations, each with (unknown)

mean Xi and (known) variance o-?, find the t populations with the largest (best) means. The best populations

can be ranked or unranked. Bechhoffer's procedure is to draw Ni samples from population i, i = 1...k,

compute the sample means Xi, and rank the sample populations accordingly. The problem becomes, how

many samples Ni should I draw from each population to be certain that the t populations that I have selected

are the t best with probability P? For the case when all variances are equal and the populations have a

Normal distribution, Bechhofer gives a formula for determining the number of samples to collect that looks

similar to our rule of thumb (see also [12], page '.7 ). However, this formula has a different interpretation,

and depends on an expansion factor that is a function of k, k, and P in a complicated way. Later works

in the area have considered the problem of finding the t best of k populations when s > t populations are

selected [10]. A large set of selection and ranking problems involving a variety of distributions and both

known and unknown variances is discussed in [3].

In the problem addressed in this paper, the populations have a Bernoulli distribution and the same

number of samples is drawn from each population. Sobel and Huyett [24] calculate the number of samples to

collect when selecting the best one of k Bernoulli populations with confidence P, and the best population has

a chance of success d larger than the second best population. The Sobel and Huyett procedure was shown

to be the optimal single-stage procedure [14]. The number of required samples can be reduced by adaptive

sampling. A survey of this work is found in [15, 5, 4, 20]. In [4], the theory is extended to selecting the best

Prob. in buffer vs. Number of references
Item 30

Probability in buffer

0 .8 ...............
Expntl. ana
0.6 .. ... .. ........................................... E xpntl. sim
Triang. ana
STriang sim
0.4 .-.... 80/20 ana

80/20 sim
0.2 -

0 200 400 600 800 1,000 1,200
Number of references

Figure 5: Probability that the 30th hottest item is in the buffer.

t out of k Bernoulli populations (this work shows that choosing the populations with the largest number of

success is the best procedure). Jennison and Kulkarni [18] improve the work of [4] to minimize the number

of samples that must be collected.

The current work is most closely related to that of Sobel and Huyett [24], since every population (data

item) has the same number of samples taken (M). Sobel and Huyett calculated the number of samples to

collect by approximating the binomial distributions (the sum of Bernoulli samples) as a Normal distribution

and applying the methods described by Bechhofer in [2]. An extention to selecting the t best Bernoulli

populations out of K is analogous to the work in [24]. However, we are more concerned with buffer value

than with finding the t best populations. Though a stopping rule with a similar form to our rule of thumb

can be found in [2], it depends on k and t (or N and b) in a complicated way, and the expansion constant

has been computed only for small k and t. Since we are concerned with buffer value rather than finding all

of the best populations, our rule of thumb calls for far fewer samples.

In Figure 6, we plot the probability that an item is labeled "hot" against the item number for the

triangular, exponential, and partitioned distributions when the rule of thumb number of references are

taken. Depending on the distribution, the probability that a hot item is labeled "hot" can be fairly low, and

the chance of buffering all of the how items is vanishingly low. Yet, the buffer value is fairly high.

Probability of being labeled hot

prob. in buffer

0 8 % . . . . . . . . . . . . . . . . . . . . . . . .. .
0. \

0 20 40 60 80 100 120 140 16 0180 200
item number
Figure : Probability that an item is labeled "hot" when the rule of thumb number of references are collected.
- Exponential
*.4 -^ - Partitioned


0 20 40 60 80 100120140160180200
item number

Figure 6: Probability that an item is labeled "hot" when the rule of thumb number of references are collected.

7 Conclusion

Many database optimization activities require an estimate of reference frequencies. However, little work has

been done to investigate the quality of the available information about reference frequencies. In this work,

we present an analytical model of frequency counting that predicts the quality of a hot spot estimate as a

function of the number of references processed. We validate this model by a comparison to simulation results.

Finally we present a simple but useful rule of thumb that accurately predicts the number of references that

need to be processed in order to make a good hot spot estimate.


[1] D.I. Aven, E.G. Coffman, and Y.A. Kogan. Stochastic Analysis of Computer Storage. D. Reidel Pub-

lishing, 1 'I .

[2] R.E. Bechhofer. A single-sample multiple decision procedure for ranking means of normal populations

with known variances. Annals of Mathematical Statistics, '-" 16-39, 1 ', I.

[3] R.E. Bechhofer, J. Kiefer, and M. Sobel. Sequential Identification and Ranking. University of Chicago

Press, 1968.

[4] R.E. Bechhofer and R.V. Kulkarni. Statistical Decision Theory and Related Topics III, Vol. 1, pages

61-108. Academic Press, 1982.

[5] H. Buringer, S.M. Johnson, and K-H Schriever. Nonparametric Sequential Selection Procedures.

Birkhauser, 1980.

[6] E.G. Coffman and P.J. Denning. Operating System Theory. Prentice-Hall, 1973.

[7] G. Copeland, W. Alexander, E. Boughter, and T. Keller. Data placement in Bubba. In AC if SIC, iOD

Conf., 1988.

[8] K.M. Curewitz, P. Krishnan, and J.S. Vitter. Practical prefetching via data compression. In AC if

SIC, iOD Conf., pages '",;266, 1993.

[9] H.A. David. Order Statistics. John Wiley, 1981.

[10] M.M. Desu and M. Sobel. A fixed subset-size approach to the selection problem. Biometrika, 55(2):401

410, 1968.

[11] S. Ghandeharizadeh, D.J. DeWitt, and W. Qureshi. A performance analysis of alternative multi-

attribute declustering strategies. In AC if SIC, fOD Conf., pages 29-38, 1992.

[12] J.D. Gibbons, I. Olkin, and M. Sobel. Selecting and Ordering Populations: A New Statistical Method-

ology. John Wiley and Sons, 1977.

[13] R.L. Graham, D.E. Knuth, and 0. Patashnik. Concrete Mathematics. Addison Wesley, 1989.

[14] W.J. Hall. The most economical character of bechhofer and sobel decision rules. Annals of Mathematical

Statistics, 30:964-969, 1959.

[15] D.G. Hoel, M. Sobel, and G.H. Weiss. Perspectives in Biometry, pages 29-61. Academic Press, 1 7

[16] M.F. Hornick and S.B. Zdonick. A shared, segmented memory system for an object-oriented database.

AC if Transactions on Office Information Systems, 5(1), 1 '.

[17] S.E. Hudson and R. King. Cactis: A self-adaptive, concurrent implementation of an object-oriented

database management system. AC if Trans. on Database Systems, 14(3):291-321, 1989.

[18] C. Jennison and R.V. Kulkarni. Design of Experiments: Ranking and Selection, pages 113-125. Dekker,


[19] D. Kotz and C.S. Ellis. Prefetching in file systems for MIMD multiprocessors. IEEE Trans. on Parallel

and Distributed Systems, 1(2):218-230, 1990.

[20] R.V. Kulkarni and C. Jennison. Optimal properties of the bechhofer-kulkarni bernoulli selection proce-

dure. Annals of Statistics, 14(1):298-314, 1986.

[21] M. Palmer and S.B. Zdonik. Fido: A cache that learns to fetch. In Proc. 17th I", / Conf. of Very Large

Databases, pages 2'-..-264, 1991.

[22] K. Salem. Adaptive prefetching for disk buffers. Technical Report TR-91-64, CESDIS, NASA Goddard

Space Flight Center, 1991.

[23] K. Salem, D. Barbara, and R.J. Lipton. Probabilistic diagnosis of hot spots. In IEEE Data Engineering

Conf., pages 30-39, 1992.

[24] M. Sobel and M.J. Huyett. Selecting the one best of several binomial populations. The Bell Systems

Technical Journal, 36:537-,;,., 1 1-.

[25] J.W. Stamos. Static grouping of small objects to enhance performance of a paged memory system.

AC if Transactions on Computer Systems, 2(2):155-180, 1984.

[26] M.M. Tsangaris and J.F. Naughton. A stochastic approach for clustering in object bases. In AC if

SIC; iOD Conf., pages 12-21, 1991.

[27] M.M. Tsangaris and J.F. Naughton. On the performance of object clustering techniques. In AC if

SIC; iOD Conf., pages 144-153, 1992.

[28] C.K. Wong. Algorithmic Studies in Mass Storage Systems. Computer Science Press, 1983.

University of Florida Home Page
© 2004 - 2010 University of Florida George A. Smathers Libraries.
All rights reserved.

Acceptable Use, Copyright, and Disclaimer Statement
Last updated October 10, 2010 - - mvs