Sensitivity Analysis of Frequency Counting*
Theodore Johnson
Dept. of Computer and Information Science
University of Florida
Abstract
Many database optimization activities, such as prefetching, data clustering and partitioning, and
buffer allocation, depend on the detection of hot spots in access patterns. While a database designer can
in some cases use special knowledge about the data and the users to predict hot spots, in general one
must use information about past activity to predict future activity. However, algorithms that make use
of hot spots pay little attention to the way in which hot spot information is gathered, or to the quality of
this information. In this paper, we present a model for analyzing hot spot estimates based on frequency
counting. We present a numerical method for estimating the quality of the data, and a ruleofthumb.
We find that if b of the references are made to the hottest a of the N data items, then one should process
Na[(1 a)/(b a)2] references.
1 Introduction
Many database optimization activities require the identification and classification of hot spots, or regions of
the database with an exceptionally high access frequency. In some cases, a database designer can identify
hot spots based on special knowledge of the data and the users. However, one is usually forced to predict
future referencing patterns based on past behavior. A common approach is to use frequency counting. The
number of references to the objects in the system are counted over a period of time. Objects with a high
frequency count are labeled "hot", and the other objects are labeled ..I.1".
Frequency counting will accurately identify the hot objects in the database if the reference pattern is
stationary and if enough references are collected. However, one usually wants to make database organization
or reorganization decisions based on limited data. The earlier the hot spots are detected, the sooner the
database can be optimized and the corresponding performance benefits obtained. In addition, reference
patterns are usually not stationary, and decisions based on stale data can degrade performance. Since one is
forced to make decisions based on limited data, a method of evaluating the quality of the hot spot information
is needed.
In this paper, we examine the performance of frequency counting. The input to the frequency counting
algorithm is a reference string composed of the sequence of object references that the system generates.
*We acknowledge the support of USRA grant #555519. Part of the work was performed while Theodore Johnson was an
ASEE Summer Faculty Fellow at the NASA's National Space Science Data Center
We assume the Independent Reference Model (IRM), so that each reference is independent and identically
distributed. Given the length of the reference string and the distribution of references, we calculate the
quality of the hot spot estimate. One measure of the estimate quality is the probability that a hot object
is labeled "hot". A second measure of quality is sum of the reference probabilities of the objects that are
labeled "hot". Our objective is to calculate these measures, with an emphasis on the second. Towards this
end, we derive procedures for calculating both performance metrics, and a rule of thumb for estimating when
enough references have been processed.
1.1 Background
An accurate knowledge of the access frequency of objects in a database permits an optimal database orga
nization. We present a brief survey here.
Suppose that references to database objects in a reference stream are independent and chosen from a
stationary distribution. In this case, the optimal page replacement algorithm is the A0 algorithm, which
locks the most frequently accessed data objects into memory [6]. The layout of objects on a disk drive that
minimizes head travel time is the "organ pipe arrangement" (place the most frequently accessed object in
the middle, and more frequently accessed objects are placed closer to the center than less frequently accessed
objects) [28].
Sharednothing parallel database systems need to decluster the data stored in a relation, and scatter
it across the processors so that the proper number of processors is involved in an average query. The
declustering algorithms make use of statistics on average queries as part of their input [11, 7].
An objectoriented database typically consists of a number of objects which can have links to one another.
Queries on the object base typically follow a path through the object links. As a result there has been much
interest in placing objects that are 'close' on the same page, to reduce the page miss rate. Many researchers
have observed that some links are more often traversed than others, and have proposed algorithms that use
this kind of information [26, 17, 16, 25]. Tsangaris and Naughton [27] have found that stochastic clustering
outperforms other methods. The reference stream for accessing blocks in a file system often shows a great
deal of regularity. Several authors have proposed algorithms that observe patterns in the reference stream,
and prefetch blocks that were often requested after the previous request [8, 19, 22, 21]. For prefetching and
object clustering, one does not count object accesses, rather one counts pairs of references. Hence, the set
of objects is correspondingly larger and the available data correspondingly scarcer.
In the above cited works, there is little analysis of the quality of the access frequence data. In this
work we make a contribution to determining when one has collected enough access frequency data to make
clustering, declustering, migration, and prefetching decisions. We would like to make note of a related work
by Salem, Barbara, and Lipton [23]. These authors consider the problem of identifying the most frequently
accessed objects in a very large reference string while using a limited amount of work space. In contract to
the work by Salem, Barbara, and Lipton, we assume that we already have accurate access frequency counts
and we ask how well these counts reflect the distribution that generated them.
There is also a considerable body of related work in the Statistics literature. We defer a comparison
between this work and related statistics work until Section 6.
2 Problem Statement
We assume that the database consists of N objects, labeled dl, d2,..., dN. We are given a reference string
of length M, R = (ri, r2,..., rM). The reference string is generated by the independent reference model [1],
so that Pr[ri = dj] = pj for all ri in R. We denote the distribution {pj} by P, and we assume that pi < p4+l
for i= 1...,M 1.
We would like to identify the b I..i I" objects, di through db, based on the information in R. Since
we are restricted to using the information contained in R, we rank objects according to the number of
references they received in R. For every di, we compute ref(di) to be the number of references to di in
R. For every i = 1,. ., M we compute pos(i) to be the dj such that ref(pos(i)) > ref(pos(i + 1)) (if two
objects have the same reference count, we break ties arbitrarily). The b items with the most references are
pos(l) through pos(b). Finally, we define the rank of a data item, rank(di) to be the inverse mapping of pos,
pos(rank(di)) = di.
After we process our reference string, we would like to determine how good our ranking of the data items
is. There are several ways to measure a good ranking:
1. The quality of a ranking can be calculated by counting the number of data items di through db which
are ranked b or less (i.e, how many hot data items did we identify?). However, this measure does not
capture well the meaning of a "good ranking" that we are after, since some hot items are fairly cool
and some cool items are fairly hot. For example, suppose that the distribution is p= (.4, .3, .299, .001).
Choosing (dl, d3) to be our hot items is judged to be a poor choice, but intuitively it is almost as good
as (di, d2). Also, (d2, d4) is measured to be as good as (dl, d3), but is actually much worse.
2. The quality of ranking can be calculated by summing the reference probabilities of pos(l) through
pos(b). We call this measure the buffer value. It more directly reflects the intuitive idea of a good
ranking. In the above example, (dl, d2) has a buffer value of .7, (dl, d3) has a buffer value of .699, and
(d2, d4) has a buffer value of .301.
We are mostly interested in buffer value as a function of M, but we also calculate the probability that a
hot item is in the buffer, and the expected ranking of a data item.
3 The Model
Let us first try a direct approach. An object di is ranked s if receives more references than N s other objects
and fewer references than s 1 others. Let ref(j) be the number of references received by object dj, with
distribution Fj and density fj. Let P(s 1, i) be the set of all partitions of {d, d d2, di, di+1, ,..., dN}
into A U B such that AI = s 1. Let R(s, i) be the probability that object di is ranked s. Then [9]
Rs, i) = fO Z (,,i,) [FlEA(1 F,(x)) nI5 F,(x)] fi(x)dx
This integral is extremely difficult to evaluate for a large or an arbitrary number of objects in the system.
In addition, the distribution functions Fj do not have closed form for their exact value. Therefore, we are
compelled to search for approximate answers.
3.1 An Approximate Approach
Let us first look at the the distribution functions. Since object dj is accessed independently on every reference
with probability pj, the density function of the number of accesses over M references is binomial(M, p). The
distribution is therefore an incomplete binomial sum, and does not have a closed form [13]. We instead use
the normal approximation to the binomial distribution:
binomial(M, p) w N(Mp, Mp(1 p))
The factor of (1 p) in the variance of the normal distribution creates difficulties in the subsequent
calculations. The value of p is typically very small a data item with p = .1 is extremely hot. Therefore we
make the further approximation that binomial(M, p) w N(Mp, Mp)
Two objects di and dj do not receive independent numbers of references. However, this dependence
generally has a very small effect, and is cumbersome to account for. We make the approximation that all
objects receive references independently of the number of references to other objects.
To compute the ranking of a data item di, we compare ref(di) to ref(dj) for each dj, and count the
number of times that ref(di) is larger (breaking ties arbitrarily). If we are told that ref(di) > ref(dj), then
we know that di is likely to have a large reference count and thus a low ranking. So, the indicator variables
Iref(d,)>ref(d,) and Iref(d,)>ref(dk) are not independent, and in fact show a very strong dependence. If di
receives K references, rank(di) is is likely to be located in a small interval centered on a point that depends
on K, E[K].
We therefore take the following approach to testing the sensitivity of frequency counting. We ask, if data
item di receives K references, what is the chance that it is ranked in the top b in terms of reference counts?
We denote this probability by 7R<(blK). To determine the probability R4<(b, i) that object di is in the buffer,
we uncondition on K:
'R<(b, i)= =0 oR<(blK)fi(K)
The Derivations
We first list the important symbols that we use in the analysis:
N : Number of data items.
di : Data item i, 1 < i < N.
R : Reference string.
M = RI.
b : Size of the buffer (number of items to mark as hot).
ref(di) : Number of references to di in R.
pos(j) : The index of the data item that is ranked j.
rank(i) : The ranking of di.
Fi (fi) : The distribution (density) of ref(i).
pi : Probability that a reference is to item i.
buffer value : j p=posj
R(s, i) : The probability that di is ranked s.
R<(b, i) = 1 R(s, i).
R~<(b, K) = R<(b, ildi = K).
E[K] : Average ranking of an object with k references.
* V[K] : Variance of the ranking of an object with K references.
VI[K] : Variance in E[K].
VT[K] : Variance due to tie breaking.
p(1),p(+1) : Lower and upper bounds of integration wrt. pj.
jlo, jhi : Lower and upper bounds of integration wrt. j.
To calculate 'R<(blK), we take the following approach. For every object dj, we calculate the probability
that the object will have a lower rank than an object with K references as Xj[K] = Pr[ref(j) > K] +
Pr[ref(j) = K]/2 (recall that ref(j) is the random variable corresponding to the number of references to
dj, and we break ties arbitrarily when ranking). Each object dj makes its own contribution to the number
of objects with more than K references. The total number of objects ranked lower than an object with K
reference has mean E[K] = Xj [K]. If dj receives K references, its ranking is a random variable with
mean E[K] and variance V[K]. We approximate the distribution of the ranking by a Normal distribution,
so
R,<(bK) = Pr[N(E[K], V[K]) < b]
We need to calculate E[K]. We have approximated the distribution of ref(j) as a normal N(Mpj, Mpj)
distribution, which automatically captures the tiebreaking:
Pr[ref(j) > K] + Pr[ref(j) = K]/2 = Pr[N(Mpj, Mpj) > K]
= Pr N(, 1) < MpK1
Even the Normal distribution is too difficult to work with. Fortunately there is a simple firstorder
approximation. We will approximate N(O, 1; x) by a linear approximation N*(0, 1; x), which is nonconstant
only in [1, 1]. As a result, we find that:
0 Mp3K<
Pr[ref(j) > K] C1 + C2MPK 1 < Mp < 1 (1)
S< Mp3K
where C1 = .5 and C2 = 1/(v27r) w .3989. We note that more accurate higherorder approximations can
be used, at the cost of a more complex model.
We define Ij [K] to be 1 if rank(j) > K and 0 otherwise. Let Ej [K] and V'[K] be the mean and variance
of Ij[K]. Let I[K] = Zj Ij[K]. Then I[K] has mean E[K] = Lj Ej [K] and variance V'[K] = Vj[[Ki].
The sums required to compute E[K] and V'[K] are usually intractable, so we approximate them by an
integral over pj. So, the first thing to determine are the interesting bounds of the integration. We note that
(Mpj K)/ Mpj is an increasing function of pj. By solving
Mpj K
M = 1 (2)
we find that
S1+2K 1+4K
P(1) =2M
1+2K+ v+4K
P(+1) = 2M (3)
We need to convert p(1) and p(+i) into bounds over dj i.e., we need to find the j(1) and j(+1) such that
pj_) = P(_1) and pj(,,) = P(+) As an example, let us specialize to a triangular distribution, pj = a + bj
(b < 0). Note that since b < 0 (i.e, we order the hottest data items first), j(1) is the index of the first dj
that is not 'assured' of receiving at least K references, and j(+l) is the index of the first dj that is 'assured'
of receiving less than k references. Then
1 + 2K 2Ma /1+ 4K
( 1)= 2Mb
1 + 2K 2Ma + v1+ 4K
3(+1) 2Mb
Computing E[K] requires an integral of Ej [K] over j. We need to define jlo and jhi as jlo = max(j(+l), 0)
and jhi = min(j(_1), N). Then, the integral is
E[K]= f Id + fhld Cl + C2P dj + Odj
(jhi + jio)/2 + C2 f Mp Kdj (4)
The bounds of integration can fall into several ranges, as illustrated in Figure 1, each of which provide
different answers. If jio > N (case 6) then E[K] = N. If jhi = 0 (case 1), then E[K]=0. The principle
nontrivial cases are 0 < jio < jhi < N (case 3) and 0 = jlo < jhi < N (case 2). Case 4 is degenerate, and
we approximate case 5 by case 3.
Next, we turn to estimating the variance V[K]. There are two components to the variance in the ranking
of a data items that receives K references. The first component is the variance in E[K], which we write as
V'[K]. The second component of the variance is due to the random tiebreaking, which we write as VT[K].
We calculate V[K] = V'[K] + VT[K].
Since variances sum, we can compute V'[K] by summing the individual contributions of each V[KI],
1) <0 2) 1 <0
V 0< +1
1 +1 J1 +1
N
0
0
3) 4)
0< J1
O
J1 +1 1 +1
N
N 0
0
5) 0< J
1 +1 1 +1
N N
0 0
Figure 1: Possibilities for the bounds of integration.
which is Ij[K]2. Thus,
/ )2
V[K] o J C2 ( K dj (5)
If Y[K] data items receive K references, then the ranking of a data item that receives K references
is uniformly distributed in [E[K] Y[K]/2, E[K] + Y[K] + 2]. So, the variance due to tiebreaking is
VT[K] = Y[K]2/12. For small K, one can estimate Y[K] by integrating Pr[Cj = K] over j. For large K, the
binomial distribution or its Poisson approximation become cumbersome. In that case, once can approximate
Y[K] w (E[K] E[K 1])/2 + (E[K + 1] E[K])/2 = (E[K + 1] E[K 1])/2. VT[K] usually dominates
V'[K].
We next calculate E[K] and V[K] for three distributions: triangular, exponential, and partitioned. The
triangular density function has a gradual slope, while the exponential density function has a steep slope.
The partitioned density function has a very steep, and is useful for approximating an unknown distribution.
3.1.1 Triangular Distribution
In a triangular distribution, pi = a + b i. We start by calculating the bounds of integration for E[K] and
V [lK]:
1 + 2K 2Ma 1+ 4K
J(1) = 2Mb
1 + 2K 2Ma+ v + 4K
3(+1) 2Mb
The term v/1 + 4K makes the integral difficult, so we approximate the bounds by:
1 + 2K 2Ma + 2v/ + 1/4/K
(1) 2Mb
1 + 2K 2Ma 2vK 1/4v/
(+), 2Mb
Where
1) = (1) + 0 MbK/2
(+1) = J(+1) + 0 MbK3/2
When we substitute pi = a + b i into equation 4, we get
E[K] (jhi + jo)/2 + C2 [2(a+bj)1( 3K+Ma/+Mbj)
When jio = j(1) and jhi = (+l), we get
E[K] 2K1+2Ma 2*C2
2Mb 3MB
K+1/2Ma2C2/3
Mb
We note that determining E[K] is not as simple as calculating the number of data items dj such that
Mpj > K, since approach incorrectly calculates (K Ma)/Mb for E[K].
When jio = j(1) and jhi = J(+l), we get
E[Ki] = 2K+12v K2Ma cC2 V2/2K+12 v(4K1+2 ) 2C2 Va(Ma3K)
L[] 4Mb 6Mb 3bvM
2 C2 a 1 ) 4 C2K3/ C2+ 'K C2 2C2a3/2VM 1 2 K2Ma
S bvM 2Mb 3Mb 1 2Mb 3Mb 3b 4Mb
We next need to calculate V'[K]. When jio = j(1) and jhi = J(+l), we get
V[] (1+4 C22K22 C2 C22K32 ln(2K+1+2 v)+2 C22K3/2 ln(2K+12 K))
V[K] = 2Mb
We want to make an approximation to remove the logs. We observe that
n 1+2 K 2 1+( )
We empirically find that the twoterm approximation is very close to the exact value even for small K.
Putting this approximation into the formula for V[K] gives us
VK(1+4 C22K2 C222 C22K1 3/2( )
IV[K]= 2Mb
K1/2 (4C22 1
When jlo = J(1) and jhi = J(+), we get
TI[K]= 2Ma+C2 4C22 a2M2+12K+2K4C2 21K12C22 K+8C22 Kln((1+2K2VK)/(2Ma))+16C2 aMK8C22 K
S 2Ma+C22 4C22a2M2+12v/+2K4C22 K 12C22K2+8C22K2(ln(K/(Ma))K /2)+16C2 aMK8C2 K3
8Mb
3.1.2 Exponential Distribution
In the exponential distribution, pj = c rJ, where c = (r 1)/(rN 1). After substituting for pj into
equation 3 and solving for j, we find that
n(2)ln(1+2 K+1+ K) n(rN+' 1)+ln(M)+ln(r1)
(1) = In(r)
) In(2)In(1+2 KV+4K)ln(r +l1)+1n(M)+ln(r1)
=(+1) In(r)
After solving equation 4, we find that
2C2 (r +NIK+r Kr+M+rM) hi
E[K] = (jlo + jhi)/2 + 2C2 + n)
After solving equation 5, we find that
SC22((r2 2r+1)M r3+2jKrln(r)M2jKln(r)M+2r3+N+lK 2r3K2)
[] n(r)M(r 1)(rN+11)
3.1.3 Partitioned Distribution
In a partitioned distribution, the data items are partitioned into n parts. Partition i contains aiN data
items, and receives bi of the references. Every data item in partition i is equally likely to be referenced.
Because the partitioned distribution has an inherently discrete description, we take a different approach
to calculating E[K] and V[K].
1. set [1] = hits[1] = hits[*] = 0.
2. For every K,
(a) For every partition i,
i. Calculate the expected number of data items in partition i that will receive K references
using the Poisson approximation to the binomial distribution Citation?.
ii. Add this value to hits[K].
(b) Set E[K] = E[K 1] + (hits[K 1] + hits[K])/2.
(c) set V[K] = hits[]2/12.
3.2 Performing The Computation
To calculate the performance measures of interest (the probability that a data item is in the buffer, and the
probability value in the buffer), we use the following procedure:
1. Determine the range of K for which we need to compute E[K] and V[K]. Set Kmin to the K that
satisfies j(+l)(K) = N, and set Kmax to the K that satisfies j(_j)(K) = 0. If Kmin < 0, set
Kmin = 0, and if Kmax > M, set Kmax = M.
2. For each K in [Kmin... Kmax], compute E[K] and V[K].
3. For each K, compute 7 <(bK) = Pr[N(E[K], V[K]) < b]
4. For each data item j, compute the probability that it is admitted to the buffer.
(a) Compute bounds on number of references that dj is likely to receive, [Klo, . ., Khi]. These can
be estimated by the normal approximation to the binomial distribution.
(b) Set Padmitted[j] = 0.
(c) For each K in [Klo,... Khi],
i. Compute the probability that dj receives K references, probi, by using the Poisson approxi
mation to the Binomial distribution.
ii. Set Padmitted[j] = Padmitted[j] + probi 7< (bK)
5. Compute a normalizing constant normalize by summing padmitted[j] over j and dividing by B. Divide
each padmitted[j] by normalize (to reduce errors).
6. Compute buffer,,ale by summing pj padmitted[j] across all j.
This procedure requires that we perform 0(1) operations for each K, and O(maxj(Khi Klo)) oper
ations for each j. We know that maxj (Khi Klo) is O( Vmax), so the complexity of the calculation is
O(N Kmax) = O(N pM).
4 Results
To validate our model, we wrote a simulator that generated a reference string and ranked the data items
according to the number of references they received, breaking ties arbitrarily. Based on the ranking, we
calculated the probability that specific data items were ranked B or less, the reference probability of the
data items ranked B or less, and the average ranking of a data item that received K references. We ran the
simulator for 10,000 reference strings to compute expected values. We used a database of N = 200 items,
set B = 40, and varied the length of the reference string.
We used three different probability distributions: triangular, exponential, and partitioned. The results for
the triangular distribution are shown in Figure 2. In the triangular distribution, the probability of accessing
data item i is a i + b, and a and b have been assigned so that pi = 10 P200 In Figure 2, min is the value
the buffer would have if items were placed in the buffer randomly, and max is the best possible buffer value.
The agreement between simulation and analytical predictions is very close. The results for the exponential
distribution are shown in Figure 3. Here, pi = c ri, and r = .97. For small numbers of references,
the analytical model underestimates the value of the buffer. However, for moderate to large numbers of
references, the predictions are accurate. Figure 4 shows the results for the partitioned distribution. In this
distribution, .'i. of the references are made to 20% of the data items. The agreement between simulation
and analytical predictions is good throughout the range of M. In general, the analytical model makes
accurate predictions, though accuracy suffers when the reference string is small and the distribution pi has
a steep slope. The inaccuracies are due to the approximations made in the calculations.
We also measure the quality of a ranking by the probability that a hot item is included in the buffer.
We ran experiments with the triangular, exponential, and partitioned distributions, and plot the probability
that the 30th hottest item is included in the buffer against length of the reference string. The comparison
between analytical and simulation predictions is shown in Figure 5. The agreement between the simulation
and analytic predictions is good.
5 A Rule Of Thumb
While the analytical model that we have presented is far faster than a simulation, a simple model that can
be immediately applied and does not depend on the distribution fpis more useful in practice. In this section,
we present a simple rule of thumb for determining when enough reference have been collected.
Suppose that we want to buffer the aN hottest items, which receive b of the total references. We can
look at a partitioned distribution where phot = b/aN and pcold = (1 b)/(1 a)N.
Buffer value vs. number of references
Triangular distribution
Buffer value
0 .3 5 . . . . . . . . . . .. . . . . . . . . . . . . . . . ..
0 .25........................ .. .
value
0.2
mm
0.15 ................................. ............... m ax
0 simulation
0.1
0.05
0 I I I I I
0 400 800 1,200 1,600 2,000
Number of references
Figure 2: Comparison of buffer value predictions for a triangular distribution.
By solving equation 2 for K, we can determine the highest and lowest K for which the approximation 1
is nonconstant. We denote these as Khi and Klo, and we find that
Kh = Mp+ Mp Ko= Mp V Up
The value of M for which Khi(pcold) = Klo(Phot) is enough references to distinguish a hot item from a
cold item in most cases. So, to find Mcut, we solve Khi(Pcold) = Klo(Phot) for M to find
Na(1 a)
(b a)2 (6)
We notice that Na is the size of the buffer. Therefore, formula 6 asks that we process X references for
every item in the buffer, where
1a
X=
(b a)2
If X is usually nonintegral. We can improve our estimate of the number of references we need to collect
by processing [X] references for every hot item. This leads us to:
Rule of Thumb: If the buffer holds aN items, and b of the references are directed to the aN items,
then collect MROT references, where
MROT = Na [( 1 (7)
Table 1 summarizes the predictions that the Rule of Thumb makes on the distributions in our validation
experiments. The parameters a and b are the minimum and maximum possible buffer values. We applied
Buffer value vs. number of references
Exponential distribution
Buffer value
0 .8 . . . . . . . . . . . . . . . . . . . . . . . . . . . ..
0.6 ..... ......... .
value
min
0 .4 ......................... ... 
max
Simulation
0.2
0 100 200 300 400 500 600
Number of references
Figure 3: Comparison of buffer value predictions for a exponential distribution.
distribution a b MROT buffer value percent
triangular .2 .332 1840 .304 78.9
exponential .2 .706 160 .613 81.6
partitioned .2 .8 120 .623 70.5
Table 1: Summary of the Rule of Thumb predictions.
the Rule of Thumb to obtain MROT, and then reported the buffer value that our analytical model predicts.
Finally, we report the quality of the ranking as the distance of the buffer value between the minimum
and maximum, in percent. The table shows that the Rule of Thumb makes a reasonable prediction of the
number of references to process, since the the buffer collects 70 11'. of the additional buffer value that can
be obtained by processing references. The table shows the one must collect more references to distinguish
between hot and cold data items as the difference in reference probability becomes smaller. Thus, fewer than
N references need to be processed for the exponential and partitioned distributions, while a huge number of
references must be processed for the triangular distribution.
6 Comparison to Statistics Research
Statistics researchers working in the area of design of experiments have examined problems related to the
one examined in this paper. A good survey of the area is contained in the book by Gibbons, Olken, and
Sobel [12].
Buffer value vs. number of references
80/20 partitioned distribution
Buffer value
1 
0.8 ......
 min
max
Simulation
0.2 ......
0 50 100 150 200 250 300
Number of references
Figure 4: Comparison of buffer value predictions for a exponential distribution.
Bechhofer [2] worked on the original problems in this area. Given k populations, each with (unknown)
mean Xi and (known) variance o?, find the t populations with the largest (best) means. The best populations
can be ranked or unranked. Bechhoffer's procedure is to draw Ni samples from population i, i = 1...k,
compute the sample means Xi, and rank the sample populations accordingly. The problem becomes, how
many samples Ni should I draw from each population to be certain that the t populations that I have selected
are the t best with probability P? For the case when all variances are equal and the populations have a
Normal distribution, Bechhofer gives a formula for determining the number of samples to collect that looks
similar to our rule of thumb (see also [12], page '.7 ). However, this formula has a different interpretation,
and depends on an expansion factor that is a function of k, k, and P in a complicated way. Later works
in the area have considered the problem of finding the t best of k populations when s > t populations are
selected [10]. A large set of selection and ranking problems involving a variety of distributions and both
known and unknown variances is discussed in [3].
In the problem addressed in this paper, the populations have a Bernoulli distribution and the same
number of samples is drawn from each population. Sobel and Huyett [24] calculate the number of samples to
collect when selecting the best one of k Bernoulli populations with confidence P, and the best population has
a chance of success d larger than the second best population. The Sobel and Huyett procedure was shown
to be the optimal singlestage procedure [14]. The number of required samples can be reduced by adaptive
sampling. A survey of this work is found in [15, 5, 4, 20]. In [4], the theory is extended to selecting the best
Prob. in buffer vs. Number of references
Item 30
Probability in buffer
1
0 .8 ...............
Expntl. ana
0.6 .. ... .. ........................................... E xpntl. sim
Triang. ana
STriang sim
0.4 ..... 80/20 ana
80/20 sim
0.2 
0 200 400 600 800 1,000 1,200
Number of references
Figure 5: Probability that the 30th hottest item is in the buffer.
t out of k Bernoulli populations (this work shows that choosing the populations with the largest number of
success is the best procedure). Jennison and Kulkarni [18] improve the work of [4] to minimize the number
of samples that must be collected.
The current work is most closely related to that of Sobel and Huyett [24], since every population (data
item) has the same number of samples taken (M). Sobel and Huyett calculated the number of samples to
collect by approximating the binomial distributions (the sum of Bernoulli samples) as a Normal distribution
and applying the methods described by Bechhofer in [2]. An extention to selecting the t best Bernoulli
populations out of K is analogous to the work in [24]. However, we are more concerned with buffer value
than with finding the t best populations. Though a stopping rule with a similar form to our rule of thumb
can be found in [2], it depends on k and t (or N and b) in a complicated way, and the expansion constant
has been computed only for small k and t. Since we are concerned with buffer value rather than finding all
of the best populations, our rule of thumb calls for far fewer samples.
In Figure 6, we plot the probability that an item is labeled "hot" against the item number for the
triangular, exponential, and partitioned distributions when the rule of thumb number of references are
taken. Depending on the distribution, the probability that a hot item is labeled "hot" can be fairly low, and
the chance of buffering all of the how items is vanishingly low. Yet, the buffer value is fairly high.
Probability of being labeled hot
prob. in buffer
1
0 8 % . . . . . . . . . . . . . . . . . . . . . . . .. .
0. \
0 20 40 60 80 100 120 140 16 0180 200
item number
Figure : Probability that an item is labeled "hot" when the rule of thumb number of references are collected.
 Exponential
*.4 ^  Partitioned
0.2
0 20 40 60 80 100120140160180200
item number
Figure 6: Probability that an item is labeled "hot" when the rule of thumb number of references are collected.
7 Conclusion
Many database optimization activities require an estimate of reference frequencies. However, little work has
been done to investigate the quality of the available information about reference frequencies. In this work,
we present an analytical model of frequency counting that predicts the quality of a hot spot estimate as a
function of the number of references processed. We validate this model by a comparison to simulation results.
Finally we present a simple but useful rule of thumb that accurately predicts the number of references that
need to be processed in order to make a good hot spot estimate.
References
[1] D.I. Aven, E.G. Coffman, and Y.A. Kogan. Stochastic Analysis of Computer Storage. D. Reidel Pub
lishing, 1 'I .
[2] R.E. Bechhofer. A singlesample multiple decision procedure for ranking means of normal populations
with known variances. Annals of Mathematical Statistics, '" 1639, 1 ', I.
[3] R.E. Bechhofer, J. Kiefer, and M. Sobel. Sequential Identification and Ranking. University of Chicago
Press, 1968.
[4] R.E. Bechhofer and R.V. Kulkarni. Statistical Decision Theory and Related Topics III, Vol. 1, pages
61108. Academic Press, 1982.
[5] H. Buringer, S.M. Johnson, and KH Schriever. Nonparametric Sequential Selection Procedures.
Birkhauser, 1980.
[6] E.G. Coffman and P.J. Denning. Operating System Theory. PrenticeHall, 1973.
[7] G. Copeland, W. Alexander, E. Boughter, and T. Keller. Data placement in Bubba. In AC if SIC, iOD
Conf., 1988.
[8] K.M. Curewitz, P. Krishnan, and J.S. Vitter. Practical prefetching via data compression. In AC if
SIC, iOD Conf., pages '",;266, 1993.
[9] H.A. David. Order Statistics. John Wiley, 1981.
[10] M.M. Desu and M. Sobel. A fixed subsetsize approach to the selection problem. Biometrika, 55(2):401
410, 1968.
[11] S. Ghandeharizadeh, D.J. DeWitt, and W. Qureshi. A performance analysis of alternative multi
attribute declustering strategies. In AC if SIC, fOD Conf., pages 2938, 1992.
[12] J.D. Gibbons, I. Olkin, and M. Sobel. Selecting and Ordering Populations: A New Statistical Method
ology. John Wiley and Sons, 1977.
[13] R.L. Graham, D.E. Knuth, and 0. Patashnik. Concrete Mathematics. Addison Wesley, 1989.
[14] W.J. Hall. The most economical character of bechhofer and sobel decision rules. Annals of Mathematical
Statistics, 30:964969, 1959.
[15] D.G. Hoel, M. Sobel, and G.H. Weiss. Perspectives in Biometry, pages 2961. Academic Press, 1 7
[16] M.F. Hornick and S.B. Zdonick. A shared, segmented memory system for an objectoriented database.
AC if Transactions on Office Information Systems, 5(1), 1 '.
[17] S.E. Hudson and R. King. Cactis: A selfadaptive, concurrent implementation of an objectoriented
database management system. AC if Trans. on Database Systems, 14(3):291321, 1989.
[18] C. Jennison and R.V. Kulkarni. Design of Experiments: Ranking and Selection, pages 113125. Dekker,
1984.
[19] D. Kotz and C.S. Ellis. Prefetching in file systems for MIMD multiprocessors. IEEE Trans. on Parallel
and Distributed Systems, 1(2):218230, 1990.
[20] R.V. Kulkarni and C. Jennison. Optimal properties of the bechhoferkulkarni bernoulli selection proce
dure. Annals of Statistics, 14(1):298314, 1986.
[21] M. Palmer and S.B. Zdonik. Fido: A cache that learns to fetch. In Proc. 17th I", / Conf. of Very Large
Databases, pages 2'..264, 1991.
[22] K. Salem. Adaptive prefetching for disk buffers. Technical Report TR9164, CESDIS, NASA Goddard
Space Flight Center, 1991.
[23] K. Salem, D. Barbara, and R.J. Lipton. Probabilistic diagnosis of hot spots. In IEEE Data Engineering
Conf., pages 3039, 1992.
[24] M. Sobel and M.J. Huyett. Selecting the one best of several binomial populations. The Bell Systems
Technical Journal, 36:537,;,., 1 1.
[25] J.W. Stamos. Static grouping of small objects to enhance performance of a paged memory system.
AC if Transactions on Computer Systems, 2(2):155180, 1984.
[26] M.M. Tsangaris and J.F. Naughton. A stochastic approach for clustering in object bases. In AC if
SIC; iOD Conf., pages 1221, 1991.
[27] M.M. Tsangaris and J.F. Naughton. On the performance of object clustering techniques. In AC if
SIC; iOD Conf., pages 144153, 1992.
[28] C.K. Wong. Algorithmic Studies in Mass Storage Systems. Computer Science Press, 1983.
