Efficient Processing of Aggregates in Probabilistic Databases
Lixia C'!i, 1 and Alin Dobra
University of Florida
{lixchen,adobra} i iil ,du
Abstract
Computing expectations of aggregates over probabilistic
databases is believed to be a hard problem. Moreover, since providing confidence intervals for aggre
gation is strictly necessary, the problem is even harder. At least the expectation and variance need to
be evaluated to provide usable confidence intervals. In this paper, we present efficient algorithms for
computing moments of SUMlike aggregates over noncorrelated multirelation queries on databases with
tupleuncertainty. The core idea in our work is to statistically analyze moments of the aggregates by
making extensive use of linearity of expectation. The theoretical formulas can be expressed in SQL and
evaluated using a traditional DBMS. Our experiments indicate that only a small overhead usually
below 10% is needed to evaluate probabilistic aggregations using a traditional DBMS. Furthermore,
the distribution of probabilistic aggregates tends to be normal and well approximated by the normal
distribution with the mean and variance computed using our method. This means good quality con
fidence bounds based on normality can be i...1I, computed. When the normal approximation is not
appropriate, conservative ('. I... h. bounds can be used instead, bounds that are based on mean and
variance as well.
1 Introduction
Probabilistic databases have attracted much attention in recent years because of the increasing demand for
processing uncertain data in practice. Computing aggregates over probabilistic data is useful in situations
where analytical processing is required over uncertain data. Since errors are inevitably introduced in the
data, or the nature of certain kinds of data is uncertain, in a natural manner the need arises to process
aggregates over probabilistic databases. While the need is there, the speed of processing aggregates over
probabilistic data has to be comparable with the speed current system achieves when computing aggregates
over nonprobabilistic data. Should the former be much slower, the appeal of probabilistic processing is
significantly diminished.
The estimation of aggregates over probabilistic databases is perceived as a harder problem than deter
mining probabilities for result tuples. As a consequence, dealing with aggregate queries has been mostly
postponed until enough progress is made with probabilistic databases. Moreover, when it comes to aggre
gation of probabilistic data, it is not enough to compute aggregates over the topk most probable tuples
since there can be tuples with low probabilities but large values of aggregate that have a major influence
on the aggregate value. Even more, computation of the average/expected value of the aggregate is not
enough, actual confidence intervals need to be provided. Confidence intervals that indicate the region in
which the value of the aggregate can be, not only the average behavior. For example, completely different
decisions would be made by a manager if the total revenue is $10, 000,000 10, 000 or if the revenue is
$10, 000,000 100, 000, 000. In both cases the average is the same but the latter situation has the potential
to be very risky whereas the first is safe. By computing the variance of the aggregate, good confidence
bounds can be provided.
The type of probabilistic database we consider in this paper allows uncertainty only in the existence of
the tuples in the database. This tuple uncertainty model is widely used in previous work [8, 9, 7, 18]. The
actual information in the tuple is not probabilistic and the existence of each tuple is independent of the
existence of all other tuples. While considering extensions of this model is interesting, the problem needs to
be tackled first for this simpler model in order to have a chance to solve the more complicated extensions. A
natural line of attack for the computation of aggregates under this probabilistic model is to use the possible
world paradigm in [16, 13, 2, 17, 22, 4, 8, 15]. While this helps in understanding the semantics, it does not
lead to practical methods for computation of the moments of aggregates, and thus of the confidence intervals.
A better approach, used successfully in the analysis of sampling methods in [20, 21], is to make extensive
use of the linearity of expectation. Other key ingredients are expressing the aggregates in terms of simple
random variables and making use of the Kronecker delta symbol. These techniques allow us to characterize
SUM and AVERAGElike aggregates for noncorrelated, distinct free, multirelation queries. More precisely, we
make the following contributions:
We derive formulas for the moments of SUM aggregates, which allows computation of confidence inter
vals.
We indicate how these formulas can be evaluated using SQL and a traditional DBMS.
We optimize the formula for the variance to remove the need to perform computation over the cross
products of matching tuples and indicate how these optimized formulas can be evaluated using SQL.
We extend the analysis to nonlinear aggregates like AVERAGE and for queries containing GROUPBY clauses.
We evaluate the performance of our algorithms on TPCH data set and observe that the distribution
of probabilistic aggregates tends to be normal and well approximated by the normal distribution with
the mean and variance computed using our method.
Except for requiring the aggregate query not to have subqueries or duplicate elimination, we put no re
strictions on the WHERE condition or on the aggregate functions to be computed. Indeed, the treatment in
this paper is fully general for the types of queries we consider. Interestingly, as our experiments with an
implementation of our theoretical results show, confidence intervals for aggregate queries over probabilistic
data can be efficiently provided. As we will see in Section 6, in most situations the overhead is below 1l' .
(in the worst case a factor of 2 ',.) when the optimized formulas for the variance are used. Such performance
was obtained using query rewriting alone without any change to the database system. This might suggest
that probabilistic aggregate queries can be easily integrated in existing DB. I . without the need for major
redesign.
This paper is organized as follows. In Section 2, we introduce the probabilistic data model, the type
of queries and techniques used in this paper. In Section 3, we analyze the first two moments of aggregates
over one relation. This analysis serves as a warmup for the general case. Section 4 studies the moments of
aggregate over multiple relations and how to evaluate them efficiently using SQL queries. Section 5 extends
the analysis to nonlinear aggregates and queries with a GROUPBY clause. We present experimental results
in Section 6. Related work is discussed in Section 7 and conclusions are drawn in Section 8.
2 Preliminaries
In this section we introduce formally the queries we support in this work, the formal definition of the
probabilistic model we use (tupleuncertainty model) and useful basic techniques used throughout this paper:
obtaining confidence intervals from moments and the use of Kronecker 6ij to encode the cases in order to
streamline analysis of random variables.
2.1 Queries and Equivalent Algebraic Expressions
The basic type of query we consider in this paper is:
SELECT SUM(F(ti ... t))FROM RI AS 1,..., R, AS WHERE SIMPLE_CONDITION;
where Ri,..., R, are n relations, F(.) is an aggregate function that can depend on tuples from each relation
and
SIMPLE_CONDITION is a condition involving only tuples from relations Ri,..., R,. Arbitrary selections and
join conditions are allowed but no subqueries of any kind or DISTINCT are allowed (no queries with set
semantics).
In order to perform the statistical analysis in this paper, we need to translate the SQL query above into
algebraic formulas. To this end, we unify the WHERE condition and the aggregate function F(.) into a single
function f(tl ... t,) in the following way: (a) introduce a filtering function Ic(t) that takes value 1 if the
tuple t satisfies the condition C and 0 otherwise, and (b) define f(t) F(t)Ic(t). With this, the value of
the aggregate we consider is simply:
A "* f(tlie' t,)
ti R1 tneRn
that is, the sum of the function f(.) over the crossproduct of the relations involved.
2.2 Confidence Intervals from Moments
As we argued in the introduction, from user's perspective it is not enough to provide the expected value of
an aggregate over the probabilistic database. A confidence interval would provide a lot more information.
The standard way to obtain confidence intervals for random variables is to compute the first two central
moments, E [X] and Var (X) and then to use either a distribution dependent or independent bound. The
distribution dependent bounds assume the type of distribution is known and is one of the twoparameter nice
distributions. The most common situation is the application of the Central Limit Theorem to argue that
the distribution is asymptotically normal. The two parameters of the distribution are computed from E [X]
and Var (X) and then the and 1 quantities are determined, with 1 a the desired confidence (a is the
allowed error). The distribution independence bounds use the C'l. I.. I, inequality to provide conservative
bounds (bounds that are correct irrespective of the distribution but might be unnecessarily large). This
bound requires quantities E [X] and Var (X) as well.
The two types of bounds we discussed above require the computation of E [X] and Var (X). Usually,
E [X] is easy to compute but Var (X) poses significant problems. Unfortunately, it is not possible to avoid
the computation of Var (X) and still obtain reasonable confidence intervals. If only E [X] is known, only
Markov's inequality of Hoeffding bounds can be produced. Both can be reasonably efficient if multiple
copies of the random variable are available and averaged but both are completely inefficient if this is not the
case. As we will see in the next section, we have only one copy of the random variable that characterizes
the aggregates, thus Var (X) is strictly required if reasonable confidence bounds are to be produced. An
alternative to consider is to obtain multiple independent instances of X using MonteCarlo simulation. For
the error to be reasonable at least 100 such samples are required which results in a 100 fold increase in the
running time over nonprobabilistic aggregates a less than ideal scenario.
2.3 Probabilistic Database as a Description of a Probability Space
In this paper we use probabilistic databases with the following properties: (a) we are given a set of base
relations Ri,..., R,, (b) to each tuple ti in a relation Ri we associate a probability Pt, of inclusion into
instance R' of Ri, (c) the inclusion of a tuple is independent of the inclusion of all other tuples. The
instances R, ...., R' of relations R1, ..., R, form a possible world or database instance. Since tuple inclusions
are independent, the probability of any such possible world is the product of probabilities of the tuples
that appear in the world. We denote the possible worlds by w. This is essentially the tupleuncertainty
probabilistic model.
Since a possible world is a database instance, aggregates of the type described in Section 2.1 over the
database are well defined. If we denote by A such aggregates, then
A= E .. E f(tlie tn)
ti ( ', t^_Rn'
i.e. the aggregate is the sum of applications of function f(.) over tuples from the possible world. Since
relations R' are random (obtained according to the probabilistic model above), A is a random variable. The
moments of A, by definition, are:
E [Ak]= Z A(w '(w) (1)
weW
where p(w) is the probability of the possible world w and W is the set of all possible worlds. This immediately
gives formulas for E [A] and Var (A) E [A2] E [A]2, the two moments that are needed to compute
confidence intervals. Unfortunately, computing the moments of A using this formula is impractical since
the number of possible worlds is exponential in the size of relations. This paper is mostly concerned with
deriving computation methods that avoid the enumeration of all possible worlds.
2.4 Use of Kronecker 6i
Cases appear naturally in the analysis of complex random variables since the simple random variables that
they are constructed from interact differently with themselves and with other random variables (for example
when the variance of the complex random variable is performed). An elegant method used in [20, 21] is to
make use of the Kronecker symbol to encode cases. The Kronecker symbol is defined as:
6i 1 i = j
o 0 ifj
Assume that we are given a quantity Qij that takes value a if i j and b if i / j. We can express Qij
in terms of 6ij by observing that we can multiply a by 6ij, b by (1 6ij) and obtain:
Qij i= (x63)
O b ijJ (x(1 6i))
= b + (a b)6ij
This is correct since, if i = j, then 6i = 1. Thus Q = b + (a b) 1 = a (and the symmetric argument for
i ).
The following simplification rule is useful for removing 6ij from expressions:
i j i
i.e. the double sum collapses into a single sum because of 6ij. Even if 6ij is not removed, it is easy to
evaluate.
3 SUM aggregates over one relation
In this section we analyze the SUM aggregate over a single relation. As we will see, the analysis is intuitive
and makes explicit use of the independence of the selection of tuples in the database instance. For this reason,
the proof is direct and poses no problems. The resulting formulas can be rewritten in SQL, which gives a
straightforward way to evaluate them using a DBMS. While the analysis is straightforward for the one
relation case, as we will discuss at the end of the section, it cannot be generalized to the multirelation case.
The main reason is that independence is lost the moment two or more relations are involved. Nevertheless,
part of the technique is useful for the generalization and it is better exemplified by this simpler case.
For the onerelation case, the aggregate to be estimated over the instance R' of the relation R is:
A= : f(
teR'
In order to be able to compute E [A] and Var (A), for each tuple t e R, we introduce the 0, 1 random variable
Xt that indicates whether t is selected in R' or not. With this,
A Z= Xtf(t)
teR
Writing A in this manner is key for analysis since the dependence on a random range is replaced by a
dependence on random variables so linearity of expectation commutes with the sum. This technique is
useful in general case as well. Using the linearity of expectation, we have:
E [A]= E [Xt] f(t)= ptf(t)
teR teR
where we used the fact that E [Xt] P [Xt 1] = Pt.
To compute Var (A), we use the fact that, according to our probability model, the selection of tuples into
R' is independent, thus Xt is independent of Xt,. This means that the variance of the sum is the sum of
variances, thus
Var (A) Var (Xt) f(t)2 = pt(1 t)f(t)
teR teR
Above we used the fact that, for a Bernoulli r.v.
Var (Xt) pt(1 Pt)
and the fact that constants are squared when taken out of variances.
In order to evaluate these formulas in SQL, we observe that
f(t) F(t)Ic(t)
with C the selection condition, thus f(t)2 F(t)2Ic(t). In order to compute the formulas for E [A] and
Var (A) using SQL, we will add a P attribute to relations R that specifies the probability of the tuple. The
SQL statement that computes the expectation and variance is:
SELECT SUM(F(t) x P), SUM(F(t)^2 x P x (1P))
FROM R as t WHERE C;
It is immediately apparent that the computation on the expectation and variance is as easy as the com
putation of the nonprobabilistic aggregate. The most surprising fact is that variance can be estimated so
efficiently.
The above results for the onerelation case are encouraging and might suggest that similar results should
exist for multirelation case. Unfortunately, the above proof technique cannot be extended to the tworelation
case. To see this, let tl be a tuple in R1 and t2 and t' be tuples in R2. Tuples (t1, t2) and (t1, t) can both
be part of the aggregate computation, but their existence in R' x R' is not independent since the tl part is
common.1 This means that variance cannot be pushed inside the summations, thus a much more complicated
computation is required.
4 Analysis for Multirelation SUM Aggregates
As we have seen in Section 2, the computation of the moments of an aggregate using the possible worlds
is computationally infeasible. Some more progress can be made if 0, 1 random variables are introduced to
indicate whether a tuple is included in a relation instance or not, as we have seen in Section 3. We reached an
impasse with this method since we depended on independence to do the derivation, independence is lost the
moment two relations are involved in the aggregate. The approach in this section to overcome this problem
is to not make explicit use of independence between tuples of the same relations by introducing a proof that
carries out the full computation and does not use the independence shortcut. It might seem that, when this
technique is used, significantly longer proofs could result. It turns out that this is not the case and, when
1This can be checked by observing that P [tl E R' A t2 R'/ A t' E R'/, = ptpt2Pt2, which differs by a factor pt, when
compared to P [ti E R A t2 E R'] P [ti E RW A t', R] = p2lpt2P.
pairing up this style of proofs with extensive use of linearity of expectation, compact and general derivations
result. The proof technique we use is similar to [20, 21] and makes use of the Kronecker delta symbol.
The first key ingredient for the analysis is to use the 0, 1 indicator variables Xt for each tuple in each of the
relations.2 Each Xt has a Bernoulli distribution parametrized by probability pt. The second key ingredient
is to express the interaction between two random variables Xt and Xt, in terms of Kronecker delta. This
allows a pure algebraic manipulation without the need to deal with cases that significantly complicate the
formulas. When these two ingredients are used together, the derivations become straightforward.
With the above comments in mind, the analysis of the aggregate A will consist in: (a) express A in
terms of the data and r.v. Xt, (b) use linearity of expectation to compute E [A] and Var (A) in terms of Xt,
and (c) interpret the resulting formulas from a database point of view and express them in SQL. With the
notation in Section 2.3 and using the same idea, the aggregate A can be expressed in terms of Xt as:
A= E ... E f(ti ...t)
ti CR' t, cRn
1 E:. I Jx f((Lf ...t) (3)
tiCRI tneRn i=1
The following properties of Xt, that follow directly from the fact that it has a Bernoulli distribution, are
needed in the rest of the paper:
E [Xt] P [t E R] =t (4)
E [XtXt] {P[t ER] t=t'
P[t R A t'ER] t t'
Pt t t' (x6tt') (5)
ptPt, t t' (x(1t,))
PtPt' +Pt(l Pt)6tt'
where we used the fact that Xt and Xt, are independent (if t / t') and the technique explained in Section 2.4
to express cases using the Kronecker delta.
Since the formulas will be too large with the current notation (too many summation signs), as in [21],
we introduce more compact notation for summations:
S. f(tli t ) = f({tili E {1:n}})
Sums that are subscripted by the set {ti E Rili E S} are equivalent to sums over each of the indexes in the
set. We use the same compact notation for the arguments of f(.).
With this, we have the following result:
Theorem 1 The moments of A, the aggregate defined by Equation 3 are:
E [A] E (fPt f({fli {1 :n}})
{tnief {m}} \i=1
{tii i t e{ln}} {t e(R1 i', e{ll}} i
Var(A)= E [A2] E [A]2
2There is no need to index the random variables by the relation as well, as is done in [21], since it will be clear from the
context what relation we are referring to and all random variables interact the same way.
Proof. Using Equation 4, linearity of expectation and the fact that Xt r.v. are independent for different
relations, thus the expectation of products is product of expectations, we have:
E [A] ( E [Xt)] f({tili E {1:n}})
{teiR ie{ln}}i=
Ipt f({t li e {l:n}})
{ticRiiC{{}} \i=1 )
Creating two instances for A with the indexes primed for the second one, using Equation 5 and linearity of
expectation, we have:
E [A2] E (itiXtxt, f({t i e {l:n}})f({tli' e {l:n}}
{ti e Rie{}} {te i Cf1'{Ift}} C Vf i=l
Si(flE[XtYi,]) f({tii e{ 1:n}})f({ti' {l:n}})
{teR ie_{ }} {teR,\i'_ {ln}} i=1
/n
I: ([(Pt, +Pt (1 Pt. .) f({tili E {l:n}})
x f({tfi,' e {1:n}})j
Making the same observations as in Section 3, the formulas in Theorem 1 can be efficiently evaluated
using a DBMS. Remember that f(t) F(t)Ic(t) where F(t) is the aggregate function and Ic(t) is the
indicator function of the condition C. To compute E [A] using SQL we observe that the query is the same
except the aggregate function is multiplied by the product of probabilities:
SELECT SUM(F(t, .. tn) x tl.P x .. x tn.P)
FROM R1 AS tl,..., Rn AS tn
WHERE SIMPLE_CONDITION;
In order to compute E [A2], thus Var (A), we have to compute the aggregate over the cross product of the
result tuples. To translate this efficiently into SQL, we put the unaggregate value and the probabilities into
a temporary table and then compute the aggregate over the crossproduct of this table with itself. It is
possible to write this as a single query but here, for ease of exposition, we prefer two queries. We will assume
that a function Delta(i,j) that implements the Kronecker delta is available (functions can be added to all
major database systems). In order to be able to apply the function, we need the ids of the tuples (primary
keys are ideal ids; we will assume that the attribute ID exists). With this, the SQL code to compute E [A]
is:
SELECT F(ti . *tn) AS F,t1.P AS Pi,...,tn.P AS Pn,t1.ID AS ID1,..., t.ID AS ID,
INTO unaggregated (6)
(6)
FROM R1 AS tl,...,Rn AS tn
WHERE SIMPLE_CONDITION;
SELECT SUM(U.F x V.F x (U.P1 x V.Pi + U.Pi x (1 U.Pi) x Delta(U.ID1, V.ID1)) x ...)
ROM unaggreggated AS U, unaggreggated AS V;
FROM unaggreggated AS U, unaggreggated AS V;
Clearly, except for a slightly more complicated aggregate function, the effort to compute E [A] is the same as
the effort to compute the original aggregate A. Furthermore, the query plan can remain the same and still
be efficient. The computation of E [A2] involves the crossproduct of all result tuples, with no possibility of
further optimization at the database level. There are no conditions to take advantage of to replace the cross
product by a join. This evaluation is equivalent to the method described in Section 4.1 if the probabilistic
database is highly optimized for this particular type of query. The fact that SQL can be used directly instead
of redesigning the database helps but this solution is not fundamentally better. As we will see in Section 4.2,
further optimizations can be applied to this solution in order to significantly reduce the time complexity.
4.1 Connection with existing probabilistic
databases
Let us consider a more general type of query in this section: aggregation queries in which DISTINCT and
set operators are allowed. The only restriction is for the aggregate to be applied as the last operation.
Essentially, this means that the queries have the same structure as described in Section 2, but the relations
R,..., R, can be views instead of stored relations. The algebraic formula of the aggregate is:
A F(t)
tCRM
where RM is the relation containing matching tuples and the aggregate is SUM(F(t)). If the database is
probabilistic, the value of the aggregate over an instance R'> is:
A F(t)= XtF(t)
S tRM
where, as we did in Sections 3 and 4, we introduced a 0, 1 random variable Xt that indicates whether the
matching tuple t is included in the instance (world) R'> or not. The introduction of Xt is crucial since we
can use the linearity of expectation to prove the following result:
Proposition 1
E[A]= 1 P[teR'] F(t)
tCRM
S[A2] = P[t R' At' R'~IF(t)F(t')
tE(RM t'ERM
Proof. First, by linearity of expectation:
E [A] = E K XtF(t)
LteCRM
= E [Xt] F(t)
tCRM
= P [t E R'] F(t)
tCRM
where we used the fact that Xt is 0, 1 random variable thus
E [Xt] = 1 P [Xt = 1] + 0 P [Xt = 1]
P [Xt 1]
P [t RM]
Similarly,
E [A 2] E Kt 4A XtXtF(t)F(t')
= E [XtX,] F(t)F(t')
teRM t'RtR
= P[t R A t' E RM] F(t)F(t')
tCRM t'tRM
where we used the fact that
E [X,X,,] = P [t eR' A t'e R]
(using the same reasoning as for E [Xt]) and linearity of expectation. O
The formulas in Proposition 1 can be evaluated using a probabilistic database in the following manner.
To compute E [A] go over all possible matching tuples and multiply the probability to see the tuple and the
value of the aggregate function (all these terms are accumulated in the total sum). To compute E [A2] go
over the cross product of the possible matching tuples and multiply the probability to simultaneously see
the two tuples and the product of values of the aggregate function applied to the two tuples.
Interestingly, the above method to compute E [A] and Var (A) can be seen as a way to implement, using
existing probabilistic databases, the results in Theorem 1. For the queries we consider in this paper, since
we were able in the previous section to write SQL statements to carry out their computation, the fact that
the same formulas can be computed using existing probabilistic databases is not very useful since executing
the SQL statements using an traditional DBMS will be significantly more efficient. The usefulness comes in
when we observe that the same technique can be applied to queries that contain distinct and set operators,
case in which a probabilistic database can compute the probability to see a matching tuple. Potentially
existing probabilistic databases can be modified to compute the probability of simultaneous inclusion of two
matching tuples, thus allow the computation of E [A2] which gives Var (A).
An even tighter connection can be established with the safe plans of Dalvi and Suciu[8]. If we relate the
formulas for E [A] in Theorem 1 and Proposition 1, we immediately have:
n
P [(tl,..., tn,)E R'M]=
i= 1
This is precisely the probability, as computed by the method in [8] of matching tuple t = (tl,..., t,).
It is important to notice that, for the nonaggregate part of the queries we consider in this paper any plan
is a safeplan since they do not contain DISTINCT. It is no surprise then that the same formula should be
used for P [t E R ']. Since the safeplans cannot contain the same relation multiple times, the method in
[8] is not general enough to allow the computation of P [t E R' A t' E R~,], which is required according to
Proposition 1 in order to compute E [A2] and Var (A).
The above observations suggest that the moment based analysis of SUMlike aggregates that we consider
in this paper can be used to compute confidence intervals for aggregates in the most general scenarios as well.
The fact that the cross product of matching tuples has to be considered, might be a bigger problem than
the fact that complex events have to be dealt with if the queries are relatively simple. For complex queries
or queries with large number of matching tuples, both the method described in this section and the method
in the previous section seem impractical. A significantly better method is described in the next section but
it works only for the queries considered in this paper, not for general queries.
4.2 Reducing the time complexity
When we derived the formulas for Var (A) for the onerelation case in Section 3, the resulting formula could
be evaluated directly over the set of matching tuples; there was no need to go over the crossproduct of the
matching tuples for the computation of the E [A2] part of the variance. An interesting question to ask is
whether such a reduction is possible for the multirelation case.
A fundamental question that can be asked about the formula for E [A2] in Theorem 1 is whether we
can remove the Kronecker deltas, for example by using the simplification rule in Section 2.4. While it is not
clear why this might reduce the complexity, it would be a necessary step in that direction since otherwise
there is no way to proceed.
To see where the opportunity lies, let us consider the two relation case. In this case, the formula for
E [iA2] can be written as:
E [A2] = : E E ((pt', +pt(l pt)tt')(p,p + p,(1 p)6) )f(t, v) f (t', v'))
t Ri t'e R1 v R2 U' R2
1 1: E E ((PtPt'PvPv' +Pt(l Pt)6tt'PvPv' +Ptptpv(l Pv)6vv' +Pt(1 Pt)6tt'Pv(1 Pv)6v',)
t Ri t'e R1 v R2 U' R2
xf(t, v)f(t', v'))
) C ~ ) ptpt'pVpVf(t, v)f(t',I)+ P pil tt'pvp v)f(t',I')
t(R1 t' Rl vR2 v'6R2 t6R1 t'6 R1 vR2 v'6R2
+ E E E ptpt'p( p,)b,,'f(t, v)f(t', v')
teRi t'P R vU R2 v' R2
+ : E E (Pt( Pt)6tt'P ( p)6,,f(t, v)f(t', v))
tBRi t' Ri vB1 R2 U' R2
P: E tpf (t, V) E E pt'p'f ', ) + E Pt(1 
tCRI vBR2 t'eRi u'e 2 t 1I
+ :(1 ^ )PPV( ) Pt' fW, V) +
eR2 \t f(t, v) E ptf(t',
2 t 1 t'VC1 I t(R i v
p4: Zp (t, V) +E)) : pt(1
/ 2)
2
+ p (1 i Pu) ( Ptf(t, v))
vER2 \tB1 /
Pt) P f (t, v) p' f (t, V)
\vcR2 v'6/R2 /
t (1 Pt)P (1 p)f(t, v)f(t, v)
EI2
) 2
Pt )Pv (1 _ P)f(t, V)2
Now we can see why such a derivation leads to better evaluation algorithms for E [Ai2]. The first and the last
terms straightforwardly require just aggregates over the matching tuples. To see how to efficiently evaluate
the second term (the third term is symmetric), we observe that the inner square needs to be computed for
each tuple t e R1. Thus, we can compute this squares using a GROUPBY with the correct aggregates. Once
the squares are computed, one for each t e R1, we simply compute the rest of the aggregate. The query in
SQL is:
SELECT SUM(SQ x P1 x (1 P1))
FROM ( SELECT SUM(F x P2)^2 AS SQ ,P1
FROM unaggreggated
GROUPBY ID1,P1 );
The grouping on P1 is needed since otherwise the information cannot be passed up in the outer query. Thus,
it seems that each term can be evaluated almost as efficiently as an aggregate over the relation unaggregated
(an extra GROUPBY is required).
The above derivation suggests that we might be able to compute efficiently Var (A) in the general case.
Indeed this is possible as we show in the rest of this section. In order to provide the result, we need the
following technical lemma. The result will be used in Section 5 as well:
Pt) (Zp ()
+ EEP(1
tB! 1 V6U 2
Lemma 1 Let at,, bt, i {1 : n},ti E Ri be arbitrary values. Furthermore, let f({ti}) and g({ti}) be
arbitrary functions of tuples {ti}. Then
S( (at atU, + bti ti,t,)) f({.t})9({ti'})
{ten ite{M}} {t'e ,i' { }} \i,i'{} /
I: >E (nijbt) Nt) f({ } ( E atI t) g (L tiL'}))
seC(n) {tieRi itiS} \kes / {teIjcesc} 1CSI {tesc t, j',ces} 1 esc 7
with P(n) the powerset of {1:n}. S is a subset of P(n) and Sc is the complement of S w.r.t. P(n).
The lemma above provides a recipe to remove Kronecker deltas and factorize the result. Using this
lemma, we can prove the following result:
Theorem 2 Let A be the sampling estimate defined by Equation 3. Then,
kcS E E fi (1S))f ( 11
sCP(n) {tieRiicss} kes {te~ j esc} \,lsc
Proof. The result follows directly from the expression of E [A] in Theorem 1 and the Lemma 1 with at = t,,
b pt,(1 Pt.) and g() f(.). O
The above result indicates that, for each set S E P(n), 2" in all, we have to perform a computation that
generalizes the 2relation case we saw before. Using the same reasoning, the SQL query that computes the
term corresponding to set S is:
SELECT SUM(SQ x J Pi x (1 P))
icS
FROM ( SELECT SUM(F x PJ Pj)^2 AS SO ,{P, iS} (8)
FROM unaggreggated
GROUPBY {IDi, Pi, i S} );
The computation is exponential in n since 2" different aggregates need to be computed, but, in the worse
case, it just requires a sort and a linear scan of relation unaggregated for each of the sets S. When the size
of unaggregated is large, we expect this method to compute E [A2] to be much faster that the method in
Section 4. The unoptimized method would be comparable only when the size number of matching tuples is
smaller than n2".
5 Extensions
The analysis and computation for SUM aggregates can serve as the basis for computing more complicated
aggregates. The two extensions we consider are here: nonlinear aggregates and GROUPBY.
5.1 Nonlinear aggregates
The method described in the rest of the paper can provide confidence intervals only for SUM aggregates. In
practice, other aggregates such as AVERAGE and VARIANCE, that are similar to SUM, are useful. Such aggregates
can be computed from multiple SUM aggregates by making a nonlinear combination of them.3 For example,
AVERAGE is the ratio of SUM and COUNT4. Since our analysis extensively used the linearity of the SUM aggregate,
it cannot be applied to nonlinear aggregates like AVERAGE directly.
3If the combination is linear, then the F(.) functions can be combined into a single such function that is used for computation.
4COUNT is a SUM with the aggregate function F(.) = 1.
To provide a general treatment, let A1,..., Ak be SUM aggregates (with different aggregation functions).
Then, a general nonlinear aggregate needs to compute A = (A1,...,Ak). For AVERAGE aggregate, A1 is
SUM (F), A2 is SUM (1) and F(A1, A2) .
Now, let A1,... Ak be the random variables that give the value of the aggregates for the probabilistic
database instance. The value of the aggregate A on the instance would be A = (A1,...,Ak). As we
did before, we need to compute (or estimate with good precision) E [A] and Var (A) in order to provide
confidence intervals for the aggregate. The standard method in Statistics to analyze nonlinear combinations
of random variables is the delta ,.. i;i .. [,_'.] The delta method consists in expressing the moments of A in
terms of the moments of A1,..., Ak. In particular:
E[A] (E[Al],...,E[Ak])
Var (A) VY(E [A],..., E [A])T Var(A1,...,A)
SVF(E [Al],..., E [Ak])
Proof. E where VF() is the gradient of the function F (i.e. the vector consisting of the partial derivatives
w.r.t. each component) evaluated at E [A1],..., E [Ak]. Var(Ai,..., Ae) is the variance matrix, that has
Var (A,) on the diagonal and Cov(AQ A) off the diagonal. We can use the technique developed in this
paper to compute E [A,] and Var (A,) efficiently for each A,. The only remaining task is to compute
Cov(A,, A) E [AA] E [A,] E [A]
The only challenge is computing the E [AiAt ] component. The formula is provided by the following result:
Theorem 3 Let A,, A, be the sampling estimate defined by Equation 3 for two different sum aggregates
given by aggregate functions f,(') and f,(.). Then,
E[[(ppEEn (1 )) x ( z (nE1 Pf({tt})')
SCP(n) {ticRiicS} VcS ) ) {tCRijCS0} ifsC /
{tIRj 'j'eSc} l'Sc /S
Proof. First, we observe that the same algebraic manipulations as in the case of E [A2] in the proof of
Theorem 1, can be used for E [AuAv]. Using linearity of expectation we obtain:
E [AA,}= E E (fJ + Pt( Ptj) x f.({ftji E {1:n}})
{tenR ieC{n}} {t,,e ,, i' {ln}} i=l
x f,({tJi' e {1:n}})
Now, we can use again the results in Lemma 1 with at = Pt, bt = t,(l Pti), f() = fu() and g(.) = f()
and we obtain the required result. O
To see how we can evaluate the E [AuA,] terms using SQL, we first observe that the aggregates are
computed using the same WHERE condition but different aggregate functions. In particular,
() = F,(b)Ic()e
fN()= F,()Ic()
5Not to be confused with the Kronecker delta.
This means that the set of matching tuples over which the aggregates are computed are the same. Using the
same reasoning as in Section 4.2 where we expressed the computation of E [A2] in SQL, we get the following
SQL expression for the computation of the terms corresponding to set S in the expression of E [AuSA,]:
SELECT SUM(SQ, x SQ x xP (1 P))
icS
FROM ( SELECT SUM(F, x ]J PJ) AS SQ,, SUM(F, x ]J PJ) AS SQ,, {Pj, i ES}
jeSc jeSc
FROM unaggreggated
GROUPBY {IDi, Pi, i S} );
Furthermore, since the SQL queries for computation of terms corresponding to set S for all E [AI] and
E [AAu,] terms are identical except the actual aggregates computed the GROUPBY is the same they can
all be combined into a single query that computes all required aggregates simultaneously. This means that
the number of overall SQL queries is 2", irrespective of the number of sum aggregates combined by the
nonlinear aggregate. The number of aggregates computed by each such query will be k(k + 1)/2, where k is
the number of operations.
5.2 GROUPBY queries
If the query contains a GROUPBY clause, for each of the groups, confidence intervals for the value of aggregates
have to be provided. The computation of the moments for the aggregates for each of these groups would be
no different from the computation of the aggregates for the entire relations. With this in mind, one method
to obtain the desired confidence intervals/group would be to generate the tuples in each group and then
apply the techniques described in this paper for each such group. Fortunately, this can be accomplished
using SQL without the need to change the database engine or external computation. Essentially, for each
aggregate that is generated we have to add a GROUPBY clause to account for the extra grouping and make
sure we have the group information where we need it. For example, with grouping on the set of attributes
G, the SQL statement at the end of Section 4.2 becomes:
SELECT SUM(SQ x f Pi x (1 P,)),
iES
FROM ( SELECT SUM(F x J Pj)^2 AS SQ {P,, ie S},
J Sc
FROM unaggreggated
GROUPBY {ID, Pi, i E S}, )
GROUPBY G;
6 Experiments
The most important question through experiments is what is the overhead of computing the moments of
the probabilistic aggregates when compared to the execution of the nonprobabilistic query. As we will see
in this section, the overhead is usually small this is especially true for the optimized version thus the
moments of the probabilistic aggregates can be computed with only minimal performance degradation.
A second question to ask is what is the distribution of the probabilistic aggregates. As our experiments
will show, the distribution tends to be normal thus the expectation and variance provide complete information
about the probabilistic aggregates. This is the best scenario possible since it means that tight confidence
intervals can be produced efficiently.
Methodology All the experiments were carried on a 4 processor, 2.4 GHz CPU, and 8GB RAM machine
running Linux (Ubuntu 2.6.20 kernel). We used Postgres as a DBMS without any modification. Our
algorithms work as a client program that generates the SQL queries to implement the probabilistic aggregates.
Both client and DBMS were running on the same machine. We used TPCH benchmark with a data set of size
0.1 G, 1G and 10 G. For each tuple, we added a probability attribute that was generated uniformly random
in [0, 1] interval. Notice that the query execution time does not depend on the values of the probabilities
since the same computations are performed with different numbers.
6.1 Computation of Moments
We implemented both the unoptimized, Algorithm 1, and the optimized, Algorithm 2, versions of our
algorithm using query rewriting. Aggregation over one relation was special cased in both algorithms (the
specialized algorithm in Section 3 was used). Both algorithms compute first the unaggregated table and
then process it to compute the moments. The main difference is the fact that the unoptimized algorithm
needs to evaluate a single query over the cross product unaggregatedxunaggregated and the optimized
algorithm needs to evaluate 2" GROUPBY queries over unaggregated.
Algorithm 1 ComputeMomentsUnOpt(q)
1: if q is an aggregation over a single relation then
2: Generate and execute SQL code(2)
3: else
4: Create a temporary table unaggregated using SQL code(6)
5: Generate and execute SQL code(7)
6: end if
Algorithm 2 ComputeMomentsOpt(q)
1: if q is an aggregation over a single relation then
2: Generate and execute SQL code(2)
3: else
4: Create a temporary table unaggregated using SQL code(6)
5: for i 0 to relations 1 do
6: Map each relation to set S or Sc by i
7: Generate and execute SQL code(8)
8: end for
9: Sum up the computed 2#relations items of each group.
10: end if
The algorithms are evaluated on queries Q1, Q3, Q5 and Q6 (all the queries in TPCH without subqueries)
for databases of size 1G, Table 1, and 10G, Table 2. To get insight into how the time is spent, for all queries
that involve more than one relation, we timed the generation of unaggregated table, and the time to
evaluate the moments once unaggregated table is generated for unoptimized and optimized algorithm. Q3
is executed with the original GROUPBY and without it in order to see the impact of large groups for which the
aggregates need to be computed. The experimental results reveal as the following:
Queries # rel. GroupBy
6
# matches
6001197
30569
30569
23
202
time
nonprob.(ms)
35105
20637
20270
5297
21534
time probabilistic (ms)
gen. unagg. table unoptimized
21010
21056
5492
2905
4 Hours
86
Table 1: Experimental results for TPCH queries, 1G database
optimized
61673
1500
711
2422
21535
% inc.
17.'
I' ,
I'"'
lIi III ,'
Queries # rel. GroupBy # matches time time probabilistic (ms)
nonprob.(ms) gen. unagg. table unoptimized optimized % inc.
Q1 1 7 59986052 529867 1199574 22'i'.
Q3 3 V 301618 714125 737278 65723 32412 7 '.
Q3 3 x 301618 709218 711089 z 15 Days 10064 1.7%
Q5 6 1 95 52798 53630 292 2520 6. '.
Q6 1 / 2094 400423 403630
Table 2: Experimental results for TPCH queries, 10G database
Aggregation over one relation Queries Q1 and Q6 involve a single relation (for this case, the method
in Section 3 was used instead of the general method). For Q6, the time to perform the probabilistic aggregate
is virtually the same as the time for the nonprobabilistic aggregate for both database sizes. The time to find
matching tuples dwarfs the aggregation time. For Q1, most of the time is spent on computing aggregates not
finding matching tuples. Since the probabilistic aggregate needs 25 aggregations versus only 8 aggregations
for the nonprobabilistic aggregate, the running time is approximately 25/8 = 3.125 times higher. The time
spent on Q1 seems to be CPU not I/O bound.
Table unaggregated The time to generate the unaggregated is almost the same as the execution of
the original nonprobabilistic query. This is expected since the aggregation is usually the last operation
performed thus the table unaggregated is generated as an intermediate result.
Comparison of unoptimized and optimized algorithms The advantage of the optimized algorithm
is that it does not require the formation of the cross product within each group of tuples in unaggregated.
Query Q3, as it appears in TPCH contains a GROUPBY that limits the number of tuples within each group.
In this case the optimized algorithm is about twice as efficient as the unoptimized algorithm for both 1G and
10G database. To see what happens when the size of groups grows, we removed the GROUPBY clause in Q3.
For the 1G data set, 30569 tuples are now part of a single group. For 10G data set, 301618 tuples form the
single group. When we tried to run these queries in Postgres, the system gave up with a message that the
maximum number of tuples allowed was exceeded. We estimated the times based on the rate of processing
tuples for the 0.1G data set size as 62500 tuples/second. This results in an estimate of 4 hours for 1G data
set and 15 days for the 10G data set. In both situations, all the tuples of relation unaggregated can be
stored in memory, thus the large running time is due exclusively to computation inefficiency. In comparison,
the optimized algorithm can compute the moments in only 10 seconds for the 10G data set. For truly large
data sets, it is crucial to use the optimized algorithm to ensure the moments can be computed in reasonable
time.
The unoptimized algorithm works faster only when the number of matching tuples is small and the
number of joined relations is large, which is happening for Q6. There are only 23 matching tuples for 1G
data set and 195 matching tuples for 10G data set for this query but 6 relations are involved. In this
case, a large number of queries are posed to the database engine (26 64 queries) in order to run the
optimized algorithm. It seems that the time is dominated by the parsing and query plan generation not
by the execution itself. Indeed, when the number of matching tuples becomes 9 times bigger (1G vs 10G
data sets), the execution time increases only I'. In such a situation, the inefficiency can be removed by
processing the unaggregated table in the client rather than using query rewriting and running the query on
the server.
Total overhead While the algorithms we proposed, especially the optimized algorithm, might seem to
require a lot of effort, the percentage overhead in the experimental result tables paints a different picture.
Except for Q1 that is CPU bound, for all other the queries at least one of the versions of the algorithm
require less than 1l' overhead. The overhead seems to reduce for larger databases which means that for
TPCH type of queries the algorithms are expected to scale to very large databases. Finding matching tuples
seems to be much more time consuming than computing moments form the unaggregated table.
SPDF of 2000 Samples CDF of 2000 Samples
70 PDF of Normal Distribution 
 CDF of Normal Distribution
08
60
50 0
40 .
30 & 04
20
02
10
0 0
09 1 11 12 13 14 15 16 17 09 1 11 12 13 14 15 16 17
x 107 x 107
Figure 1: PDF of Q3 Without GroupBy Figure 2: CDF of Q3 Without GroupBy
40
, 30
20
10
0 _
IPDF of14
 PDF of No
21 Samples
rmal Distribution
LL 2
0 2000 4000 6000 8000 10000 12000 0 2000 4000 6000 8000 10000 12000
Figure 3: PDF of Q6 Figure 4: CDF of Q6
6.2 Distribution of Probabilistic Aggregates
In order to examine the distribution of probabilistic aggregates, we performed experiments on TPCH Q3
without GROUPBY and TPCH Q6 over 0.1G data set. For each query, we generated more than 1000 instances
of the base relations according to the probabilistic model and we executed the original SQL query on these
instances. The values of the aggregates thus obtained these are i.i.d. samples from the distribution of the
probabilistic aggregate were used to estimate the PDF and CDF of the distribution.
Figures 14 depict the experimental results together with the approximation of these quantities based
on approximating the distribution of the probabilistic aggregates with a normal distribution with mean and
variance computed using our method. What is immediately apparent from these experiments is that the
normal approximation is surprisingly good, especially when it comes to approximation of the CDF. This
immediately means that reliable confidence intervals can be computed using the normal approximation the
quality of confidence intervals depends only on the quality of approximating the CDF.
The fact that the distribution tends to be normal is due to the fact that the aggregates are obtained by
combining a large number of 0, 1 random variables with the data, which has a normalizing effect similar to
the Central Limit Theorem. We expect the distribution to diverge from the normal distribution when tuples
with large contributions to the aggregate appear with small probabilities. In such circumstances, CI'. . I,.
bounds can be derived based on the moments, bounds that are correct but conservative.
7 Related Work
A lot of research has been published on probabilistic databases. Below we mostly survey and comment on
work that is closest to the current contribution, with a particular emphasis on how our work differs.
In order to allow uncertain data in databases, [6, 5, 12, 11, 23, 14] modeled the uncertain data and
extended the standard relational algebra to the probabilistic algebra. [8, 10, 9] further studied the complexity
of queries over probabilistic databases and proved that computing the probability of a Boolean query on a
disjointindependent database is a #P problem. [9] also proved that evaluation of any conjunctive query is
either #Pcomplete or in PTime.
1 I i     
Aggregates over probabilistic databases perceived as a much harder problem by the community has
attracted attention in recent years. [25] studied the aggregation over probabilistic databases. The focus
in [25] is on probabilistic databases with attribute uncertainty and the probability of each attribute is in
a bounded interval. [3, 27] developed the TRIO systems for managing uncertainty and lineage of data.
Aggregation over TRIO systems is based on the possible worlds model and therefore operations are simple
to implement but intractable for most situations. Only expectation seems to be implemented in TRIO
(details are scant in the published literature), even though any other moments could be computed as easily
(but inefficiently). It is worth mentioning that TRIO can also compute lower and upper deterministic bounds
for aggregates but these bounds are likely to be very pessimistic the probability that the lower or upper
bound is achieved is extremely low in most situations.
[18, 19] and [7] studied aggregations over probabilistic data streams. The problem in [18, 19] is to
estimate the expected value of various aggregates over a single probabilistic datastream (or probabilistic
relation). As part of this work, they had to derive the expected value formulas for onerelation case that we
provide in Section 3. [7] studied the same problem together with the estimation of the size of the join of
two relations. The analysis provided in these papers is significantly more restricted than ours: expectation
and variance for onerelation case and just expectation for tworelation case. Furthermore, the aggregate is
restricted to COUNT (the work is only concerned with frequency moments). It is important to note that the
problem solved in all these three pieces of work is harder since the estimation has to be performed with small
space (datastreaming problem). It would be interesting to investigate how the formulas we derive could be
approximated using small space, as well.
Inspired by the same observation that the expected value of aggregations cannot capture the distribution
clearly, [24] studied the problem of dealing with HAVING predicates that necessarily use aggregates. The
basic problem they consider is: compute the probability that, for a given group, the aggregate a is in
relationship 0 with the constant k, i.e. ack. The types of aggregates considered are MIN, MAX,COUNT,SUM
and the comparison operator 0 is a comparison operator like >. Only integer constants k are supported
since the operations are performed on the semiring Sk+1. The probabilities of events a < k are in fact
the cumulative distribution function(c.d.f) of aggregate a at the point k. The efficient computation of such
probabilities can be readily used to compute confidence intervals for a by essentially inverting the c.d.f. This
can be accomplished efficiently using binary search since the c.d.f is monotone. Unfortunately, most of the
results in [24] are negative. For most queries, computing exactly the probability of event aOk is in #P. Even
for the queries for which the computation is polynomial this is the case for MIN, MAX, COUNT, SUM (y) but
only for asafe plans and y a single attribute the complexity is linear in k, the constant involved. This is
especially troublesome for SUM aggregates since k can be as large as the product of the size of the domain of
the aggregate and the size of the group.
In view of the above comments on the difficulty of computing exact confidence intervals, a fundamental
question needs to be asked: how is it possible to have these negative results but at the same time provide
efficient algorithms for determining confidence intervals like we do in this paper? The most important obser
vation about the present work is that only the first two moments are computed exactly, not the confidence
intervals. The confidence intervals are either pessimistic, if the C'i. I . I.  bound is used, or based on extra
information about the approximate distribution of the aggregate. The pessimism of the C'i. I. Ii. bound
results only in a small multiplicative constant for the size of the confidence interval for *.' confidence
intervals the constant is about 3. It is important to notice that, from the point of view of the user, the exact
confidence interval is not that important; having some idea of the fluctuation of the expected value is the
most useful piece of information. It is worth mentioning that the c.d.f. of most discrete distributions is hard
to compute efficiently. Even for the Binomial distribution, special functions like the incomplete regularized
beta function have to be used to compute the c.d.f. [1]. For this reason, the use of C'i. I. I.  bounds of the
empirical approximation of the distribution of discrete random variables is standard in Statistics I'] Users
of statistical methods in natural sciences all understand and are comfortable with these limitations.
The proof techniques used in this paper were also used in previous work of the authors in a different
context: analysis of sampling estimators [20, 21]. The 0, 1 random variables significantly simplified the
analysis of sampling estimators making possible a generic analysis independent of the type of sampling. The
Kronecker Sij symbol was used to keep under control formulas involving cases, as in the case in the current
work. The analysis of sampling estimators in [21] was performed using the similar proof technique as in
the current work. Lemma 1 in this paper is in fact a generalization of the formula in [21]. We believe
similar techniques can be used for other problems related to probabilistic databases and approximate query
processing.
8 Conclusions
In this paper we described a method to efficiently compute confidence intervals for noncorrelated, distinct
free, aggregates over probabilistic databases. The method requires simple query rewriting and the use of a
regular database system to perform the required computations. The core of the method is statistical analysis
of the expectation and variance of the aggregates as random variables. We derived both unoptimized, but
more general, and optimized formulas for variance of the aggregates and indicated how these formulas
can be computed using SQL aggregate queries. As our experimental results indicate, computing moments
of probabilistic aggregates using the method we described and existing database technology has a small
overhead even without any changes/optimizations to the database system. Moreover, as our experiments
showed, the distribution of the probabilistic aggregates tends to be normal and well approximated by the
normal distribution with the mean and variance computed using our method. This effectively means that
the probabilistic aggregates of the type we considered can be well characterized statistically with small effort
without the need to rewrite the database engine.
References
[1] http://mathworld.wolfram.com/.
[2] S. Abiteboul, P. C. Kanellakis, and G. Grahne. On the representation and querying of sets of possible
worlds. In SIC .!OD Conference, 1987.
[3] P. Agrawal, O. Benjelloun, A. D. Sarma, C. Hayworth, S. U. Nabar, T. Sugihara, and J. Widom. Trio:
A system for data, uncertainty, and lineage. In VLDB, 2006.
[4] M. Arenas, L. E. Bertossi, and J. Chomicki. Consistent query answers in inconsistent databases. In
PODS, 1999.
[5] D. Barbara, H. GarciaMolina, and D. Porter. The management of probabilistic data. IEEE Trans.
Knowl. Data Eng., 4(5):487.n11, 1992.
[6] R. Cavallo and M. Pittarelli. The theory of probabilistic databases. In VLDB '87: Proceedings of the
13th International Conference on Very Large Data Bases, pages 7181, San Francisco, CA, USA, 1987.
[7] G. Cormode and M. N. Garofalakis. Sketching probabilistic data streams. In SICG.IOD Conference,
2007.
[8] N. N. Dalvi and D. Suciu. Efficient query evaluation on probabilistic databases. In VLDB, 2004.
[9] N. N. Dalvi and D. Suciu. The dichotomy of conjunctive queries on probabilistic structures. In PODS,
2007.
[10] N. N. Dalvi and D. Suciu. Management of probabilistic data: foundations and challenges. In PODS,
2007.
[11] D. Dey and S. Sarkar. A probabilistic relational model and algebra. AC.If Trans. Database Syst.,
21(3):339369, September 1996.
[12] N. Fuhr and T. R6lleke. A probabilistic relational algebra for the integration of information retrieval
and database systems. AC. I Trans. Inf. Syst., 15(1):3266, January 1997.
[13] G. Grahne. Dependency satisfaction in databases with incomplete information. In VLDB, 1984.
[14] T. J. Green, G. Karvounarakis, and V. Tannen. Provenance semirings. In PODS, 2007.
[15] T. J. Green and V. Tannen. Models for incomplete and probabilistic information. In EDBT Workshops,
2006.
[16] T. Imielinski and W. L. Jr. Incomplete information in relational databases. J. ACM, 31(4), 1984.
[17] T. Imielinski, S. A. Naqvi, and K. V. Vadaparty. Incomplete objects a data model for design and
planning applications. In SIC. IOD Conference, 1991.
[18] T. S. Jayram, S. Kale, and E. Vee. Efficient aggregation algorithms for probabilistic data. In SODA,
2007.
[19] T. S. Jayram, A. McGregor, S. Muthukrishnan, and E. Vee. Estimating statistical aggregates on
probabilistic data streams. In PODS, 2007.
[20] C. Jermaine, A. Dobra, S. Arumugam, S. Joshi, and A. Pol. The sortmergeshrink join. AC.\f Trans.
Database Syst., 31(4):13821416, December 2006.
[21] C. M. Jermaine, S. Arumugam, A. Pol, and A. Dobra. Scalable approximate query processing with the
dbo engine. In SIC.!OD Conference, 2007.
[22] L. Libkin and L. Wong. Semantic representations and query languages for orsets. In PODS, 1993.
[23] M. Pittarelli. An algebra for probabilistic databases. IEEE Trans. Knowl. Data Eng., 6(2):293303,
April 1994.
[24] C. R6 and D. Suciu. Efficient evaluation of having queries on a probabilistic database. In DBPL, 2007.
[25] R. B. Ross, V. S. Subrahmanian, and J. Grant. Aggregate operators in probabilistic databases. J. ACM,
52(1):54101, January 2005.
';.] J. Shao. Mathematical Statistics. SpringerVerlag, 1999.
[27] J. Widom. Trio: A system for integrated management of data, accuracy, and lineage. In CIDR, 2005.
