An Optimal Parallel Algorithm
for Sorting Multisets'
Dept. of CISE, Univ. of Florida, Gainesville, FL 32611.
Abstract. In this paper we consider the problem of sorting n numbers
such that there are only k distinct values. We present a randomized
arbitrary CRCW PRA' .1 algorithm that runs in O(logn) time using nl
processors. Tli, algorithm is clearly optimal. Tli, same algorithm runs in
O ( ) o time with a total work of O(n(logk)1+l) for any fixed e > 0.
All the stated bounds hold with high probability.
Keywords: multiset sorting, randomized algorithms, arbitrary CRCW
Several optimal algorithms have been devised for sorting in sequence as well as in
parallel. For sorting n general keys, Q(nlogn) is a well known lower bound on the
work. When additional information about the keys to be sorted is available, sorting
can be done with less work. For instance sorting of n keys where each key is an integer
in the range [1, no(1)] can be accomplished in O(n) time sequentially using radix sort.
Another interesting case of sorting is when the number of distinct keys is k < n.
A lower bound of Q(nlog k) on the work is easy to derive. An algorithm with a
sequential run time of O(n log k) is also straight forward.
Recently, Farach and -.1 1111i;11! -li.,ii1  looked at the related problem of renaming
the keys. Here the input is an array a[ ] of n keys. Ti, output is an array b[ ] such
that the entries in b are integers in the range [1, k]. Also, if a[i] = a[j], for any
1 < i,j < n, then b[i = b[j]. Ti, presented a randomized CRCW PRA' algorithm
'This research was supported in part by an NSF Award CCR-95-03-1 I'I and an EPA Grant
that runs in O(logk) time and does O(n log k) work with high probability. Note that
if the keys can be sorted, then the renaming problem can be solved trivially.
In this paper we present a randomized algorithm for sorting an array of n numbers
given that there are only k < n distinct values. Ti,, value of k need not be given as
a part of the input.
2 Some Preliminaries
Ti, amount of resource (like time, space, etc.) used by any randomized algorithm
is said to be O(f(n)) if the amount used is no more than c~af(n) with probability
> (1 -n-), where c is some constant. Let B(n,p) denote a binomial random variable
with parameters n and p. If X is a random variable with a distribution of B(n,p),
then C11i i i .ff bounds can be used to get tight upper bounds on the tail ends of X.
Prob.[X > (1 + e)np] < n-62 np2
Prob.[X < (1 )np] < n-6np3
for any fixed 0 < e < 1.
3 The Algorithm
Our algorithm is based on random sampling. We pick a random sample of size 1"
and sort it using any general sorting algorithm. As a result, we will be able to estimate
k. If k = n(vn), we sort the whole input since then the work done will be O(n log k).
Otherwise, we collect all the distinct keys and sort them. A binary search is performed
for each input key so that each key is assigned a label in the range [1, k] depending
on its value. Fin illy, the keys are sorted with respect to the assigned labels using the
algorithm of Rajasekaran and Reif . 1.i, re details follow. Let kl, k2,..., kn be the
input sequence. Tli, number of processors used is P = log k
Step 1. Each processor is assigned keys from the input. Every input
key is independently and randomly chosen to be in the sample S with
Step 2. Collect the sample in successive cells of common memory using
a prefix computation and sort S. Let S' be the sorted sample.
Step 3. Perform a prefix computation in S' to form a sequence Q of
distinct values in S, i.e., if S has more than one key of the same value
then only one key with this value is retained in Q. Note that IQI can
possibly be less than k. If IQI > VJn, sort the input using any general
sorting algorithm, output and quit.
Step 4. For each input key perform a binary search in Q.
Step 5. T111- .. input keys whose values are not represented in Q are
collected using a prefix computation. Let R be this collection.
Step 6. Sort Q and R together. Perform a prefix computation and keep
only one key of each value. Let U be the resultant sequence.
Step 7. Perfrom a binary search for every input key in U and assign a
label to this key in the range [1, k]. If a key ki has a value equal to the
jth smallest value in the input then it gets a label of j.
Step 8. Sort the input keys with respect to the labels assigned in Step 7.
Thi resultant sequence is the desired output.
Theorem 3.1 Algorithm MultisetSort runs in time O(logn) using log CRCW
PRAM processors and solves the multiset sorting problem.
Proof. Thi correctness of the algorithm is quite evident.
Step 1 takes logk time. T11, number of samples in S has a distribution of B (n, ).
Tiil-, the cardinality of S is 6 ( ).
Prefix computation in Step 2 can be performed in O(logn) time, the total work
done being 0(n). Sorting takes O(log n) time using processors using the parallel
merge sort algorithm of Cole .
Step 3 takes O(logn) time using 3 processors.
Since |Q < k, Step 4 can be completed in O(logk) time using n processors. Or
equivalently, it can be done in O(logn) time the total work done being O(n log k).
Step 5 takes O(log n) time using 0 (o) processors.
If a value is represented mr times in the input, then the expected number of
occurrences of this value in S is If mn > 5a. 1 n, then with probability >
(1 n-16a/15), there will be at least logn copies of this value in S (for any fixed
a > 1). In other words, if a value is not represented in S, then with high probability
the number of occurrences of this value in the input is 0(1 ..~ n). This implies that
the cardinality of R is 0(k 1, ..~ n).
Assume that there are more than N = v ..' n distinct values in the input. Let
qi,q2, -. ,qN be any N keys of the input with distinct values. T111, from among
these keys we expect l/nogn of them to be in S. T!,i i is, the cardinality of Q will
be Q( nlogn). T111 i; f.re, if IQI < n, the value of k has to be O( n,..1 n).
As a consequence, Step 6 can be completed in O(log n) time using l processors,
since IQ + IR = 6(/n l,,' n).
Step 7 takes O(logn) time with a total work of O(n log k).
Fifilly, Step 8 takes O(logn) time using l' processors. Tlii algorithm of 
can sort n integers in the range [1,n(logn)0(1)] in O(logn) time using arbitrary
CRCW PRAI. processors. D
4 Sub-Logarithmic Time Sorting
In this section we show that multiset sorting can be done in O ( ogl n time the total
work done being O(n(logk)l+ ), for any fixed e > 0.
Since Q(log n/ log log n) is a lower bound on the parallel time needed to sort n bits
(given only a polynomial number of processors), the time bound is the best possible.
Ti, sub-logarithmic time algorithm is the same as MultisetSort with some modi-
Theorem 4.1 We can sort n keys with k distinct values in 0 (loogn ) time with a
total work of O(n(logk)1l+), for any fixed e > 0.
Proof. We employ P = n(logk)1+l processors, for any fixed e > 0.
In Step 1, employ n lgn processors to pick the sample S in lo n time.
log n log log n
In Step 2, the sample S can be sorted using the general sorting algorithm given
in . This algorithm can sort N keys in 0 (oloN) time with a total work of
O(N(logN)1++) for any constant e > 0. Tlim-. Step 2 can be completed in 6( l )
time using the given processors. Tli, same bounds hold for Step 6 as well.
In Step 3, if IQI > v/n, the input keys can be sorted using the general sorting
algorithm of . Ti, work done will be optimal.
Prefix computations in Steps 2, 3, 5, and 6 can be done in 0 og time using
Sog"og processors using the algorithm of Cole and Vishkin , since the sequences
operated on in these steps are binary.
In Steps 4 and 7 we assign (log k)" processors to each key and perform a (log k)"-ary
search. T111~- the search takes O (Tlogk time the total work done being (n (log k)1+l).
For sorting in Step 8, a sub-logarithmic time integer sorting algorithm is needed.
An algorithm for sorting N integers in the range [1, N(log N)o(1)] in 0 (ilog time
(log log N
with a total work of O(NloglogN) was given in . Ti, total work done in this
algorithm was later improved to 0(N) in the independent works of Hagerup ,
.i.i1,i-. and Vishkin , and Raman . Tliil-. Step 8 can also be completed within
the stated resource bounds. ]l
 R. Cole, Parallel :.1i rge Sort, SIA' i Journal on Computing, vol. 17, no. 4, 1988,
pp. 770-7' .
 R. Cole and U. Vishkin, Faster Optimal Parallel Prefix Sums and List Ranking,
Information and Computation 81, 1989, pp. 334-352.
 '.1 Farach and S. '-.i n!lil.t!i-linii li Optimal Parallel Randomized Renaming, In-
formation Processing Letters 61(1), 1997, pp. 7-10.
 T. Hagerup, Fast Parallel Space Allocation, Estimation and Integer Sorting,
Proc. IEEE Symposium on Foundations of Computer Science, 1991.
 Y. '.i1~i.-, and U. Vishkin, Converting High Probability into i ly-Constant
Time with Applications to Parallel Hashing, Proc. A CM Symposium on T7 :
of C.,,, ',, .,i 1991, pp. 307-316.
 S. Rajasekaran and J.H. Reif, Optimal and Sub-Logarithmic Time Randomized
Parallel Sorting Algorithms, SIAM Journal on C.,1,,I/,~' 18(3), 1989, pp. 594-
 R. Raman, T!i Power of Collision: Randomized Parallel Algorithms for Cllii11iig
and Integer Sorting, Technical Report 336, Dept. of Computer Science, Univer-
sity of Rochester, January 1991.