Group Title: Department of Computer and Information Science and Engineering Technical Reports
Title: Block matrix methods : taking advantage of high-performance computers
CITATION PDF VIEWER THUMBNAILS PAGE IMAGE ZOOMABLE
Full Citation
STANDARD VIEW MARC VIEW
Permanent Link: http://ufdc.ufl.edu/UF00095433/00001
 Material Information
Title: Block matrix methods : taking advantage of high-performance computers
Series Title: Department of Computer and Information Science and Engineering Technical Report ; 98-024
Physical Description: Book
Language: English
Creator: Davis, Tim
Affiliation: University of Florida
Publisher: Department of Computer and Information Science and Engineering, University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: October 29, 1998
Copyright Date: 1998
 Record Information
Bibliographic ID: UF00095433
Volume ID: VID00001
Source Institution: University of Florida
Holding Location: University of Florida
Rights Management: All rights reserved by the source institution and holding location.

Downloads

This item has the following downloads:

1998289 ( PDF )


Full Text













Block matrix methods: Taking advantage of

high-performance computers

Tim Davis
Dept. of Computer & Info. Sci. & Eng.
Univ. of Florida
TR-98-024

October 29, 1998


Abstract
Block matrices and block matrix operations are essential for high
performance computing. Using these techniques, applications can take
advantage of architectural features such as cache, virtual memory, par-
allelism, and vector and other special instructions, that are present in
most modern computers, from the latest PC's to the most powerful
supercomputers. As a result, block matrix operations and the appli-
cations that rely on them (including scientific simulations, graphics,
and multimedia applications) can often reach close to the theoretical
maximum performance on any computer. I In!- document, written for
COT 4501 Numerical .1 ..i ..' A Computational Approach explains
how these techniques work. This work was supported by the
National Science Foundation, grant number 9634470.


1 Introduction

As you read this document, refer to Cliipi, r 5 (Matrix Computations) from
the book Introduction to Scientific Computing by C11i.111 -. Van Loan, 1997.
In Section 2, below, block matrices are presented. Section 3 explains how
two block matrices are multiplied together. Using this framework, the ma-
trix operations in Cl.ipi, r_ 5 of Van Loan are explained in terms of block















matrix methods in Section 4 (except for parallel matrix multiplication). Sec-
tions 5 and 6 explain how to take advantage of cache, and vector or other
special instructions, such as those in Intel's '.1 I. X architecture. Parallel com-
puting is introduced in Section 7; this section discusses the parallel methods
in Cl,1p11i r 5 of Van Loan in terms of block matrix operations. Finally, Sec-
tion 8 concludes with a brief look at more advanced techniques, and Section 9
provides an annotated bibliography.
T 1, following notation is used throughout this document. i'.1it' es are
shown in bold upper case, like A, B, and C. Scalars are in lower case. A
scalar entry in row i and column j of a matrix is shown in lower case with
subscripts, aij for the matrix A, and bij for the matrix B for example. Vectors
are in bold lower case, such as x and y. By convention, a vector x denotes a
column vector; if we want to denote a row vector we need to use xT, which
is the transpose of a column vector. Row and column vectors extracted from
a matrix are shown in bold lower case with subscripts. For example, bj and
bT are the jth column and ith row of B, respectively. Submatrices that are
not scalars or vectors are shown as bold upper case with subscripts, such as
A12. If the range of one of the subscripts is always 1, for a set of submatrices,
then it is dropped. For example, if the m-by-n matrix A is split into only
two submatrices, each with half the columns of A, then the two m-by-n/2
submatrices are referred to as A1 and A2.


2 Block matrices

Block matrices are not really a different kind of matrix just another way of
looking at a regular matrix. A block matrix can be simply thought of as a
matrix of matrices. For example, suppose A is a 4-by-4 matrix,

all a12 a13 a14
A = a21 a22 a23 a24
a31 a32 a33 a34
a41 a42 a43 a44

We can look at this as a 2-by-2 block matrix, where each block is a 2-by-2
matrix,
S All A12
[ A21 A22















and where
Al a a12 A12 a13 14
a21 a22 a23 a24

A21 a31 a32 A22 a33 a34
a41 a42 a43 a44
By dividing up a matrix in different ways like this, different matrix meth-
ods can be devised that take better advantage of the underlying computer
architecture. Cache, virtual memory, parallel computing, and special instruc-
tion sets (such as Cray vector instructions and Intel's .1. X architecture) can
all be exploited to speed up matrix computations.
Now for some more formal definitions. Suppose we have an m-by-n matrix
A. T!i, m rows can be partitioned into m pieces, where the ith piece is of size
mi. Ti, pieces of course must add up to m, that is m = Eil mni. Similarly,
the n columns of A can be split into n pieces, where the jth piece is of size
nj, and we have n = 1 NI nj. T1i, matrix A is treated as a m-by-n block
matrix, where block Aj has dimension mi-by-nj. We have

All A12 ... Ai
A= A21 A22 ... A2W

Ami1 An2 ... Amn

In the above example, m = 4, m = 2, mi = 2, and m2 = 2. Tli, columns of
A are partitioned in the same way.
This definition can be applied recursively, where, for example, any one or
more of the blocks of A can itself be viewed as a block matrix.


3 Block matrix multiplication

Block matrices are a useful way of looking at the many different ways of
doing matrix multiplication, and also a useful way of coming up with new
techniques. Suppose we want to compute C = AB, where C is m-by-n, A
is m-by-r, and B is r-by-n. Remember that C = AB is not defined unless

1. the number of rows in C and A are the same (m),

2. the number of columns in C and B are the same (n),















3. and the number of columns in A is the same as the number of rows in
B (which is r).
Ti, definition of this operation is given as a collection of dot products (also
called inner products). Tli, cij entry is a dot product of row i of A with
column j of B. T11ii is
r
Cij = aikbkj
k=l
where i and j are in the range 1 to m and n, respectively. (This can be a
lousy way of computing C = AB, however). Writing this in vector notation,
we get
cij = aTbj
where aT is the ith row of A, and bj is the jth column of B.
For a block matrix multiplication, we can view the larger matrix mul-
tiplication (C = AB) as a collection of smaller matrix operations, such
as matrix-matrix multiplication with smaller matrices, dot products, scaled
vector additions, vector outer products, and matrix-vector multiplications.
Suppose C, A, and B are viewed as block matrices, and that
1. the partitioning of the m rows of C and A are the same,

2. the partitioning of the n columns of C and B are the same,

3. and the partitioning of the r columns of A is the same as the parti-
tioning of the r rows of B.
Tli, 1i we can define the block matrix multiplication of C = AB as
r
Cij = E AikBkj
k=l
where i ranges from 1 to m and j ranges from 1 to n.
For example, consider C = AB where C is 3-by-5, A is 3-by-4, and B is
4-by-5. Suppose we split m = 3 into two pieces of size 1 and 2, n = 5 into
three pieces of size 2, 1, and 2, and r = 4 into two pieces of size 2 each. We
get

bi b12 b13 b14 b15
Cll C12 C13 C14 C15 all a12 a13 a14 21 b 2 3 4 b25
SH b21 b22 b23 b24 b25
C21 C22 C23 C24 C25 a21 a22 a23 a24 -31 b32 b33 b34 b35
2 C3 4 35 b31 b32 b33 b34 b35
C31 C32 C33 C34 C35 a31 a32 a33 a34
b41 b42 b43 b44 b45














or, with block submatrices,


C11
C21


C12
C22


where


C11 = [ C c12


C21 C21 C22
C31 C32


C12= [ C13 ]


C22


C23
C33


C13 = [ C14 C15 ]


C24
C34


C25
C35


A11= [ a al12 ]


A21 a21 a22 A22
A21 22 -
a31 a32

i = bi b 1 bB12 13
b21 b22 b23

B21 1 b32 B22 b 33
b41 b42 b43


A12= [ a13 a14 ]


a23 a24
a33 a34

Sb14 b15
B13 b24 b25


B23


b34 b35
b44 b45


4 Block matrices in the textbook

In Cliipil r 5 of Introduction to Scientific Computing by C11i.111 Van Loan,
1997, several matrix-vector and matrix-matrix multiplication methods are
presented. Each of these has a block matrix counterpart.

4.1 Matrix-vector: row oriented
Tli, row-oriented matrix-vector product, y = Ax as computed by MatVecRO,
can be written as
T
Y1 al
._ a2
S .

Ym amT
.Um. a m.


All A12
A21 A22


B11 B12
B21 B22


B13]
B23


C13
C23















where ;, is a scalar and x is a n-by-1 column vector (a single 1-by-1 block
matrix, with a single n-by-1 block). Ti, matrix A is a m-by-1 block matrix,
where each block aT is a 1-by-n row vector.


Algorithm MatVecRO:
for i 1 : m
Yi = aTx
end for




4.2 Matrix-vector: column oriented

Thi acronym saxpy stands for (single-precision) ax + y, where a is a scalar
and x and y are vectors. Ti, result is accumulated into y, so the com-
plete operation is y = ax + y. This can be used to perform other matrix
operations, such as matrix-vector and matrix-matrix multiply. Ti! column-
oriented matrix-vector multiply (MatVecCO) can be written as

X1

y = ai a2 .. an

xn

where y is a single m-by-1 column vector, Thl matrix A is a 1-by-n block
matrix, where each block aj is a m-by-1 column vector.


Algorithm MatVecCO:
note that this is n scaled vector additions (".-, rj)
y = EN=I ajxj















4.3 Matrix-matrix: dot product version


Tih dot product form of a matrix multiply, C = AB as computed by
MatMatDot, can be written as

T
C11 C12 "' C n al
C21 C22 "' C2n a2

T
Cml Cm2 ' Cmn a m -

where C is a matrix of 1-by-1 blocks, A is a m-by-1 block matrix with block
a1 a row vector of size 1-by-r, and B is a 1-by-n block matrix with block bj
a column vector of size r-by-1.


Algorithm MatMatDot:
note that the for loops can be interchanged
for i = 1 : m
forj = 1 : n
Cij = a bj
end for
end for




4.4 Matrix-matrix: saxpy version

Ti, MatMatSax routine is a saxpy-based matrix-matrix multiply. It can be
written as

Sbi b12 bin
Sr b21 b22 b2n
C1 C2 .. Cnj a I a2 arj

brl br2 r br

where cj and ak are m-by-1 column vectors.
















Algorithm MatMatSax:
for j = 1 : n
Cj = E I akbkj
end for



4.5 Matrix-matrix: matrix-vector version
Tli MatMatVec routine is based on matrix-vector multiplication. It can be
written as
C1 C2 Cn = [b b 2 bn,
where cj is a m-by-1 column vector, and bj is an r-by-1 column vector.


Algorithm MatMatVec:
for j = 1 : n
cj = Abj
end for



4.6 Matrix-matrix: outer product version
Recall that a dot product is a row vector times a column vector, a = xTy,
with the result being a scalar. An outer product is a column vector times
a row vector, E = xyT, with the result being a matrix. Outer products are
used in the MatMatOuter matrix-matrix multiply. It can be written as

bT

C =[a a2 .. a br 2

bT

where ak is a m-by-1 column vector and b is a 1-by-n row vector.
















Algorithm MatMatOuter:
note that this is the summation of r outer products
C = lE akbl



4.7 Matrix-matrix: Strassen's method
Strassen's matrix multiplication of C = AB is best thought of as a recursive
2-by-2 block partitioning of the three matrices. Suppose the matrices are all
square (n-by-n) and n is a power of 2 Thl method can be extended to handle
more general cases. This assumption makes the presentation simpler. We
can write
C11 C12 All A12 i B11 812
C21 C22 A21 A22 B21 B22
where all submatrices are n/2-by-n/2. Thl trick is to replace the standard
computation (requiring eight multiplications and additions with n/2-by-n/2
submatrices) with a more complicated rearrangement that requires seven
and 18 multiplications and additions, respectively. Thl seven n/2-by-n/2
matrix multiplications can be done the same way, recursively. As a result,
Strassen's method takes 0(1:': -'7) time, or about O(n281). This is faster
than the standard O(n3) we've seen so far. Th, I, are even faster methods.
Surprisingly, no one yet knows the tight bounds on the time complexity for
matrix multiplication! This is one of the fascinating unsolved problems of
linear algebra. You would think that by now we would know how much work
it takes to multiply two matrices ... but we don't.

4.8 Comparison
Tih MatBench program referred to in the textbook compares the four O(n3)
methods for computing C = AB just discussed (MatMatDot, MatMatSax,
MatMatVec, and MatMatOuter) with the single '.iiIl., statement C = A*B.
Ti, conclusion is correct; the performance of the method depends on the
computation kernel used (saxpy, dot product, outer product, etc.). But be
careful not to draw too many conclusions from the table on page 170. Til,














different run times have more to do with the '.1i 111.1, interpreter overhead
than the details of the kernel. Since '.i, il .1 stores matrices by columns, it
most likely uses a saxpy-based method (MatMatSax). This method appears
to be the slowest in the table, however. '. i, 1.1 i does not use specialized block
algorithms such as the ones discussed below in Sections 5 and 6.


5 Cache performance

Cache is a small fast memory that most modern computers have, including
supercomputers, all high-performance workstations, and nearly all PC's. It
is used to keep copies of the most recently used data from main memory.
If a copy of a variable is held in cache, then operating on that variable is
much faster. Tl, cache is managed by hardware, so that software need not
be concerned with its use it gets used automatically. Since it is managed
by hardware, a very simple technique is used for figuring out what to keep
in cache and what to throw out. When a new item needs to be put in cache,
one typical strategy is to throw out the oldest item still in the cache (the one
that has been there the longest, or the one that has been not been used for
the longest amount of time). If you want fast matrix multiplications (critical
for multimedia applications, scientific simulations, and such) then you have
to write the software so that it takes advantage of cache. All we need to do
is to use the correct block matrix method.
Suppose we have a cache that can hold only 1024 floating-point numbers
(this is small, about 8K bytes if we're using IEEE double precision, but it
makes the example simpler). If we attempt to multiply two 1024-by-1024
matrices using MatMatDot, we compute

11 = albi
C12 = aZb2


When the first statement has been executed, the last 512 entries from a{
and bl remain in cache, since this is the most recently used data (neglect the
scalar c1). So when the second statement starts, the leftmost entries of aT
and the uppermost entries of b, start to get loaded into cache flushing out
the less recently used data. By the time we get to the middle of af, none of














the latter part of aT is left in cache. So no use is made of cache, at all. For
every 2 scalar operations, cij = cij + aikbkj, two pieces of data (aik and bkj)
are not in cache. Ti, ratio of cache 'ii.-- to floating-point operations is
2/2 = 1 (a reference to memory is a ii-- if it is not in cache; for a cache
miss we have to get the data from main memory). This is terrible, since main
memory is much slower than the floating-point hardware on chip.
However, if we break up the matrices A, B, and C into 16-by-16 blocks,
we can do the following:


note that the for loops can be interchanged
for i == : m
for j = 1 :
-ij k=l AikBkj
end for
end for


Suppose Cij is already in cache. When the product AikBkj is computed,
there are 2(162) operations to load the two submatrices into cache, and then
2(163) + (162) operations to multiply them and add them to Ciy. Since the
cache holds 1024 words, all these submatrices fit into cache (including Cij).
So only 2(162) times is data is not in cache when we need it. Thlii- the ratio
of cache misses to floating-point operations is 2(162)/(2(163) + (162)), which
is 2/33 or about 0.06. This is much better than the naive MatMatDot.


6 Special instructions

Some computers have special instructions that can be used to speed up ma-
trix operations. Cray supercomputers have vector operations that can quickly
perform vector dot products and saxpy's. Thl Intel Pentium ':.1'.iX archi-
tecture has special instructions for improving the performance of multimedia
applications (:'-:. X architecture is in two Pentium families: the Pentium
with :.'i.IX, and the Pentium II). In particular, a single '.I'.IX instruction














can multiply a 2-by-2 matrix times a 2-by-1 vector.


Y _l all a12 1 I
[i_ 021 022 X

that is, the single instruction computes yl = a11x + a12X2 and = a21X1 +
a22X2, all at once, and much faster than doing the four multiplications and
two additions separately. It also an instruction for adding vectors,


bl

A block matrix method can be designed based on these two instructions,
rather than the single multiple and add instructions (this will be a homework
assignment). For a matrix-vector multiply with a 16-by-16 matrix, the '. 1. iX
version is almost six times faster than the regular version.


7 Parallel computing in the textbook

Block matrix methods are particularly useful when performing matrix oper-
ations on a parallel computer. Parallel computing is introduced in Section
5.5 of the textbook; read that section before continuing.

7.1 Models of computing
TIi, textbook uses the term processor. T!i, terms "pro --" and "pro --.I"
are used in this discussion. Informally, a process is a i l iin pi'-if. "
(that is, it has a program, and its own data, stack, and program counter). If
you know what i Ii, .i,1-." are, then the term process means a process with a
single thread. A processor, on the other hand, is a piece of hardware. It can
run one or more processes at a time. If it runs more than one, it must switch
between them (only one will run at any given instant in time). On a typ-
ical single-processor PC, it may appear that multiple processes are running
simultaneously. What is really happening is time-sharing or multi-tasking, in
which the processor is running one process, then the other, and back again,
in quick alternation. If you have a PC with two processors, then two of your
processes can truly be running simultaneously. TI also could have access














to the same memory. This is an example of a -11.11 i1 memory" parallel com-
puter. Processes on a shared memory parallel computer communicate using
shared memory and synchronization techniques such as semaphores, locks,
barriers, et cetera (see Section 4.5 for more details; we won't be discussing
shared memory algorithms in this course).
With a dli-i~; i ill, ii memory" parallel computer, each processor has its
own memory, and there is no single "- 11,i i 1 memory that all processors can
access. If we assume one process per processor, then no two processes can
share memory either. We can also run more than one process per processor,
where the processes on a single processor could choose to share memory or
not, depending on the algorithm. Distributed memory processors communi-
cate and synchronize themselves using messages, rather than shared memory.
In both types of parallel computers (shared or distributed memory), each
processor may have cache and special instructions such as '.1'.i1X or vector
instructions. Tlhii- all of the preceding discussion in Sections 5 and 6 applies,
as well as the new issue of parallelism. Since this can get pretty complicated,
only parallelism is discussed here, and the other issues are neglected.

7.2 Two-process matrix multiplication: high commu-
nication

Ti, 2-process algorithm presented in Section 5.5.2 is a simple block algo-
rithm, which can be written as

[ C1 C ]= A[ B B2

where A, B, and C are n-by-n. Assume that n is even, so that C1, C2, B1,
and B2 are all n-by-n/2. Note that in contrast to the diagram on page 188 of
the textbook, Tit, matrix A has not (yet) split up. Ti, 1, are two processes.
Process 1 computes
C1 = AB1
while process 2 computes
C2 = AB2.
On a distributed memory parallel computer, this means that either (1) both
processes must have a complete copy of A, or (2) the matrix A must be split
between the two processes. Ti, method in the textbook takes the latter














approach. Ti, matrices C and B don't pose any problem; Process 1 does
not need C2 or B2, nor does Process 2 need C1 or B1. '.1, l!. oi (2) is more
complicated, but allows our 2-processor parallel computer to solve larger
problems than if we were to use '. i, !l oi (1).
Ti! algorithm for '.1, 1 I li (2), on the bottom of page 189, can be seen in
a block matrix context. Process 1 computes C = AB1 using MatMatSax,

bll b12 " bin
Sb21 b22 b2, n
[C1 C2 "' C [a1 a2 n "

b. l bn2 . b.,2

while process 2 computes C2 = AB2 using the same technique,

1bi,+1 b1,+2 bi.
1 2 2 b2, +1 b2, +2 " b2n
[c +1 c +2 ... Cn a= [ .al a2 ... a2 2

b n,+1 bn,+2 ... bn

To save memory, we split A into two blocks,

A = AI A2

where process 1 is given A, and process 2 is given A2. TI, problem, now,
is that both processes need all of A. We could send all of A1 from process
1 to process 2, and all of A2 from process 2 to process 1, and then do the
matrix multiply, but this defeats to purpose (we get back to ', .1,! loi (1)).
Ti, method on page 189 of textbook only sends single columns at a time
between the processes. A process computes one column of C at a time,
and alternates between using local columns of A and columns of A that are
owned by the other process and sent to it when needed. For example, instead
of computing column 1 of C as

bil
b21
C = [ al a2 ... a.

bnl















process 1 computes cl by alternating the columns of A


bil
bt1,1
b21
F 2
c1= al, an+t a2 an+2 ...' a, bl
L 2 2 J +2,1


bnl

Tli, result is the same, namely


Cl = akbkl.
k=l

When process 1 needs a non-local column of A, it is explicitly sent by process
2 and then received by process 1.
Tli, problem with this approach, pointed out on page 190, is that to
compute column c2, process 1 will need all of A again. Communication is
expensive, so we'd like to avoid sending all of A2 from process 2 to process 1
all over again. It would make more sense for process 1 to use each incoming
column for as many computations as it needs, for all of C, not just cl.


7.3 Two-process matrix multiplication: low communi-
cation

Let's avoid the large amount of communication needed for the method just
discussed in Section 7.2. Look back over the matrix-matrix multiply methods
in Section 4. Tli, ,1 is only one method that uses A just once: MatMatOuter.
Tli, other methods use and then reuse parts of A later in the computation.
For example, MatMatDot uses a' in each iteration of the i loop, for a total of
m times. Tli, MatMatSax method uses al for each iteration of the j loop, or
n times for the whole computation. MatMatVec uses all of A for each of the
n iterations.
Tli, method in the book is a slight reworking of MatMatOuter. A method
using MatMatOuter is explained next, followed by a description of how it
is modified to get the method on the bottom of page 190 in the textbook.















Consider the block matrix multiply
bT baT

[CI C2 [al aa2 ... a b
b bT T
bl bn2
where C1 and C2 are n-by-n/2 matrices, ak is a n-by-1 column vector, and
each block bj of the matrix B is a l-by-n/2 row vector. Process 1 and 2 still
own the same parts of A, B, and C. Process 1 can compute


C = [a a2 ... an


while process 2 can compute


C2= [a a2 .. a ]


k= bl
k=1


b T2
b T


bn2


k=l


With this division of labor, a simple
with a small amount of communication.
the textbook.


Process 1 does this:
C1 =0
for k= : n
if k < n/2
send ak to process 2
else
recv ak from process 2
endif
C1 = C1 + akbkl
end for


parallel algorithm can be written,
Ti, following algorithm is not in



Process 2 does this:
C2= 0
for k= : n
if k < n/2
recv ak from process 1
else
send ak to process 1
endif
C2 = C2 + akbk2
end for














Note that once ak is received by a process, it used and then no longer needed,
so we can discard it. Tlii-, neither process needs to hold the entire A matrix
all at once just half of it. This seems like a minor point, but if we had a 1000
processors, each one would hold only 3n2/1000 values, plus memory of size
n to hold the received message. Our parallel computer would need a total of
1000(3n2/1000 + n) = 3n2 + 1000n memory (adding up the memory size on
all the processors). If each needed an entire copy of A, then each processor
would need 2n2/1000 to hold its part of B and C, plus n2 to hold the copy
of A, for a total of 2n2/1000 + n2. Using this method, our parallel computer
would need a total of memory of size 1000(2n2/1000 + n2) = 1002n2, which
is enormous.
Ti, method in the book (on the bottom of page 190) is a little more
complicated. It can be written as


Process 1 does this: Process 2 does this:
C1 = 0 C2 = 0
for k= 1 for k = 1
2 2
send ak to process 2 send ak+ to process 1
C1 = C1 + a, b, C2 2 + ak bk+ ,2
recv ak+ from process 2 recv ak from process 1
C1 C1 + ak bi n, C2 = C2 + akb2
end for end for


Note the for j = 1 : loops shown in the book have been replaced with
a simpler description of what they are really computing an outer product
of two vectors. Tli, code in the book breaks the outer product down into a
collection of saxpy operations (look back at MatMatSax and set r to 1, and
you get a saxpy-based method for computing the outer product C = aTb).
Ti 11, is a minor mistake in Section 5.5.3 (Performance Analysis) in the
textbook. It counts inner products of a row of A with a column of B. T11i,
is not what the algorithm is doing. It performs outer products of a column
of A with a row of B.














8 Summary


You should now understand the basic principles of block matrices, and how
they are used in matrix multiplication. Cli.1ipi r 5 of the textbook has been
fully explained in terms of block matrix methods, and cache and ':.1. 1X issues
have also been presented. Block matrix methods provide a powerful frame-
work for parallel computing, as presented in the textbook and explained here
in terms of the block matrix technique.
Ti, next step will be to elaborate on the parallel computing techniques
presented here. Another paper will follow this one, describing more par-
allel block matrix techniques and the block cyclic matrix distribution for
distributed memory parallel computers. It will describe how to simulate a
parallel distributed-memory algorithm in '. ,111i ., and how to implement the
algorithm truly in parallel using '. PI, a message-passing standard.


9 Bibliography

Ti, following papers and books provide useful or essential additional ma-
terial related to block matrix operations. Tli, technical reports and many
of the articles are available at http://www.netlib.org/index.html and
http://www.cs.utexas.edu/users/plapack/.
1. E. Anderson, Z. Bai, C. Bischof, J. Demmel, J. Dongarra, J. Du Croz,
A. Greenbaum, S. Hammarling, A. -.1i i' iiy, S. Ostrouchov, and
D. Sorensen, LAPACK Users' Guide, SIA:'.i Philadelphia, 1992 (sec-
ond edition 1995). Discuss block matrix operations, primarily for se-
quential (single processor) computers. '.i,, iI1, is based on LINPACK,
which is the non-block precursor to LAPACK.

2. J. C111,i J. Demmel, I. Dhillon, J. Dongarra, S. Ostrouchov, A. Petite,
K. Stanley, D. Walker, and R. C. Whaley, LAPACK Working Note 95,
ScaLAPACK: A portable linear algebra library for distributed memory
computers design issues and performance. Discusses the block cyclic
matrix distribution for distributed memory parallel computers, and
related parallel block matrix operations. Relies on the PBLAS.

3. J. Cl1.1, J. Dongarra, S. Ostrouchov, A. Petitet, D. Walker, and R. C. Wha-
ley, LAPACK Working Note 100: A proposal for a set of parallel basic















linear algebra subprograms. Ti, PBLAS, or parallel BLAS. Uses block
matrix methods exclusively.

4. A. Cli Ii l;, :.iiiva, J. Gunnels, G. '-.1 i r'ow, J. Overfelt, and R. van de
Geijn, Parallel implementation of the BLAS: General techniques for
level 3 BLAS, Concurrency: Practice and Experience vol. 9, no. 9,
pp. 837-,<. (Sept. 1997).

5. J. Dongarra, J. Du Croz, I. Duff, and S. Hammarling, A set of level 3
basic linear algebra subprograms, A( ':.1 Transactions on '.i1, 11 i ii i, 11
Software, vol. 16, no. 1, pp. 1-17, 1990. Describes the BLAS, a
collection of key matrix kernels. Ti, BLAS is a portable library that
is normally tuned or rewritten by each computer manufacturer so that
codes using the BLAS can achieve portable performance. Thi model
implementation does not use block matrix techniques, but most tuned
versions do.

6. '. :'l. 1.11, A. Peleg, U. Weiser, MMX1" Technology Architecture Overview,
Intel Technology Journal, Q3 1997, Intel Corporation. Available at
http://developer.intel.com/technology/itj/q31997/pdf/archite.pdf. Dis-
cusses :.i.X, and how it can be used to speed up multimedia applica-
tions. In particular, it presents a block matrix-vector multiply method,
for multiply a 16-by-16 matrix times a 16-by-1 vector. A copy of this
paper was handed out in class.

7. R. van de Geijn, SUMMA: scalable universal matrix multiplication al-
gorithm. A parallel block matrix multiply algorithm for distributed
memory computers.

8. C. Van Loan, Introduction to Scientific C ,,,il'!,.i Prentice Hall, 1997.
For the COT 4501 class, the essential starting point. Read Cllipli r 5
first, then this document.




University of Florida Home Page
© 2004 - 2010 University of Florida George A. Smathers Libraries.
All rights reserved.

Acceptable Use, Copyright, and Disclaimer Statement
Last updated October 10, 2010 - - mvs