Block matrix methods: Taking advantage of
highperformance computers
Tim Davis
Dept. of Computer & Info. Sci. & Eng.
Univ. of Florida
TR98024
October 29, 1998
Abstract
Block matrices and block matrix operations are essential for high
performance computing. Using these techniques, applications can take
advantage of architectural features such as cache, virtual memory, par
allelism, and vector and other special instructions, that are present in
most modern computers, from the latest PC's to the most powerful
supercomputers. As a result, block matrix operations and the appli
cations that rely on them (including scientific simulations, graphics,
and multimedia applications) can often reach close to the theoretical
maximum performance on any computer. I In! document, written for
COT 4501 Numerical .1 ..i ..' A Computational Approach explains
how these techniques work. This work was supported by the
National Science Foundation, grant number 9634470.
1 Introduction
As you read this document, refer to Cliipi, r 5 (Matrix Computations) from
the book Introduction to Scientific Computing by C11i.111 . Van Loan, 1997.
In Section 2, below, block matrices are presented. Section 3 explains how
two block matrices are multiplied together. Using this framework, the ma
trix operations in Cl.ipi, r_ 5 of Van Loan are explained in terms of block
matrix methods in Section 4 (except for parallel matrix multiplication). Sec
tions 5 and 6 explain how to take advantage of cache, and vector or other
special instructions, such as those in Intel's '.1 I. X architecture. Parallel com
puting is introduced in Section 7; this section discusses the parallel methods
in Cl,1p11i r 5 of Van Loan in terms of block matrix operations. Finally, Sec
tion 8 concludes with a brief look at more advanced techniques, and Section 9
provides an annotated bibliography.
T 1, following notation is used throughout this document. i'.1it' es are
shown in bold upper case, like A, B, and C. Scalars are in lower case. A
scalar entry in row i and column j of a matrix is shown in lower case with
subscripts, aij for the matrix A, and bij for the matrix B for example. Vectors
are in bold lower case, such as x and y. By convention, a vector x denotes a
column vector; if we want to denote a row vector we need to use xT, which
is the transpose of a column vector. Row and column vectors extracted from
a matrix are shown in bold lower case with subscripts. For example, bj and
bT are the jth column and ith row of B, respectively. Submatrices that are
not scalars or vectors are shown as bold upper case with subscripts, such as
A12. If the range of one of the subscripts is always 1, for a set of submatrices,
then it is dropped. For example, if the mbyn matrix A is split into only
two submatrices, each with half the columns of A, then the two mbyn/2
submatrices are referred to as A1 and A2.
2 Block matrices
Block matrices are not really a different kind of matrix just another way of
looking at a regular matrix. A block matrix can be simply thought of as a
matrix of matrices. For example, suppose A is a 4by4 matrix,
all a12 a13 a14
A = a21 a22 a23 a24
a31 a32 a33 a34
a41 a42 a43 a44
We can look at this as a 2by2 block matrix, where each block is a 2by2
matrix,
S All A12
[ A21 A22
and where
Al a a12 A12 a13 14
a21 a22 a23 a24
A21 a31 a32 A22 a33 a34
a41 a42 a43 a44
By dividing up a matrix in different ways like this, different matrix meth
ods can be devised that take better advantage of the underlying computer
architecture. Cache, virtual memory, parallel computing, and special instruc
tion sets (such as Cray vector instructions and Intel's .1. X architecture) can
all be exploited to speed up matrix computations.
Now for some more formal definitions. Suppose we have an mbyn matrix
A. T!i, m rows can be partitioned into m pieces, where the ith piece is of size
mi. Ti, pieces of course must add up to m, that is m = Eil mni. Similarly,
the n columns of A can be split into n pieces, where the jth piece is of size
nj, and we have n = 1 NI nj. T1i, matrix A is treated as a mbyn block
matrix, where block Aj has dimension mibynj. We have
All A12 ... Ai
A= A21 A22 ... A2W
Ami1 An2 ... Amn
In the above example, m = 4, m = 2, mi = 2, and m2 = 2. Tli, columns of
A are partitioned in the same way.
This definition can be applied recursively, where, for example, any one or
more of the blocks of A can itself be viewed as a block matrix.
3 Block matrix multiplication
Block matrices are a useful way of looking at the many different ways of
doing matrix multiplication, and also a useful way of coming up with new
techniques. Suppose we want to compute C = AB, where C is mbyn, A
is mbyr, and B is rbyn. Remember that C = AB is not defined unless
1. the number of rows in C and A are the same (m),
2. the number of columns in C and B are the same (n),
3. and the number of columns in A is the same as the number of rows in
B (which is r).
Ti, definition of this operation is given as a collection of dot products (also
called inner products). Tli, cij entry is a dot product of row i of A with
column j of B. T11ii is
r
Cij = aikbkj
k=l
where i and j are in the range 1 to m and n, respectively. (This can be a
lousy way of computing C = AB, however). Writing this in vector notation,
we get
cij = aTbj
where aT is the ith row of A, and bj is the jth column of B.
For a block matrix multiplication, we can view the larger matrix mul
tiplication (C = AB) as a collection of smaller matrix operations, such
as matrixmatrix multiplication with smaller matrices, dot products, scaled
vector additions, vector outer products, and matrixvector multiplications.
Suppose C, A, and B are viewed as block matrices, and that
1. the partitioning of the m rows of C and A are the same,
2. the partitioning of the n columns of C and B are the same,
3. and the partitioning of the r columns of A is the same as the parti
tioning of the r rows of B.
Tli, 1i we can define the block matrix multiplication of C = AB as
r
Cij = E AikBkj
k=l
where i ranges from 1 to m and j ranges from 1 to n.
For example, consider C = AB where C is 3by5, A is 3by4, and B is
4by5. Suppose we split m = 3 into two pieces of size 1 and 2, n = 5 into
three pieces of size 2, 1, and 2, and r = 4 into two pieces of size 2 each. We
get
bi b12 b13 b14 b15
Cll C12 C13 C14 C15 all a12 a13 a14 21 b 2 3 4 b25
SH b21 b22 b23 b24 b25
C21 C22 C23 C24 C25 a21 a22 a23 a24 31 b32 b33 b34 b35
2 C3 4 35 b31 b32 b33 b34 b35
C31 C32 C33 C34 C35 a31 a32 a33 a34
b41 b42 b43 b44 b45
or, with block submatrices,
C11
C21
C12
C22
where
C11 = [ C c12
C21 C21 C22
C31 C32
C12= [ C13 ]
C22
C23
C33
C13 = [ C14 C15 ]
C24
C34
C25
C35
A11= [ a al12 ]
A21 a21 a22 A22
A21 22 
a31 a32
i = bi b 1 bB12 13
b21 b22 b23
B21 1 b32 B22 b 33
b41 b42 b43
A12= [ a13 a14 ]
a23 a24
a33 a34
Sb14 b15
B13 b24 b25
B23
b34 b35
b44 b45
4 Block matrices in the textbook
In Cliipil r 5 of Introduction to Scientific Computing by C11i.111 Van Loan,
1997, several matrixvector and matrixmatrix multiplication methods are
presented. Each of these has a block matrix counterpart.
4.1 Matrixvector: row oriented
Tli, roworiented matrixvector product, y = Ax as computed by MatVecRO,
can be written as
T
Y1 al
._ a2
S .
Ym amT
.Um. a m.
All A12
A21 A22
B11 B12
B21 B22
B13]
B23
C13
C23
where ;, is a scalar and x is a nby1 column vector (a single 1by1 block
matrix, with a single nby1 block). Ti, matrix A is a mby1 block matrix,
where each block aT is a 1byn row vector.
Algorithm MatVecRO:
for i 1 : m
Yi = aTx
end for
4.2 Matrixvector: column oriented
Thi acronym saxpy stands for (singleprecision) ax + y, where a is a scalar
and x and y are vectors. Ti, result is accumulated into y, so the com
plete operation is y = ax + y. This can be used to perform other matrix
operations, such as matrixvector and matrixmatrix multiply. Ti! column
oriented matrixvector multiply (MatVecCO) can be written as
X1
y = ai a2 .. an
xn
where y is a single mby1 column vector, Thl matrix A is a 1byn block
matrix, where each block aj is a mby1 column vector.
Algorithm MatVecCO:
note that this is n scaled vector additions ("., rj)
y = EN=I ajxj
4.3 Matrixmatrix: dot product version
Tih dot product form of a matrix multiply, C = AB as computed by
MatMatDot, can be written as
T
C11 C12 "' C n al
C21 C22 "' C2n a2
T
Cml Cm2 ' Cmn a m 
where C is a matrix of 1by1 blocks, A is a mby1 block matrix with block
a1 a row vector of size 1byr, and B is a 1byn block matrix with block bj
a column vector of size rby1.
Algorithm MatMatDot:
note that the for loops can be interchanged
for i = 1 : m
forj = 1 : n
Cij = a bj
end for
end for
4.4 Matrixmatrix: saxpy version
Ti, MatMatSax routine is a saxpybased matrixmatrix multiply. It can be
written as
Sbi b12 bin
Sr b21 b22 b2n
C1 C2 .. Cnj a I a2 arj
brl br2 r br
where cj and ak are mby1 column vectors.
Algorithm MatMatSax:
for j = 1 : n
Cj = E I akbkj
end for
4.5 Matrixmatrix: matrixvector version
Tli MatMatVec routine is based on matrixvector multiplication. It can be
written as
C1 C2 Cn = [b b 2 bn,
where cj is a mby1 column vector, and bj is an rby1 column vector.
Algorithm MatMatVec:
for j = 1 : n
cj = Abj
end for
4.6 Matrixmatrix: outer product version
Recall that a dot product is a row vector times a column vector, a = xTy,
with the result being a scalar. An outer product is a column vector times
a row vector, E = xyT, with the result being a matrix. Outer products are
used in the MatMatOuter matrixmatrix multiply. It can be written as
bT
C =[a a2 .. a br 2
bT
where ak is a mby1 column vector and b is a 1byn row vector.
Algorithm MatMatOuter:
note that this is the summation of r outer products
C = lE akbl
4.7 Matrixmatrix: Strassen's method
Strassen's matrix multiplication of C = AB is best thought of as a recursive
2by2 block partitioning of the three matrices. Suppose the matrices are all
square (nbyn) and n is a power of 2 Thl method can be extended to handle
more general cases. This assumption makes the presentation simpler. We
can write
C11 C12 All A12 i B11 812
C21 C22 A21 A22 B21 B22
where all submatrices are n/2byn/2. Thl trick is to replace the standard
computation (requiring eight multiplications and additions with n/2byn/2
submatrices) with a more complicated rearrangement that requires seven
and 18 multiplications and additions, respectively. Thl seven n/2byn/2
matrix multiplications can be done the same way, recursively. As a result,
Strassen's method takes 0(1:': '7) time, or about O(n281). This is faster
than the standard O(n3) we've seen so far. Th, I, are even faster methods.
Surprisingly, no one yet knows the tight bounds on the time complexity for
matrix multiplication! This is one of the fascinating unsolved problems of
linear algebra. You would think that by now we would know how much work
it takes to multiply two matrices ... but we don't.
4.8 Comparison
Tih MatBench program referred to in the textbook compares the four O(n3)
methods for computing C = AB just discussed (MatMatDot, MatMatSax,
MatMatVec, and MatMatOuter) with the single '.iiIl., statement C = A*B.
Ti, conclusion is correct; the performance of the method depends on the
computation kernel used (saxpy, dot product, outer product, etc.). But be
careful not to draw too many conclusions from the table on page 170. Til,
different run times have more to do with the '.1i 111.1, interpreter overhead
than the details of the kernel. Since '.i, il .1 stores matrices by columns, it
most likely uses a saxpybased method (MatMatSax). This method appears
to be the slowest in the table, however. '. i, 1.1 i does not use specialized block
algorithms such as the ones discussed below in Sections 5 and 6.
5 Cache performance
Cache is a small fast memory that most modern computers have, including
supercomputers, all highperformance workstations, and nearly all PC's. It
is used to keep copies of the most recently used data from main memory.
If a copy of a variable is held in cache, then operating on that variable is
much faster. Tl, cache is managed by hardware, so that software need not
be concerned with its use it gets used automatically. Since it is managed
by hardware, a very simple technique is used for figuring out what to keep
in cache and what to throw out. When a new item needs to be put in cache,
one typical strategy is to throw out the oldest item still in the cache (the one
that has been there the longest, or the one that has been not been used for
the longest amount of time). If you want fast matrix multiplications (critical
for multimedia applications, scientific simulations, and such) then you have
to write the software so that it takes advantage of cache. All we need to do
is to use the correct block matrix method.
Suppose we have a cache that can hold only 1024 floatingpoint numbers
(this is small, about 8K bytes if we're using IEEE double precision, but it
makes the example simpler). If we attempt to multiply two 1024by1024
matrices using MatMatDot, we compute
11 = albi
C12 = aZb2
When the first statement has been executed, the last 512 entries from a{
and bl remain in cache, since this is the most recently used data (neglect the
scalar c1). So when the second statement starts, the leftmost entries of aT
and the uppermost entries of b, start to get loaded into cache flushing out
the less recently used data. By the time we get to the middle of af, none of
the latter part of aT is left in cache. So no use is made of cache, at all. For
every 2 scalar operations, cij = cij + aikbkj, two pieces of data (aik and bkj)
are not in cache. Ti, ratio of cache 'ii. to floatingpoint operations is
2/2 = 1 (a reference to memory is a ii if it is not in cache; for a cache
miss we have to get the data from main memory). This is terrible, since main
memory is much slower than the floatingpoint hardware on chip.
However, if we break up the matrices A, B, and C into 16by16 blocks,
we can do the following:
note that the for loops can be interchanged
for i == : m
for j = 1 :
ij k=l AikBkj
end for
end for
Suppose Cij is already in cache. When the product AikBkj is computed,
there are 2(162) operations to load the two submatrices into cache, and then
2(163) + (162) operations to multiply them and add them to Ciy. Since the
cache holds 1024 words, all these submatrices fit into cache (including Cij).
So only 2(162) times is data is not in cache when we need it. Thlii the ratio
of cache misses to floatingpoint operations is 2(162)/(2(163) + (162)), which
is 2/33 or about 0.06. This is much better than the naive MatMatDot.
6 Special instructions
Some computers have special instructions that can be used to speed up ma
trix operations. Cray supercomputers have vector operations that can quickly
perform vector dot products and saxpy's. Thl Intel Pentium ':.1'.iX archi
tecture has special instructions for improving the performance of multimedia
applications (:':. X architecture is in two Pentium families: the Pentium
with :.'i.IX, and the Pentium II). In particular, a single '.I'.IX instruction
can multiply a 2by2 matrix times a 2by1 vector.
Y _l all a12 1 I
[i_ 021 022 X
that is, the single instruction computes yl = a11x + a12X2 and = a21X1 +
a22X2, all at once, and much faster than doing the four multiplications and
two additions separately. It also an instruction for adding vectors,
bl
A block matrix method can be designed based on these two instructions,
rather than the single multiple and add instructions (this will be a homework
assignment). For a matrixvector multiply with a 16by16 matrix, the '. 1. iX
version is almost six times faster than the regular version.
7 Parallel computing in the textbook
Block matrix methods are particularly useful when performing matrix oper
ations on a parallel computer. Parallel computing is introduced in Section
5.5 of the textbook; read that section before continuing.
7.1 Models of computing
TIi, textbook uses the term processor. T!i, terms "pro " and "pro .I"
are used in this discussion. Informally, a process is a i l iin pi'if. "
(that is, it has a program, and its own data, stack, and program counter). If
you know what i Ii, .i,1." are, then the term process means a process with a
single thread. A processor, on the other hand, is a piece of hardware. It can
run one or more processes at a time. If it runs more than one, it must switch
between them (only one will run at any given instant in time). On a typ
ical singleprocessor PC, it may appear that multiple processes are running
simultaneously. What is really happening is timesharing or multitasking, in
which the processor is running one process, then the other, and back again,
in quick alternation. If you have a PC with two processors, then two of your
processes can truly be running simultaneously. TI also could have access
to the same memory. This is an example of a 11.11 i1 memory" parallel com
puter. Processes on a shared memory parallel computer communicate using
shared memory and synchronization techniques such as semaphores, locks,
barriers, et cetera (see Section 4.5 for more details; we won't be discussing
shared memory algorithms in this course).
With a dlii~; i ill, ii memory" parallel computer, each processor has its
own memory, and there is no single " 11,i i 1 memory that all processors can
access. If we assume one process per processor, then no two processes can
share memory either. We can also run more than one process per processor,
where the processes on a single processor could choose to share memory or
not, depending on the algorithm. Distributed memory processors communi
cate and synchronize themselves using messages, rather than shared memory.
In both types of parallel computers (shared or distributed memory), each
processor may have cache and special instructions such as '.1'.i1X or vector
instructions. Tlhii all of the preceding discussion in Sections 5 and 6 applies,
as well as the new issue of parallelism. Since this can get pretty complicated,
only parallelism is discussed here, and the other issues are neglected.
7.2 Twoprocess matrix multiplication: high commu
nication
Ti, 2process algorithm presented in Section 5.5.2 is a simple block algo
rithm, which can be written as
[ C1 C ]= A[ B B2
where A, B, and C are nbyn. Assume that n is even, so that C1, C2, B1,
and B2 are all nbyn/2. Note that in contrast to the diagram on page 188 of
the textbook, Tit, matrix A has not (yet) split up. Ti, 1, are two processes.
Process 1 computes
C1 = AB1
while process 2 computes
C2 = AB2.
On a distributed memory parallel computer, this means that either (1) both
processes must have a complete copy of A, or (2) the matrix A must be split
between the two processes. Ti, method in the textbook takes the latter
approach. Ti, matrices C and B don't pose any problem; Process 1 does
not need C2 or B2, nor does Process 2 need C1 or B1. '.1, l!. oi (2) is more
complicated, but allows our 2processor parallel computer to solve larger
problems than if we were to use '. i, !l oi (1).
Ti! algorithm for '.1, 1 I li (2), on the bottom of page 189, can be seen in
a block matrix context. Process 1 computes C = AB1 using MatMatSax,
bll b12 " bin
Sb21 b22 b2, n
[C1 C2 "' C [a1 a2 n "
b. l bn2 . b.,2
while process 2 computes C2 = AB2 using the same technique,
1bi,+1 b1,+2 bi.
1 2 2 b2, +1 b2, +2 " b2n
[c +1 c +2 ... Cn a= [ .al a2 ... a2 2
b n,+1 bn,+2 ... bn
To save memory, we split A into two blocks,
A = AI A2
where process 1 is given A, and process 2 is given A2. TI, problem, now,
is that both processes need all of A. We could send all of A1 from process
1 to process 2, and all of A2 from process 2 to process 1, and then do the
matrix multiply, but this defeats to purpose (we get back to ', .1,! loi (1)).
Ti, method on page 189 of textbook only sends single columns at a time
between the processes. A process computes one column of C at a time,
and alternates between using local columns of A and columns of A that are
owned by the other process and sent to it when needed. For example, instead
of computing column 1 of C as
bil
b21
C = [ al a2 ... a.
bnl
process 1 computes cl by alternating the columns of A
bil
bt1,1
b21
F 2
c1= al, an+t a2 an+2 ...' a, bl
L 2 2 J +2,1
bnl
Tli, result is the same, namely
Cl = akbkl.
k=l
When process 1 needs a nonlocal column of A, it is explicitly sent by process
2 and then received by process 1.
Tli, problem with this approach, pointed out on page 190, is that to
compute column c2, process 1 will need all of A again. Communication is
expensive, so we'd like to avoid sending all of A2 from process 2 to process 1
all over again. It would make more sense for process 1 to use each incoming
column for as many computations as it needs, for all of C, not just cl.
7.3 Twoprocess matrix multiplication: low communi
cation
Let's avoid the large amount of communication needed for the method just
discussed in Section 7.2. Look back over the matrixmatrix multiply methods
in Section 4. Tli, ,1 is only one method that uses A just once: MatMatOuter.
Tli, other methods use and then reuse parts of A later in the computation.
For example, MatMatDot uses a' in each iteration of the i loop, for a total of
m times. Tli, MatMatSax method uses al for each iteration of the j loop, or
n times for the whole computation. MatMatVec uses all of A for each of the
n iterations.
Tli, method in the book is a slight reworking of MatMatOuter. A method
using MatMatOuter is explained next, followed by a description of how it
is modified to get the method on the bottom of page 190 in the textbook.
Consider the block matrix multiply
bT baT
[CI C2 [al aa2 ... a b
b bT T
bl bn2
where C1 and C2 are nbyn/2 matrices, ak is a nby1 column vector, and
each block bj of the matrix B is a lbyn/2 row vector. Process 1 and 2 still
own the same parts of A, B, and C. Process 1 can compute
C = [a a2 ... an
while process 2 can compute
C2= [a a2 .. a ]
k= bl
k=1
b T2
b T
bn2
k=l
With this division of labor, a simple
with a small amount of communication.
the textbook.
Process 1 does this:
C1 =0
for k= : n
if k < n/2
send ak to process 2
else
recv ak from process 2
endif
C1 = C1 + akbkl
end for
parallel algorithm can be written,
Ti, following algorithm is not in
Process 2 does this:
C2= 0
for k= : n
if k < n/2
recv ak from process 1
else
send ak to process 1
endif
C2 = C2 + akbk2
end for
Note that once ak is received by a process, it used and then no longer needed,
so we can discard it. Tlii, neither process needs to hold the entire A matrix
all at once just half of it. This seems like a minor point, but if we had a 1000
processors, each one would hold only 3n2/1000 values, plus memory of size
n to hold the received message. Our parallel computer would need a total of
1000(3n2/1000 + n) = 3n2 + 1000n memory (adding up the memory size on
all the processors). If each needed an entire copy of A, then each processor
would need 2n2/1000 to hold its part of B and C, plus n2 to hold the copy
of A, for a total of 2n2/1000 + n2. Using this method, our parallel computer
would need a total of memory of size 1000(2n2/1000 + n2) = 1002n2, which
is enormous.
Ti, method in the book (on the bottom of page 190) is a little more
complicated. It can be written as
Process 1 does this: Process 2 does this:
C1 = 0 C2 = 0
for k= 1 for k = 1
2 2
send ak to process 2 send ak+ to process 1
C1 = C1 + a, b, C2 2 + ak bk+ ,2
recv ak+ from process 2 recv ak from process 1
C1 C1 + ak bi n, C2 = C2 + akb2
end for end for
Note the for j = 1 : loops shown in the book have been replaced with
a simpler description of what they are really computing an outer product
of two vectors. Tli, code in the book breaks the outer product down into a
collection of saxpy operations (look back at MatMatSax and set r to 1, and
you get a saxpybased method for computing the outer product C = aTb).
Ti 11, is a minor mistake in Section 5.5.3 (Performance Analysis) in the
textbook. It counts inner products of a row of A with a column of B. T11i,
is not what the algorithm is doing. It performs outer products of a column
of A with a row of B.
8 Summary
You should now understand the basic principles of block matrices, and how
they are used in matrix multiplication. Cli.1ipi r 5 of the textbook has been
fully explained in terms of block matrix methods, and cache and ':.1. 1X issues
have also been presented. Block matrix methods provide a powerful frame
work for parallel computing, as presented in the textbook and explained here
in terms of the block matrix technique.
Ti, next step will be to elaborate on the parallel computing techniques
presented here. Another paper will follow this one, describing more par
allel block matrix techniques and the block cyclic matrix distribution for
distributed memory parallel computers. It will describe how to simulate a
parallel distributedmemory algorithm in '. ,111i ., and how to implement the
algorithm truly in parallel using '. PI, a messagepassing standard.
9 Bibliography
Ti, following papers and books provide useful or essential additional ma
terial related to block matrix operations. Tli, technical reports and many
of the articles are available at http://www.netlib.org/index.html and
http://www.cs.utexas.edu/users/plapack/.
1. E. Anderson, Z. Bai, C. Bischof, J. Demmel, J. Dongarra, J. Du Croz,
A. Greenbaum, S. Hammarling, A. .1i i' iiy, S. Ostrouchov, and
D. Sorensen, LAPACK Users' Guide, SIA:'.i Philadelphia, 1992 (sec
ond edition 1995). Discuss block matrix operations, primarily for se
quential (single processor) computers. '.i,, iI1, is based on LINPACK,
which is the nonblock precursor to LAPACK.
2. J. C111,i J. Demmel, I. Dhillon, J. Dongarra, S. Ostrouchov, A. Petite,
K. Stanley, D. Walker, and R. C. Whaley, LAPACK Working Note 95,
ScaLAPACK: A portable linear algebra library for distributed memory
computers design issues and performance. Discusses the block cyclic
matrix distribution for distributed memory parallel computers, and
related parallel block matrix operations. Relies on the PBLAS.
3. J. Cl1.1, J. Dongarra, S. Ostrouchov, A. Petitet, D. Walker, and R. C. Wha
ley, LAPACK Working Note 100: A proposal for a set of parallel basic
linear algebra subprograms. Ti, PBLAS, or parallel BLAS. Uses block
matrix methods exclusively.
4. A. Cli Ii l;, :.iiiva, J. Gunnels, G. '.1 i r'ow, J. Overfelt, and R. van de
Geijn, Parallel implementation of the BLAS: General techniques for
level 3 BLAS, Concurrency: Practice and Experience vol. 9, no. 9,
pp. 837,<. (Sept. 1997).
5. J. Dongarra, J. Du Croz, I. Duff, and S. Hammarling, A set of level 3
basic linear algebra subprograms, A( ':.1 Transactions on '.i1, 11 i ii i, 11
Software, vol. 16, no. 1, pp. 117, 1990. Describes the BLAS, a
collection of key matrix kernels. Ti, BLAS is a portable library that
is normally tuned or rewritten by each computer manufacturer so that
codes using the BLAS can achieve portable performance. Thi model
implementation does not use block matrix techniques, but most tuned
versions do.
6. '. :'l. 1.11, A. Peleg, U. Weiser, MMX1" Technology Architecture Overview,
Intel Technology Journal, Q3 1997, Intel Corporation. Available at
http://developer.intel.com/technology/itj/q31997/pdf/archite.pdf. Dis
cusses :.i.X, and how it can be used to speed up multimedia applica
tions. In particular, it presents a block matrixvector multiply method,
for multiply a 16by16 matrix times a 16by1 vector. A copy of this
paper was handed out in class.
7. R. van de Geijn, SUMMA: scalable universal matrix multiplication al
gorithm. A parallel block matrix multiply algorithm for distributed
memory computers.
8. C. Van Loan, Introduction to Scientific C ,,,il'!,.i Prentice Hall, 1997.
For the COT 4501 class, the essential starting point. Read Cllipli r 5
first, then this document.
