
Full Citation 
Material Information 

Title: 
Simulating simple parallel matrix algorithms in Matlab 

Series Title: 
Department of Computer and Information Science and Engineering Technical Report ; 99002 

Physical Description: 
Book 

Language: 
English 

Creator: 
Davis, Tim 

Publisher: 
Department of Computer and Information Science and Engineering, University of Florida 

Place of Publication: 
Gainesville, Fla. 

Publication Date: 
January 19, 1999 

Copyright Date: 
1999 
Record Information 

Bibliographic ID: 
UF00095436 

Volume ID: 
VID00001 

Source Institution: 
University of Florida 

Holding Location: 
University of Florida 

Rights Management: 
All rights reserved by the source institution and holding location. 

Downloads 

Full Text 
Simulating simple parallel matrix algorithms
in Matlab
Tim Davis
Dept. of Computer & Info. Sci. & Eng.
Univ. of Florida
TR99002
January 19, 1999
Abstract
Cell matrices in :i 1.i.1i, can be used to simulate simple parallel
matrix algorithms for distributed memory parallel computers. Each
process in the kbyk grid of processors owns a single entry in a k
byk cell array. One entry in cell array can be a scalar, a matrix,
another cell array, or any other data type supported by : i.i.,t l 1. 
sages are also passed using cell arrays. I !, i, paper describes : i. 1i.1i,'s
cell arrays and how they can be used to simulate the distribution of
matrices and the sending of messages on a parallel computer. A com
plete 2processor parallel matrix multiply algorithm is described. I In
paper was written in order to introduce parallel computing concepts
into the course COT 4501 (miir. i,. .,I Analysis A Computational
Approach). This work was supported by the National Science
Foundation, grant number 9634470.
1 Introduction
Before you read this paper, refer to C11,11i r 5 (Matrix Computations) from
the textbook (Introduction to Scientific Computing by C!.i11 1 Van Loan,
1997). "' read the paper Block matrix methods: Taking advantage of
/. 'I~, rformance computers, TR98024, by Tim Davis (October 1998). Ti,
notation used here is described in the preceding paper.
2 A model of parallel computing
One common method for organizing computations on a distributedmemory
parallel computer is to consider the parallel computer as a grid of processors.
Each processor has a row index and a column index. An example 2by2
processor grid is shown below.
P1I P1,2
P2,1 P2,2
This assumes what we start counting row and column indices with 1. It makes
for simpler index computations to start with zero, but '.i 1, l.ii is 1based, so
we'll follow the normal '.1 111, i usage.
Each processor runs its own program. Each one has its own memory, and
cannot directly access the memory of other processors in the grid. Instead,
data is shared by the processors by sending and receiving messages. '.1 i .'ry
distribution and message handling can be simulated in .i 1.111i using cell
arrays, described in the next section.
In general, a loop of the following form defines a single parallel step on a
kbyk parallel computer, taken by each processor independently.
for Pi = 1:k
for Pj = 1:k
% Process does something
end
end
In a truly parallel step of computation, the result should not depend on the
order of execution. This can be tested in 1 111, 11 by running the iterations
in a random order. This can be done with the following code:
for pp = forall (k)
Pi = pp (1) ;
Pj = pp (2) ;
% Process does something
end
where the forall.m function is shown below. It is available at
http://www.cise.ufl.edu/ davis/COT4501/Block/forall.m
function proclist = forall (k)
% return a 2byk^2 array. Each column is a unique combination of two
% numbers from the range l:k. The columns of the array are randomly
% permuted. The purpose of this function is to simulate the parallel
% execution of a single step on all processors of a kbyk grid.
Pi = repmat (l:k, k, 1) ;
Pi = Pi (:)' ;
Pj = repmat (1:k, 1, k) ;
proclist = [Pi ; Pj] ;
proclist = proclist (:, randperm (k^2))
If the action of the loop changes from iteration to iteration, then some
thing is wrong in your algorithm.
3 Cell arrays in Matlab
Ti, textbook does not describe '.i .i 1 I's cell array feature. We will use this
to simulate distributed memory. A cell array is a '.i,, 1,l, array where each
element of the cell array is not just a single number, but a container that can
hold other il., 11II arrays. Here are some examples.
A = cell (2,3)
This creates a 2by3 cell array A with empty entries. Tit, result is
[1 [1] [1]
[1 [1] [1]
A 1dimensional cell array is called a list.
A{1,2} = [1 2 ; 3 4]
This sets the 1,2 entry of A to be a 2by2 matrix. Note the use of
the curly brackets { and } instead of standard parentheses. This is an
example of '.i,,II ii, .iI, ii iiiil ii," for cell arrays. We obtain the
cell array
[1 [1 [1
S 3 4
Typing the statement above, without a semicolon at the end, prints a
summary of the cell array A. If you want to see the detailed contents
of each entry of A, type celldisp(A).
C = {[1 2 9], [3 4]; [5 6], [7]}
Tit, curly brackets are also used to construct cell lists and cell arrays.
Tit, statement creates a cell array that looks like this:
1 2 9 3 4
5 6 7
Note that the sizes of each cell entry do not have to be the same. Tli
do not even have to be of the same type (you can mix numeric and
character entries, for example).
B = {A C}
This statement creates a 1by2 cell array, each entry of which is also a
cell array (the A and C arrays just described). Tit, expression B{ 1} will
give you a cell array equal to A, and B{2}{1,2} will give you a 1by2
array with the contents [3 4].
A word of caution about cell arrays. If you're creating a new cell array
and you have a existing numeric array with the same name, '.1, iil will
complain. Clear the old array first, using the clear function. For example
clear A will clear the array A. clear, by itself, will clear the whole '.1I11i.1 I
workspace.
4 Cell arrays for distributing matrices
Suppose we have a 20by20 matrix A, and we want to cut it into four
submatrices of equal size and put one submatrix on each processor of our 2
by2 processor grid. In a real distributed memory computer, each processor
would have its own 10by10 matrix. In '.1i11,i I'. we can simulate this with a
2by2 cell array. Ti, following code splits the matrix A and distributes it
to the processors as the cell array Agrid.
Agrid {1,1} = A (1:10, 1:10)
Agrid {1,2} = A (1:10, 11:20)
Agrid {2,1} = A (11:20, 1:10)
Agrid {2,2} = A (11:20, 11:20)
T1 11 I, if we come to a step in the algorithm in which each processor needs
to operate on its part of the A matrix, we can simulate it with a loop like
this:
for Pi = 1:2
for Pj = 1:2
% Process gets its local copy of A
Alocal = Agrid {Pi,Pj} ;
% ... do something with Alocal here
% Process puts its local copy back into A
Agrid {Pi,Pj} = Alocal ;
end
end
In the above example, Pi is a processor's row index, and Pj is a processor's
column index. Since '.ii liiI uses 1based indexing for its cell arrays (and
all arrays), we think of P1,i as the first processor in the upper left of the
processor grid (instead of Po,o, which is more typical of parallel distributed
memory computing).
5 Cell arrays for message passing
Suppose processor P2,2 wants to access the submatrix A(1 : 3, 1 : 3). Since
it doesn't own this piece of the matrix A, it must be sent from P1,1 to P2,2.
We can simulate this message passing by using cell arrays. Ti, 1, are many
kinds of message passing:
1. pointtopoint: a single processor sends a message to a single processor
2. broadcast: a single processor sends a message to all processors, or to
all processors in its row of the processor grid, or to all processors in its
column, or some other subset.
3. alltoall: all processors send messages to all other processors.
4. others... T1i, are many more types of communication patterns.
Suppose we're in the middle of a parallel loop in which each processor is
a message to one or more processors. We can use a cell array to hold each
processor's message, as in the following example.
for Pi = 1:2
for Pj = 1:2
% send a message from processor
Message {Pi,Pj} =
end
end
% synchronization point
for Pi = 1:2
for Pj = 1:2
% get a message sent by another processor
if (Pi == 2)
% in this case P <2,*> gets a message from P <1,*>
GotIt = Message {1,Pj} ;
% ... do something with the matrix "GotIt"
end
Ti1, example uses two loops. Each processor would do the first part, and
finish it, before any processor completes the second part. This is a simple
way of enforcing barrier synchronization. In a barrier, all processors must
reach the barrier point in the code before any can proceed past it. Its use
here simply ensures that all messages are sent before they are received.
Since cell arrays can accommodate entries of different sizes, each processor
in the example can send different size messages. We can find out how big it
was using the .ii, .11, size function.
6 An complete parallel matrix algorithm
Consider the parallel matrix algorithm on page 16 of TR98024, repeated
below. Ti, algorithm computes C = A B using the following block matrix
multiply:
bT b
Ci C2 [ai a2 Can b21 22
b bT 2
n1 bn2
Process 1 owns the left half of each matrix, and
Here's the algorithm in parallel pseudocode.
Process 1 does this:
C1 =0
for k= : n
if k < n/2
send ak to process 2
else
recv ak from process 2
endif
C1 = C1 + akbkl
end for
process 2 owns the right half.
Process 2 does this:
C2 =0
for k= : n
if k < n/2
recv ak from process 1
else
send ak to process 1
endif
C2 = C2 + akb
end for
Tii corresponding '.,, 11, I script is shown below. It is also available from my
web page at http://www.cise.ufl.edu/~davis/COT4501/Block/mult2.m.
% mult2: 2processor parallel matrix multiply
% Computes C = A*B, where A and B are both nbyn
n = size (A,1) ;
n2 = floor (n/2) ;
% distribute A, B, and C.
Agrid = cell (1,2) ;
Agrid {i} = A (:, l:n2)
Agrid {2} = A (:, n2+1:n)
Bgrid = cell (1,2) ;
Bgrid {i} = B (:, 1:n2)
Bgrid {2} = B (:, n2+1:n)
Cgrid = cell (1,2) ;
Cgrid 1)} = zeros (n, n2)
Cgrid {2} = zeros (n, nn2) ;
for k = 1:n
%a parallel step, but only one processor does something
if (k <= n2)
% send A (:,k) from process 1 to process 2
Message = Agrid {i} (:,k)
else
% send A (:,k) from process 2 to process 1.
% IJote the local column index kn2.
Message = Agrid {2} (:,kn2)
end
% synchronization point here. Each processor gets the message
for p = 1:2 % a parallel loop
Cgrid {p} = Cgrid {p} + Message Bgrid {p} (k,:)
end
end
% get the final result
clear C
C = [Cgrid{l} Cgrid{2}] ;
7 For more information
For more information about cell arrays, see Chliipl r 13, i ,.!.,, res and Cell
Arrays, in the Using Matlab manual, or page 59 of the Getting Started with
Matlab manual. Both manuals are available online. Type helpdesk in
.1 ii!,l., or go to my course web page.
8 Exercises
1. Write '.1 11, 11 pseudoparallel program to implement the parallel algo
rithm on page 17 of the paper Block matrix methods: Taking advantage
of 1. '/l, p rformance computers, TR98024. You may assume n is even.
Note that the algorithm uses the same matrix distribution as the mult2
algorithm in Section 6. C11 I 1: your result by computing norm (CA*B).
2. Suppose we split the three nbyn matrices A, B, and C into 2by2
block matrices (similar to what's done in Section 4). Each processor
holds only one n/2byn/2 sized block of each of the three matrices.
We can write
C11 C12 All A12 [1 B1 812
C21 C22 A21 A22 B21 B22
Processor Pj owns the submatrices Ai,, Bij, and Cij. Verify that the
following algorithm correctly computes C = AB. Draw a diagram of
how messages are passed between the processors, one for each of the
two phases of communication. Draw the diagram as a square grid of
processors, with arrows denoting the messages. Label the arrows with a
description of what data is being sent in the messages. Write a. I ii I 11
pseudoparallel program to implement the algorithm. Note that not all
messages are immediately used in the next computation phase. C11 I 1:
your result by computing norm (CA*B).
2by2 Multiplication Algorithm:
Perform the following steps in parallel:
Process P1,1 sends All to process Pi,2
Process PI,2
Process P2,1
Process P2,2
sends A12
sends A21
sends ~.
to process P1,1
to process P2,2
to process P2,1
Perform the following steps in parallel:
Process P1,1 computes C11 A11B11
Process PI,2 computes C12 = A11B12
Process Pa,1 computes C21 = A22B21
Process P2,2 computes C22 = A22B22
Perform the following steps
Process P1,1 sends B11
Process PI,2 sends B12
Process P2,1 sends B21
Process P2,2 sends B22
in parallel:
to process P2,1
to process P2,2
to process PI1,
to process Pi,2
Perform the following steps in parallel:
Process P1,1 computes C11 = C11 + A12B21
Process PI,2 computes C12 = C12 + A12B22
Process P2,1 computes C21 = C21 + A21B11
Process P2,2 computes C22 C22 + A21B12
9 What's next
A subsequent technical report will describe the block cyclic matrix distribu
tion for distributed memory parallel computers, and how to simulate that
in :'.iI i lI. Also left for a followon paper is how to implement a matrix
algorithm truly in parallel using '.iPI, a messagepassing standard.

