Group Title: Department of Computer and Information Science and Engineering Technical Reports
Title: Simulating simple parallel matrix algorithms in Matlab
Full Citation
Permanent Link:
 Material Information
Title: Simulating simple parallel matrix algorithms in Matlab
Series Title: Department of Computer and Information Science and Engineering Technical Report ; 99-002
Physical Description: Book
Language: English
Creator: Davis, Tim
Publisher: Department of Computer and Information Science and Engineering, University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: January 19, 1999
Copyright Date: 1999
 Record Information
Bibliographic ID: UF00095436
Volume ID: VID00001
Source Institution: University of Florida
Holding Location: University of Florida
Rights Management: All rights reserved by the source institution and holding location.


This item has the following downloads:

1999370 ( PDF )

Full Text

Simulating simple parallel matrix algorithms

in Matlab

Tim Davis
Dept. of Computer & Info. Sci. & Eng.
Univ. of Florida

January 19, 1999

Cell matrices in :i 1.i.1i, can be used to simulate simple parallel
matrix algorithms for distributed memory parallel computers. Each
process in the k-by-k grid of processors owns a single entry in a k-
by-k cell array. One entry in cell array can be a scalar, a matrix,
another cell array, or any other data type supported by : i.i.,t l 1. --
sages are also passed using cell arrays. I !, i, paper describes : i. 1i.1i,'s
cell arrays and how they can be used to simulate the distribution of
matrices and the sending of messages on a parallel computer. A com-
plete 2-processor parallel matrix multiply algorithm is described. I In-
paper was written in order to introduce parallel computing concepts
into the course COT 4501 (miir. i,. .,I Analysis A Computational
Approach). This work was supported by the National Science
Foundation, grant number 9634470.

1 Introduction

Before you read this paper, refer to C11,11i r 5 (Matrix Computations) from
the textbook (Introduction to Scientific Computing by C!.i11 1 Van Loan,
1997). "'- read the paper Block matrix methods: Taking advantage of

/. 'I-~-, rformance computers, TR-98-024, by Tim Davis (October 1998). Ti,
notation used here is described in the preceding paper.

2 A model of parallel computing

One common method for organizing computations on a distributed-memory
parallel computer is to consider the parallel computer as a grid of processors.
Each processor has a row index and a column index. An example 2-by-2
processor grid is shown below.

P1I P1,2
P2,1 P2,2
This assumes what we start counting row and column indices with 1. It makes
for simpler index computations to start with zero, but '.i 1, l.ii is 1-based, so
we'll follow the normal '.1 111, i usage.
Each processor runs its own program. Each one has its own memory, and
cannot directly access the memory of other processors in the grid. Instead,
data is shared by the processors by sending and receiving messages. '.1 i .'ry
distribution and message handling can be simulated in -.i 1.111i using cell
arrays, described in the next section.
In general, a loop of the following form defines a single parallel step on a
k-by-k parallel computer, taken by each processor independently.

for Pi = 1:k
for Pj = 1:k
% Process does something

In a truly parallel step of computation, the result should not depend on the
order of execution. This can be tested in 1 111, 11 by running the iterations
in a random order. This can be done with the following code:

for pp = forall (k)
Pi = pp (1) ;
Pj = pp (2) ;
% Process does something

where the forall.m function is shown below. It is available at davis/COT4501/Block/forall.m

function proclist = forall (k)

% return a 2-by-k^2 array. Each column is a unique combination of two
% numbers from the range l:k. The columns of the array are randomly
% permuted. The purpose of this function is to simulate the parallel
% execution of a single step on all processors of a k-by-k grid.

Pi = repmat (l:k, k, 1) ;
Pi = Pi (:)' ;
Pj = repmat (1:k, 1, k) ;
proclist = [Pi ; Pj] ;
proclist = proclist (:, randperm (k^2))

If the action of the loop changes from iteration to iteration, then some-
thing is wrong in your algorithm.

3 Cell arrays in Matlab

Ti, textbook does not describe '.i .i 1 I's cell array feature. We will use this
to simulate distributed memory. A cell array is a '.i,, 1,l, array where each
element of the cell array is not just a single number, but a container that can
hold other il., 11II arrays. Here are some examples.

A = cell (2,3)
This creates a 2-by-3 cell array A with empty entries. Tit, result is

[1 [1] [1]
[1 [1] [1]

A 1-dimensional cell array is called a list.

A{1,2} = [1 2 ; 3 4]
This sets the 1,2 entry of A to be a 2-by-2 matrix. Note the use of
the curly brackets { and } instead of standard parentheses. This is an

example of '.i,,II ii, .iI, ii iiiil ii," for cell arrays. We obtain the
cell array

[1 [1 [1
S 3 4

Typing the statement above, without a semicolon at the end, prints a
summary of the cell array A. If you want to see the detailed contents
of each entry of A, type celldisp(A).

C = {[1 2 9], [3 4]; [5 6], [7]}
Tit, curly brackets are also used to construct cell lists and cell arrays.
Tit, statement creates a cell array that looks like this:

1 2 9 3 4
5 6 7

Note that the sizes of each cell entry do not have to be the same. Tli
do not even have to be of the same type (you can mix numeric and
character entries, for example).

B = {A C}
This statement creates a 1-by-2 cell array, each entry of which is also a
cell array (the A and C arrays just described). Tit, expression B{ 1} will
give you a cell array equal to A, and B{2}{1,2} will give you a 1-by-2
array with the contents [3 4].

A word of caution about cell arrays. If you're creating a new cell array
and you have a existing numeric array with the same name, '.1, iil will
complain. Clear the old array first, using the clear function. For example
clear A will clear the array A. clear, by itself, will clear the whole '.1I11i.1 I

4 Cell arrays for distributing matrices

Suppose we have a 20-by-20 matrix A, and we want to cut it into four
submatrices of equal size and put one submatrix on each processor of our 2-
by-2 processor grid. In a real distributed memory computer, each processor

would have its own 10-by-10 matrix. In '.1i11,i I'. we can simulate this with a
2-by-2 cell array. Ti, following code splits the matrix A and distributes it
to the processors as the cell array Agrid.

Agrid {1,1} = A (1:10, 1:10)
Agrid {1,2} = A (1:10, 11:20)
Agrid {2,1} = A (11:20, 1:10)
Agrid {2,2} = A (11:20, 11:20)

T1 11 I, if we come to a step in the algorithm in which each processor needs
to operate on its part of the A matrix, we can simulate it with a loop like

for Pi = 1:2
for Pj = 1:2
% Process gets its local copy of A
Alocal = Agrid {Pi,Pj} ;
% ... do something with Alocal here
% Process puts its local copy back into A
Agrid {Pi,Pj} = Alocal ;

In the above example, Pi is a processor's row index, and Pj is a processor's
column index. Since '.ii liiI uses 1-based indexing for its cell arrays (and
all arrays), we think of P1,i as the first processor in the upper left of the
processor grid (instead of Po,o, which is more typical of parallel distributed
memory computing).

5 Cell arrays for message passing

Suppose processor P2,2 wants to access the submatrix A(1 : 3, 1 : 3). Since
it doesn't own this piece of the matrix A, it must be sent from P1,1 to P2,2.
We can simulate this message passing by using cell arrays. Ti, 1, are many
kinds of message passing:

1. point-to-point: a single processor sends a message to a single processor

2. broadcast: a single processor sends a message to all processors, or to
all processors in its row of the processor grid, or to all processors in its
column, or some other subset.

3. all-to-all: all processors send messages to all other processors.

4. others... T1i, are many more types of communication patterns.

Suppose we're in the middle of a parallel loop in which each processor is
a message to one or more processors. We can use a cell array to hold each
processor's message, as in the following example.

for Pi = 1:2
for Pj = 1:2
% send a message from processor
Message {Pi,Pj} =

% synchronization point

for Pi = 1:2
for Pj = 1:2
% get a message sent by another processor
if (Pi == 2)
% in this case P <2,*> gets a message from P <1,*>
GotIt = Message {1,Pj} ;
% ... do something with the matrix "GotIt"

Ti1, example uses two loops. Each processor would do the first part, and
finish it, before any processor completes the second part. This is a simple
way of enforcing barrier synchronization. In a barrier, all processors must
reach the barrier point in the code before any can proceed past it. Its use
here simply ensures that all messages are sent before they are received.

Since cell arrays can accommodate entries of different sizes, each processor
in the example can send different size messages. We can find out how big it
was using the .ii, .11, size function.

6 An complete parallel matrix algorithm

Consider the parallel matrix algorithm on page 16 of TR-98-024, repeated
below. Ti, algorithm computes C = A B using the following block matrix
bT b

Ci C2 [ai a2 Can b21 22

b bT 2
n1 bn2

Process 1 owns the left half of each matrix, and
Here's the algorithm in parallel pseudo-code.

Process 1 does this:
C1 =0
for k= : n
if k < n/2
send ak to process 2
recv ak from process 2
C1 = C1 + akbkl
end for

process 2 owns the right half.

Process 2 does this:
C2 =0
for k= : n
if k < n/2
recv ak from process 1
send ak to process 1
C2 = C2 + akb
end for

Tii corresponding '.,, 11, I script is shown below. It is also available from my
web page at

% mult2: 2-processor parallel matrix multiply

% Computes C = A*B, where A and B are both n-by-n

n = size (A,1) ;
n2 = floor (n/2) ;

% distribute A, B, and C.
Agrid = cell (1,2) ;
Agrid {i} = A (:, l:n2)
Agrid {2} = A (:, n2+1:n)
Bgrid = cell (1,2) ;
Bgrid {i} = B (:, 1:n2)
Bgrid {2} = B (:, n2+1:n)
Cgrid = cell (1,2) ;
Cgrid 1)} = zeros (n, n2)
Cgrid {2} = zeros (n, n-n2) ;

for k = 1:n

%a parallel step, but only one processor does something
if (k <= n2)
% send A (:,k) from process 1 to process 2
Message = Agrid {i} (:,k)
% send A (:,k) from process 2 to process 1.
% IJote the local column index k-n2.
Message = Agrid {2} (:,k-n2)

% synchronization point here. Each processor gets the message

for p = 1:2 % a parallel loop
Cgrid {p} = Cgrid {p} + Message Bgrid {p} (k,:)

% get the final result
clear C
C = [Cgrid{l} Cgrid{2}] ;

7 For more information

For more information about cell arrays, see Chliipl r 13, i ,.!.,, res and Cell
Arrays, in the Using Matlab manual, or page 59 of the Getting Started with
Matlab manual. Both manuals are available on-line. Type helpdesk in
.1 ii!,l., or go to my course web page.

8 Exercises

1. Write '.1 11, 11 pseudo-parallel program to implement the parallel algo-
rithm on page 17 of the paper Block matrix methods: Taking advantage
of 1. '/l,- p rformance computers, TR-98-024. You may assume n is even.
Note that the algorithm uses the same matrix distribution as the mult2
algorithm in Section 6. C11 I 1: your result by computing norm (C-A*B).

2. Suppose we split the three n-by-n matrices A, B, and C into 2-by-2
block matrices (similar to what's done in Section 4). Each processor
holds only one n/2-by-n/2 sized block of each of the three matrices.
We can write

C11 C12 All A12 [1 B1 812
C21 C22 A21 A22 B21 B22

Processor Pj owns the submatrices Ai,, Bij, and Cij. Verify that the
following algorithm correctly computes C = AB. Draw a diagram of
how messages are passed between the processors, one for each of the
two phases of communication. Draw the diagram as a square grid of
processors, with arrows denoting the messages. Label the arrows with a
description of what data is being sent in the messages. Write a. I ii I 11
pseudo-parallel program to implement the algorithm. Note that not all
messages are immediately used in the next computation phase. C11 I 1:
your result by computing norm (C-A*B).

2-by-2 Multiplication Algorithm:
Perform the following steps in parallel:
Process P1,1 sends All to process Pi,2

Process PI,2
Process P2,1
Process P2,2

sends A12
sends A21
sends ~.

to process P1,1
to process P2,2
to process P2,1

Perform the following steps in parallel:
Process P1,1 computes C11 A11B11
Process PI,2 computes C12 = A11B12
Process Pa,1 computes C21 = A22B21
Process P2,2 computes C22 = A22B22

Perform the following steps
Process P1,1 sends B11
Process PI,2 sends B12
Process P2,1 sends B21
Process P2,2 sends B22

in parallel:
to process P2,1
to process P2,2
to process PI1,
to process Pi,2

Perform the following steps in parallel:
Process P1,1 computes C11 = C11 + A12B21
Process PI,2 computes C12 = C12 + A12B22
Process P2,1 computes C21 = C21 + A21B11
Process P2,2 computes C22 C22 + A21B12

9 What's next

A subsequent technical report will describe the block cyclic matrix distribu-
tion for distributed memory parallel computers, and how to simulate that
in :'.iI i lI. Also left for a follow-on paper is how to implement a matrix
algorithm truly in parallel using '.iPI, a message-passing standard.

University of Florida Home Page
© 2004 - 2010 University of Florida George A. Smathers Libraries.
All rights reserved.

Acceptable Use, Copyright, and Disclaimer Statement
Last updated October 10, 2010 - - mvs