Group Title: Department of Computer and Information Science and Engineering Technical Reports
Title: Distributed shared memory using reflextive memory : the LAM system
Full Citation
Permanent Link:
 Material Information
Title: Distributed shared memory using reflextive memory : the LAM system
Series Title: Department of Computer and Information Science and Engineering Technical Reports
Physical Description: Book
Language: English
Creator: Denton, Roger
Johnson, Theodore
Affiliation: University of Florida
Publisher: Department of Computer and Information Science and Engineering, University of Florida
Place of Publication: Gainesville, Fla.
Copyright Date: 1996
 Record Information
Bibliographic ID: UF00095376
Volume ID: VID00001
Source Institution: University of Florida
Holding Location: University of Florida
Rights Management: All rights reserved by the source institution and holding location.


This item has the following downloads:

1996219 ( PDF )

Full Text

Distributed Shared Memory Using Reflective Memory:
The LAM System

Roger Denton
Encore Computer Corporation
Plantation, Florida


In the past, acceptance of Distributed Y' ...... Mem-
,, ., (D .1f) .,! .. ..f / because of poor p. f'...- -
mance. Performance has P., '........ ', been limited by the
availability of an interconnect media whose properties
are similar to those required by a D .1 .i,.' .. r In an
. T.T to compensate for inappropriate '"!. .. sig-
,'I .....! research has been devoted to maximizing the
utilization of the available interconnect.
A class of interconnect media is available that be-
gins to address the inhibitors to DN.1 .. ,1, ........, ''
I !. .. interconnect media have one essential property;
the "!. ,' f... is mapped into the address space of the
process participating in the D .1 f .,. ....
I !. project focuses on the development of the LAM
D, ,If ..'. LAM is a well-balanced ',1d. ".l D.If
. !"t. .i implemented using Reflective Memory inter-
connect hardware with a .,',". ..'. consistency ,./i', ,,
and ',!. ,, .... A distinctive property of this ...'. !. is
that access time is 1'... ii., proportional to the number
of bytes accessed and unrelated to the size of the data
structure being accessed. 1 P..i'. architecture, in-
i, ..I. and results are described. .,1,1 ..... i', a case
is made for the inherent advantage a ..i'! ....- .I-,../,,
interconnect has over an interconnect accessed through
the operating ..," i. '. and network protocols.

1 Introduction

The nature of the problems being presented to com-
puters are increasingly complex (e.g., the Grand ('I, I -
lenge problems). While single processor and multipro-
cessor machines can be expected to continue making
incremental performance gains, distributed -I. I -
have the potential to introduce scalable increases in
available performance. One of the fundamental com-
ponents of a useful distributed computer -I -. i i is a
distributed interprocess communications mechanism.

Ti ..I .. ,re Johnson
Dept. of Computer and Information Science
University of Florida
Gainesville, Florida

To be useful, the communications mechanism must
have the following properties:


high throughput

low I 1. I,.

intuitive programming interface.

The goal of this I" ... I was to develop such a com-
munications mechanism in the form of a Distributed
,1i -, [ Memory (DS. I -' -I, ,
In general, the communications mechanism in dis-
tributed -1, in- is either a message passing (MP) sys-
tem or a DS. -- -,. i i Numerous studies have been
done advocating MP --. 11 as the mechanism of
choice; a seemingly equal number of studies have been
done resulting in proof that DS. I - 1 i- are supe-
rior. By now it is clear that the choice of DS., I or MP
in the design of a given -1. !1, depends on a number
of factors peculiar to the -I1 i i being designed.
In the past one primary factor in the choice be-
tween DS. I and MP models was the i11.1 i i!_ com-
munication hardware [LeB92]. MP -- -1. i ,- made few
assumptions regarding the support hardware. In con-
trast, the performance of a DS. -- -. ii, is directly
proportional to the performance of the I 1l. 1! i i in-
terconnect hardware. This factor has led to perfor-
mance being listed as one of the primary advantages
of MP over DS. I - i-. ,i- This also led to DS. I sys-
tems being built on top of an MP interface/hardware.
Recently, the parameters associated with the
premises regarding performance have evolved. Com-
mercially available communications mechanisms rou-
tinely exceed 1Gb/s in speed (SCI [Gus92], ATM) and
in addition, some hardware supports mapping the in-
terconnect hardware into the address space of the pro-
cess participating in the DS., -i. i (e.g., Reflective

Memory'). These developments encourage a new gen-
eration of higher performance DS. I -I -. -
This project focuses on the development of a DS. I
-1. i!I meeting the previously mentioned criteria that
utilizes a high-speed iI. !ii.. -! i! 11,.'1 communica-
tions medium-Reflective Memory (1:. !.'

2 Related Work

Over the last decade extensive research has been
done in the area of DS. I with a number of -- -I !,I-
being built in both the research and commercial com-
ii [ii- I Below you will find information related to
prior DS. I implementations and observations on the
trends found with attention drawn to key conclusions.
Previous research in the area of DS. I has focused in
two areas; the development of software -I- i, utiliz-
ing readily available hardware, and hardware -I- iI.-
attempting to develop hardware specifically for solv-
ing a set of DS. I issues. A grey area, i- 1 l, i -- -1' i i
utilize both special purpose hardware and software to
attack DS. I issues. An extensive ',1 1_ i I 1 1! of DS. I
-* iI and related work can be found in i1.-1.:'ii]
A number of significant software DS. I -- I t- have
been developed. Amber [('! 1'I] took the approach of
migrating the process to the data as well as the data
to the process in order to reduce messaging. ('!..ILI-
[Ram89] was one of the first -I. !I- to treat the
shared data as an object. Linda .A\i 't-] used com-
piler inserted DS.I primitives to support paralleliz-
ing the application. '111; [Ber93] introduced entry
consistency. Munin [Car91] implemented a i' Ij of
coherence policies that were selected by the user and
enforced by Munin. Munin also developed facilities to
support recovery in the face of -I, iI failure. Soft-
ware DS. I! -. i I research has tended to concentrate
on the development of increasingly relaxed consistency
schemes and attempts to increase effective throughput
(e.g., multiple writers).
Hardware DS. I -- -. I are fewer as would be
expected given the additional resources required to
develop a hardware -., i- T DASH [Len92] was
one of the early hardware DS.! --I i,!- utilizing a
distributed directory for cache coherence. Alewife
.\o ,',] supports coherent DS. I and message passing
interfaces. SCI [Gus92] offers a memory mapped inter-
connect interface and directory based cache coherence
protocols. ..- '-I,] is attempting to produce a

1Reflective Memory is a trademark of Encore Computer

- -I1 ii capable of scaling from a few to several thou-
sand processors.
Hybrid DS.I -I. i- are becoming more prevalent
and are appealing because of the oppi .! iIt il of per-
forming operations that are common or time-critical
in hardware and implementing the remaining logic in
software-as we've all noticed, software is consider-
ably easier to update than hardware. FLASH [Kus94],
PLUS [Bis90] and 111: 1.I P [Blu94] implement DS. I
using a 1i 1, I1 approach. LAM is also a 1i 1, I1 sys-
tem. LAM [it i from the previously mentioned hy-
brid -1I i. in that the RM hardware is considerably
less complex. The reduced i.1 i I -'1 .1i of the hardware
is not transferred directly to more complex software
indicating an overall reduction in -- -I. i I ,1.
Simpler hardware does not significantly impact per-
formance as will be shown later in this paper.
A common thread through ii Ii of the previous
DS. I research i!.i.'. I was the search for improve-
ments in the consistency model. In [Li89] it is stated
that one of the most fundamental design decisions
made in the development of a DS. -I -l. iI is the
choice of a consistency i .".1i ['\iiL !] provides an in-
tuitive definition of i i .r consistency models.

2.1 DSM Performance Improvement

In \ 'I '' ')] a seemingly obvious conclusion is stated,
DS. I I i -. perform better with hardware assist. In
this particular case they were working with the Cash-
mere -- ix..!''] and a 1i. of network inter-
faces that allowed the shared memory to be mapped
into the address space of the cooperating processors.
The authors of TreadMarks [Kel94] quantify the
overhead incurred with interfaces that are not mapped
into the local address space (e.g., ATM, Ethernet, ...).
The TreadMarks group found that in their case the
ii 1 i .i~y of the overhead is incurred in the operating
- -I*, (in their case, UnixTM) interprocess commu-
nication primitives and network protocols. Addition-
ally, they provide measurements comparing the perfor-
mance of TreadMarks using a 10 11 ,/s Ethernet and a
100 -ll/s ATM LAN. Not surprisingly, a 10:1 perfor-
mance increase was not seen; the average was closer to
2:1. Given that the operating -1. i' overhead would
remain constant in the two cases, the implication is
that the network protocols (UDP over Ethernet and
AAL3/4 over the ATM LAN) are the primary bottle-
neck preventing the -1, II performance from scaling
with the performance of the i 1I1.i1l i1 communica-
tion medium. [Ber93] corroborates these ratios with
similar work done under their H111;- i DS. I -. i I

('I! !i significant improvement is possible in the
area of scalability and effective use of available band-
width. Since most (up to 78 percent [Blu94]) of the
lost bandwidth is associated with the interface and as-
sociated protocols an tin i. ii way to regain the lost
bandwidth is through the design of a memory mapped
interface. The RM/LAM combination provides an ef-
ficient method for taking advantage of this potential
for performance improvement.

2.2 DSM Trends

While I i ;_ previously sited DS. I work a few
trends were consistently noted.

In-kernel implementations or fast user-level im-
plementation are necessary for an t! i. !I DS., I

Process ii!1-.,11i G 11., must be kept to a min-
imum. Synchronization calls are expensive and
will likely have secondary (local cache, TLB man-
agement, interrupt overhead) effects on the pro-
cessors preempted to deal with ii1, i.!ii irG!ii
This requirement to minimize synchronization op-
erations should influence the design of the DS. I
-1. i 1 as well as the ultimate application.

Early design decisions dramatically impact the
performance of a DS. I implementation. Decisions
related to hardware, software, 1,i 1 i1 implemen-
tation, coherence models, and granularity of the
coherence model are critical.

DS., I implementations utilizing standard network
interfaces and protocols will be limited by the per-
formance of the protocol and interface.

Hardware DS. I implementations are faster than
software and 1i- 1.i 1- can be, on average, almost
as fast as pure hardware implementation with sig-
nificantly less !ii .1. -. ;

3 LAM Environment

Development and measurement of LAM was accom-
plished at Encore Computer Corporation using a four
node Encore computer -- -I. i1 This multi-node sys-
tem was configured as shown in I ,,.i.- 1.

RM Network

i -,1.. 1- System configuration

Each node contained 4 processors arranged in a
symmetric, shared memory multiprocessor architec-
ture. Each node is controlled by an autonomous op-
erating -- -I. !! (Unix).

3.1 RM Hardware Overview

RM is a memory mapped interconnect media. Be-
low are a few salient RM properties.

145MB/s interconnect network (the RM network)

8 nodes per network (multiple networks per node
are possible)

16 RM networks ii ~ be interconnected

RM network distances supported:

80 feet with copper cables
240 feet with coaxial cable
15000 feet with ili i optic cable

4KB RM window (page) size

64bit bus

word reflection or block reflection

processors not interrupted by RM ;~ 11I- i

Only memory writes are transmitted on the RM
network. Writes are causally ordered on the RM net-
work. Reads are not transmitted on the RM network;
they are satisfied from the memory on the RM board.
Reads and writes from RM space are not cached. Ad-
ditionally, RM memory is not strictly consistent across
Each RM board has three ports to memory, one
for the RM network, one for the I/O bus and one for
the processor bus. Arbitration logic and speed match-
ing buffers are used to manage the i I 1 between the
busses. The RM board supports block transfer op-
erations so that an I/O controller (e.g., S('Si) iii ,
transfer data directly to RM. Processor test and set
instructions are not supported across RM since there

is no inherent provision for freezing bus access and
synchronizing all nodes on the RM network.
Control registers are primarily initialized at -. 11,
boot time. A device driver is installed by the operating
-. -I i at -- -1. boot time. This driver is primarily
responsible for error handling so that processes utiliz-
ing RM do not need to be concerned with handling
RM (board and network) errors.
RM windows provide a method to control mapping
of the RM space; they have no other effect on the
RM space. \\ !i 1.. control allows determination of
whether or not a given window is reflected and also
allows mapping transmit and receive locations to dif-
ferent addresses. This ; .1b1 I to map transmit and re-
ceive to different addresses is a fundamental property
used in the implementation of RM synchronization
primitives. Additionally, a window ii be mapped
for peer to peer, multicast or broadcast transmission.

3.2 RM Software Environment

Operating -I. i-, support for mapping RM pages
into the address space of the process using LAM is re-
quired. This support is provided by versions of stan-
dard Unix -- -1. 11 calls (shmget and shmat) that have
been modified to support RM. Once RM is mapped
into the process address space it is manipulated with
memory access (e.g., load and store) instructions. RM
pages that are mapped into the process virtual address
space are protected (but not paged or replaced) by
the same virtual memory support that protects local
RM has been the interconnect media for a num-
ber of -I. ii- at Encore and recently, DEC [Gil96].
Projects i !11,1-., i,- RM have included a disk sub-
-I. ,, cache, a database distributed lock manager
[\.1'gi] and a distributed, fault-tolerant I !. -- -. i1
[I. !:'i,] Obviously, each of these require coherent
access to control and data space. In each case the
problem was solved with an implementation designed
specifically for the problem at hand. To date, there
has been no general service implemented to support
coherent shared memory across RM.
LAM was designed to fill this functional gap in RM

4 LAM Architecture

LAM is an implementation of coherent distributed
shared memory. LAM was designed to be scalable
and provide an intuitive interface to the programmer.

Essentially, LAM must manage allocation of RM space
and !!i 1. i i .. access to this shared space.
In this section the architectural properties of LAM
are described.

4.1 Architectural Properties

In terms of the DS. I classification -l. ,n provided
in [1' ..' i'] LAM has the following architectural prop-

DS. I implementation: 1!- 1'i1 with library routines.
LAM uses RM hardware as the interconnect me-
dia. The library routines are linked into the ap-
plication and allow the programmer to allocate
and -- i !!. .ii -- shared memory from RM space.

1! ,i. -1 data organization: data structure. LAM al-
lows the programmer to allocate and share data
structures of; 1 size that fits within the available

Granularity of coherence unit: data structure. In
fact, LAM has no knowledge of the content or
structure of the data; it only knows the size of
the data structure, the structure is imposed by
the application. Each allocated data structure
is guaranteed to be consistent using LAM prim-
itives. Maximum parallelism can be obtained by
defining the size and content of the data struc-
tures to maximize data; :, ,1 ,lili'- and minimize
synchronization events.

DS. I algorithm iii., ,1 1 multiple reader, multiple
writer. Any cooperating process on ;:,i partici-
pating node 11i initiate a read or write operation
at ; i! time.

Responsibility for DS. I management: distributed.
Each node has access to all control information.

Consistency model: entry. LAM uses acquire and
release primitives to achieve memory consistency.
The shared data structure is guaranteed to be co-
herent and available for exclusive update after the
acquire call to LAM has completed.

Coherence ".11i write-update. RM is responsible
for this portion of coherence management. As
RM locations are updated locally they are staged
for transmission (assuming they are shared loca-
tions) on the RM network to update the local RM
space of the participating nodes.

4.2 Consistency Implementation

Fundamentally, although the RM network is
causally ordered, strict consistency of memory at each
processor is not guaranteed because of asynchronous,
non-instantaneous communication. In LAM, mutual
exclusion and entry consistency is implemented utiliz-
ing properties of the RM hardware.
A distributed lock is associated with each LAM
structure. By convention, the process owning the lock
owns the LAM structure.
As previously mentioned, RM allows mapping
transmit (write) and receive (read) operations to dif-
ferent addresses. Additionally, the RM network ar-
bitration logic guarantees that RM network ii ,i!,n is
causally ordered with respect to the nodes attached.
This allows for an t. i. i implementation of dis-
tributed locks as described below.
LAM allocates an area of RM for locks. This area
is configured to have the receive/transmit locations
at !!. I. i, addresses. A node bids for an apparently
available lock by setting a word (a processor test-and-
set instruction first acquires a node local guard lock
to achieve. ti i III nodal mutual exclusion) associated
with its node id in the lock structure. When this trans-
mission appears in the receive window the lock word
is checked for competing bids. RM ordering ensures
that all bids that were present at the time this process
made a bid are visible at the time this check is made.
If no other nodes have bid for the lock then the bid-
ding processor has acquired the lock and i !! enter the
mutual exclusion region. If other nodes have bid for
the lock then the competing nodes enter a prioritized
retry algorithm (each node clears its bid and retries
in a few microseconds; the number of microseconds is
based on the node identification number and in a four
node -- -I. i! will be less than 5 microseconds) until
one node successfully acquires the lock. RM locks are
released by clearing the RM bid word and then the
processor local guard lock.
LAM uses such locks (transparent to the program-
mer) as part of the acquire and release code. Since
writes are causally ordered on the RM network, if a
node is able to obtain a lock then it must also be true
that writes to the shared data structure preceding the
acquisition of the lock must have been stored in the
local copy of the shared data structure. Again, due
to write ordering on the RM network, once the lock is
acquired this also guarantees that the associated LAM
shared data structure is globally consistent. In [Bir87]
a similar scheme using causal updates to guarantee
consistency is discussed.

LAM manages three chunks of RM space; data,
control, and lock space.
Lock pages contain the node local and distributed
locks used to maintain shared data structure consis-
. i, One lock is required for each LAM shared data
Control pages contain information related to allo-
cating, locating and updating LAM shared data struc-
tures. The process allocating a shared structure pro-
vides an integer tag uniquely ;1. if[" i iI- the shared
data structure. This tag is used by other processes to
attach to the shared structure. One control entry is
required for each LAM shared data structure.
Data pages contain the shared data. LAM has no
knowledge regarding the structure or content of the
shared data. LAM structures ii cross page (and
window) boundaries and no particular byte alignment
is required. LAM simply manages the allocation and
deallocation of the data pages.
Since the lock and control pages must be shared
across nodes they reside in RM space. RM spinlocks
protect these LAM control areas to maintain consis-
1. 111

4.4 External Interfaces

One of the design goals was to produce an intuitive,
programmer friendly interface. LAM has an interface
that is certainly simple and hopefully intuitive. Each
entry point is listed below:

lam_init() initialize the LAM -I. in Called
once by each participating process.

lam_alloc() allocate a LAM shared data struc-

lam_acquire() obtain exclusive access to a
LAM shared data structure.

lam_release() relinquish exclusive access to a
LAM shared data structure.

lamfree() remove a LAM shared data struc-
ture from the global pool.

* lam_retire()
LAM -i. i

withdraw this process from the

4.3 Internal Structure

5 Performance Model

The most common sequence of operations within
a program using LAM will be a call to lam_acquire(),
followed by some sequence of data access, followed by a
call to lam_release(. This sequence ii be described
with the following relationship.


TLAM + TRM ByteSaccessed
Tlamacquire() Tlamrelease()

Where Taccess is the total time required to acquire,
manipulate, and release the data. TLAM is the time
associated with LAM overhead. TRM is the time re-
quired to read or write a RM location (see Table 1).
Tlamacquire() and Tirrelease() can be further dis-
sected into the relationships given below.


Tlocalacquire + DRM *
(2TRMread + TRMwrite)
Tlocal-release + DRM TRMwrite

Where Tocalacquire and Tiocalrelease is the time re-
quired to acquire or release a node local lock. TRMread
and TRMwrite is the time required to read or write a
RM location. DRM is the RM .... ti. i. of ,l.1 as
described in the following paragraph.
Note that an equivalent set of relationships for a
DS. I utilizing a non-memory mapped interconnect
would also contain terms for operating -I -I, i, ser-
vice calls and protocol execution. Each of these terms
would be relatively expensive in terms of time. LAM
makes no -- -I- i! calls after -- -I i! initialization and
the protocol is limited to library calls to lam_acquire()
and lam_release(). LAM/RM impose no software over-
head for propagating writes to other nodes.
Also, note these relationships imply that the time
required to access one byte in a small data structure
will be equivalent to the time required to access one
byte in a much larger data structure.
The time required to complete each RM operation
is primarily dependent on the available 1 i I- of the
RM network. As the 1 ,l through the RM network
increases Taccess will increase proportionately. \\ iI
respect to this performance model, the RM 1. 1 re-
lated to increased 1 i. I (, i .ti. ) on the RM network
i i- be expressed as a l. 1 ., tin t. I DRM. The
value of DRM will range from 1 (when RM 1 i. -i is
at its minimum) to a larger value as 1 I increases.
In an attempt to measure DRM, a program was
written to provide a configurable background load on

the RM network. Measurements of 1 I through-
put and LAM overhead were taken with various back-
ground loads.
Three of the four nodes in the -- -1 11 (the fourth
was used to run the timing programs) were dedicated
to flood the RM network-each node running 20 copies
(60 copies total) of the load generation program. The
12 processors in the three nodes were 100 percent uti-
lized during the run. This load did not significantly
affect (less than 10 percent) the RM network lit. !i.
in i of the measurements. This implies that in this 4
node configuration the RM i... t!. i !i of .l 1 (DRM)
was equal to 1.

6 Results

Measurements of LAM fundamental quantities (la-
i, i and overhead) are provided followed by results
related to scaling and speedup.

6.1 Latency

A relatively simple program was developed to mea-
sure the rate at which RM reads and writes were pro-
cessed by the - -1 i i Recall that RM reads and writes
are not cached by the processor. This was primarily
a measurement of the RM -- -1. i, since LAM simply
provided the memory allocator. These measurements
were taken using one processor on one node. The
memory accesses were timed by sequentially accessing
each word in a contiguous n, i of 64K bytes. Table
1 shows the amount of time required to perform the
indicated operation on one word (32 bits, 4 bytes).

Table 1 Memory 1 ,l i measurements
Operation Type Time (usec)
RM writes 0.20
RM reads 0.37

6.2 Overhead

In this section timings are provided for the funda-
mental LAM primitives. (TLAM). Also, a significant
optimization to the LAM primitives is discussed.
A program was written to time the relative cost of
LAM operations as the size of the LAM shared data
structure varied. This was accomplished by timing
multiple calls to lam_acquire() and lam_release() then
reporting the average cost. Table 2 provides timing

information on the original as well as the optimized
version of the LAM primitives (TLAM).

Table 2 LAM primitive timings
Original time Optimized time
LAM 43.0usec 14.7usec

Bytes accessed (written)

The initial timings of LAM primitives indicated sig-
nificant opp,, I iLil for optimization. As previously
described, lam_acquire() is essentially a mutual exclu-
sion entry point and lam_release() is a mutual exclu-
sion release point additionally, each contains a com-
ponent of time related to the number of instructions
required to execute the LAM code and the associated
mutual exclusion entry or exit. \\ ii!, an understand-
ing of the iL !1. i--_ media and the algorithms em-
1...-. .1 there was reason to believe that the acquire
and release times could be significantly decreased. The
source code was streamlined and flattened and the
performance improvements in Table 2 were measured.
Please note that unless specified otherwise all results
in this document were recorded using the optimized

The value of this optimization is apparent in i ,.
2 access to less than 1024 bytes of a LAM struc-
ture requires significantly less time. i ,,L- 2 shows
the amount of time required to acquire an uncontested
LAM shared data structure, access a given number of
bytes in the structure, and then release the structure.
Results are shown for the original unoptimized prim-
itives, the optimized primitives and predicted results
derived from a linear regression model. The linear re-
gression model was calculated to be:

Taccess = 14.7usec + 0.20usec ByteSwrittel.

Note that the predicted results are well aligned with
the measured values for the optimized primitives.

I ,,L,.. 2 LAM throughput measurements

It is important to remember that access time is di-
rectly proportional to the number of bytes accessed
and is unrelated to the size of the data structure be-
ing accessed.

6.2.1 Overhead Relative to Data Accesses

LAM overhead as a percentage of total data access
time is inversely proportional to the amount of data
accessed in the shared structure. Overhead was calcu-
lated using the relationship:
Overheadpercet = TLAM 100.
Where TLAM is the time required to execute
lamacquire() (Tacquire) plus the time required to ex-
ecute lam_release() (Treease). TLAM equals 14.7usec.
Tacess was defined in the previous chapter. The re-
sults of this calculation for a ;, I- of access ranges
is given in I i,[Hi- 3.

80% L

a 60%

S30% -

Bytes accessed (written)

I Lo, 1- 3 LAM overhead as a function of shared
data access range



-- Optimized
S Original
- Model

6.3 Scaling

Scaling is the ability of a -1 11, or resource to ac-
cept and complete incremental work. (An alternative
definition of scaling has been given as the ability to
produce greater precision results in the same period
of time given additional resources.) Additional work
should be completed in less time until the -I. i or
resource is depleted of 1',, i l
Matrix multiplication is a fairly common Il[Il-1
used to measure parallel -- -i. ,,- in general and DS., I
-. I 1- in particular. \\ i 1i this in mind, the code for
this matrix multiplication program was derived from
that used in the CRL i!..i', I [II- -',.] Relatively mi-
nor modifications were required to replace the CRL
interface with the appropriate LAM calls. The matrix
sizes were increased to 512x512 in order to increase
the runtime. The matrix multiplication program was
measured in three environments; running on the local
- -I i- in one process on one node, running in 1-32
processes across 4 nodes (16 processors) using LAM
services, and running in 1-32 processes across 4 nodes
using LAM services with the source matrices marked
as read only.
i !, .i. 4 is a graph of the multiprocess scaling prop-
erties observed running the matrix multiplication pro-
gram with LAM.

memory or the time required to run the application in
LAM memory with one process.
Speedup measurements show that for this matrix
multiplication application two or more processes will
provide lower overall elapsed time (compared to the lo-
cal case) and that scaling continued to improve across
all 32 processes. I ,,..i- 5 is a graph of the speedup
obtained using LAM compared to local memory and
LAM memory.

12 00

10 00

8 00

6 00

4 00

2 00

0 00


1i [,,!.- 5 Speedup

7 Conclusion

250 Localmemoly
E Remote memory
Remote memory
S5 (w/read only)




i S,p .- 4 Multiprocess scaling results

6.3.1 Speedup

Speedup is described by the relationship:

Speedup= T,
Where Tref is the reference time. In this case, Tref is
either the time required to run the application in local

In this I'1,'.. 1I a high-performance, 1! 1.,b1i DS .,
-I. -!I was implemented and measured.
The measured results show that this 1i 1, 1i1 ap-
proach results in good scaling, high throughput and
low 11. I Additionally, the overhead imposed by
LAM is relatively small (see 1 i, ii-. 2).

7.1 Potential Areas for Future Work

The current implementation has a number of short-

There is one LAM memory pool. Multiple pools
would be useful for prioritized partitioning of
LAM shared space.

The shared data structure identification tags
must be known by each participating process. A
distributed data structure identification service
could be implemented to remove this limitation.

In the current implementation it would be interest-
ing to investigate the potential for more application
supplied hints (e.g., a conditional acquire based on
whether or not a shared data structure is available).
The most direct way to accomplish this would be to
port additional DS., I applications to LAM with an eye
out for :,i optimizations that would be generally use-
A strictly consistent, transparent (to the applica-
tion programmer) DS. -- -1. i! using RM could be
implemented. This -- -I, i could be integrated with
the virtual memory 1.' ii of the operating -1. ri
so that logical acquire and release operations could be
done each time the page was accessed. Notification of
the page being accessed could be done through manip-
ulation of the processor page table entry control bits.
False sharing [Bol93] could be a problem with this im-
plementation given that the sharing would occur at
the page level.
Investigation into a protocol allowing multiple writ-
ers (as in Munin [Car91] and TreadMarks [Kel94])
should increase parallelism and improve overall per-
formance and It! "
Implementation of increasingly relaxed consistency
protocols would improve performance. For example,
a lazy entry protocol would take advantage of tempo-
rally local references, thereby reducing the incidence
of expensive inter-node synchronization events and re-
ducing RM network load.
Recovery in the presence of node failures would be
relatively simple to add to LAM since each node could
have a copy of all shared data necessary to synchronize
a recovering node with its pre-failure state.

7.2 The Importance of Low Overhead to
DSM Performance and Acceptance

Low overhead is directly related to increased per-
formance and it is subtly related to the acceptance of
the DS. I -I. i i, Applications written for distributed
- -1 1 are generally constructed to avoid distributed
communications (as much as possible) and the asso-
ciated drop in performance. If the overhead associ-
ated with distributed communications can be driven
towards zero then this burden will be removed from
the programmer, distributed applications will be eas-
ier to produce and therefore are more likely to be pro-
A DS.! -- -1. i, imposing a significant additional
burden on the programmer is unlikely to be accepted
for reasons related to economy increased software
*. !i! '-; translates directly to extended develop-
ment cycles and an increased incidence of program-

ming errors, both are costly. The LAM interface is
simple, intuitive and provides a relatively straightfor-
ward implementation platform for developing or port-
ing applications.
As previously mentioned, in [Kel94] and [Ber93] an
effort was made to measure the overhead associated
with DS., I operations using interfaces requiring a net-
work protocol and operating -I. i, intervention in
DS., I operations. In TreadMarks [Kel94] it was shown
that from 3 to 17 percent of the total execution time
was spent on communication. In ,1l- ,- [Ber93] it
was found that communication time varied between
6 and 26 percent of the total application time de-
pending on the application and the interconnect me-
dia. Reports from the 11:ii.I P project il i;'I,] show
that "- I I[, 111 -, i, 1ry-mapped communication can re-
duce the send 1 I overhead by as much as 78
percent." In addition to extending processing times
this overhead also serves to effectively limit the use-
ful bandwidth of a communication medium. !,!l'ii,]
noted that a network interface capable of 1., Il;/s was
only able to drive 20MB/s the other '.' ll/s was
"l9 -1 to overhead associated with the protocol and
data path to the interface.
An interface that is mapped into the address space
of the process avoids virtually all of the overhead as-
sociated with the operating -- -I. i, An interconnect
media that appears to be memory (operated upon
with processor load and store instructions) minimizes
the protocol overhead. RM has both properties; it is
mapped into the process address space and it is ac-
cessed using memory access instructions.
In LAM there is little overhead associated with
inter-node DS.I communication as long as there is
bandwidth available on the RM bus. Assuming sim-
ilar inherent execution times for the DS. I primitives
this gives a -I- i with a memory mapped DS. I in-
terconnect a significant performance advantage (up to
78 percent 1I I 'I]) over interconnect devices accessed
through the operating -I. i, and protocols. This ad-
vantage is a result of the lower overhead associated
with inter-node communication-load and store in-
structions will always be much faster than -1. ii calls
followed by protocol processing.


.\ I.] A. Agarwal, R. Bianchini, D. ('!i !I:. ~, K.
Johnson, D. Kranz, Kubiatowicz, B. Lim, K.
Mackenzie, D. Yeung, "The MIT Alewife Ma-
chine: Architecture and Performance," Pro-
ceedings of the 22nd Annual International
S-,c 1....;" .. on Computer Architecture, June
\i ll'.] S. Ahuja, N. Carreiro, D. Gelernter,
"Linda and Friends," IEEE Computer, vol.
19, no. 8, August 1986, pp 26-34.
.1\.L',] M. Aldred, I. Gertner, S. McKellar, "A
Distributed Lock Manager on Faul Toler-
ant MPP," Proceedings of the 28th Annual
Hawaii International C,-f,. .' .. on S,,," ..
Sciences, January 1995, pp 134-136
[Ber93] B.N. Bershad, M.J. Zekauskas, W.A.
Sawdon, "The 1~1 Distributed !, ,I, ,1
Memory System," Digest of Papers COMP-
CON, February 1993, p. -.2" 537.
[Bir87] K.P. Birman, T.A. Joseph, "1. I ,i.!.i
Communication in the Presence of Failures,"
AC(. I, ...... .'...- on Computer ',!.'. '
vol. 5, no. 1, February 1987, pp 47-76.
[Bis90] R. Bisiani, M. Ravishankar, "PLUS: A
Distributed !, ,i. Memory System," Pro-
ceedings of the 17th International ,i,' ... ...I!
on Computer Architecture, vol. 18, no. 2, May
1990, pp 115-124.
[Blu94] M. Blumrich, K. Li, R. Alpert, C.
Dubnicki, E. Felten, J. Sandberg, "\ !!i,1 ,
Memory Mapped Network Interface for the
i 1: 1. I P Multicomputer," Proceedings of the
21st International '! !... .'.u! on Computer
Architecture, April 1994, pp 142-153.
I;I1.'i.] M. Blumrich, C. Dubnicki, K. Li, M.
Mesarina, -"\!!ii1, Memory Mapped Net-
work Interfaces," IEEE Micro, vol. 15, no. 1,
February 1995, pp 21-28.
[Bol93] W. Bolosky, M. Scott, -"1 ,. I .ii! and
its 1i1. I on 'I! ,. I. Memory Performance,"
Proceedings of the Fourth S., ,,' .... .. on Ex-
periences with Distributed and Multiprocessor
S.i,i ..., (SEDIs),' September 1993.
[Car91] J.Carter, J.Bennett, W.Zwaenepoel "Im-
plementation and Performance of Munin,"
Proceedings of the 13th S'.,,.... ...'. on Op-
erating S.,.i .I.. Principles, October 1991, pp

[('I '.'ij, J. ('I! F. Amador, E. Lazowska, H.
Levy, R. Littlefield, "The Amber System:
Parallel Programming on a Network of Mul-
tiprocessors," Proceedings of the 12th I-
posium on Operating r., .. i' Principles, De-
cember 1989, pp 147-158.
i.-i: 'ti.] M.R. Eskicioglu, "A Comprehensive Bib-
li,-i 1i1 i of Distributed li i.,l Memory,"
IEEE Operating e i.c... Review, January
[Gil96] R. Gillett, -. I. i ... ('! i I Network for
PCI," IEEE Micro, February 1996, pp 12-18.
[Gus92] D. Gustavson, "The Scalable Coherent
Interface and Related 1 ,I i. II Projects,"
IEEE Micro, February 1992, pp 10-22.
[Hag92] E. Hagersten and A. Landin and S.
Haridi, "DDM A Cache-Only Memory Ar-
chitecture,", IEEE Computer, vol. 25, no. 9,
September 1992, pp 241--' 1
l1..!'i'] K. Johnson, M. Kaashoek, D. Wallach,
"CRL: High-Performance All-Software Dis-
tributed 'I! i,. .1 Memory," Proceedings of the
I ... I' S,'',I .... '... on Operating R..i. *i
Principles, December 1995.
[Kus94] J. Kuskin, D. Ofelt, M. Heinrich, J. Hein-
lein, R. Simoni, K. (;I! i 1, !.I -i.... J. Chapin,
D. Nakahira, J. Baxter, M. Horowitz, A.
Gupta, M. Rosenblum, J. II. I!. -- "The
l ,11 t,..0l FLASH Multiprocessor," Proceed-
ings of the 21st Annual International I'i ...
sium on Computer Architecture, April 1994,
pp 302-313.
[Kel94] P. Keleher and S. Dwarkadas and A.
Cox and W. Zwaenepoel, "TreadMarks: Dis-
tributed '! ,. 1I Memory On l ,-i 1 [1, 1 Work-
stations and Operating Systems," Proceed-
ings of the 1994 li ,";,, L /!.. !.\ C,. -f,
January 1994, pp 115-131.
I\K,!|'I.] L. Kontothanassis, M. Scott, "Dis-
tributed '!i i., 1 Memory for New Generation
Networks," U.',, ,. o,' of Rochester Technical
Report 578, March 1995.
[LeB92] T. LeBlanc, E. Markatos, "-I, .1 Mem-
ory vs. Message Passing in I!, ,. 1- ,. I i,.--ry
Multiprocessors," Fourth Sr, ..."' ... on Par-
allel and Distributed P!...... '... December
[Len92] D. Lenoski, J. Laudon, K. (;! .1 ..!I..i-
loo, W. Weber, A. Gupta, J. II. i. --. M.
Horowitz, M. Lam, "The i i!i.,,i. DASH

Multiprocessor," IEEE Computer, March
1992, pp 63 7'i
[Li89] K. Li and P. Hudak, ".,I. !..., Coherence
in li ,. [ Virtual Memory Systems," .li .1[
I, ...-... ....- on Computer q,.i .!. vol. 7,
no. 4, November 1989, pp 321-359.
.!!'l i.] R. Minnich, D. Burns, F. Hady,
"The' 11. -.i -ITi. i_ ,1. 1 Network Interface,"
IEEE Micro, vol. 15, no. 1, February 1995, pp
\1i'i!] B. Nitzberg and V. Lo, "Distributed
-I! ,. 1l Memory: A Survey of Issues and Al-
gorithms," IEEE Computer. vol. 24, no. 8,
August 1991, pp 52-60.
'...- .] A. Nowatzyk, G. Aybay, M. Browne,
E. Kelly, M. Parkin, B. Radke, S. Vishin,
"The Scalable 1i ,1 l, Memory Mul-
tiprocessor," Proceedings of the 24th Interna-
tional C,.. f .' ..' on Parallel i ... .. '.i; Au-
gust 1995, pp I:1-10.
[1'. .' ,I] J. Protic, M. Tomasevic, V. Multinovic,
"A Survey of Distributed I ,i. Memory
Systems," Proceedings of the 28th Annual
Hawaii International C-.4*f .. .. on S,,., ..
Sciences, January 1995, pp 74-84.
[Ram88] U. Ramachandran, M. Ahamad, Y.
Khalidi, "U!!' i Synchronization and Data
Transfer in Maintaining Coherence of Dis-
tributed 1 ,i. i[ Memory," Georgia Techni-
cal Institute Technical Report GIT-CS-88/23,
June 1988.
[V !:'.] N. Vekiarides, "1 ,i-1 ... i i, Disk .i,,-
age and I I.- Systems Using Reflective Mem-
ory," Proceedings of the 28th Annual Hawaii
International C, .f '.... on S,i. . Sciences,
January 1995, pp 103-113.

University of Florida Home Page
© 2004 - 2010 University of Florida George A. Smathers Libraries.
All rights reserved.

Acceptable Use, Copyright, and Disclaimer Statement
Last updated October 10, 2010 - - mvs