Enhancing multistage interconnection network performance in computer and telecommunications systems


Material Information

Enhancing multistage interconnection network performance in computer and telecommunications systems
Physical Description:
xi, 149 leaves : ill. ; 29 cm.
Cheung, Sandra E., 1967-
Publication Date:


Subjects / Keywords:
Computer networks -- Evaluation   ( lcsh )
Telecommunication systems -- Evaluation   ( lcsh )
Data transmission systems   ( lcsh )
Telecommunication -- Switching systems   ( lcsh )
Computer and Information Sciences thesis Ph.D
Dissertations, Academic -- Computer and Information Sciences -- UF
bibliography   ( marcgt )
non-fiction   ( marcgt )


Thesis (Ph. D.)--University of Florida, 1993.
Includes bibliographical references (leaves 142-147).
General Note:
General Note:
Statement of Responsibility:
by Sandra E. Cheung.

Record Information

Source Institution:
University of Florida
Rights Management:
All applicable rights reserved by the source institution and holding location.
Resource Identifier:
aleph - 001922334
oclc - 30495379
notis - AJZ8146
System ID:

Full Text







First and foremost, I would like to thank my advisor Professor

for his guidance, and infinite wisdom that made this work possible.

Yann-Hang Lee

His sound and

thorough words have guided me in many aspects of academic life.

I would also like to thank my other committee members: Professor

Professor R. Newman-Wolfe, Professor

C. Chow,

Davis, and Professor S. X. Bai.

I am deeply indebted to my brother and friend Professor Shun Yan Cheung with-

out whose encouragement and support throughout all of my life, but in particular

during the past few years, this work would not have been possible.

A great number of people are responsible for instilling in me the passion to pursue

knowledge and truth.

I would like to thank all of my friends, colleagues, professors,

and support staff who contributed to this experience.

These include,

but are not

limited to, Dr.

F. D. Anger, R.

. Rodriguez, Dr.

M. E. Bermidez, Dr.

S. Sahni,

G. Haskins, Javier, Padmashree, Jin-Joo, Balaji, and Mario.

Special mention goes

to the entire Happy Hour Gang for those sobering moments of truth.

I would also like to acknowledge my sister-in-law Stella, and my niece Nathanie

for adding a new dimension to my life.

Last, but certainly not least, I would like to

thank my father kfc4 and my mother A. *- for their love and vote of confidence,

To them, I dedicate this work.





Overview of Multistage Interconnection Networks

Multistage Interconnection Networks in Large-Scale Multiproces-

sor Systems
2.3 Multista
2.4 Chapter

ge Interconn

section Networks in Telecommunication Systems
* S S f C C S S S S S S S S S . S


3.1 Non-Uniform Reference Patterns


Vector Interference .
Vector-Scalar Interference

* . a a a a a a a
* a a a a a a a a a a .

Path Setup Algorithms
Fast Packet Scheduling


* C C 9 S S C f S S S S S S 5 S ft
* S . S b S C B S S S S S S S S .

Sorter Based Networks .
Input Queue Architectures
Shared Queue Architectures

Chapter Summary


Consecuntive Renuest Traffic Pattern

. .



Dynamic Priority Scheme
The Combined Approach

* 4 S 4 S S S n U 0 4
* S S S 4 0 5 4 4 S *

of Spatial Distribution on CR Pattern

Dynamic Priority and Stride Based Mapping Revisited
Skewed Storage Schemes . . . .

Performance Evaluation of Forward Network

- 4 4 4 S 4 4 S 4 7


Dynamic Priority and Bit-Reverse Mapping Results
Dynamic Priority and Stride Based Mapping Results

* S 4

The Effects of the Reverse Network on CR Traffic


Processor Model.

Memory Model
Network Model

* S 4 4 4 4 4 S 4 0 5 4 S S S
* S S S 4 5 5 5 5 5 9 5 5 S S S *

Performance Evaluation

Chapter Summary


5.1 The Effects of Vector/Scalar Interaction
5.2 Vector/Scalar Interaction Models .


Special Interaction Types .

. 78

Performance Evaluation


Performance of SSRM Model.
Performance of DDRM Model

* P 5 4 5
* P S P S I

* S 4 0
* 4

Chapter Summary



Optimal Path Setup


Transportation Network Flow Problems

Multicommodity Flows in MIN Transportation Networks


Maximum Setup Example

The Re-Attempt Parallel Path Setup Method .
ReAPPS with Distributed Synchronization .
Analysis of Path Setup Delay of ReAPPS Scheme

* S S

Performance Evaluation of Re-APPS

Chapter Summary

. 101
. 105
. 110
. 113
. 114


The Design of Circular Window Control (CWC) Scheme

4 0 4 S 119

Switch Architecture and Operation

Analytic Models for the CWC Scheme


. 120
. 124
. 125

Throughput Analysis

,~ '~I Ii I I 1 I *1~1 A 1 C ii nr.yn Cl 1 I flfl

* 4 5 5 .
* 4 4 S
* 4 4
* 4 4 4 .

S S 4 4 4 4 5 S S I 70

. 79

. 96

. 88



F k t^ 11 1 1 1 I 'oT I




a a a a a a a a a a a a a a a a a a a 142

. a a a a a a a a I a a a a a a a 148

.* . . . . . . . 139


Skewed Storage Scheme. . . . . . . .

Simulation Parameters . . . . . . .

Performance Measures . . . . . . .

Assumptions of the Vector/Scalar Model

SSRM Model Parameters.

DDRM Model Parameter.

Setup Overhead for PPS, IPS and ReAPPS schemes .

Various Schemes under Uniform Traffic

Various Schemes under Permutation Tra

Effect of Network Size on Setup Schemes

Schedulability Analysis

a a a 11.5

115. -

f ac a a a a a a a a a a a 115

a a a a a 116

S.~J 9 S S S S S a 130


An 8

Omega Network.

. 3 f 3 3 f f a 3 3 3 3 3 9

Large Scale Multiprocessor System Configuration

An Output Queued Packet Switch. .. . . . .

An Input Queued Packet Switch. . . . . .

Stride-2 Access . . . . . . .

(CR Concurrent Fetch . . . . . . .

ProCelSSi d .

Same Bank Offset/Stride 1 -

Random Bank Offset/Stride 1

Same Bank Offset/Stride 1 1

Random Bank Offset/Stride 1

No Initial Load .

- No Initial Load

. 3 3 3 3 3 .

f 3 3 3 a t 3 3 .

Uniform Load for 500 Cycles . .

- Uniform Load for 500 Cycles . .

Throughput for Dynamic Priority and Round Robin .

Throughput of Dynamic Priority . . .

Effect of Buffer Size on Message Delay

- -

*-|l- J I" 1 ||- I I 1

Stride Dependent Mapping in Interleaved Memory. . . .

Average Vector Delay in Interleaved Memory . . . .

Uniform Traffic Under Proposed Schemes ... . . .

Average Forward Vector Delay Omega Reverse Network . .

Average Reverse Vector Delay Omega Reverse Network . .

Average Forward Vector Delay

Average Reverse Vector Delay

Baseline Reverse Network

Baseline Reverse Network

* 6 S P 7

* 6 4 6 S

Vector-Scalar Interaction Example.

. S S 5 4 0 5 4 4 5 4 5 I 74

Average Vector Delay

Average Scalar Delay

Average Vector Delay

- SSRM Model

- SSRM Model

* 4 5 S S S S P 4 5 4 4 4 6

* S 6 f S 4 5 5 S S S 6 P 5 9 P 4

- SSRM Model (Scalar Priority)

Average Scalar Delay SSRM Model (Scalar Priority)

Average Vector Performance for All Schemes

Average Scalar Performance for All Schemes

Effect of Scalar Burst Length on

* 4 4 5 4 5 4 4 6 4 5 5 5

* 6 5 5 4 5 S S S P 9

Vector Delay No Scalar Priority

Effect of Scalar Burst Length on Scalar Delay

- No Scalar Priority

Effect of Scalar Issue Rate on Average Vector Delay

A Three Commodity Flow Problem

Path Setup is Not Single Commodity Problem

Input pattern in


S P S S S S 5 5 5 5 5 5 4 4 5 5 1 100

Transport Network

* S . P 5 S S 5 4 5 5 5 5 5 5 4 S P 4 5 9 P 100


Embedded Synchronization Example Continued

Embedded Synchronization Example Continued

Embedded Synchronization Example Continued

* S S S S S 108

SS S S .. S 109

* S S S C S S C 109

Time slot in a nonbuffered packet switch banyan network

Transmission of request in a 4-stage banyan network.

*S S C f 111

. S . 5 112

CWC Architecture Model . ... . . . . 121

The Circular Window Control Scheme .. . . . 123

An Illustrating CWC Example. . . . . . 124

Throughput of CWC crossbar switch systems

Throughput of CWC banyan switch systems

S. S S S 134t

S. 5. 135

Mean waiting times of CWC scheme vs. input queueing scheme

Throughput of CWC schemes under Uniform and HotSpot traffic

. 135


Mean Waiting time of CWC schemes under Uniform and HotSpot traffic136

Throughput of CWC scheme under different alternate queue selections 137

Throughput of CWC scheme under different window sizes.

S 137

Abstract of Dissertation Presented to the Graduate School
of the University of Florida in Partial Fulfillment of the
Requirements for the Degree Doctor of Philosophy




August 1993


Yann-Hang Lee

Major Department: Computer and Information Sciences

Multistage interconnection


(MINs) can

provide an

excellent tradeoff

between cost and efficiency.

Their regular structure and multiple access points make

them not only amenable to cost effective VLSI implementation but also provide higher



based systems.


MINs are used



computer and

telecommunications systems.


the design goals in these two

areas are different, similar problems can be identified. Despite their advantages, the

these networks limits their throughput considerably, and is aggravated by the size of

the network.

In enhancing the throughput of MINs it is thus imperative to consider

reducing the number of internal link conflicts.

effects of the Consecutive Requests Traffic Pat

In computer systems we observe the

tern (CR) in MINs. By identifying the

causes for deterioration we are able to propose two methods, the Dynamic Priority

and the Bit Reverse schemes which can significantly improve the performance under

CR traffic.

In telecommunications systems, the design goals are to grant as many

connections as possible (in a circuit-switched design) and

to pass as many packets

as possible in a conflict-free manner (in a packet-switched design).

We propose a




scheme which

can improve existing parallel




To enhance throughput of input queued fast packet switches, the Circular


Control scheme can be implemented to schedule packets in a manner in which the

probability of having conflicts is greatly reduced.

The methods and schemes presented

in this thesis show that with simple and cost effective enhancements, the performance

of MINs can be enhanced considerably.


Communication bandwidth

is a key factor in

limiting the performance of not

only high performance computers but also that of data communications networks.

In both

types of systems, multistage interconnection networks (MINs) provide the

tradeoff between cost and


performance compared to other available means of

Some large scale shared memory multiprocessors such as the NYU

Ultracomputer [GOTT83

utilize MINs.

the IBM RP3 [PFIS85a], and the Illinois Cedar [GAJS83]

The bandwidth of the interconnection between processors and memory

modules, memory access delays, and the delays resulting from network and memory

conflicts contribute to the overall system's performance.

In data communication networks MINs are used as a switch, i.e.

a functional

unit that receives units of information (sometimes referred to as cells) from a set of

incoming channels and routes them as appropriate before transmitting the cells on a

set of outgoing channels.

With the advent of high transmission bandwidths, the main

challenge remains to build switches which can handle these rates.

One category of

space division fast packet switches is based on multistage interconnection networks,

in fact some of these include the

Batcher-Banyan [HUI87] and

the Starlite switch


Common methods for optimizing MINs range from enhancing the switches of the

MIN to scheduling the units of information in order to reduce the contention on the

communication medium. Switch enhancement involves modifying the basic crossbar

switches by adding circuitry that can handle additional functions.

For example, by

equipping switches with some bufferspace, conflicts can be tolerated in a more graceful

way than having to drop the messages.

Scheduling of data permits a much higher

bandwidth to be achieved if conflicts can be actively avoided by not permitting them

to access the medium at the same time.

The problem of improving the performance of a MIN can be addressed from several


In a computer system the network delay is a direct measure of performance

and for this there are a number of ways to attack the problem:

Reducing memory access delays.

Supporting special traffic patterns.

Improving bandwidth of the interconnect.

The efforts in data communication networks in turn,

problems which arose due to recent technologies. The acc<

higher speed communication has opened new avenues for

have concentrated on the

elerated development into

all kinds of applications

which in


become dependent upon networks or networked computing applica-

tions. Building switches which can handle these enormous rates remains a challenge.

MINs are used in the class of space-division fast packet switches and the methods


* Reducing or eliminating head-of-line blocking.

* Scheduling requests in a conflict free manner.

* Providing multiple interconnection networks.

This thesis addresses the problem of providing a high degree of throughput using

MINs in both

tion networks.

high performance computer systems and in high speed communica-

This thesis describes their vital role in both systems and presents

efficient techniques for enhancing the performance of such switches according to the


Though the requirements of computer systems and communications systems dif-

fer vastly, a common goal of achieving an enhanced throughput can be obtained by

studying the switch and its operations.

In computer systems the switch can be mod-

ified in order to be robust under severe traffic conditions by supporting nonuniform

reference patterns.

In telecommunications' switches, throughput enhancement mech-

anisms vary according to the switching technique implemented. In a circuit switched

architecture, the path setup phase should attempt to grant the most number of con-


In a packet switched architecture output contention and possibly internal

blocking must be avoided by scheduling packets which will be non-contending.


this may involve having to relax the strict FIFO order of the packet arrivals.

The first half of the thesis addresses the aspects of throughput improvement in

__ ... .- --- -- --- i- *- -- _- -: 4 -- C -. 4. - -. --t~~


The remainder of this thesis addresses the problem of MIN-based switch enhance-

ment in telecommunications.

In here, the problem is examined both from a circuit-

switching as well as a packet-switching point of view.

This thesis further


that these networks can provide excellent throughput and have applications in both

computer and data communications systems.

The thesis is organized as follows.

Chapter 2 gives a general overview of multistage

interconnection networks and formulates the goals in

telecommunication networks.

taken in related work. Chap

both computer as well as in

Chapter 3 details the background work and approaches

ter 4 presents the Consecutive Requests traffic pattern

and investigates the causes of network deterioration under this type of traffic.


5 studies

the effects of the Consecutive


traffic pattern

when it interacts

with regular traffic.

Chapter 6 presents a parallel path setup scheme for connection

oriented MIN-based high

speed network switch.



the Circular

Window Control scheme for scheduling packets in a highly nonconflicting manner in

a MIN-based switch for high speed networks.

Chapter 8 summarizes this thesis and

discusses future research directions.

The methods presented are by no means exhaustive, and new techniques can be

developed using the work described in this thesis.


Multistage interconnection networks (MINs), and research associated with them,

date back to the 1950s in connection with switching exchanges for telephone systems.

In the early

1970s MINs were proposed for computer system applications and since

then the research in both the areas of computer and telecommunications has reached

numerous milestones.

With the present research on fabrication methods, there is no reason to suspect

that the technological advancement will not continue.

With the advent of high per-

formance computers on one hand, and high speed networking on the other, MINs will

be playing an integral part in providing data communication.

The purpose of this

thesis is to present methods to avoid or overcome pitfalls of MINs in computer and

telecommunication systems.

The remainder of this chapter is organized

as follows.

An overview is provided

in Section 2.1

to give the basic terminology and underlying topological design prin-

ciples of MINs. Sections 2.2 and Section 2.3 deal with specific design issues of MINs

in large-scale multiprocessor systems and telecommunication systems, respectively.

Finally, Section 2.4 summarizes the requirements of MINs for computer and telecom-

Overview of Multistage Interconnection Networks

The vast array of physical implementations for interconnection switch ranges from

bus-oriented to crossbar systems.

The simplest of these is perhaps the time-shared

bus, but it is unable to provide the performance required in a large-scale high perfor-

mance system mainly because of its inability to support a large number of processors

in terms of scalability.

Moreover, buses are single access and would therefore would

incur high contention. At the other end of the bandwidth spectrum is the full cross-

bar, which can connect any of its free input ports to any free output port by providing

a separate switching gate for each input-output connection.

A crossbar switch with

N input and N output ports requires N2 switching gates, and thus becomes the main

drawback found with crossbar switches.

The hardware costs required for intercon-

necting thousands of processors are too vast for crossbar switches to be considered

economically viable.

The feasibility of implementing large crossbars in VLSI presents

another problem.

The main motivation behind selecting a multistage interconnection network as

a communication medium is the hardware cost reduction.

By connecting stages of

smaller crossbar switches, a MIN can provide the required interconnection.


number of gates is significantly less than in crossbars O(Nlog2 N), the speed is re-

duced because of the multiple number of stages and the time needed to set the network

control, unless the network is self-routing (in which case the control is distributed to

anrb/j- cntfrbi'/h


In general, a MIN consists of a sequence of switching elements, arranged in stages.

Physical wires connect successive stages and are also referred to as interstate links.

The most common switching element used in a MIN is a crossbar switch, small enough

to be economically attractive.

square crossbar switches, i.e.

For simplicity, we shall consider only networks with

where the number of inputs is equal to the number of


MINs can be classified in a number of ways.

For the purpose of this thesis,

shall consider one class of permutation networks (i.e. networks which can connect an

input port to at most one output port) called

blocking networks which exhibit the

Unique Path Property (UPP). As their name implies, these networks have a unique

path between every input-output pair.

However, a connection between a free input-

output pair may not always be possible due to conflicts with existing connections,

ergo the name blocking.

The network topology refers to the number of stages, the number and size of the

switches, and the interconnection pattern between successive stages.

Two networks

are said to be topologically equivalent if their underlying graphs are isomorphic; they

are isomorphic if there exists a label-preserving graph isomorphism between them.

A topological equivalence relationship was established for several networks belonging

to the class of UPP networks [WU80].

This equivalence implies they can perform the

same set of permutation functions if the destinations are rearranged appropriately

and that under the same routine algorithm, equivalent networks will have identical


performance and fault tolerance characteristics. Equivalent networks can also be used

to simulate each other.

UPP networks can

be controlled

with a distributed routing scheme, known as

the destination tag algorithm proposed by Lawrie [LAWR75].

labels called destination tags

It is based on output

The binary representation of the destination is used to

route the packet in stage i to the upper output port if di = 0 or to the lower port if

d = 1

where the binary representation of the destination tag is did2... d, (where n

is the least significant bit).

An example of a

network of size N x



network is the Omega network [LAWR75].

= 2") consists of n stages.


The network is composed of



where each stage contains N/2 switches and has a interstate pattern

referred to as the perfect shuffle. An example of the Omega network constructed with

x 2 crossbar switches is shown in Figure 2.1,

where N is equal to 8.

In general,

a multistage interconnection network can be built with a complexity of O(N log2 N)

in contrast

to the O(N2)

complexity of a fully


crossbar network.

destination tag routing algorithm is illustrated in Figure 2.1,

indicate the route taken by a packet issued from processor

where the bold lines

1 to memory module


Due to the unique path nature of UPP networks, simultaneous connection of input

ports to output port may lead to routing conflicts.

The routing conflicts are due to the

sharing of common internal links and are also referred to as internal blocking. Internal

Figure 2.1.

An 8

x8 Omega Network

to internal conflicts, MINs also suffer from output conflicts, a type of blocking even

crossbars experience.

same destination. O0

Output conflicts occur when multiple input ports request the

f the contending requests, one has to be delayed by one cycle.

The victim is usually chosen in a round robin fashion.

Conflicts may propagate back

to the preceding stage and prohibit a successive message from transmission.

The regular structure of multistage interconnection networks makes it very amenable

to VLSI implementation.

In addition to this, its modular structure allows the con-

struction of larger networks out of smaller ones without the need to substantially

modify the physical layout or the algorithms.

The drawbacks to UPP networks are that the UPP leads to blocking and makes

the full


capability more difficult in the event of a single failure.

These short-

comings of the UPP networks have led to the design and implementation of enhanced

UPP networks, or in multipath MINs. Another drawback of UPP networks is that a

given UPP network cannot realize the full set of permutations from inputs to outputs

without using multiple passes or by multiple copies of the network. Determining the

- r - -


concerned with faults and the discussion of enhanced networks will be the topic of

the next chapter.

2.2 Multistage Interconnection Networks in Large-Scale Multiprocessor Systems

Network latency, compounded with memory access delay is known as the main

factor limiting the performance of large-scale shared memory multiprocessors. Meth-

ods to reduce the memory access delay include memory hierarchies and asynchronous

block transfers.

Though I/O constitutes another important factor in multiprocessor

performance, we assume that the applications under investigation are computation-

ally intensive and that the I/O involved is negligible.

As computer technology has been moving towards high performance architectures,

the trend has become to organize multiple processors (as much as thousands) which

can then

work in

parallel on

a single or on multiple tasks.

These tasks are then

partitioned into processes which are then assigned to different processors.


these processes must communicate with one another and it is this cost which becomes

a deciding factor in determining the performance and efficiency of the application.

Multistage interconnection networks have frequently been chosen to interconnect

processors and memory modules due to many of its desirable features described in the

previous section.

Although there are a large number of different possible MINs, only

a surprisingly small number are actually considered for multiprocessors.

One that

recurs in many different forms is the multistage shuffle-exchange network [BATC76,

Figure 2.2.

Large Scale Multiprocessor System Configuration

MIN systems together with memory interleaving can provide adequate network

performance that not only depends on the memory system, but also on the rate at



issue memory references into the network (processor issue rate),

which in turn is affected by the network delay.

Requests issued

a processor





be issued into

the interconnect.

After issuing a request, the processor waits for all

the messages

associated with that request to return.

In general there are two types of messages

that can

be issued

a processor:

reads and


Read requests are typically

synchronous, i.e.


the processor waits for the reply of memory in order to resume

However, in pipelined and vector processors, several reads can be issued

in a sequence.

Write requests, on

the other hand, are asynchronous as processors

do not need to wait for a reply.

The requests which are initiated by the processors

are sent through the forward network and memory modules return data through the

backward network (Figure 2.2).

Network performance is not only

dependent the request

issue rate but also on


by applications running in any computer environment constitute a traffic pattern.

Some patterns have no particular structure to them and be considered quite random

in their frequency and destinations.

The reference pattern in which processing all

memory modules with the same probability and where accesses are not dependent on

one another is referred to as the uniform traffic pattern.

The bandwidth of multistage

interconnection networks can be very high for requests that do not conflict [LAWR75],

but unfortunately most applications generate reference patterns which cause a high

degree of conflicts in the MIN.

Nonuniform traffic patterns can be generated from a variety of applications which

are data intensive or communication intensive. Nonuniform traffic patterns can cause

a large amount of link contentions for an indefinite time and have a deteriorating effect

on regular traffic as well.

It is this effect that makes the study of traffic patterns

which can arise from typical applications in multiprocessor environments important,

for they directly impact the throughput provided by the interconnection network.

Multistage Interconnection Networks in

Telecommunication Systems

Over the years, the growing number of applications which require high bandwidth

has accelerated the research of high speed transmission media such as optical fibers.

These applications are no longer solely computer related but can be found in areas

such as video, voice and image, collectively known as multimedia [ARMB86, KIM90a,


The applications produce a range of traffic flow characteristics,



ensuring that each


type is allocated



ensure its flow requirement.

Early networks were typically designed to support different traffic characteristics

and requirements, and each was tailored for particular applications.

Lately the drive

is to design a single communication system which can provide the same services in a

unified and integrated fashion. Some of the reasons are the ease of maintenance and

installation, economy and ease of access.

This movement has prompted the proposal

of a set of services called Integrated Services Digital Networks (ISDN).

The constant drive to achieve higher transmission bandwidth has led to the devel-

opment of optical communications.

The usage of optic fibers communication changed

the requirements of data communication dramatically.

Using lasers,

which can be

switched at a high rate, and optic fibers,

distances with little attenuation. With thi

the light signal can be carried over long

s technology, data rates of 4Gb/s over 100

km without repeaters [TOBA90] are possible.

The emergence of numerous applications which require much higher bandwidth

than possible in present networks was inevitable once the high transmission capacity

of fiber optics technology was made available.

The real challenge now becomes the

creation of a network which can provide the high bandwidth services to the users.

The bottleneck comes primarily from switching, since the data can be transmitted in

a very high speed fashion.

These high speed networks will carry all applications in an

integrated fashion which all require different quality of service.

The most appropriate


handle the wide diversity of data rate and latency requirements resulting from the

integration of these services.

Packet switching is also known as the asynchronous

transfer mode (ATM) and circuit switching is known

mode (STM).

as the synchronous transfer

At the present time, ATM specifies fixed-size packets of which 48 bytes

are data and 5 bytes are control information.

Line speeds which are specified are

nominal rates of 150 Mb/s (for digitized high definition television (HDTV)) and 600



Several architectures for fast packet switches have emerged in recent years, namely,

(i) shared-memory, (ii) shared-medium, and (iii) space-division packet switches.

Of these three, space division fast packet switches allow multiple concurrent paths

to be established from the inputs to the outputs.

Space division switches can

categorized in

two different types of fabrics:

crossbar fabrics and


based fabrics.

Though crossbar fabrics are nonblocking, the size of realizable crossbar

switches tends to be limited.

This is. the primary reason

why banyan-based fabrics

have been considered as alternative candidates for fast packet switching.

The routing of data from an input to an output depends on the switching tech-

unique used in the switch. There are two principal switching techniques, circuit switch-

ing and packet switching. In circuit switching, a complete path of connected links

from the origin to the destination is set up when the call is placed and before date is

sent over it.

The path remains dedicated to the call until it is released by the com-

municating parties.

The overhead incurred during the setup phase is one of the cost


II111111 II

Figure 2.3.

An Output Queued Packet Switch

when there is a steady flow of information and is hence the switching method used

for voice communication.

Communication between computers tends to be bursty in nature, however. Cir-

cuit switching would be too costly and the circuits would be underutilized. Packet

switching is a form of store-and-forward technique where the data is transmitted in

chunks which are not


to exceed a certain maximum length and are referred to as

Packet switching achieves benefits such as (i) dynamic bandwidth alloca-

tion, (ii) easy error recovery because smaller chunks are sent at a time.

Once the

packets arrive at their destination, reordering may need to take place.

Nonblocking MINs can be classified based on their queueing strategy:

queueing, (ii) output queueing, and (iii) shared queueing.

fabric also plays a factor in where the queueing is done.

(i) input

The speed of the switch

For example, if in a N


the switch fabric runs

times as fast

as the input and output

links, all

the packets that

arrive during

a slot

can all

be delivered at

the outputs,

even if

multiple inputs request the same output.

A slot is the time in which a packet can




I.11 11111

Figure 2.4.

An Input Queued Packet Switch

Input queueming is necessary when each output can accept at most one packet per

time slot (Figure 2.4).

Input queued architectures have a maximum throughput of

0.586 [HLUC88] for an infinitely large switch, FIFO input queues with infinite queue

length and uniform traffic.

Thus, in spite of the queueing capability, its capacity is

worse than an unbuffered crossbar switch.

This phenomenon is due to the head-of-

line (HOL) blocking. HOL blocking occurs when the packet at the head of the input

queue cannot

be transmitted and

consequently blocks the other packets behind it

although they may be addressed to idle outputs.

Typically there are

three parameters


to describe

the performance of the

switching fabric:

* Switch throughput.

The utilization of the output links is defined as the proba-

ability that a packet is transmitted in a time slot by the switch.

The maximum

throughput, also referred to as the switch capacity, is the traffic carried when

packet arrival rate is one packet per time slot.

* Average packet delay.

The average packet delay is defined to be the number of



* Packet loss probability.

The packet loss probability is defined to be the proba-

ability that a packet received at the switch input is lost due to buffer overflow.

Chapter Summary

In this Chapter, a general overview of MINs was presented and found to be widely



large scale multiprocessor systems as

well as



They provide a reasonable throughput at moderate hardware costs.

The requirements of the MIN depends highly on the environment in which it is

being used.

In multiprocessor systems, where the MIN is used to connect a number


to a number of memory modules,

the objective is

to minimize the

network delay incurred by the memory references.

These references are characterized

by traffic patterns generated by applications running in the multiprocessor system. In

telecommunications on the other hand, the MIN is used to switch data from incoming

links to a outgoing links. The ob

least amount of internal blocking.

jective is to do this as fast as possible,

Depending on

with the

the switching technique used in

the switch, this requirement can be phrased as (i) improving the average number of

setups granted (in circuit switched systems), and (ii) optimizing packet throughput

(in packet switched systems).


Most studies examined the behavior of MINs in isolation, and these do not reflect

the characteristics of the entire system.

The studies make simplifying assumptions,

avoid the effects of head-of-line blocking, and thus obtain overestimating results.

The remainder of this chapter is organized as follows. Section 3.1 presents several

studies performed on non-uniform traffic patterns and emphasizes the work done in

modeling vector interference. Section 3.2 presents several setup algorithms previously


for some multistage interconnection network



details studies performed in improving the performance of banyan-based fast packet

switches. Section 3.4 presents a summary of the related work discussed.

Non-Uniform Reference Patterns

One of the most unrealistic assumptions to make is that generated traffic always

follows a uniform and independent pattern.

The consequences of nonuniform traffic

patterns, even when they occur sporadically, can have lasting effects on


the entire

This stresses the need to study the system in its entirety rather than study-

ing its parts in isolation. Note that the non-uniform traffic patterns discussed in this

o f,-mnn nrerlr kli-h in rrrnniitnr and t.olormm lmmnlcatlnn SVstems.

The terminoloffv is


Reference patterns in computer systems are the result of application programs exe-

cuting and requiring communication to and from other processes, or from memories.

One type of reference pattern which has received much attention in the literature

is the one in which one particular destination address is accessed with a higher proba-

ability than the remaining ones.

This destination is also referred to as a hotspot.


phenomenon, first observed by Pfister and Norton [PFIS85b], causes the buffers of the

paths leading to the hotspot to become saturated.

The tree saturation phenomenon is

a direct effect of hotspot access and leads to serious degradation in network through-

put of not only accesses directed to the hotspot, but other memory references as well.

The problem of hotspot contention has been studied extensively and various solutions

have been proposed




, SCOT89, DIAS89

Hotspot traffic

occurs also in communications networks, where it is referred to

as output concentra-


When each


references its own

particular memory module more than

others, the reference pattern which is generated is referred to

as the preferred mod-

ule, or in

telecommunications' terminology, communities of interest.

This type of

traffic can

occur in

the case of bulk transfer.

The hotspot

traffic can

be seen as

an instance of preferred module traffic where all processors prefer the same module.
The consequences of preferred module traffic are not as severe as those of hotspot
The consequences of preferred module traffic are not as severe as those of hotspot

traffic when

the individual modules are well distributed.

The traffic referencing a

processor's preferred module obstructs other traffic from the same processor.


The burst issue rate is the rate at which the processor issues bursts of requests

to the memory system. In most studies, the assumption is made that the burst issue

rate is the same for all the processors in the system.

A processor will typically wait

for the burst requests to be satisfied before issuing the next burst. If the behavior

has a distinct repeating pattern, it can also be termed as periodic. Burstiness is

particularly of interest in communications, as burstiness is characterized by random

gaps encountered during message transmission,

variability of the message size and

the low tolerance of delay of the source.

Another non-uniformity may occur when not all processors have the same issue


This case is also referred to as unbalanced input, and may not

be of much

interest in a computer system but is of utmost importance in telecommunications



Vector Interference

Numerical applications which operate primarily on

distinct pattern of reference.

banks of memory.

vectors and matrices have a

These systems are often required to use interleaved

A parallel interleaved memory system allows concurrent access

to multiple data items

by placing these items in

distinct memory modules.

assumption is that addresses can be presented to all the modules in parallel and that

after a delay equal to the cycle time of the memory, data can be removed in parallel

from all of the modules.

any spatial locality.

Gottlieb [GOTT83] states that hashing, however, only works for

small systems.

In [CHEU86], a simulation study of the Cray X-MP's memory system was per-

formed where the vector traffic caused linked conflicts. Vector activity was studied in

a multiple bus architecture and suggestions were made with respect to the bank cycle

time and the number of bus lines.

Other studies on memory in vector supercomputers

have been given in [BAIL87, OED85].

Storage schemes and address transformations have been proposed to reduce the

contention created by the access of vectors.

The conventional storage schemes are

the interleaved scheme, and the skewed storage scheme [BUDN71].

The objective of

storage schemes is to permit conflict-free access for a set of frequent access patterns.

Storage schemes based on the


pattern of the vectors are presented in [HARP87,


Vector traffic in a MIN and its adverse effects have been studied in [TURN89],

where it was demonstrated that the forward network, memory, and backward network

all affect each other.

They propose placing long buffers at the memory inputs, which

will eliminate blocking in the forward network.

Evaluation of the Cedar multiprocessor, which is a MIN-based system, was done

in the work by [GALL91].

They observed that the degradation was due to the density

of the requests which they resolved by adding dead cycles (NOPs) in order to reduce

the density.


Vector-Scalar Interference

In many commercially available supercomputers, such as the Cray X-MP, multiple

processors can access both vectors as well as scalars simultaneously.

Their interaction

can cause the degradation of one or both types of access types.

Processors which have a scalar cache cause cache blocks to be transferred between

cache and main memory.

Since data in cache lines are stored in contiguous sets of

words, the interaction between these cache lines and ordinary vector accesses would

result in the interaction between vectors, which was introduced in the section above.

In order to observe vector-scalar interference, the system is assumed to be without

any scalar caches.

In [RAGH91]

the interference among vector and scalar accesses has

lyzed and found to reduce the performance of vector


been ana-

that may already be

in progress by as much as 40%.

Their model assumed that the memory system had

reached a conflict free steady state, operating at

100% efficiency.

Other important

assumptions are that there were no memory bank buffers present and that preference

was given to scalar accesses.

Then, a single scalar access caused a series of simul-

taneous bank conflicts leading to a detrimental effect on the memory efficiency.

using a crossbar interconnection network they avoided memory path conflicts which

typically arise in MINs.

3.2 Path Setup Algorithms


its source to its destinationss. Setting up a path between a source and a destination

can be done using the self-routing property of a UPP.

Unique path property networks can realize only a small fraction of the N! possible

source-destination permutations. In order for a UPP to realize an arbitrary permuta-

tion, several passes through the UPP is necessary, where each pass realizes a submap

of the permutation.

The problem of finding the minimum number of passes to re-

alize a permutation is intractable.

Finding whether an arbitrary permutation can

be realized in a given UPP topology has been done, for example for cube-connected

networks in [ORUC91].

Generally, however, input patterns are not necessarily permutations.

This adds



to the setup


n different sources request

the same

destination, at most one of these can send its packet.

A setup procedure is a mechanism which specifies which sources of a given request

pattern get

to send their data to the destinations.

The setup procedure typically

precedes the actual packet transfer protocols, particularly when the switching fabric

is unbuffered and dropping of packets is inherent.

In order to avoid retransmission,

the setup phase ensures the packet can be transmitted prior to being admitted to

the switch.

The main goal of any setup procedure is to grant as many connections

as possible at the least possible overhead.

A subgoal in synchronous transmission

is to combine this with synchronization that notifies the sources that the setup has

terminated and that transmission may start.

The speed of the switch,

with respect to the speed at which the input links are

operating, has a direct impact on the path setup phase.

If the transmission speed of

the switch is n times as fast as the link transmission speed, then the packet at the

input port has n opportunities to attempt to set up a path within a slot time.

increases the throughput considerably.


Output queuing is necessary when multiple

packets can arrive at the same output port.
The fundamental assumption made most path setup schemes is that at the
The fundamental assumption made in most path setup schemes is that at the

beginning of a transmission cycle, all setup requests are aligned and

a parallel fashion.

processed in

Output link conflicts are usually resolved in a random manner

whereby the victimized requests are dropped

denied a setup during the data

transmission interval.

Victors continue on

to subsequent stages where again


might have output link conflicts.

Ultimately, the victors of all stages are granted

a connection.

This is also known as

parallel path setup (PPS), and



gives an

analysis of the maximum throughput

possible under



A study conducted by Lea [LEA92] showed that an incremental path setup (IPS)

procedure can be implemented which can increase the number of possible setups at

the expense of having more setup intervals (assuming the parallel path setup scheme

only uses a single interval).

Lea [LEA92] is able to achieve a spectrum of improve-

ments by increasing the number of setup intervals.

In order to achieve a reasonable

improvement in throughput the number of setup intervals equals the number of con-

each of the sources, one at a time.

The order in which the setups occur (setup

quence) places a certain priority on the traffic at these inlets, the highest priority is

a priori assigned to the inlet which is to be attempted first and the inlet at the end

of the sequence has the lowest priority. Another drawback to the IPS scheme is that

the data transmission cannot start until all the setup intervals are over, even though

the network might be idle during some of these intervals due to a lack of requests

(low input rate).

3.3 Fast Packet Scheduling

Most Asynchronous

Transfer Mode (ATM) switches rely on the use of multiple

stages of very simple switching elements, each of which is self-routing.

constructed on


very principles and

are thus

heavily used in

many of these


Blocking MINs require means of controlling packet loss by either buffering in the

switching elements or




nonblocking MINs are used



Nonblocking MINs can

be further



on the

queueing strategy adopted:


(i) input queueing, (ii) output queueing and (iii) shared

A nonblocking interconnection network guarantees the absence of internal

conflicts and can thus establish a path from any free input to any free output.


Sorter Based Networks

A number of nonblocking MINs are based on placing a sorting network prior to a

MINs are

elements respectively.

An N

Batcher sorting network has n(n + 1)/2 stages


log2 N),

with each stage consisting of

sorting elements.


means are required to guarantee that the sorted elements are all distinct and skewed

into adjacent outlets.
It is well known that a routing bariyan network is nonblocking if the set of packets

to be routed is sorted based on the output addresses and received on adjacent inlets

of the banyan network. Examples of sort-banyan fast packet switches are the Starlite

[HUAN84], and others [HUI87].


Input Queue Architectures

Input queue switch architectures run at the same speed as the inputs and outputs,

and packets are queued at the inputs. Typi'

a FIFO manner, creating the HOL blocking.

cally, these input queues are serviced in

This kind of blocking involves a packet

at the head of the queue obstructing passage to packets behind it whose outputs are

otherwise idle.

Three-Phase switch


each slot

is subdivided






the probe phase each

active inlet issues a

request packet indicating the outlet addressed by its head-of-line (HOL) packet.

interconnection network generates as many acknowledgements as there are granted

requests and sends them back

to their respective requesting inlets.

the inlets

which have received the acknowledgement packet transmit their HOL packet during

IT 1 1 1 I 1 i


The Ring reservation switch [BING88] coordinates the inlets by interconnecting

them in a ring structure.

A reservation frame is serially transferred along the ring

slot by slot so that each inlet can reserve the outlet addressed by its HOL packet, if it

is not already reserved by some upstream inlet. Successful inlets transmit their HOL

packet nonblockingly.

The reservation phase and the data phase can be pipelined,

but the serial reservation scheme makes the solution unviable for large switches, as

the bit rate grows with the switch size.

The HOL blocking can be reduced in several ways [PATT88]:

* Switch expansion.

The switch has more outlets


inlets (N).

throughput is improved because the number of conflicts for each outlet is re-


* Windowing.

Windowing is a technique that relieves the HOL blocking by al-

lowing non-HOL packets to contend for switch outlets.

means that a search is done for up to W

* Channel grouping.

window of depth W

packets, including the HOL packet.

The switch outlets are subdivided into groups of R each

and packets now address groups rather than single outlets.

In each slot each

output link in a group is allocated to a specific inlet which is addressing that

group with its HOL packet.


Shared Queue Architectures


is to mark and separate slot by slot the set of transmittable packets from the loser


The resulting sequence then consists of a sorted permutation,

which after

being skewed (removing the idle gaps in the sequence), is then passable in any UPP


Winner packets are forwarded

to the routing network and


Loser packets are fed back through a recirculation network to contend again with

newly arrived packets.

The recirculation network implements a distributed shared

queue for packets that cannot be transmitted to their outputs.

Chapter Summary

In this chapter, relevant work in MINs was discussed and approaches taken by

other researchers summarized.

the next chapters,

we will present methods by


we can improve on

the already presented techniques and/or solve problems

which were also observed by others.



Communication between processors in a shared memory multiprocessor is achieved

by reading and writing to shared variables that reside in the shared memory.


processor is typically equipped with its own local memory where its local data and

its local program can reside.

When these processors are executing in a parallel fash-

ion, the computations should be arranged in such a manner so that the individual

processors have to wait as little as possible.

The traffic between the shared and local

memories must be carefully managed since these data transfers incur a significant


The architecture of the interconnection affects parallel algorithm development and

it is paramount to reduce the number of possible access collisions in this network.

The performance of an interconnection network is dependent on the behavior of the

imposed traffic.

The pattern of the traffic is characterized by the rate of the requests

and the frequency with which destinations are requested.

a centralized-control SIMD environment,


performance can

tain its optimum when



. Unfortunately,

permutations appear in

the memory access traffic

these special patterns cannot always

be guaranteed in

and "uniformly" to all memory modules.

This means that at each source subsequent

requests are independent from one another and that requests to memory modules are

generated with equal probability.

When the memory access traffic does not follow this

independence and uniformity assumption, the performance is often overestimated.

Even if the memory accesses could be distributed uniformly, the independence

assumption might not be valid.

One instance is the consecutive memory access in

which a request targeted to memory location i is followed by a request to location i+1.

This kind of access pattern may occur when processors access a stride-1 vector data

or load program codes.

This consecutive request (CR) pattern can also be caused by

the transfer of a cache block between cache and shared memory if each processor is

associated with a primary cache.

When the interleaved memory allocation scheme

is used, these consecutive memory accesses will be targeted to contiguous memory

modules. In this case, the network receives sequences of dependent requests which

may experience repeated network conflicts as observed in [SHEN89].

vector applications,

vector accesses are typically preceded


periods of uniform traffic.

These requests can be directed to synchronization variables

or could be updates of local information. Hence our work considers the case in which

at the time of the vector access the network might have some messages left from the

preceding uniform access.

CR traffic.

These messages might obstruct the smooth passage of the

Other factors which might affect the performance of our proposed schemes

are the starting destination of the rennest sentences and whether all sequences were

issued synchronously or asynchronously.

We investigate these cases and evaluate the

performance of the proposed solutions using simulations.

The evaluations show that, under the CR traffic, the performance of the original

switch design is badly degraded and the solutions proposed significantly mitigate this


Evaluations of the new designs are also done under the uniform traffic

pattern to fulfill the nature of non-deterministic memory accesses in multiprocessor


These results show that the performance of the proposed schemes is the

same as that of the primitive design running on the uniform traffic.

The remainder of this chapter is organized

as follows.

Section 4.1 discusses how,

under the CR traffic, the network composed of the conventional switch elements will

experience heavy

conflicts in

the initial stages.

Section 4.2 presents two effective

methods for mitigating the effects of the CR traffic.

First, the Bit-Reverse address

mapping scheme is proposed, in order to reduce the number of conflicts in the initial

stages by reordering the destinations of continuous requests.

Priority scheme is proposed

Second, the Dynamic

to increase the overlap of the uses of the two links at

each switch. Section 4.3 discusses the effects of spatial distribution on the CR traffic

pattern and

generalizes the

Bit-Reverse scheme.

Sections 4.4 and 4.5 present

performance evaluation of the schemes on the forward network and backward network,


Finally, Section 4.6 gives an overview of pertinent results obtained in

this chapter.

Consecutive Reauest Traffic Pattern

It is

questionable whether two successive requests

from a


are inde-


If they are dependent, the question is whether switching elements would

experience an unbalanced traffic and correlated conflicts.

To investigate the impact

of dependent requests, CR traffic pattern is described, which has the following char-


1. A sequence of requests is generated at each source node and are destined to

consecutive sink nodes, for example, the requests generated at processing el-

ement PE1

to memory module MM4 is followed by the requests to memory

modules MM5, MM6 and MM7.

This sequence of requests is generated continuously without interference from

other requests.

The first place where this access pattern occurs is in

vector accessing.

A large

number of numeric arrays and numerous iterations of computations involving array


are used in engineering and scientific applications.

These data are distributed

to many memory modules to prevent memory contention and increase computational


are assigned to

To execute these applications- in a parallel manner, multiple processors

perform similar operations on a selected subset of operands.

thermore, the use of instruction pipelining and data prefetching makes possible the

prefetching of all these operands in very short intervals.

Consequently, each processor

fetches and program loading have the same consecutive nature.

systems where each processor has its own

In multiprocessor

private cache, or even in the event of a

shared cache, upon a cache miss, the cache controller must read/write a line of words

from/to global memory.


The Effects of the CR Pattern

In this section we will describe the effects of CR traffic in a network with conven-

tional switches.

Consider the activities in a switch in the first stage when the same

CR sequence arrives simultaneously at each of its input ports.

PE1 both send requests with destinations 000, 001, 010, ... ,

For example, PEO and

111 consecutively min an

x 8 system. In the first half of the CR sequence, messages are addressed to the first

half of the destinations, i.e., 000, 001, 010, 011.

Therefore, the sequence of the first

bits, used to control routing, remains the same; as in the above example, 0, 0, 0, 0.

We define the duration in which routings remain unchanged as Duration of Identical

Routings (DIR); in our example, the DIR at

the first stage is 4.

We can

see that

the possibility of insertion conflicts is high during this half of the CR sequence since

the DIRs of both inputs are the same and long, i.e.

both of the inputs try to route

messages to the same output

port for many cycles.

The same thing occurs during

the second half of the CR sequence which is directed to the lower half of memory.

In the second and the subsequent stages, conflicting messages will not only affect

one of the messages at the input ports but also messages at the stages before it as

cases of effects which will be generalized in Section 4.3.

Below are the effects of CR

traffic for unit stride accesses.

Effect (1') The usages of the two links connected to the two output ports of a

switch are not fully overlapped, i.e.

resources are not fully utilized.

As in the

above example, at the output ports of a first-stage switch, the lower link is idle

while the upper link is busy during the first half of the input CR sequences.

* Effect

(2') The DIR is still long in

the latter stages.

If two input sequences

overlap with each other,

the length of the DIR has an adverse effect on the

performance of the CR traffic.

If we can find a way that can prevent or reduce the above two phenomena,

can mitigate the performance degradation caused by the CR traffic.

For simplicity,

we shall call the above two effects created by the CR traffic as Effect (1'), and Effect

(2') in the subsequent sections.

Dynamic Priority and Bit-Reverse Scheme

From the above, we can see that the network performance will degrade due to poor

link utilization and long durations of identical routings.

In this section,

we propose

two approaches to remedy these deficiencies.

The approaches are easily incorporated

in conventional switches and have the same performance as the conventional switch

design under an independent and uniform traffic pattern.


4.2.1 Bit-Reverse Mapping

We consider the conventional switch design which uses the round robin priority

scheme to resolve insertion conflicts.

When a processor issues consecutive requests

starting from the same memory module or approximately the same memory module,

the destination addresses of these requests vary

in low-order




messages in

the first stage depends on

the first

bit of the destination


Thus, the routes of successive requests will be overlapped in the first stages

of the network.

One way to split the routes of successive requests and to reduce

their overlap is to adopt a bit-reverse mapping in the destination addresses and to

rearrange the memory modules accordingly.

and the network,

In the interface between the processor

we reverse the order of the destination addresses, for example, in

an 8


MIN system, 3 (011) is mapped to 6 (110),

The consecutive requests 0,1,2,3,4,5,6,7

and 4 (100) is mapped to

are then mapped to 0,4,2,6,1,5,3,7


the memory


are also distributed

according to

the bit-

reverse order in memory modules. B

performance of independent requests.

y doing this,

It is the sam<

we have not changed the network

e as that without address mapping

on uniform traffic since the destination addresses are generated randomly and are

distributed to each destination with equal probability. No matter with which mapping

function the addresses are mapped, they are still randomly distributed.

The Bit-Reverse mapping scheme can reduce the effect of CR traffic for a unit


becomes a

"0" followed by a

alternatively, so any two successive requests from a

processor will have unoverlapped routes, except that they are received from the same

input port.

Also, if two consecutive requests enter a switch at the first stage, then

the DIR is very short (only one cycle). If sequences arrive synchronously at the two

input ports, a routing conflict arises. No conflict other than the initial one can occur

in the first stage.

to the next stage.

As a result, both the output ports are kept busy forwarding data

The usage of the two links from the same switch in the first stage

is therefore well overlapped.

The main drawback to the Bit-Reverse mapping scheme is that Effect (1') becomes


as the

DIR increases in

latter stages.

Since the round robin


scheme is used to resolve insertion conflicts, all the messages that route to the same

destination will cluster together at the stages approaching the destination.

In other


the blockage in

the first stage of the network will be removed in order for

requests to enter the network but these then get congested in the later stages.

In spite

of this deficiency, we expect that the advantages of the Bit-Reverse address mapping

will lead to improvements in the network performance.

This will be confirmed in

Section 4.4.

4.2.2 Dynamic Priority Scheme


previous sections,

we observed


the round



scheme is used to select which of the two conflicting messages is routed to the output


to reduce these effects we propose to use a dynamic priority scheme to resolve the

insertion conflicts.

Adhering to the conventional switches with no address mapping mechanism at the

processor-network interface, we will now give preference to a message coming from a

particular input port in the cases when successive input messages are trying to route

to the same output port.

Dynamic Priority

any input

can be initially


the priority

will keep this priority as long as it continues having messages in its input register

in subsequent cycles.

Upon resolution of insertion conflicts the priority circuit will

indicate the register which has the current priority.

If at any point in time, the input


holding the priority

has no

message but

the other one does,

the priority

circuit will revert to reflect this change of traffic.

Since the insertion

priority is given

to a

particular minput, if the next arriving

message is to be routed to the same output port, the first half of the CR sequence

from that input can pass quickly without being blocked. Therefore, the second half

of the CR sequence can appear at that input earlier and also pass through the second

output link of the switch earlier.

That is,

the overlapping of the link usages from

the same switch is higher and Effect (1') is reduced.

Moreover, since the sequence

of messages to the second and subsequent stages remains in a consecutive order (i.e.

not mixed with the sequence issued by other processors), the DIR length is halved in

tha~ staoe

Tn our examrnnle the senence remains 000f. 001. 010. 011 when it arrives

at the second stage and the DIR at this stage is two.

, Effect (2') can be reduced

in the second stage and subsequent stages.

An intuitive drawback of this approach is that the unfair message-passing would

cause uneven responses to the processors.


, unfair services would be per-

ceived by the processors.

vector accesses

However, if all the processors in the network are performing

this unfairness within each switch element does not have an impact

on the performance of the overall system.


in uniform traffic

this unfairness

is not present mainly because the DIR of any stage in this type of traffic is on the

average two.


The Combined Approach

The hybrid model which combines both the Dynamic Priority scheme as well as

the Bit Reverse mapping offers the same benefits as the ones described above, but

with one added benefit.

Observe that the Dynamic Priority allows for rapid passage

of messages in the early stages but in the latter stages it is still almost round robin

since the DIR in this stage is not long.

In Bit Reverse mapping however,

the DIR

becomes larger in the last few stages and the Dynamic Priority could alleviate this in

the same manner as described

, but effective in the last stages,

where the Bit Reverse

mapping needs it most.

We therefore expect this scheme to have the advantages of


Effects of Spatial Distribution on CR Pattern


which can be of paramount importance to the performance of the memory access.

The stride of an access refers to the constant offset between consecutive requests. For

example, a processor which starts at module 0 and accesses with a stride of 2 will

send the requests to the even numbered memory banks.

Hence it can be seen that

the stride not only affects the order in which memory modules are accessed

but also

the number of distinct memory modules which are accessed.

Bank offset can be an important factor in the performance of CR accesses.

effects can be seen from observing that the DIR of the sequences depends highly on

the sequences which arrive at the input ports of a switch.

The most restrictive case

is when both the sequences are identical, which makes the DIRs at both inlets of the

same length and cause the least amount of overlap.


the sequences differ by

just one module, it creates some overlap and reduces the Effect (1') mentioned above.

In some vector computers the stride is an important

computers (the Cyber-205,


for instance) allow only a stride of 1.

Some vector

These types of


use "gather"


methods to create temporary vectors which

are contiguously stored. Most vector computers can process strides other than 1, and

in our processor model we assume the processors can handle strides larger than 1.

Stride accesses other than 1 are not unusual.

In fact, a number of numeric array


and numeric computations in loops involving array

accesses contain loop

iteration variables which are incremented by numbers other than 1.

In Figure 4.1 a

code segment is viven where the narticinating nrocessors are accessin every other




DO j=0,





is 2







Figure 4.1

Stride-2 Access

In order to see the effect of the stride on a network with conventional switches we

will assume we have the same type of scenario as described above.

We will consider

the activities in a switch in the first stage when the following CR sequence arrives

simultaneously at

each of its



Assuming the stride to be

2 and

access to start from memory module 0,

we then observe that at a first stage switch,

the sequence with destinations 000,

110. 000

, 100, 110 consecutively

arriving in an

x 8 system.

It can immediately be observed that only the even

numbered modules are addressed.

The DIR of this sequence for the first stage is 2,

half of what it would have been in the case of a stride-1 access.

The possibility of

insertion conflicts is still high due to the length of the DIR at both inpuits (this is

particularly the case for larger systems).

The next observation to be made is

that the DIR for the second stage


is because the stride-2


causes the second least significant bit to toggle.


is consistent with the findings of the unit stride


example in Section 4.2.1


In general, a stride-i access can deteriorate network performance for the following


(the first two are a reformulation of the two effects stated in Section 4.1).

* Effect (1) The usages of the two links connected to the two output ports of a

switch are not fully overlapped, i.e. resources are not fully utilized. In essence,

this underutilization may occur at different stages,

depending on

the stride,

and the size of the network.

* Effect (2) The length of the DIR depends on the stride as well as on the size of

the network and is long while traversing stages which do not use the bit toggled

by the stride (henceforth referred to as the stride-bit).

In the stage which uses

the stride-bit, the DIR is exactly one, and doubles with each subsequent stage.

If two input sequences overlap with each other, the length of the DIR has an

adverse effect on the performance of the CR traffic.

* Effect

A stride-i

access is in essence a special kind of multiple hot-spot

pattern since only

memory modules will be accessed (assuming there are n

memory modules and that i divides n evenly). In the worst case, all the traffic

is directed to only those modules.

Strides which are not powers of 2 do not have the deteriorating characteristics

which are described by the effects above.

The primary reason for this is that

length of the DIR of such accesses is not as long as for accesses of powers of


4.3.1 Dynamic Priority and Stride Based Mapping Revisited

The Dynamic Priority scheme as described above can still be effective with stride

accesses other than 1. It is well understood that the most restrictive case is the stride-

1 access (where the DIR is the longest) and hence that type of access can benefit the

most from the Dynamic Priority.

However, Dynamic Priority still offers the same

benefits to access of other strides, though the underutilization is perceivably less, due

to their shorter DIRs.

The Bit-Reverse mapping scheme described in Section 4.2.1 did not take the stride

into consideration, but rather assumed it to be 1.

The reason why this approach is

not sufficient here is that this method was meant to spread the


of a stride-1

access which accessed all the modules in a consecutive fashion.

Strides other than 1

cannot take advantage of this characteristic and hence a new mapping is proposed

which generalizes the Bit-Reverse mapping previously described.

We will show by means of a simple example why the original Bit-Reverse mapping

is not adequate for strides other than

000, 010, 100, 110 (access stride is 2).

1. Suppose we had the sequence of requests:

Using the original Bit-Reverse mapping scheme

would map these to addresses 000, 010, 001, 011 respectively. These newly mapped

addresses all lie in the lower half of the memory modules. With higher numbered

strides (powers of 2), the mappings will concentrate in an even smaller corner of the

address spectrum.

XXt'lr .. ^..... ^---^ /T.. .-n-nn+n :>.+. ; 1, A i,;+b t o+ ;Ao ,, f 7 ih, A fct;fnafifn lflrannrP.SP

d_1 dn-2

where do is the least significant bit.

The original motivation of the

Bit-Reverse scheme was to split the routes of successive requests in order to allow

more overlap in the initial stages of the MIN

By applying the Bit-Reverse mapping,

the DIR of successive requests became 1 (in the first stage) and gradually grew as the

stages approached memory.

generated by a stride of i, the i

and the remaining bits di-idd-2

In order to achieve the same results with the sequence

eversal of bits should involve only bits d_-id,,_

... do should stay the same.

This is the general stride

based mapping family of which the Bit-Reverse is a particular case (for stride 1).

Using this stride based

mapping scheme the sequence 000,

, 110 gets


to 000,

, 010,


to the mappings created

by the

Reverse mapping,

the newly acquired destinations are scattered in the upper half as

well as the lower half of the memory modules,

of the address spectrum.

instead of concentrated in a corner

By distributing the addresses, the network performance of

independent requests

has not changed.

It is still the same as that without address

mapping by the same argument as was given in the Bit-Reverse discussion.

The general stride based scheme can reduce the effect of any CR pattern because

it reduces




described above.

The drawback of

scheme is still that the DIR increases

in the latter stages.


, Effect (3) has

not been affected by the mapping scheme.

We shall

see later that Effect (3) becomes

the predominant reason that stride based mapping alone is not sufficient,

as the stride increases.


. .do,

. d d

4.3.2 Skewed Storage Schemes

Interleaving memory works

memory modules.


Strides other than

reference sequences address

one are quite common in


a multiprocessor

vector processing environment where each processor will generate highly correlated

address sequences [HARP87].

The interleaved system will suffer under strides other

than one because references will not be distributed to all memory modules.

Using any of the stride dependent schemes described above will not increase the

number of distinct memory modules being accessed (Effect (3)

Skewing schemes




to eliminate the conflicts which arise from

parallel access due to

strides greater than one.

It has been proven [BUDN71] that there is no single skewing

scheme which allows conflict free


to a vector using all strides.

There are several

classes of skewing schemes [SHAP78,

WIJS85] but the scheme which is used in our

experiments is


a function

which maps the ith

element of

a vector to

memory module i + []) mod N where N is the number of memory modules and

[i/NJ denotes the largest integer less than or equal to i/N

Table 4.1 illustrates the

skewing scheme in an eight module system.

In order to see the relationship between the stride (

S) and the number of memory

modules (M),

we consider

the case



the stride and

the number of

memories are powers of 2 (S


and M

We will consider two cases, strides

in which there is at least one access per row (for S

or s < m), and strides in

1 "1 ii 1 1 /I n 'A

= 2m

i f


For the following discussion, we assume that the initial access of the vector is to

memory module 0.

There is no loss of generality if we assume this.

Assuming an

interleaved memory scheme, the sequence of references is: 0,

. After

* 2m-1

references the sequence repeats. In the

in the same module as the initial access.

initial access.

interleaved scheme, the 2m-hth access falls

This obviously causes a collision with the

However, if the current row is skewed relative to the former row, then

rather than referencing the same modules as the former row, references are made to

modules adjacent to the corresponding modules in the former row.

to allow

In other words,

M accesses to occur before a conflict occurs, it is necessary to skew 2s rows

relative to each other.

For any stride


m, a skewed storage scheme

would consist of circularly rotating reach row r

by r mod

places relative to its

original location in the interleaved scheme.

In the case where there is at most one access per row, the sequence of references

x 2s

All references will

the same module which


most performance degradation.

Rotating the rows can alleviate this but the pattern

is different




blocks of



are rotated


to preceding blocks.

Blocks of


rows are rotated in unison,

which is precisely

the number of rows between consecutive elements.

The rotation places consecutive


in adjacent modules.

Over a period of M


each module is referenced

exactly once.

TI ,rinnarl J o01ixrQ7A cf+narrno crhamc r 10 hnaA nn +UtCr n~r.m t~rc "


Table 4.1.

Skewed Storage Scheme

1. The maximum rotation a row can have relative to the first row (rmax).

s > m, rmar is always M, and for the s < m case, rma, is given by 2s.

rmax = min(2S



Amount of memory to be rotated as a single block.

In the

s < m case, this is

always 1, otherwise it is equal to


The efficiency of any skewed storage scheme is measured in terms of its ease in

generating addresses, and the number of possible conflict free accesses it can provide
(a parallel access is said to be conflict free if it can be accessed a single memory
(a parallel access is said to be conflict free if it can be accessed in a single memory

cycle, i.e.

without conflicts).

In the context of a multistage interconnection network, a skewed storage scheme

will improve the performance of a CR traffic pattern, mainly because Effect (3) above

is alleviated.

However, note that there is still underutilization of the links since the


010, 100, 110, 000, 010, 100, 110.

In a skewed memory storage the reference pattern



It can

be seen from this

simple example that the output links will be underutilized in certain portions of the



we apply the stride dependent Bit-Reverse mapping,

we get the


, 100,

, 110,

(or 0,4,2,6,1,5,3,7).


enough this is the same pattern which we would get when applying the Bit Reverse

mapping scheme onto a unit stride


starting at module 000. From the above we

will then expect similar performance of a Bit-Reverse mapping scheme of a stride-1

access in an interleaved memory and a stride dependent mapping scheme of stride-2

in a skewed memory.

our investigations,

we assume

a vector is only


with a single


Even if vectors are not always accessed with a single stride, it is by far the

most common case.

An example in which a vector is not accessed with single stride

is the well known row/column access pattern found in matrix manipulation routines.

Performance Evaluation of Forward Network

In order to measure the performance of the system under CR traffic and to eval-

uate the effectiveness of our schemes,

we have conducted a simulation study of the

Dynamic Priority and Bit-Reverse (and stride based) schemes in a multistage inter-

connection network.

The simulation model used a 64

x 64 buffered Omega network.

A larger network


The reason

the size of eight was chosen is that in [KRUS83]

it was shown

that a queue size of eight could model infinite queue size, so that blockage caused

by a full queue can be almost eliminated.

A processor has at most one outstanding

request and stops issuing a request when the corresponding input register of the first

stage of the MIN is busy. On

generate one every clock cycle,

ice the processor starts issuing CR messages,

if it can insert it into the network.

it will

This dependency

of subsequent messages is the main characteristic of the CR traffic.

The time at


the processors start their vector access is also


, from synchronous


To study the spatial effect caused

tween two types of access,

by the bank offset,

CR sequences starting with

we shall distinguish

the same bank offset

CR sequences access with bank offsets chosen randomly for each processor.

The spa-

tial effect caused

by different stride


is investigated in conjuction with

stride-dependent mapping and skewing schemes.


we also study vector ac-


under an idle network before the access

and under an initial load of uniform


The code which implements the CR concurrent fetch

is typical (Figure 4.2).


processor is assumed to run

the exact same code

this code is assumed to be

pre-scheduled (to minimize the effects of scheduling overhead)

single vector load is made, and that no NOPs

. We assume that a

are inserted between fetches.

The main measures which we took from simulations using the CR traffic model




DO j=l





Figure 4.2.

CR Concurrent Fetch

on the average for each message of the vector sequence.

the throughput (number of

messages which reach memory per cycle), and the average sequence (vector) delay,

which is defined as the average number of cycles it takes for

complete their vector access in the forward network.

all the processors to

It represents the average of all

the worst case completions over all iterations.

These measurements conform to a processing model in which a number of pro-

cessors, N, participate in a barrier synchronization. As processors reach the barrier

they are suspended and wait until the last one finishes. The barrier is the most re-

stricted form of synchronization. Thus

case execution of each of the processors.

a good measure of performance is the worst

This is an indication of how long the other

processors must wait

before they can proceed.

In our model,

the execution is the

issuing of a CR sequence and the measuring of the average sequence delay, as defined


is a measure of the average time the processors

need to spend

during the

barrier synchronization.

^ /** /i a-i/- f:

Figure 4.3.

Processing Model

case, a number of 200 runs were collected per simulation point (measurements were

taken at interstart periods of 0.0, 5.0, 10.0, 15.0, and 20.0).

optimum is also given.


The comparison to the

The optimum always assumes that there are no conflicts in the

The model for this section assumed no memory model, i.e., that memory

.modules had infinite lengthput buffers.
modules had infinite length input buffers.

The optimum vector delay in every graph,

is also

provided to illustrate the point and

the illustration speaks for itself.

optimum vector delay is one in which it is assumed that there are no conflicts in the


4.4.1 Dynamic Priority and Bit-Reverse Mapping Results

The results presented in this section assume the CR sequence to be accessed with

unit stride.

Our simulation primarily focused on

the relative performance of the

Average Vector Delay (Vector = 256, Queue = 4)
Same Bank Offset No Initial Uniform Load








j . -. .

[- o



S------>------A---- A------r



No Priority/No Bit Reverse
Priority/No Bit Reverse
No Priority/Bit Reverse
Priority/Bit Reverse

Inter-Start Period

Figure 4.4.

Same Bank Offset/Stride 1

- No Initial Load

which follows we will refer to the round robin strategy with no memory mapping as

the conventional design. Measurements were collected for the conventional design, the

Dynamic Priority, Bit-Reverse, and the hybrid scheme.

The time that each processor

begins to issue its CR requests is distributed within the inter-start period.

The simplest scenario is the one in which there is no initial load on the system

and the processors issue CR sequences with the same bank offset.

Figure 4.4 shows

the average vector delay for these four schemes under these assumptions.

From this

figure we can conclude that the Dynamic Priority scheme gives the best results if the

processors start synchronously.

the asynchronous case this scheme deteriorates

but the combined scheme performs slightly better than the Dynamic Priority alone.

T, 1 :-' .. 4. L 1....-- -... 4. 4.. L -L L- J -.- .*.. L -- -L_ -.- 1 L -. .. .. -


Average Vector Delay (Vector = 256, Queue = 4)









Random Bank Offset

- No Initial Uniform Load


Inter-Start Period

Figure 4.5.

Random Bank Offset/Stride 1

- No Initial Load

In all cases our proposed schemes yield better performance than

the conventional


It is of importance in which sequence the allocated vector is being accessed since

this determines the sequence of requests issued by a processor and also the number

of conflicts.

Our simulation has concentrated on CR sequences with the same bank

offset and for completeness we will show how our schemes would perform under access

to different memory starting modules. Th

its performance can be seen in Figure 4.5.

is type of access causes fewer conflicts and

In this case, the hybrid scheme yields the

best results while the individual schemes perform alike.

It can also be seen that all

the proposed schemes outperform the conventional switch design.

n-------- .-

1 4


.- I1-- A I- 1 r 1 f1 f,

--O No Priority/No Bit Reverse
-E Priority/No Bit Reverse
-0 No Priority/Bit Reverse
A-- Priority/Bit Reverse
*- Optimum

* --- i-- --- .


Average Vector Delay (Vector

= 256, Queue

= 4)

Same Bank Offset Uniform Load for 500 Cycles


600.0 c







-< 1


No Priority/No Bit Reverse

3--E Priority/No Bit Reverse
0--0 No Priority/Bit Reverse
a-t Priority/Bit Reverse
*-----* Optimum


Inter-Start Period

Figure 4.6.

Same Bank Offset/Stride 1

- Uniform Load for 500 Cycles

bank offset access where the CR traffic is preceded by

a uniform traffic period of

500 cycles at a request rate of 0.5 is shown in

Figure 4.6.

initial load has a

larger effect on the performance of the synchronous Dynamic Priority scheme,

this scheme is more vulnerable to the network


being idle before the access.

hybrid scheme shows the most stability, as can be observed from Figure 4.4 and 4.6

that there is no change in the curve.

We also obtained the case in which the same

initial load precedes a CR


starting from different bank offsets.

The results are


in Figure 4.7

. Remarkably, the initial load has no effect on the average

vector delay in the case of an


with different bank offsets.

The reason for this

is that the initial load makes the DIR smaller by interfering with the vector




Average Vector Delay (Vector

= 256, Queue


Random Bank Offset

- Uniform Load for 500 Cycles









Inter-Start Period

Figure 4.7

Random Bank Offset/Stride 1

- Uniform Load for 500 Cycles

Another factor which we measured is the throughput under the CR traffic.

Figure 4.8 it can be seen that the throughput obtained under synchronous



using the Dynamic Priority scheme is the highest.

Figures 4.8 and 4.9 show an interesting property of the Dynamic Priority scheme

under synchronous and

CR sequences with the same bank offset access,

spective of the length of the vector, after an

processor), each processor will deliver message

initial setup

es with the sat

time (different for each

ne message delay. Once

the first request reaches memory, they keep arriving at a constant rate of one per


This indicates that after the bottleneck is resolved in the early stages,

pipeline of requests

is operating to its fullest capacity.


C-ONo Priority/No Bit Reverse
3- Priority/No Bit Reverse
-0 No Priority/Bit Reverse
A-- Priority/Bit Reverse
----- Optimum

* ** *** *II IM




Dynamic Priority vs.

Round Robin










Priority Only Synchronous Access
Priority Only Interstart Time = 10


Switch -

Interstart Time = 10

- I


-1 / V
!* /
:lj f

~ i1 j
^ /








Clock Cycles

Figure 4.8.






Throughput for Dynamic Priority and Round Robin


The first message arrives at memory in a total of

s + 1


where s is the

number of stages of the network.

cessor whose first request arrives.

At each subsequent cycle there is one more pro-

If the vector is at least as long as the number of

memory modules, this constant will increment continuously until all processors have

one arrival per cycle.

It starts decrementing at a constant rate of one processor per

cycle once the first processor finishes its vector. If the vector is shorter than the num-

ber of memory modules, the curve still has the same shape but some processors will

have finished their entire vector even before others have begun.

Figure 4.9 depicts

the case in which a vector larger than the number of memory modules is accessed in

a network of 64

x 64 processors.

From this the average vector delay is given by:


where s is the number of stages in

(N)( (s+v+i-1)
s+v-1+( 2)

the MIN, N is the number of processors in the

network, and v is the number of elements in the vector.

From the simulation results it is clear that the CR traffic has benefits from the

Dynamic Priority Scheme and suffers a lot

under the conventional switch design.

This scheme gives continual priority to one port until the requests from that port

are exhausted.

Thus the utilization of the links can be overlapped and the overall

Average Message Delay (Vector

Effect of Buffersize on Dynamic Priority



Inter-Start Period

Figure 4.10.

Effect of Buffer Size on Message Delay

turnaround time for all the processors is reduced compared to that of the the round

robin strategy.

The size of the buffers inside of the switches does not have a great impact on

the overall performance of Dynamic Priority scheme.

As Figure 4.10 shows,

for a

vector of length 256 the average message delay degrades slightly as the buffer space


The improvement in the average vector delay only becomes significant in

the asynchronous case (as can be seen in Figure 4.11)

. However

even with a small

amount of buffer space inside a switch

the proposed schemes perform well.

= 256)

C-- Length = 4
3- Length = 6
-0 Length = 8


Average Vector Delay (Vector =256)
Effect of Buffer Size on Dynamic Priority









5.0 10.0 15.0

Inter-Start Period

Figure 4.11

Effect of Buffer Size on

Vector Delay


Dynamic Priority and Stride Based Manping Results

In this section, results obtained on CR sequences accessed with strides besides

unit stride are discussed. We have limited our experiments to CR sequences with the

same bank offset traffic and no load on the network prior to the CR access.

The degradation caused by non-unit stride access,

clearly be seen in Figure 4.12.

as described in section 4.3


From the same figure it can be observed that even

for the conventional switch design, a skewed storage scheme is able to improve the

performance of several non-unit stride


It is not surprising that the disparity

between different allocation schemes interleavedd

skewed) gets larger as the access


with a larger stride,

it also means that fewer modules are accessed in an interleaved


The effect of the stride,

the evaluation of the stride dependent mapping

scheme in an interleaved memory can be seen in Figure 4.13.

By merely adapting

the stride dependent mapping, an adequate improvement over the original bit reverse

mapping in an interleaved memory can be obtained. An important observation to be

made from these results is that even though the number of modules accessed remains

the same (so

Effect (3) is not changed),

the distribution of the requests itself can

cause the improvement.

When Bit-Reverse mapping is performed without taking the

stride into account (i.e.


unit stride), all

the mapped addresses will be

in one corner of the address spectrum, causing a localized multiple hot spot area.

The stride dependent mapping distributes these highly requested addresses, and by

doing so the requests are also distributed in these directions causing less contention

for similar MIN links.

In the event that no skewed storage is available, the best route to take is still to

employ the Dynamic Priority.

This can

be concluded from Figure 4.14,

where for

a stride 4 access the Dynamic Priority scheme is certainly the one which yields the

lowest average vector delay.

To be complete, we provide the graph which shows that the uniform traffic has

the same message delay in all four schemes.

Compare it to the minimum message

A* -h I

Conventional Scheme

- Interleaved

vs Skewed Storage

Vector Length



= 128

- Interleaved, Stride 1
-- Interleaved, Stride 2
--* Interleaved, Stride 4




1^ A

- Skewed,.

--V Skewed,.
-0 Skewed,
*--- Optimum

Stride 1
Stride 2
Stride 4







Interstart Period

Figure 4.12.



- Interleaved


Stride Dependent Mapping

in Interleaved Memory

Vector Length = 64











nf ___

* I I __________ I





Average Vector Delay in Interleaved Memory

Vector Length


Stride =







0.0 -

5.0 10.0 15.0

Interstart Period

Figure 4.14.

Average Vector Delay in Interleaved Memory

Uniform Traffic

Average Message


C---O No

Priority/No Bit


[3---E3 Priority/No Bit
0-O No Priority/Bit

A--A Priority/Bit
---- Optimum








1 1.0




The Effects of the Reverse Network on CR Traffic

The overall system performance can be measured in terms of the message through-

put which can be achieved by the MIN. In practice, the message throughput is not

only determined by the bandwidth of the network but also by its latency. A processor

may not be able to place a new request until it has received the response to a previous

one. In the case of CR accesses one might consider three cases upon which a new CR

sequence can be issued:

* As soon as the entire previous CR sequence has been issued,

with no need to

wait for its return.

* As soon

as the first message of the previously requested CR or the element

which caused the processor to issue the sequence (for example in the case of a

cache miss) sequence has returned.

As soon as the entire previous CR sequence has returned from memory.

These cases have an impact on the utilization (the processor might not be able

to do anything useful until the requested message has returned).


Processor Model

For the purpose of the discussion in this section,

we assumed the following pro-

cessor model:

* The CR sequences will be accessed with the same stride, namely 1.


Memory Model

We assume the memory to be interleaved and that all requests to memory have

identical latency (we do not distinguish

between reads and

writes), other than of

course the individual latencies incurred due to the network conflicts.

In reality, these

requests result in distinct traffic patterns on the forward and reverse networks. Reads

generate a request of one word which traverses the forward network but returns with

two words over the reverse network.

Writes on the other hand generate two words

over the forward network and a single word travels back over the reverse network

to convey the success of the write to the processor.

Though this asymetric behavior

may cause a difference in the performance in the types of memory traffic, we will not

consider this in our experiments.

Each memory module consists of four memory banks, but only one input buffer

and one output buffer.

After being serviced in a memory bank, messages contend for

the reverse network and are queued according to their finishing times in the memory

module (FIFO).


Network Model

It is of importance to study the behavior of the CR traffic in the forward

as well

as the reverse network, and to observe what improvements the proposed schemes can


All the permutations of the schemes are worthwhile of study, except one.


We assumed the forward network to be an Omega network and varied the reverse

network from an Omega network to a Baseline network.

a different type of MIN,

found to be of interest.

The consequences of using

in particular as the backward network is investigated and

The type of reverse network and the schemes used in both

the forward and reverse network have a direct bearing on the performance obtained.


Performance Evaluation

The delay incurred per CR sequence is a direct

measure of the processor uti-

lization, since it is generally assumed that the processor idles as long as there are

messages pending.

In our experiments, the average delay incurred in the forward

network, and the average delay incurred in the reverse network were measured sep-

arately for comparison.

Table 4.2 lists the parameters used in our model and table

4.3 lists the measures obtained.

Table 4.2.

Simulation Parameters

Parameter Value
Number of CR Sequences per Processor 1
Memory Latency 4
Number of Processors 64
Number of Memory Modules 64
Number of Banks per Memory Module 4
Buffer Size per Switch 4

Performance Measure
Average Forward Vector Delay
Average Reverse Vector Delay


Here, the average forward vector delay is taken to be the same as defined in Section

The average reverse vector delay however, includes the memory latency.

sum of the forward and reverse vector delay results in the total average vector turn

around time.

In this section,

we will present the results obtained from the forward

and the backward network in a segregated fashion in order to illustrate the usage of

different backward network topologies.

In the following, the notation x/x/x is used to indicate the implementation of

Bit Reverse in the forward network, Dynamic Priority in the forward network, and

Dynamic Priority in

the reverse network, respectively.

The values for x are E for

Enabled, or D for Disabled.

For example, E/D/E refers to the case where the forward

network has the Bit-Reverse mapping implemented and the Dynamic Priority scheme


The reverse network, on the other hand,

has the Dynamic Priority enabled.

The performance of Dynamic Priority in the forward network depends on what

type of backward network is used (compare Figures 4.16 and 4.18).

In the first one,

Dynamic Priority schemes outperform all the other six combinations whereas in the

latter one the Hybrid schemes and the Dynamic Priority schemes seem to have similar

results. From these two figures it is also apparent that the Bit Reverse schemes have

different outcomes depending on the type of reverse network used.

We shall return

to this later.

Implementing Dynamic Priority in the reverse network can be beneficial as well

although the nature of the requests depends on what schemes were used in the forward

general direction and sequences with long DIRs benefit most from this scheme.

nature of the returning traffic which is generated at the memory modules and which

is heading towards the processors is dependent on the order in which they arrived at

the modules.

The choice of multistage interconnection network together with whether the Bit

Reverse mapping was used

or not,

have a

big impact on

the latency suffered

the CR traffic.

Note that the schemes which implement Bit-Reverse in the forward

network perform worse than the conventional scheme when

the Omega network is



of the

Baseline network


Figures 4.16



with 4.19).

The reason for this phenomenon lies in the mapping performed and the


of the reverse network.

The observations can

be explained

by means of

Figure 2.1.

When Bit Reverse mapping is used in the forward network, the modules

to which the consecutive requests are mapped, are connected to the same switch in

the backward network (if an Omega Network is used as the backward network). Since

the contending messages are headed for the same processor, they will be contending

on the whole way back.

This same phenomenon will occur for all pairs of messages.

Any different backward network which does not connect the modules described

above in the first stage would remove this phenomenon, and the Baseline network is

an example of a network which meets this criterion.

Average Forward Vector Delay





200.0 -

100.0 -

0.0 -

= Ormega,

Vector Length

= 128

5.0 10.0 15.0

Interstart Period

Figure 4.16.

Average Forward Vector Delay

- Omega Reverse Network










= Omega.

Vector Delay

Vector Length


= 128


ca---* cos
4$--- EDO
- ..l-A-=--AEO. --







Forward Vector Delay








Baseline, Vector Length

= 128

5.0 10.0 15.0

Interstart Period

Figure 4.1

Average Forward Vector Delay

- Baseline Reverse Network





Reverse = Baseline, Vector Length = 128




200.0 -

100.0 -



Charter Summary

In this Chapter, the Consecutive Requests traffic pattern was discussed and the

causes for network deterioration under CR traffic have been determined.

Two meth-

ods were discussed to alleviate the poor performance of CR traffic in a MIN.

first method is a scheduling mechanism that would replace the round robin schedul-

ing, namely the Dynamic Priority.

The second method, Bit-Reverse mapping, is a

mapping scheme with the purpose of moving the link contention to the latter stages

of the network.

It is shown that these schemes can improve the performance of the

network under CR traffic considerably. In the following Chapter, the issues involving

the interaction of CR traffic with regular accesses are addressed.


In Chapter 4 the problem of access conflicts due to similar CR pattern accesses

was studied.


this problem in



the attention it

received, a

more common processing model is one in which processors access different types of

data, resulting in a mix of interacting traffic patterns.

Memory traffic generated by

processors can be characterized by bursts of requests which may belong to either a

vector or a scalar stream.

The network performance in

this case,

will depend on

the rate of issue of each of these request streams, the type of access (traffic pattern)

made by each of the processors, as well as the degree of interaction of these multiple


The objective is still that

the requests issued

are to be satisfied


memory. From the previous chapter it can be gathered that when left alone, CR traffic

can be detrimental in MINs.

The question now remains whether the performance of

this type of traffic is made worse by the presence of non-CR traffic patterns.

Some of

these non-CR traffic types are for example, uniform traffic, and hot spot traffic.

reverse is absolutely of interest:

effect on non-CR traffic. Non-4

whether the presence of CR traffic has a degrading

CR traffic will be referred to as scalar traffic in the


One example in which vector accesses are interspersed with scalar references can

be drawn from a parallel algorithm also known as the Gaxpy algorithm. In the shared

memory Gaxpy computation, a vector z is computed as the sum of a vector y and

the product of a matrix A and vector x.

assume throughout that

A, x, and y reside in shared memory.

the shared memory multiprocessor has n processors.

following example is that of the n-by-n gaxpy problem z = Ax + y partitioning.

For simplicity we assume that

there are n

processors and

that each


is allocated one element from each vector and one row of the matrix A.

From the

algorithm in Figure 5.1 it can be observed that each processor engages in the fetch

of a scalar element yi and two vectors x and Ai.

The vector x is a shared read only

vector which all processors will load into their local memory.

The vector Ai,


pertains to a row of the matrix A, is not shared but different elements of the different

rows may still reside in similar memory modules, causing potential access conflicts.

(We assume that the matrix is stored in row major). Note that after each processor

completes the computation, it returns the value to the shared memory by a store


Local Variables xtocai(1 : n), ZtocatI, atoca
for id=i,n do in parallel


x into Xlooal //vector load //

load y(id) into ziocal //scalar load//

for j = 1,n
load A(id, j) into azocai


-= Zlocal local

X Xlocal(j)

end gaxpy;

The remainder of this chapter is organized as follows. Section 5.1 describes the na-

ture of the vector-scalar interaction.

Section 5.2 depicts the model used to evaluate

the performance of the vector-scalar interaction,

Section 5.3

details the

formance results obtained by implementing the Dynamic Priority and Bit Reverse

mapping schemes. Section 5.4 summarizes the findings in this chapter.

The Effects of Vector/Scalar Interaction

Both vector and scalar request streams can be adversely affected by one another


they interfere with each other for a long period of time.

The switching ele-

ments of a MIN may experience an

unbalanced traffic and in order to understand

the nature of this interference we will consider the network behavior which is caused

by elements of scalar streams and CR (vector) streams.

This interference has the

following characteristics:

1. A sequence of uniform and independent requests (scalars) which are generated

at a source node interact with a sequence of requests destined to consecutive

Consecutive requests are generated in a continuous fashion.

Scalars, on

other hand, have a probability of being generated in a given cycle.

The proba-

ability that a scalar reference will be made is also referred to as the scalar issue


The effects and performance of uniform traffic in MINs have been studied in lit-


The detrimental effects caused by vector/vector interaction was the subject

of the previous chapter for which the Dynamic Priority and Bit-Reverse were pro-


The first one is an alterative to the round robin scheduling strategy and the

Bit Reverse mapping which reverses the addresses for unit stride


When vectors and scalars interact, it would seem likely that the scalar requests

would be penalized the most.

Uniform scalar accesses will find that the paths which

they must


to be

mostly saturated

with requests


vector streams.

It is

empirically found that when the percentage of processors actively accessing vectors

is equal to

the percentage of processors engaged

in scalar requests,

the volume of

requests generated

by vector


is sufficient to congest

the paths

which are

shared by the scalar requests.

Insertion conflicts in individual switches of the MIN can be handled in various


When the round robin strategy is used, vector requests and scalar requests will

be interleaved and the consecutive nature of the vector requests will be altered.


larger networks this scalar interaction will cause vector sequences to be broken up and

Figure 5.1.

Vector-Scalar Interaction Example

traffic pattern and eliminates the congestion normally accompanied by nonuniform

traffic patterns.

It can hence be expected that the interaction of scalars with vectors

can improve the performance of the vector access, at the cost of the scalar references.

Consider for

example the activities in

a switch

the first stage

when a

sequence arrives at the upper input port and a scalar sequence arrives at the lower

input port, with an arrival rate of 0.5 (one arrival every two cycles).

For example, in

Figure 5.1, the upper input port receives the CR sequence 000, 001, 010, ...

, 111

x 8 system) in a continuous fashion while at the lower input port there is an arrival

of a random sequence of destinations.

The result of a round robin strategy in this

switch results in sequences at the output ports which no longer exhibit a long DIR

(which, as we conjectured from the previous chapter, is the main cause for the poor


By implementing the Dynamic Priority scheme the issue of fairness becomes im-


Since Dynamic Priority gives priority to a single port

until that port is

totally exhausted,

the port at which the vector sequences arrive will undoubtedly

t7 t6 t5 t4 t3 t2 tl tO (9 t8 t7 t6 t5 t4 t3 t2 tl tO
7 6 5 4 3 2 1 0 2 3 0 2 1 3 0
First '
: :2: :0: :5:3 S 7:6:5:4: : : :5: :
2 0 5 3 Stage 7 6 5 4 5
a a 3 5 4



current priority receiving port happens to be the one at which the vector sequence is


Vector sequences on the other hand,

will benefit from the interaction due

to the randomization effect as was explained earlier.

The Bit Reverse mapping has a specific purpose which is that of dispersing the

consecutive accesses to non-consecutive locations.

When this mapping is applied to

uniform scalar accesses the result is another uniform set of accesses.

The interference

of these two types of requests, even after the mapping, will have no remarkable effect

on one another.

The scalar requests will be subject

to the congestion


the volume of vector requests, as was described above,

but that is not due to the


Vector/Scalar Interaction Models

For the discussion of our processing model we made the following assumptions

Table 5.1).

We are only considering the interaction in the forward network, and

thus assume that

the memory bank buffers to be infinite.

This way, memory can

handle all the requests, and

there will be no queueing in

the forward MIN

due to

busy memory modules.

The simulation model then proceeds to issue vectors with equal length, unit stride,

and with

the same bank offset.

The interacting processors are continuously either

issuing vectors or scalars (following one of the request models described below).


measured point of the simulation results represents the average of 200 runs, with each

Table 5.1.

Assumptions of the Vector/Scalar Model

Single Request Mode (SRM)

Dual Dynamic Request Mode (DDRM)

Single Static Request Mode (SSRM)

The Single Request Mode model is followed by the Cray X-MP system where vector

instructions cannot be issued as long as scalar memory accesses are still in progress.

The SRM model requires barriers to be placed for all processors participating. It also

places the most stringent requirements on the network.

In a way, the interaction of

scalar and vector


is not an issue because they will never interact.

What is

of essence in this model is the latency incurred during both the scalar burst as well

as the vector burst, as this will be the decisive factor.

For this reason,

we shall not

consider this model one in which the desired interaction is manifested.

In the

Dual Dynamic


Mode (DDRM) model processors can issue both

vector and scalar streams, without having to wait for either to have completed before

Simulation Parameters
No scalar caches
Infinite memory bank buffers
Scalar traffic is uniform
Vectors of equal length
Vector stride of one
Same bank offset
Interleaved memory
Forward network only

Request Mode (SSRM), is one in which the processors are dedicated to one

particular request mode, instead of being general purpose.

For the duration of the

experiment, a processor has a single request mode which remains unchanged.

The request mode affects the issue rate and the dependencies between requests

from the same stream. Scalars are issued according to a burst rate and are indepen-

dent of one another, whereas individual vector requests are dependent and are issued

in a continuous fashion

(CR traffic pattern).


the number of processors

which are each issuing either scalar or vector traffic also has a direct bearing on the


In the DDRM model, processors can request both vectors as well as scalars, but

not simultaneously.

A processor can start either a vector or a scalar request.

probability with which a processor will do one or the other should be a parameter to

be varied.

The duration length of a scalar stream is drawn from a burst interval.

frequency of scalar requests during this scalar burst is decided by the scalar issue rate.

Scalar bursts and vector bursts are interleaved, so a scalar burst is always followed by

a vector access and a vector access is always followed by a scalar burst.

The processing

model assumes enough buffers are provided to have all requests outstanding which

may still be in the network or are being serviced by memory.

The end of a



is indicated by the issue of the last vector element, upon which another scalar

burst is started.

The end of scalar burst is indicated by the system clock, and any of

the remaining requests which are still pending in the system are allowed to finish and



the duration of interference, whereas the scalar issue rate directly affects the degree

of interference.

In the SSRM model

, the request mode of each processor is fixed.

In contrast with

the DDRM model, a vector requesting processor issues vectors which are not followed

by scalar requests but instead are followed by an idle period. Scalar issuing processors

issue scalars according to a certain issue rate, and for simplicity we will assume that

the issue rate is the same for all scalar requesting processors.

The interaction now is

defined by the locations and the number of the scalar requesters with respect to the

vector requesters.

The simplest assumption to make is that the request mode is chosen

in an independent and random manner so the locations of these individual processors

are random.

Together with the scalar issue rate the scalar/vector interaction has

tangible parameters which can be varied and observed in the experiments we shall



Special Interaction


this section

we shall describe the traffic patterns which may arise from the

DDRM model:

* Scalar burst length of 0.

The interaction here degenerates to a vector/vector

interaction where there are continuous vector references (there are no scalar

references between vectors).

The traffic generated resembles that of a single

vector access.

* Scalar issue rate of 1.

and vector references.

independent fashion.

This results in the most severe interaction between scalar

The scalar references are issued continuously but in an

They will serve primarily as a means to randomize the

vector sequence.

Special traffic patterns which arise from the SSRM model are:

* Scalar burst length of 0 or issue rate of 0.

The interaction becomes one in

which vector/vector interaction takes place but in addition to that,

there are

idle processors (scalar requesters are idle).

* Scalar rate of 1.

The interference caused by a continuous sequence of scalar

references and a vector reference is similar to that of the DDRM model with

scalar rate of 1.

The vector sequence will be randomized by the interference of

the uniform traffic.

* Percentage scalar requesters of 0,

this results in a

vector/vector interaction

model, such as was studied in the previous chapter.

The vector sequences will

be interleaved by idle periods.

Performance Evaluation

The simulation model used was a 64

x 64 buffered Omega network.

The buffer

capacity of the switches was held at 4 messages per port.

We assume that a processor

can continue issuing messages as lone as the inout register to the first stage is avail-


model which we are investigating assumes that the processors in the system are ex-

ecuting parallel tasks which have been previously allocated.

request scalars and vectors in an interleaved fashion. We ass

of the same length and are accessed with the same stride.

length is selected from an uniform distribution. The time

Processors execute and

ume that the vectors are

The scalar access burst

at which the processors

start their access (vector or scalar) is also varied, from synchronous to asynchronous.

We have limited the experiment to vector references which start with the same bank


The model primarily performed comparison experiments between the conventional

switch and the Dynamic Priority and Bit Reverse mapping schemes.

The measures

which were of interest were the average scalar throughput and delay, and the average

vector throughput and delay.

The scalar (vector) throughput is defined to be the

number of scalar (vector) messages which arrived at memory per cycle.

This chapter

will concern itself with the forward network only, and hence the memory is assumed

to be able to service all the requests.

The average vector delay is defined to be the

average number of cycles it takes for an entire vector to reach memory.

The results

obtained are then compared against the optimum, where the optimum is obtained by

assuming that there are no conflicts in the network.


Performance of SSRM Model

We assume that per experiment, the number of processors which request vectors is

1TX? 1 1l j j1 11 "1 I 1 i i1 i 1 i 1 1

Table 5.2.

SSRM Model Parameters

fixed amongst all vector requesting processors.

Unless explicitly given, the parameters

used in the SSRM experiments are assumed to be the ones given in Table 5.2.

Giving scalar references a higher priority will result in a degradation of vector


It can

be observed from Figure 5.2 where four schemes used scalar

priority and as the scalar issue rate increases, the delay incurred by vector sequences

gets worse.

The curves in Figure 5.3 reflect this same information, but shows the

scalar delay for the same schemes as in Figure 5.2.

As expected, the highest scalar

delays are incurred for the scheme which did not give priority to scalar references.

The number of processors actively engaged in vector requests decides the degree

of contention in the system. By giving scalar references a higher priority, all the four

schemes have more or less the same performance (see Figure 5.4), except when the

vector load is low.

In that case, the conflicts in the early stages can be removed effec-

tively by using the Bit Reverse mapping. Scalar requests can randomize the sequence

in the latter stages and together provide an excellent turnaround time. In Figure 5.5

. 1 1 1 1 *1 1s 11 1 1 *1

Parameter Value
System Size 64 x 64
Vector Length 64
Buffer Size per Switch 4
Access Stride 1
Scalar Issue Rate 0.50
Idle Interval 100
Ratio Scalar/Vector Processors 1:1


Vector Delay (SSRIM)

--- -

No Dynamic Priority/ No Bit Reverse

Dynamic Priority/ No Bit


No Dynamic Priority/ Bit Reverse
Dynamic Priority/ Bit Reverse









Size =

0- Dynamic Priority/No Bit Rev/No Scal Prior ,F
- Optimum

-- --.- "









Scalar Issue


Figure 5

Average V

ector Delay

RM Model


calar Delay (SSRVM)





--No Dynamic Priority/ No
---Dynamic Priority/ No Bit
- No Dynamic Priority/ Bit


Bit Reverse/Scal
Reverse/Sceal Prid


Priority/ Bit Reverse/Scal


0--) Dynamic Priority/No Bit Rev/No Scalt




*---* Optimum

L, I~ -I


h w




Vector Delay (SSRM)





Size == 64

No Dynamic Priority/ No Bit Reverse

Dynamic Priority/ No Bit

- -
- -

* *--Me


No Dynamic Priority/ Bit Reverse

Dynamic Priority/ Bit







- -

- -N -

."- k he Q h J *~ + -A--












Figure 5.4.

Average Vector Delay

RM Model (

alar Priority)


calar Delay (




Size = 64

- With

Scalar Priority

- -w

-- No Dynamic
---- Dynamic Pri
- No Dynamic

odrity/ No

- Dynamic Priority/ Bit
*----Dynamic Priority/No

No Bit


Bit Reverse/Scal
Bit Reverse /Seal


Bit Rev/No Scal





Table 5.3.


DDRM Model Parameter

Performance of DDRM Model

The assumptions made are that each processor had an equal probability of starting

either a vector or a scalar stream.

Ideally, this will depend on the application, since

most vectorized code does not contain many of scalar accesses, perhaps the proportion

of scalars with respect to vectors is slightly exaggerated.

All vectors were assumed

to be of the same length and accessed

with the same stride.

We assume that the

vectors are being accessed from from the same bank and that the scalar requests are

uniformly distributed over the entire memory spectrum.

The length of the scalar burst is a decisive factor on the number of vector accesses

the model, since the vector accesses are interleaved with scalar bursts.


otherwise specified, the assumptions made in the DDRM experiments are the ones

listed in Table 5.3.

The results in Figures 5.6 and 5.7 were obtained from issuing vectors of length 64

with a scalar request rate of 0.5.

The scalar burst length was uniformly drawn from an

interval of 0-100 cycles.

In this experiment,

vector requests were not distinguished

Parameter Value
System Size 64 x 64
Vector Length 64
Buffer Size per Switch 4
Access Stride 1
Scalar Issue Rate 0.50
Scalar Burst Length 0-100

Vector Delay for DDRM

Vector Length = 64. Scalar Burst 0-100, No Scalar Priority











*----* No Dynamic Prior/No Bit Reverse
-- Dynamic Prior/No Bit Reverse


No Dynamic Prior/Bit Reverse
Dynamic Prior/Bit Reverse

4I&----e Optimum



1 50

Interstart Period

Figure 5.6.

Average Vector Performance for All Schemes

of Dynamic


(Figure 5.7).




, were quite improved

compared to the conventional scheme.

Figure 5.6 confirms the fact that even with






Reverse schemes can

outperform the conventional scheme,

as was found in the previous chapter.

The scalar burst parameter not only has direct bearing on the vector access,

it also

allows the study of special

the study of


continuous vector



as the scalar burst length equal to zero (hence

) and scalar issue rate equal to one (which

means the issue rate of scalars and vectors are the same).

When the scalar issue rate

is one

, it can be seen as studying the temporal correlation of CR patterns,

since for a


- : I


Scalar Delay for DDRMV

Vector Length



Scalar Burst 0-100.

No Scalar


---* No Dynamic Prior/No Bit

---i Dynamic Prior/No Bit


No Dynamic Prior/Bit

Dynamic Prior/Bit






Interstart Period



calar Performance for All Schemes

Effect of Scalar Burst Interval (DDRMVI)















No Scalar


- ---.-----=- -
- .- -
- -- -
- - -

No Dynamic
Dynamic Pri

No Dynamic

Dynamic Prior/Bit

Prior/No Bit







or/No Bit




*--- ---------


3 s-------- ----- A-------------- !h X


- -
- -


Effect of Scalar Burst Interval (DDRM)


Scalar Delay (Vector=64) No

Scalar Priority










' ...r -..
---- --

No Dynaflte -Rl.Pror/NoEBtt Reverse
Dynamic Prior/No'"Bit -RFevers.. -

- --
* -

No Dynamic Prior/Bit Reverse
Dynamic Prior/BEit Reverse

^^^j^: ~ ~ ~ ~ .-.-.^^^ ^J^^^J

--- -

- -





110.0 130.0 150.0 170.0 190.0


Burst Interval Length

Figure 5.9.

Effect of Scalar Burst Length on Scalar Delay

- No Scalar Priority

In Figure 5.8 the average vector delay for the conventional scheme becomes in-

creasingly worse even under long scalar burst intervals.

arge number of scalar requests,

Dynamic Priority

Long burst intervals reflect

and a long vector/scalar interaction period.

Reverse and the Hybrid schemes can offer a significant im-

provement on the vector performance.

As was expected the Dynamic Priority scheme

favors long DIRs,

which is not advantageous to the scalar requests,

as can be per-

ceived in Figure 5.9.

The Bit Reverse and Hybrid schemes can offer similar results as

the conventional scheme due to the distributing effect of the Bit Reverse mapping.

The scalar issue rate depicts the volume of scalars injected into the system,


Effect of Scalar Issue Rate (DDRIIM)



Delay (Vector


= 64) With















-- -


- -


No Dyn Prior/No Bit Reverse
Dyn Prior/No Bit Reverse
No Dyn Prior/Bit Reverse
Dyn Prior/Bit Reverse








Scalar Issue

Figure 5.10.

rate is zero


Effect of Scalar Issue Rate on Average Vector Delay

, the equivalent of having only vector interaction amongst the processors

with idle periods between consecutive vector sequences,

it can be seen in Figure 5.10

that the conventional scheme is significantly worse than any of the proposed ones.

The scalar delay is slightly better in the proposed schemes as long as the issue rate

is less than 0.5 but with higher rates,

the Bit Reverse and Hybrid schemes perform

slightly worse than

the conventional one.

Due to the scalar priority,

the effect of

higher scalar issue rate does not affect the average scalar delay dramatically

Chapter Summary

The nature of the interaction between Consecutive Requests and scalar references



causing more overlap of link utilization.

Unless priority is given to scalar references,

the scalar delay increases due to the volume of CR requests.