Enhancing multistage interconnection network performance in computer and telecommunications systems

MISSING IMAGE

Material Information

Title:
Enhancing multistage interconnection network performance in computer and telecommunications systems
Physical Description:
xi, 149 leaves : ill. ; 29 cm.
Language:
English
Creator:
Cheung, Sandra E., 1967-
Publication Date:

Subjects

Subjects / Keywords:
Computer networks -- Evaluation   ( lcsh )
Telecommunication systems -- Evaluation   ( lcsh )
Data transmission systems   ( lcsh )
Telecommunication -- Switching systems   ( lcsh )
Computer and Information Sciences thesis Ph.D
Dissertations, Academic -- Computer and Information Sciences -- UF
Genre:
bibliography   ( marcgt )
non-fiction   ( marcgt )

Notes

Thesis:
Thesis (Ph. D.)--University of Florida, 1993.
Bibliography:
Includes bibliographical references (leaves 142-147).
General Note:
Typescript.
General Note:
Vita.
Statement of Responsibility:
by Sandra E. Cheung.

Record Information

Source Institution:
University of Florida
Rights Management:
All applicable rights reserved by the source institution and holding location.
Resource Identifier:
aleph - 001922334
oclc - 30495379
notis - AJZ8146
System ID:
AA00003236:00001

Full Text











ENHANCING MULTISTAGE INTERCONNECTION
NETWORK PERFORMANCE IN COMPUTER AND
TELECOMMUNICATIONS SYSTEMS









BY
SANDRA E. CHEUNG


A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL
OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY

UNIVERSITY OF FLORIDA


1993








ACKNOWLEDGEMENTS


First and foremost, I would like to thank my advisor Professor

for his guidance, and infinite wisdom that made this work possible.


Yann-Hang Lee

His sound and


thorough words have guided me in many aspects of academic life.


I would also like to thank my other committee members: Professor


Professor R. Newman-Wolfe, Professor


C. Chow,


Davis, and Professor S. X. Bai.


I am deeply indebted to my brother and friend Professor Shun Yan Cheung with-

out whose encouragement and support throughout all of my life, but in particular

during the past few years, this work would not have been possible.

A great number of people are responsible for instilling in me the passion to pursue


knowledge and truth.


I would like to thank all of my friends, colleagues, professors,


and support staff who contributed to this experience.


These include,


but are not


limited to, Dr.


F. D. Anger, R.


. Rodriguez, Dr.


M. E. Bermidez, Dr.


S. Sahni,


G. Haskins, Javier, Padmashree, Jin-Joo, Balaji, and Mario.


Special mention goes


to the entire Happy Hour Gang for those sobering moments of truth.

I would also like to acknowledge my sister-in-law Stella, and my niece Nathanie


for adding a new dimension to my life.


Last, but certainly not least, I would like to


thank my father kfc4 and my mother A. *- for their love and vote of confidence,

To them, I dedicate this work.














TABLE OF CONTENTS


CHAPTERS


INTRODUCTION


MINS IN COMPUTER AND TELECOMMUNICATION SYSTEMS


Overview of Multistage Interconnection Networks


Multistage Interconnection Networks in Large-Scale Multiproces-


sor Systems
2.3 Multista
2.4 Chapter


ge Interconn
Summary


section Networks in Telecommunication Systems
* S S f C C S S S S S S S S S . S


RELATED WORK .. . .

3.1 Non-Uniform Reference Patterns


3.1.1
3.1.2


Vector Interference .
Vector-Scalar Interference


* . a a a a a a a
* a a a a a a a a a a .


Path Setup Algorithms
Fast Packet Scheduling


3.3.1
3.3.2
3.3.3


* C C 9 S S C f S S S S S S 5 S ft
* S . S b S C B S S S S S S S S .


Sorter Based Networks .
Input Queue Architectures
Shared Queue Architectures


Chapter Summary


THE CONSECUTIVE REQUESTS TRAFFIC PATTERN


Consecuntive Renuest Traffic Pattern


. .








4.2.2
4.2.3
Effects


4.3.1
4.3.2


Dynamic Priority Scheme
The Combined Approach


* 4 S 4 S S S n U 0 4
* S S S 4 0 5 4 4 S *


of Spatial Distribution on CR Pattern


Dynamic Priority and Stride Based Mapping Revisited
Skewed Storage Schemes . . . .


Performance Evaluation of Forward Network


- 4 4 4 S 4 4 S 4 7


4.4.1
4.4.2


Dynamic Priority and Bit-Reverse Mapping Results
Dynamic Priority and Stride Based Mapping Results


* S 4
4


The Effects of the Reverse Network on CR Traffic


4.5.1
4.5.2
4.5.3
4.5.4


Processor Model.


Memory Model
Network Model


* S 4 4 4 4 4 S 4 0 5 4 S S S
* S S S 4 5 5 5 5 5 9 5 5 S S S *


Performance Evaluation


Chapter Summary


VECTOR/SCALAR INTERACTION IN MINS

5.1 The Effects of Vector/Scalar Interaction
5.2 Vector/Scalar Interaction Models .


5.2.1


Special Interaction Types .


. 78


Performance Evaluation


5.3.1
5.3.2


Performance of SSRM Model.
Performance of DDRM Model


* P 5 4 5
* P S P S I


* S 4 0
* 4


Chapter Summary


SETUP


AUGMENTATION PROCEDURES


Optimal Path Setup


6.1.1


Transportation Network Flow Problems


Multicommodity Flows in MIN Transportation Networks


6.1.3


Maximum Setup Example


The Re-Attempt Parallel Path Setup Method .
ReAPPS with Distributed Synchronization .
Analysis of Path Setup Delay of ReAPPS Scheme


* S S


Performance Evaluation of Re-APPS


Chapter Summary


. 101
. 105
. 110
. 113
. 114


IMPROVING THE PERFORMANCE OF FAST PACKET SWITCHES


The Design of Circular Window Control (CWC) Scheme


4 0 4 S 119


Switch Architecture and Operation


Analytic Models for the CWC Scheme


7.2.1


. 120
. 124
. 125


Throughput Analysis


,~ '~I Ii I I 1 I *1~1 A 1 C ii nr.yn Cl 1 I flfl


* 4 5 5 .
* 4 4 S
* 4 4
* 4 4 4 .


S S 4 4 4 4 5 S S I 70


. 79


. 96


. 88


7.1.1


1


F k t^ 11 1 1 1 I 'oT I








CONCLUSION


REFERENCES


BIOGRAPHICAL SKETCH.


a a a a a a a a a a a a a a a a a a a 142


. a a a a a a a a I a a a a a a a 148


.* . . . . . . . 139

















LIST OF TABLES


Skewed Storage Scheme. . . . . . . .

Simulation Parameters . . . . . . .

Performance Measures . . . . . . .


Assumptions of the Vector/Scalar Model


SSRM Model Parameters.

DDRM Model Parameter.


Setup Overhead for PPS, IPS and ReAPPS schemes .


Various Schemes under Uniform Traffic

Various Schemes under Permutation Tra

Effect of Network Size on Setup Schemes


Schedulability Analysis


a a a 11.5


115. -

f ac a a a a a a a a a a a 115


a a a a a 116


S.~J 9 S S S S S a 130


















LIST OF FIGURES


An 8


Omega Network.


. 3 f 3 3 f f a 3 3 3 3 3 9


Large Scale Multiprocessor System Configuration


An Output Queued Packet Switch. .. . . . .

An Input Queued Packet Switch. . . . . .

Stride-2 Access . . . . . . .

(CR Concurrent Fetch . . . . . . .

ProCelSSi d .


Same Bank Offset/Stride 1 -

Random Bank Offset/Stride 1

Same Bank Offset/Stride 1 1

Random Bank Offset/Stride 1


No Initial Load .

- No Initial Load


. 3 3 3 3 3 .

f 3 3 3 a t 3 3 .


Uniform Load for 500 Cycles . .

- Uniform Load for 500 Cycles . .


Throughput for Dynamic Priority and Round Robin .

Throughput of Dynamic Priority . . .


Effect of Buffer Size on Message Delay


- -


*-|l- J I" 1 ||- I I 1








Stride Dependent Mapping in Interleaved Memory. . . .

Average Vector Delay in Interleaved Memory . . . .

Uniform Traffic Under Proposed Schemes ... . . .


Average Forward Vector Delay Omega Reverse Network . .

Average Reverse Vector Delay Omega Reverse Network . .


Average Forward Vector Delay

Average Reverse Vector Delay


Baseline Reverse Network

Baseline Reverse Network


* 6 S P 7

* 6 4 6 S


Vector-Scalar Interaction Example.


. S S 5 4 0 5 4 4 5 4 5 I 74


Average Vector Delay

Average Scalar Delay

Average Vector Delay


- SSRM Model

- SSRM Model


* 4 5 S S S S P 4 5 4 4 4 6

* S 6 f S 4 5 5 S S S 6 P 5 9 P 4


- SSRM Model (Scalar Priority)


Average Scalar Delay SSRM Model (Scalar Priority)


Average Vector Performance for All Schemes

Average Scalar Performance for All Schemes


Effect of Scalar Burst Length on


* 4 4 5 4 5 4 4 6 4 5 5 5

* 6 5 5 4 5 S S S P 9


Vector Delay No Scalar Priority


Effect of Scalar Burst Length on Scalar Delay


- No Scalar Priority


Effect of Scalar Issue Rate on Average Vector Delay


A Three Commodity Flow Problem


Path Setup is Not Single Commodity Problem


Input pattern in


Omega


S P S S S S 5 5 5 5 5 5 4 4 5 5 1 100


Transport Network


* S . P 5 S S 5 4 5 5 5 5 5 5 4 S P 4 5 9 P 100


100








Embedded Synchronization Example Continued

Embedded Synchronization Example Continued

Embedded Synchronization Example Continued


* S S S S S 108

SS S S .. S 109

* S S S C S S C 109


Time slot in a nonbuffered packet switch banyan network


Transmission of request in a 4-stage banyan network.


*S S C f 111


. S . 5 112


CWC Architecture Model . ... . . . . 121

The Circular Window Control Scheme .. . . . 123

An Illustrating CWC Example. . . . . . 124


Throughput of CWC crossbar switch systems

Throughput of CWC banyan switch systems


S. S S S 134t

S. 5. 135


Mean waiting times of CWC scheme vs. input queueing scheme

Throughput of CWC schemes under Uniform and HotSpot traffic


. 135

136


Mean Waiting time of CWC schemes under Uniform and HotSpot traffic136

Throughput of CWC scheme under different alternate queue selections 137


Throughput of CWC scheme under different window sizes.


S 137















Abstract of Dissertation Presented to the Graduate School
of the University of Florida in Partial Fulfillment of the
Requirements for the Degree Doctor of Philosophy


ENHANCING MULTISTAGE INTERCONNECTION
NETWORK PERFORMANCE IN COMPUTER AND
TELECOMMUNICATIONS SYSTEMS


By


SANDRA E. CHEUNG


August 1993


Chairman:


Yann-Hang Lee


Major Department: Computer and Information Sciences


Multistage interconnection


networks


(MINs) can


provide an


excellent tradeoff


between cost and efficiency.


Their regular structure and multiple access points make


them not only amenable to cost effective VLSI implementation but also provide higher


throughputs


than


based systems.


Nowadays,


MINs are used


both


parallel


computer and


telecommunications systems.


Though


the design goals in these two


areas are different, similar problems can be identified. Despite their advantages, the








these networks limits their throughput considerably, and is aggravated by the size of


the network.


In enhancing the throughput of MINs it is thus imperative to consider


reducing the number of internal link conflicts.

effects of the Consecutive Requests Traffic Pat


In computer systems we observe the


tern (CR) in MINs. By identifying the


causes for deterioration we are able to propose two methods, the Dynamic Priority

and the Bit Reverse schemes which can significantly improve the performance under


CR traffic.


In telecommunications systems, the design goals are to grant as many


connections as possible (in a circuit-switched design) and


to pass as many packets


as possible in a conflict-free manner (in a packet-switched design).


We propose a


new


path


setup


scheme which


can improve existing parallel


path


setup


schemes.


To enhance throughput of input queued fast packet switches, the Circular


Window


Control scheme can be implemented to schedule packets in a manner in which the


probability of having conflicts is greatly reduced.


The methods and schemes presented


in this thesis show that with simple and cost effective enhancements, the performance

of MINs can be enhanced considerably.













CHAPTER 1
INTRODUCTION


Communication bandwidth


is a key factor in


limiting the performance of not


only high performance computers but also that of data communications networks.


In both


types of systems, multistage interconnection networks (MINs) provide the


tradeoff between cost and


interconnection.


performance compared to other available means of


Some large scale shared memory multiprocessors such as the NYU


Ultracomputer [GOTT83


utilize MINs.


the IBM RP3 [PFIS85a], and the Illinois Cedar [GAJS83]


The bandwidth of the interconnection between processors and memory


modules, memory access delays, and the delays resulting from network and memory

conflicts contribute to the overall system's performance.


In data communication networks MINs are used as a switch, i.e.


a functional


unit that receives units of information (sometimes referred to as cells) from a set of

incoming channels and routes them as appropriate before transmitting the cells on a


set of outgoing channels.


With the advent of high transmission bandwidths, the main


challenge remains to build switches which can handle these rates.


One category of


space division fast packet switches is based on multistage interconnection networks,


in fact some of these include the


Batcher-Banyan [HUI87] and


the Starlite switch




2


Common methods for optimizing MINs range from enhancing the switches of the

MIN to scheduling the units of information in order to reduce the contention on the

communication medium. Switch enhancement involves modifying the basic crossbar


switches by adding circuitry that can handle additional functions.


For example, by


equipping switches with some bufferspace, conflicts can be tolerated in a more graceful


way than having to drop the messages.


Scheduling of data permits a much higher


bandwidth to be achieved if conflicts can be actively avoided by not permitting them

to access the medium at the same time.

The problem of improving the performance of a MIN can be addressed from several


angles.


In a computer system the network delay is a direct measure of performance


and for this there are a number of ways to attack the problem:

Reducing memory access delays.


Supporting special traffic patterns.


Improving bandwidth of the interconnect.


The efforts in data communication networks in turn,

problems which arose due to recent technologies. The acc<

higher speed communication has opened new avenues for


have concentrated on the


elerated development into

all kinds of applications


which in


turn


become dependent upon networks or networked computing applica-


tions. Building switches which can handle these enormous rates remains a challenge.

MINs are used in the class of space-division fast packet switches and the methods




3


* Reducing or eliminating head-of-line blocking.


* Scheduling requests in a conflict free manner.


* Providing multiple interconnection networks.


This thesis addresses the problem of providing a high degree of throughput using


MINs in both

tion networks.


high performance computer systems and in high speed communica-

This thesis describes their vital role in both systems and presents


efficient techniques for enhancing the performance of such switches according to the

requirements.

Though the requirements of computer systems and communications systems dif-

fer vastly, a common goal of achieving an enhanced throughput can be obtained by


studying the switch and its operations.


In computer systems the switch can be mod-


ified in order to be robust under severe traffic conditions by supporting nonuniform


reference patterns.


In telecommunications' switches, throughput enhancement mech-


anisms vary according to the switching technique implemented. In a circuit switched

architecture, the path setup phase should attempt to grant the most number of con-


nections.


In a packet switched architecture output contention and possibly internal


blocking must be avoided by scheduling packets which will be non-contending.


Often


this may involve having to relax the strict FIFO order of the packet arrivals.

The first half of the thesis addresses the aspects of throughput improvement in

__ ... .- --- -- --- i- *- -- _- -: 4 -- C -. 4. - -. --t~~




4


The remainder of this thesis addresses the problem of MIN-based switch enhance-


ment in telecommunications.


In here, the problem is examined both from a circuit-


switching as well as a packet-switching point of view.


This thesis further


attests


that these networks can provide excellent throughput and have applications in both

computer and data communications systems.


The thesis is organized as follows.


Chapter 2 gives a general overview of multistage


interconnection networks and formulates the goals in


telecommunication networks.

taken in related work. Chap


both computer as well as in


Chapter 3 details the background work and approaches


ter 4 presents the Consecutive Requests traffic pattern


and investigates the causes of network deterioration under this type of traffic.


Chapter


5 studies


the effects of the Consecutive


Request


traffic pattern


when it interacts


with regular traffic.


Chapter 6 presents a parallel path setup scheme for connection


oriented MIN-based high


speed network switch.


Chapter


presents


the Circular


Window Control scheme for scheduling packets in a highly nonconflicting manner in


a MIN-based switch for high speed networks.


Chapter 8 summarizes this thesis and


discusses future research directions.

The methods presented are by no means exhaustive, and new techniques can be

developed using the work described in this thesis.














CHAPTER 2
MINS IN COMPUTER AND TELECOMMUNICATION SYSTEMS


Multistage interconnection networks (MINs), and research associated with them,

date back to the 1950s in connection with switching exchanges for telephone systems.


In the early


1970s MINs were proposed for computer system applications and since


then the research in both the areas of computer and telecommunications has reached

numerous milestones.

With the present research on fabrication methods, there is no reason to suspect


that the technological advancement will not continue.


With the advent of high per-


formance computers on one hand, and high speed networking on the other, MINs will


be playing an integral part in providing data communication.


The purpose of this


thesis is to present methods to avoid or overcome pitfalls of MINs in computer and

telecommunication systems.


The remainder of this chapter is organized


as follows.


An overview is provided


in Section 2.1


to give the basic terminology and underlying topological design prin-


ciples of MINs. Sections 2.2 and Section 2.3 deal with specific design issues of MINs

in large-scale multiprocessor systems and telecommunication systems, respectively.

Finally, Section 2.4 summarizes the requirements of MINs for computer and telecom-







Overview of Multistage Interconnection Networks


The vast array of physical implementations for interconnection switch ranges from


bus-oriented to crossbar systems.


The simplest of these is perhaps the time-shared


bus, but it is unable to provide the performance required in a large-scale high perfor-

mance system mainly because of its inability to support a large number of processors


in terms of scalability.


Moreover, buses are single access and would therefore would


incur high contention. At the other end of the bandwidth spectrum is the full cross-

bar, which can connect any of its free input ports to any free output port by providing


a separate switching gate for each input-output connection.


A crossbar switch with


N input and N output ports requires N2 switching gates, and thus becomes the main


drawback found with crossbar switches.


The hardware costs required for intercon-


necting thousands of processors are too vast for crossbar switches to be considered


economically viable.


The feasibility of implementing large crossbars in VLSI presents


another problem.

The main motivation behind selecting a multistage interconnection network as


a communication medium is the hardware cost reduction.


By connecting stages of


smaller crossbar switches, a MIN can provide the required interconnection.


While


number of gates is significantly less than in crossbars O(Nlog2 N), the speed is re-

duced because of the multiple number of stages and the time needed to set the network

control, unless the network is self-routing (in which case the control is distributed to

anrb/j- cntfrbi'/h




7


In general, a MIN consists of a sequence of switching elements, arranged in stages.

Physical wires connect successive stages and are also referred to as interstate links.

The most common switching element used in a MIN is a crossbar switch, small enough


to be economically attractive.

square crossbar switches, i.e.


For simplicity, we shall consider only networks with

where the number of inputs is equal to the number of


outputs.


MINs can be classified in a number of ways.


For the purpose of this thesis,


shall consider one class of permutation networks (i.e. networks which can connect an


input port to at most one output port) called


blocking networks which exhibit the


Unique Path Property (UPP). As their name implies, these networks have a unique


path between every input-output pair.


However, a connection between a free input-


output pair may not always be possible due to conflicts with existing connections,

ergo the name blocking.

The network topology refers to the number of stages, the number and size of the


switches, and the interconnection pattern between successive stages.


Two networks


are said to be topologically equivalent if their underlying graphs are isomorphic; they

are isomorphic if there exists a label-preserving graph isomorphism between them.

A topological equivalence relationship was established for several networks belonging


to the class of UPP networks [WU80].


This equivalence implies they can perform the


same set of permutation functions if the destinations are rearranged appropriately

and that under the same routine algorithm, equivalent networks will have identical




8


performance and fault tolerance characteristics. Equivalent networks can also be used

to simulate each other.


UPP networks can


be controlled


with a distributed routing scheme, known as


the destination tag algorithm proposed by Lawrie [LAWR75].


labels called destination tags


It is based on output


The binary representation of the destination is used to


route the packet in stage i to the upper output port if di = 0 or to the lower port if


d = 1


where the binary representation of the destination tag is did2... d, (where n


is the least significant bit).


An example of a

network of size N x


UPP


N(N


network is the Omega network [LAWR75].


= 2") consists of n stages.


Omega


The network is composed of


log2N


stages,


where each stage contains N/2 switches and has a interstate pattern


referred to as the perfect shuffle. An example of the Omega network constructed with


x 2 crossbar switches is shown in Figure 2.1,


where N is equal to 8.


In general,


a multistage interconnection network can be built with a complexity of O(N log2 N)


in contrast


to the O(N2)


complexity of a fully


connected


crossbar network.


destination tag routing algorithm is illustrated in Figure 2.1,


indicate the route taken by a packet issued from processor


where the bold lines

1 to memory module


M5.

Due to the unique path nature of UPP networks, simultaneous connection of input


ports to output port may lead to routing conflicts.


The routing conflicts are due to the


sharing of common internal links and are also referred to as internal blocking. Internal

















Figure 2.1.


An 8


x8 Omega Network


to internal conflicts, MINs also suffer from output conflicts, a type of blocking even


crossbars experience.

same destination. O0


Output conflicts occur when multiple input ports request the


f the contending requests, one has to be delayed by one cycle.


The victim is usually chosen in a round robin fashion.


Conflicts may propagate back


to the preceding stage and prohibit a successive message from transmission.

The regular structure of multistage interconnection networks makes it very amenable


to VLSI implementation.


In addition to this, its modular structure allows the con-


struction of larger networks out of smaller ones without the need to substantially

modify the physical layout or the algorithms.

The drawbacks to UPP networks are that the UPP leads to blocking and makes


the full


access


capability more difficult in the event of a single failure.


These short-


comings of the UPP networks have led to the design and implementation of enhanced

UPP networks, or in multipath MINs. Another drawback of UPP networks is that a

given UPP network cannot realize the full set of permutations from inputs to outputs

without using multiple passes or by multiple copies of the network. Determining the


- r - -




10


concerned with faults and the discussion of enhanced networks will be the topic of

the next chapter.

2.2 Multistage Interconnection Networks in Large-Scale Multiprocessor Systems

Network latency, compounded with memory access delay is known as the main

factor limiting the performance of large-scale shared memory multiprocessors. Meth-

ods to reduce the memory access delay include memory hierarchies and asynchronous


block transfers.


Though I/O constitutes another important factor in multiprocessor


performance, we assume that the applications under investigation are computation-

ally intensive and that the I/O involved is negligible.

As computer technology has been moving towards high performance architectures,

the trend has become to organize multiple processors (as much as thousands) which


can then


work in


parallel on


a single or on multiple tasks.


These tasks are then


partitioned into processes which are then assigned to different processors.


Inevitably,


these processes must communicate with one another and it is this cost which becomes

a deciding factor in determining the performance and efficiency of the application.

Multistage interconnection networks have frequently been chosen to interconnect

processors and memory modules due to many of its desirable features described in the


previous section.


Although there are a large number of different possible MINs, only


a surprisingly small number are actually considered for multiprocessors.


One that


recurs in many different forms is the multistage shuffle-exchange network [BATC76,



















Figure 2.2.


Large Scale Multiprocessor System Configuration


MIN systems together with memory interleaving can provide adequate network

performance that not only depends on the memory system, but also on the rate at


which


processors


issue memory references into the network (processor issue rate),


which in turn is affected by the network delay.


Requests issued


a processor


buffered


when


they


cannot


be issued into


the interconnect.


After issuing a request, the processor waits for all


the messages


associated with that request to return.


In general there are two types of messages


that can


be issued


a processor:


reads and


writes.


Read requests are typically


synchronous, i.e.


execution.


the processor waits for the reply of memory in order to resume


However, in pipelined and vector processors, several reads can be issued


in a sequence.


Write requests, on


the other hand, are asynchronous as processors


do not need to wait for a reply.


The requests which are initiated by the processors


are sent through the forward network and memory modules return data through the

backward network (Figure 2.2).


Network performance is not only


dependent the request


issue rate but also on




12
I

by applications running in any computer environment constitute a traffic pattern.

Some patterns have no particular structure to them and be considered quite random


in their frequency and destinations.


The reference pattern in which processing all


memory modules with the same probability and where accesses are not dependent on


one another is referred to as the uniform traffic pattern.


The bandwidth of multistage


interconnection networks can be very high for requests that do not conflict [LAWR75],

but unfortunately most applications generate reference patterns which cause a high

degree of conflicts in the MIN.

Nonuniform traffic patterns can be generated from a variety of applications which

are data intensive or communication intensive. Nonuniform traffic patterns can cause

a large amount of link contentions for an indefinite time and have a deteriorating effect


on regular traffic as well.


It is this effect that makes the study of traffic patterns


which can arise from typical applications in multiprocessor environments important,

for they directly impact the throughput provided by the interconnection network.


Multistage Interconnection Networks in


Telecommunication Systems


Over the years, the growing number of applications which require high bandwidth

has accelerated the research of high speed transmission media such as optical fibers.

These applications are no longer solely computer related but can be found in areas

such as video, voice and image, collectively known as multimedia [ARMB86, KIM90a,


WOOD90].


The applications produce a range of traffic flow characteristics,


which







allocation,


ensuring that each


application


type is allocated


enough


bandwidth


ensure its flow requirement.

Early networks were typically designed to support different traffic characteristics


and requirements, and each was tailored for particular applications.


Lately the drive


is to design a single communication system which can provide the same services in a

unified and integrated fashion. Some of the reasons are the ease of maintenance and


installation, economy and ease of access.


This movement has prompted the proposal


of a set of services called Integrated Services Digital Networks (ISDN).

The constant drive to achieve higher transmission bandwidth has led to the devel-


opment of optical communications.


The usage of optic fibers communication changed


the requirements of data communication dramatically.


Using lasers,


which can be


switched at a high rate, and optic fibers,

distances with little attenuation. With thi


the light signal can be carried over long


s technology, data rates of 4Gb/s over 100


km without repeaters [TOBA90] are possible.

The emergence of numerous applications which require much higher bandwidth

than possible in present networks was inevitable once the high transmission capacity


of fiber optics technology was made available.


The real challenge now becomes the


creation of a network which can provide the high bandwidth services to the users.

The bottleneck comes primarily from switching, since the data can be transmitted in


a very high speed fashion.


These high speed networks will carry all applications in an


integrated fashion which all require different quality of service.


The most appropriate




14


handle the wide diversity of data rate and latency requirements resulting from the


integration of these services.


Packet switching is also known as the asynchronous


transfer mode (ATM) and circuit switching is known


mode (STM).


as the synchronous transfer


At the present time, ATM specifies fixed-size packets of which 48 bytes


are data and 5 bytes are control information.


Line speeds which are specified are


nominal rates of 150 Mb/s (for digitized high definition television (HDTV)) and 600


Mb/s


TOBA90].


Several architectures for fast packet switches have emerged in recent years, namely,

(i) shared-memory, (ii) shared-medium, and (iii) space-division packet switches.

Of these three, space division fast packet switches allow multiple concurrent paths


to be established from the inputs to the outputs.


Space division switches can


categorized in


two different types of fabrics:


crossbar fabrics and


banyan-


based fabrics.


Though crossbar fabrics are nonblocking, the size of realizable crossbar


switches tends to be limited.


This is. the primary reason


why banyan-based fabrics


have been considered as alternative candidates for fast packet switching.

The routing of data from an input to an output depends on the switching tech-


unique used in the switch. There are two principal switching techniques, circuit switch-

ing and packet switching. In circuit switching, a complete path of connected links


from the origin to the destination is set up when the call is placed and before date is
a


sent over it.


The path remains dedicated to the call until it is released by the com-


municating parties.


The overhead incurred during the setup phase is one of the cost









I!11111''


II111111 II


Figure 2.3.


An Output Queued Packet Switch


when there is a steady flow of information and is hence the switching method used

for voice communication.


Communication between computers tends to be bursty in nature, however. Cir-

cuit switching would be too costly and the circuits would be underutilized. Packet


switching is a form of store-and-forward technique where the data is transmitted in


chunks which are not


packets.


to exceed a certain maximum length and are referred to as


Packet switching achieves benefits such as (i) dynamic bandwidth alloca-


tion, (ii) easy error recovery because smaller chunks are sent at a time.


Once the


packets arrive at their destination, reordering may need to take place.


Nonblocking MINs can be classified based on their queueing strategy:


queueing, (ii) output queueing, and (iii) shared queueing.


fabric also plays a factor in where the queueing is done.


(i) input


The speed of the switch


For example, if in a N


switch,


the switch fabric runs


times as fast


as the input and output


links, all


the packets that


arrive during


a slot


can all


be delivered at


the outputs,


even if


multiple inputs request the same output.


A slot is the time in which a packet can


--p


MIN SWITCH









TTTTTrm


I.11 11111


Figure 2.4.


An Input Queued Packet Switch


Input queueming is necessary when each output can accept at most one packet per


time slot (Figure 2.4).


Input queued architectures have a maximum throughput of


0.586 [HLUC88] for an infinitely large switch, FIFO input queues with infinite queue


length and uniform traffic.


Thus, in spite of the queueing capability, its capacity is


worse than an unbuffered crossbar switch.


This phenomenon is due to the head-of-


line (HOL) blocking. HOL blocking occurs when the packet at the head of the input


queue cannot


be transmitted and


consequently blocks the other packets behind it


although they may be addressed to idle outputs.


Typically there are


three parameters


used


to describe


the performance of the


switching fabric:


* Switch throughput.


The utilization of the output links is defined as the proba-


ability that a packet is transmitted in a time slot by the switch.


The maximum


throughput, also referred to as the switch capacity, is the traffic carried when

packet arrival rate is one packet per time slot.


* Average packet delay.


The average packet delay is defined to be the number of


MIN SWITCH


1---*N







* Packet loss probability.


The packet loss probability is defined to be the proba-


ability that a packet received at the switch input is lost due to buffer overflow.


Chapter Summary


In this Chapter, a general overview of MINs was presented and found to be widely


used


both


large scale multiprocessor systems as


well as


telecommunications


systems.


They provide a reasonable throughput at moderate hardware costs.


The requirements of the MIN depends highly on the environment in which it is


being used.


In multiprocessor systems, where the MIN is used to connect a number


processors


to a number of memory modules,


the objective is


to minimize the


network delay incurred by the memory references.


These references are characterized


by traffic patterns generated by applications running in the multiprocessor system. In

telecommunications on the other hand, the MIN is used to switch data from incoming


links to a outgoing links. The ob

least amount of internal blocking.


jective is to do this as fast as possible,


Depending on


with the


the switching technique used in


the switch, this requirement can be phrased as (i) improving the average number of

setups granted (in circuit switched systems), and (ii) optimizing packet throughput

(in packet switched systems).














CHAPTER 3
RELATED WORK


Most studies examined the behavior of MINs in isolation, and these do not reflect


the characteristics of the entire system.


The studies make simplifying assumptions,


avoid the effects of head-of-line blocking, and thus obtain overestimating results.

The remainder of this chapter is organized as follows. Section 3.1 presents several

studies performed on non-uniform traffic patterns and emphasizes the work done in

modeling vector interference. Section 3.2 presents several setup algorithms previously


proposed


for some multistage interconnection network


topologies


Section


details studies performed in improving the performance of banyan-based fast packet

switches. Section 3.4 presents a summary of the related work discussed.


Non-Uniform Reference Patterns


One of the most unrealistic assumptions to make is that generated traffic always


follows a uniform and independent pattern.


The consequences of nonuniform traffic


patterns, even when they occur sporadically, can have lasting effects on


system.


the entire


This stresses the need to study the system in its entirety rather than study-


ing its parts in isolation. Note that the non-uniform traffic patterns discussed in this


o f,-mnn nrerlr kli-h in rrrnniitnr and t.olormm lmmnlcatlnn SVstems.


The terminoloffv is




19


Reference patterns in computer systems are the result of application programs exe-

cuting and requiring communication to and from other processes, or from memories.

One type of reference pattern which has received much attention in the literature

is the one in which one particular destination address is accessed with a higher proba-


ability than the remaining ones.


This destination is also referred to as a hotspot.


This


phenomenon, first observed by Pfister and Norton [PFIS85b], causes the buffers of the


paths leading to the hotspot to become saturated.


The tree saturation phenomenon is


a direct effect of hotspot access and leads to serious degradation in network through-

put of not only accesses directed to the hotspot, but other memory references as well.

The problem of hotspot contention has been studied extensively and various solutions


have been proposed


[PFIS85b,


YEW87


HO89


, SCOT89, DIAS89


Hotspot traffic


occurs also in communications networks, where it is referred to


as output concentra-


tion.


When each


processor


references its own


particular memory module more than


others, the reference pattern which is generated is referred to


as the preferred mod-


ule, or in


telecommunications' terminology, communities of interest.


This type of


traffic can


occur in


the case of bulk transfer.


The hotspot


traffic can


be seen as


an instance of preferred module traffic where all processors prefer the same module.
The consequences of preferred module traffic are not as severe as those of hotspot
The consequences of preferred module traffic are not as severe as those of hotspot


traffic when


the individual modules are well distributed.


The traffic referencing a


processor's preferred module obstructs other traffic from the same processor.




20


The burst issue rate is the rate at which the processor issues bursts of requests

to the memory system. In most studies, the assumption is made that the burst issue


rate is the same for all the processors in the system.


A processor will typically wait


for the burst requests to be satisfied before issuing the next burst. If the behavior

has a distinct repeating pattern, it can also be termed as periodic. Burstiness is


particularly of interest in communications, as burstiness is characterized by random


gaps encountered during message transmission,


variability of the message size and


the low tolerance of delay of the source.

Another non-uniformity may occur when not all processors have the same issue


rate.


This case is also referred to as unbalanced input, and may not


be of much


interest in a computer system but is of utmost importance in telecommunications

switches.


3.1.1


Vector Interference


Numerical applications which operate primarily on


distinct pattern of reference.


banks of memory.


vectors and matrices have a


These systems are often required to use interleaved


A parallel interleaved memory system allows concurrent access


to multiple data items


by placing these items in


distinct memory modules.


assumption is that addresses can be presented to all the modules in parallel and that

after a delay equal to the cycle time of the memory, data can be removed in parallel

from all of the modules.







any spatial locality.


Gottlieb [GOTT83] states that hashing, however, only works for


small systems.


In [CHEU86], a simulation study of the Cray X-MP's memory system was per-

formed where the vector traffic caused linked conflicts. Vector activity was studied in


a multiple bus architecture and suggestions were made with respect to the bank cycle


time and the number of bus lines.


Other studies on memory in vector supercomputers


have been given in [BAIL87, OED85].

Storage schemes and address transformations have been proposed to reduce the


contention created by the access of vectors.


The conventional storage schemes are


the interleaved scheme, and the skewed storage scheme [BUDN71].


The objective of


storage schemes is to permit conflict-free access for a set of frequent access patterns.


Storage schemes based on the


access


pattern of the vectors are presented in [HARP87,


HARP89].

Vector traffic in a MIN and its adverse effects have been studied in [TURN89],

where it was demonstrated that the forward network, memory, and backward network


all affect each other.


They propose placing long buffers at the memory inputs, which


will eliminate blocking in the forward network.

Evaluation of the Cedar multiprocessor, which is a MIN-based system, was done


in the work by [GALL91].


They observed that the degradation was due to the density


of the requests which they resolved by adding dead cycles (NOPs) in order to reduce

the density.








3.1.2


Vector-Scalar Interference


In many commercially available supercomputers, such as the Cray X-MP, multiple


processors can access both vectors as well as scalars simultaneously.


Their interaction


can cause the degradation of one or both types of access types.

Processors which have a scalar cache cause cache blocks to be transferred between


cache and main memory.


Since data in cache lines are stored in contiguous sets of


words, the interaction between these cache lines and ordinary vector accesses would

result in the interaction between vectors, which was introduced in the section above.

In order to observe vector-scalar interference, the system is assumed to be without

any scalar caches.


In [RAGH91]


the interference among vector and scalar accesses has


lyzed and found to reduce the performance of vector


accesses


been ana-


that may already be


in progress by as much as 40%.


Their model assumed that the memory system had


reached a conflict free steady state, operating at


100% efficiency.


Other important


assumptions are that there were no memory bank buffers present and that preference


was given to scalar accesses.


Then, a single scalar access caused a series of simul-


taneous bank conflicts leading to a detrimental effect on the memory efficiency.


using a crossbar interconnection network they avoided memory path conflicts which

typically arise in MINs.


3.2 Path Setup Algorithms




23


its source to its destinationss. Setting up a path between a source and a destination

can be done using the self-routing property of a UPP.

Unique path property networks can realize only a small fraction of the N! possible

source-destination permutations. In order for a UPP to realize an arbitrary permuta-

tion, several passes through the UPP is necessary, where each pass realizes a submap


of the permutation.


The problem of finding the minimum number of passes to re-


alize a permutation is intractable.


Finding whether an arbitrary permutation can


be realized in a given UPP topology has been done, for example for cube-connected

networks in [ORUC91].


Generally, however, input patterns are not necessarily permutations.


This adds


another


dimension


to the setup


problem.


n different sources request


the same


destination, at most one of these can send its packet.

A setup procedure is a mechanism which specifies which sources of a given request


pattern get


to send their data to the destinations.


The setup procedure typically


precedes the actual packet transfer protocols, particularly when the switching fabric


is unbuffered and dropping of packets is inherent.


In order to avoid retransmission,


the setup phase ensures the packet can be transmitted prior to being admitted to


the switch.


The main goal of any setup procedure is to grant as many connections


as possible at the least possible overhead.


A subgoal in synchronous transmission


is to combine this with synchronization that notifies the sources that the setup has

terminated and that transmission may start.







The speed of the switch,


with respect to the speed at which the input links are


operating, has a direct impact on the path setup phase.


If the transmission speed of


the switch is n times as fast as the link transmission speed, then the packet at the


input port has n opportunities to attempt to set up a path within a slot time.


increases the throughput considerably.


This


Output queuing is necessary when multiple


packets can arrive at the same output port.
The fundamental assumption made most path setup schemes is that at the
The fundamental assumption made in most path setup schemes is that at the


beginning of a transmission cycle, all setup requests are aligned and


a parallel fashion.


processed in


Output link conflicts are usually resolved in a random manner


whereby the victimized requests are dropped


denied a setup during the data


transmission interval.


Victors continue on


to subsequent stages where again


they


might have output link conflicts.


Ultimately, the victors of all stages are granted


a connection.


This is also known as


parallel path setup (PPS), and


notably


[PATE81]


gives an


analysis of the maximum throughput


possible under


these


assumptions.

A study conducted by Lea [LEA92] showed that an incremental path setup (IPS)

procedure can be implemented which can increase the number of possible setups at

the expense of having more setup intervals (assuming the parallel path setup scheme


only uses a single interval).


Lea [LEA92] is able to achieve a spectrum of improve-


ments by increasing the number of setup intervals.


In order to achieve a reasonable


improvement in throughput the number of setup intervals equals the number of con-








each of the sources, one at a time.


The order in which the setups occur (setup


quence) places a certain priority on the traffic at these inlets, the highest priority is

a priori assigned to the inlet which is to be attempted first and the inlet at the end

of the sequence has the lowest priority. Another drawback to the IPS scheme is that

the data transmission cannot start until all the setup intervals are over, even though

the network might be idle during some of these intervals due to a lack of requests

(low input rate).


3.3 Fast Packet Scheduling


Most Asynchronous


Transfer Mode (ATM) switches rely on the use of multiple


stages of very simple switching elements, each of which is self-routing.


constructed on


these


very principles and


are thus


heavily used in


many of these


proposals.

Blocking MINs require means of controlling packet loss by either buffering in the


switching elements or


deflection


routing.


Rather,


nonblocking MINs are used


ATM


architectures.


Nonblocking MINs can


be further


classified


based


on the


queueing strategy adopted:


queueming.


(i) input queueing, (ii) output queueing and (iii) shared


A nonblocking interconnection network guarantees the absence of internal


conflicts and can thus establish a path from any free input to any free output.


3.3.1


Sorter Based Networks


A number of nonblocking MINs are based on placing a sorting network prior to a


MINs are







elements respectively.


An N


Batcher sorting network has n(n + 1)/2 stages


(where


log2 N),


with each stage consisting of


sorting elements.


Additional


means are required to guarantee that the sorted elements are all distinct and skewed

into adjacent outlets.
p4
It is well known that a routing bariyan network is nonblocking if the set of packets

to be routed is sorted based on the output addresses and received on adjacent inlets

of the banyan network. Examples of sort-banyan fast packet switches are the Starlite

[HUAN84], and others [HUI87].


3.3.2


Input Queue Architectures


Input queue switch architectures run at the same speed as the inputs and outputs,


and packets are queued at the inputs. Typi'

a FIFO manner, creating the HOL blocking.


cally, these input queues are serviced in

This kind of blocking involves a packet


at the head of the queue obstructing passage to packets behind it whose outputs are

otherwise idle.


Three-Phase switch


[HUI87]


each slot


is subdivided


three


phases:


probe,


acknowledgement,


data.


the probe phase each


active inlet issues a


request packet indicating the outlet addressed by its head-of-line (HOL) packet.


interconnection network generates as many acknowledgements as there are granted


requests and sends them back


to their respective requesting inlets.


the inlets


which have received the acknowledgement packet transmit their HOL packet during


IT 1 1 1 I 1 i




27


The Ring reservation switch [BING88] coordinates the inlets by interconnecting


them in a ring structure.


A reservation frame is serially transferred along the ring


slot by slot so that each inlet can reserve the outlet addressed by its HOL packet, if it

is not already reserved by some upstream inlet. Successful inlets transmit their HOL


packet nonblockingly.


The reservation phase and the data phase can be pipelined,


but the serial reservation scheme makes the solution unviable for large switches, as

the bit rate grows with the switch size.

The HOL blocking can be reduced in several ways [PATT88]:


* Switch expansion.


The switch has more outlets


than


inlets (N).


throughput is improved because the number of conflicts for each outlet is re-


duced.


* Windowing.


Windowing is a technique that relieves the HOL blocking by al-


lowing non-HOL packets to contend for switch outlets.


means that a search is done for up to W


* Channel grouping.


window of depth W


packets, including the HOL packet.


The switch outlets are subdivided into groups of R each


and packets now address groups rather than single outlets.


In each slot each


output link in a group is allocated to a specific inlet which is addressing that

group with its HOL packet.


3.3.3


Shared Queue Architectures




28


is to mark and separate slot by slot the set of transmittable packets from the loser


packets.


The resulting sequence then consists of a sorted permutation,


which after


being skewed (removing the idle gaps in the sequence), is then passable in any UPP


network.


Winner packets are forwarded


to the routing network and


transmitted.


Loser packets are fed back through a recirculation network to contend again with


newly arrived packets.


The recirculation network implements a distributed shared


queue for packets that cannot be transmitted to their outputs.


Chapter Summary


In this chapter, relevant work in MINs was discussed and approaches taken by


other researchers summarized.


the next chapters,


we will present methods by


which


we can improve on


the already presented techniques and/or solve problems


which were also observed by others.













t














CHAPTER 4
THE CONSECUTIVE REQUESTS TRAFFIC PATTERN


Communication between processors in a shared memory multiprocessor is achieved


by reading and writing to shared variables that reside in the shared memory.


Each


processor is typically equipped with its own local memory where its local data and


its local program can reside.


When these processors are executing in a parallel fash-


ion, the computations should be arranged in such a manner so that the individual


processors have to wait as little as possible.


The traffic between the shared and local


memories must be carefully managed since these data transfers incur a significant

overhead.

The architecture of the interconnection affects parallel algorithm development and

it is paramount to reduce the number of possible access collisions in this network.

The performance of an interconnection network is dependent on the behavior of the


imposed traffic.


The pattern of the traffic is characterized by the rate of the requests


and the frequency with which destinations are requested.


a centralized-control SIMD environment,


network


performance can


tain its optimum when


[LAWR75


certain


. Unfortunately,


permutations appear in


the memory access traffic


these special patterns cannot always


be guaranteed in







and "uniformly" to all memory modules.


This means that at each source subsequent


requests are independent from one another and that requests to memory modules are


generated with equal probability.


When the memory access traffic does not follow this


independence and uniformity assumption, the performance is often overestimated.

Even if the memory accesses could be distributed uniformly, the independence


assumption might not be valid.


One instance is the consecutive memory access in


which a request targeted to memory location i is followed by a request to location i+1.

This kind of access pattern may occur when processors access a stride-1 vector data


or load program codes.


This consecutive request (CR) pattern can also be caused by


the transfer of a cache block between cache and shared memory if each processor is


associated with a primary cache.


When the interleaved memory allocation scheme


is used, these consecutive memory accesses will be targeted to contiguous memory

modules. In this case, the network receives sequences of dependent requests which


may experience repeated network conflicts as observed in [SHEN89].


vector applications,


vector accesses are typically preceded


followed


periods of uniform traffic.


These requests can be directed to synchronization variables


or could be updates of local information. Hence our work considers the case in which

at the time of the vector access the network might have some messages left from the


preceding uniform access.


CR traffic.


These messages might obstruct the smooth passage of the


Other factors which might affect the performance of our proposed schemes


are the starting destination of the rennest sentences and whether all sequences were







issued synchronously or asynchronously.


We investigate these cases and evaluate the


performance of the proposed solutions using simulations.

The evaluations show that, under the CR traffic, the performance of the original

switch design is badly degraded and the solutions proposed significantly mitigate this


degradation.


Evaluations of the new designs are also done under the uniform traffic


pattern to fulfill the nature of non-deterministic memory accesses in multiprocessor


systems.


These results show that the performance of the proposed schemes is the


same as that of the primitive design running on the uniform traffic.


The remainder of this chapter is organized


as follows.


Section 4.1 discusses how,


under the CR traffic, the network composed of the conventional switch elements will


experience heavy


conflicts in


the initial stages.


Section 4.2 presents two effective


methods for mitigating the effects of the CR traffic.


First, the Bit-Reverse address


mapping scheme is proposed, in order to reduce the number of conflicts in the initial


stages by reordering the destinations of continuous requests.


Priority scheme is proposed


Second, the Dynamic


to increase the overlap of the uses of the two links at


each switch. Section 4.3 discusses the effects of spatial distribution on the CR traffic


pattern and


generalizes the


Bit-Reverse scheme.


Sections 4.4 and 4.5 present


performance evaluation of the schemes on the forward network and backward network,


respectively.


Finally, Section 4.6 gives an overview of pertinent results obtained in


this chapter.







Consecutive Reauest Traffic Pattern


It is


questionable whether two successive requests


from a


processor


are inde-


pendent.


If they are dependent, the question is whether switching elements would


experience an unbalanced traffic and correlated conflicts.


To investigate the impact


of dependent requests, CR traffic pattern is described, which has the following char-

acteristics:


1. A sequence of requests is generated at each source node and are destined to

consecutive sink nodes, for example, the requests generated at processing el-


ement PE1


to memory module MM4 is followed by the requests to memory


modules MM5, MM6 and MM7.


This sequence of requests is generated continuously without interference from

other requests.


The first place where this access pattern occurs is in


vector accessing.


A large


number of numeric arrays and numerous iterations of computations involving array


accesses


are used in engineering and scientific applications.


These data are distributed


to many memory modules to prevent memory contention and increase computational


parallelism.


are assigned to


To execute these applications- in a parallel manner, multiple processors


perform similar operations on a selected subset of operands.


thermore, the use of instruction pipelining and data prefetching makes possible the


prefetching of all these operands in very short intervals.


Consequently, each processor







fetches and program loading have the same consecutive nature.


systems where each processor has its own


In multiprocessor


private cache, or even in the event of a


shared cache, upon a cache miss, the cache controller must read/write a line of words

from/to global memory.


4.1.1


The Effects of the CR Pattern


In this section we will describe the effects of CR traffic in a network with conven-


tional switches.


Consider the activities in a switch in the first stage when the same


CR sequence arrives simultaneously at each of its input ports.

PE1 both send requests with destinations 000, 001, 010, ... ,


For example, PEO and


111 consecutively min an


x 8 system. In the first half of the CR sequence, messages are addressed to the first


half of the destinations, i.e., 000, 001, 010, 011.


Therefore, the sequence of the first


bits, used to control routing, remains the same; as in the above example, 0, 0, 0, 0.

We define the duration in which routings remain unchanged as Duration of Identical


Routings (DIR); in our example, the DIR at


the first stage is 4.


We can


see that


the possibility of insertion conflicts is high during this half of the CR sequence since


the DIRs of both inputs are the same and long, i.e.


both of the inputs try to route


messages to the same output


port for many cycles.


The same thing occurs during


the second half of the CR sequence which is directed to the lower half of memory.

In the second and the subsequent stages, conflicting messages will not only affect

one of the messages at the input ports but also messages at the stages before it as







cases of effects which will be generalized in Section 4.3.


Below are the effects of CR


traffic for unit stride accesses.


Effect (1') The usages of the two links connected to the two output ports of a


switch are not fully overlapped, i.e.


resources are not fully utilized.


As in the


above example, at the output ports of a first-stage switch, the lower link is idle

while the upper link is busy during the first half of the input CR sequences.


* Effect


(2') The DIR is still long in


the latter stages.


If two input sequences


overlap with each other,


the length of the DIR has an adverse effect on the


performance of the CR traffic.


If we can find a way that can prevent or reduce the above two phenomena,


can mitigate the performance degradation caused by the CR traffic.


For simplicity,


we shall call the above two effects created by the CR traffic as Effect (1'), and Effect

(2') in the subsequent sections.


Dynamic Priority and Bit-Reverse Scheme


From the above, we can see that the network performance will degrade due to poor


link utilization and long durations of identical routings.


In this section,


we propose


two approaches to remedy these deficiencies.


The approaches are easily incorporated


in conventional switches and have the same performance as the conventional switch

design under an independent and uniform traffic pattern.




35


4.2.1 Bit-Reverse Mapping

We consider the conventional switch design which uses the round robin priority


scheme to resolve insertion conflicts.


When a processor issues consecutive requests


starting from the same memory module or approximately the same memory module,


the destination addresses of these requests vary


in low-order


bits.


However,


routing


messages in


the first stage depends on


the first


bit of the destination


address.


Thus, the routes of successive requests will be overlapped in the first stages


of the network.


One way to split the routes of successive requests and to reduce


their overlap is to adopt a bit-reverse mapping in the destination addresses and to


rearrange the memory modules accordingly.


and the network,


In the interface between the processor


we reverse the order of the destination addresses, for example, in


an 8


(001).


MIN system, 3 (011) is mapped to 6 (110),


The consecutive requests 0,1,2,3,4,5,6,7


and 4 (100) is mapped to


are then mapped to 0,4,2,6,1,5,3,7


Correspondingly,


the memory


locations


are also distributed


according to


the bit-


reverse order in memory modules. B

performance of independent requests.


y doing this,

It is the sam<


we have not changed the network

e as that without address mapping


on uniform traffic since the destination addresses are generated randomly and are

distributed to each destination with equal probability. No matter with which mapping

function the addresses are mapped, they are still randomly distributed.

The Bit-Reverse mapping scheme can reduce the effect of CR traffic for a unit

*1I







becomes a


"0" followed by a


alternatively, so any two successive requests from a


processor will have unoverlapped routes, except that they are received from the same


input port.


Also, if two consecutive requests enter a switch at the first stage, then


the DIR is very short (only one cycle). If sequences arrive synchronously at the two

input ports, a routing conflict arises. No conflict other than the initial one can occur


in the first stage.

to the next stage.


As a result, both the output ports are kept busy forwarding data

The usage of the two links from the same switch in the first stage


is therefore well overlapped.

The main drawback to the Bit-Reverse mapping scheme is that Effect (1') becomes


serious


as the


DIR increases in


latter stages.


Since the round robin


priority


scheme is used to resolve insertion conflicts, all the messages that route to the same


destination will cluster together at the stages approaching the destination.


In other


words,


the blockage in


the first stage of the network will be removed in order for


requests to enter the network but these then get congested in the later stages.


In spite


of this deficiency, we expect that the advantages of the Bit-Reverse address mapping


will lead to improvements in the network performance.


This will be confirmed in


Section 4.4.

4.2.2 Dynamic Priority Scheme


From


previous sections,


we observed


when


the round


robin


priority


scheme is used to select which of the two conflicting messages is routed to the output




37



to reduce these effects we propose to use a dynamic priority scheme to resolve the

insertion conflicts.

Adhering to the conventional switches with no address mapping mechanism at the

processor-network interface, we will now give preference to a message coming from a

particular input port in the cases when successive input messages are trying to route

to the same output port.


Dynamic Priority


any input


can be initially


given


the priority


will keep this priority as long as it continues having messages in its input register


in subsequent cycles.


Upon resolution of insertion conflicts the priority circuit will


indicate the register which has the current priority.


If at any point in time, the input


register


holding the priority


has no


message but


the other one does,


the priority


circuit will revert to reflect this change of traffic.


Since the insertion


priority is given


to a


particular minput, if the next arriving


message is to be routed to the same output port, the first half of the CR sequence

from that input can pass quickly without being blocked. Therefore, the second half


of the CR sequence can appear at that input earlier and also pass through the second


output link of the switch earlier.


That is,


the overlapping of the link usages from


the same switch is higher and Effect (1') is reduced.


Moreover, since the sequence


of messages to the second and subsequent stages remains in a consecutive order (i.e.

not mixed with the sequence issued by other processors), the DIR length is halved in


tha~ staoe


Tn our examrnnle the senence remains 000f. 001. 010. 011 when it arrives







at the second stage and the DIR at this stage is two.


, Effect (2') can be reduced


in the second stage and subsequent stages.

An intuitive drawback of this approach is that the unfair message-passing would


cause uneven responses to the processors.


Therefore


, unfair services would be per-


ceived by the processors.


vector accesses


However, if all the processors in the network are performing


this unfairness within each switch element does not have an impact


on the performance of the overall system.


Also


in uniform traffic


this unfairness


is not present mainly because the DIR of any stage in this type of traffic is on the

average two.


4.2.3


The Combined Approach


The hybrid model which combines both the Dynamic Priority scheme as well as

the Bit Reverse mapping offers the same benefits as the ones described above, but


with one added benefit.


Observe that the Dynamic Priority allows for rapid passage


of messages in the early stages but in the latter stages it is still almost round robin


since the DIR in this stage is not long.


In Bit Reverse mapping however,


the DIR


becomes larger in the last few stages and the Dynamic Priority could alleviate this in


the same manner as described


, but effective in the last stages,


where the Bit Reverse


mapping needs it most.


We therefore expect this scheme to have the advantages of


both.


Effects of Spatial Distribution on CR Pattern




39


which can be of paramount importance to the performance of the memory access.

The stride of an access refers to the constant offset between consecutive requests. For

example, a processor which starts at module 0 and accesses with a stride of 2 will


send the requests to the even numbered memory banks.


Hence it can be seen that


the stride not only affects the order in which memory modules are accessed


but also


the number of distinct memory modules which are accessed.


Bank offset can be an important factor in the performance of CR accesses.


effects can be seen from observing that the DIR of the sequences depends highly on


the sequences which arrive at the input ports of a switch.


The most restrictive case


is when both the sequences are identical, which makes the DIRs at both inlets of the


same length and cause the least amount of overlap.


When


the sequences differ by


just one module, it creates some overlap and reduces the Effect (1') mentioned above.


In some vector computers the stride is an important


computers (the Cyber-205,


parameter.


for instance) allow only a stride of 1.


Some vector


These types of


processors


use "gather"


"scatter"


methods to create temporary vectors which


are contiguously stored. Most vector computers can process strides other than 1, and

in our processor model we assume the processors can handle strides larger than 1.


Stride accesses other than 1 are not unusual.


In fact, a number of numeric array


accesses


and numeric computations in loops involving array


accesses contain loop


iteration variables which are incremented by numbers other than 1.


In Figure 4.1 a


code segment is viven where the narticinating nrocessors are accessin every other









DOALL


i=1,


number-of-processors


DO j=0,


number


-of-vectorelements


step


size


is 2


between


consecutive


iterations


LOAD


A(j)


ENDDO
ENDDOALL


Figure 4.1


Stride-2 Access


In order to see the effect of the stride on a network with conventional switches we


will assume we have the same type of scenario as described above.


We will consider


the activities in a switch in the first stage when the following CR sequence arrives


simultaneously at


each of its


input


ports.


Assuming the stride to be


2 and


access to start from memory module 0,


we then observe that at a first stage switch,


the sequence with destinations 000,


110. 000


, 100, 110 consecutively


arriving in an


x 8 system.


It can immediately be observed that only the even


numbered modules are addressed.


The DIR of this sequence for the first stage is 2,


half of what it would have been in the case of a stride-1 access.


The possibility of


insertion conflicts is still high due to the length of the DIR at both inpuits (this is

particularly the case for larger systems).


The next observation to be made is


that the DIR for the second stage


This


is because the stride-2


access


causes the second least significant bit to toggle.


This


is consistent with the findings of the unit stride


access


example in Section 4.2.1




41


In general, a stride-i access can deteriorate network performance for the following


reasons:


(the first two are a reformulation of the two effects stated in Section 4.1).


* Effect (1) The usages of the two links connected to the two output ports of a

switch are not fully overlapped, i.e. resources are not fully utilized. In essence,


this underutilization may occur at different stages,


depending on


the stride,


and the size of the network.


* Effect (2) The length of the DIR depends on the stride as well as on the size of

the network and is long while traversing stages which do not use the bit toggled


by the stride (henceforth referred to as the stride-bit).


In the stage which uses


the stride-bit, the DIR is exactly one, and doubles with each subsequent stage.

If two input sequences overlap with each other, the length of the DIR has an

adverse effect on the performance of the CR traffic.


* Effect


A stride-i


access is in essence a special kind of multiple hot-spot


pattern since only


memory modules will be accessed (assuming there are n


memory modules and that i divides n evenly). In the worst case, all the traffic

is directed to only those modules.


Strides which are not powers of 2 do not have the deteriorating characteristics


which are described by the effects above.


The primary reason for this is that


length of the DIR of such accesses is not as long as for accesses of powers of




42



4.3.1 Dynamic Priority and Stride Based Mapping Revisited

The Dynamic Priority scheme as described above can still be effective with stride

accesses other than 1. It is well understood that the most restrictive case is the stride-

1 access (where the DIR is the longest) and hence that type of access can benefit the


most from the Dynamic Priority.


However, Dynamic Priority still offers the same


benefits to access of other strides, though the underutilization is perceivably less, due

to their shorter DIRs.

The Bit-Reverse mapping scheme described in Section 4.2.1 did not take the stride


into consideration, but rather assumed it to be 1.


The reason why this approach is


not sufficient here is that this method was meant to spread the


accesses


of a stride-1


access which accessed all the modules in a consecutive fashion.


Strides other than 1


cannot take advantage of this characteristic and hence a new mapping is proposed

which generalizes the Bit-Reverse mapping previously described.

We will show by means of a simple example why the original Bit-Reverse mapping
f


is not adequate for strides other than


000, 010, 100, 110 (access stride is 2).


1. Suppose we had the sequence of requests:

Using the original Bit-Reverse mapping scheme


would map these to addresses 000, 010, 001, 011 respectively. These newly mapped

addresses all lie in the lower half of the memory modules. With higher numbered


strides (powers of 2), the mappings will concentrate in an even smaller corner of the

address spectrum.

XXt'lr .. ^..... ^---^ /T.. .-n-nn+n :>.+. ; 1, A i,;+b t o+ ;Ao ,, f 7 ih, A fct;fnafifn lflrannrP.SP







d_1 dn-2


where do is the least significant bit.


The original motivation of the


Bit-Reverse scheme was to split the routes of successive requests in order to allow


more overlap in the initial stages of the MIN


By applying the Bit-Reverse mapping,


the DIR of successive requests became 1 (in the first stage) and gradually grew as the


stages approached memory.


generated by a stride of i, the i

and the remaining bits di-idd-2


In order to achieve the same results with the sequence


eversal of bits should involve only bits d_-id,,_


... do should stay the same.


This is the general stride


based mapping family of which the Bit-Reverse is a particular case (for stride 1).


Using this stride based


mapping scheme the sequence 000,


, 110 gets


mapped


to 000,


, 010,


Compared


to the mappings created


by the


Reverse mapping,


the newly acquired destinations are scattered in the upper half as


well as the lower half of the memory modules,


of the address spectrum.


instead of concentrated in a corner


By distributing the addresses, the network performance of


independent requests


has not changed.


It is still the same as that without address


mapping by the same argument as was given in the Bit-Reverse discussion.

The general stride based scheme can reduce the effect of any CR pattern because


it reduces


both


Effect


Effect


described above.


The drawback of


scheme is still that the DIR increases


in the latter stages.


Moreover


, Effect (3) has


not been affected by the mapping scheme.


We shall


see later that Effect (3) becomes


the predominant reason that stride based mapping alone is not sufficient,

as the stride increases.


particularly


. .do,


. d d








4.3.2 Skewed Storage Schemes


Interleaving memory works


memory modules.


when


Strides other than


reference sequences address


one are quite common in


contiguous


a multiprocessor


vector processing environment where each processor will generate highly correlated


address sequences [HARP87].


The interleaved system will suffer under strides other


than one because references will not be distributed to all memory modules.

Using any of the stride dependent schemes described above will not increase the


number of distinct memory modules being accessed (Effect (3)


Skewing schemes


have


been


used


to eliminate the conflicts which arise from


parallel access due to


strides greater than one.


It has been proven [BUDN71] that there is no single skewing


scheme which allows conflict free


access


to a vector using all strides.


There are several


classes of skewing schemes [SHAP78,


WIJS85] but the scheme which is used in our


experiments is


defined


a function


which maps the ith


element of


a vector to


memory module i + []) mod N where N is the number of memory modules and


[i/NJ denotes the largest integer less than or equal to i/N


Table 4.1 illustrates the


skewing scheme in an eight module system.


In order to see the relationship between the stride (


S) and the number of memory


modules (M),


we consider


the case


which


both


the stride and


the number of


memories are powers of 2 (S


=2s


and M


We will consider two cases, strides


in which there is at least one access per row (for S


or s < m), and strides in


1 "1 ii 1 1 /I n 'A


= 2m


i f




45


For the following discussion, we assume that the initial access of the vector is to


memory module 0.


There is no loss of generality if we assume this.


Assuming an


interleaved memory scheme, the sequence of references is: 0,


. After


* 2m-1


references the sequence repeats. In the

in the same module as the initial access.


initial access.


interleaved scheme, the 2m-hth access falls

This obviously causes a collision with the


However, if the current row is skewed relative to the former row, then


rather than referencing the same modules as the former row, references are made to


modules adjacent to the corresponding modules in the former row.


to allow


In other words,


M accesses to occur before a conflict occurs, it is necessary to skew 2s rows


relative to each other.


For any stride


with


m, a skewed storage scheme


would consist of circularly rotating reach row r


by r mod


places relative to its


original location in the interleaved scheme.

In the case where there is at most one access per row, the sequence of references


x 2s


All references will


the same module which


causes


most performance degradation.


Rotating the rows can alleviate this but the pattern


is different


than


described


above:


blocks of


contiguous


rows


are rotated


relative


to preceding blocks.


Blocks of


2s-m


rows are rotated in unison,


which is precisely


the number of rows between consecutive elements.


The rotation places consecutive


accesses


in adjacent modules.


Over a period of M


accesses,


each module is referenced


exactly once.

TI ,rinnarl J o01ixrQ7A cf+narrno crhamc r 10 hnaA nn +UtCr n~r.m t~rc "


2m-s





















Table 4.1.


Skewed Storage Scheme


1. The maximum rotation a row can have relative to the first row (rmax).

s > m, rmar is always M, and for the s < m case, rma, is given by 2s.


rmax = min(2S


,M)


(4.1)


Amount of memory to be rotated as a single block.


In the


s < m case, this is


always 1, otherwise it is equal to


2s-m


The efficiency of any skewed storage scheme is measured in terms of its ease in

generating addresses, and the number of possible conflict free accesses it can provide
(a parallel access is said to be conflict free if it can be accessed a single memory
(a parallel access is said to be conflict free if it can be accessed in a single memory


cycle, i.e.


without conflicts).


In the context of a multistage interconnection network, a skewed storage scheme

will improve the performance of a CR traffic pattern, mainly because Effect (3) above


is alleviated.


However, note that there is still underutilization of the links since the


II








010, 100, 110, 000, 010, 100, 110.


In a skewed memory storage the reference pattern


would


then


It can


be seen from this


simple example that the output links will be underutilized in certain portions of the


sequence.


When


we apply the stride dependent Bit-Reverse mapping,


we get the


sequence:


, 100,


, 110,


(or 0,4,2,6,1,5,3,7).


Interestingly


enough this is the same pattern which we would get when applying the Bit Reverse


mapping scheme onto a unit stride


access


starting at module 000. From the above we


will then expect similar performance of a Bit-Reverse mapping scheme of a stride-1

access in an interleaved memory and a stride dependent mapping scheme of stride-2

in a skewed memory.


our investigations,


we assume


a vector is only


accessed


with a single


stride.


Even if vectors are not always accessed with a single stride, it is by far the


most common case.


An example in which a vector is not accessed with single stride


is the well known row/column access pattern found in matrix manipulation routines.


Performance Evaluation of Forward Network


In order to measure the performance of the system under CR traffic and to eval-


uate the effectiveness of our schemes,


we have conducted a simulation study of the


Dynamic Priority and Bit-Reverse (and stride based) schemes in a multistage inter-

connection network.


The simulation model used a 64


x 64 buffered Omega network.


A larger network








each.


The reason


the size of eight was chosen is that in [KRUS83]


it was shown


that a queue size of eight could model infinite queue size, so that blockage caused


by a full queue can be almost eliminated.


A processor has at most one outstanding


request and stops issuing a request when the corresponding input register of the first


stage of the MIN is busy. On

generate one every clock cycle,


ice the processor starts issuing CR messages,


if it can insert it into the network.


it will


This dependency


of subsequent messages is the main characteristic of the CR traffic.


The time at


which


the processors start their vector access is also


varied


, from synchronous


asynchronous.


To study the spatial effect caused


tween two types of access,


by the bank offset,


CR sequences starting with


we shall distinguish


the same bank offset


CR sequences access with bank offsets chosen randomly for each processor.


The spa-


tial effect caused


by different stride


accesses


is investigated in conjuction with


stride-dependent mapping and skewing schemes.


Finally,


we also study vector ac-


cesses


under an idle network before the access


and under an initial load of uniform


accesses.


The code which implements the CR concurrent fetch


is typical (Figure 4.2).


Each


processor is assumed to run


the exact same code


this code is assumed to be


pre-scheduled (to minimize the effects of scheduling overhead)


single vector load is made, and that no NOPs


. We assume that a


are inserted between fetches.


The main measures which we took from simulations using the CR traffic model









DOALL


i=1,


number-of-processors


DO j=l


number-of-vectorelements


LOAD


A(j)


ENDDO
ENDDOALL


Figure 4.2.


CR Concurrent Fetch


on the average for each message of the vector sequence.


the throughput (number of


messages which reach memory per cycle), and the average sequence (vector) delay,


which is defined as the average number of cycles it takes for


complete their vector access in the forward network.


all the processors to


It represents the average of all


the worst case completions over all iterations.


These measurements conform to a processing model in which a number of pro-


cessors, N, participate in a barrier synchronization. As processors reach the barrier

they are suspended and wait until the last one finishes. The barrier is the most re-


stricted form of synchronization. Thus

case execution of each of the processors.


a good measure of performance is the worst

This is an indication of how long the other


processors must wait


before they can proceed.


In our model,


the execution is the


issuing of a CR sequence and the measuring of the average sequence delay, as defined


above,


is a measure of the average time the processors


need to spend


during the


barrier synchronization.


^ /** /i a-i/- f:


























Figure 4.3.


Processing Model


case, a number of 200 runs were collected per simulation point (measurements were


taken at interstart periods of 0.0, 5.0, 10.0, 15.0, and 20.0).


optimum is also given.


network.


The comparison to the


The optimum always assumes that there are no conflicts in the


The model for this section assumed no memory model, i.e., that memory


.modules had infinite lengthput buffers.
modules had infinite length input buffers.


The optimum vector delay in every graph,


is also


provided to illustrate the point and


the illustration speaks for itself.


optimum vector delay is one in which it is assumed that there are no conflicts in the

network.

4.4.1 Dynamic Priority and Bit-Reverse Mapping Results


The results presented in this section assume the CR sequence to be accessed with


unit stride.


Our simulation primarily focused on


the relative performance of the










Average Vector Delay (Vector = 256, Queue = 4)
Same Bank Offset No Initial Uniform Load


700.0


600.0


500.0


400.0


300.0


200.0


100.0


j . -. .


[- o



--------_
-^

___-------0-------

S------>------A---- A------r


0---0
*3----E
o----o6
A----A4


0.0-
0.0


No Priority/No Bit Reverse
Priority/No Bit Reverse
No Priority/Bit Reverse
Priority/Bit Reverse
Optimum


10.0
Inter-Start Period


Figure 4.4.


Same Bank Offset/Stride 1


- No Initial Load


which follows we will refer to the round robin strategy with no memory mapping as


the conventional design. Measurements were collected for the conventional design, the


Dynamic Priority, Bit-Reverse, and the hybrid scheme.


The time that each processor


begins to issue its CR requests is distributed within the inter-start period.


The simplest scenario is the one in which there is no initial load on the system


and the processors issue CR sequences with the same bank offset.


Figure 4.4 shows


the average vector delay for these four schemes under these assumptions.


From this


figure we can conclude that the Dynamic Priority scheme gives the best results if the


processors start synchronously.


the asynchronous case this scheme deteriorates


but the combined scheme performs slightly better than the Dynamic Priority alone.


T, 1 :-' .. 4. L 1....-- -... 4. 4.. L -L L- J -.- .*.. L -- -L_ -.- 1 L -. .. .. -


20.0









Average Vector Delay (Vector = 256, Queue = 4)


700.0


600.0


500.0


400.0


300.0


200.0


100.0


0_n


Random Bank Offset


- No Initial Uniform Load


0.0


Inter-Start Period


Figure 4.5.


Random Bank Offset/Stride 1


- No Initial Load


In all cases our proposed schemes yield better performance than


the conventional


design.


It is of importance in which sequence the allocated vector is being accessed since


this determines the sequence of requests issued by a processor and also the number


of conflicts.


Our simulation has concentrated on CR sequences with the same bank


offset and for completeness we will show how our schemes would perform under access


to different memory starting modules. Th


its performance can be seen in Figure 4.5.


is type of access causes fewer conflicts and


In this case, the hybrid scheme yields the


best results while the individual schemes perform alike.


It can also be seen that all


the proposed schemes outperform the conventional switch design.


n-------- .-


1 4


*3


.- I1-- A I- 1 r 1 f1 f,


--O No Priority/No Bit Reverse
-E Priority/No Bit Reverse
-0 No Priority/Bit Reverse
A-- Priority/Bit Reverse
*- Optimum


* --- i-- --- .


20.0









Average Vector Delay (Vector


= 256, Queue


= 4)


Same Bank Offset Uniform Load for 500 Cycles


700.0


600.0 c


500.0


400.0


300.0


200.0


100.0


0.0


-< 1


0-0


No Priority/No Bit Reverse


3--E Priority/No Bit Reverse
0--0 No Priority/Bit Reverse
a-t Priority/Bit Reverse
*-----* Optimum


10.0


Inter-Start Period


Figure 4.6.


Same Bank Offset/Stride 1


- Uniform Load for 500 Cycles


bank offset access where the CR traffic is preceded by


a uniform traffic period of


500 cycles at a request rate of 0.5 is shown in


Figure 4.6.


initial load has a


larger effect on the performance of the synchronous Dynamic Priority scheme,


this scheme is more vulnerable to the network


since


being idle before the access.


hybrid scheme shows the most stability, as can be observed from Figure 4.4 and 4.6


that there is no change in the curve.


We also obtained the case in which the same


initial load precedes a CR


access


starting from different bank offsets.


The results are


presented


in Figure 4.7


. Remarkably, the initial load has no effect on the average


vector delay in the case of an


access


with different bank offsets.


The reason for this


is that the initial load makes the DIR smaller by interfering with the vector


access.


15.0


20.0









Average Vector Delay (Vector


= 256, Queue


=4)


Random Bank Offset


- Uniform Load for 500 Cycles


700.0


600.0


500.0


400.0


300.0


200.0


100.0


0.0


Inter-Start Period


Figure 4.7


Random Bank Offset/Stride 1


- Uniform Load for 500 Cycles


Another factor which we measured is the throughput under the CR traffic.


Figure 4.8 it can be seen that the throughput obtained under synchronous


From


access


using the Dynamic Priority scheme is the highest.


Figures 4.8 and 4.9 show an interesting property of the Dynamic Priority scheme


under synchronous and


CR sequences with the same bank offset access,


spective of the length of the vector, after an


processor), each processor will deliver message


initial setup


es with the sat


time (different for each


ne message delay. Once


the first request reaches memory, they keep arriving at a constant rate of one per


cycle.


This indicates that after the bottleneck is resolved in the early stages,


pipeline of requests


is operating to its fullest capacity.


p&----


C-ONo Priority/No Bit Reverse
3- Priority/No Bit Reverse
-0 No Priority/Bit Reverse
A-- Priority/Bit Reverse
----- Optimum


* ** *** *II IM


10.0


15.0


20.0



















Throughput-Vector
Dynamic Priority vs.


256/Queue=4
Round Robin


100.0

90.0

80.0

70.0

60.0

50.0

40.0


I


Dynamic
Dynamic
Conventi


Priority Only Synchronous Access
Priority Only Interstart Time = 10


onal


Switch -


Interstart Time = 10


- I
I


'A


*r
*
-?^
-1 / V
!* /
:lj f

~ i1 j
^ /
*[/


800


900


200


300


500


600


700


Clock Cycles


Figure 4.8.











Req/
Cycle

A


1000


100


400


Throughput for Dynamic Priority and Round Robin


S+V+N-1







The first message arrives at memory in a total of


s + 1


cycles,


where s is the


number of stages of the network.

cessor whose first request arrives.


At each subsequent cycle there is one more pro-

If the vector is at least as long as the number of


memory modules, this constant will increment continuously until all processors have


one arrival per cycle.


It starts decrementing at a constant rate of one processor per


cycle once the first processor finishes its vector. If the vector is shorter than the num-

ber of memory modules, the curve still has the same shape but some processors will


have finished their entire vector even before others have begun.


Figure 4.9 depicts


the case in which a vector larger than the number of memory modules is accessed in


a network of 64


x 64 processors.


From this the average vector delay is given by:


AvVectorDelay


where s is the number of stages in


1=1
(N)( (s+v+i-1)
N+1
s+v-1+( 2)


the MIN, N is the number of processors in the


network, and v is the number of elements in the vector.

From the simulation results it is clear that the CR traffic has benefits from the


Dynamic Priority Scheme and suffers a lot


under the conventional switch design.


This scheme gives continual priority to one port until the requests from that port


are exhausted.


Thus the utilization of the links can be overlapped and the overall












Average Message Delay (Vector


Effect of Buffersize on Dynamic Priority


10.0


16.0


Inter-Start Period


Figure 4.10.


Effect of Buffer Size on Message Delay


turnaround time for all the processors is reduced compared to that of the the round

robin strategy.

The size of the buffers inside of the switches does not have a great impact on


the overall performance of Dynamic Priority scheme.


As Figure 4.10 shows,


for a


vector of length 256 the average message delay degrades slightly as the buffer space


increases.


The improvement in the average vector delay only becomes significant in


the asynchronous case (as can be seen in Figure 4.11)


. However


even with a small


amount of buffer space inside a switch


the proposed schemes perform well.


= 256)


C-- Length = 4
3- Length = 6
-0 Length = 8


20.0












Average Vector Delay (Vector =256)
Effect of Buffer Size on Dynamic Priority


700.0


600.0


500.0


400.0


300.0


200.0


100.0


0.0


5.0 10.0 15.0


Inter-Start Period


Figure 4.11


Effect of Buffer Size on


Vector Delay


4.4.2


Dynamic Priority and Stride Based Manping Results


In this section, results obtained on CR sequences accessed with strides besides


unit stride are discussed. We have limited our experiments to CR sequences with the


same bank offset traffic and no load on the network prior to the CR access.


The degradation caused by non-unit stride access,


clearly be seen in Figure 4.12.


as described in section 4.3


can


From the same figure it can be observed that even


for the conventional switch design, a skewed storage scheme is able to improve the


performance of several non-unit stride


accesses.


It is not surprising that the disparity


between different allocation schemes interleavedd


skewed) gets larger as the access


20.0








with a larger stride,


it also means that fewer modules are accessed in an interleaved


memory.


The effect of the stride,


the evaluation of the stride dependent mapping


scheme in an interleaved memory can be seen in Figure 4.13.


By merely adapting


the stride dependent mapping, an adequate improvement over the original bit reverse

mapping in an interleaved memory can be obtained. An important observation to be

made from these results is that even though the number of modules accessed remains


the same (so


Effect (3) is not changed),


the distribution of the requests itself can


cause the improvement.


When Bit-Reverse mapping is performed without taking the


stride into account (i.e.


default


unit stride), all


the mapped addresses will be


in one corner of the address spectrum, causing a localized multiple hot spot area.

The stride dependent mapping distributes these highly requested addresses, and by

doing so the requests are also distributed in these directions causing less contention

for similar MIN links.

In the event that no skewed storage is available, the best route to take is still to


employ the Dynamic Priority.


This can


be concluded from Figure 4.14,


where for


a stride 4 access the Dynamic Priority scheme is certainly the one which yields the

lowest average vector delay.

To be complete, we provide the graph which shows that the uniform traffic has


the same message delay in all four schemes.


Compare it to the minimum message


A* -h I













Conventional Scheme


- Interleaved


vs Skewed Storage


Vector Length


1000.0

900.0


= 128


- Interleaved, Stride 1
-- Interleaved, Stride 2
--* Interleaved, Stride 4


800.0


700.0

600.0


-
1^ A


- Skewed,.


--V Skewed,.
-0 Skewed,
*--- Optimum


Stride 1
Stride 2
Stride 4


500.0


400.0

300.0

200.0

100.0

0.0


Interstart Period


Figure 4.12.


Conventional


cheme


- Interleaved


kewed


Stride Dependent Mapping


in Interleaved Memory


Vector Length = 64


1000.0


900.0

800.0


700.0

600.0

500.0


400.0

300.0

200.0

100.0

nf ___


* I I __________ I


==
___


10.0


16.0


20.0















Average Vector Delay in Interleaved Memory


Vector Length


=64,


Stride =


600.0




500.0




400.0




300.0




200.0




100.0


0.0 -
0.0


5.0 10.0 15.0


Interstart Period


Figure 4.14.


Average Vector Delay in Interleaved Memory


Uniform Traffic


Average Message


Delay


C---O No


Priority/No Bit


Reverse


[3---E3 Priority/No Bit
0-O No Priority/Bit


A--A Priority/Bit
---- Optimum


Reverse
Reverse


Reverse


20.0


15.0

14.0

13.0

12.0

1 1.0


10.0


I


I








The Effects of the Reverse Network on CR Traffic


The overall system performance can be measured in terms of the message through-

put which can be achieved by the MIN. In practice, the message throughput is not

only determined by the bandwidth of the network but also by its latency. A processor

may not be able to place a new request until it has received the response to a previous

one. In the case of CR accesses one might consider three cases upon which a new CR

sequence can be issued:


* As soon as the entire previous CR sequence has been issued,


with no need to


wait for its return.


* As soon


as the first message of the previously requested CR or the element


which caused the processor to issue the sequence (for example in the case of a

cache miss) sequence has returned.


As soon as the entire previous CR sequence has returned from memory.


These cases have an impact on the utilization (the processor might not be able

to do anything useful until the requested message has returned).


4.5.1


Processor Model


For the purpose of the discussion in this section,


we assumed the following pro-


cessor model:







* The CR sequences will be accessed with the same stride, namely 1.


4.5.2


Memory Model


We assume the memory to be interleaved and that all requests to memory have


identical latency (we do not distinguish


between reads and


writes), other than of


course the individual latencies incurred due to the network conflicts.


In reality, these


requests result in distinct traffic patterns on the forward and reverse networks. Reads

generate a request of one word which traverses the forward network but returns with


two words over the reverse network.


Writes on the other hand generate two words


over the forward network and a single word travels back over the reverse network


to convey the success of the write to the processor.


Though this asymetric behavior


may cause a difference in the performance in the types of memory traffic, we will not

consider this in our experiments.

Each memory module consists of four memory banks, but only one input buffer


and one output buffer.


After being serviced in a memory bank, messages contend for


the reverse network and are queued according to their finishing times in the memory

module (FIFO).


4.5.3


Network Model


It is of importance to study the behavior of the CR traffic in the forward


as well


as the reverse network, and to observe what improvements the proposed schemes can


offer.


All the permutations of the schemes are worthwhile of study, except one.




64


We assumed the forward network to be an Omega network and varied the reverse


network from an Omega network to a Baseline network.


a different type of MIN,

found to be of interest.


The consequences of using


in particular as the backward network is investigated and

The type of reverse network and the schemes used in both


the forward and reverse network have a direct bearing on the performance obtained.


4.5.4


Performance Evaluation


The delay incurred per CR sequence is a direct


measure of the processor uti-


lization, since it is generally assumed that the processor idles as long as there are


messages pending.


In our experiments, the average delay incurred in the forward


network, and the average delay incurred in the reverse network were measured sep-


arately for comparison.


Table 4.2 lists the parameters used in our model and table


4.3 lists the measures obtained.


Table 4.2.


Simulation Parameters


Parameter Value
Number of CR Sequences per Processor 1
Memory Latency 4
Number of Processors 64
Number of Memory Modules 64
Number of Banks per Memory Module 4
Buffer Size per Switch 4


Performance Measure
Average Forward Vector Delay
Average Reverse Vector Delay




65


Here, the average forward vector delay is taken to be the same as defined in Section


The average reverse vector delay however, includes the memory latency.


sum of the forward and reverse vector delay results in the total average vector turn


around time.


In this section,


we will present the results obtained from the forward


and the backward network in a segregated fashion in order to illustrate the usage of

different backward network topologies.

In the following, the notation x/x/x is used to indicate the implementation of

Bit Reverse in the forward network, Dynamic Priority in the forward network, and


Dynamic Priority in


the reverse network, respectively.


The values for x are E for


Enabled, or D for Disabled.


For example, E/D/E refers to the case where the forward


network has the Bit-Reverse mapping implemented and the Dynamic Priority scheme


disabled.


The reverse network, on the other hand,


has the Dynamic Priority enabled.


The performance of Dynamic Priority in the forward network depends on what


type of backward network is used (compare Figures 4.16 and 4.18).


In the first one,


Dynamic Priority schemes outperform all the other six combinations whereas in the

latter one the Hybrid schemes and the Dynamic Priority schemes seem to have similar

results. From these two figures it is also apparent that the Bit Reverse schemes have


different outcomes depending on the type of reverse network used.


We shall return


to this later.

Implementing Dynamic Priority in the reverse network can be beneficial as well

although the nature of the requests depends on what schemes were used in the forward








general direction and sequences with long DIRs benefit most from this scheme.


nature of the returning traffic which is generated at the memory modules and which

is heading towards the processors is dependent on the order in which they arrived at

the modules.

The choice of multistage interconnection network together with whether the Bit


Reverse mapping was used


or not,


have a


big impact on


the latency suffered


the CR traffic.


Note that the schemes which implement Bit-Reverse in the forward


network perform worse than the conventional scheme when


the Omega network is


used


instead


of the


Baseline network


(contrast


Figures 4.16


with


4.18,


with 4.19).


The reason for this phenomenon lies in the mapping performed and the


topology


of the reverse network.


The observations can


be explained


by means of


Figure 2.1.


When Bit Reverse mapping is used in the forward network, the modules


to which the consecutive requests are mapped, are connected to the same switch in

the backward network (if an Omega Network is used as the backward network). Since

the contending messages are headed for the same processor, they will be contending


on the whole way back.


This same phenomenon will occur for all pairs of messages.


Any different backward network which does not connect the modules described

above in the first stage would remove this phenomenon, and the Baseline network is

an example of a network which meets this criterion.














Average Forward Vector Delay


Reverse


500.0





400.0





300.0


200.0 -





100.0 -





0.0 -
0.0


= Ormega,


Vector Length


= 128


5.0 10.0 15.0


Interstart Period


Figure 4.16.


Average Forward Vector Delay


- Omega Reverse Network


Avera


Reverse


500.0


400.0


300.0


200.0





100.0


Rev


erse


= Omega.


Vector Delay


Vector Length


20.0


= 128


p


ca---* cos
ME-DD
0-DEE
4$--- EDO
- ..l-A-=--AEO. --

0--OEEE
Optmum


-


-


I


I












Average


Forward Vector Delay


500.0




400.0




300.0


200.0




100.0


Reverse


0.0
0.0


Baseline, Vector Length


= 128


5.0 10.0 15.0


Interstart Period


Figure 4.1


Average Forward Vector Delay


- Baseline Reverse Network


Average


Reverse


Vector


Delay


Reverse = Baseline, Vector Length = 128


500.0




400.0




300.0


200.0 -




100.0 -


20.0


I







Charter Summary


In this Chapter, the Consecutive Requests traffic pattern was discussed and the


causes for network deterioration under CR traffic have been determined.


Two meth-


ods were discussed to alleviate the poor performance of CR traffic in a MIN.


first method is a scheduling mechanism that would replace the round robin schedul-


ing, namely the Dynamic Priority.


The second method, Bit-Reverse mapping, is a


mapping scheme with the purpose of moving the link contention to the latter stages


of the network.


It is shown that these schemes can improve the performance of the


network under CR traffic considerably. In the following Chapter, the issues involving

the interaction of CR traffic with regular accesses are addressed.













CHAPTER 5
VECTOR/SCALAR INTERACTION IN MINS


In Chapter 4 the problem of access conflicts due to similar CR pattern accesses


was studied.


Though


this problem in


itself


deserved


the attention it


received, a


more common processing model is one in which processors access different types of


data, resulting in a mix of interacting traffic patterns.


Memory traffic generated by


processors can be characterized by bursts of requests which may belong to either a


vector or a scalar stream.


The network performance in


this case,


will depend on


the rate of issue of each of these request streams, the type of access (traffic pattern)

made by each of the processors, as well as the degree of interaction of these multiple

streams.


The objective is still that


the requests issued


are to be satisfied


promptly


memory. From the previous chapter it can be gathered that when left alone, CR traffic


can be detrimental in MINs.


The question now remains whether the performance of


this type of traffic is made worse by the presence of non-CR traffic patterns.


Some of


these non-CR traffic types are for example, uniform traffic, and hot spot traffic.


reverse is absolutely of interest:

effect on non-CR traffic. Non-4


whether the presence of CR traffic has a degrading


CR traffic will be referred to as scalar traffic in the




71


One example in which vector accesses are interspersed with scalar references can

be drawn from a parallel algorithm also known as the Gaxpy algorithm. In the shared

memory Gaxpy computation, a vector z is computed as the sum of a vector y and


the product of a matrix A and vector x.


assume throughout that


A, x, and y reside in shared memory.


the shared memory multiprocessor has n processors.


following example is that of the n-by-n gaxpy problem z = Ax + y partitioning.


For simplicity we assume that


there are n


processors and


that each


processor


is allocated one element from each vector and one row of the matrix A.


From the


algorithm in Figure 5.1 it can be observed that each processor engages in the fetch


of a scalar element yi and two vectors x and Ai.


The vector x is a shared read only


vector which all processors will load into their local memory.


The vector Ai,


which


pertains to a row of the matrix A, is not shared but different elements of the different

rows may still reside in similar memory modules, causing potential access conflicts.

(We assume that the matrix is stored in row major). Note that after each processor

completes the computation, it returns the value to the shared memory by a store

statement.








Local Variables xtocai(1 : n), ZtocatI, atoca
for id=i,n do in parallel


load


x into Xlooal //vector load //


load y(id) into ziocal //scalar load//


for j = 1,n
load A(id, j) into azocai


Local


-= Zlocal local


X Xlocal(j)


end-for
end-for
end gaxpy;



The remainder of this chapter is organized as follows. Section 5.1 describes the na-


ture of the vector-scalar interaction.


Section 5.2 depicts the model used to evaluate


the performance of the vector-scalar interaction,


Section 5.3


details the


formance results obtained by implementing the Dynamic Priority and Bit Reverse

mapping schemes. Section 5.4 summarizes the findings in this chapter.


The Effects of Vector/Scalar Interaction


Both vector and scalar request streams can be adversely affected by one another


when


they interfere with each other for a long period of time.


The switching ele-


ments of a MIN may experience an


unbalanced traffic and in order to understand


the nature of this interference we will consider the network behavior which is caused


by elements of scalar streams and CR (vector) streams.


This interference has the


following characteristics:


1. A sequence of uniform and independent requests (scalars) which are generated

at a source node interact with a sequence of requests destined to consecutive








Consecutive requests are generated in a continuous fashion.


Scalars, on


other hand, have a probability of being generated in a given cycle.


The proba-


ability that a scalar reference will be made is also referred to as the scalar issue

rate.


The effects and performance of uniform traffic in MINs have been studied in lit-


erature.


The detrimental effects caused by vector/vector interaction was the subject


of the previous chapter for which the Dynamic Priority and Bit-Reverse were pro-


posed.


The first one is an alterative to the round robin scheduling strategy and the


Bit Reverse mapping which reverses the addresses for unit stride


accesses.


When vectors and scalars interact, it would seem likely that the scalar requests


would be penalized the most.


Uniform scalar accesses will find that the paths which


they must


take


to be


mostly saturated


with requests


from


vector streams.


It is


empirically found that when the percentage of processors actively accessing vectors


is equal to


the percentage of processors engaged


in scalar requests,


the volume of


requests generated


by vector


accesses


is sufficient to congest


the paths


which are


shared by the scalar requests.

Insertion conflicts in individual switches of the MIN can be handled in various


ways.


When the round robin strategy is used, vector requests and scalar requests will


be interleaved and the consecutive nature of the vector requests will be altered.


With


larger networks this scalar interaction will cause vector sequences to be broken up and



















Figure 5.1.


Vector-Scalar Interaction Example


traffic pattern and eliminates the congestion normally accompanied by nonuniform


traffic patterns.


It can hence be expected that the interaction of scalars with vectors


can improve the performance of the vector access, at the cost of the scalar references.


Consider for


example the activities in


a switch


the first stage


when a


sequence arrives at the upper input port and a scalar sequence arrives at the lower


input port, with an arrival rate of 0.5 (one arrival every two cycles).


For example, in


Figure 5.1, the upper input port receives the CR sequence 000, 001, 010, ...


, 111


x 8 system) in a continuous fashion while at the lower input port there is an arrival


of a random sequence of destinations.


The result of a round robin strategy in this


switch results in sequences at the output ports which no longer exhibit a long DIR

(which, as we conjectured from the previous chapter, is the main cause for the poor

performance).

By implementing the Dynamic Priority scheme the issue of fairness becomes im-


mediate.


Since Dynamic Priority gives priority to a single port


until that port is


totally exhausted,


the port at which the vector sequences arrive will undoubtedly


r
t7 t6 t5 t4 t3 t2 tl tO (9 t8 t7 t6 t5 t4 t3 t2 tl tO
7 6 5 4 3 2 1 0 2 3 0 2 1 3 0
First '
: :2: :0: :5:3 S 7:6:5:4: : : :5: :
2 0 5 3 Stage 7 6 5 4 5
SI 3I
a a 3 5 4

L




75


current priority receiving port happens to be the one at which the vector sequence is


arriving.


Vector sequences on the other hand,


will benefit from the interaction due


to the randomization effect as was explained earlier.

The Bit Reverse mapping has a specific purpose which is that of dispersing the


consecutive accesses to non-consecutive locations.


When this mapping is applied to


uniform scalar accesses the result is another uniform set of accesses.


The interference


of these two types of requests, even after the mapping, will have no remarkable effect


on one another.


The scalar requests will be subject


to the congestion


caused


the volume of vector requests, as was described above,


but that is not due to the


mapping.


Vector/Scalar Interaction Models


For the discussion of our processing model we made the following assumptions


Table 5.1).


We are only considering the interaction in the forward network, and


thus assume that


the memory bank buffers to be infinite.


This way, memory can


handle all the requests, and


there will be no queueing in


the forward MIN


due to


busy memory modules.

The simulation model then proceeds to issue vectors with equal length, unit stride,


and with


the same bank offset.


The interacting processors are continuously either


issuing vectors or scalars (following one of the request models described below).


Each


measured point of the simulation results represents the average of 200 runs, with each




















Table 5.1.


Assumptions of the Vector/Scalar Model


Single Request Mode (SRM)


Dual Dynamic Request Mode (DDRM)


Single Static Request Mode (SSRM)


The Single Request Mode model is followed by the Cray X-MP system where vector

instructions cannot be issued as long as scalar memory accesses are still in progress.

The SRM model requires barriers to be placed for all processors participating. It also


places the most stringent requirements on the network.


In a way, the interaction of


scalar and vector


accesses


is not an issue because they will never interact.


What is


of essence in this model is the latency incurred during both the scalar burst as well


as the vector burst, as this will be the decisive factor.


For this reason,


we shall not


consider this model one in which the desired interaction is manifested.


In the


Dual Dynamic


Request


Mode (DDRM) model processors can issue both


vector and scalar streams, without having to wait for either to have completed before


Simulation Parameters
No scalar caches
Infinite memory bank buffers
Scalar traffic is uniform
Vectors of equal length
Vector stride of one
Same bank offset
Interleaved memory
Forward network only







Request Mode (SSRM), is one in which the processors are dedicated to one


particular request mode, instead of being general purpose.


For the duration of the


experiment, a processor has a single request mode which remains unchanged.

The request mode affects the issue rate and the dependencies between requests

from the same stream. Scalars are issued according to a burst rate and are indepen-

dent of one another, whereas individual vector requests are dependent and are issued


in a continuous fashion


(CR traffic pattern).


Moreover,


the number of processors


which are each issuing either scalar or vector traffic also has a direct bearing on the

performance.

In the DDRM model, processors can request both vectors as well as scalars, but


not simultaneously.


A processor can start either a vector or a scalar request.


probability with which a processor will do one or the other should be a parameter to


be varied.


The duration length of a scalar stream is drawn from a burst interval.


frequency of scalar requests during this scalar burst is decided by the scalar issue rate.

Scalar bursts and vector bursts are interleaved, so a scalar burst is always followed by


a vector access and a vector access is always followed by a scalar burst.


The processing


model assumes enough buffers are provided to have all requests outstanding which


may still be in the network or are being serviced by memory.


The end of a


vector


access


is indicated by the issue of the last vector element, upon which another scalar


burst is started.


The end of scalar burst is indicated by the system clock, and any of


the remaining requests which are still pending in the system are allowed to finish and


Single




78


the duration of interference, whereas the scalar issue rate directly affects the degree

of interference.


In the SSRM model


, the request mode of each processor is fixed.


In contrast with


the DDRM model, a vector requesting processor issues vectors which are not followed

by scalar requests but instead are followed by an idle period. Scalar issuing processors

issue scalars according to a certain issue rate, and for simplicity we will assume that


the issue rate is the same for all scalar requesting processors.


The interaction now is


defined by the locations and the number of the scalar requesters with respect to the


vector requesters.


The simplest assumption to make is that the request mode is chosen


in an independent and random manner so the locations of these individual processors


are random.


Together with the scalar issue rate the scalar/vector interaction has


tangible parameters which can be varied and observed in the experiments we shall

conduct.


5.2.1


Special Interaction


Types


this section


we shall describe the traffic patterns which may arise from the


DDRM model:


* Scalar burst length of 0.


The interaction here degenerates to a vector/vector


interaction where there are continuous vector references (there are no scalar


references between vectors).


The traffic generated resembles that of a single


vector access.








* Scalar issue rate of 1.

and vector references.

independent fashion.


This results in the most severe interaction between scalar

The scalar references are issued continuously but in an

They will serve primarily as a means to randomize the


vector sequence.


Special traffic patterns which arise from the SSRM model are:


* Scalar burst length of 0 or issue rate of 0.


The interaction becomes one in


which vector/vector interaction takes place but in addition to that,


there are


idle processors (scalar requesters are idle).


* Scalar rate of 1.


The interference caused by a continuous sequence of scalar


references and a vector reference is similar to that of the DDRM model with


scalar rate of 1.


The vector sequence will be randomized by the interference of


the uniform traffic.


* Percentage scalar requesters of 0,


this results in a


vector/vector interaction


model, such as was studied in the previous chapter.


The vector sequences will


be interleaved by idle periods.


Performance Evaluation


The simulation model used was a 64


x 64 buffered Omega network.


The buffer


capacity of the switches was held at 4 messages per port.


We assume that a processor


can continue issuing messages as lone as the inout register to the first stage is avail-




80


model which we are investigating assumes that the processors in the system are ex-


ecuting parallel tasks which have been previously allocated.

request scalars and vectors in an interleaved fashion. We ass

of the same length and are accessed with the same stride.

length is selected from an uniform distribution. The time


Processors execute and


ume that the vectors are

The scalar access burst


at which the processors


start their access (vector or scalar) is also varied, from synchronous to asynchronous.

We have limited the experiment to vector references which start with the same bank

offset.

The model primarily performed comparison experiments between the conventional


switch and the Dynamic Priority and Bit Reverse mapping schemes.


The measures


which were of interest were the average scalar throughput and delay, and the average


vector throughput and delay.


The scalar (vector) throughput is defined to be the


number of scalar (vector) messages which arrived at memory per cycle.


This chapter


will concern itself with the forward network only, and hence the memory is assumed


to be able to service all the requests.


The average vector delay is defined to be the


average number of cycles it takes for an entire vector to reach memory.


The results


obtained are then compared against the optimum, where the optimum is obtained by

assuming that there are no conflicts in the network.


5.3.1


Performance of SSRM Model


We assume that per experiment, the number of processors which request vectors is


1TX? 1 1l j j1 11 "1 I 1 i i1 i 1 i 1 1



















Table 5.2.


SSRM Model Parameters


fixed amongst all vector requesting processors.


Unless explicitly given, the parameters


used in the SSRM experiments are assumed to be the ones given in Table 5.2.

Giving scalar references a higher priority will result in a degradation of vector


performance.


It can


be observed from Figure 5.2 where four schemes used scalar


priority and as the scalar issue rate increases, the delay incurred by vector sequences


gets worse.


The curves in Figure 5.3 reflect this same information, but shows the


scalar delay for the same schemes as in Figure 5.2.


As expected, the highest scalar


delays are incurred for the scheme which did not give priority to scalar references.

The number of processors actively engaged in vector requests decides the degree

of contention in the system. By giving scalar references a higher priority, all the four

schemes have more or less the same performance (see Figure 5.4), except when the


vector load is low.


In that case, the conflicts in the early stages can be removed effec-


tively by using the Bit Reverse mapping. Scalar requests can randomize the sequence

in the latter stages and together provide an excellent turnaround time. In Figure 5.5

. 1 1 1 1 *1 1s 11 1 1 *1


Parameter Value
System Size 64 x 64
Vector Length 64
Buffer Size per Switch 4
Access Stride 1
Scalar Issue Rate 0.50
Idle Interval 100
Ratio Scalar/Vector Processors 1:1

















Average


Vector Delay (SSRIM)


--- -


No Dynamic Priority/ No Bit Reverse


Dynamic Priority/ No Bit


Reverse


No Dynamic Priority/ Bit Reverse
Dynamic Priority/ Bit Reverse


Vector


200.0


180.0


160.0


140.0


120.0


100.0


80.0


Size =


0- Dynamic Priority/No Bit Rev/No Scal Prior ,F
-j
- Optimum







-- --.- "


0.2


0.3


0.4


0.5


0.6


0.7


0.8


0.9


Scalar Issue


Rate


Figure 5


Average V


ector Delay


RM Model


Average


calar Delay (SSRVM)


Vector


Size


=64


10.0


--No Dynamic Priority/ No
---Dynamic Priority/ No Bit
- No Dynamic Priority/ Bit


Dynamic


Bit Reverse/Scal
Reverse/Sceal Prid


Reverse


Priority/ Bit Reverse/Scal


/Scal


0--) Dynamic Priority/No Bit Rev/No Scalt


Prior


Prior


Prior


*---* Optimum


L, I~ -I


1.0


h w


LJ I


"-T

















Average


Vector Delay (SSRM)


Vector


200.0


180.0


160.0


Size == 64


No Dynamic Priority/ No Bit Reverse


Dynamic Priority/ No Bit


- -
- -


* *--Me


Reverse


No Dynamic Priority/ Bit Reverse


Dynamic Priority/ Bit
Optimum


Reverse


140.0


120.0


100.0


80.0


60.0


- -


- -N -


."- k he Q h J *~ + -A--


0.2


0.3


0.4


0.5


0.6


0.7


0.8


0.9


Scalar


Issue


Rate


Figure 5.4.


Average Vector Delay


RM Model (


alar Priority)


Average


calar Delay (


SSRM)


Vector


20.0


Size = 64


- With


Scalar Priority


- -w


-- No Dynamic
---- Dynamic Pri
- No Dynamic


Pdriority/
odrity/ No
Pdriority/


- Dynamic Priority/ Bit
*----Dynamic Priority/No


No Bit


Reverse/Scal


Bit Reverse/Scal
Bit Reverse /Seal


Reverse/Scal


Bit Rev/No Scal


Prior


Prior


1.


--


















Table 5.3.


5.3.2


DDRM Model Parameter


Performance of DDRM Model


The assumptions made are that each processor had an equal probability of starting


either a vector or a scalar stream.


Ideally, this will depend on the application, since


most vectorized code does not contain many of scalar accesses, perhaps the proportion


of scalars with respect to vectors is slightly exaggerated.


All vectors were assumed


to be of the same length and accessed


with the same stride.


We assume that the


vectors are being accessed from from the same bank and that the scalar requests are

uniformly distributed over the entire memory spectrum.

The length of the scalar burst is a decisive factor on the number of vector accesses


the model, since the vector accesses are interleaved with scalar bursts.


Unless


otherwise specified, the assumptions made in the DDRM experiments are the ones

listed in Table 5.3.

The results in Figures 5.6 and 5.7 were obtained from issuing vectors of length 64


with a scalar request rate of 0.5.


The scalar burst length was uniformly drawn from an


interval of 0-100 cycles.


In this experiment,


vector requests were not distinguished


Parameter Value
System Size 64 x 64
Vector Length 64
Buffer Size per Switch 4
Access Stride 1
Scalar Issue Rate 0.50
Scalar Burst Length 0-100













Vector Delay for DDRM


Vector Length = 64. Scalar Burst 0-100, No Scalar Priority


225.0

200.0

175.0

150.0

125.0

100.0

75.0

50.0

25.0

0.0


*----* No Dynamic Prior/No Bit Reverse
-- Dynamic Prior/No Bit Reverse


0-0
A-A


No Dynamic Prior/Bit Reverse
Dynamic Prior/Bit Reverse


4I&----e Optimum


0.0


10.0


1 50


Interstart Period


Figure 5.6.


Average Vector Performance for All Schemes


of Dynamic


Priority


(Figure 5.7).


Vector


accesses


however


, were quite improved


compared to the conventional scheme.


Figure 5.6 confirms the fact that even with


scalar


interference,


both


Dynamic


Priority


Reverse schemes can


outperform the conventional scheme,


as was found in the previous chapter.


The scalar burst parameter not only has direct bearing on the vector access,


it also


allows the study of special


the study of


cases


continuous vector


such


accesses


as the scalar burst length equal to zero (hence


) and scalar issue rate equal to one (which


means the issue rate of scalars and vectors are the same).


When the scalar issue rate


is one


, it can be seen as studying the temporal correlation of CR patterns,


since for a


20.0


- : I



------------i___________1___________l___________


















Scalar Delay for DDRMV


Vector Length


50.0


=64,


Scalar Burst 0-100.


No Scalar


Priority


---* No Dynamic Prior/No Bit


---i Dynamic Prior/No Bit


0--0
F---
**


No Dynamic Prior/Bit


Dynamic Prior/Bit
Optimum


Reverse
Reverse


Reverse


10.0


15.0


20.0


Interstart Period


Figure


Average


calar Performance for All Schemes


Effect of Scalar Burst Interval (DDRMVI)


Average


200.0


180.0


160.0


140.0


120.0


100.0


80.0


60.0


40.0


20.0


Vector


Delay


(Vector=64)


No Scalar


Priority


- ---.-----=- -
-
- .- -
- -- -
- - -


No Dynamic
Dynamic Pri


No Dynamic


Dynamic Prior/Bit
Optimum


Prior/No Bit


Reverse


45.0


40.0


Reverse


0.0


5.0


or/No Bit
Prior/Bit


Reverse
Reverse


Reverse


I


*--- ---------


*O




3 s-------- ----- A-------------- !h X


-*~
.....----*""""'
....----""
,-*""'


- -
- -


I














Effect of Scalar Burst Interval (DDRM)


Average


Scalar Delay (Vector=64) No


Scalar Priority


35.0


30.0


25.0


20.0


15.0


10.0


5.0


nft


10.0


' ...r -..
---- --


No Dynaflte -Rl.Pror/NoEBtt Reverse
Dynamic Prior/No'"Bit -RFevers.. -


- --
* -


No Dynamic Prior/Bit Reverse
Dynamic Prior/BEit Reverse
Optimum



^^^j^: ~ ~ ~ ~ .-.-.^^^ ^J^^^J


--- -


- -


30.0


50.0


70.0


90.0


110.0 130.0 150.0 170.0 190.0


Scalar


Burst Interval Length


Figure 5.9.


Effect of Scalar Burst Length on Scalar Delay


- No Scalar Priority


In Figure 5.8 the average vector delay for the conventional scheme becomes in-


creasingly worse even under long scalar burst intervals.


arge number of scalar requests,


Dynamic Priority


Long burst intervals reflect


and a long vector/scalar interaction period.


Reverse and the Hybrid schemes can offer a significant im-


provement on the vector performance.


As was expected the Dynamic Priority scheme


favors long DIRs,


which is not advantageous to the scalar requests,


as can be per-


ceived in Figure 5.9.


The Bit Reverse and Hybrid schemes can offer similar results as


the conventional scheme due to the distributing effect of the Bit Reverse mapping.


The scalar issue rate depicts the volume of scalars injected into the system,


....















Effect of Scalar Issue Rate (DDRIIM)


Ave.


Vector


Delay (Vector


Size


= 64) With


Scalar


Priority


200.0

180.0

160.0

140.0

120.0

100.0

80.0

60.0

40.0

20.0

0.0


--

-- -


4.--.-


- -


0.0


No Dyn Prior/No Bit Reverse
Dyn Prior/No Bit Reverse
No Dyn Prior/Bit Reverse
Dyn Prior/Bit Reverse


0.2


0.3


0.4


0.5s


0.6


0.7


0.8


Scalar Issue


Figure 5.10.


rate is zero


Rate


Effect of Scalar Issue Rate on Average Vector Delay


, the equivalent of having only vector interaction amongst the processors


with idle periods between consecutive vector sequences,


it can be seen in Figure 5.10


that the conventional scheme is significantly worse than any of the proposed ones.


The scalar delay is slightly better in the proposed schemes as long as the issue rate


is less than 0.5 but with higher rates,


the Bit Reverse and Hybrid schemes perform


slightly worse than


the conventional one.


Due to the scalar priority,


the effect of


higher scalar issue rate does not affect the average scalar delay dramatically


Chapter Summary


The nature of the interaction between Consecutive Requests and scalar references


0.9


1.0







causing more overlap of link utilization.


Unless priority is given to scalar references,


the scalar delay increases due to the volume of CR requests.