Parallel Performance Wizard - Framework and Techniques for Parallel Application Optimization

Permanent Link: http://ufdc.ufl.edu/UFE0042186/00001

Material Information

Title: Parallel Performance Wizard - Framework and Techniques for Parallel Application Optimization
Physical Description: 1 online resource (129 p.)
Language: english
Creator: Su, Hung
Publisher: University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: 2010


Subjects / Keywords: analysis, automatic, bottleneck, framework, mpi, optimization, parallel, performance, pgas, ppw, shmem, tool, upc
Electrical and Computer Engineering -- Dissertations, Academic -- UF
Genre: Electrical and Computer Engineering thesis, Ph.D.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation


Abstract: Developing a high-performance parallel application is difficult. Given the complexity of high-performance parallel programs, developers often must rely on performance analysis tools to help them improve the performance of their applications. While many tools support analysis of message-passing programs, tool support is limited for applications written in other programming models such as those in the partitioned global-address-space (PGAS) family, which is of growing importance. Existing tools that support message-passing models are difficult to extend to support other parallel models because of the differences between the paradigms. In this dissertation, we present work on the Parallel Performance Wizard (PPW) system, the first general-purpose performance system for parallel application optimization. The complete research is divided into three parts. First, we introduce a model-independent PPW performance tool framework for parallel application analysis. Next, we present a new scalable, model-independent PPW analysis system designed to automatically detect, diagnose, and possibly resolve bottlenecks within a parallel application. Finally, we discuss case studies to evaluate the effectiveness of PPW and conclude with contributions and future directions for the PPW project.
General Note: In the series University of Florida Digital Collections.
General Note: Includes vita.
Bibliography: Includes bibliographical references.
Source of Description: Description based on online resource; title from PDF title page.
Source of Description: This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Statement of Responsibility: by Hung Su.
Thesis: Thesis (Ph.D.)--University of Florida, 2010.
Local: Adviser: George, Alan D.

Record Information

Source Institution: UFRGP
Rights Management: Applicable rights reserved.
Classification: lcc - LD1780 2010
System ID: UFE0042186:00001

Permanent Link: http://ufdc.ufl.edu/UFE0042186/00001

Material Information

Title: Parallel Performance Wizard - Framework and Techniques for Parallel Application Optimization
Physical Description: 1 online resource (129 p.)
Language: english
Creator: Su, Hung
Publisher: University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: 2010


Subjects / Keywords: analysis, automatic, bottleneck, framework, mpi, optimization, parallel, performance, pgas, ppw, shmem, tool, upc
Electrical and Computer Engineering -- Dissertations, Academic -- UF
Genre: Electrical and Computer Engineering thesis, Ph.D.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation


Abstract: Developing a high-performance parallel application is difficult. Given the complexity of high-performance parallel programs, developers often must rely on performance analysis tools to help them improve the performance of their applications. While many tools support analysis of message-passing programs, tool support is limited for applications written in other programming models such as those in the partitioned global-address-space (PGAS) family, which is of growing importance. Existing tools that support message-passing models are difficult to extend to support other parallel models because of the differences between the paradigms. In this dissertation, we present work on the Parallel Performance Wizard (PPW) system, the first general-purpose performance system for parallel application optimization. The complete research is divided into three parts. First, we introduce a model-independent PPW performance tool framework for parallel application analysis. Next, we present a new scalable, model-independent PPW analysis system designed to automatically detect, diagnose, and possibly resolve bottlenecks within a parallel application. Finally, we discuss case studies to evaluate the effectiveness of PPW and conclude with contributions and future directions for the PPW project.
General Note: In the series University of Florida Digital Collections.
General Note: Includes vita.
Bibliography: Includes bibliographical references.
Source of Description: Description based on online resource; title from PDF title page.
Source of Description: This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Statement of Responsibility: by Hung Su.
Thesis: Thesis (Ph.D.)--University of Florida, 2010.
Local: Adviser: George, Alan D.

Record Information

Source Institution: UFRGP
Rights Management: Applicable rights reserved.
Classification: lcc - LD1780 2010
System ID: UFE0042186:00001

This item has the following downloads:

Full Text



- FRA ---' AND



F ) TO THE (

I -, I F S.R ,L
)( )R C P .OSOP---





110 Hung-Hsun

I dedicate this to my family for all their love anid (rt


work is ; 'ed in part by the U.S. Department of i go to

advisor, Dr. D. George, for his advice and patience over the course of this research

and imy committee members, Dr. Lain, Dr. Gre -, and Dr. P

Sanders, for t heir time and art. I would also to acknowledge current and

memITbers of the Unified Par. C t at UF, Max F TIl Adam Lcko, Hans

Shcrburinc, P Go i, S Oos P olden, Arm do Satos, and P Subramanan t heir involve meant

in the' 'I and .1 of the Parallel F e .. zard system, v I

would like to express thanks to Dan Bonachea and the Unified Parallel C group members

at Uj.C. Berkeley and Lawrence Berkeley National Laboratory for their s 's

and cooperate ion.



ACKNOW LEDGMENTS ................................. 4

LIST OF TABLES ....................... ............. 8

LIST OF FIGURES ....................... ........... 10



1 INTRODUCTION ...................... .......... 14

PERFORMANCE TOOLS ................... ....... 16

2.1 Parallel Programming Models ........ ................. 16
2.2 Performance Tools ..... ...................... 18

3 BACKGROUND RESEARCH FINDINGS ........... ....... 22

PERFORMANCE ANALYSIS .............. ... .......... 25

4.1 Parallel Programming Event Model ......... ............. 28
4.1.1 Group-Related Operations ..... .................. 32
4.1.2 Data Transfer Operations ........ ........... ... 35
4.1.3 Lock, Wait-On-Value, and Locally Executed Operations ...... 38
4.1.4 Implementation C(!h i' ,g-es and Strategies ...... ...... 39
4.2 Instrumentation and Measurement ......... .............. 41
4.2.1 Overview of Instrumentation Techniques .... .......... 41
4.2.2 The Global-Address-Space Performance Interface .... 42
4.2.3 GASP Implementations .................. ..... .. 46
4.3 Automatic A i-i- .............. ...... 46
4.4 Data Presentation ............... .... 47
4.5 PPW Extensibility, Overhead, and Storage Requirement ... 48
4.5.1 PPW, TAU, and Scalasca Comparision ..... 51
4.5.2 PPW Tool Scalability ............... .... .. 51
4.6 Conclusions ............... .............. .. 52


5.1 Overview of Automatic All .l -i- Approaches and Systems ... 56
5.2 PPW Automatic Analysis System Design ... 59
5.2.1 Design Overview. .................... ......... .. 60
5.2.2 Common-Bottleneck Analysis ................ .... .. 61 Bottleneck detection ..... ......... .. .. .. .. 65 Cause analysis ............. .. .. .. 66
5.2.3 Global Analyses ............... ........ .. 68 Scalability analysis ................. .. 68 Revision analysis ................ .... .. 68 Load-balancing analyses .................. .. 69 Barrier-redundancy ,in i., .. ......... 70 Shared-data analysis ................ .. .. 70
5.2.4 Frequency Analysis .................. ........ .. 71
5.2.5 Bottleneck Resolution ............... ...... .. 71
5.3 Prototype Development and Evaluation ............. .. .. .. 72
5.3.1 Sequential Prototype .................. ....... .. 75
5.3.2 Threaded Prototype .................. ........ .. 80
5.3.3 Distributed Prototype .................. ..... .. 82
5.3.4 Summary of Prototype Development ................ .. 85
5.4 Conclusions .................. ................ .. 85


6.1 Productivity Study .................. ........... .. 87
6.2 FT Case Study .................. .............. .. 88
6.3 SAR Case Study .................. ............. .. 92
6.4 Conclusions .................. ................ .. 100

7 CONCLUSIONS .................. ................ .. 103

APPENDIX: GASP SPECIFICATION 1.5 .................. ...... 106

A.1 Introduction .................. ................ .. 106
A.1.1 Scope .................. ................ .. 106
A.1.2 Organization .................. ............ 106
A.1.3 Definitions .................. ............. 106
A.2 GASP Overview. .................. ............. 107
A.3 Model-Independent Interface .................. .. ..... 108
A.3.1 Instrumentation Control .................. ... 108
A.3.1.1 User-visible instrumentation control 108
A.3.1.2 Tool-visible instrumentation control 109
A.3.1.3 Interaction with instrumentation, measurement, and user
events .............. .... ... ....... 109
A.3.2 Callback Structure .................. ........ .. 110
A.3.3 Measurement Control .................. ..... .. 113
A.3.4 User Events .................. ............ .. 113
A.3.5 Header Files .................. ............ .. 114
A.4 C Interface ................ ............ .. .. 114
A.4.1 Instrumentation Control .................. ... .. 114
A.4.2 Measurement Control ............... .... .. 114

A.4.3 System Events ............
A.4.3.1 Function events .
A.4.3.2 Memory allocation events
A.4.4 Header Files .......
UPC Interface .........
A.5.1 Instrumentation Control .
A.5.2 Measurement Control ..
A.5.3 User Events ........
A.5.4 System Events ......
A.5.4.1 Exit events ...
A.5.4.2 Synchronization events .
A.5.4.3 Work-sharing events .
A.5.4.4 Library-related events .

A.5.4.5 Blocking shared variable

A.5.4.6 Non-blocking shared varial
A.5.4.7 Shared variable cache even
A.5.4.8 Collective communication
A.5.5 Header Files .......

access events .
)le access events
ts ..
events .


BIOGRAPHICAL SKETCH ................................



. 126



Table page

4-1 Theoretical event model for parallel programs .................. 30

4-2 Mapping of UPC, SHMEM, and MPI 1.x constructs to generic operation types 31

4-3 GASP events for non-blocking UPC communication and synchronization .. 43

4-4 Profiling/tracing file size and overhead for UPC NPB 2.4 benchmark suite 48

4-5 Profiling/tracing file size and overhead for SHMEM APSP, CONV, and SAR
application ............... ...................... .. 50

4-6 Profiling/tracing file size and overhead for MPI corner turn, tachyon, and SAR
application ............... ...................... .. 50

4-7 Profiling/tracing file size and overhead comparison of PPW, TAU, and Scalasca
tools for a 16-PE run of the MPI IS benchmark ................. 51

4-8 Profiling/tracing file size and overhead for medium-scale UPC NPB 2.4 benchmark
suite runs . .. 52

5-1 Summary of existing PPW analyses ................ .... 62

5-2 Common-bottleneck patterns currently supported by PPW and data needed to
perform cause analysis .. .............. ........... 63

5-3 Example resolution techniques to remove parallel bottlenecks .. 73

5-4 Sequential analysis speed of NPB benchmarks on workstation ... 80

5-5 Analysis speed of NPB benchmarks on workstation ............... ..82










Analysis speed of NPB benchmarks on Ethernet-connected cluster .

Performance comparison of various versions of UPC and SHMEM SAR programs

User function events ............. .....

Memory allocation events.......... .....

Exit events . . ... .

Synchronization events......... .....

Work-sharing events ............. .....

Library-related events .......... ......

Blocking shared variable access events. .....










A-8 Non-blocking shared variable access events ............. .. ... .. 121

A-9 Shared variable cache events ............... ......... 122

A-10 Collective communication events ............... .... .. 123


Figure page

2-1 Measure-modify performance analysis approach of performance tool ...... ..19

4-1 PPW-assisted performance analysis process from original source program to
revised (optimized) program ............... ......... .. 26

4-2 Generic-operation-type abstraction to facilitate the support for multiple
programming models ............... ............. .. 27

4-3 Framework of Parallel Performance Wizard organized with respect to stages
of experimental measurement and model dependency (multi-boxed units are
model-dependent) ................ ........... .. .. 28

4-4 Events for group synchronization, group communication, initialization,
termination, and global memory allocation operations (TS = Timestamp.
Only a few of PE Y's events are shown to avoid clutter) .......... .34

4-5 Events for one-sided communication and synchronization operations ...... ..37

4-6 Events for two-sided communication and synchronization operations ...... ..38

4-7 Events for (a) lock mechanisms, wait-on-value and (b) user-defined
function/region, work--1 ,iii: and environment-inquiry operations ...... ..39

4-8 Interaction of PGAS application, compiler, and performance tool in
GASP-enabled data collection .................. ... 43

4-9 Specification of gasp_eventnotify callback function .............. 44

4-10 (a) Load-balancing analysis visualization for CG 256-PE run, (b) Experimental
set comparison chart for Camel 4-, 8-, 16-, and 32-PE runs ... 49

4-11 Annotated screenshot of new UPC-specific array distribution visualization showing
physical layout of a 2-D 5x8 array with block size 3 for an 8-PE system 50

4-12 (a) Data transfers visualization showing communication volume between
processing PEs for 256-PE CG benchmark tracing mode run, (b) Zoomed-in
Jumpshot view of 512-PE MG benchmark .................. 53

5-1 Tool-assisted automatic performance analysis process .............. ..56

5-2 PPW automatic analysis system architecture ................ 61

5-3 Example analysis processing system with 3 processing units showing the analyses
each processing unit performs and raw data exchange needed between processing
units . .. ...... 63

5-4 Analysis process flowchart for a processing unit in the system ... 64

5-5 Barrier-redundancy analysis .. ...........

5-6 Frequency analysis .. ...............

5-7 PPW analysis user interface .. ...........

5-8 Annotated PPW scalabili v- in ilysis visualization .

5-9 PPW revision-comparison visualization ........

5-10 Annotated PPW high-level analysis visualization .

5-11 PPW event-level load-balance visualization ......

5-12 Annotated PPW analysis table visualization .....

5-13 Annotated PPW analysis summary report ......

5-14 A common use case of PPW where user transfers the
to workstation for analysis .. ............

5-15 Analysis workflow for the (a) sequential prototype (b)
(c) distributed prototype .. .............

data from parallel sy

Threaded prototype

5-16 A use case of PPW where analyses are performed on the parallel systems .

5-17 Memory usage of PPW analysis system on cluster .. ..............

6-1 Productivity study result showing (a) method with more bottlenecks identified
(b) preferred method for bottleneck identification (c) preferred method for
program optim ization . . .

6-2 Annotated PPW Tree Table visualization of original FT showing code regions
yielding part of performance degradation .. ..................

6-3 Annotated Jumpshot view of original FT showing serialized nature of upcmemget
at line 1950 . . .

6-4 PPW Tree Table for modified FT with replacement .,-vnchronous bulk transfer
c a lls . . .

6-5 Multi-table analysis visualization for FT benchmark with annotated Jumpshot
visualization . . .

6-6 Overview of Synthetic Aperture Radar algorithm .. ..............

6-7 Performance breakdown of UPC SAR baseline version run with 6, 12, and 18
computing PEs annotated to show percentage of execution time associated with
b barriers . . .


6-8 Timeline view of UPC SAR baseline version run with 6 computing PEs annotated
to highlight execution time taken by barriers ................ 97

6-9 High-level analysis visualization for the original version (vl) of SAR application
with load-imbalance issue .................. ........... .. 98

6-10 Observed execution time for various UPC and SHMEM SAR reversions 98

6-11 Timeline view of UPC SAR flag synchronization version executed on system
with 6 computing PEs annotated to highlight wait time of flags ... 99

6-12 High-level analysis visualization for the F2M version (v5) of SAR application
with no i ii"r bottleneck .................. ........... .. 100

Abstract < Dissertation Presented to the Graduate School
of the University ( a' da in Partial Fulfillment < the
Requirements the Degree of Doctor < T''" '-




August 10

C : Alan D. George
M i cal and C,

Dcvc< a, high :c parallel li cation is < :ult. Given the comp'

< hig- : p. grams, developers < must on '

ania tools to' "> them the 'e of their 1 '' While

1 tools I l rt anal of 1 -: tool :rt is limited for

written in other models such as those in the titioned

global-address-space (PG/ C') famil y, which is of growing importance. Existing tools that

,port message :models are difficult to extend to Iport other p. models

because of th e cs bet n the adigms. In this dissertation, we present work on

the Parallel Performance d (1PP ) system, the first general-p se C :e

for C '' e research is divided into

t three parts. First, iwe introduce a model-iUndependent PT im '' ance tool :)'k

paral l l nationn \, we present a new scalable, model-independelnt

P" -.. analysis ; designed to detect, diagnose, and resolve

bottlenecks within a pi '' Finally, we i case to evaluate the

Sveness of PP.. and conclude with contributions and future directions for the PP..

( \T- 1

Parallel ( uting has emerged as the dominant high-p ce mutg

To fully; o rt concurrent execution, many .iter systems

such as the symmetric 1e' icessing system, computer cluster, grid,

and multi-core machines have been developed. In to take advantage of these

allel systems, a variety < para llel programming models such as C Multi-Processing

(OpenMTP), Mesgssa Passing Interf ace ( '), unified Par. C (UPC), and SHar'ed

MEMory (SHMEM) library have been created over the years. L' these technologies,

Srimanly : and commercial fields are able to develop

S that solve difficult soe problems more y or solve 1 1

:, thought, to be L ---ossible.

Unfortunately, due to thle daded complexity < lite systems and programming

models, .11 are more *ult to write than sequential ones and even

harder to < e for :e. Discovery and removal of :e issues 'e

extensive 1 I on part t a t the execution environment andt involve

a significant amount < effort. Programmers often must. undergo a non-trivial, iterative

and optimization process that is cumbersome to rn manually in order

to the I e of their e:)e this 1 process,

Sp :e anal tools ( th to as I tool) were

developed that :rt a variety of 11 1 systems and amming models.

Sthe available par programming models, MPI has received the :

S'' :e tool research and dcevelo as it remIins the most iwn and

wi<' used. Most existing pair p :e tools 0 ort MPI to

some degree buit are limited in; .'' other 1I" i models such as Op) ) and

those in the partitioned gioblal-address-space (iPGAS) : y. i have been

made to ,'port t these newer models, t he progress has not 1 with the

demand. Since most existing tools were f designed to port a particular model

(i.e.C 'I), I" became too .' i. i~ d with that model, and as a result, ea

Seamount of effort on the ( ,ers' -t to add new model r)t.

this i we outline our work toward the Parallel Perflormance d

(PPW) system. goal is to reseac an d develop a general _rpose ; cC

too0 utuirc that re .ports "' i alcl ammning models and to

develop advance t to enhance tool usability. remainder < this document

is organized as fl ws. In ( iter 2, we iide an overview of performance tools and
11 1 models. In ( iter 3, we describe our research inetlho as

as some background research findings that shaped the development ( the F "7

infrastructure. In ( tcr 4, we present the PPW )rk anid ovide experimental

results for a :.ional Pr tool that su---orts UIPC, V\, and MPI. In ( ,tcer 5,

we introduce a new automatic ana' system i 1 ed to enhance the usability of the

PP.. tool and :e experimental results for the; '' threaded, and l

versions of this In C I, tr 6, we discuss case studies to validate the framework

and tcchni d and conclude the document in C -tcr 7.

( \ 2

In this chapter r, we provide an overview ( par i ograinining models and

e tools. To avoid confusion, the term processing element (P ) is used

to reference a system component (e.g., a node, a thread) that executes a stream of


2.1 Parallel Programming Models

In this section, we --ide an overview (

models x y relevant to this resIarch,

A 1 model is a collect

p. "ogramming models and the three

' I, PI, UPC, and

ion of software -' 1 that iws

and orchestrate interactions among P

goal is to 'a t mmcrs in turning -allcl algorithms~ ntio executable

'' actionss on par computers. Par I 'aniling models arc gcncfr. catcgoriz

Show memory is used In t he shar d-memory model (c.g., C :P, "cit thn

libraries), each P- has direct access to a shared 1 1. and communication

between PEs is achieved "' and writing of variables that reside in this shared

memory I the model (e.g., '1), each P- has access only to its

local memory. A pair < PEs communicates by sending and rcciving messages to each

oitler which t ti the data. the local memory of sender to the local memory of

the receiver. x y, the partitioned global-adO (PGAS) model (e.g., -C,

) presents the rammerr with a lc 7 global memory space divided into two

ts: a private 3rtion local to each P and a global portion which can be t' "y

"titionc among the PEs. PE commnunicatecs with each otRler by rn and writing

the global rtion the memory via the use < put. and get, operations. In terms
1 mntation, tlhse od(els arc realized citlier as libraries, as sequential language

extensions, or as new languages.


Message Passing Interface (\!PI) is a communication library used to program parallel

computers with the goals of high performance, scalability and portability [1]. MPI has

become the de facto standard for developing high-performance parallel applications;

virtually every existing parallel system provides some form of support for MPI application

development. There are currently two versions of the standard: MPI-1 (first standardized

in 1994) that uses purely matching send and receive pairs for data transfer and provides

routines for collective communication and synchronization, and MPI-2 (a superset of

MPI-1, first standardized in 1996) which includes additional features such as parallel I/O,

dynamic process management, and some remote memory access (put and get) capabilities.

Unified Parallel C (UPC) is an explicit parallel extension of the ANSI C language

developed beginning in the late 1990s based on experience with several earlier parallel

C-based programming models [2]. UPC exposes the PGAS abstraction to the programmer

by way of several language and library features, including specially typed (shared)

variables for declaring and accessing data shared among PEs, synchronization primitives

such as barriers and locks, a number of collective routines, and a unique, affinity-aware

work sharing construct (upcforall). The organization responsible for the continuing

development and maintenance of the UPC language is a consortium of government,

industry, and academia, which released the latest UPC specification version 1.2 in June

2005. This specification has been implemented in the form of vendor compilers, including

offerings from HP and IBM, as well as open-source compilers such as Berkeley UPC [3]

and the reference Michigan UPC. These provide for UPC support on a number of HPC

platforms, including SMP systems, super-computers such as the Cray XT series, and Linux

clusters using av i I i. of commodity or high-speed interconnects.

The SHared MEMory (SHMEM) library essentially provides the shared-memory

abstraction typical of multi-threaded sequential programs to developers of high-performance

parallel applications [4]. First created by Cray Research for use on the Cray T3D

supercomputer and now trademarked by SGI, SHMEM allows PEs to read and write

all globally declared variables, including those mapped to memory regions physically

located on other PEs. SHMEM is distinct from a language such as UPC in that it

does not provide intrinsically parallel language features; instead, the shared memory

model is supported by way of a full assortment of API routines (similar to MPI). In

addition to the fundamental remote memory access primitives (get and put), SHMEM

provides routines for collective communication, synchronization, and atomic memory

operations. Implementations of the library are primarily available on systems offered by

SGI and Cray, though versions also exist for clusters using interconnects such as Quadrics

(recently went out of business). At present time, no SHMEM standardization exists so

different implementations tend to support a different set of constructs providing similar

functionalities. However, an effort to create an OpenSHMEM standard [5] is currently


2.2 Performance Tools

Performance tools are software systems that assist programmers in understanding

the runtime behavior of their applicationi on real systems and ultimately in optimizing

the application with respect to execution time, scalability, or resource utilization. To

achieve this goal, the in i i i ly of the tools make use of a highly effective experimental

performance analysis approach, based on a measure-modify cycle (Figure 2-1), in which

the programmer conducts an iterative process of performance data collection, data

analysis, data visualization, and optimization until the desired application performance is

achieved [6]. Under this approach, the tool first generates instrumentation code that serves

as entry points for performance data collection (Instrumentation). Next, the application

and instrumentation code are executed on the target platform and raw performance data

are collected at runtime by the tool (\. i-r, -iement). The tool organizes the raw data

and can optionally perform various automatic analyses to discover and perhaps t'-:, -1

1 Alternative approaches include simulation and analytical models.


Figure 2-1. Measure-modify performance ,i &!1 -i approach of performance tool

resolutions to performance bottlenecks (Automatic Analysis). Both the raw and wi, &v. .1

data are then presented in more user-friendly forms to the programmer through text-based

or graphical interface (Presentation) to facilitates the manual analysis process (\! i,1 1i!

Analysis). Finally, the tool or programmer applies appropriate optimization techniques

to the program or the execution environment (Optimization) and the whole cycle repeats

until the programmer is satisfied with the performance level.

A tool can use (fixed-interval) sampling-driven instrumentation or event-driven

instrumentation, depending on when and how often performance data are collected. In

sampling-driven tools, data are collected regularly at fixed-time intervals by one or more

concurrently executing threads. At each time step, a predefined, fixed set of metrics

(types of performance data) are recorded regardless of the current program behavior (e.g.,

same metrics recorded regardless of whether the program is performing computation

or communication). The performance of the program is then estimated, often using

only a subset of these metrics. In most cases, the monitoring threads access only a few

hardware counters and registers and calculations, t.

low data collection overhead. As a result, this technique is less

in execution behavior that miay lead to an inaccurate anal of th

I -driven tools "y have greater i .y in I

with respect to lhe high-level source code, especially when the t.imc

enough t.o miss short-lived trends. In contrast, event-driven tools re

specified events (such as the start. of a function or conmmrnunication c

execution. ToLeether

hus introducing -----

, t to cause changes

e program. evere.

the behavior

interval is large

cord data only when

all) occur during

events and metrics make up the event model that the

tool uses to describe 1' behavior; the set e events and metrics is

used to reconstruct the behavior < the program in direct relation with high-level source

code, easing the and optimization process. For each event, the tool records a

select, number of m1etrics (e.g., time, PE ID, etc.) relevIant t that particular event but

requires y more time th an t accessing a: hardware counters

in the sampling-driven case. As a result, event-driven tools generally introduce higher

data col ecion overhead than sampling-driven tools and thus have a higher chance of

introducing isctnbugs: bugs (caused by rmnance perturbation) that < apear or alter

heir behavior when one attempt

for : 'y c

,o I e data.

) 'i or isolate them. i 'slem is

short-lived events that force substantial i

Another common tool classification, tracing versus pnr "" distinguishes f how a

tool handles the metrics each time instrumentation code is executed. A tool ierating

in tracing mode stores metric values calculated at each time instance separately

one another. this data, it is possible for the tool to reconstruct the y-step

behavior, analysis in great detail. weaver the large amount

of data generated also requires space r run, and the sheer

amount, ( data coucl be over whcming if it iis not ( organized and presented to

the user. In addition, due to memory limitations, the tool < must file T/O


- in order


during runtime, introducing additional data collection overhead on top of the unavoidable

metric calculation overhead. Examples of tools that support the collection and viewing

of trace data include Dimemas/Paraver [7], Intel Cluster Tools [8], MPE/Jumpshot [9],

and MPICL/ParaGraph [10]. In contrast, a tool operating in profiling mode performs

additional on-the-fly calculations2 (min, max, average, count, etc.) after metric values are

calculated at runtime and only statistical (profile) data are kept. This data can usually fit

in memory, avoiding the need to perform file I/O at runtime. However, profiling data often

only provide sufficient information to perform high-level analysis and may be insufficient

for determining the causes of performance bottlenecks. Examples of popular profiling tools

include DynaProf [11], mpiP [12], and SvPablo [13].

Finally, as the system used to execute the application grows in size, the amount of

performance data that a tool must collect and manage grows to a point that it become

nearly impossible for users to manually analyze the data even with the help of the tool. To

address this issue, several tools such as HPCToolkit [14], Paradyn [15], Scalasca/KOJAK

[16], and TAU [17] also include mechanisms to have the tool automatically analyze the

collected performance data and point out potential performance bottlenecks within the

application (e.g., scalability analysis, common bottleneck analysis, etc.).

2 For tracing mode, these statistical calculations are often performed after execution


A substantial background research process has led to the formulation and development

of the PPW system. In this chapter we briefly describe this process and its resulting

findings and insights that have shaped the PPW design.

We began our background research by studying the details of parallel programming

model specifications and implementations in order to identify characteristics important

in analyzing parallel application performance. In parallel, we surveyed existing works on

performance tool research in order to identify characteristics important for the success of a

performance tool [18]. Using the knowledge gained from these studies, we then evaluated

the applicability of existing performance tool techniques to various programming models

and leveraged techniques that could be re-used or adopted. Additionally, we performed

comparisons between related performance analysis techniques, identified characteristics

common to these techniques, and made generalizations based on the commonalities (such

generalization is desirable as it reduces tool complexity). Finally, we recognized new

obstacles pertaining to parallel performance a n 1, i-;- and formulated solutions to handle

these issues.

A helpful performance tool must collect appropriate and accurate performance data.

We found that it is useful for a tool to support both profiling and tracing measurement

modes. Profile data guides users to program segments where they should focus their

tuning efforts, while trace data provides detailed information often needed to determine

the root causes of performance degradations. The tool should also make use of hardware

counter monitoring systems, such as the portable Performance Application Programming

Interface (PAPI) [19], which are valuable in analyzing non-parallel sections of an

application but can also be used on parallel sections. Finally, to avoid performance

perturbation, the data collection overhead introduced by the tool must be minimized. A

general consensus from literature indicates that a tool with overhead of approximately

-, under---- mode and 1-1 under traces mode is considered to be safe

e perturbation.

A : ucti-ve tool must be easy to learn and use. r :e tools have

ve in trouble, e problems, they are often not used because

< their high learning curve. To avoid this a1 tool should 'de an intuitive,

Suser by wing an established standard or adopting visualizations

used by existing : tools. In since source code is generally the only

level which users have '. control over (the average user not have the knowledge

or to alter the execution environment), l: e tools should present

e data with respect to the iplication source code. e '

in id. '" specific source code regions that d an ap, s :c

making it. easier for the user to remove bottlenecks. A tool's ability to w)vide source-line

correlation a(nd to work closer to the source level is thus critical to its success.

o eflicierntly; :art '1 r models, a successful performance

tool design must, include me chanislms to resolve : ulties introduced due to diverse

models and imp lementatlions. We noticed that techniques used in the measurement

and the presentation stages arc gencr not tied to t he programming model.

types of measurements required and difficulties one must solve are very similar among

models (i.e., get timestamp, clock '1 onization issues, etc.) and

vis ; developed are us x. '` or easily extensible to a va '

.. graimming models. Furthermore, we noted that while each agramming model l 1

the ammirnr with a set of c onstructs to orchestrate work among PEs,

the of inter-P interaction : ,ed by these constructs are i similar between

models. )r example, both I C and include constructs to other

S' !, barrier i ', and get it is desirable to take advantage of this

commonality and devise a generalization imchanism to enabil:e the development < system

components that dly to "' )le models, helping reduce the complexity < tool design.

In contrast, the choice of the best instrumentation technique is highly dependent

on the strategy used to implement the target compiler. AT ,i: diverse implementation

methods are used by compiler developers to enable execution of target model applications.

For instance, all MPI and SHMEM implementations are in the form of linking libraries

while UPC implementations range from a direct compilation system (e.g., Cray UPC) to

a system employing source-to-source translation complemented with extensive runtime

libraries (e.g., Berkeley UPC). We noticed that while it is possible to select an established

instrumentation technique that works well for a particular implementation strategy (for

example, using the wrapper instrumentation approach for linking libraries), none of

these techniques work well for all compiler implementation strategies (see Chapter 4.2

for additional discussion). Thus, any tool that wishes to support a range of models and

compilers must include mechanisms to handle these implementation strategies efficiently.


Parallel Performance Wizard (PPW) is a performance data collection, analysis, and

visualization system for parallel programs. The goal is to provide a performance tool

infrastructure that supports a wide range of parallel programming models with ease; in

particular, we focus on the much-needed support for PGAS models [20].

PPW's high-level architecture is shown in Figure 4-1, with arrows illustrating the

steps involved in the PPW-assisted application optimization process. A user's source

program is first compiled using PPW's commands to generate an instrumented executable.

This executable is then run and either profiling or tracing data (as selected by the user)

is collected and managed by PPW. The user then opens the resulting performance data

file and proceed to analyze application performance in several v-- i- examining statistical

performance information via profiling data visualizations supplied by PPW; converting

tracing data to SLOG2 or OTF format for viewing with Jumpshot or Vampir; or using the

PPW analysis system to search for performance bottlenecks.

PPW currently supports the analysis of UPC, SHMEM, and MPI 1.x applications and

is extensible to support other parallel programming models. To facilitate support for a

v -i' Ii of models, we developed a new concept: the generic-operation-type abstraction. In

the remainder of this section, we discuss the motivations behind and advantages of using

this abstraction.

Existing performance tools are commonly designed to support a specific model or are

completely generic. Model-specific tools interact directly with model-specific constructs

and have the advantage of being able to collect a varying set of operation-specific data

(such as memory address and data transfer size for put and get). However, the cost

of adding support for additional models to these tools is usually high, often requiring

updates to a significant portion of the system, due to the need to re-implement the same

functionalities for each model. In contrast, completely generic tools (such as MPE [9])

Figure Legend

O Application code

I I PPW-generated data

Activities using
SGASP interface

Activities using
event mapping

r -1 Activities using
I I both GASP and
, j event mapping

L |Other activities

SAbstraction or

Figure 4-1. PPW-assisted performance analysis process from original source program to revised (optimized) program

Figure 4-2. Generic-operation-type abstraction to facilitate the support for multiple
programming models

work with generic program execution states (i.e., the beginning and the end of function

calls) and thus can be easily adopted to support a wide range of models. Unfortunately,

being completely generic forces these tools to collect a standard set of metrics (e.g.,

source line, timestamp) each time data collection occurs, and as a result, these tools

lose the capacity to obtain useful operation-specific metrics1 (e.g., data size for data

transfer operations). To avoid the unnecessary tight-coupling of a tool to its supported

programming models while still enabling the collection of useful operation-specific metrics,

we developed a generic-operation-type abstraction that is a hybrid of the model-specific

and the completely generic approaches. The idea is to first map model-specific constructs

to a set of model-independent generic operation types classified by their functionality.

For each generic operation, the tool can then collect operation-specific events and metrics

and may later analyze and present these data differently depending on the operation type

(Figure 4-2).

The generic-operation-type abstraction has influenced the development of many

components of the PPW system, including its event model, instrumentation and

1 Note that it is possible but impractical to collect all metrics each time, as part of the
collected metrics will not be meaningful.

S Instrumentation r Measurement

I Instrumentation-
Inslrumenlalion Unil Measurement Measurement Unit
Interface (GASP)
SI GASP-enabled collection

Figure 4-3. Framework of Parallel Performance Wizard organized with respect to stages of
experimental measurement and model dependency (multi-boxed units are

measurement approach, and analyses and visualizations. We now describe these

components (Figure 4-3) in the following sections.

4.1 Parallel Programming Event Model

The effectiveness of an event-driven tool is directly impacted by the events and

metrics (i.e., event model) that it uses. Events signify important instances of program

execution when performance data should be gathered, while metrics define the types

of data that should be measured and subsequently stored by the tool for a given event.

In this section, we present the generic parallel event model PPW uses to describe the

behavior of a given parallel application.

PPW focuses on providing detailed information needed to analyze parallel portions

of a program while maintaining sufficient information to enable a high-level sequential

analysis. For this reason, the current PPW event model includes mostly parallel events.

Table 4-1 summarizes the theoretical event model using the generic-operation-type

abstraction to describe the behavior of a given program; this event model is compatible

with both PGAS models and MPI while Table 4-2 shows the mapping of UPC, SHMEM,

and MPI 1.x constructs to generic operation types.

We organized Table 4-1 so that operation types with the same set of relevant events

are shown together as a group. In addition to metrics useful for any event (i.e., calling PE

ID, code location, timestamp, and operation-type identifier), for each event, we provide a

list of additional metrics that would be beneficial to collect. For one-sided communication,

metrics such as the data source and destination2 (PE ID and memory address), data

transfer size being transferred, and synchronization handler3 (for non-blocking operations

only) provide additional insights on the behavior of these operations. For two-sided

communication, it is necessary to match the PE ID, transfer size, message identifier, and

synchronization handler in order to correlate related operations. For lock acquisition

or release operations and wait-on-value change operations, the lock identifier or the

wait-variable address help prevent false bottleneck detection. For collective global-memory

allocation, the memory address distinguishes one allocation call from another. For group

communication, the data transfer size may help understand these operations. Finally,

for group synchronization and communication that do not involve all system PEs, group

member information is useful in distinguishing between distinct but concurrently executing


It is important to point out that event timestamp information is often the most

critical metric to monitor (as performance optimization usually aimed at minimizing the

observed execution time). With a proper set of events and accurate timing information

for these events, it is possible to calculate (or at least provide a good estimate of) the

2 The source and destination of a transfer may be different from the calling PE.

3 Identifier used by the explicit/implicit synchronization operations to force completion
of a particular put or get.

Table 4-1. Theoretical event model for parallel programs
Generic-operation type Events
Enter (Notification_Begin)

Group synchronization
Group communication
Initialization / termination
Global memory allocation

Atomic read/write
Blocking Implicit put/get
Blocking explicit put/get
S Non-blocking explicit put/get
Explicit communication synchronization

Blocking send/receive
Non-blocking send/receive

Lock acquisition or release
Wait-on-value change
User-defined function/region
Environment inquiry

Exit (WaitEnd)


Wait End

Additional metrics (may also include PAPI counters)
Group info (sync/comm), address (memory),
transfer size (comm)

Source, destination, transfer size,
synchronization handler (non-blocking)

Synchronization handler
Synchronization handler
Matching PE, transfer size, message identifier,
synchronization handler (non-blocking)

Synchronization handler

Lock identifier (lock), address (wait)

Table 4-2. Mapping of UPC, SHMEM, and MPI 1.x constructs to generic operation types
Generic-operation type UPC SHMEM
Initialization N/A shmem_init()

Environment inquiry

Group sync.

Group comm.

Global memory
Implicit put
Implicit get
Explicit put (one-sided)

Explicit get (one-sided)
Send (two-sided)
Receive (two-sided)
Explicit comm.
Lock acquisition or
Atomic comm.
Wait-on-value change

upc_threadof(), ...
upc_notify(), upc_wait(),
upc_all_scatter(), ...
Declaration with shared
keyword, upc_alloc(), ...
Direct assignment
(sharedint = 1)
Direct assignment
(x = sharedint)
upc_memput(), upc_memset(),
upc_memget(), upc_memcpy()

upc_lock(), upc_unlock(),

my_pe(), num_pes()

shmalloc(), shfree()



shmem_put(), shmemiput(), ...



_get(), shmemiget(, ...

_wait_nb(), .


_waituntil(), ...

MPI 1.x
mpi_address(, ..

mpi_alltoall(), .




mpi_bsend(), mpiirsend(), .
mpirecv(), mpiirecv(), ...
mpi_wait(), mpi_waitall(), ...



" upc_memcpy() can be either put and/or get depending on the source and destination

duration for various computation, communication, and synchronization calls throughout

program execution. In some cases, it is also possible to calculate program-induced

d,-1 i-,4 (PI d,-1-4. d,-i-, caused by poor orchestration of parallel code such as uneven

work-distribution, competing data access, or lock acquisition) that point to locations

in the program that can be optimized via source-code modification. By examining the

durations of various operations and identifying PI d-.1 i- that can be removed, it is much

simpler for a programmer to devise optimization techniques to improve the execution time

of the application.

In the following subsections, we discuss the events and means to calculate operation

duration and PI delay for each of the logical groups of generic-operation types shown

in Table 4-1. For each group, we diagram some typical execution patterns via a set of

operation-specific events (each indicated by an arrow with number at the end) ordered

with respect to time (x-axis) for each of the processing PEs (PE X, Y, Z) involved. We

discuss means to calculate the duration of non-blocking operations (duration for blocking

operation is ah--iv-i the time difference between its Enter and Exit event and PI delay

with these events, discuss why the inclusion of some events affects the accuracy of the

calculations, and mention how a tool can track these events in practice. In addition, we

point out performance issues typically associated with each operation group.

4.1.1 Group-Related Operations

In Figure 4-4, we illustrate the events for the category of operations that involves a

group of PEs working together, including group synchronization, group communication,

initialization, termination, and global memory allocation operations. The execution

behavior of these operations is commonly described in terms of participating PEs running

in one of two phases. First is the notification phase when the calling PE sends out signals

4 Example of a delay which is not a PI delay includes data transfer delay due to
network congestion, slowdown due to multiple applications running at the same time, etc.

to all other PEs in the group indicating its readiness in performing the operation. The

second is the wait phase where the calling PE blocks until the arrival of signals from all

PEs before completing the operation. Two versions of these group operations are typically

provided to programmers: the standard (blocking) single-phase version where a single

construct is used to complete both phases and the more flexible (non-blocking) split-phase

version using separate constructs for each phase that allows for overlapping operation

(generally restricted to local computation). With respect to existing programming models,

the single-phase version is available for all operations in this category while the split-phase

version is typically only available for group synchronization and group communication.

PI d-!.iv1 associated with these operations normally mark the existence of load-imbalance


Events associated with this category of operations are the following:

Enter (Notification_Begin): Event denoting the beginning of cooperative operation
(beginning of notification phase). The calling PE starts sending out Ready signals
(plus data for group communication operations) to all other PEs.

Notification_End: Event denoting the point in time when the calling PE finishes
sending Ready signals (plus data for group communication operations) to all other
PEs (end of notification phase). For the split-phase version, the calling PE is free
to perform overlapping operations after this point until the wait phase. In the
single-phase version, this event is normally not traceable directly but is estimated to
occur a short time after the Enter event.

WaitBegin: Event denoting the beginning of the wait phase (where the calling PE
blocks until Ready signals are received from all other PEs). Normally only traceable
for split-phase version.

TransfersReceived: Event denoting the arrival of Ready signals from all other PEs
on the calling PE. This event is usually not traceable directly but is estimated to
occur a short time after the last participating PE enters the operation.

Exit (Wait_End): Event denoting the completion of the cooperative operation.

An example execution pattern exhibiting bottlenecks (on PE X and Y) caused

by uneven work-distribution for the single-phase version is diagramed in Figure 4-4a.

In this scenario, PE Z entered the operation after it has received a Ready signal from

4 1 25 4 1 2 3 5
-- Z initiated signal I I T
t t t '
or data transfer
*- Y Yinitiated signal C Z Z
or data transfer
-- X initiated signal
or data transfer
I I Operation ".... '.
S PI ldelay2 4

Events 0
1. Enter
1 I 11 1 1 1 I 1
(Notification Begin) 1 2 4 3 5
2. Notification End 1 2 4 1
3.Wait_Begin time I- time --
4. Transfers_Received
5. Exit (WaitEnd) (a) Sinale-phase (b) Split-phase

Duration(Single) = TS(Exit) TS(Enter)
Duration(Split) = [TS(Notification_End) TS(Enter)] + [TS(Exit) TS(Wait_Begin)]

PI Delay(Single) = TS(Transfers_Received) TS(Notification_End)
PI Delay(Split) = TS(Transfers_Received) TS(Wait_Begin)

Figure 4-4. Events for group synchronization, group communication, initialization,
termination, and global memory allocation operations (TS = Timestamp.
Only a few of PE Y's events are shown to avoid clutter)

both PE X and Y (i.e., on PE Z, a Transfers_Received event occurred before an Enter

event) so it was able to complete the operation without blocking. In contrast, PE X

and Y finished sending out signals before receiving all incoming Ready signals so they

were unable to complete the operation optimally; each PE became idle until it reached

the TransfersReceived event. PI delay for the single-phase version is given by the time

difference between the Transfers_Received and Notification_End events (Figure 4-4,


Figure 4-4b shows an example execution pattern for the split-phase version. In this

scenario, PE Z received all signals before entering the notification phase so it is free of any

PI delay. PE Y entered the wait phase before receiving all signals so it remained idle for

a period of time before completing its operation (idle time given by the time difference

between the TransfersReceived event and WaitBegin event). Finally, PE X shows a

situation where overlapping computation is used to remove potential d,-.-1 (advantage of

split-phase version). PE X entered the notification phase first so it logically required the

longest wait time (i.e., largest difference between the Transfers_Received and the Enter

event). However, by performing sufficient computation, PE X no longer needed to wait

once it entered the wait phase and thus was free of delay. For the split-phase version,

the total operation duration is given by the combined duration of the notification phase

(time difference between the Enter and Notification_End event) and the wait phase (time

difference between the WaitBegin and Exit event) and the PI delay is given by the time

difference between its TransfersReceived and WaitBegin event (Figure 4-4, bottom).

4.1.2 Data Transfer Operations

In Figure 4-5, we illustrate the events for operations relating to one-sided, point-to-point

data transfers such as atomic operations, blocking or non-blocking explicit or implicit put,

and get operations, and explicit communication synchronization (e.g., fence, quiet).

Note that we present the get operation as a reverse put5 (where the calling PE sends

a request to the target PE and the target PE performs a put) since get operations are

often implemented this way in practice in order to improve their performance. For this

class of operations, we are interested in determining the time it takes for the full data

transfer to complete; from beginning of read or write to when data is visible to the

whole system. The precise duration for the non-blocking version could be calculated

if the Transfer_Complete event is available; unfortunately, it is often not possible

for model implementations to supply this event. For such systems, the duration can

only be estimated from the end time of either the explicit or implicit communication

synchronization (Synchronization_End event) that enforces data consistency among

PEs. As illustrated in Figure 4-5c, this estimated duration could be much higher

than the precise duration and as a result compromises the reliability of subsequent

analyses. Furthermore, any PI d,-1i- caused by the synchronization operations increase

5 A canonical get would have events similar to those illustrated for a put operation.

the duration time further away from the actual transfer time. Finally, if an explicit

synchronization operation is used to force the completion of multiple data transfers, PI

delay in one transfer will affect the duration calculation for all other transfers as well,

further decreasing the accuracy of the performance information.

To calculate the PI delay, either the Transfer_Begin or the Transfer_Complete event

is needed; they are sometimes obtainable by examining the Network Interface Card (NIC)

status. PI delay for blocking put (get) is the time difference between the Transfer_Begin

and the Enter event. For non-blocking put (get) using implicit synchronization, PI delay is

the time difference between the Transfer_Complete and the Synchronization_Begin event.

For non-blocking put (get) using explicit synchronization, PI delay is the time difference

between the Transfer_Begin and the Synchronization_Begin event. PI d4-.1i associated

with these operations often signify the existence of competing data accesses.

In Figure 4-6, we illustrate the events for operations (of PE X) relating to two-sided,

point-to-point data transfers such as blocking or non-blocking send and receive operations

and explicit communication synchronization. As with one-sided communication, we

are interested in determining the time it takes for the full data transfer to complete.

The duration for the non-blocking versions is the time difference between the send

(receive) Enter and the Exit event plus the time difference between the Signal_Received

and the Wait_End event. Unfortunately, the Signal_Received event (event denoting

the point in time when the calling PE received a Ready signal from the matching

PE) is nearly impossible to obtain; thus duration can only be estimated from the exit

time of the synchronization call that guarantees data transfer completion, resulting in

much higher than normal duration calculation. PI delay for blocking send (receive) is

the time difference between the Enter and the Matching_Enter events while PI d.1 1

for non-blocking send (receive) with matching blocking receive (send) is the time

difference between the WaitBegin and the Matching_Enter event. For a non-blocking

send (receive) with matching non-blocking receive (send), PI delay cannot be calculated

-- Y initiated signal I Operation Events
or data transfer
r data transf PI delay 1. Enter 4. TransferComplete
0 X initiated signal 2. Exit 5. Synchronization Begin
or data transfer o Implicit sync. 3. Transfer_Begin 6. SynchronizationEnd

time ------------+
(a) Blocking put


11 1 1 1
1 2 3 5 6
time -------
(c) Non-blocking put with
implicit synchronization

o ~-F~J-~--.

12 5 3 6
time -- -- --------
(e) Non-blocking put with
explicit comm. synchronization

1 1
1 2
time ------- ---
(b) Blocking get

(Y %


SI 1 1 1
1 2 4 5 6
time ---
(d) Non-blocking get with
implicit synchronization


time -----------------------
(f) Non-blocking get with
explicit comm. synchronization

Duration(Blocking) = TS(Exit) TS(Enter)
Duration(Non-blocking, actual) = TS(Actual_Completion) TS(Enter)
Duration(Non-blocking, estimated) = TS(Synchronization_End) TS(Enter)

PI Delay(Blocking) = TS(TransferBegin) TS(Enter)
PI Delay(Non-blocking, implicit) = TS(Transfer_Complete) TS(Synchronization_Begin)
PI Delay(Non-blocking, explicit) = TS(Transfer_Begin) TS(Synchronization_Begin)

Figure 4-5. Events for one-sided communication and synchronization operations

Y initiated signal Y Y
or data transfer '- I -
SX initiated signal
or data transfer 3
r ---I Operation Q -
I Delay I I I I
Implicit sync. 1 4 2 1 4 2
time- -- time -- -
(a) Blocking send/recv with (b) Blocking send/recv with
matching blocking recv/send matching non-blocking recv/send

Y 'Y
Events (Y) (Y
1. Enter I
2. Exit 3 1 /
3. Matching_Enter
4. Signal_Received
5. Wait Begin 0
6. Wait End
11 1 1 1 1 111 1
1 2 5 4 6 1 2 5 4 6
time-- ime ---
(c) Non-blocking send/recv with (d) Non-blocking send/recv with
matching blocking recv/send Matching non-blocking recv/send
Duration(Blocking) = TS(Exit) TS(Enter)
Duration(Non-blocking, actual) = [TS(Exit) TS(Enter)] + [TS(Wait_End) TS(Signal_Received)]
Duration(Non-blocking, estimated) = TS(Wait_End) TS(Enter)

PI Delay(Blocking) = TS(Matching_Enter) TS(Enter)
PI Delay(Non-blocking with matching non-blocking) = TS(MatchingEnter) TS(Wait_Begin)

Figure 4-6. Events for two-sided communication and synchronization operations

(since Signal_Received is not available). These PI d. 1 ,l- signify a potentially inefficient

calling sequence of send and receive pairs that typically stem from a load-imbalance prior

to calling of these operations.

4.1.3 Lock, Wait-On-Value, and Locally Executed Operations

In Figure 4-7a, we illustrate the events for the lock mechanisms and the wait-on-value

operation. The duration is calculated from the time difference between the Enter and Exit

events. To calculate the PI delay, the Condition_Fulfilled event is needed, which indicates

when the lock becomes available (unlocked) or when a remote PE updates the variable

to have a value satisfying the specified wait condition. This Condition_Fulfilled event

is generally not traceable directly by the tool but instead can be estimated from other

-- Y initiated signal or ( Y ( Y
data transfer .
-- X initiated signal or
data transfer 3
0I" I Operation \
-M Pl delay 0

1 2 1 2
1Enter time i- -- -------- time ----- ----
2. Exit (a) Lock system, wait-on-value (b) User-defined, work-sharing,
3. Condition_Fulfilled change operations environment inquiry operations

Duration = TS(Exit) TS(Enter)

PI Delay(lock, wait-on-value change) = TS(Condition_Fulfilled) TS(Enter)

Figure 4-7. Events for (a) lock mechanisms, wait-on-value and (b) user-defined
function/region, work--, ni il_ and environment-inquiry operations

operations' events (i.e., the last unlock or data transfer completed). PI dl 1i', associated

with these operations generally stem from poor orchestration among processing PEs (such

as lock competition and late updates of wait variables).

Finally in Figure 4-7b, we illustrate the events for the locally executed operations

such as user-defined function or region, work-sharing, and environment inquiry operations.

Tracking the performance of these operations is important as it facilitates the analysis

of local portions of the program. Since we can consider each to be a blocking operation,

the duration is simply the time difference between the Enter and Exit events. Without

extensive sequential performance tracking and analysis, it is not possible to determine if

any PI delay exists.

4.1.4 Implementation Challenges and Strategies

In this subsection, we briefly discuss the challenges and strategies used to implement

the event model with respect to data collection (instrumentation and measurement),

automatic data analysis, and data presentation. A more detailed discussion of each of

these stages will be given in C'!I pter 4.2, 5 and 4.4 respectively.

Together, the programming model, chosen instrumentation technique, and tool

design decision determines the set of events that the tool collects during runtime. The

programming model supplies the Meaningful Event Set to collect as specified by the

event model discussed previously. From this set, a subset of Measurable Event Set that

can be collected directly during runtime given the constraints imposed by the chosen

instrumentation technique is identified. Finally, some tool design decisions may further

limits the Actual Event Set (a subset of Measurable Event Set) the tool supports Once

the Actual Event Set is known, metrics (common metrics plus additional metrics in Table

4-1) associated with events in this set are collected during runtime.

Depending on the Actual Event Set collected, analyses performed during the analysis

phase will differ. For example, to calculate barrier duration and PI delay in MPI and

SHMEM, the single-phase formulas are used, while for Berkeley UPC, the split-phase

formulas are used. Programming model capabilities also pl i, a role in what kind of

analyses are performed. Ain li specific to barriers can be applied to all three models,

while analyses specific to one-sided transfer operations (e.g., PI delay due to competing

put and get) are applicable to both SHMEM and UPC, but not MPI 1.x.

For the presentation stage, each visualization is equipped to handle each type

of operation defined in the event model. For some visualizations, the operation type

pl1 li- no role in how the visualization handles the data, while other visualizations must

include mechanisms to handle each type separately. For example, table-based views

display data for all operation types in a similar fashion, but a grid-based view of data

transferred between PEs makes specific use of communication-related metrics related to

data-transfer operations exclusively. Other visualizations, such as the timeline view of

trace data, operate differently for various operation types; for example, the timeline needs

to appropriately handle two-sided operations, one-sided data transfers (for which a new

approach is needed to handle this operation type), etc.

4.2 Instrumentation and Measurement

In this section, we introduce known approaches for instrumentation, discuss their

strengths and limitations within the context of the goals of PPW, and then present our

data collection solution based on a novel, standardized performance interface called GASP.

4.2.1 Overview of Instrumentation Techniques

While several techniques have proven to be effective in application instrumentation

[21], the differences in compilation6 and execution among the divergent compilation

approaches prevent the selection of a universal instrumentation strategy. With source

instrumentation, the instrumentation code is added as part of the high-level source code

prior to execution time. Because the source code is altered during the instrumentation

process, this technique may prevent compiler optimization and reorganization and

also lacks the means to handle global memory models where some semantic details of

communication are intentionally underspecified at the source level to allow for .,.-i--ressive

optimization (for example, implicit read or write of a shared variable in UPC is difficult

to handle using source instrumentation, especially under the relaxed memory consistency

mode where a given compiler may reorder the implicit calls to improve performance).

With binary instrumentation, the instrumentation code is added to the machine code

before or during program execution. A direct benefit of modifying the machine-code

rather than the source-code is that recompilation is often not needed after each program

modification. Unfortunately, binary instrumentation is unavailable on some architectures

and yields performance data that is often difficult to correlate back to the relevant

source code, especially for systems employing source-to-source translation. Finally, with

library instrumentation (such as PMPI for MPI), wrappers are placed around functions

implementing operations of interest. During execution time, a call to a function first

executes the appropriate wrapper code that enables data collection and then invokes the

original function. This approach is very easy to use but does not work for programming

model constructs that are not in the form of a function call (such as an implicit put in

UPC) or for compilers that generate code which directly targets hardware instructions or

low-level proprietary interfaces.

A brute force approach to having a tool simultaneously support multiple programming

models and implementations is simply to select an existing instrumentation technique that

works for each particular model implementation. Unfortunately, this approach forces the

writers of performance tools to be deeply versed in the internal and often changing or

proprietary details of the implementations, which can result in tools that lack portability.

In addition, the use of multiple instrumentation techniques forces the tool to handle each

model implementation disjointly and thus complicates the tool development process.

4.2.2 The Global-Address-Space Performance Interface

The alternative we have pursued is to define an instrumentation-measurement

interface, called the Global-Address-Space Performance (GASP) interface (Appendix

A), that specifies the relationship between programming model implementations and

performance tools (Figure 4-8). This interface defines the events and arguments of

importance for each model construct (see Table 4-3 for GASP events and arguments

related to non-blocking UPC communication and synchronization calls). Insertion of

appropriate instrumentation code is left to the compiler writers who have the best

knowledge about the execution environment, while the tool developers retain full control

of how performance data are gathered. By shifting the instrumentation responsibility

from tool writer to compiler writers, the chance of instrumentation altering the program

behavior is minimized. The simplicity of the interface minimizes the effort required

from the compiler writer to add performance tool support to their system (and once

completed, any tool that supports GASP and recognize these model constructs can

6 For example, existing UPC implementations include direct, monolithic compilation
systems (GCC-UPC, Cray UPC) and source-to-source translation complemented with
extensive runtime libraries (Berkeley UPC, HP UPC, and Michigan UPC).

application code

PGAS compiler &
runtime systems

System events

Compiler GASP
impem nttio 1**


analysis tool

Figure 4-8. Interaction of PGAS application, compiler, and performance tool in
GASP-enabled data collection

Table 4-3. GASP events for non-blocking UPC communication and synchronization

Operation identifier Event type









Enter, Exit



Enter, Exit
Enter, Exit

int isrelaxed,
void *dst,
gasp_upc_PTS_t *src,
size_t n
int isrelaxed,
void *dst,
gasp_upc_PTS_t *src,
size_t n
gasp_upcmnbhandle_t handle
gasp_upcmnbhandle_t handle
int isrelaxed,
gasp_upc_PTS_t *dst,
void *src,
size_t n
int isrelaxed,
gasp_upc_PTS_t *dst,
void *src,
size_t n
gasp_upcmnbhandle_t handle
gasp_upcmnbhandle_t handle
gasp_upcmnbhandle_t handle

support application analysis for that compiler). Concomitantly, this approach also greatly

reduces the effort needed for performance tool writers to add support for a variety of

model implementations; a single tool-side GASP implementation is sufficient for all

compilers with GASP support.

enum gasp event type {gasp eventtype start, gasp eventtype end,
gasp eventtype atomic};

void gasp event notify(
unsigned int event id,
enum gasp event type event type,
const char* source file,
unsigned int source line,
unsigned int source col,

Figure 4-9. Specification of gasp_eventnotify callback function

The most important entry point in the GASP interface is the event callback function

named Irl'-':v' ,/-_i. './:fy (Figure 4-9) that compilers use to notify when events of potential

interest occur at runtime and provide useful information (e.g., event identifier, source code

location, and event-related arguments) to the performance tool. The tool then decides how

to handle the information and what metrics to record. In addition, the tool is permitted to

make calls to routines that are written in the source programming model or that use the

source library to query model-specific information which may not otherwise be available.

The tool may also consult alternative sources of performance information, such as CPU

hardware counters exposed by PAPI, for monitoring serial aspects of computational and

memory system performance in great detail. The gaspev i/,l_,.lfy callback includes

a per-thread, per-model context pointer to an opaque, tool-provided object created at

initialization time, where the tool can store thread-local performance data.

The GASP specification is designed to be fully thread-safe, supporting model

implementations where arbitrary subsets of programming model threads may be

implemented as threads within a single process and virtual address space. It is highly

extensible by allowing a tool to capture model- and implementation-specific events at

varying levels of detail and to intercept just the subset of events relevant to the current

analysis task. It also allows for mixed-model application analysis whereby a single

performance tool can record and analyze performance data generated by all programming

models in use and present the results in a unified manner. Finally, GASP provides

facilities to create user-defined, explicitly-tli.--. -,. It performance events which allow the

user to give context to performance data. This user-defined context data facilitates phase

profiling and customized instrumentation of specific code segments.

Several user-tunable knobs are also defined by the GASP specification to provide

finer control over the data collection process. First, several compilation flags are included

so user can control the event types tool will collect during runtime. For example, the

--inst-local compilation flag is used to request instrumentation of data transfer

operations generated by shared local accesses (i.e., one-sided accesses to local data which

are not statically known to be local). Because shared local accesses are often as fast

as normal local accesses, enabling these events can add a significant runtime overhead

to the application so by default, the tool does not collect these data. However, shared

local access information is useful in some analyses, particularly those that deal with

optimizing data locality (a critical consideration in PGAS programming, see Section and performing privatization optimizations, and thus may be worth the additional

overhead. Second, instrumentation #paij,,.i directives are provided, allowing the user to

instruct the compiler to avoid instrumentation overheads for particular regions of code at

compile time. Finally, a programmatic control function is provided to '..._-1 performance

measurement for selected program phases at runtime.

The complete GASP-enabled data collection process works as follows. First, the

compiler-side GASP implementation (Instrumentation Unit) generates instrumentation

code which is executed together with the application. The tool-side GASP implementation

(\. i-, irement Unit) then intercepts these calls and performs the desired measurement.

Next, the raw data are passed to the Performance Data Manager that is responsible

for storing this raw data, merging data from multiple PEs, and performing simple

post-processing of data (e.g., calculating averages among PEs). These data are then used

by the automatic ain 1 -i; units and presentation units at the later stages of performance


4.2.3 GASP Implementations

Here we briefly discuss considerations for the compiler-side implementation of

the GASP interface, focusing on UPC as it is the more interesting case. There are

several UPC compilers with existing GASP implementations: Berkeley UPC, GCC UPC,

and HP UPC [22]. Berkeley UPC translates UPC code to standard C code with calls

to the Berkeley UPC runtime system. As a result, much of the corresponding GASP

implementation consists of appropriate GASP calls made within the runtime system.

However, several features of the GASP specification must be implemented within the

compiler itself, including the #pa1.i',,. directives for controlling instrumentation of

program regions and support for instrumentation of user-function calls. In addition, to

provide appropriate UPC source code correlation, the compiler must pass source code

information down through the translation process. By contrast, the GCC UPC and HP

UPC compilers both use a direct compilation approach, generating machine code directly

instead of translating UPC into C. With this architecture, the GASP implementation

involves more changes to the compiler itself than with Berkeley UPC. In the case of GCC

UPC, for example, changes were needed in one of the UPC compilation phases (called the

simplificationn" phase because intermediate representations of functions are converted

to GCC's GIMPLE language) to determine if instrumentation is enabled and generate

appropriate code if so.

4.3 Automatic Analysis

The analysis module aims at providing the tool with automatic performance

bottleneck detection and resolution capabilities. In this section, we briefly describe the

capabilities of analysis units and leave the in-depth discussion on the automatic analysis

system development to C!i lpter 5.

The High-Level Analysis Unit provides analyses of the overall program performance

not easily associated with a given operation type (e.g., load-balancing analyses) as well

as multiple experiment comparison (e.g., scalability analysis). To provide finer analyses,

the model-independent Bottleneck Detection Unit uses both profiling and tracing data to

identify bottlenecks and determine their cause for a particular execution. Once identified,

Bottleneck Resolution Units then try to provide -i.-.- -1 i. i on how to remove these

bottlenecks from the application. These units are partially model-dependent, as a given

resolution strategy may not ahv--l- work for all programming models. For example, a

technique to fix the performance degradation stemming from upc_mem get, versus from

/,,,, ,,i,, I could be different even though they are both classified as one-sided get

operations. Each of the analysis units generates new analysis data that are incorporated

by the Performance Data Manager and later presented by the Visualization Manager.

4.4 Data Presentation

PPW provides both graphical and text-based interfaces to view collected profile data

and generated a n i1 -i results. Most of these visualizations have been designed to have

a similar look and feel to those provided by other tools so users already familiar with

other tools can quickly learn and effectively use PPW. The following list summarizes the

visualization-related features (with each supporting source-code correlation whenever

possible) currently provided by PPW:

* A view summarizing the application execution environment (optimization flags used,
machine hostnames, etc.).

('!i its to facilitate the identification of time-consuming application segments (10
longest-executing regions).

Flat and call-path tables to display high-level statistical performance information.

A visualization to detect and show event-level load-balancing issue (Figure 4-10a, see
('!i ipter for more detail).

A chart to compare related experimental runs (Figure 4-10b) such as runs of the
same program using various system sizes or runs of different versions of the same

A display showing the inter-PE communication volume for all data transfer
operations in the program (providing PE-to-PE or PE-to-global-memory
communication statistics).

Table 4-4. Profiling/tracing file size and overhead for UPC NPB 2.4 benchmark suite
Benchmark CG EP FT IS MG
Size (profile) (KB) 113 840 369 276 195
Size (trace) (M\l ) 0.15 34 142 1050 4560
Overhead (profile) < 0.1, 2 i'i, < 0.1, 1.,.' < 0.1
Overhead (trace) < 0.1 4.-'. < 0.1 2. !'. 2 It

* A unique, UPC-specific Array Distribution display that depicts the physical layout of
shared objects in the application on the target system (Figure 4-11).

Exports of trace data for viewing with the Jumpshot and Vampir [23] timeline

4.5 PPW Extensibility, Overhead, and Storage Requirement

We first develop the Parallel Performance Wizard tool to support UPC application

analysis which took close to two years to develop. This tool is later extended to support

SHMEM and MPI 1.x, each taken less than six months to complete with a large portion of

time spend on system configuration. In Table 4-4, 4-5, and 4-6, we provide experimental

data on the data-collection overhead and storage space needed for PPW on an Opteron

cluster connected with a Q-N. i1 Quadrics interconnect using the following test programs.

For UPC, we executed the George Washington University's UPC NPB version 2.4

benchmark suite (class B) [24] with Berkeley UPC [3] version 2.6. For SHMEM, we

executed the Quadrics's APSP and CONV test programs and an in-house SHMEM SAR

application (see C! lpter 6.3) with Quadrics SHMEM [4]. For MPI, we used the Tachyon

[25], an in-house MPI SAR [26] and an in-house Corner Turn applications with MPICH2

[27] v1.0.8. All programs were instrumented and monitored using PPW (v.2.2), with

performance data collected for all UPC/SHMEM/ \!PI constructs and user functions in

each program. In all cases, the data-collection overhead numbers (< 2.7. for profile, <

4.;;' for trace) are comparable to existing performance tools. Tracing data size is linearly

related to the total number of events instrumented by the tool; on average, PPW requires

17MB of storage space per 1 million trace events.

I Total Time


Regal Calame Maxrtalance Tot

Jpc walt c.c1264 212.85183 815 32760.
pc walt g.c 1495 110.66065s 337,50171
"educesum cg.c1487 105.9610S s 407 59207
jpcmemget c; 1289 101.11701 s 21 32560
pc_wait cg.c1277 37.9321 s 35,22376
jpc_Vwat g.1514 22.I79112 54,76192s
jpcput g.c 1505 17.92769 15 16496
p1 t cg.c:1423 10.13281s 31.97364
pcm memget e. 1414 9.95701 2 69833
Jpc me met c.c1447 5.81770s 098557s
B. Dn

he Edit -pbcns Ailys ep \

Total Times by Function: Camel cryplonalysis


7 -,
., i

s -- .i

,, -- ,, ,, ,
. ..', "' _

Now viewng reviaon Came yptondlyds

Figure 4-10.

256-PE run, (b) Experimental set comparison

chart for Camel

(a) Load-balancing ain 1.il;- visualization for CG

4-, 8-, 16-, and 32-PE runs



I ,,- 1 1,

Node ID 7 .... S.

S------------i ----
U II I ~4 .

1 menu to select
Shared array
- -

M I" I
Properties of the selected
shared array sx
(user is free to change the
blocksize. dimension, and
thread number of selected
shared array to see what the
new physical layout would
be like)

,r lt.JrI.1 -
1 l.. n .:.r..
2 1 L .j *

sx array elements with affinity
to node 3

*I J":. .-* 2 i rL:C I I T 'N I.
.- L I

Figure 4-11. Annotated screenshot of new UPC-specific array distribution visualization
showing physical layout of a 2-D 5x8 array with block size 3 for an 8-PE

Table 4-5. Profiling/tracing file size and overhead for SHMEM APSP, CONV, and SAR



Size (profile) (KB) 116 84 256
Size (trace) (M\!1) 0.14 0.12 0.26
Overhead (profile) < 0.1 K < 0.1A < 0.1,.

Overhead (trace)

< 0.1 < 0.1 < 0.1

Table 4-6. Profiling/tracing file size and overhead for MPI corner turn, J ,i ,on, and SAR

Size (profile) (KB)
Size (trace) (M\1l)
Overhead (profile)
Overhead (trace)

Corner Turn Tachyon

0.3 .' ,,

0.' ",


< 0.1

0..!' ', 0.71 < 0.1



Prret Cdra r.File*. T.--* T lr s-- -T..-. 'Fil T T. -* Tn ..r.. T1,.

Table 4-7. Profiling/tracing file size and overhead comparison of PPW, TAU, and Scalasca
tools for a 16-PE run of the MPI IS benchmark
PPW TAU Scalasca
Size (profile) (KB) 133 77.5 56
Size (trace, uncompressed) (Mll /million events) 16.16 23.75 11.44
Overhead (profile) < 1. < 1. < 1 .
Overhead (trace) < 1. < 1. < 1 .

4.5.1 PPW, TAU, and Scalasca Comparision

In Table 4-7, we show a brief overhead and data-size comparison of the PPW, TAU,

and Scalasca tools for a 16-PE run of the MPI IS benchmark on a Quad-core Xeon cluster

using a Gigabit Ethernet interconnect. From this table, we see that the data-collection

overhead for each of the three tools is negligible compared to the total execution time. For

profiling, PPW requires a higher (but still quite small) amount of storage space than TAU

and Scalasca. For ti i-,,.- PPW requires 16.16 MB of uncompressed storage space per one

million trace events generated while TAU and Scalasca require 24.03 MB and 11.44 MB of

uncompressed storage space, respectively.

4.5.2 PPW Tool Scalability

To evaluate the scalability of PPW, we conducted 128-, 256-, and 512-PE runs of

GWU's UPC NPB version 2.4 benchmarks (class B) using Berkeley UPC 2.8 (via the

GASNet MPI conduit using MPICH2 version 1.0.8) on an 80-PE Intel Quad-core Xeon

cluster with a Gigabit Ethernet interconnect (with all model constructs and user functions

instrumented). In Figure 4-12, we show the communication statistics visualization for a

256-PE run of CG (Figure 4-12a) and the zoomed-in Jumpshot view of an MG 512-PE

run (Figure 4-12b). As shown in Table 4-8, the data collection overhead numbers are

higher than with runs on smaller system sizes but are still within the acceptable range

(< 6.3 !'. for profiling, < 5.61 for tracing). In all cases, profile data size remained in

the manageable MB range. In contrast, trace data size for larger runs is significantly

greater for some of these benchmarks that exhibit weak scaling, such as CG and MG (for

Table 4-8. Profiling/tracing file size and overhead for medium-scale UPC NPB 2.4

benchmark suite runs
128 256
CG Overhead 0>7 0.31 -
Data size 4.0 MB 14.1 MB
EP Overhead 6.3 !' 0.91,
Data size 3.2 MB 11.8 MB
FT Overhead 4.4 1'- N/A
Data size 4.9 MB N/A
IS Overhead < 1, 4.0' .
Data size 3.7 MB 12.8 MB
MG Overhead 4.55' < 1,
Data size 8.0 MB 21.3 MB

(enchmarks such a'

that (

strong 1' the data size stays relatively constant);

this characteristic could become an issue as system size continTues to increase.

4.6 Conclusions

goal of the first part of this research was to investigate, des desi develop,

and e evaluate a model-independent 'e tool :,,, nile many tools

:)rt :e anal of 1 tool : is

for written in other programming models such as those in the PGAS

y. Existing tools were :ifically designed to support, a 'ticular model (i.e.

MPT) and ecamlne too tightly coi with that model. a result, a significant

amount effort : the <' ers is needed to add new model )rt. To address

this issue, the PP .. 1 e with two novel concepts was developed ..

introduced the generic-i -t abstraction concept and illustrated how the

generic-o0 'ation-type-based event model helps in minimizing the depend- -- < a

tool to its s ) Ir)tcd amnming models and the need the new C/

interface and how this t the otherwise cum)bersomle data ( Ation

i. ti the inclusion of these two concepts, our PP ,. tool framework ; rts and

is easily extensible to )rt a wide range of p1. Tramming models.

45.1 MB
64.1 MB

< 1t
8.8 GB
3.5 MB
164.6 MB
4.5 GB
< 1.
150.8 MB

17.6 GB
12.4 MB
4.6 GB
304.4 MB

46.3 MB
627.3 MB
627.3 MB

I ,T

"7 r

Soom i -l
OD= -, m-d


412_497 9

Figure 4-12. (a) Data transfers visualization showing communication volume between processing PEs for 256-PE CG

benchmark tracing mode run, (b) Zoomed-in Jumpshot view of 512-PE MG benchmark

." \. M I '- -& 0

FIle Edlt Optlons Analy
| l u -K-* i.(5t> ar |

To v-lidate the ----posed rk, wIe develop the PP\W -,-)t(-"'Tpe tool that

originally s manual anal of C "' iand later extended to

manual i of and '1 1.x ;rams. .. showed that while it took

over two years to the first extc the 'pe to; other

-amming models was achieved fairly qui ( than 6 month for both \ l

and MPT), proving that. our iosed framework is highly extensible. In :)n, we

dcmonst.ratedi that our PPW incurred overhead (< for profiling and <

for tracing for all 'ed models) well within the acceptable : is ct .' to

other p< i e tools, and is still usable to 512 PEs.

Future work on this part < the research includes u integrating P 7 into the E '.

development environment; enhancing the scalability < existing PPW visualizations;

.. OXroving data-collection overhead, management,i and storage on larger systems; and

Slower-level (e.g., model runtime and network-related) e

information using GAL..


Performance tools that collect and visualize raw performance data have proven to

be productive in the application optimization process. However, to be successful in this

manual cn i li -; process, the user must posses a certain degree of expertise to discover and

fix performance bottlenecks -and thus limiting the usefulness of the tool, as non-expert

programmers often do not have the skill set needed. In addition, as the size of the

performance dataset grows, it becomes nearly impossible to manually analyze the data,

even for expert programmers. One viable solution to this issue is an automatic analysis

system that can detect, diagnose, and potentially resolve bottlenecks.

In this chapter, we present a new automatic analysis system that extends the

capabilities of the PPW performance tool. The proposed system supports a range of

analyses that (to our knowledge) no single existing system provides and uses novel

techniques such as baseline filtering and a parallelized analysis process to improve

execution time and responsiveness of analyses. In addition, because it is based on the

generic-operation-type abstraction introduced earlier, the analysis framework is applicable

to any parallel programming model with constructs that can be mapped to the supported

operation types.

To avoid confusion, we begin by defining some important terms used in the remainder

of this chapter. A performance property (or pattern) defines an execution behavior of

interest within an application. A performance bottleneck is a performance property with

non-optimal behavior. Bottleneck detection (or identification, discovery) is the process of

finding the locations (PE, line of code, etc.) of performance bottlenecks. Cause analysis1 is

the process of discovering the root causes of performance bottlenecks (e.g., late barrier

entrance caused by uneven work distribution). Bottleneck resolution is the process of

identifying potential strategies that may be applied to remove the bottlenecks. Automatic

optimization refers to source code transformation and/or changes in the execution

-riginal User interaction
code code

S........... ..

Performance r Boleneck r *
Bottleneck Bottleneck Automatic
data discovery cause resolution optimization
collection analysis I

Performance analysis tool

Figure 5-1. Tool-assisted automatic performance analysis process

environment made by the tool to improve application performance. Finally, a hotspot is a

portion of the application that took a significant percentage of time to execute and thus is

a good candidate for optimization.

5.1 Overview of Automatic Analysis Approaches and Systems

Automatic (or automated) analysis is a tool-initiated process to facilitate the finding

and ultimately the removal of performance bottlenecks within an application. The entire

process may involve the tool, with or without user interaction, performing some or all of

the tasks illustrated in Figure 5-1 on the application under investigation. Note that in

the figure, performance data collection refers to the gathering of additional data on top of

what the tool collects by default. In the remainder of this section, we provide an overview

of existing work relating to automatic ,in 1,i-

The APART Specification Language (ASL) [28] is a formal specification model

introduced by the APART [29] working group to describe performance properties via three

components: a set of conditions to identify the existence of the property, a confidence

value to quantify the certainty that the property holds, and a severity measure to describe

the impact of the property on performance. The group used this language to provide a

1 Because bottleneck detection and cause analysis are closely tied to each other, in some
literature they are together referred to as the bottleneck detection process.

list of performance properties for the MPI, OpenMP, and HPF programming models and

noted the possibility of defining a set of base (model-independent) performance property


HPCToolkit and TAU are examples of tools providing features to evaluate the

scalability of an application using profiling data. HPCToolkit uses the timing information

from two experiments to identify regions of code with scalability behavior that deviates

from the weak or strong scaling expectation [30]. PerfExplorer is an extension of TAU

that generates several types of visualizations that compare the execution time, relative

efficiency, or relative speedup of multiple experiments [31]. In addition, PerfExplorer

includes techniques such as ( In-I. i i i dimension reduction, and correlation analysis to

reduce the amount of performance data the user must examine.

Periscope, KappaPI-2, and KOJAK are knowledge-based tools that support the

detection of well-known performance bottlenecks defined with respect to the programming

model. The advantage of a knowledge-based system is that little or no expertise is

required of the user to successfully analyze the program. Periscope supports online

detection of MPI, OpenMP, and memory system related bottlenecks (specified using

ASL) through a distributed hierarchy of processing units that evaluate the profiling data

[32]. KappaPI-2 is a post-mortem, centralized, tree-based analysis system that supports

bottleneck detection, cause analysis, and bottleneck resolution (via static source code

analysis) using tracing data [33]. Finally, EXPERT is a part of KOJAK (now known as

Scalasca) that supports post-mortem bottleneck detection and cause analysis of MPI,

OpenMP, and SHMEM bottlenecks (specified using ASL). The developers recently

introduced an event-reply strategy to allow parallel, localized analysis processing which

has been successfully applied to MPI [34], but it remains questionable whether such a

strategy works well for other programming models.

Hercules [35] is a prototype knowledge-based extension of TAU that detects and

analyzes causes of performance bottlenecks with respect to the programming paradigm

(such as master-worker, pipeline, etc.) rather than the programming model. An advantage

of this system is that it can be used to analyze applications written in any programming

model. Unfortunately, the system cannot handle applications developed using a mixture of

paradigms or that do not follow any known paradigm at all, making it somewhat limited

in applicability.

Paradyn's online W3 search model [36] was designed to answer three questions

through iterative refinement: why is it performing poorly, where are the bottlenecks, and

when did the problems occur. The W3 search system analyzes instances of performance

data at runtime, testing a hypothesis which is continually refined along one of the three

question dimensions. The W3 system considers hotspots to be bottlenecks, and since not

all hotspots contribute to performance degradation (they could simply be performing

useful work), the usefulness of this system is somewhat limited.

The main idea behind the design of N,. Mi- ii. r [37], a component of the Projections

tool, is that events of similar type should have similar performance under ideal circumstances.

Utilizing this assumption, the system makes a pass through the trace log, assigns an

expected performance value to each event type, and then identifies specific trace events

with performance that do not meet the expectations (i.e., noisy events).

Performance Assertions (PA) is a prototype source code annotation system for the

specification of performance expectations [38]. Once performance assertions are explicitly

added by the user, the PA runtime collects data needed to evaluate these expectations

and selects the appropriate action (e.g., alert the user, save or discard data, call a specific

function, etc.) during runtime. IBM has also developed an automated bottleneck detection

system enabling the detection of arbitrary performance properties within an application

[39]. The system supplies users with an interface to add new performance properties using

pre-existing metrics and to add new metrics needed to formulate the new properties. With

both of the above systems, a certain degree of expertise is required of the user to formulate

meaningful assertions or properties.

Each of the above approaches has made a contribution to the field of automatic

performance analysis. Each also has particular drawbacks that limit its effectiveness or

applicability. In light of ongoing progress in and the ever-increasing complexity of parallel

programming models and environments, we have sought to make corresponding progress in

effective analysis functionality for a variety of modern programming models.

5.2 PPW Automatic Analysis System Design

The PPW automatic analysis system focuses on optimization of observed application

execution time with the goal of guiding (and possibly provide hints to) users to specific

regions of code to focus their optimization efforts on. The proposed system is novel in

several aspects. First, the analysis system makes use of the same (model-independent)

generic-operation-type abstraction underlying our PPW tool. As a result, our analysis

system can be easily adapted to support a wide range of parallel programming models and

naturally supports the analysis of mixed-model applications (i.e., programs written using

two or more models). The use of this abstraction also improves the system's capabilities

by allowing in-depth analysis of some user-defined functions. For example, by simply

instructing the system to treat a user-defined upcuser-waituntil function in a UPC

program as a wait-on-value-change operation (adding one line in the event type mapping),

the system is able to determine the cause of any d-1i- associated with this function.

Second, we introduce several techniques such as a new baseline filtering technique2 to

identify performance bottlenecks via comparison of actual to expected performance values

that is generally more accurate than the deviation filtering technique used in N. i-, Mi\,. r.

Third, our system performs a range of existing and new analyses including scalability

analysis, load-balance analyses, frequency analysis, barrier-redundancy analysis, and

common-bottleneck analysis, whereas other systems support only a few of these analyses.

Finally, we have developed a scalable analysis processing technique to minimize the

execution time and responsiveness of the analyses. This process is designed to allow

multiple localized analyses to take place in parallel (involving minimal data transfers) and

is able to identify specific bottleneck regions using only profile data and determine the

cause of the bottleneck when trace data is available. Compared to other parallel a'i i1, -'i

systems in existence (such as the event-reply strategy introduced in [34]), our system is

inherently portable, since the analysis process is not tied to the execution environment

used to run the application.

5.2.1 Design Overview

The high-level architecture of the PPW automatic analysis system is depicted in

Figure 5-2. The analyses supported by the system are categorized into two groups:

application analyses which deal with performance evaluation of a single run and

experiment set analyses to compare the performance of related runs. We designed this

system to focus on providing analyses to help both novice and expert users in optimizing

their application via source code modification. In particular, these analyses focus on

finding operations that took longer than expected to run, operations that may be

redundant, and operations that could be transformed into other operations to improve


The parallelized analysis processing mechanism is a peer-to-peer system consists of

up to N processing units (where N is the application system size): 0 to N-l non-head

processing units, each has (local) access to raw data from one or a group of PEs, and

one head processing unit that require access to a small portion of data from all PEs

to performs global analyses. This inherently parallel design is intended to support the

analysis of large-scale applications in a reasonable amount of time. In Figure 5-3, we

illustrate the types of analyses conducted and the raw data exchange needed for all

processing units in an example 3-processing unit system.

2 The baseline filtering technique is new for automatic analysis but readily used in
system performance evaluation.

Global application Frequency
I analyses analysis

I Common Bottlenec
bCause Bottleneck
Sbottleneck analysis resolution
Ipication anases
Application analyses

Figure 5-2. PPW automatic ,i .l -i ; system architecture

The complete analysis process can be broken down into several distinct categories.

Figure 5-4 depicts the analysis workflow for a processing unit in the system which

includes common-bottleneck analysis, global ,in 1i -- frequency wi, ii -i- and bottleneck

resolution. We now describe these categories in more detail in the following sections; a

summary of analyses currently supported by PPW is presented in Table 5-1.

5.2.2 Common-Bottleneck Analysis

The goal of Common-Bottleneck Analysis is to identify commonly encountered

performance bottlenecks which application developers have seen over the years; due to

common occurrences, there are usually well-known optimization strategies that could be

apply to remove these issues. For example, a common optimization strategy to remove

Table 5-1. Summary of existing PPW analyses
Name Purpose
Scalability Determine scalability
Ai, 1. -~i of an application
Revision Compare performance
Ai i- 1,-i of different revisions
High-Level Compare comp., comm., sync
Ai, ivi--i among PEs
Block-Level Detect load-balancing issue
Ai i 1, i-i of individual program blocks
Event-Level Detect load-balancing issue
Ai, i1.i-i of individual event among PEs
Barrier-Redundancy Identify unnecessary
Ai, 1, -i barrier operations
Shared-Data Evaluate data
Ai, J 1 -i affinity efficiency
Frequency Identify short-lived
Ai i, 1,-i high-frequency operations
Bottleneck Identify potential
Detection bottleneck locations
Cause Identify causes and types
Ai i 1,- -i of common-bottlenecks

Required data type
Profile data
(Multiple runs)
Profile data
(Multiple runs)
Profile data
Tracing data
Profile data
Tracing data
(A2A, data xfer)
Profile data
(Data xfer)
Tracing data
Profile data
Trace data

Global or local









(min. data xfer)

Related bottlenecks
Low application scalability


PE-level load-'. 1 ,ii. ,-.
Low comp/comm ratio
Block level load-balancing

Poor data locality

Inefficiency relating to
multiple small transfers
D. 1 i-, 1d operations

Bottlenecks listed in
Table 5-2

Unit 0 (head)

Processing Processing
Unit 1 Unit 2

Global analyses

Bottleneck resolution

Figure 5-3. Example analysis processing system with 3 processing units showing the
ain i1 each processing unit performs and raw data exchange needed between
processing units

Table 5-2. Common-bottleneck patterns currently supported by PPW and data needed to
perform cause analysis

Local data type
Global sync./comm.

P2P lock
P2P wait-on-value
One-sided put/get
Two-sided send
Two-sided receive

Bottleneck patterns Request targets
Wait on group sync./comm. All other
Wait on lock availability All other
Wait-on-value change All other
Competing put/get All other
Late sender Receiver PE
Late receiver Sender PE

Remote data type
Global sync./comm.

P2P unlock
One-sided data xfer
One-sided put/get
Two-sided receive
Two-sided send

O Common-bottleneck

W |Frequency analysis

Profile data
Scalability I Application (remote)
revision related
analyses analyses
SGlobal analysis (head only)
........................... ......

I v



er I



I Bottleneck
I' resolution

Inter-node raw data transfer

Figure 5-4. Analysis process flowchart for a processing unit in the system


Late Sender, a common-bottleneck relating to two-sided data comuniUcation

move the send call forward in the execution ;

Coimmon-bottleneck anal is the most substantial and time-consuminig anal

)rted by the PP .. ana system. other 1 -based such as

S)paPT-2 and KOJAK, our roach uses both pr and trace data in the

is scalable by design. PPW Common-Bott oleneck A process is separated int,

two a Bottllenck Detection phase to identify '' bottleneck regions using

is to



i data; followed by a Cause Analysis to determine the cause the bottleneck

using trace data. Bottleneck detection

goal of Bottleneck Detection is to i< program regions that when optimized

could the "'cation '' : by a noticeable amount. During this detection

each processing unit examines its portion the ( ial) data and identifies

bottleneck entries. each of the entries, the processing unit first

checks whether or not that s total execution time exceeds a preset percentage of the

total a-) ication time (i.e., is a he' -ot). r ose < this filtering is to >cus the

effort on ,lrtions ( 0 program that would noticeably their :e of

the 'ation when i d.

Next, the :essing unit decides if the i 1 is a bottleneck by
I one of the following two comparison methods. the baseline comparison

method, the processing unit marks the as a bott neck if the ratio of its average

execution time to its baseline execution time the minimal amount of time needed a

given ( nation to I, lc e its execution under ideal circulmsta1s (\- exceeds a pe)se


If the baseline c method is not 1 (e.g., because the entry is a

user function or no I value has been c .cd for the ), the processing unit

uses the alternative deviation evaluation method. With this method we make use of the

following assumption: under ideal circumstances, when an event is executed multiple

times, the performance of each instance should be similar to that of other instances (the

same assumption is used in NoiseMiner). Thus for each hotspot entry, the processing unit

calculates the ratio of its minimal execution time and of its maximum execution time to

its average execution time. If one or both of the ratios exceeds a preset threshold, the

processing unit marks the entry as a potential bottleneck. Cause analysis

The list of potential bottlenecks identified in the detection phase points application

developers to specific regions of code to focus their attention on but does not contain

sufficient information to determine the causes of performance issues often needed to device

an appropriate optimization technique. To provide these detail information, PPW's Cause

Analysis, using available trace data, aims at finding remote events that possibly caused the

bottlenecks identified.

The underlying concept behind our approach is that if some remote events caused

the local event to execute non-optimally (as opposed to caused by other factors such as

network congestion not related to event ordering), then these remote events must have

occurred between the start and end time of the local event. It is because of this concept

that the amount of data exchange between processing units are minimized as only the

related events that occurred during this time range need to be exchanged (compared to

the event-reply strategy introduced in [34] where all relating events must be exchanged).

For example, for a upclock event on PE 0 with start time of 2 ms and end time of 5

ms, the request entry {PE 0, 2ms, 5 ms, P2P unlock} would be issued to all processing

units. The logic behind this example is the following: if at the time of the lock request,

another PE holds the lock, the P2P lock operation issued by PE 0 will block until it is

released by the lock holder. To find out which PE(s) held the lock that caused the delay in

the P2P lock operation, we simply look at the P2P unlock operations issued between the

start and end time of the lock operation. If no P2P unlock operation was issued by any

other PEs, we conclude that the delay was caused by uncontrollable factors that cannot

be resolved by the user, such as network congestion due to concurrent execution of other


During the cause analysis phase, each processing unit carries out several activities

using local tracing data in a two-pass scheme. In the first trace-log pass, the processing

unit identifies trace events3 with source location matching any of the profiling entries

discovered in the detection phase. For each matching trace event, the processing unit

generates a request entry containing its name, start time, and end time along with the

event's operation type which is sent to other processing units to retrieve appropriate trace

data (Table 5-2 illustrates the current set of common-bottlenecks -which is currently

hard-coded -supported by our system). At the end of the first pass, the processing unit

sends the requests out to all other processing units and waits for the arrival of requests

from all other processing units.

Next, the processing unit makes a second pass through its trace log and generates the

correct replies -consisting of {event name, timestamp} tuples -and sends them back to

the requesting processing units. Finally, the processing unit waits for the arrival of replies

and completes the cause analysis by assigning a bottleneck pattern name to each matching

trace event, along with the remote operations that contributed to the delay.

In terms of execution time, we expect bottleneck detection to complete relatively

quickly as the amount of profile entries is usually not large (data size in KB range). By

contrast, we expect cause analysis to take significantly longer to complete due to its use of

trace data; we expect the execution time of cause analysis to be linearly proportional to

the number of trace events in the performance data file.

3 Processing units can choose to apply the filtering techniques on each trace event
during this pass to further reduce the amount of data exchange needed between processing

5.2.3 Global Analyses

PPW supports several analyses that require a global view of the performance data;

more specifically, the head processing unit needs to have access to some data from all

PEs. Depending on the data types required by the analyses, the time required to carry

out these analyses4 will vary. In the remainder of this section, we briefly discuss these

global analyses which include analyses to compare the performance of multiple related

experiments (scalability analysis and revision analysis) and analyses to evaluate the

performance of a single run (barrier-redundancy analysis, shared-data analysis, and several

analyses to evaluate load-balance among PEs). Scalability analysis

Scalability of an application is a 1 i. Pr concern for developers of parallel applications.

With the ever-growing increase in parallel system size, it is becoming more important for

applications to exhibit good scalability. Using profiling data from two or more experiments

on different system sizes, PPW's Scalability Analysis evaluates an application's scalability

(or more precisely, its parallel efficiency, the ratio of parallel performance improvement

over the size increase). From the experiment with the smallest number of PEs, the head

processing unit calculates the parallel efficiency for all other experiments. An efficiency of

1 indicates that the application exhibits perfect scalability, while a value approaching 0

-i .-.-. -r very poor scalability. Revision analysis

During the iterative measure-modify process which a user performs to optimize his or

her application, multiple revisions of an application are often produced by the user, with

each revision containing code changes aimed at removing or minimizing the performance

issues discovered. To assist in evaluating the performance effects of these code changes,

4 Note that these analyses can be performed at any time during the analysis process or
as part of the other analyses.

PPW's Revision Analysis facilitates performance comparison of the application and the 10

longest-running code regions between revisions; this analysis is used to determine whether

or not code changes improved program performance and if so, what part of the program

was improved. Load-balancing analyses

Achieving good work distribution among PEs is difficult and often impacts the

performance and scalability of the application significantly. To help in this aspect, PPW

provides several analyses to investigate an application's workload distribution at different

levels. At the highest level, PPW's High-Level Analysis calculates and compares the total

computation, communication, and synchronization time among PEs. Since the PEs with

largest computation time (i.e., highest workload) often determine the overall performance

of the application, this analysis assists in the identification of bottleneck PEs (PEs that

when optimized improve the overall application performance).

Next, the Block-Level Analysis aims at identifying specific program blocks5 with

uneven work distribution and thus further guides users to parts of the program where they

should focus their efforts. By ensuring that all program blocks have good load-balance, the

user essentially achieved good load balance for the entire application.

Finally, at the lowest level, the Event-Level Analysis compares the workload of

individual events (i.e., a specific line of code or code region) among PEs which is

extremely useful when the event under investigation represents workload that was meant

to be parallelized or is a global synchronization event (as an uneven global synchronization

often stems from uneven workload distribution prior to the synchronization call).

5 A program block is defined as a segment of code between one global synchronization
to the next similar to a block in the Bulk Synchronous Parallelism (BSP) computation

S = Global synchronization (barrier)
LP = Local processing

I *c




remove m
lentified 1





Redundant global barrier identified Improved
(i.e. no data transfer application
before and after this barrier) performance

Figure 5-5. Barrier-redundancy analysis Barrier-redundancy analysis
To ensure program correctness, programmers may often insert extra global barrier
calls that give rise to the performance degradation depicted in Figure 5-5. To detect
potentially redundant barriers, PPW's Barrier-Redundancy Analysis examines the
shared data accesses between barrier calls and identifies calls with no shared data
accesses between the target call and the one before it as redundant. The idea is that
since barriers are often used to enforce global memory consistency, a barrier call with
no prior shared data accesses may not be needed. The output of this analysis is a list
of potential redundant barrier calls (with source information) that user may consider
removing from the program. Shared-data analysis
Data affinity (or locality) is a very important factor in parallel programming and it is
often a in j r deciding factor between a good and a poor performing parallel application.
To assess an application's data locality efficiency, PPW's Shared-Data Analysis measures
the ratio of local-to-remote access6 for all PEs and combines them into a single data access
ratio that could be used to determine the application locality efficiency (typically, the
higher the ratio, the better the program). This analysis could be refined to analyze the



locality efficiency of a specific shared region (such as UPC shared array) when the tool

knows the specific memory regions that a particular data communication call touches.

In the case of UPC, this refinement is extremely useful as it allows the determination of

the best blocking factor leading to minimized remote data access on all PEs (part of the

Bottleneck Resolution).

5.2.4 Frequency Analysis

The existence of short-lived, high-frequency events (henceforth referred to simply

as high-frequency events) can affect the accuracy of the performance data collected, so

it is useful to identify these high-frequency events that should not be tracked during the

subsequent data collection process. More importantly, high-frequency events sometimes

represent events which are highly beneficial to optimize (since they are called many

times) or in the case of data communication operations, could potentially be transformed

into more efficient bulk transfer operations (a in ii. known optimization technique as

illustrated in Figure 5-6). For these reasons, PPW includes a memory-bound Frequency

Analysis aimed at identifying high-frequency events. By making a pass through the trace

data, this analysis identifies a list of high-frequency events for each PE.

5.2.5 Bottleneck Resolution

In the final step of the analysis process, Bottleneck Resolution7 the processing unit

aims at identifying hints useful to the user in removing the bottlenecks identified in one

of the previous analyses (Table 5-3). This process is the only part of the system that

may need to be model-dependent, as a given resolution strategy may not ahv--, work for

all programming models. For example, a technique to fix the performance degradation

6 Note that due to factors such as variable aliasing, it may be very difficult to collect
performance data relating to local accesses (and it is even more difficult to keep track of
specific memory addresses being accessed) and thus not possible to carry out this analysis
for some programming model implementations.

G = Get Replace
with 1 bulk
0 G GIG G G G G transfer Q Bulk Get

Many small transfers application

Figure 5-6. Frequency analysis

stemming from upc_memget, versus from tlir,, i,_j. I/ could be different even though they

are both classified as one-sided get operations.

One example of a model-specific resolution technique is the identification of the

best blocking factor to use in declaring a high-affinity UPC shared array. When the

system detects an excessive communication issue associated with a shared array, the

processing unit would try to find an alternative blocking factor that would yield the best

local-to-remote memory access ratio for all PEs in the system.

5.3 Prototype Development and Evaluation

Several analysis system prototypes supporting UPC, SHMEM, and MPI were

developed and integrated into the latest version of the PPW tool. These prototypes

add to PPW a number of analysis components, corresponding to those shown in Figure

5-2, to perform the necessary processing, management of analysis data, and presentation

of analysis results to the tool user. To perform any of the analyses, the user brings up

the analysis user interface (Figure 5-7), selects the desired analysis type, and adjusts any

parameter values (such as percentage program threshold that defines the minimum hotspot

percentage) if desired. Once all the analyses are completed, the results are sent to an

analysis visualization manager which generates the appropriate visualizations.

7 Bottleneck resolution is currently an open research area.

Table 5-3. Example resolution techniques to remove parallel bottlenecks

Bottleneck pattern
Wait on group sync./comm.

Wait on lock availability

Wait-on-value change
Competing put/get
Late sender

Late receiver

Consecutive blocking data transfers
to unrelated targets (i.e., different PEs,
different memory addresses on same PE)
Multiple small data transfers to same PE
Poor data locality

Potential resolution techniques
Modify the code to achieve better work distribution
Use multiple point-to-point synchronization operations
Perform more local computation before the wait-on-lock operation
Use multiple locks if appropriate
Perform more local computation before the wait-on-value operation
Use non-blocking put/get
Perform less local computation before the local send operation
Perform more local computation before the remote receive operation
Use non-blocking receive
Perform less local computation before the local receive operation
Perform more local computation before the remote send operation
Use non-blocking send
Use non-blocking data transfers
Use bulk transfer operation if appropriate

Combine multiple small transfers into a single bulk transfer
Modify shared data layout (e.g., use different blocking factor in UPC)

File Edit Options Help
B Default Revisi Run Application Analysis Table i-- ,, .
*- Run Scalability Analysis
camel-8.p r ,,, .. iili -,i
camel-16. Save Analysis Data...
S-* camel-32. Load Analysis Data... : i ; :

1 i i i,, ii i i 4
i, ,, ,I I

S,1 j i : -h'"- r,- r II "-.. "', 1.

p. : 1 l',- I" 1 44 .

: "r l:

-i L i i,, 1 .
I .- A j I ,
I ** -

Now viewing data File camel-4.par.gz

Figure 5-7. PPW analysis user interface

To acquire the appropriate baseline values needed for the baseline filtering technique,

we created a set of bottleneck-free benchmark programs for each of the supported models.

These benchmarks are then executed on the target system, and the generated data files

are processed to extract the baseline value for each model construct.

PPW provides several new analysis visualizations to display the generated analysis

results. To facilitate experiment set analyses, a scalability-analysis visualization that

plots the calculated parallel efficiency values against the ideal parallel efficiency (Figure

5-8) and a revision-comparison visualization that facilitates side-by-side comparison of

observed execution times for regions within separate versions of an application (Figure

5-9) are supported. To visualize analysis result of a single experiment, PPW includes a

high-level analysis visualization displaying the breakdown of computation, communication,

and synchronization time for each PE executing an application to evaluate the workload

distribution at a high level (Figure 5-10), an event-level load-balance visualization to

compare the workload of individual events across PEs (Figure 5-11), and a multi-table

Nh ist 5M MM M

-r c" Experimental Set Analysis

8 W .i _T C 05

Ideal parallel efficiency
,, (i.e. Speedup is linearly proportional
a to the number ofnodes)
,Jp. ,

-5 Observed application
.0 parallel efficiency
(rt e oo

.,- i -' I: ;- '', rIl
t 'e ^ L 4

NoNew cmin data saropc_ Lar

Figure 5-8. Annotated PPW scalabilii i, ilysis visualization

analysis visualization which di-pl -,I- the result from common bottleneck detection and

cause analysis supplemented with source-code correlation (Figure 5-12). Finally, PPW

generates a text-based report provides a summary of the analyses performed; this report

includes information such as the speed of analysis, the parameter values used, number of

and list of bottlenecks found on each PEs, and results from several analyses (block-level

load-balancing analysis, frequency analysis, barrier-redundancy ,i, i ,--i- shared-data

analysis) not ldi 1 .i'. t in the analysis visualizations just mentioned (Figure 5-13).

In the remainder of this section, we present details of the sequential, threaded, and

distributed prototypes developed and supply experimental results regarding the speed of

these prototypes.

5.3.1 Sequential Prototype

The proposed analysis system was first developed as part of the PPW Java front-end

to reflect a common PPW use case illustrated in Figure 5-14 where the user collects

File Edit options Analysis Help

Now viewing revision Default Revision

Figure 5-9. PPW revision-comparison visualization

5e Edt gtns kdyss 11*
-j "1 "-'' *" 3 -*_ilr0.. i ..is- 'dL ~ "a:r I:-I-I I I' A i*.i~uu' l'.v

View with respel
%. program or v
clock time

Displays total
time for node 0

:t to _.._.'lgh-level Application Performance Breakdown: Times
Sall' I I

--- -

1 :l. .- -1. : L C -O: IF *CFEA II r U N E : r fJ;

ii-ye e '' 3 $Q'ia3 0 C"0c tO 6 '3

Rearrangeable by operation
S:.. type for easy viewing
XT) u

Figure 5-10. Anno ed PPW high-level analysis visuaizaion

Figure 5-10. Annotated PPW high-level analysis visualization

i I ,

, r -- ",
*, -
*,i i .

Total Times by Function: Default Revision

-^ ^^ ^ ^

I I.. i 'i rT -J ]

* .:r i r r i:r -, ; .,r, .',n ui 'n .[i. r i r n 1, r 1 ... ;r ,, n ,, jraiI'.,r 1" .
.,- 3 .., ., -. i. I I ,, _. '.

I,- ,_ ,_.l _I I l ",,, i ,, ,-,_,,, ,, ,l-,_U l _, ,,
n.. n,- rl.. ... .... n :, n I -:! l ,- I! n ,, ,, n. .,, ,, I
-,.- -, .,; ;',- ,,- n. I-. -

*wyF 1-" t .'liWH-r/ r,^i *

Total Time


5 10 15 20 25 3 35 40 45 50 55 o e85 70 75 BU

Regon Casiht Maximbalance Teol
ipc walt 1514 189 10052 319.08584
pcput c1506 182 58526 34.94245
apc ialt c1277 60 25746 238.27379
pcmemget c 1289 18 41325 5.83018
ipc_ Wat c1552 5,14958s 8.02617
pc rnemget cl414 4,66044s 0.58746 s
pc_wait .c1435 3 84142s 10.81702 s
jpcput c1543 3,51171s 0.0049
jpcput : 1544 1 89799 s 0.27961
pcwat .9 1531 105896 s 2.44891s
Ecmemget cg 1c447 0,90020 s 0.20194
C;! Do

Figure 5-11. PPW event-level load-balance visualization

application performance data on the parallel system using the PPW back-end, transfers

the combined performance data file to a personal workstation, and then visualizes the

collected data using the PPW front-end system.

In this initial prototype, a single processing unit was used to conduct all of the

selected analyses in a sequential fashion illustrated in Figure 5-15a, using main memory to

store intermediate (i.e., request and reply) and result data. To validate the correctness of

this prototype, we created a set of test programs written in UPC, SHMEM, and MPI in a

method similar that discussed in [40]. This analysis test suite consists of control programs

with no bottlenecks and test programs which each contain a bottleneck pattern of interest.

We applied the analysis process on these programs and verified that the system is able to

detect the target bottlenecks correctly.

In Table 5-4, the speed of the analysis for several of the NAS 2.4 benchmarks

executed with 128 or 256 PEs is shown. The testbed for this experiment is an Intel

Core i7 Quad-core (with Hyper-Threading support) 2.66 GHz processor workstation with 6

GB of RAM running 64-bit Windows 7. As expected, the analysis speed for trace-related

Lm -blncn aalss X

85 80 95 100 105 110 115 120 125

Ee Wit Aptnr Anys ta
=- *. *i .. rr- i:- r.1.r T'.1.ri. :.: T'i-% r- .. i Trb,.n- 1. 1
-!B!a~n l

Table displaying
all profile entries
Identified as
bottleneck -
location on
selected node

Table displaying
all bottleneck
trace events of -
selected profile

Source code of
the selected
profile entry


Pull-down menu to view
bottlenecks for different nodes

Th., W T -,i I v

"_ -i. \ ? _i l .-" .,

Present ii the bottleneck When present, display the
was identified via expected (baseline) lime
deviation filtering (also indicated that the bottleneck
was identified via baseline filtering'

Tr-sl r* '- 5:a Eu.'::.r "-l'sr'r ."-.' "alr .:s're a Tr T,.;L,;i, Bcr.eneai' ar3 -


.45 7 Ij

r _. _

S L.
.1 .
%flij. F

: I- 4i

J 34th

.meL 4 V. .1A-5 .5 i.95 e :< e- -. *5 1.1 5 .- Tl ,e I -*?IP' ,-s

. C"~ljrs,:- *a4:s .r 7:14r1Tr4 S I T|; '5r 34 L541,41:r

l: Jn *.:L; -L: 3- er.r-r. Information ol remote
-6 ..:ri ..: ;_..j : ;i.j_.. ; a.e ; 1 ILL operation that caused
e- : ;e :.:. : the identified bottleneck

."? 'a4 ;* 11*l4; '

_i l L 1 ; l :

, .l llI l: II .: "*:'" '

. il l .| l. !. "l: ,'"
Il 'ljl?' nLk a .ll r.^ r,[n

Name of the identified
bottleneck pattern

Figure 5-12. Annotated PPW analysis table visualization

11ilPaa ILI erorane iai -a AucA 1 Laiair

FL- V3 i Q'o flrr n A Wi" r i' a -:r

.'j .-f n ,, T i-." t. '"T ;.' ;r "lile- .'r r ';" .r, ~r..,'.I

. L r T IF r

T..i'. r.. 'ar u ia :tl .sa n b r!'3 Ir .l ,
.1jT.B5 n' Far ia **hd. iiu l ^3 h3 ud .ral"; 1- ur.,3 -a"

ME9 i-r:L. e ass 3 -a1s a.a' f'.ii
nri **ei-rz.c sIt- I:Ta' f ias

lfl t .'t-rimy '!flA
*a VP:. Ta .eJr.e2*s I'rb
Z3,Lr. 5 !: s. i h: ". -

*' : Lr :i: 11att nt I L *1:1J I

7 F. j 1 I
IL 5 0 3?,r ,p StIr ..I_ i r_ t I.
i i t :.D ... J j-- -1rt E Ii i n ia....

Parameter values

used in analyses

T. 3 t rL -'t. L I
,-. .- *--r. I r =1i .

1lt i: r L V t ta [ 1 'r d :

== TIcr =.".
... TF:_ 1 ...

--- THPTs -1
-'I at0 *-.


bottlenecks found

(for each node)

.rT -3 D -1i l. n r' rr i 1_-
Sa- r 1. I -l n .

. i i i. I .

Results from various


irrrn I

..Ja.. I q .'

Firtshed analyst

.h: lP :* Pa< 1 ; l .ll-j i r- i r .I -l!: j -- rL ': .*:i r ;E r-.1 : jl: : 1.g a-':*
ii ..* Ir =.I: L .. L-.PL *

r ,-'*- L,; II :., I:- f r n .':,* I,: l. :*. i :> il T.r, an -. ;:' .=".P* .=. .: t 5 ., 's 1 T.w r,-e e
P I -.: a .l *Cu .. ..r.

.' 1 3 3 **0E, .- -:f ? I' --." r*: -. : -. 0 : 5 W 0.iit

I trI. j w>v* *n l.M E.-r-.0nt :A1 .4nr,S.G anf, i icT N,,. 5- s N .

Figure 5-13. Annotated PPW analysis summary report

Analysis speed and

System information

rrir ;r rr


SClus Workstation

Tr T er TF ---- I
I i I I
PPW 14 2 I crn7t
backendQ E]_QW W W i- *_sil
PPW Frontend
SL A Analysis

I __-I I
I ..... I I ,

Figure 5-14. A common use case of PPW where user transfers the data from parallel
system to workstation for analysis

Table 5-4. Sequential analysis speed of NPB benchmarks on workstation
System size 128 128 128 256
Avg. Profile entries per PE 36000 23000 37 37
Total trace events 9.24 million 5.72 million 10574 21072
An J. -i time 3821 s 1705 s 0.68 s 2.27 s
(63.7 min) (28.4 min) 0.68 s 2.27 s

analyses dominates the overall execution time (profile-based analyses all took less than 1
ms); we observed that the analysis speed is linearly proportional to the number of trace
events (0.15-0.2 million trace events per minute).
5.3.2 Threaded Prototype

While these initial performance results were encouraging, we quickly realized that
the sequential approach would not suffice for two reasons. First, the largest data size that
the sequential prototype could analyze is limited by the amount of memory available; our
attempt to analyze a 128-PE run of the CG benchmark (31.5 million trace events) was
unsuccessful due to this reason. Second, the time required to complete the analysis may
become unreasonably long for much larger data size; it may take hours or even d i before
the user can see the result of the analysis of an experiment with huge amount of trace

Figure 5-15. Analysis workflow for the (a) sequential prototype (b) threaded prototype (c) distributed prototype

Table 5-5. Analysis speed of NPB benchmarks on workstation
Num. threads FT MG EP (128) EP (256)
1 (seq.) 3821 s (63.7 min) 1705 s (28.4 min) 0.68 s 2.27 s
2 2007 s (33.5 min) 1128 s (18.8 min) 0.37 s 1.10 s
4 1263 s (21.1 min) 709 s (11.8 min) 0.41 s 0.78 s
8 1026 s (17.1 min) 603 s (10.1 min) 0.39 s 0.81 s
16 1234 s (20.6 min) 626 s (10.4 min) 0.79 s 1.35 s

Fortunately, since the design of PPW analysis system is inherently parallel, we were

able develop parallel versions of our system to address these two issues; we developed

a threaded prototype to take advantage of the dominating multi-core workstation

architecture and a fully distributed prototype that can execute on a large cluster.

The modified analysis process of the (Java-based) threaded prototype is illustrated

in Figure 5-15b. In this threaded prototype, each processing unit (1 to K) is assigned a

group of PEs (1 to N) and is responsible for carrying out all the analyses for that group

of PEs. The results produced by the threaded prototype were validated against those

produced by the sequential prototype, and we again ran the analysis of NAS benchmarks

on the Core i7 workstation to measure the analysis speed. The results are shown in Table

5-5; from this table, we see that the analysis speed (for reasonably sized data files) scales

fairly well up to the number of cores (1 to 2 and 4 threads), shows a slight improvement

(4 to 8 threads) using Hyper-Thi, idi:- and slows down somewhat when the thread count

exceeded the number of processing units (16 threads). The analysis of the CG benchmark

was again unable to complete as the threaded prototype also uses main memory to store

all intermediate and result data structures.

5.3.3 Distributed Prototype

We have shown in the previous section that a threaded version of the analysis

improves the speed of analysis fairly well up to the number of cores on the workstation.

However, since the number of cores is limited on a single machine, we continued our

prototyping effort to develop a version of the PPW analysis system capable of running

on cluster systems that could contain thousands of PEs. There are several reasons for

Cluster Workstation

backend I

I f. A. AatcA I I-ll ,
a b at II7f An. 1 An PPW Frontend

I IAnalysis I
Il Dresult

Figure 5-16. A use case of PPW where analyses are performed on the parallel systems

developing a distributed version of the analysis system. First, the distributed version is
more scalable than the threaded version; it can support the analysis of larger data runs
and improves the analysis speed further due to increase amount of available processors.
Second, the distributed analysis process can now be executed as a batch job or as part
of the data collection process as shown in Figure 5-16. When running as part of the data
collection process, the result of the analysis could potentially be used to reduce the raw
data size and thus improve the scalability of the PPW tool itself.
The workflow for the distributed prototype (Figure 5-15c) is very similar to that of
the threaded prototype except now each processing unit is assign to process data of a
single PE (each processing unit has local access to assigned PE's data) and intermediate
data (requests and replies) must now be exchange across the network. As shown in Figure
5-17, the amount of memory space required on each processing unit is reduced (from
NxNxM requests and replies to 2xNxM requests and replies) and is now able to support
larger data file such as CG (contain 245974 trace events per PE) which was unable to run

Figure 5-17. Memory usage of PPW analysis system on cluster

Table 5-6. Analysis speed of NPB benchmarks on Ethernet-connected cluster
Num. PEs 128 128 128 128 256
1 processing unit 2113 s 1019 s N/A 0.15 s 0.85 s
(35.2 min) (17.0 min)
#PE processing units 32.5 s 242 s 16668 s 6.38 s 40.12 s
(0.5 min) (4.0 min) (4.63 hrs)
Speedup 65.02 4.21 0.02 0.02 s

successfully on the Core i7 workstation. The results produced by the distributed prototype

were again validated against those produced by the sequential prototype, and in Table 5-6

we show the analysis speed of the NAS benchmarks on an 80-PE Quad-core Xeon Linux

cluster connected using MPICH-2 1.0.8 over Ethernet.

We made several observations from this data. We saw that the sequential analysis

speed improved almost by a factor of 2 due to move from a Java-based to a C-based

environment. More importantly, the analysis speed of the parallel version (128 or 256

processing units) is greatly improved for larger data files. We saw that the analysis speed

for EP (82 trace events per PE) worsened but this behavior was expected as there is

simply not enough work to be distributed. In the case of MG (44670 trace events per PE),

the analysis speed improved by a factor of 4. Finally in the case of FT (72230 trace events

per PE, more bottlenecks undergoing cause analysis than MG), the analysis speed was

---- roved by almost two orders of magnitude, demonstrating the fo irmance benefit

the i' ,ed ..expect the performance to be more 'ent

on systems with 1 -speed interconnects and for with a larger number of

trace events -

5.3.4 Summary of Prototype Development

We have developed several versions < the PPW system and provided

experimental data on the speed of \' observed that. the analysis (in

all versions) is dependent on the size of the trace data as ( '( but also

the number of bottlenecks undergoing cause ana ,. have shown the correctness of

the PT- system design using a ; lic test suite and proved the scalabi'

< the design by demonstrating the i improvement both threaded and

distributed over the initial version. Wec noted that, while the

sequential and to lesser extent the threaded ''ype, exhibits some scalability

issues, it is not without use. :)r analysis of experiments with small to moderate amount of

data. the workstation are sufficient in completing the anal in a reasonable

amount < time. However, when tle number < trace events per PE exceeds a certain

amount,r user of the PPW should use the more < distributed

5.4 Conclusions

goal of the second this research was to investigate, design, and

evaluate a scalable, miodel-independent automatic analysis Performan ce-tool-assist ed

manual the cumbersome plicat ion c )timnization process but does not

scale. / 1 the size of the :c dlatasct grows, it, becomes nearly impossible for the

user to manually examiine the data and find performance issues using the v
"' by the tool. : problem exposes the need for an automatic analysis

that ca ci detect, diagnose, and potentially resolve bottlicec ks. HX X several autioniatic

p' roaches have been -.. p;)osed, each has t- ticular drawbacks t iat its

veness or

(o address this issue, we A. ied the model-in'. ,endent PP .. automatic

ana system that I 3rts a variety of '.. presented the architecture of

the PPWV analysis introduced novel ; such as the

technique to improve dctectciot n ac: ', and dc the scalable scssing

mechanism de signed to orft large-scale ap; :1 \c showed correctness

and i results for a sequential version of the system that has been integrated

into the PPVW e tool and then demonstrated the I nature of the design

and its p ce benefits in the discussion < the threaded and distributed versions


Future work for this includes experimental evaluation on a larger .

enhancements to the. '' analyses (e.g. use tb y to reduce

requirements, faster ; '"i to improve 1
analyses such as :bottleneck resolution, expansion < the number of co0mmn-(bottlcncck

Sthe system detects, and development. < function to allow users to define new

bottle encks themselves.


In this chapter, we present studies used to evaluate the effectiveness of the proposed

PPW framework and automatic analysis system.

6.1 Productivity Study

To assess the usefulness and productivity of PPW, we conducted a study with a group

of 21 graduate students who had a basic understanding of UPC programming but were

unfamiliar with the performance analysis process. Each student was asked to spend several

hours conducting manual (via the insertion of printf statements) and tool-assisted (using

a version of PPW without automatic analysis support) performance analysis with a small

UPC cryptanalysis program called CAMEL (with approximately 1000 lines of code) known

to have several performance bottlenecks. Students were told to concentrate their effort on

finding and resolving only parallel bottlenecks.

The results demonstrated that PPW was useful in helping programmers identify

and resolve performance bottlenecks (Figure 6-1). On average, 1.38 bottlenecks were

found with manual performance analysis while 1.81 bottlenecks were found using PPW.

All students were able to identify at least as many bottlenecks using PPW, and one

third of them identified more bottlenecks using PPW. In addition, most students noted

that they had an easier time pinpointing the bottlenecks using PPW. However, only

six students were able to correctly modify the original code to improve its performance

(with an average performance gain of 38.7'.), while the rest either performed incorrect

code transformations or were unable to devise a strategy to fix the issues. This inability

to modify the original code was not surprising, since the students were not familiar

with the algorithms used in the CAMEL program, were novices with respect to parallel

programming, and were asked to spend only a few hours on the task.

Students were also asked to compare the experiences they had with both approaches

in terms of code analysis (bottleneck identification) and optimization. Overall, PPW was

(a) Method with More Bottlenecks (b) Preferred Bottleneck Identification (c) Preferred Program Optimization
Identified Method Method

Either I ppW
\ bo (7) (12)
(14 PPW

Figure 6-1. Productivity study result showing (a) method with more bottlenecks identified
(b) preferred method for bottleneck identification (c) preferred method for
program optimization

viewed as a helpful tool by students, with most students preferring PPW over manual

performance analysis for reasons listed below (summarized from student feedback).

* Manual insertion and deletion of timing calls is tedious and time-consuming. While
not significantly difficult in this case, it can potentially be unmanageable for large
applications with tens of thousands of lines of code.

A significant amount of effort was needed in determining where to insert the timing
calls, a process which was automated in PPW.

Visualizations provided by the tool were much more effective in pinpointing the
source of bottlenecks and even more so in determining the cause of the bottlenecks.

6.2 FT Case Study

For the first application case study, we ran the Fourier Transform (FT) benchmark

(which implements a Fast Fourier Transform algorithm) from the NAS benchmark suite

version 2.4 using GASP-enabled Berkeley UPC version 2.6. Initially no change was

made to the FT source code, and the performance data were collected for the class B

setting executed using 16 PEs on an Opteron cluster with Quadrics Q- N. i" high-speed


From the Tree Table (Figure 6-2), it was immediately obvious that the fft function

call (3rd row) constituted the bulk of the execution time (18s out of 20s of total execution

time). Further examination of performance data for events within the fft function revealed

.1e Edit tio An y p
j aale Perormnc Wizar -'on AnBJ6a 1 I O

I< I > I
;i-,I -

' ,. rj. I-,:-r T,: ,I:r, i T 3:1:_ : I ,3 ,. r : ,, : ,,: .r,, ,-, ,,-: :,"

.. .: i. ...,. -.. .: n r1 n ,, h : W
rJ ,T,*- ,_ II ,[- T ,- I w ,

_, rr rr 4 : 44,
a .-* h. nrr r, : rr ,:. .1 1 4. 4 : 424.8' : "" ,

u ':_':.T.: 4 r
i-,' *.: .:_r: or,

v ._: ,r


jr. 7

- L!.j4
] I :'. 4'
] ,,: ;::

I I A'

-t .4,,.

CI Ij: ,,, r"'

i 1 1r f 3 1 ,I. [,I r 1,3'- C_ r 1 X-I I_: L_ r. 1i
r .-1 L A L -

i JJ r ,- u .irnr_ r, L r iri = r, r jr, _En, L_? rr,-LVL :

I1 '

i 4- I
A iL L' 'F

I tj : r : I L I. i i I ,:r,,iI .. r_ _. _rL_ I |
( __

Now vewng data fie FT--Tlfipar

Figure 6-2. Annotated PPW Tree Table visualization of original FT showing code regions

yielding part of performance degradation

the upcbarrier operations (represented as up, _.'U/fy and upc-wait) in triu-l.', '.Il.bal

(6th row) as potential bottleneck locations. We came to this conclusion by observing that

the actual average execution times for upcbarrier at lines 1943 (78.71ms) and 1953 (1.06s)

far exceed the expected value of 2ms on our system for 16 PEs (we obtained the expected

value by running a simple benchmark). Looking at the code between the two barriers, we

saw that multiple upcmemget operations were issued and speculated that the bottleneck

was related to these operations. However, we are unable to verify this speculation and

determine the cause of this bottleneck based solely on this statistical data.

Thus, we then converted the trace data into the Jumpshot SLOG-2 format and

looked at the behavior of upcbarrier and up, l, irn /: operations in a timeline view. We

discovered that the upcbarrier at line 1953 was waiting for the upcmemrget operation

m Gel mg

1 1iiih,,, = ,- II I_,-I. I

I= n I ,1 .i 11 i" 1 ,

S1 ose I

SGel m'q
ai ial.)lin = 2'1 I..l ) ,- n .: .
11 lmlfl = 2 11 '821) 14 '., LIlilt I[
II II l' = lll 2 )C.'d-.1 W I.L.I. rl['= I
-i.ila -;-Ip 24?!ii8 h.F :
Elle. Ir.r handl rtlrln 4 ?.'1 .? E I, *.
'o iitIe L.ni p l i l" r

7. r-.
Gel msg

I .I -_ I I h -

E 1 r. I I ,. I i-

0 close

pms -~i


ii U I


II- I- --- I- -

CI! iL~J



Il 'M P
Pd ,1.A. u

I I I I I I I I I :
.i :- ~,l:s O925 : OS 375 2S 085 U2 5 2- ;.:. 1 .:. .' .:.; .:.l i; .: ..:' '- r

Figure 6-3. Annotated Jumpshot view of original FT showing serialized nature of
upcmemget at line 1950

to complete. In addition, we saw that up, _,'. ,ii, /: operations issued from the same PE

were unnecessarily serialized, as shown in the annotated Jumpshot screenshot (Figure 6-3;

note the zigzag pattern for memget operations). Looking at the start and end times of

the upcrmemget operations issued from PE 0 to all other PEs (see the info box in Figure

6-3), we saw that the later upcrmemget operations must wait for the earlier upcmemget

operations to complete before ii1iii liii. even though the data obtained were from different

sources and stored locally at different private memory locations.

A solution to improve the performance of the FT benchmark is to use a non-blocking

(.C-vtchronous) bulk-transfer get such as bup, i,,n r./ I _. -; provided by Berkeley UPC.

I l
l, I
, L m t CE i '1


Pre Ldit options analysis -Uei

- _, D ulrP,;.urlur

FProper, tihle A

Th-E ads lb

j)d O .?'3 a
< )

j Pro'ilE*.3rtS F r ie r *rTree TTble C L3ria ri, nsrers Ar ,r*, .rrtuitor,

- I s, r fl rl
) irt
) Irrnr,,r
t 'ain:pc:e..__
I I'rd' ,L,,. ri,. ,,r
) uc, _bs,r..-
u] ._r.r,'
I uOC_r.o'.i
> up .rr ,



fl ..
flc 141.

fi.: H.I
fl.:" L

Fr .:-'1 ?.

rr .:-?1 .2
fr .:.1? ,.

fletr 1ir I
Total v Sel~

Ib 1131:4 s

4 5:1055
: I-':?s
I ,.59 1

E3 479' ims
43 I3130.8 r.s


4' 4

6 I

v Thread Anl Tr.ead v

:1 43

l"? .rrs 03
-?i in,.s '

I i rcpbp.rant, h

1964 c.:.Nr = ): T HRl~ .'1": ~,-r
1965 I
1966 nandl .91( j bup,_:ngErc_asyrc J .i ccr.ple/ 'idat["fI'lr'EAij..cell[chuirl,'],
1967 .sr.:[1] cell[.:ri,.uik'T. ZaFIA],
1968 51:I:fl d.oi.pl't .:huntk

SNow viewing data file FTM-.-T16,par

Figure 6-4. PPW Tree Table for modified FT with replacement .i-vi chronous bulk transfer

When this code transformation1 was made (shown in the lower portion of Figure 6-4), we

were able to improve the performance of the program by 14.!'-. over the original version.

We later applied the automatic analysis process to the same FT data file to check

whether or not the analysis system could find the bottlenecks that we identified. Looking

at the multi-table analysis visualization, we saw that the system found 4 bottlenecks,

including the most significant upcbarrier bottleneck. In addition, the system was able

to determine the cause of delay for each occurrence of the barrier operation that took

longer than expected. For example, the system found that the barrier called by PE 7

with a starting time of 2.6s took longer than expected to execute because PEs 8 and 15

entered the barrier later than PE 7 (Figure 6-5, left). As we observed in the annotated

1 Note that while this code transformation is not portable to other compilers, the
optimization strategy of using non-blocking transfer is portable.


Jumpshot view (Figure 6-5, right), this pattern is verified. Switching to the high-level

analysis visualization, we saw that each PE spent 5 to 15'. of the total execution time

inside the barrier call, further validating the existence of a barrier-related bottleneck. This

percentage drops to 1 to S''. of the total execution time for the revised version using the

non-blocking get operation.

In this case study, we have shown how PPW was used to optimize a UPC program.

With little knowledge of how the FT benchmark works, we were able to apply the manual

analysis process and remove a in rw bottleneck in the program within a few hours of

using PPW. In addition, we showed that our automatic analysis system was able to

correctly identify and determine the cause of significant bottlenecks in the FT benchmark.

6.3 SAR Case Study

For the second application case study, we performed analysis of both UPC and

SHMEM in-house implementations of the Synthetic Aperture Radar (SAR) algorithm

using GASP-enabled Berkeley UPC version 2.6 and Quadrics SHMEM on an Opteron

cluster with a Quadrics QsNet" interconnect. SAR is a high-resolution, broad-area

imaging processing algorithm used for reconnaissance, surveillance, targeting, navigation,

and other operations requiring highly detailed, terrain-structural information. In this

algorithm, the raw image gathered from the downward-facing radar is first divided into

patches with overlapping boundaries so they can be processed independently from each

other. Each patch then undergoes a two-dimensional, space-variant convolution that can

be decomposed into two domains of pro'. --iir the range and azimuth, to produce the

result for a segment of final image (Figure 6-6)).

The sequential version from Scripps Institution of Oceanography and MPI version

provided by two fellow researchers in our lab [26] were used as the templates for the

development of UPC and SHMEM versions. The MPI version follow the master-worker

approach where the master PE reads patches from the raw image file, distributes patches

for processing, collects result from all PEs, and writes the result to an output file, while

", *\ I *t.

a. ____-


=I':. .
1 i'W-f .



: i .....

St;= "------------

- [ ,I


Figure 6-5. Multi-table ,n i,! i -i visualization for FT benchmark with annotated Jumpshot visualization




?1. -r I~~; dii.

* S1

E* /

5 ET,-:,::-e FT--T1 &o ft pEI

I .


fbflhed Save

I 1
? r~



0 Af *


19 Paalle Perormace Wzard- FTB-T1. pa F-In 1Ex I


15 *

: ,


p h fr" R # range cells
raw imae A azimuth cells
aFigure 6-6. Overview ofZ Sy cells palcA
aa s s t cae 3 pa eah o s ll s a coud bkeppatch

systems, such as our 32-PE cluster, execute over multiple iterations (in each iter VA
S-\ ---- --iA
pat s ae refe wnce euals t t num r We assume on

sequential I/O is available throughout the study, a fair assumption since neither UPC nor
vector R
SH M i--ncl s s sine filter I
rv? R

Figure 6-6. Overview of Synthetic Aperture Radar algorithm

the worker PEs perform the actual range and azimuth computation on the patches (note:

master PEs also perform computation). For this study, we used a raw image file with

parameters sett to create 35 patches, each of size 28MB. While all patches could be

executed in parallel in a single iteration on a system with more than 35 PEs, smaller

systems, such as our 32-PE cluster execute over multiple iterations (in each iteration, M

patches are processed where M equals to the number of computing PEs). We assume only

sequential I/O is available throughout the study, a fair assumption since neither UPC nor

SHMEM currently includes standardized parallel I/O.

We began this case study by developing a UPC baseline version (which mimics the

MPI version) using a single master PE to handle all the I/O operations and that also

performs processing of patches in each iteration. Between consecutive iterations, all-to-all

barrier synchronization is used to enforce the consistency of the data. After verifying the

correctness of this version, we used PPW to analyze the performance on three systems

sizes of computing PEs: 6, 12, and 18; these system sizes were chosen so that in each

iteration, at most one worker PE is not performing any patch processing. By examining

several visualizations in PPW (one of which is the Profile Metrics Bar C'i h't shown in

Figure 6-7), we noticed that with 6 computing PEs, 18.7'. of the execution time was spent

inside the barrier and that the percentage increased with the number of computing PEs

(20.!'. for 12 PEs, 27.1'.. for 18 PEs). Using the timeline view to further investigate the

issue (Figure 6-8), we then concluded that the cause of this bottleneck was that worker

PEs must wait until the master PE writes the result from the previous iteration to storage

and sends the next patches of data to all processing PEs before they can exit the barrier.

Similar findings were seen when automatic analysis is applied. From the high-level

analysis visualization (Figure 6-9), we observed that a significant amount of time is lost

performing global synchronization. This observation is reconfirmed by examinging the

multi-table analysis visualization which lists two shmem_barrier_all bottlenecks.

We devised two possible optimization strategies to improve the performance of the

baseline version. The first strategy was the use of dedicated master PE(s) (performing

no patch processing) to ensure that I/O operations could complete as soon as possible.

The second strategy was to replace all-to-all barrier synchronization with point-to-point

flag synchronization (implementing the wait-on-value-change operation) so processing PEs

could work on the patches as early as possible. We expected that the first approach would

yield a small performance improvement while the second approach should greatly alleviate

the issue identified.

We then developed five revisions of the program using one or both of these strategies:

(1) dedicated-master, (2) flag synchronization, (3) dedicated master with flag synchronization,

(4) two dedicated masters (one for read, one for write), and (5) two dedicated masters and

flag synchronization. These revisions were again run on system sizes with 6, 12, and

18 PEs and the performance of the revisions was compared to that of the baseline

version (Figure 6-10). As expected, the dedicated master strategy alone did not improve

the performance of the application. Surprisingly, the flag synchronization strategy by

itself also did not improve the performance as we expected. After some investigation,

we discovered that while we eliminated the barrier wait time, we introduced the same

amount of idle time waiting for the shared flags to be set (Figure 6-11). The combination

of both strategies, however, did improve the performance of the program by a noticeable

amount, especially in the two dedicated masters and flag synchronization version where

the percentage of patch execution time increased (from 77.95' 78.0' 1. 70.71 for the

6 computing


RofibCharts proTa a TrT mrarsers Trayonttran
Total Time
a o ra a is s 2s soTs a soa n on as 1o

I.. .U.. .

12 computing


Total Time
ame (s)

27.6% of


.... 18 computing
....... nodes

Figure 6-7. Performance breakdown of UPC SAR baseline version run with 6, 12, and 18 computing PEs annotated to show
percentage of execution time associated with barriers

Total Time

(-.ulmuth~ mpuol~^^^

"u oi ,ni


oOperiacPolar pRoi**hilPesuMlrt sweetneserheqanderead oua ToRraT- ToITW 1b, unca-

-o"er Py. aR I f 1 -,1 j d ITn I T .I, r I-unqty

[ e. n : s uy1_ 6 I .1t UtyMa

MM I o "M MM mo

Figure 6-8. Timeline view of UPC SAR baseline version run with 6 computing PEs
annotated to highlight execution time taken by barriers

baseline version) to 97.05'. 94.3'-. and 87.97' of total time for 6, 12, and 18 processing

PEs respectively (the remaining time is mainly spent on unavoidable sequential I/O

and bulk data transfer). This observation was verified when we looked at the high-level

analysis visualization (Figure 6-12) and saw that all PEs spent the 1 i i ii ly of their time

performing computation.

This case study was then performed using SHMEM implementations of SAR based

on the same approaches outlined above for the UPC version (performance comparison

for these versions are also shown in Figure 6-10). For SHMEM, we noticed that the

dedicated master strategy improved the performance by a small amount, while the flag

synchronization strategy still did not help. The combination of both strategies again

improved the performance by a noticeable percentage, with the two dedicated masters

and flag synchronization version exhibiting 6.1 ., 13.'. and 15.->'. improvement over the

baseline version for 6, 12, and 18 PEs respectively.

File Edt Options Analysis Help


S, i, ,
._1 -

High-level Application Performance Breakdown: Percentages

Iv I .

" ]' '

Now viewing data file sarshmem_vl_12_ .par

Figure 6-9.

High-level analysis visualization for the original
application with load-imbalance issue

version (vl) of SAR

Observed Execution Time






6 12 18 6 12 18

Baseline Dedicated

6 12 18 6 12 18
Flag sync. master and
flag sync.

6 12 18

2 dedicated

6 12 18
2 dedicated
masters and
flag sync.

Version and Number of Processinu Nodes

Figure 6-10. Observed execution time for various UPC and SHMEM SAR reversions

............n_............... UIdentity O


. ^ 6 ^|,

. :,... .

I ,, I
u' I
I'. I
"3 I

S mI


Figure 6-11. Timeline view of UPC SAR flag synchronization version executed on system
with 6 computing PEs annotated to highlight wait time of flags

Additionally, PPW enabled us to observe that the performance of the SHMEM

versions was 15-211'- slower than the corresponding UPC versions (Table 6-1)). This

observation was surprising since we used the same Quadrics interconnect with communication

libraries built on top of the same low-level network API. We examined the performance of

the data transfers for UPC and SHMEM versions and found that the performance of these

operations is actually better in SHMEM. After some investigation, we determined that

the difference between the two versions came from the significant increase in execution

time for read and write of data and patch processing functions (i.e., the azimuth and

range functions) in the SHMEM versions. We concluded that this behavior is most likely

due to the overhead introduced by the Quadrics SHMEM library to allow access to the

shared memory space, which incurred even for accesses of data physically residing on the

calling PE. For UPC, a cast of shared pointer to local pointer made before entering these

.-. f

i\ 1 I' I A I
I:i -1 a:'

.- I I'" .. ... I I meLnes
IimeLles i

L 1 41 _Ll
m. .,mm i-m

l im I IN
ii1 4-I!






File Edit options Analysis Help

High-level Application Performance Breakdown: Percentages

''I I

I ''. '. '. r ,: .,: ,,. I I,: .T- -,-: a ,: ,-, -. -,I:

Now viewing data file sarshmem_5_14_t.par

Figure 6-12. High-level analysis visualization for the F2M version (v5) of SAR application
with no i i i' r. bottleneck

functions eliminates the overhead associated with global memory access (not available in


In this case study, we have shown how PPW was used to facilitate the optimization

process of an in-house SAR application. We were able to use PPW to discover performance

bottlenecks, compare the performance of both UPC and SHMEM versions side-by-side,

and discover properties of the Quadrics SHMEM environment that should be considered

when dealing with global memory access.

6.4 Conclusions

The goal of the third part of this research was to experimentally evaluate the

PPW-assisted parallel application optimization process. Through several case studies,

we assessed the effectiveness of the PPW framework and the analysis system presented

in previous chapters. A classroom productivity study was conducted and the result

Table 6-1. Performance comparison of various versions of UPC and SHMEM SAR programs





2 masters

2 masters/flag



6 PEs

SI)iff UPC
24.6 68.3s
25.4 68.3s
25.8 68.2s
24.8 65.7s
23.7 68.8s
23.3 60.7s

12 PEs

SI )iff UPC
23.6 52.1s
23.6 51.8s
24.9 51.5s
23.8 49.6s
22.1 53.8s
21.9 45.7s

18 PEs







illustrated that, most students preferred PPW over the 1 --- -- :e analysis

In the case study, we showed how PP i. assisted the manual anal process and

verified that the ana I, was able to correctly determine and causes of

bottlenecks identified during the manual anal process. the SA case study, we

demonstrated how the complete PPW tool (with manual and aut omatic analysis) was

used iln tuning an in-house, inefficient, first :inentation of SAR to n' an optimized


( '\1 i 7

chers from many scientific( ields have turned to f allel computing in u rsuit

of the 1 ;t 1 perform nce. 1 due to the combined

complexity of parallel execution environments and models, "' .tions

must of't.cn be analyze ad !ad '' the oa mmcr reaching an tab

Ievel of ce. Many :c tools were devel to this non-trivia

ana' -optimize process but, have traditi( becl d in amming model

:)rt; '' tools were often developed to specifically target MPI and thus are

not easily extensible to ; alternative models. To this need, we
Work on what we believe to be the general-p _:se :e tool

the Par T ce W\izard (PF7) system, in this nm.

We first d the PPW framework and dsscd novel concepts to too

extensibility. ... introduced t the generi- pe abstraction ain the GA '-enable

data collection i developed to minimize the dependence of the tool on its :)rte(

models, making the PP .. ,1 highly extensible to )rt ai range of programming

models. Using this framework, wei created the PPW "' : system that

Imports tihe much needed PC models (i.e., UPC and ) as well as MPT.

our experimental studies showed that our PP .. incurred an acceptable level

of overhead and is comparable to other p< : e tools in terms overhead

and storage re,< addition, we demonstrated that our PP.. system is scalable

up to at least 512 processing clements.

XWe next presented a new scalable, model-in-' :dent system to automatic

detct and diagnose ; :c bottlenecks. \W introduced new tcchniqcus to '

detection accuracy, discussed a range of new and existing analyses 1' to find

e bottlenecks, and C a parallelized mechanism

designed to ort larg-scale ilication e the correct ness of our




Design using a test suite designed to verify the sc ility il

detecting specific bottlenecks. ... then demonstrated the 1 -d nature of the design

successfully developing both the threat ded and distributed versions of the system. ..

showed the :e improvement of the 11 1 versions over the sequential version;

in one case, \w illustrateId that the speed was 1 almost, two orders

magnitude ( : minutes to second ds).

Finally, we pr es(ented several case S to evaluate the PPW framework and

the ana' system. In the classroom productivity study, we demonstrated that PPt .

was viewed as a tool in rammerss identify and resolve I

bottlenecks. On average, participants were able to find more bottlenecks using PPW and

most 'ti noted that '' had an easier tim e pointingg bottlenecks using PPW.

In the FT case study, we first demonstrated how PPW was used in the manual analysis

to the performance of the original benchmark by 14. within a few

hours of use and later showed that our automatic analysis a was able to correctly

identify anid determined the cause of' significant bo ttlencks >und during the manual

process. In the SAR case study, we illustrated how the

.etc PRPW

was used to discover p ce bottlenecks

and SIHMEM versions side-by-side, and I

environment that should be considered when

r main contributions of this research i

tool system for par. a- "ication optimizat

With the creation of our PPW tool and

S )rt to the much needed 'C and

to '. performance tool rt to other/d

, ( ed the of both UPC

:ed properties of the 1 SHMEM

dealing with global nernemory access.

include the PPW general-pp )sc

ion and a scalable u automatic analysis

*ucturc, we brought C' :c tool

programming models and made it easier

developing ; models.

we contributed to the on-going automatic performance ana

a scalabcl and portable:l automatic '

analyses to find p ce bottlen1ecks

research '

system and introducing now techniques and

t and more( accurately.

There are several future areas of research related to this work. First, we have thus far

tested and evaluated the PPW system on parallel systems up to hundreds of processing

elements. To keep up with the ever-growing parallel system size, we plan to experimentally

evaluate the usefulness of PPW on larger parallel systems (thousands of processing

elements) and develop strategies to resolve potential scalability issues with the PPW

system. Second, we are interested in extending the PPW system to support newer

programming models; we have already start working on enabling support for X10 [41].

Third, we are interested in enhancing the capability of the automatic analysis system by

developing algorithms to further improve the analysis speed and accuracy, investigating

techniques to automatically resolve bottleneck, and developing mechanisms to allow users

to define bottlenecks themselves. Finally, we will continue to work on improving the

usability of PPW. For example, we are currently working on integrating PPW into the

Eclipse development environment and we are planning to develop mechanisms to provide

lower-level performance information using GASP.


A.1 Introduction

In this Appendix, we include the an adapted version of the GASP interface (version

1.5). The authors of this specification are Adam Leko, Hung-Hsun Su, and Alan D. George

from the Electrical and Computer Engineering Department at the University of Florida

and Dan Bonachea from the Computer Science Division at the University of California at


A.1.1 Scope

Due to the wide range of compilers and the lack of a standardized performance tool

interface, writers of performance tools face many challenges when incorporating support

for global address space (GAS) programming models such as Unified Parallel C (UPC),

Titanium, and Co-Array Fortran (CAF). This document presents a Global Address Space

Performance (GASP) tool interface that is flexible enough to be adapted into current

global address space compiler and runtime infrastructures with little effort, while allowing

performance analysis tools to gather much information about the performance of global

address space programs.

A.1.2 Organization

Section A.2 gives a high-level overview of the GASP interface. As GASP can be

used to support many global address space programming models, the interface has been

broken down into model-independent and model-specific sections. Section A.3 presents the

model-independent portions of the GASP interface, and the subsequent sections detail the

model-specific portions of the interface.

A.1.3 Definitions

In this section, we define the terms used throughout this specification.

Model -a parallel programming language or library, such as UPC or MPI.

Users -individuals using a GAS model such as UPC.

* Developers
CAF, or Titaniumn

* Tools

I Imance

Swho write

t. tools such as Vampir, TAU, or KOJAK

* Tool developers individuals who develc-- performance

. tools

* Tool code code or y implementing the tool developer's portion < the C

Threadrea d a head of control in a G/ --g)gram, ly to UPCs
threads or CAFs concept images.

A.2 GASP Overview

C/ controls t he interaction bektwvcn a users code, a. :(

tool, and GAS model compiler and/or runtime is interaction is event-based

and comes in the form of callbacks to the ._eventnoti" ion at runtime. I

callbacks mray come from instrumentation code placed directly in an executable,

an instrument rutime ary, or other method; tihe interface < e requires that

_event_noti'.. is called at a;- ropriate times in the manner described in t he rest

this document.

CGA allows tool 1' .' to GAS models on all "

and implementations '' the interface. interface is used in the following tI




1. ( compile their GAS code using '1 I ,)ts provided tool
developers. ;mayi specify y which analysis they wish the tool to on their
code through either command-line arguments, environment variables or through
other tool-specific methods.

2. compiler wn .f flags to the compiler indicating which
callbacks the tool wishes to receive. E the linking 1 the 'r i in
code from tihe mnance tool that I the c at runtime.
Sto< ) ', d code shalll be written in C.

3. When a user runs their program, the toool-proviced code receives ( at runtime
and p: some action such as storing events in a trace or performing
basic statistical pn

software infrastructure such as UP(

The specifics of each step will be discussed in Section A.3. The model-specific

portions of the GASP interface will be discussed in the subsequent sections. A GAS

implementation may exclude any system-level event defined in the model-specific sections

of this document if an application cannot be instrumented for that event (e.g., due to

design limitations or other implementation-specific constraints). Any action resulting

in a violation of this specification shall result in undefined behavior. Tool and model

implementors are strongly encouraged not to deviate from these specifications.

A.3 Model-Independent Interface

A.3.1 Instrumentation Control

Instrumentation control is accomplished through either compilation arguments or

compiler pragmas. Developers may use alternative names for the command-line arguments

if the names specified below do not fit the conventions already used by the compiler.

A.3.1.1 User-visible instrumentation control

If a user wishes to instrument their code for use with a tool using the GASP interface,

they shall pass one of the command-line arguments described in this section to the

compiler wrapper scripts. GASP system events are divided into the following broad

categories, for the purposes of instrumentation control:

* Local access events: -Events resulting from access to objects or variables
contained in the portion of the global address space which is local to the accessing

User function events: Events resulting from entry and exit to user-defined
functions, as described in Section A.4.3.

Other events: Any system event which does not fall into the above categories.

The --inst argument specifies that the users code shall be instrumented for all

system events supported by the GAS model implementation which fall into the final

category of events described above. The --inst-local argument implies --inst, and

additionally requests that user code shall be instrumented to generate local access events

supported by the GAS model implementation. Otherwise, such events need not be

generated. For models lacking a semantic concept of local or remote memory accesses,

--inst shall have the same semantics as --inst-local, implying instrumentation of

all global address space accesses. The --inst-functions argument implies --inst,

and additionally requests that user code shall be instrumented to generate user function

events supported by the GAS model implementation. Otherwise, such events need not be


A.3.1.2 Tool-visible instrumentation control

Compilers supporting the GASP interface shall provide the following command-line

arguments for use by the tool-provided compiler wrapper scripts. The arguments --inst,

--inst-local and --inst-functions have the same semantics as the user-visible

instrumentation flags specified in Section A.3.1.1. An additional argument --inst-only

takes a single argument filename which is a file containing a list of symbolic event names

(as defined in the model-specific sections of this document) separated by newlines.

The files contents indicate the events for which the performance tool wishes to receive

callbacks. Events in this file may be ignored by the compiler if the events are not

supported by the model implementation. Compiler implementations are encouraged to

avoid any overheads associated with generating events not specified by --inst-only,

however tools that pass --inst-only must still be prepared to receive and ignore events

which are not included in the --inst-only list.

A.3.1.3 Interaction with instrumentation, measurement, and user events

When code is compiled without an --inst flag, all instrumentation control shall be

ignored and all user event callbacks shall be compiled away. Systems may link "dummy"

versions of gaspcontrol and gaspcreate_event (described in Section A.3.3 and A.3.4)

for applications that have no code compiled with --inst.

Systems may support compiling parts of an application using one of the --inst flags

and compiling other parts of an application normally; for systems where this scenario is

not possible, this behavior may be prohibited. Applications compiled using an --inst

at least one translation unit also pass the --inst ; during t he

to the ( ts. Any model-specific instrumentation control shall

not have effect on user events or on the state of measurement control. a result.

any model- instrumentation controls shall not prevent user events from being

instrumented during elation (e.g., : I shall noti change the behavior < the

createevent and _eventstart functions in UPC 'ams).

A.3.2 Callback Structure

At runtime, all threads of an instrumented executable shall collect 1 call the

Sinit C function at the 1: of ;ram execution t the model runtime has

finished initialization but before executing the point in a user's code (e.g., main in

UPC). I _init function shall have the following signature:

t edef enurm


G/ )LAN(




gasp_model t






struct _gasp _

t ldef struck

gasp_con text,_

onte xt_ S ;

t _gasp_context_S

gaspii it (gasp_ moi

in t *arec .


i sare aodeg ,
char ***argv);

gasp_init function and an implementation of the _gasp_context_S strict

shall be 'ovided by tool devlel iA single runningt1 instance an executable

< call _init multiple times if the executable contains code written in

.""'le models (such as a '--brid TUPC and CAF program), with at, most one (

model. gasp_init function returns a winter to an threat

tool-impelemented strict. I pointer shall be passed in all subsequent calls to the tool

developer's code made on behalf of this thread. pointer only be used in event

callbacks events ma ; to the model indicated the srcmodel argument. Tool

code mo'"" the contents of the a and a pointers to port the :cssing of

command argument ts.

Sthe -- _init : ion has been called by each thread execution, the tool

code shall receive all other I 1 the two : ions whose signatures are

shown below. Both .ions be used interchangeab l the VA variant, is provided as a

convenience to developers.

typedief enum {


CA 'i ) ,

} gasp_evtt -

v oi d a s p _e v c n n o i f


ga sp-e vttype

int linenum ,

Lt context, unsigned int. cvttag,

Scvt t const char i ilename ,

int colnium ) ;

vo id gasp _event _not i fy

A (g aspcontext_t text unsigned int evttag

gasp evttype_t evtty const char ifilenamer

int linenum int colnunm, va_list varargs);

S eventnoti'.".-- i action shall be written in C, but make

up< to code written in the model ificd the srmodel argument I to the

-- init functioon n the thread that received the ( If upcals are used, the

eventnoti' -- :ion implementation is -----onsible for ham n re-entrant (

A.' "y. code that is used in 11 be 1 'd the same environmental

specifications as the code in a users 1' (i gasp_event not' s shall only

Sto C code '1 under a static threads environment when used with

a IPC 'am i 3 under the static threads environment)

Any user data referenced i )inters passed to _eventnotify not

be modified 1 tool code. For the first argument to _eventnot tool code

shall receive the same _context_t pointer that was returned from the -_init

Sion for this thread. Tool "1 ers may use the context struct to store thread-local

information each thread. ---. _event_notify :,ion be thread- in

order to sL )rt model implementations that, make use < pthreads or other thread

argument shall specify the event identifier as described in the

sections of this document. evttype argument shall be of

t and shall indicate whet her the event is a begin event, end

event, or atoric event.

filename, linenum, and co arguments indicate the line and column

number in the model-level source code most closely associated with the generation of the

event '' If: is I t : a character o whose contents

must, remain valid nd unoifid the remainder of' the execution. same

filename pointer is -tt, ed to be passed in multiple and by multiple threads, and

it is also 'ttcd ( filename ),inters ( I in ) to indicate

the same name (this scenario ; the tool may store filename pointer values and

use '1 inter comparison of non-I N L values to establish e but not

ine )

C '0 model implementations that do not, ret ain column information during

compilation pass 0 in place < the colnum parameter. C i model implementations

that do not retain any source-level i mn during compilation ---0 pass 0 the

filename, linenum, and colnum pa rameters. GAS model are strongly

encouraged to these arguments unless this information ca n be y and

a accurately obtained through other documented methods. GAS model implementations

that use instrumented runtime arics for G/ st :)rt provide dummy

icrmcntat ions for the _eventnotify, _eventnotif A, _init functi10o

and _context_S struck to ink errors while linking a user's that

is not being used with any performance tool. ( contents of the argument shall

be .'' : to each event identifier and and will be discussed in the imo iecific

sections < this document.

A.3.3 urement Control

Tool developers sliall 1 an implementation for ti he > lowing ,ion:

int gaspcontrol (gaspcont extt context int on);

gasp_control : tion takes the cont

the ._eventnoti i ion. X the val

the t0ool shall cease measuring any :

or user events generated on the ( thread, u

Control with a nonzero value for the on r

shall return the last vahle for the on ;ter

nonzero value if --- -control has never been <

A.3.4 User Events

Tool developers shall an implement.,

ext argument in the sarme manner as

lue 0 is I for the, on parameter,

data associated with subsequeCnt, system

nmtilhe t read makes a c, call to

)arameter. _control function

the function received from this thread, or

:i for this thread.

asp _c r a _cv n ( ga, s p_c o ntx c on t cxt ,

const, char *namc const cC har[ *desc)

gasp_create_event shall return a

shall trasnslat, t he < l model-specific

'ed event i. Compilers

createevent functions listed in the

action for t he >owing

unsigned int

model-specific sections of this document into corresponding gasp_create_event calls.

The semantics of the name and desc arguments and the return value shall be the same

as defined by the _create_event function listed in the model-specific section of this

document corresponding to the model indicated by context.

A.3.5 Header Files

Developers shall distribute a gasp.h C header file with their GAS implementations

that contains at least the following definitions. The gasp.h file shall be installed in a

directory that is included in the compilers default search path.

* Function prototypes for the gasp_init, gasp_event_notify, gasp_control, and
gasp_create_event functions and associated typedefs, enums, and structs.

A GASP_VERSION macro that shall be defined to an integral date (coded as
YYYYMMDD) corresponding to the GASP version supported by this GASP
implementation. For implementations that support the version of GASP defined in
this document, this macro shall be set to the integral value 20060914.

Macro definitions that map the symbolic event names listed in the model-specific
sections of this document to 32-bit unsigned integers.

A.4 C Interface

A.4.1 Instrumentation Control

Instrumentation for the events defined in this section shall be controlled by using the

corresponding instrumentation control mechanisms for UPC code defined in Section A.5.1.

A.4.2 Measurement Control

Measurement for the events defined in this section shall be controlled by using the

corresponding measurement control mechanisms for UPC code defined in Section A.5.2.

A.4.3 System Events

A.4.3.1 Function events

Table A-i shows system events related to executing user functions. These events

occur upon each call to a user function (after entry into that function), and before exit

from a user function (before returning to the caller as a result of executing a return

statement or reaching the closing brace which terminates the function). The funcsig

Table A-i. User function events
Symbolic name Event type vararg arguments
GASP CFUNC Start, End const char* funcsig

Table A-2. Memory allocation events
Symbolic name Event type vararg arguments
GASP C MALLOC Start size_t nbytes
GASP C MALLOC End size_t nbytes, void* returnptr
GASP C REALLOC Start void* ptr, size_t size
GASP C REALLOC End void* ptr, size_t size, void* returnptr
GASP C FREE Start, End void* ptr

argument specifies the character string representing the full signature of the user function

that is being entered or exited, or NULL if that information is not available.

If funcsig is non-NULL, it references a character string whose contents must remain

valid and unmodified for the remainder of the program execution. The same funcsig

pointer is permitted to be passed in multiple calls and by multiple threads, and it is also

permitted for different funcsig pointers (passed in different calls) to indicate the same

function signature (this scenario implies the tool may store funcsig pointer values and

use simple pointer comparison of non-NULL values to establish function equality, but not


A.4.3.2 Memory allocation events

Table A-2 shows system events related to the standard memory allocation functions.

The GASP_C_MALLOC, GASP_C_REALLOC, and GASP_C_FREE stem directly from the standard

C definitions of malloc, realloc, and free.

A.4.4 Header Files

Supported C system events shall be handled in the same method as UPC events,

which are described in Section A.5.5.

A.5 UPC Interface

A.5.1 Instrumentation Control

Users may insert #pragma pupc on or #pragma pupc off directives in their code

to instruct the compiler to avoid instrumenting lexically-scoped regions of a users UPC

code. These pragmas may be ignored by the compiler if the compiler cannot control

instrumentation for arbitrary regions of code. When an --inst argument is given to a

compiler or compiler wrapper script, the #pragma pupc shall default to on.

A.5.2 Measurement Control

At runtime, users may call the following functions to control the measurement of

performance data. The pupc_control function shall behave in the same manner as the

gasp_control function defined in Section A.3.3.

int pupccontrol(int on);

A.5.3 User Events

unsigned int pupc_create_event (const char *name, const char *desc);

void pupc_event_start (unsigned int evttag, ...);

void pupc_event_end (unsigned int evttag ...);

void pupc_event_atomic(unsigned int evttag, ...);

The pupc_create_event function shall be automatically translated into a

corresponding gaspcreate_event call, as defined in Section A.3.4. The name argument

shall be used to associate a user-specified name with the event, and the desc argument

may contain either NULL or a printf-- I 1, format string. The memory referenced by both

arguments need not remain valid once the function returns.

The event identifier returned by pupccreate_event shall be a unique value in

the range from GASP_UPC_USEREVT_START to GASP_UPC_USEREVT_END, inclusive. The

GASP_UPC_USEREVT macros shall be provided in the gaspupc.h header file described in

Section A.5.5. The value returned is thread-specific. If the unique identifiers are exhausted

for the calling thread, pupccreate_event shall issue a fatal error.

The pupc_event_start, pupc_event_end, and pupc_event_atomic functions may

be called by a users UPC program at runtime. The evttag argument shall be any value

returned by a prior pupccreate_event function call from the same thread. Users may

pass in any list of values for the ... arguments, provided the argument types match the

printf--I format string supplied in the corresponding pupc_createevent (according

to the printf format string conventions specified by the target system). Any memory

referenced by ... arguments (e.g., string arguments) need not remain valid once the

function returns. A performance tool may use these values to display performance

information alongside application-specific data captured during runtime to a user. The

UPC implementation shall translate the pupceventstart, pupcevent_end, and

pupcevent_atomic function calls into corresponding gaspeventnotify function calls.

When a compiler does not receive any --inst arguments, the pupc_event function

calls shall be excluded from the executable or linked against dummy implementations

of these calls. A users program shall not depend on any side effects that occur from

executing the pupcevent functions. Users shall not pass a shared-qualified pointer as an

argument to the pupc_event functions.

A.5.4 System Events

For the event arguments below, the UPC-specific types upc_flag_t and upcop_t

shall be converted to C ints. Pointers to shared data shall be passed with an extra level

of indirection, and may only be dereferenced through UPC upcalls. UPC implementations

shall provide two opaque types, gaspupcPTS_t and gaspupclock_t, which shall

represent a generic pointer-to-shared (i.e., shared void *), and a UPC lock pointer (i.e.,

upclock_t *), respectively. These opaque types shall be typedefed to void to prevent

C code from attempting to dereference them without using a cast in a UPC upcall. The

content of any gaspupcPTS_t or gaspupclock_t location passed to an event is only

Table A-3. Exit events
Symbolic name Event type vararg arguments
GASP_UPC_COLLECTIVE_EXIT Start, End int status

Table A-4. Synchronization events
Symbolic name Event type vararg arguments
GASPUPCNOTIFY Start, End int named, int expr
GASP_UPC_WAIT Start, End int named, int expr
GASPUPCBARRIER Start, End int named, int expr
GASPUPCFENCE Start, End (none)

guaranteed to remain valid for the duration of the gasp_event_notify call, and must not

be modified by the tool.

A.5.4.1 Exit events

Table A-3 shows system events related to the end of a programs execution. The

GASP_UPC_COLLECTIVE_EXIT events shall occur at the end of a programs execution on

each thread when a collective exit occurs. These events correspond to the execution of the

final implicit barrier for UPC programs. The GASP_UPC_NONCOLLECTIVE_EXIT event shall

occur at the end of a program's execution on a single thread when a non-collective exit


A.5.4.2 Synchronization events

Table A-4 shows events related to synchronization constructs. These events shall

occur before and after execution of the notify, wait, barrier, and fence synchronization

statements. The named argument to the notify, wait, and barrier start events shall be

nonzero if the user has provided an integer expression for the corresponding notify,

wait, and barrier statements. In this case, the expr variable shall be set to the result of

evaluating that integer expression. If the user has not provided an integer expression for

the corresponding notify, wait, or barrier statements, the named argument shall be zero

and the value of expr shall be undefined.

Table A-5. Work-sharing events
Symbolic name Event type vararg arguments
GASPUPC_FORALL Start, End (none)

A.5.4.3 Work-sharing events

Table A-5 shows events related to work-sharing constructs. These events shall occur

on each thread before and after upc_forall constructs are executed.

A.5.4.4 Library-related events

Table A-6 shows events related to library functions. These events stem directly from

the UPC library functions defined in the UPC specification. The vararg arguments for

each event callback mirror those defined in the UPC language specification.

A.5.4.5 Blocking shared variable access events

Table A-7 shows events related to blocking shared variable accesses. These events

shall occur whenever shared variables are assigned to or read from using the direct syntax

(not using the upc.h library functions). The arguments to these events mimic those of the

upc_memget and upc_memput event callback arguments, but differ from the ones presented

in the previous section because they only arise from accessing shared variables directly. If

the memory access occurs under the relaxed memory model, the is_relaxed parameter

shall be nonzero; otherwise the is_relaxed parameter shall be zero.

A.5.4.6 Non-blocking shared variable access events

Table A-8 shows events related to direct shared variable accesses implemented

through non-blocking communication. These non-blocking direct shared variable access

events are similar to the regular direct shared variable access events in Section A.5.4.5.

The INIT events shall correspond to the non-blocking communication initiation, the

DATA events shall correspond to when the data starts to arrive and completely arrives

on the destination node (these events may be excluded for most implementations that

use hardware-supported DMA), and the GASP_UPC_NB_SYNC function shall correspond to

Table A-6. Library-related events
Symbolic name Event type vararg arguments
GASP_UPCGLOBAL_ALLOC Start size_t nblocks size_t nbytes

GASPUPCGLOBALALLOC End sizet nblocks sizet nbytes
gasp_upc_PTS_t* newshrdptr
GASPUPCALL_ALLOC Start size_t nblocks size_t nbytes

GASPUPCALLALLOC End sizet blocks, size_t nbytes
gasp_upc_PTS_t* newshrdptr
GASPUPCALLOC Start size_t nbytes

GASPUPCALLOC End size_t nbytes
gasp_upc_PTS_t* newshrdptr
GASP UPC FREE Start, End gaspupcPTSt* shrdptr
GASP_UPC_ALL_LOCK_ALLOC End gaspupcJock_t* Ick
GASPUPCLOCKFREE Start, End gasp_upcJock_t* Ick
GASPUPCLOCK Start, End gaspupcJock_t* Ick
GASP_UPCLOCK_ATTEMPT Start gaspupc_lock_t* Ick

GASPUPCLOCKATTEMPT End gasp_upc_lock_t* Ick ,
int result
GASP UPC UNLOCK Start, End gaspupcJock_t* Ick
gasp_upc_PTS_t* dst,
GASPUPCMEMCPY Start, End gasp_upc_PTS_t* src ,
size_t n
void* dst,
GASPUPCMEMGET Start, End gasp_upc_PTS_t* src ,
size_t n
gasp_upc_PTS_t* dst,
GASP_UPC_MEMPUT Start, End void* src ,
size_t n
gasp_upc_PTS_t* dst,
GASP_UPC_MEMSET Start, End int c,
size_t n

Table A-7. Blocking shared variable access events
Symbolic name Event type vararg arguments
int isrelaxed,
GASP_UPC_GET Start, End void* dst ,
gasp_upc_PTS_t src ,
size_t n
int isrelaxed,
GASPUPCPUT Start, End gasp_upc_PTS_t* dst,
void* src
size_t n

Table A-8. Non-blocking shared variable access events
Symbolic name Event type vararg arguments
int is_relaxed,
GASPUPC NB GETINIT Start void* dst,
gasp_upc_PTS_t* src ,
size_t n
int is_relaxed,
void* dst ,
GASP_UPCNBGET_INIT End gasp_upc_PTS_t src ,
size_t n,
gasp_upc_nb_handle_t handle
GASP UPC_NBGETDATA Start, End gasp_upcnb_handle_t handle
int is_relaxed,
GASP_UPC_NB_PUT_INIT Start gasp_upc_PTS_t* dst,
void* src
size_t n
int is_relaxed,
gasp_upc_PTS_t* dst,
GASP_UPC_NB_PUT_INIT End void* src
size_t n,
gasp_upc_nb_handle_t handle
GASP UPC NB PUTDATA Start, End gaspupcmnbhandlet handle
GASP UPC NB SYNC Start, End gaspupcmnbhandle_t handle

,he synchronization call that block

until the corresponding data of the non-blocking

is no


e_t shall be an opaque

. bv the 'C

non-blocking get or put operations

Several outs-

be attached to a single

Table A-9. Shared variable cache events
Symbolic name Event type vararg arguments

size_t n_lines
GASPUPCCACHE_HIT Atomic size_t n

gasp_upc_nb_handle_t instance. When a sync callback is received, the tool code shall

assume all get and put operations for the corresponding handle in the sync callback

have been retired. The implementation may pass the handle GASP_NB_TRIVIAL to

GASP_UPC_NB_{PUT,GET}_INIT to indicate the operation was completed synchronously in

the initiation interval. The tool should ignore any DATA or SYNC event callbacks with the


A.5.4.7 Shared variable cache events

Table A-9 shows events related to shared variable cache events. The GASP_UPC_CACHE

events may be sent for UPC runtime systems containing a software cache after a

corresponding get or put start event but before a corresponding get or put end event

(including non-blocking communication events). UPC runtimes using write-through cache

systems may send GASP_UPC_CACHE_MISS events for each corresponding put event.

The size_t n argument for the MISS and HIT events shall indicate the amount of

data read from the cache line for the particular cache hit or cache miss. The n_lines

argument of the GASP_UPC_CACHE_MISS event shall indicate the number of bytes brought

into the cache as a result of the miss (in most cases, the line size of the cache). The

n_dirty argument of the GASP_UPC_CACHE_INVALIDATE shall indicate the number of dirty

cache lines that were written back to shared memory due to a cache line invalidation.

A.5.4.8 Collective communication events

Table A-10 shows events related to collective communication. The events in Table

A-10 stem directly from the UPC collective library functions defined in the UPC

Table A-10. Collective communication events
Symbolic name Event type vararg arguments
gasp_upc_PTS_t* dst,
GASP_UPC_ALL_BROADCAST Start, End gaspupcPTS_t* src
size_t nbytes,
int upc_flags
gasp_upc_PTS_t* dst,
GASP_UPC_ALL_SCATTER Start, End gasp_upc_PTS_t* src ,
size_t nbytes,
int upc_flags
gasp_upc_PTS_t* dst,
GASP_UPC_ALL_GATHER Start, End gasp_upc_PTS_t src ,
size_t nbytes,
int upc_flags
gasp_upc_PTS_t* dst,
GASP_UPC_ALL_GATHER_ALL Start, End gasp_upc_PTS_t* src
size_t nbytes,
int upc_flags
gasp_upc_PTS_t* dst,
GASP_UPC_ALL_EXCHANGE Start, End gasp_upc_PTS_t src ,
size_t nbytes,
int upc_flags
gasp_upc_PTS_t* dst,
gasp_upc_PTS_t src ,
GASP_UPC_ALL_PERMUTE Start, End gasp_upc_PTS_t* perm,
size_t nbytes,
int upc_flags
gasp_upc_PTS_t* dst,
gasp_upc_PTS_t src ,
int upc_op ,
GASP_UPC_ALL_REDUCE Start, End sizet nelems
size_t blk_size ,
void* func ,
int upc_flags ,
gasp_upc_reduction_t type


vararg arguments for each event callback mirror those defined ini

he _QC lang uage

Table A-10. Collective communication events (Continued)
Symbolic name Event type vararg arguments
gasp_upc_PTS_t* dst
gasp_upc_PTS_t* src ,
int upc_op,
GASP_UPC_ALL_PREFIX_REDUCE Start, End sizet nelems
size_t blk_size ,
void* func,
int upc_flags ,
gasp_upc_reduction_t type

For the reduction functions, the gaspupcreduction_t enum shall be provided by a

UPC implementation and shall be defined as follows. The suffix to GASP_UPC_REDUCTION

denotes the same type as specified in the UPC specification.

typedef enum {












} gasp_upc_reduction_t ;

A.5.5 Header Files

UPC compilers shall distribute a pupc.h C header file with their GAS language

implementations that contains function prototypes for the functions defined in Sections

A.5.2 and A.5.3. The pupc.h file shall be installed in a directory that is included in the

UPC compiler's default search path.

All supported system events and associated gasp_upc_* types shall be defined

in a gaspupc.h file located in the same directory as the gasp.h file. System events

not supported by an implementation shall not be included in the gaspupc.h file. The

gaspupc.h header file may include definitions for implementation-specific events, along

with brief documentation embedded in source code comments.

Compilers shall define a compiler-specific integral GASP_UPC_VERSION version number

in gasp_upc. h that may be incremented when new implementation-specific events are

added. Compiler developers are encouraged to use the GASP_X_Y naming convention for

all implementation-specific events, where X is an abbreviation for their compilation system

(such as BUPC) and Y is a short, descriptive name for each event.

Compilers that implement the pupc interface shall predefine the feature macro

__UPC_PUPC__ to the value 1. The macro should be predefined whenever applications

may safely #include , invoke the functions it defines and use the #pragma pupc

directives, without causing any translation errors. The feature macro does not guarantee

that GASP instrumentation is actually enabled for a given compilation, as some of the

features might have no effect in non-instrumenting compilations.


[1] W. Gropp, E. Lusk, N. Doss, and A. Skjellum, "A high-performance, portable
implementation of the mpi message passing interface," Tech. Rep., Argonne National
Laboratory, 1996.

[2] UPC Consortium, "Upc language specifications 1.2,"
http://www.gwu.edu/-upc/docs/upcspecs_ .2.pdf, Accessed July 2010.

[3] University of California at Berkeley, "Berkeley upc website," http://upc.lbl.gov,
Accessed July 2010.

[4] Quadrics Ltd, "Quadrics shmem programming manual,"
http://webl.quadrics.com/downloads/documentation/ShmemMan_6.pdf, Accessed
July 2010.

[5] Oak Ridge National Lab, "Openshmem website,"
https ://email .ornl. gov/mailman/listinfo/openshmem, Accessed July 2010.

[6] L. De Rose and B. Mohr, "Tutorial: Principles and practice of experimental
performance measurement and analysis of parallel applications," in Supercomput-
ing, November 15-21 2003.

[7] J. Labarta, S. Girona, V. Pillet, T. Cortes, and L. Gregoris, "Dip: A parallel program
development environment," in :',./I International Euro-Par Conference on Parallel
Processing, August 26-29 1996.

[8] Intel Corporation, "Intel cluster tools website,"
http://www.intel. com/software/products/cluster, Accessed July 2010.

[9] A. C'!i i, W. Gropp, and E. Lusk, "Scalable log files for parallel program trace data
draft," ftp://ftp.mcs.anl.gov/pub/mpi/slog2/slog2-draft.pdf, Accessed July

[10] M. T. Heath and J. A. Etheridge, "Visualizing the performance of parallel programs,"
IEEE Software, vol. 8, no. 5, pp. 29-39, 1991.

[11] P. J. Mucci, "Tutorial: Dynaprof," in Supercomputing, November 15-21 2003.

[12] J. S. Vetter and M. O. McCracken, "Statistical scalability analysis of communication
operations in distributed applications," in Principles and Practice of Parallel
P,, il,, i,, ,, ,: i June 18-20 2001.

[13] L. De Rose and D. A. Reed, "Svpablo: A multi-language performance analysis
system," in 10th International Conference on Computer Performance Evaluation:
Modeling Techniques and Tools, September 14-18 1998.

[14] J. Mellor-Crummey, R. Fowler, and G. Marin, "Hpcview: A tool for top-down
analysis of node performance," The Journal of Super, ..i',,l,.:,. vol. 23, no. 1, pp.
81-104, 2002.

[15] B. P. Miller, M. D. Callaghan, J. M. Cargille, J. K. Hollingsworth, R. B. Irvin, K. L.
Karavanic, K. Kunchithapadam, and T. N. 1'- 11 "The paradyn parallel performance
measurement tools," IEEE Computer, vol. 28, no. 11, pp. 37-46, 1995.

[16] B. Mohr, F. Wolf, B. Wylie, and M. Geimer, "Kojak a tool set for automatic
performance analysis of parallel programs," in 9th International Euro-Par Conference
on Parallel Processing, August 26-29 2003.

[17] S. S. Shende and A. D. Malony, "Tau: The tau parallel performance system,"
International Journal of High Performance CorT,,Il',.:l' Applications, vol. 20, no. 2,
pp. 287-331, 2006.

[18] A. Leko, H. Sherburne, H. Su, B. Golden, and A. D. George, "Practical experiences
with modern parallel performance analysis tools: an evaluation," Tech. Rep.,
University of Florida, Accessed July 2010.

[19] K. London, S. Moore, P. Mucci, K. Seymour, and R. Luczak, "The papi
cross-platform interface to hardware performance counters," in Department of
Defense UsersGroup Conference, June 18-21 2001.

[20] University of Florida, "Parallel performance wizard (ppw) tool project website,"
http://ppw.hcs.ufl.edu, Accessed July 2010.

[21] S. S. Shende, The Role of Instrumentation and Mapping in Performance Measure-
ment, Ph.D. thesis, University of Oregon, 2001.

[22] Hewlett-Packard Development Company, L.P., "Hp upc website,"
http://h30097.www3.hp. com/upc/, Accessed July 2010.

[23] W. E. N y,. 1 "Vampir tool website," http://www.vampir. eu, Accessed July 2010.

[24] George Washington University, "Gwu upc nas 2.4 benchmarks,"
http://www.gwu.edu/-upc/download.html, Accessed July 2010.

[25] J. Stone, "Tachyon parallel / multiprocessor ray tracing system,"
http://jedi.ks.uiuc.edu/~johns/raytracer/, Accessed July 2010.

[26] A. Jacobs, G. Cieslewski, C. Reardon, and A. D. George, \!,!i paradigm computing
for space-based synthetic aperture radar," in International Conference on Engineering
of Reconfigurable S';/..i and Algorithms, July 14-17 2008.

[27] Argonne National Laboratory, \! pich2 website,"
http://www.mcs.anl.gov/research/projects/mpich2/, Accessed July 2010.

[28] T. Fahringer, M. Gerndt, B. Mohr, F. Wolf, G. Riley, and J. L. Traff, "Knowledge
specification for automatic performance analysis revised version," Tech. Rep.,
APART Working Group, August 2001.

[29] APART Working Group, "Automatic performance analysis: Real tools (apart) ist
working group website," http://www.fz-juelich.de/apart, Accessed July 2010.

[30] C. Coarfa, J. Mellor-Crummey, N. Froyd, and Y. Dotsenko, "Scalable analysis of
spmd codes using expectations," in International Conference on Super.. 'i,,1,l..:.
June 16-20 2007.

[31] K. A. Huck and A. D. M1 i, i: "Perfexplorer: A performance data mining framework
for large-scale parallel computing," in Super ..- ,,i.I.:u, Nov. 12-18 2005.

[32] K. Furlinger and M. Gerndt, "Automated performance analysis using asl performance
properties," in Workshop on State-of-the-Art in Scientific and Parallel Corniil,.i,:l
June 18-21 2006.

[33] J. Jorba, T. Margalef, and E. Luque, "Search of performance inefficiencies in message
passing applications with kappapi-2 tool," in Lecture Notes in Computer Science,
2007, number 4699, pp. 409-419.

[34] M. Geimer, F. Wolf, B. J. N. Wylie, and B. Mohr, "Scalable parallel trace-based
performance analysis," in PVM/MPI, Sep. 17-20 2006.

[35] L. Li and A. D. Malony, "Model-based performance diagosis of master-worker parallel
computations," in Lecture Notes in Computer Science, 2006, number 4128, pp. 35-46.

[36] J. K. Hollingsworth, Finding Bottlenecks in I",,.- Scale Parallel Pi,..gi',-'" Ph.D.
thesis, University of Wisconsin-Madison, 1994.

[37] I. Dooley, C. Mei, and L. Kale, "Noiseminer: An algorithm for scalable automatic
computational noise and software interference detection," in 13th International
Workshop on High-Level Parallel P,.pi,,.:,nii Models and Supportive Environments
of IPDPS, April 14-18 2008.

[38] J. S. Vetter and P. H. Worley, "Asserting performance expectations," in Supercomput-
ing 02, Nov. 16-22 2002.

[39] I. Chuig;, G. Cong, and D. Klepacki, "A framework for automated performance
bottleneck detection," in 13th International Workshop on High-Level Parallel
P,..II,,i,,I,,.:, Models and Supportive Environments of IPDPS, April 14-18 2008.

[40] M. Gerndt, B. Mohr, and J. L. Traff, "A test suite for parallel performance analysis
tools," in Concurr, :, and Computation: Practice and Experience, 2007, number 19,
pp. 1465-1480.

[41] IBM Corporation, "X10 website," http://xlO-lang. org/, Accessed July 2010.


iun Su is a Ph.D. graduate from the L tment of a

(with a minor in engineering : the e tent of

and Science and _' '1 a t the versity of orida. r

B.S. degrees in iter science and biic: I the UTn < C?

Angeles in 1' a M.S. in biochemistry the n Southern C

and a M.S. in .uter science from the Un of Souuthern C

i research focuses on the development and performance analysis of high:

1 '' the realization of high-I 1 1 mmulnic

and i ce evaluation of hi performancee

nid C


received two




r :e

nation systems:




c r 2010Hung-HsunSu 2


Idedicatethistomyfamilyforalltheirloveandsupport. 3


ACKNOWLEDGMENTS ThisworkissupportedinpartbytheU.S.DepartmentofDefens e.Thanksgotomy advisor,Dr.AlanD.George,forhisadviceandpatienceovert hecourseofthisresearch andmycommitteemembers,Dr.HermanLam,Dr.GregoryStitt,a ndDr.Beverly Sanders,fortheirtimeandeort.Iwouldalsoliketoacknow ledgecurrentandformer membersoftheUniedParallelCgroupatUF,MaxBillingsleyII I,AdamLeko,Hans Sherburne,BryanGolden,ArmandoSantos,andBalajiSubrama nianfortheirinvolvement inthedesignanddevelopmentoftheParallelPerformanceWi zardsystem.Finally,I wouldliketoexpressthankstoDanBonacheaandtheUniedPar allelCgroupmembers atU.C.BerkeleyandLawrenceBerkeleyNationalLaboratoryfo rtheirhelpfulsuggestions andcooperation. 4


TABLEOFCONTENTS page ACKNOWLEDGMENTS ................................. 4 LISTOFTABLES ..................................... 8 LISTOFFIGURES .................................... 10 ABSTRACT ........................................ 13 CHAPTER 1INTRODUCTION .................................. 14 2OVERVIEWOFPARALLELPROGRAMMINGMODELSAND PERFORMANCETOOLS .............................. 16 2.1ParallelProgrammingModels ......................... 16 2.2PerformanceTools ............................... 18 3BACKGROUNDRESEARCHFINDINGS ..................... 22 4GENERAL-PURPOSEFRAMEWORKFORPARALLELAPPLICATION PERFORMANCEANALYSIS ............................ 25 4.1ParallelProgrammingEventModel ...................... 28 4.1.1Group-RelatedOperations ....................... 32 4.1.2DataTransferOperations ........................ 35 4.1.3Lock,Wait-On-Value,andLocallyExecutedOperation s ....... 38 4.1.4ImplementationChallengesandStrategies .............. 39 4.2InstrumentationandMeasurement ...................... 41 4.2.1OverviewofInstrumentationTechniques ............... 41 4.2.2TheGlobal-Address-SpacePerformanceInterface ........... 42 4.2.3GASPImplementations ......................... 46 4.3AutomaticAnalysis ............................... 46 4.4DataPresentation ................................ 47 4.5PPWExtensibility,Overhead,andStorageRequirement .......... 48 4.5.1PPW,TAU,andScalascaComparision ................ 51 4.5.2PPWToolScalability .......................... 51 4.6Conclusions ................................... 52 5SYSTEMFORAUTOMATICANALYSISOFPARALLELAPPLICATIONS 55 5.1OverviewofAutomaticAnalysisApproachesandSystems ......... 56 5.2PPWAutomaticAnalysisSystemDesign ................... 59 5.2.1DesignOverview ............................. 60 5.2.2Common-BottleneckAnalysis ..................... 61 ..................... 65 5

PAGE 6 ........................ 66 5.2.3GlobalAnalyses ............................. 68 ...................... 68 ....................... 68 ................... 69 ................. 70 ..................... 70 5.2.4FrequencyAnalysis ........................... 71 5.2.5BottleneckResolution .......................... 71 5.3PrototypeDevelopmentandEvaluation .................... 72 5.3.1SequentialPrototype .......................... 75 5.3.2ThreadedPrototype ........................... 80 5.3.3DistributedPrototype .......................... 82 5.3.4SummaryofPrototypeDevelopment ................. 85 5.4Conclusions ................................... 85 6EXPERIMENTALEVALUATIONOFPPW-ASSISTEDPARALLEL APPLICATIONOPTIMIZATIONPROCESS ................... 87 6.1ProductivityStudy ............................... 87 6.2FTCaseStudy ................................. 88 6.3SARCaseStudy ................................ 92 6.4Conclusions ................................... 100 7CONCLUSIONS ................................... 103 APPENDIX:GASPSPECIFICATION1.5 ........................ 106 A.1Introduction ................................... 106 A.1.1Scope ................................... 106 A.1.2Organization ............................... 106 A.1.3Denitions ................................ 106 A.2GASPOverview ................................. 107 A.3Model-IndependentInterface .......................... 108 A.3.1InstrumentationControl ........................ 108 A.3.1.1User-visibleinstrumentationcontrol ............. 108 A.3.1.2Tool-visibleinstrumentationcontrol ............. 109 A.3.1.3Interactionwithinstrumentation,measurement,an duser events ............................. 109 A.3.2CallbackStructure ........................... 110 A.3.3MeasurementControl .......................... 113 A.3.4UserEvents ............................... 113 A.3.5HeaderFiles ............................... 114 A.4CInterface .................................... 114 A.4.1InstrumentationControl ........................ 114 A.4.2MeasurementControl .......................... 114 6


A.4.3SystemEvents .............................. 114 A.4.3.1Functionevents ........................ 114 A.4.3.2Memoryallocationevents .................. 115 A.4.4HeaderFiles ............................... 115 A.5UPCInterface .................................. 116 A.5.1InstrumentationControl ........................ 116 A.5.2MeasurementControl .......................... 116 A.5.3UserEvents ............................... 116 A.5.4SystemEvents .............................. 117 A.5.4.1Exitevents .......................... 118 A.5.4.2Synchronizationevents .................... 118 A.5.4.3Work-sharingevents ..................... 119 A.5.4.4Library-relatedevents .................... 119 A.5.4.5Blockingsharedvariableaccessevents ........... 119 A.5.4.6Non-blockingsharedvariableaccessevents ......... 119 A.5.4.7Sharedvariablecacheevents ................. 122 A.5.4.8Collectivecommunicationevents .............. 122 A.5.5HeaderFiles ............................... 124 REFERENCES ....................................... 126 BIOGRAPHICALSKETCH ................................ 129 7


LISTOFTABLES Table page 4-1Theoreticaleventmodelforparallelprograms ................... 30 4-2MappingofUPC,SHMEM,andMPI1.xconstructstogenericope rationtypes 31 4-3GASPeventsfornon-blockingUPCcommunicationandsynchr onization .... 43 4-4Proling/tracinglesizeandoverheadforUPCNPB2.4benc hmarksuite ... 48 4-5Proling/tracinglesizeandoverheadforSHMEMAPSP,CONV, andSAR application ...................................... 50 4-6Proling/tracinglesizeandoverheadforMPIcornertu rn,tachyon,andSAR application ...................................... 50 4-7Proling/tracinglesizeandoverheadcomparisonofPP W,TAU,andScalasca toolsfora16-PErunoftheMPIISbenchmark .................. 51 4-8Proling/tracinglesizeandoverheadformedium-scal eUPCNPB2.4benchmark suiteruns ....................................... 52 5-1SummaryofexistingPPWanalyses ......................... 62 5-2Common-bottleneckpatternscurrentlysupportedbyPPW anddataneededto performcauseanalysis ................................ 63 5-3Exampleresolutiontechniquestoremoveparallelbottl enecks ........... 73 5-4SequentialanalysisspeedofNPBbenchmarksonworkstati on .......... 80 5-5AnalysisspeedofNPBbenchmarksonworkstation ................ 82 5-6AnalysisspeedofNPBbenchmarksonEthernet-connectedcl uster ........ 84 6-1PerformancecomparisonofvariousversionsofUPCandSHME MSARprograms 101 A-1Userfunctionevents ................................. 115 A-2Memoryallocationevents .............................. 115 A-3Exitevents ...................................... 118 A-4Synchronizationevents ................................ 118 A-5Work-sharingevents ................................. 119 A-6Library-relatedevents ................................ 120 A-7Blockingsharedvariableaccessevents ....................... 121 8


A-8Non-blockingsharedvariableaccessevents ..................... 121 A-9Sharedvariablecacheevents ............................. 122 A-10Collectivecommunicationevents .......................... 123 9


LISTOFFIGURES Figure page 2-1Measure-modifyperformanceanalysisapproachofperfo rmancetool ....... 19 4-1PPW-assistedperformanceanalysisprocessfromorigin alsourceprogramto revised(optimized)program ............................. 26 4-2Generic-operation-typeabstractiontofacilitatethe supportformultiple programmingmodels ................................. 27 4-3FrameworkofParallelPerformanceWizardorganizedwit hrespecttostages ofexperimentalmeasurementandmodeldependency(multi-b oxedunitsare model-dependent) ................................... 28 4-4Eventsforgroupsynchronization,groupcommunication ,initialization, termination,andglobalmemoryallocationoperations(TS= Timestamp. OnlyafewofPEY'seventsareshowntoavoidclutter) ............. 34 4-5Eventsforone-sidedcommunicationandsynchronizatio noperations ....... 37 4-6Eventsfortwo-sidedcommunicationandsynchronizatio noperations ....... 38 4-7Eventsfor(a)lockmechanisms,wait-on-valueand(b)us er-dened function/region,work-sharing,andenvironment-inquiry operations ....... 39 4-8InteractionofPGASapplication,compiler,andperforma ncetoolin GASP-enableddatacollection ............................ 43 4-9Specicationofgasp event notifycallbackfunction ................ 44 4-10(a)Load-balancinganalysisvisualizationforCG256PErun,(b)Experimental setcomparisonchartforCamel4-,8-,16-,and32-PEruns ............ 49 4-11AnnotatedscreenshotofnewUPC-specicarraydistribut ionvisualizationshowing physicallayoutofa2-D5x8arraywithblocksize3foran8-PE system ..... 50 4-12(a)Datatransfersvisualizationshowingcommunicati onvolumebetween processingPEsfor256-PECGbenchmarktracingmoderun,(b) Zoomed-in Jumpshotviewof512-PEMGbenchmark ..................... 53 5-1Tool-assistedautomaticperformanceanalysisprocess ............... 56 5-2PPWautomaticanalysissystemarchitecture .................... 61 5-3Exampleanalysisprocessingsystemwith3processingun itsshowingtheanalyses eachprocessingunitperformsandrawdataexchangeneededb etweenprocessing units .......................................... 63 5-4Analysisprocessrowchartforaprocessingunitinthesys tem .......... 64 10


5-5Barrier-redundancyanalysis ............................. 70 5-6Frequencyanalysis .................................. 72 5-7PPWanalysisuserinterface ............................. 74 5-8AnnotatedPPWscalability-analysisvisualization ................. 75 5-9PPWrevision-comparisonvisualization ....................... 76 5-10AnnotatedPPWhigh-levelanalysisvisualization ................. 76 5-11PPWevent-levelload-balancevisualization ..................... 77 5-12AnnotatedPPWanalysistablevisualization .................... 78 5-13AnnotatedPPWanalysissummaryreport ..................... 79 5-14AcommonusecaseofPPWwhereusertransfersthedatafro mparallelsystem toworkstationforanalysis .............................. 80 5-15Analysisworkrowforthe(a)sequentialprototype(b)th readedprototype (c)distributedprototype ............................... 81 5-16AusecaseofPPWwhereanalysesareperformedonthepara llelsystems .... 83 5-17MemoryusageofPPWanalysissystemoncluster ................. 84 6-1Productivitystudyresultshowing(a)methodwithmoreb ottlenecksidentied (b)preferredmethodforbottleneckidentication(c)pref erredmethodfor programoptimization ................................. 88 6-2AnnotatedPPWTreeTablevisualizationoforiginalFTsho wingcoderegions yieldingpartofperformancedegradation ...................... 89 6-3AnnotatedJumpshotviewoforiginalFTshowingserialize dnatureofupc memget atline1950 ...................................... 90 6-4PPWTreeTableformodiedFTwithreplacementasynchron ousbulktransfer calls .......................................... 91 6-5Multi-tableanalysisvisualizationforFTbenchmarkwi thannotatedJumpshot visualization ...................................... 93 6-6OverviewofSyntheticApertureRadaralgorithm ................. 94 6-7PerformancebreakdownofUPCSARbaselineversionrunwith 6,12,and18 computingPEsannotatedtoshowpercentageofexecutiontim eassociatedwith barriers ........................................ 96 11


6-8TimelineviewofUPCSARbaselineversionrunwith6computi ngPEsannotated tohighlightexecutiontimetakenbybarriers .................... 97 6-9High-levelanalysisvisualizationfortheoriginalvers ion(v1)ofSARapplication withload-imbalanceissue .............................. 98 6-10ObservedexecutiontimeforvariousUPCandSHMEMSARrever sions ..... 98 6-11TimelineviewofUPCSARragsynchronizationversionexec utedonsystem with6computingPEsannotatedtohighlightwaittimeofrags ......... 99 6-12High-levelanalysisvisualizationfortheF2Mversion( v5)ofSARapplication withnomajorbottleneck .............................. 100 12


AbstractofDissertationPresentedtotheGraduateSchool oftheUniversityofFloridainPartialFulllmentofthe RequirementsfortheDegreeofDoctorofPhilosophy PARALLELPERFORMANCEWIZARD-FRAMEWORKANDTECHNIQUESFOR PARALLELAPPLICATIONOPTIMIZATION By Hung-HsunSu August2010 Chair:AlanD.GeorgeMajor:ElectricalandComputerEngineering Developingahigh-performanceparallelapplicationisdi cult.Giventhecomplexity ofhigh-performanceparallelprograms,developersoftenm ustrelyonperformance analysistoolstohelpthemimprovetheperformanceoftheir applications.While manytoolssupportanalysisofmessage-passingprograms,t oolsupportislimitedfor applicationswritteninotherprogrammingmodelssuchasth oseinthepartitioned global-address-space(PGAS)family,whichisofgrowingimp ortance.Existingtoolsthat supportmessage-passingmodelsarediculttoextendtosup portotherparallelmodels becauseofthedierencesbetweentheparadigms.Inthisdis sertation,wepresentworkon theParallelPerformanceWizard(PPW)system,therstgene ral-purposeperformance systemforparallelapplicationoptimization.Thecomplet eresearchisdividedinto threeparts.First,weintroduceamodel-independentPPWpe rformancetoolframework forparallelapplicationanalysis.Next,wepresentanewsca lable,model-independent PPWanalysissystemdesignedtoautomaticallydetect,diag nose,andpossiblyresolve bottleneckswithinaparallelapplication.Finally,wedis cusscasestudiestoevaluatethe eectivenessofPPWandconcludewithcontributionsandfut uredirectionsforthePPW project. 13


CHAPTER1 INTRODUCTION Parallelcomputinghasemergedasthedominanthigh-perfor mancecomputing paradigm.Tofullysupportconcurrentexecution,manypara llelcomputersystems suchasthesymmetricmultiprocessingsystem,computerclu ster,computationalgrid, andmulti-coremachineshavebeendeveloped.Inaddition,t otakeadvantageofthese parallelsystems,avarietyofparallelprogrammingmodels suchasOpenMulti-Processing (OpenMP),Message-PassingInterface(MPI),UniedParalle lC(UPC),andSHared MEMory(SHMEM)libraryhavebeencreatedovertheyears.Using thesetechnologies, programmersfrommanyscienticandcommercialeldsareab letodevelopparallel applicationsthatsolvedicultproblemsmorequicklyorso lvecomplexproblems previouslythoughttobeimpossible. Unfortunately,duetotheaddedcomplexityoftheparallelsy stemsandprogramming models,parallelapplicationsaremorediculttowritetha nsequentialonesandeven hardertooptimizeforperformance.Discoveryandremovalo fperformanceissuesrequire extensiveknowledgeonprogrammers'partabouttheexecuti onenvironmentandinvolve asignicantamountofeort.Programmersoftenmustunderg oanon-trivial,iterative analysisandoptimizationprocessthatiscumbersometoper formmanuallyinorder toimprovetheperformanceoftheirapplications.Tofacili tatethisunwieldyprocess, manyparallelperformanceanalysistools(henceforthrefe rstoasperformancetool)were developedthatsupportavarietyofparallelsystemsandpro grammingmodels. Amongtheavailableparallelprogrammingmodels,MPIhasrec eivedthemajority ofperformancetoolresearchanddevelopmentasitremainst hemostwell-knownand widelyused.Mostexistingparallelperformancetoolssupp ortMPIprogramanalysisto somedegreebutarelimitedinsupportingotherparallelmod elssuchasOpenMPand thoseinthepartitionedglobal-address-space(PGAS)famil y.Whileeortshavebeen madetoimprovesupportforthesenewermodels,theprogress hasnotkeptupwiththe 14


demand.Sincemostexistingtoolswerespecicallydesigne dtosupportaparticularmodel (i.e.,MPI),theybecametootightlycoupledwiththatmodel ,andasaresult,requirea signicantamountofeortonthedevelopers'parttoaddnew modelsupport. Inthisdissertation,weoutlineourworktowardtheParalle lPerformanceWizard (PPW)system.Thegoalistoresearchanddevelopageneral-p urposeperformance toolinfrastructurethatreadilysupportsmultipleparall elprogrammingmodelsandto developadvancetechniquestoenhancetoolusability.Ther emainderofthisdocument isorganizedasfollows.InChapter2,weprovideanoverview ofperformancetoolsand parallelprogrammingmodels.InChapter3,wedescribeourr esearchmethodologyas wellassomebackgroundresearchndingsthatshapedthedev elopmentofthePPW infrastructure.InChapter4,wepresentthePPWframeworka ndprovideexperimental resultsforafunctionalPPWtoolthatsupportsUPC,SHMEM,and MPI.InChapter5, weintroduceanewautomaticanalysissystemdevelopedtoen hancetheusabilityofthe PPWtoolandprovideexperimentalresultsforthesequentia l,threaded,anddistributed versionsofthissystem.InChapter6,wediscusscasestudie stovalidatetheframework andtechniquespresentedandconcludethedocumentinChapt er7. 15


CHAPTER2 OVERVIEWOFPARALLELPROGRAMMINGMODELSAND PERFORMANCETOOLS Inthischapter,weprovideanoverviewofparallelprogramm ingmodelsand performancetools.Toavoidconfusion,thetermprocessing element(PE)isused toreferenceasystemcomponent(e.g.,anode,athread)that executesastreamof instructions. 2.1ParallelProgrammingModels Inthissection,weprovideanoverviewofparallelprogramm ingmodelsandthethree modelsdirectlyrelevanttothisresearch,namelyMPI,UPC,a ndSHMEM. Aparallelprogrammingmodelisacollectionofsoftwaretec hnologiesthatallows programmerstoexplicitlyexpressparallelismsandorches trateinteractionsamongPEs. Thegoalistofacilitateprogrammersinturningparallelal gorithmsintoexecutable applicationsonparallelcomputers.Parallelprogramming modelsaregenerallycategorized byhowmemoryisused.Intheshared-memorymodel(e.g.,Open MP,explicitthreading libraries),eachPEhasdirectaccesstoasharedmemoryspac eandcommunication betweenPEsisachievedbyreadingandwritingofvariablest hatresideinthisshared memoryspace.Inthemessage-passingmodel(e.g.,MPI),eac hPEhasaccessonlytoits localmemory.ApairofPEscommunicatesbysendingandrecei vingmessagestoeach otherwhichtransfersthedatafromthelocalmemoryofsende rtothelocalmemoryof thereceiver.Finally,thepartitionedglobal-address-sp ace(PGAS)model(e.g.,UPC, SHMEM)presentstheprogrammerwithalogicalglobalmemorys pacedividedintotwo parts:aprivateportionlocaltoeachPEandaglobalportion whichcanbephysically partitionedamongthePEs.PEcommunicateswitheachotherb yreadingandwriting theglobalportionofthememoryviatheuseofputandgetoper ations.Intermsof implementation,thesemodelsarerealizedeitheraslibrar ies,assequentiallanguage extensions,orasnewparallellanguages. 16


MessagePassingInterface(MPI)isacommunicationlibrary usedtoprogramparallel computerswiththegoalsofhighperformance,scalabilitya ndportability[ 1 ].MPIhas becomethedefactostandardfordevelopinghigh-performan ceparallelapplications; virtuallyeveryexistingparallelsystemprovidessomefor mofsupportforMPIapplication development.Therearecurrentlytwoversionsofthestanda rd:MPI-1(rststandardized in1994)thatusespurelymatchingsendandreceivepairsfor datatransferandprovides routinesforcollectivecommunicationandsynchronizatio n,andMPI-2(asupersetof MPI-1,rststandardizedin1996)whichincludesadditiona lfeaturessuchasparallelI/O, dynamicprocessmanagement,andsomeremotememoryaccess( putandget)capabilities. UniedParallelC(UPC)isanexplicitparallelextensionofth eANSIClanguage developedbeginninginthelate1990sbasedonexperiencewi thseveralearlierparallel C-basedprogrammingmodels[ 2 ].UPCexposesthePGASabstractiontotheprogrammer bywayofseverallanguageandlibraryfeatures,includings peciallytyped(shared) variablesfordeclaringandaccessingdatasharedamongPEs ,synchronizationprimitives suchasbarriersandlocks,anumberofcollectiveroutines, andaunique,anity-aware worksharingconstruct( upc forall ).Theorganizationresponsibleforthecontinuing developmentandmaintenanceoftheUPClanguageisaconsorti umofgovernment, industry,andacademia,whichreleasedthelatestUPCspeci cationversion1.2inJune 2005.Thisspecicationhasbeenimplementedintheformofv endorcompilers,including oeringsfromHPandIBM,aswellasopen-sourcecompilerssuc hasBerkeleyUPC[ 3 ] andthereferenceMichiganUPC.TheseprovideforUPCsupporto nanumberofHPC platforms,includingSMPsystems,super-computerssuchas theCrayXTseries,andLinux clustersusingavarietyofcommodityorhigh-speedinterco nnects. TheSHaredMEMory(SHMEM)libraryessentiallyprovidesthesh ared-memory abstractiontypicalofmulti-threadedsequentialprogram stodevelopersofhigh-performance parallelapplications[ 4 ].FirstcreatedbyCrayResearchforuseontheCrayT3D supercomputerandnowtrademarkedbySGI,SHMEMallowsPEsto readandwrite 17


allgloballydeclaredvariables,includingthosemappedto memoryregionsphysically locatedonotherPEs.SHMEMisdistinctfromalanguagesuchas UPCinthatit doesnotprovideintrinsicallyparallellanguagefeatures ;instead,thesharedmemory modelissupportedbywayofafullassortmentofAPIroutines( similartoMPI).In additiontothefundamentalremotememoryaccessprimitive s(getandput),SHMEM providesroutinesforcollectivecommunication,synchron ization,andatomicmemory operations.Implementationsofthelibraryareprimarilya vailableonsystemsoeredby SGIandCray,thoughversionsalsoexistforclustersusingi nterconnectssuchasQuadrics (recentlywentoutofbusiness).Atpresenttime,noSHMEMsta ndardizationexistsso dierentimplementationstendtosupportadierentsetofc onstructsprovidingsimilar functionalities.However,aneorttocreateanOpenSHMEMsta ndard[ 5 ]iscurrently underway. 2.2PerformanceTools Performancetoolsaresoftwaresystemsthatassistprogram mersinunderstanding theruntimebehavioroftheirapplication 1 onrealsystemsandultimatelyinoptimizing theapplicationwithrespecttoexecutiontime,scalabilit y,orresourceutilization.To achievethisgoal,themajorityofthetoolsmakeuseofahigh lyeectiveexperimental performanceanalysisapproach,basedonameasure-modifyc ycle(Figure 2-1 ),inwhich theprogrammerconductsaniterativeprocessofperformanc edatacollection,data analysis,datavisualization,andoptimizationuntilthed esiredapplicationperformanceis achieved[ 6 ].Underthisapproach,thetoolrstgeneratesinstrumentat ioncodethatserves asentrypointsforperformancedatacollection(Instrumen tation).Next,theapplication andinstrumentationcodeareexecutedonthetargetplatfor mandrawperformancedata arecollectedatruntimebythetool(Measurement).Thetool organizestherawdata andcanoptionallyperformvariousautomaticanalysestodi scoverandperhapssuggest 1 Alternativeapproachesincludesimulationandanalyticalm odels. 18


Figure2-1.Measure-modifyperformanceanalysisapproach ofperformancetool resolutionstoperformancebottlenecks(AutomaticAnalysis ).Boththerawandanalyzed dataarethenpresentedinmoreuser-friendlyformstothepr ogrammerthroughtext-based orgraphicalinterface(Presentation)tofacilitatesthem anualanalysisprocess(Manual Analysis).Finally,thetoolorprogrammerappliesappropri ateoptimizationtechniques totheprogramortheexecutionenvironment(Optimization) andthewholecyclerepeats untiltheprogrammerissatisedwiththeperformancelevel Atoolcanuse(xed-interval)sampling-driveninstrument ationorevent-driven instrumentation,dependingonwhenandhowoftenperforman cedataarecollected.In sampling-driventools,dataarecollectedregularlyatxe d-timeintervalsbyoneormore concurrentlyexecutingthreads.Ateachtimestep,aprede ned,xedsetofmetrics (typesofperformancedata)arerecordedregardlessofthec urrentprogrambehavior(e.g., samemetricsrecordedregardlessofwhethertheprogramisp erformingcomputation orcommunication).Theperformanceoftheprogramisthenes timated,oftenusing onlyasubsetofthesemetrics.Inmostcases,themonitoring threadsaccessonlyafew 19


hardwarecountersandregistersandperformlimitedcalcul ations,thusintroducingvery lowdatacollectionoverhead.Asaresult,thistechniqueisl esslikelytocausechanges inexecutionbehaviorthatmayleadtoaninaccurateanalysi softheprogram.However, sampling-driventoolstypicallyhavegreaterdicultyinp resentingtheprogrambehavior withrespecttothehigh-levelsourcecode,especiallywhen thetimeintervalislarge enoughtomissshort-livedtrends.Incontrast,event-driv entoolsrecorddataonlywhen speciedevents(suchasthestartofafunctionorcommunica tioncall)occurduring programexecution.Together,eventsandmetricsmakeupthe eventmodelthatthe toolusestodescribeapplicationbehavior;thecompletese tofeventsandmetricsis usedtoreconstructthebehavioroftheprogramindirectrel ationwithhigh-levelsource code,easingtheanalysisandoptimizationprocess.Foreac hevent,thetoolrecordsa selectnumberofmetrics(e.g.,time,PEID,etc.)relevantt othatparticulareventbut requiressignicantlymoreprocessingtimethansimplyacc essingafewhardwarecounters inthesampling-drivencase.Asaresult,event-driventools generallyintroducehigher datacollectionoverheadthansampling-driventoolsandth ushaveahigherchanceof introducingheisenbugs:bugs(causedbyperformancepertu rbation)thatdisappearoralter theirbehaviorwhenoneattemptstoprobeorisolatethem.Th isproblemisparticularly applicableforfrequentlyoccurring,short-livedeventst hatforcesubstantialdelayinorder tocollectperformancedata. Anothercommontoolclassication,tracingversusproling ,distinguisheshowa toolhandlesthemetricseachtimeinstrumentationcodeise xecuted.Atooloperating intracingmodestoresmetricvaluescalculatedateachtime instanceseparatelyfrom oneanother.Fromthisdata,itispossibleforthetooltorec onstructthestep-by-step programbehavior,enablingapplicationanalysisingreatd etail.However,thelargeamount ofdatageneratedalsorequiressignicantstoragespacepe rprogramrun,andthesheer amountofdatacouldbeoverwhelmingifitisnotcarefullyor ganizedandpresentedto theuser.Inaddition,duetomemorylimitations,thetoolof tenmustperformleI/O 20


duringruntime,introducingadditionaldatacollectionov erheadontopoftheunavoidable metriccalculationoverhead.Examplesoftoolsthatsuppor tthecollectionandviewing oftracedataincludeDimemas/Paraver[ 7 ],IntelClusterTools[ 8 ],MPE/Jumpshot[ 9 ], andMPICL/ParaGraph[ 10 ].Incontrast,atooloperatinginprolingmodeperforms additionalon-the-rycalculations 2 (min,max,average,count,etc.)aftermetricvaluesare calculatedatruntimeandonlystatistical(prole)dataar ekept.Thisdatacanusuallyt inmemory,avoidingtheneedtoperformleI/Oatruntime.How ever,prolingdataoften onlyprovidesucientinformationtoperformhigh-levelan alysisandmaybeinsucient fordeterminingthecausesofperformancebottlenecks.Exa mplesofpopularprolingtools includeDynaProf[ 11 ],mpiP[ 12 ],andSvPablo[ 13 ]. Finally,asthesystemusedtoexecutetheapplicationgrows insize,theamountof performancedatathatatoolmustcollectandmanagegrowsto apointthatitbecome nearlyimpossibleforuserstomanuallyanalyzethedataeve nwiththehelpofthetool.To addressthisissue,severaltoolssuchasHPCToolkit[ 14 ],Paradyn[ 15 ],Scalasca/KOJAK [ 16 ],andTAU[ 17 ]alsoincludemechanismstohavethetoolautomaticallyana lyzethe collectedperformancedataandpointoutpotentialperform ancebottleneckswithinthe application(e.g.,scalabilityanalysis,commonbottlene ckanalysis,etc.). 2 Fortracingmode,thesestatisticalcalculationsareoften performedafterexecution time. 21


CHAPTER3 BACKGROUNDRESEARCHFINDINGS Asubstantialbackgroundresearchprocesshasledtothefor mulationanddevelopment ofthePPWsystem.Inthischapterwebrierydescribethispro cessanditsresulting ndingsandinsightsthathaveshapedthePPWdesign. Webeganourbackgroundresearchbystudyingthedetailsofp arallelprogramming modelspecicationsandimplementationsinordertoidenti fycharacteristicsimportant inanalyzingparallelapplicationperformance.Inparalle l,wesurveyedexistingworkson performancetoolresearchinordertoidentifycharacteris ticsimportantforthesuccessofa performancetool[ 18 ].Usingtheknowledgegainedfromthesestudies,wetheneval uated theapplicabilityofexistingperformancetooltechniques tovariousprogrammingmodels andleveragedtechniquesthatcouldbere-usedoradopted.Ad ditionally,weperformed comparisonsbetweenrelatedperformanceanalysistechniq ues,identiedcharacteristics commontothesetechniques,andmadegeneralizationsbased onthecommonalities(such generalizationisdesirableasitreducestoolcomplexity) .Finally,werecognizednew obstaclespertainingtoparallelperformanceanalysisand formulatedsolutionstohandle theseissues. Ahelpfulperformancetoolmustcollectappropriateandacc urateperformancedata. Wefoundthatitisusefulforatooltosupportbothprolinga ndtracingmeasurement modes.Proledataguidesuserstoprogramsegmentswhereth eyshouldfocustheir tuningeorts,whiletracedataprovidesdetailedinformat ionoftenneededtodetermine therootcausesofperformancedegradations.Thetoolshoul dalsomakeuseofhardware countermonitoringsystems,suchastheportablePerforman ceApplicationProgramming Interface(PAPI)[ 19 ],whicharevaluableinanalyzingnon-parallelsectionsof an applicationbutcanalsobeusedonparallelsections.Final ly,toavoidperformance perturbation,thedatacollectionoverheadintroducedbyt hetoolmustbeminimized.A generalconsensusfromliteratureindicatesthatatoolwit hoverheadofapproximately 22


1-5%underprolemodeand1-10%undertracemodeisconsider edtobesafefrom performanceperturbation. Aproductivetoolmustbeeasytolearnanduse.Whileperform ancetoolshave provedeectiveintroubleshootingperformanceproblems, theyareoftennotusedbecause oftheirhighlearningcurve.Toavoidthispitfall,atoolsh ouldprovideanintuitive, familiaruserinterfacebyfollowinganestablishedstanda rdoradoptingvisualizations usedbyexistingperformancetools.Inaddition,sincesour cecodeisgenerallytheonly levelwhichusershavedirectcontrolover(theaverageuser maynothavetheknowledge orpermissiontoaltertheexecutionenvironment),perform ancetoolsshouldpresent performancedatawithrespecttotheapplicationsourcecod e.Thisfeaturehelps inidentifyingspecicsourcecoderegionsthatlimitedana pplication'sperformance, makingiteasierfortheusertoremovebottlenecks.Atool's abilitytoprovidesource-line correlationandtoworkclosertothesourcelevelisthuscri ticaltoitssuccess. Toecientlysupportmultipleprogrammingmodels,asucces sfulperformance tooldesignmustincludemechanismstoresolvedicultiesi ntroducedduetodiverse modelsandimplementations.Wenoticedthattechniquesuse dinthemeasurement andthepresentationstagesaregenerallynottiedtothepro grammingmodel.The typesofmeasurementsrequiredanddicultiesonemustsolv eareverysimilaramong programmingmodels(i.e.,gettimestamp,clocksynchroniz ationissues,etc.)and visualizationsdevelopedareusuallyapplicableoreasily extensibletoavarietyof programmingmodels.Furthermore,wenotedthatwhileeachp rogrammingmodelsupplies theprogrammerwithadierentsetofparallelconstructsto orchestrateworkamongPEs, thetypesofinter-PEinteractionfacilitatedbythesecons tructsarequitesimilarbetween models.Forexample,bothUPCandSHMEMincludeconstructstop erform,amongother operations,barrier,put,andgetoperation.Thusitisdesi rabletotakeadvantageofthis commonalityanddeviseageneralizationmechanismtoenabl ethedevelopmentofsystem componentsthatapplytomultiplemodels,helpingreduceth ecomplexityoftooldesign. 23


Incontrast,thechoiceofthebestinstrumentationtechniq ueishighlydependent onthestrategyusedtoimplementthetargetcompiler.Manyd iverseimplementation methodsareusedbycompilerdeveloperstoenableexecution oftargetmodelapplications. Forinstance,allMPIandSHMEMimplementationsareinthefor moflinkinglibraries whileUPCimplementationsrangefromadirectcompilationsy stem(e.g.,CrayUPC)to asystememployingsource-to-sourcetranslationcompleme ntedwithextensiveruntime libraries(e.g.,BerkeleyUPC).Wenoticedthatwhileitispo ssibletoselectanestablished instrumentationtechniquethatworkswellforaparticular implementationstrategy(for example,usingthewrapperinstrumentationapproachforli nkinglibraries),noneof thesetechniquesworkwellforallcompilerimplementation strategies(seeChapter 4.2 foradditionaldiscussion).Thus,anytoolthatwishestosu pportarangeofmodelsand compilersmustincludemechanismstohandletheseimplemen tationstrategieseciently. 24


CHAPTER4 GENERAL-PURPOSEFRAMEWORKFOR PARALLELAPPLICATIONPERFORMANCEANALYSIS ParallelPerformanceWizard(PPW)isaperformancedatacol lection,analysis,and visualizationsystemforparallelprograms.Thegoalistop rovideaperformancetool infrastructurethatsupportsawiderangeofparallelprogr ammingmodelswithease;in particular,wefocusonthemuch-neededsupportforPGASmode ls[ 20 ]. PPW'shigh-levelarchitectureisshowninFigure 4-1 ,witharrowsillustratingthe stepsinvolvedinthePPW-assistedapplicationoptimizati onprocess.Auser'ssource programisrstcompiledusingPPW'scommandstogeneratean instrumentedexecutable. Thisexecutableisthenrunandeitherprolingortracingda ta(asselectedbytheuser) iscollectedandmanagedbyPPW.Theuserthenopenstheresul tingperformancedata leandproceedtoanalyzeapplicationperformanceinsever always:examiningstatistical performanceinformationviaprolingdatavisualizations suppliedbyPPW;converting tracingdatatoSLOG2orOTFformatforviewingwithJumpshot orVampir;orusingthe PPWanalysissystemtosearchforperformancebottlenecks. PPWcurrentlysupportstheanalysisofUPC,SHMEM,andMPI1.xa pplicationsand isextensibletosupportotherparallelprogrammingmodels .Tofacilitatesupportfora varietyofmodels,wedevelopedanewconcept:thegeneric-o peration-typeabstraction.In theremainderofthissection,wediscussthemotivationsbe hindandadvantagesofusing thisabstraction. Existingperformancetoolsarecommonlydesignedtosuppor taspecicmodelorare completelygeneric.Model-specictoolsinteractdirectl ywithmodel-specicconstructs andhavetheadvantageofbeingabletocollectavaryingseto foperation-specicdata (suchasmemoryaddressanddatatransfersizeforputandget ).However,thecost ofaddingsupportforadditionalmodelstothesetoolsisusu allyhigh,oftenrequiring updatestoasignicantportionofthesystem,duetotheneed tore-implementthesame functionalitiesforeachmodel.Incontrast,completelyge nerictools(suchasMPE[ 9 ]) 25


Figure4-1.PPW-assistedperformanceanalysisprocessfro moriginalsourceprogramtorevised(optimized)program26


Figure4-2.Generic-operation-typeabstractiontofacili tatethesupportformultiple programmingmodels workwithgenericprogramexecutionstates(i.e.,thebegin ningandtheendoffunction calls)andthuscanbeeasilyadoptedtosupportawiderangeo fmodels.Unfortunately, beingcompletelygenericforcesthesetoolstocollectasta ndardsetofmetrics(e.g., sourceline,timestamp)eachtimedatacollectionoccurs,a ndasaresult,thesetools losethecapacitytoobtainusefuloperation-specicmetri cs 1 (e.g.,datasizefordata transferoperations).Toavoidtheunnecessarytight-coup lingofatooltoitssupported programmingmodelswhilestillenablingthecollectionofu sefuloperation-specicmetrics, wedevelopedageneric-operation-typeabstractionthatis ahybridofthemodel-specic andthecompletelygenericapproaches.Theideaistorstma pmodel-specicconstructs toasetofmodel-independentgenericoperationtypesclass iedbytheirfunctionality. Foreachgenericoperation,thetoolcanthencollectoperat ion-speciceventsandmetrics andmaylateranalyzeandpresentthesedatadierentlydepe ndingontheoperationtype (Figure 4-2 ). Thegeneric-operation-typeabstractionhasinruencedthe developmentofmany componentsofthePPWsystem,includingitseventmodel,ins trumentationand 1 Notethatitispossiblebutimpracticaltocollectallmetric seachtime,aspartofthe collectedmetricswillnotbemeaningful. 27


Figure4-3.FrameworkofParallelPerformanceWizardorgan izedwithrespecttostagesof experimentalmeasurementandmodeldependency(multi-box edunitsare model-dependent) measurementapproach,andanalysesandvisualizations.We nowdescribethese components(Figure 4-3 )inthefollowingsections. 4.1ParallelProgrammingEventModel Theeectivenessofanevent-driventoolisdirectlyimpact edbytheeventsand metrics(i.e.,eventmodel)thatituses.Eventssignifyimp ortantinstancesofprogram executionwhenperformancedatashouldbegathered,whilem etricsdenethetypes ofdatathatshouldbemeasuredandsubsequentlystoredbyth etoolforagivenevent. Inthissection,wepresentthegenericparalleleventmodel PPWusestodescribethe behaviorofagivenparallelapplication. PPWfocusesonprovidingdetailedinformationneededtoana lyzeparallelportions ofaprogramwhilemaintainingsucientinformationtoenab leahigh-levelsequential analysis.Forthisreason,thecurrentPPWeventmodelinclu desmostlyparallelevents. Table 4-1 summarizesthetheoreticaleventmodelusingthegeneric-o peration-type 28


abstractiontodescribethebehaviorofagivenprogram;thi seventmodeliscompatible withbothPGASmodelsandMPIwhileTable 4-2 showsthemappingofUPC,SHMEM, andMPI1.xconstructstogenericoperationtypes. WeorganizedTable 4-1 sothatoperationtypeswiththesamesetofrelevantevents areshowntogetherasagroup.Inadditiontometricsusefulf oranyevent(i.e.,callingPE ID,codelocation,timestamp,andoperation-typeidentie r),foreachevent,weprovidea listofadditionalmetricsthatwouldbebenecialtocollec t.Forone-sidedcommunication, metricssuchasthedatasourceanddestination 2 (PEIDandmemoryaddress),data transfersizebeingtransferred,andsynchronizationhand ler 3 (fornon-blockingoperations only)provideadditionalinsightsonthebehavioroftheseo perations.Fortwo-sided communication,itisnecessarytomatchthePEID,transfers ize,messageidentier,and synchronizationhandlerinordertocorrelaterelatedoper ations.Forlockacquisition orreleaseoperationsandwait-on-valuechangeoperations ,thelockidentierorthe wait-variableaddresshelppreventfalsebottleneckdetec tion.Forcollectiveglobal-memory allocation,thememoryaddressdistinguishesoneallocati oncallfromanother.Forgroup communication,thedatatransfersizemayhelpunderstandt heseoperations.Finally, forgroupsynchronizationandcommunicationthatdonotinv olveallsystemPEs,group memberinformationisusefulindistinguishingbetweendis tinctbutconcurrentlyexecuting operations. Itisimportanttopointoutthateventtimestampinformatio nisoftenthemost criticalmetrictomonitor(asperformanceoptimizationus uallyaimedatminimizingthe observedexecutiontime).Withapropersetofeventsandacc uratetiminginformation fortheseevents,itispossibletocalculate(oratleastpro videagoodestimateof)the 2 Thesourceanddestinationofatransfermaybedierentfrom thecallingPE. 3 Identierusedbytheexplicit/implicitsynchronizationo perationstoforcecompletion ofaparticularputorget. 29


Table4-1.Theoreticaleventmodelforparallelprograms Generic-operationtypeEventsAdditionalmetrics(mayalso includePAPIcounters) Enter(Notication Begin)Groupinfo(sync/comm),address(memory), transfersize(comm) GroupsynchronizationNotication End GroupcommunicationWait Begin Initialization/terminationTransfers Received GlobalmemoryallocationExit(Wait End) EnterSource,destination,transfersize, synchronizationhandler(non-blocking) Atomicread/writeExitBlockingImplicitput/getTransfer Begin Blockingexplicitput/getTransfer Complete Non-blockingexplicitput/getSynchronization BeginSynchronizationhandler ExplicitcommunicationsynchronizationSynchronization EndSynchronizationhandler EnterMatchingPE,transfersize,messageidentier, synchronizationhandler(non-blocking) Exit Blockingsend/receiveMatching Enter Non-blockingsend/receiveSignal Received Wait BeginSynchronizationhandler Wait End EnterLockidentier(lock),address(wait) LockacquisitionorreleaseCondition Fullled Wait-on-valuechangeExit User-denedfunction/regionEnterWork-sharingExitEnvironmentinquiry 30


Table4-2.MappingofUPC,SHMEM,andMPI1.xconstructstogene ricoperationtypes Generic-operationtypeUPCSHMEMMPI1.x InitializationN/Ashmem init()mpi init() Terminationupc global exit()N/Ampi nalize() EnvironmentinquiryMYTHREAD,THREADS,my pe(),num pes()mpi comm rank(), upc threadof(),... mpi address(),... Groupsync.upc notify(),upc wait(),shmem barrier(),mpi barrier() upc barrier()shmem barrier all() Groupcomm.upc all broadcast(),shmem broadcast(),mpi bcast(), upc all scatter(),...shmem collect(),...mpi alltoall(),... Globalmemorymanagement Declarationwithsharedkeyword,upc alloc(),... shmalloc(),shfree()N/A ImplicitputDirectassignmentN/AN/A(one-sided)(shared int=1) ImplicitgetDirectassignmentN/AN/A(one-sided)(x=shared int) Explicitput(one-sided)upc memput(),upc memset(), upc memcpy() 1 shmem put(),shmem iput(),...N/A Explicitget(one-sided)upc memget(),upc memcpy()shmem get(),shmem iget(),...N/A Send(two-sided)N/AN/Ampi bsend(),mpi irsend(),... Receive(two-sided)N/AN/Ampi recv(),mpi irecv(),... Explicitcomm.synchronization upc fence()shmem fence(), shmem wait nb(),... mpi wait(),mpi waitall(),... Lockacquisitionorrelease upc lock(),upc unlock(), upc lock attempt() N/AN/A Atomiccomm.N/Ashmem int fadd(),...N/A Work-sharingupc forall()N/AN/A Wait-on-valuechangeN/Ashmem wait(), shmem wait until(),... N/A a upc memcpy()canbeeitherputand/orgetdependingonthesource anddestination31


durationforvariouscomputation,communication,andsync hronizationcallsthroughout programexecution.Insomecases,itisalsopossibletocalc ulateprogram-induced delays 4 (PIdelays|delayscausedbypoororchestrationofparallel codesuchasuneven work-distribution,competingdataaccess,orlockacquisi tion)thatpointtolocations intheprogramthatcanbeoptimizedviasource-codemodica tion.Byexaminingthe durationsofvariousoperationsandidentifyingPIdelayst hatcanberemoved,itismuch simplerforaprogrammertodeviseoptimizationtechniques toimprovetheexecutiontime oftheapplication. Inthefollowingsubsections,wediscusstheeventsandmean stocalculateoperation durationandPIdelayforeachofthelogicalgroupsofgeneri c-operationtypesshown inTable 4-1 .Foreachgroup,wediagramsometypicalexecutionpatterns viaasetof operation-specicevents(eachindicatedbyanarrowwithn umberattheend)ordered withrespecttotime(x-axis)foreachoftheprocessingPEs( PEX,Y,Z)involved.We discussmeanstocalculatethedurationofnon-blockingope rations(durationforblocking operationisalwaysthetimedierencebetweenitsEnterand ExiteventandPIdelay withtheseevents,discusswhytheinclusionofsomeeventsa ectstheaccuracyofthe calculations,andmentionhowatoolcantracktheseeventsi npractice.Inaddition,we pointoutperformanceissuestypicallyassociatedwitheac hoperationgroup. 4.1.1Group-RelatedOperations InFigure 4-4 ,weillustratetheeventsforthecategoryofoperationstha tinvolvesa groupofPEsworkingtogether,includinggroupsynchroniza tion,groupcommunication, initialization,termination,andglobalmemoryallocatio noperations.Theexecution behavioroftheseoperationsiscommonlydescribedinterms ofparticipatingPEsrunning inoneoftwophases.Firstisthenoticationphasewhenthec allingPEsendsoutsignals 4 ExampleofadelaywhichisnotaPIdelayincludesdatatransf erdelaydueto networkcongestion,slowdownduetomultipleapplications runningatthesametime,etc. 32


toallotherPEsinthegroupindicatingitsreadinessinperf ormingtheoperation.The secondisthewaitphasewherethecallingPEblocksuntilthe arrivalofsignalsfromall PEsbeforecompletingtheoperation.Twoversionsoftheseg roupoperationsaretypically providedtoprogrammers:thestandard(blocking)single-p haseversionwhereasingle constructisusedtocompletebothphasesandthemorerexibl e(non-blocking)split-phase versionusingseparateconstructsforeachphasethatallow sforoverlappingoperation (generallyrestrictedtolocalcomputation).Withrespect toexistingprogrammingmodels, thesingle-phaseversionisavailableforalloperationsin thiscategorywhilethesplit-phase versionistypicallyonlyavailableforgroupsynchronizat ionandgroupcommunication. PIdelaysassociatedwiththeseoperationsnormallymarkth eexistenceofload-imbalance issues. Eventsassociatedwiththiscategoryofoperationsarethef ollowing: Enter(Notication Begin):Eventdenotingthebeginningofcooperativeoperat ion (beginningofnoticationphase).ThecallingPEstartssen dingoutReadysignals (plusdataforgroupcommunicationoperations)toallother PEs. Notication End:EventdenotingthepointintimewhenthecallingPEnis hes sendingReadysignals(plusdataforgroupcommunicationop erations)toallother PEs(endofnoticationphase).Forthesplit-phaseversion ,thecallingPEisfree toperformoverlappingoperationsafterthispointuntilth ewaitphase.Inthe single-phaseversion,thiseventisnormallynottraceable directlybutisestimatedto occurashorttimeaftertheEnterevent. Wait Begin:Eventdenotingthebeginningofthewaitphase(where thecallingPE blocksuntilReadysignalsarereceivedfromallotherPEs). Normallyonlytraceable forsplit-phaseversion. Transfers Received:EventdenotingthearrivalofReadysignalsfroma llotherPEs onthecallingPE.Thiseventisusuallynottraceabledirect lybutisestimatedto occurashorttimeafterthelastparticipatingPEentersthe operation. Exit(Wait End):Eventdenotingthecompletionofthecooperativeoper ation. Anexampleexecutionpatternexhibitingbottlenecks(onPEX andY)caused byunevenwork-distributionforthesingle-phaseversioni sdiagramedinFigure 4-4 a. Inthisscenario,PEZenteredtheoperationafterithasrece ivedaReadysignalfrom 33


Figure4-4.Eventsforgroupsynchronization,groupcommun ication,initialization, termination,andglobalmemoryallocationoperations(TS= Timestamp. OnlyafewofPEY'seventsareshowntoavoidclutter) bothPEXandY(i.e.,onPEZ,aTransfers ReceivedeventoccurredbeforeanEnter event)soitwasabletocompletetheoperationwithoutblock ing.Incontrast,PEX andYnishedsendingoutsignalsbeforereceivingallincom ingReadysignalssothey wereunabletocompletetheoperationoptimally;eachPEbec ameidleuntilitreached theTransfers Receivedevent.PIdelayforthesingle-phaseversionisgiv enbythetime dierencebetweentheTransfers ReceivedandNotication Endevents(Figure 4-4 bottom). Figure 4-4 bshowsanexampleexecutionpatternforthesplit-phasever sion.Inthis scenario,PEZreceivedallsignalsbeforeenteringthenoti cationphasesoitisfreeofany PIdelay.PEYenteredthewaitphasebeforereceivingallsig nalssoitremainedidlefor aperiodoftimebeforecompletingitsoperation(idletimeg ivenbythetimedierence betweentheTransfers ReceivedeventandWait Beginevent).Finally,PEXshowsa situationwhereoverlappingcomputationisusedtoremovep otentialdelays(advantageof 34


split-phaseversion).PEXenteredthenoticationphaser stsoitlogicallyrequiredthe longestwaittime(i.e.,largestdierencebetweentheTran sfers ReceivedandtheEnter event).However,byperformingsucientcomputation,PEXno longerneededtowait onceitenteredthewaitphaseandthuswasfreeofdelay.Fort hesplit-phaseversion, thetotaloperationdurationisgivenbythecombineddurati onofthenoticationphase (timedierencebetweentheEnterandNotication Endevent)andthewaitphase(time dierencebetweentheWait BeginandExitevent)andthePIdelayisgivenbythetime dierencebetweenitsTransfers ReceivedandWait Beginevent(Figure 4-4 ,bottom). 4.1.2DataTransferOperations InFigure 4-5 ,weillustratetheeventsforoperationsrelatingtoone-si ded,point-to-point datatransferssuchasatomicoperations,blockingornon-b lockingexplicitorimplicitput, andgetoperations,andexplicitcommunicationsynchroniz ation(e.g.,fence,quiet). Notethatwepresentthegetoperationasareverseput 5 (wherethecallingPEsends arequesttothetargetPEandthetargetPEperformsaput)sin cegetoperationsare oftenimplementedthiswayinpracticeinordertoimproveth eirperformance.Forthis classofoperations,weareinterestedindeterminingtheti meittakesforthefulldata transfertocomplete;frombeginningofreadorwritetowhen dataisvisibletothe wholesystem.Theprecisedurationforthenon-blockingver sioncouldbecalculated iftheTransfer Completeeventisavailable;unfortunately,itisoftennot possible formodelimplementationstosupplythisevent.Forsuchsys tems,thedurationcan onlybeestimatedfromtheendtimeofeithertheexplicitori mplicitcommunication synchronization(Synchronization Endevent)thatenforcesdataconsistencyamong PEs.AsillustratedinFigure 4-5 c,thisestimateddurationcouldbemuchhigher thantheprecisedurationandasaresultcompromisestherel iabilityofsubsequent analyses.Furthermore,anyPIdelayscausedbythesynchron izationoperationsincrease 5 Acanonicalgetwouldhaveeventssimilartothoseillustrat edforaputoperation. 35


thedurationtimefurtherawayfromtheactualtransfertime .Finally,ifanexplicit synchronizationoperationisusedtoforcethecompletiono fmultipledatatransfers,PI delayinonetransferwillaectthedurationcalculationfo rallothertransfersaswell, furtherdecreasingtheaccuracyoftheperformanceinforma tion. TocalculatethePIdelay,eithertheTransfer BeginortheTransfer Completeevent isneeded;theyaresometimesobtainablebyexaminingtheNet workInterfaceCard(NIC) status.PIdelayforblockingput(get)isthetimedierence betweentheTransfer Begin andtheEnterevent.Fornon-blockingput(get)usingimplic itsynchronization,PIdelayis thetimedierencebetweentheTransfer CompleteandtheSynchronization Beginevent. Fornon-blockingput(get)usingexplicitsynchronization ,PIdelayisthetimedierence betweentheTransfer BeginandtheSynchronization Beginevent.PIdelaysassociated withtheseoperationsoftensignifytheexistenceofcompet ingdataaccesses. InFigure 4-6 ,weillustratetheeventsforoperations(ofPEX)relatingto two-sided, point-to-pointdatatransferssuchasblockingornon-bloc kingsendandreceiveoperations andexplicitcommunicationsynchronization.Aswithone-si dedcommunication,we areinterestedindeterminingthetimeittakesforthefulld atatransfertocomplete. Thedurationforthenon-blockingversionsisthetimedier encebetweenthesend (receive)EnterandtheExiteventplusthetimedierencebe tweentheSignal Received andtheWait Endevent.Unfortunately,theSignal Receivedevent(eventdenoting thepointintimewhenthecallingPEreceivedaReadysignalf romthematching PE)isnearlyimpossibletoobtain;thusdurationcanonlybe estimatedfromtheexit timeofthesynchronizationcallthatguaranteesdatatrans fercompletion,resultingin muchhigherthannormaldurationcalculation.PIdelayforb lockingsend(receive)is thetimedierencebetweentheEnterandtheMatching EntereventswhilePIdelay fornon-blockingsend(receive)withmatchingblockingrec eive(send)isthetime dierencebetweentheWait BeginandtheMatching Enterevent.Foranon-blocking send(receive)withmatchingnon-blockingreceive(send), PIdelaycannotbecalculated 36


Figure4-5.Eventsforone-sidedcommunicationandsynchro nizationoperations 37


Figure4-6.Eventsfortwo-sidedcommunicationandsynchro nizationoperations (sinceSignal Receivedisnotavailable).ThesePIdelayssignifyapotent iallyinecient callingsequenceofsendandreceivepairsthattypicallyst emfromaload-imbalanceprior tocallingoftheseoperations.4.1.3Lock,Wait-On-Value,andLocallyExecutedOperation s InFigure 4-7 a,weillustratetheeventsforthelockmechanismsandthewa it-on-value operation.Thedurationiscalculatedfromthetimedieren cebetweentheEnterandExit events.TocalculatethePIdelay,theCondition Fullledeventisneeded,whichindicates whenthelockbecomesavailable(unlocked)orwhenaremoteP Eupdatesthevariable tohaveavaluesatisfyingthespeciedwaitcondition.This Condition Fullledevent isgenerallynottraceabledirectlybythetoolbutinsteadc anbeestimatedfromother 38


Figure4-7.Eventsfor(a)lockmechanisms,wait-on-valuea nd(b)user-dened function/region,work-sharing,andenvironment-inquiry operations operations'events(i.e.,thelastunlockordatatransferc ompleted).PIdelaysassociated withtheseoperationsgenerallystemfrompoororchestrati onamongprocessingPEs(such aslockcompetitionandlateupdatesofwaitvariables). FinallyinFigure 4-7 b,weillustratetheeventsforthelocallyexecutedoperati ons suchasuser-denedfunctionorregion,work-sharing,ande nvironmentinquiryoperations. Trackingtheperformanceoftheseoperationsisimportanta sitfacilitatestheanalysis oflocalportionsoftheprogram.Sincewecanconsidereacht obeablockingoperation, thedurationissimplythetimedierencebetweentheEntera ndExitevents.Without extensivesequentialperformancetrackingandanalysis,i tisnotpossibletodetermineif anyPIdelayexists.4.1.4ImplementationChallengesandStrategies Inthissubsection,webrierydiscussthechallengesandstr ategiesusedtoimplement theeventmodelwithrespecttodatacollection(instrument ationandmeasurement), automaticdataanalysis,anddatapresentation.Amoredeta ileddiscussionofeachof thesestageswillbegiveninChapter 4.2 5 and 4.4 respectively. Together,theprogrammingmodel,choseninstrumentationt echnique,andtool designdecisiondeterminesthesetofeventsthatthetoolco llectsduringruntime.The 39


programmingmodelsuppliestheMeaningfulEventSettocoll ectasspeciedbythe eventmodeldiscussedpreviously.Fromthisset,asubsetof MeasurableEventSetthat canbecollecteddirectlyduringruntimegiventheconstrai ntsimposedbythechosen instrumentationtechniqueisidentied.Finally,sometoo ldesigndecisionsmayfurther limitstheActualEventSet(asubsetofMeasurableEventSet) thetoolsupports.Once theActualEventSetisknown,metrics(commonmetricsplusad ditionalmetricsinTable 4-1 )associatedwitheventsinthissetarecollectedduringrun time. DependingontheActualEventSetcollected,analysesperfor medduringtheanalysis phasewilldier.Forexample,tocalculatebarrierduratio nandPIdelayinMPIand SHMEM,thesingle-phaseformulasareused,whileforBerkele yUPC,thesplit-phase formulasareused.Programmingmodelcapabilitiesalsopla yaroleinwhatkindof analysesareperformed.Analysesspecictobarrierscanbea ppliedtoallthreemodels, whileanalysesspecictoone-sidedtransferoperations(e .g.,PIdelayduetocompeting putandget)areapplicabletobothSHMEMandUPC,butnotMPI1.x Forthepresentationstage,eachvisualizationisequipped tohandleeachtype ofoperationdenedintheeventmodel.Forsomevisualizati ons,theoperationtype playsnoroleinhowthevisualizationhandlesthedata,whil eothervisualizationsmust includemechanismstohandleeachtypeseparately.Forexam ple,table-basedviews displaydataforalloperationtypesinasimilarfashion,bu tagrid-basedviewofdata transferredbetweenPEsmakesspecicuseofcommunication -relatedmetricsrelatedto data-transferoperationsexclusively.Othervisualizati ons,suchasthetimelineviewof tracedata,operatedierentlyforvariousoperationtypes ;forexample,thetimelineneeds toappropriatelyhandletwo-sidedoperations,one-sidedd atatransfers(forwhichanew approachisneededtohandlethisoperationtype),etc. 40


4.2InstrumentationandMeasurement Inthissection,weintroduceknownapproachesforinstrume ntation,discusstheir strengthsandlimitationswithinthecontextofthegoalsof PPW,andthenpresentour datacollectionsolutionbasedonanovel,standardizedper formanceinterfacecalledGASP. 4.2.1OverviewofInstrumentationTechniques Whileseveraltechniqueshaveproventobeeectiveinappli cationinstrumentation [ 21 ],thedierencesincompilation 6 andexecutionamongthedivergentcompilation approachespreventtheselectionofauniversalinstrument ationstrategy.Withsource instrumentation,theinstrumentationcodeisaddedaspart ofthehigh-levelsourcecode priortoexecutiontime.Becausethesourcecodeisalteredd uringtheinstrumentation process,thistechniquemaypreventcompileroptimization andreorganizationand alsolacksthemeanstohandleglobalmemorymodelswheresom esemanticdetailsof communicationareintentionallyunderspeciedatthesour celeveltoallowforaggressive optimization(forexample,implicitreadorwriteofashare dvariableinUPCisdicult tohandleusingsourceinstrumentation,especiallyundert herelaxedmemoryconsistency modewhereagivencompilermayreordertheimplicitcallsto improveperformance). Withbinaryinstrumentation,theinstrumentationcodeisa ddedtothemachinecode beforeorduringprogramexecution.Adirectbenetofmodif yingthemachine-code ratherthanthesource-codeisthatrecompilationisoftenn otneededaftereachprogram modication.Unfortunately,binaryinstrumentationisuna vailableonsomearchitectures andyieldsperformancedatathatisoftendiculttocorrela tebacktotherelevant sourcecode,especiallyforsystemsemployingsource-to-s ourcetranslation.Finally,with libraryinstrumentation(suchasPMPIforMPI),wrappersar eplacedaroundfunctions implementingoperationsofinterest.Duringexecutiontim e,acalltoafunctionrst executestheappropriatewrappercodethatenablesdatacol lectionandtheninvokesthe originalfunction.Thisapproachisveryeasytousebutdoes notworkforprogramming modelconstructsthatarenotintheformofafunctioncall(s uchasanimplicitputin 41


UPC)orforcompilersthatgeneratecodewhichdirectlytarge tshardwareinstructionsor low-levelproprietaryinterfaces. Abruteforceapproachtohavingatoolsimultaneouslysuppo rtmultipleprogramming modelsandimplementationsissimplytoselectanexistingi nstrumentationtechniquethat worksforeachparticularmodelimplementation.Unfortunat ely,thisapproachforcesthe writersofperformancetoolstobedeeplyversedintheinter nalandoftenchangingor proprietarydetailsoftheimplementations,whichcanresu ltintoolsthatlackportability. Inaddition,theuseofmultipleinstrumentationtechnique sforcesthetooltohandleeach modelimplementationdisjointlyandthuscomplicatesthet ooldevelopmentprocess. 4.2.2TheGlobal-Address-SpacePerformanceInterface Thealternativewehavepursuedistodeneaninstrumentati on-measurement interface,calledtheGlobal-Address-SpacePerformance(G ASP)interface(Appendix A),thatspeciestherelationshipbetweenprogrammingmode limplementationsand performancetools(Figure 4-8 ).Thisinterfacedenestheeventsandargumentsof importanceforeachmodelconstruct(seeTable 4-3 forGASPeventsandarguments relatedtonon-blockingUPCcommunicationandsynchronizat ioncalls).Insertionof appropriateinstrumentationcodeislefttothecompilerwr iterswhohavethebest knowledgeabouttheexecutionenvironment,whilethetoold evelopersretainfullcontrol ofhowperformancedataaregathered.Byshiftingtheinstru mentationresponsibility fromtoolwritertocompilerwriters,thechanceofinstrume ntationalteringtheprogram behaviorisminimized.Thesimplicityoftheinterfacemini mizestheeortrequired fromthecompilerwritertoaddperformancetoolsupporttot heirsystem(andonce completed,anytoolthatsupportsGASPandrecognizethesemo delconstructscan 6 Forexample,existingUPCimplementationsincludedirect,m onolithiccompilation systems(GCC-UPC,CrayUPC)andsource-to-sourcetranslatio ncomplementedwith extensiveruntimelibraries(BerkeleyUPC,HPUPC,andMichiga nUPC). 42


Figure4-8.InteractionofPGASapplication,compiler,andp erformancetoolin GASP-enableddatacollection Table4-3.GASPeventsfornon-blockingUPCcommunicationand synchronization OperationidentierEventtypeArguments intis relaxed, GASP UPC NB GET INITEntervoid*dst, gasp upc PTS t*src, size tn intis relaxed, void*dst, GASP UPC NB GET INITExitgasp upc PTS t*src, size tn gasp upc nb handle thandle GASP UPC NB GET DATAEnter,Exitgasp upc nb handle thandle intis relaxed, GASP UPC NB PUT INITEntergasp upc PTS t*dst, void*src,size tn intis relaxed, gasp upc PTS t*dst, GASP UPC NB PUT INITExitvoid*src, size tn gasp upc nb handle thandle GASP UPC NB PUT DATAEnter,Exitgasp upc nb handle thandle GASP UPC NB SYNCEnter,Exitgasp upc nb handle thandle supportapplicationanalysisforthatcompiler).Concomit antly,thisapproachalsogreatly reducestheeortneededforperformancetoolwriterstoadd supportforavarietyof modelimplementations;asingletool-sideGASPimplementat ionissucientforall compilerswithGASPsupport. 43


Figure4-9.Specicationofgasp event notifycallbackfunction ThemostimportantentrypointintheGASPinterfaceistheeve ntcallbackfunction named gasp event notify (Figure 4-9 )thatcompilersusetonotifywheneventsofpotential interestoccuratruntimeandprovideusefulinformation(e .g.,eventidentier,sourcecode location,andevent-relatedarguments)totheperformance tool.Thetoolthendecideshow tohandletheinformationandwhatmetricstorecord.Inaddi tion,thetoolispermittedto makecallstoroutinesthatarewritteninthesourceprogram mingmodelorthatusethe sourcelibrarytoquerymodel-specicinformationwhichma ynototherwisebeavailable. Thetoolmayalsoconsultalternativesourcesofperformanc einformation,suchasCPU hardwarecountersexposedbyPAPI,formonitoringserialasp ectsofcomputationaland memorysystemperformanceingreatdetail.The gasp event notify callbackincludes aper-thread,per-modelcontextpointertoanopaque,toolprovidedobjectcreatedat initializationtime,wherethetoolcanstorethread-local performancedata. TheGASPspecicationisdesignedtobefullythread-safe,su pportingmodel implementationswherearbitrarysubsetsofprogrammingmo delthreadsmaybe implementedasthreadswithinasingleprocessandvirtuala ddressspace.Itishighly extensiblebyallowingatooltocapturemodel-andimplemen tation-speciceventsat varyinglevelsofdetailandtointerceptjustthesubsetofe ventsrelevanttothecurrent analysistask.Italsoallowsformixed-modelapplicationa nalysiswherebyasingle performancetoolcanrecordandanalyzeperformancedatage neratedbyallprogramming modelsinuseandpresenttheresultsinauniedmanner.Fina lly,GASPprovides 44


facilitiestocreateuser-dened,explicitly-triggeredp erformanceeventswhichallowthe usertogivecontexttoperformancedata.Thisuser-denedc ontextdatafacilitatesphase prolingandcustomizedinstrumentationofspeciccodese gments. Severaluser-tunableknobsarealsodenedbytheGASPspeci cationtoprovide nercontroloverthedatacollectionprocess.First,sever alcompilationragsareincluded sousercancontroltheeventtypestoolwillcollectduringr untime.Forexample,the --inst-local compilationragisusedtorequestinstrumentationofdatat ransfer operationsgeneratedbysharedlocalaccesses(i.e.,one-s idedaccessestolocaldatawhich arenotstaticallyknowntobelocal).Becausesharedlocala ccessesareoftenasfast asnormallocalaccesses,enablingtheseeventscanaddasig nicantruntimeoverhead totheapplicationsobydefault,thetooldoesnotcollectth esedata.However,shared localaccessinformationisusefulinsomeanalyses,partic ularlythosethatdealwith optimizingdatalocality(acriticalconsiderationinPGASp rogramming,seeSection )andperformingprivatizationoptimizations,andthusmay beworththeadditional overhead.Second,instrumentation #pragma directivesareprovided,allowingtheuserto instructthecompilertoavoidinstrumentationoverheadsf orparticularregionsofcodeat compiletime.Finally,aprogrammaticcontrolfunctionisp rovidedtotoggleperformance measurementforselectedprogramphasesatruntime. ThecompleteGASP-enableddatacollectionprocessworksasf ollows.First,the compiler-sideGASPimplementation(InstrumentationUnit)g eneratesinstrumentation codewhichisexecutedtogetherwiththeapplication.Theto ol-sideGASPimplementation (MeasurementUnit)theninterceptsthesecallsandperforms thedesiredmeasurement. Next,therawdataarepassedtothePerformanceDataManagert hatisresponsible forstoringthisrawdata,mergingdatafrommultiplePEs,an dperformingsimple post-processingofdata(e.g.,calculatingaveragesamong PEs).Thesedataarethenused bytheautomaticanalysisunitsandpresentationunitsatth elaterstagesofperformance evaluation. 45


4.2.3GASPImplementations Herewebrierydiscussconsiderationsforthecompiler-side implementationof theGASPinterface,focusingonUPCasitisthemoreinterestin gcase.Thereare severalUPCcompilerswithexistingGASPimplementations:Be rkeleyUPC,GCCUPC, andHPUPC[ 22 ].BerkeleyUPCtranslatesUPCcodetostandardCcodewithcall s totheBerkeleyUPCruntimesystem.Asaresult,muchofthecorr espondingGASP implementationconsistsofappropriateGASPcallsmadewith intheruntimesystem. However,severalfeaturesoftheGASPspecicationmustbeimp lementedwithinthe compileritself,includingthe #pragma directivesforcontrollinginstrumentationof programregionsandsupportforinstrumentationofuser-fu nctioncalls.Inaddition,to provideappropriateUPCsourcecodecorrelation,thecompil ermustpasssourcecode informationdownthroughthetranslationprocess.Bycontr ast,theGCCUPCandHP UPCcompilersbothuseadirectcompilationapproach,genera tingmachinecodedirectly insteadoftranslatingUPCintoC.Withthisarchitecture,th eGASPimplementation involvesmorechangestothecompileritselfthanwithBerke leyUPC.InthecaseofGCC UPC,forexample,changeswereneededinoneoftheUPCcompilat ionphases(calledthe \gimplication"phasebecauseintermediaterepresentati onsoffunctionsareconverted toGCC'sGIMPLElanguage)todetermineifinstrumentationi senabledandgenerate appropriatecodeifso. 4.3AutomaticAnalysis Theanalysismoduleaimsatprovidingthetoolwithautomati cperformance bottleneckdetectionandresolutioncapabilities.Inthis section,webrierydescribethe capabilitiesofanalysisunitsandleavethein-depthdiscu ssionontheautomaticanalysis systemdevelopmenttoChapter 5 TheHigh-LevelAnalysisUnitprovidesanalysesoftheoverallp rogramperformance noteasilyassociatedwithagivenoperationtype(e.g.,loa d-balancinganalyses)aswell asmultipleexperimentcomparison(e.g.,scalabilityanal ysis).Toprovideneranalyses, 46


themodel-independentBottleneckDetectionUnitusesbothp rolingandtracingdatato identifybottlenecksanddeterminetheircauseforapartic ularexecution.Onceidentied, BottleneckResolutionUnitsthentrytoprovidesuggestions onhowtoremovethese bottlenecksfromtheapplication.Theseunitsarepartiall ymodel-dependent,asagiven resolutionstrategymaynotalwaysworkforallprogramming models.Forexample,a techniquetoxtheperformancedegradationstemmingfrom upc memget ,versusfrom shmem get ,couldbedierenteventhoughtheyarebothclassiedasone -sidedget operations.Eachoftheanalysisunitsgeneratesnewanalys isdatathatareincorporated bythePerformanceDataManagerandlaterpresentedbytheVis ualizationManager. 4.4DataPresentation PPWprovidesbothgraphicalandtext-basedinterfacestovi ewcollectedproledata andgeneratedanalysisresults.Mostofthesevisualizatio nshavebeendesignedtohave asimilarlookandfeeltothoseprovidedbyothertoolssouse rsalreadyfamiliarwith othertoolscanquicklylearnandeectivelyusePPW.Thefol lowinglistsummarizesthe visualization-relatedfeatures(witheachsupportingsou rce-codecorrelationwhenever possible)currentlyprovidedbyPPW: Aviewsummarizingtheapplicationexecutionenvironment( optimizationragsused, machinehostnames,etc.). Chartstofacilitatetheidenticationoftime-consuminga pplicationsegments(10 longest-executingregions). Flatandcall-pathtablestodisplayhigh-levelstatistica lperformanceinformation. Avisualizationtodetectandshowevent-levelload-balanc ingissue(Figure 4-10 a,see Chapter formoredetail). Acharttocomparerelatedexperimentalruns(Figure 4-10 b)suchasrunsofthe sameprogramusingvarioussystemsizesorrunsofdierentv ersionsofthesame program. Adisplayshowingtheinter-PEcommunicationvolumeforall datatransfer operationsintheprogram(providingPE-to-PEorPE-to-glo bal-memory communicationstatistics). 47


Table4-4.Proling/tracinglesizeandoverheadforUPCNPB2 .4benchmarksuite BenchmarkCGEPFTISMG Size(prole)(KB)113840369276195Size(trace)(MB)0.153414210504560Overhead(prole) < 0.1%2.69% < 0.1%1.66% < 0.1% Overhead(trace) < 0.1%4.30% < 0.1%2.84%2.08% Aunique,UPC-specicArrayDistributiondisplaythatdepict sthephysicallayoutof sharedobjectsintheapplicationonthetargetsystem(Figu re 4-11 ). ExportsoftracedataforviewingwiththeJumpshotandVampi r[ 23 ]timeline viewers. 4.5PPWExtensibility,Overhead,andStorageRequirement WerstdeveloptheParallelPerformanceWizardtooltosupp ortUPCapplication analysiswhichtookclosetotwoyearstodevelop.Thistooli slaterextendedtosupport SHMEMandMPI1.x,eachtakenlessthansixmonthstocompletew ithalargeportionof timespendonsystemconguration.InTable 4-4 4-5 ,and 4-6 ,weprovideexperimental dataonthedata-collectionoverheadandstoragespaceneed edforPPWonanOpteron clusterconnectedwithaQsNet II Quadricsinterconnectusingthefollowingtestprograms. ForUPC,weexecutedtheGeorgeWashingtonUniversity'sUPCNPBv ersion2.4 benchmarksuite(classB)[ 24 ]withBerkeleyUPC[ 3 ]version2.6.ForSHMEM,we executedtheQuadrics'sAPSPandCONVtestprogramsandanin-h ouseSHMEMSAR application(seeChapter 6.3 )withQuadricsSHMEM[ 4 ].ForMPI,weusedtheTachyon [ 25 ],anin-houseMPISAR[ 26 ]andanin-houseCornerTurnapplicationswithMPICH2 [ 27 ]v1.0.8.Allprogramswereinstrumentedandmonitoredusing PPW(v.2.2),with performancedatacollectedforallUPC/SHMEM/MPIconstructs anduserfunctionsin eachprogram.Inallcases,thedata-collectionoverheadnu mbers( < 2.7%forprole, < 4.3%fortrace)arecomparabletoexistingperformancetool s.Tracingdatasizeislinearly relatedtothetotalnumberofeventsinstrumentedbythetoo l;onaverage,PPWrequires 17MBofstoragespaceper1milliontraceevents. 48


Figure4-10.(a)Load-balancinganalysisvisualizationfo rCG256-PErun,(b)ExperimentalsetcomparisonchartforCa mel 4-,8-,16-,and32-PEruns49


Figure4-11.AnnotatedscreenshotofnewUPC-specicarraydi stributionvisualization showingphysicallayoutofa2-D5x8arraywithblocksize3fo ran8-PE system Table4-5.Proling/tracinglesizeandoverheadforSHMEMAP SP,CONV,andSAR application BenchmarkAPSPCONVSAR Size(prole)(KB)11684256Size(trace)(MB) < 0.1% < 0.1% < 0.1% Overhead(trace) < 0.1% < 0.1% < 0.1% Table4-6.Proling/tracinglesizeandoverheadforMPIco rnerturn,tachyon,andSAR application BenchmarkCornerTurnTachyonSAR Size(prole)(KB)15019676Size(trace)(MB) < 0.1% Overhead(trace)0.49%0.71% < 0.1% 50


Table4-7.Proling/tracinglesizeandoverheadcomparis onofPPW,TAU,andScalasca toolsfora16-PErunoftheMPIISbenchmark PPWTAUScalasca Size(prole)(KB)13377.556Size(trace,uncompressed)(MB/millionevents)16.1623.7 511.44 Overhead(prole) < 1% < 1% < 1% Overhead(trace) < 1% < 1% < 1% 4.5.1PPW,TAU,andScalascaComparision InTable 4-7 ,weshowabriefoverheadanddata-sizecomparisonofthePPW ,TAU, andScalascatoolsfora16-PErunoftheMPIISbenchmarkonaQ uad-coreXeoncluster usingaGigabitEthernetinterconnect.Fromthistable,wes eethatthedata-collection overheadforeachofthethreetoolsisnegligiblecomparedt othetotalexecutiontime.For proling,PPWrequiresahigher(butstillquitesmall)amou ntofstoragespacethanTAU andScalasca.Fortracing,PPWrequires16.16MBofuncompre ssedstoragespaceperone milliontraceeventsgeneratedwhileTAUandScalascarequi re24.03MBand11.44MBof uncompressedstoragespace,respectively.4.5.2PPWToolScalability ToevaluatethescalabilityofPPW,weconducted128-,256-, and512-PErunsof GWU'sUPCNPBversion2.4benchmarks(classB)usingBerkeleyUPC 2.8(viathe GASNetMPIconduitusingMPICH2version1.0.8)onan80-PEIntel Quad-coreXeon clusterwithaGigabitEthernetinterconnect(withallmode lconstructsanduserfunctions instrumented).InFigure 4-12 ,weshowthecommunicationstatisticsvisualizationfora 256-PErunofCG(Figure 4-12 a)andthezoomed-inJumpshotviewofanMG512-PE run(Figure 4-12 b).AsshowninTable 4-8 ,thedatacollectionoverheadnumbersare higherthanwithrunsonsmallersystemsizesbutarestillwi thintheacceptablerange ( < 6.34%forproling, < 5.61%fortracing).Inallcases,proledatasizeremainedi n themanageableMBrange.Incontrast,tracedatasizeforlar gerrunsissignicantly greaterforsomeofthesebenchmarksthatexhibitweakscali ng,suchasCGandMG(for 51


Table4-8.Proling/tracinglesizeandoverheadformediu m-scaleUPCNPB2.4 benchmarksuiteruns ProlingTracing 128256512128256512 CGOverhead0.87%0.31%N/A < 1% < 1%N/A Datasize4.0MB14.1MBN/A8.8GB17.6GBN/A EPOverhead6.34%0.91% < 1%1.34%2.65%1.09% Datasize3.2MB11.8MB45.1MB3.5MB12.4MB46.3MB FTOverhead4.44%N/AN/A4.31%N/AN/A Datasize4.9MBN/AN/A164.6MBN/AN/A ISOverhead < 1%4.06%N/A < 1% < 1%N/A Datasize3.7MB12.8MBN/A4.5GB4.6GBN/A MGOverhead4.55% < 1% < 1% < 1%5.61%1.18% Datasize8.0MB21.3MB64.1MB150.8MB304.4MB627.3MB benchmarkssuchasISthatexhibitstrongscaling,thedatas izestaysrelativelyconstant); thischaracteristiccouldbecomeanissueassystemsizecon tinuestoincrease. 4.6Conclusions Thegoaloftherstpartofthisresearchwastoinvestigate, design,develop, andevaluateamodel-independentperformancetoolframewo rk.Whilemanytools supportperformanceanalysisofmessage-passingprograms ,toolsupportislimited forapplicationswritteninotherprogrammingmodelssucha sthoseinthePGAS family.Existingtoolswerespecicallydesignedtosuppor taparticularmodel(i.e., MPI)andtheybecametootightlycoupledwiththatmodel.Asar esult,asignicant amountofeortfromthedevelopersisneededtoaddnewmodel support.Toaddress thisissue,thePPWperformancesystemwithtwonovelconcep tswasdeveloped.We introducedthegeneric-operation-typeabstractionconce ptandillustratedhowthe generic-operation-type-basedeventmodelhelpsinminimi zingthedependencyofa tooltoitssupportedprogrammingmodelsanddiscussedthen eedforthenewGASP interfaceandhowthisinterfacesimpliestheotherwisecu mbersomedatacollection process.Withtheinclusionofthesetwoconcepts,ourPPWto olframeworksupportsand iseasilyextensibletosupportawiderangeofparallelprog rammingmodels. 52


Figure4-12.(a)Datatransfersvisualizationshowingcomm unicationvolumebetweenprocessingPEsfor256-PECG benchmarktracingmoderun,(b)Zoomed-inJumpshotviewof5 12-PEMGbenchmark53


Tovalidatetheproposedframework,wedevelopedthePPWpro totypetoolthat originallysupportsmanualanalysisofUPCapplicationsand laterextendedtosupport manualanalysisofSHMEMandMPI1.xprograms.Weshowedthatw hileittook overtwoyearstodeveloptherstprototype,extendingthep rototypetosupportother programmingmodelswasachievedfairlyquickly(lessthan6 monthforbothSHMEM andMPI),provingthatourproposedframeworkishighlyexte nsible.Inaddition,we demonstratedthatourPPWprototypeincurredoverhead( < 3%forprolingand < 5% fortracingforallsupportedmodels)wellwithintheaccept ablerange,iscomparableto otherpopularperformancetools,andisstillusableupto51 2PEs. Futureworkonthispartoftheresearchincludesintegratin gPPWintotheEclipse developmentenvironment;enhancingthescalabilityofexi stingPPWvisualizations; improvingdata-collectionoverhead,management,andstor ageonlargersystems;and providinglower-level(e.g.,programmingmodelruntimean dnetwork-related)performance informationusingGASP. 54


CHAPTER5 SYSTEMFORAUTOMATICANALYSISOF PARALLELAPPLICATIONS Performancetoolsthatcollectandvisualizerawperforman cedatahaveprovento beproductiveintheapplicationoptimizationprocess.Howe ver,tobesuccessfulinthis manualanalysisprocess,theusermustpossesacertaindegr eeofexpertisetodiscoverand xperformancebottlenecks|andthuslimitingtheusefulne ssofthetool,asnon-expert programmersoftendonothavetheskillsetneeded.Inadditi on,asthesizeofthe performancedatasetgrows,itbecomesnearlyimpossibleto manuallyanalyzethedata, evenforexpertprogrammers.Oneviablesolutiontothisiss ueisanautomaticanalysis systemthatcandetect,diagnose,andpotentiallyresolveb ottlenecks. Inthischapter,wepresentanewautomaticanalysissystemt hatextendsthe capabilitiesofthePPWperformancetool.Theproposedsyst emsupportsarangeof analysesthat(toourknowledge)nosingleexistingsystemp rovidesandusesnovel techniquessuchasbaselinelteringandaparallelizedana lysisprocesstoimprove executiontimeandresponsivenessofanalyses.Inaddition ,becauseitisbasedonthe generic-operation-typeabstractionintroducedearlier, theanalysisframeworkisapplicable toanyparallelprogrammingmodelwithconstructsthatcanb emappedtothesupported operationtypes. Toavoidconfusion,webeginbydeningsomeimportantterms usedintheremainder ofthischapter.Aperformanceproperty(orpattern)denes anexecutionbehaviorof interestwithinanapplication.Aperformancebottlenecki saperformancepropertywith non-optimalbehavior.Bottleneckdetection(oridentica tion,discovery)istheprocessof ndingthelocations(PE,lineofcode,etc.)ofperformance bottlenecks.Causeanalysis 1 is theprocessofdiscoveringtherootcausesofperformancebo ttlenecks(e.g.,latebarrier entrancecausedbyunevenworkdistribution).Bottleneckr esolutionistheprocessof identifyingpotentialstrategiesthatmaybeappliedtorem ovethebottlenecks.Automatic optimizationreferstosourcecodetransformationand/orc hangesintheexecution 55


Figure5-1.Tool-assistedautomaticperformanceanalysis process environmentmadebythetooltoimproveapplicationperform ance.Finally,ahotspotisa portionoftheapplicationthattookasignicantpercentag eoftimetoexecuteandthusis agoodcandidateforoptimization. 5.1OverviewofAutomaticAnalysisApproachesandSystems Automatic(orautomated)analysisisatool-initiatedproce sstofacilitatethending andultimatelytheremovalofperformancebottleneckswith inanapplication.Theentire processmayinvolvethetool,withorwithoutuserinteracti on,performingsomeorallof thetasksillustratedinFigure 5-1 ontheapplicationunderinvestigation.Notethatin thegure,performancedatacollectionreferstothegather ingofadditionaldataontopof whatthetoolcollectsbydefault.Intheremainderofthisse ction,weprovideanoverview ofexistingworkrelatingtoautomaticanalysis. TheAPARTSpecicationLanguage(ASL)[ 28 ]isaformalspecicationmodel introducedbytheAPART[ 29 ]workinggrouptodescribeperformancepropertiesviathre e components:asetofconditionstoidentifytheexistenceof theproperty,acondence valuetoquantifythecertaintythatthepropertyholds,and aseveritymeasuretodescribe theimpactofthepropertyonperformance.Thegroupusedthi slanguagetoprovidea 1 Becausebottleneckdetectionandcauseanalysisareclosel ytiedtoeachother,insome literaturetheyaretogetherreferredtoasthebottleneckd etectionprocess. 56


listofperformancepropertiesfortheMPI,OpenMP,andHPFpr ogrammingmodelsand notedthepossibilityofdeningasetofbase(model-indepe ndent)performanceproperty classes. HPCToolkitandTAUareexamplesoftoolsprovidingfeaturest oevaluatethe scalabilityofanapplicationusingprolingdata.HPCToolk itusesthetiminginformation fromtwoexperimentstoidentifyregionsofcodewithscalab ilitybehaviorthatdeviates fromtheweakorstrongscalingexpectation[ 30 ].PerfExplorerisanextensionofTAU thatgeneratesseveraltypesofvisualizationsthatcompar etheexecutiontime,relative eciency,orrelativespeedupofmultipleexperiments[ 31 ].Inaddition,PerfExplorer includestechniquessuchasclustering,dimensionreducti on,andcorrelationanalysisto reducetheamountofperformancedatatheusermustexamine. Periscope,KappaPI-2,andKOJAKareknowledge-basedtoolst hatsupportthe detectionofwell-knownperformancebottlenecksdenedwi threspecttotheprogramming model.Theadvantageofaknowledge-basedsystemisthatlit tleornoexpertiseis requiredoftheusertosuccessfullyanalyzetheprogram.Pe riscopesupportsonline detectionofMPI,OpenMP,andmemorysystemrelatedbottlen ecks(speciedusing ASL)throughadistributedhierarchyofprocessingunitstha tevaluatetheprolingdata [ 32 ].KappaPI-2isapost-mortem,centralized,tree-basedana lysissystemthatsupports bottleneckdetection,causeanalysis,andbottleneckreso lution(viastaticsourcecode analysis)usingtracingdata[ 33 ].Finally,EXPERTisapartofKOJAK(nowknownas Scalasca)thatsupportspost-mortembottleneckdetection andcauseanalysisofMPI, OpenMP,andSHMEMbottlenecks(speciedusingASL).Thedevel opersrecently introducedanevent-replystrategytoallowparallel,loca lizedanalysisprocessingwhich hasbeensuccessfullyappliedtoMPI[ 34 ],butitremainsquestionablewhethersucha strategyworkswellforotherprogrammingmodels. Hercules[ 35 ]isaprototypeknowledge-basedextensionofTAUthatdetec tsand analyzescausesofperformancebottleneckswithrespectto theprogrammingparadigm 57


(suchasmaster-worker,pipeline,etc.)ratherthanthepro grammingmodel.Anadvantage ofthissystemisthatitcanbeusedtoanalyzeapplicationsw ritteninanyprogramming model.Unfortunately,thesystemcannothandleapplication sdevelopedusingamixtureof paradigmsorthatdonotfollowanyknownparadigmatall,mak ingitsomewhatlimited inapplicability. Paradyn'sonlineW 3 searchmodel[ 36 ]wasdesignedtoanswerthreequestions throughiterativerenement:whyisitperformingpoorly,w herearethebottlenecks,and whendidtheproblemsoccur.TheW 3 searchsystemanalyzesinstancesofperformance dataatruntime,testingahypothesiswhichiscontinuallyr enedalongoneofthethree questiondimensions.TheW 3 systemconsidershotspotstobebottlenecks,andsincenot allhotspotscontributetoperformancedegradation(theyc ouldsimplybeperforming usefulwork),theusefulnessofthissystemissomewhatlimi ted. ThemainideabehindthedesignofNoiseMiner[ 37 ],acomponentoftheProjections tool,isthateventsofsimilartypeshouldhavesimilarperf ormanceunderidealcircumstances. Utilizingthisassumption,thesystemmakesapassthroughth etracelog,assignsan expectedperformancevaluetoeacheventtype,andtheniden tiesspecictraceevents withperformancethatdonotmeettheexpectations(i.e.,no isyevents). PerformanceAssertions(PA)isaprototypesourcecodeannota tionsystemforthe specicationofperformanceexpectations[ 38 ].Onceperformanceassertionsareexplicitly addedbytheuser,thePAruntimecollectsdataneededtoeval uatetheseexpectations andselectstheappropriateaction(e.g.,alerttheuser,sa veordiscarddata,callaspecic function,etc.)duringruntime.IBMhasalsodevelopedanau tomatedbottleneckdetection systemenablingthedetectionofarbitraryperformancepro pertieswithinanapplication [ 39 ].Thesystemsuppliesuserswithaninterfacetoaddnewperf ormancepropertiesusing pre-existingmetricsandtoaddnewmetricsneededtoformul atethenewproperties.With bothoftheabovesystems,acertaindegreeofexpertiseisre quiredoftheusertoformulate meaningfulassertionsorproperties. 58


Eachoftheaboveapproacheshasmadeacontributiontothee ldofautomatic performanceanalysis.Eachalsohasparticulardrawbackst hatlimititseectivenessor applicability.Inlightofongoingprogressinandtheeverincreasingcomplexityofparallel programmingmodelsandenvironments,wehavesoughttomake correspondingprogressin eectiveanalysisfunctionalityforavarietyofmodernpro grammingmodels. 5.2PPWAutomaticAnalysisSystemDesign ThePPWautomaticanalysissystemfocusesonoptimizationo fobservedapplication executiontimewiththegoalofguiding(andpossiblyprovid ehintsto)userstospecic regionsofcodetofocustheiroptimizationeortson.Thepr oposedsystemisnovelin severalaspects.First,theanalysissystemmakesuseofthe same(model-independent) generic-operation-typeabstractionunderlyingourPPWto ol.Asaresult,ouranalysis systemcanbeeasilyadaptedtosupportawiderangeofparall elprogrammingmodelsand naturallysupportstheanalysisofmixed-modelapplicatio ns(i.e.,programswrittenusing twoormoremodels).Theuseofthisabstractionalsoimprove sthesystem'scapabilities byallowingin-depthanalysisofsomeuser-denedfunction s.Forexample,bysimply instructingthesystemtotreatauser-dened upc user wait until functioninaUPC programasawait-on-value-changeoperation(addingoneli neintheeventtypemapping), thesystemisabletodeterminethecauseofanydelaysassoci atedwiththisfunction. Second,weintroduceseveraltechniquessuchasanewbaseli nelteringtechnique 2 to identifyperformancebottlenecksviacomparisonofactual toexpectedperformancevalues thatisgenerallymoreaccuratethanthedeviationltering techniqueusedinNoiseMiner. Third,oursystemperformsarangeofexistingandnewanalys esincludingscalability analysis,load-balanceanalyses,frequencyanalysis,bar rier-redundancyanalysis,and common-bottleneckanalysis,whereasothersystemssuppor tonlyafewoftheseanalyses. Finally,wehavedevelopedascalableanalysisprocessingt echniquetominimizethe executiontimeandresponsivenessoftheanalyses.Thispro cessisdesignedtoallow multiplelocalizedanalysestotakeplaceinparallel(invo lvingminimaldatatransfers)and 59


isabletoidentifyspecicbottleneckregionsusingonlypr oledataanddeterminethe causeofthebottleneckwhentracedataisavailable.Compar edtootherparallelanalysis systemsinexistence(suchastheevent-replystrategyintr oducedin[ 34 ]),oursystemis inherentlyportable,sincetheanalysisprocessisnottied totheexecutionenvironment usedtoruntheapplication.5.2.1DesignOverview Thehigh-levelarchitectureofthePPWautomaticanalysiss ystemisdepictedin Figure 5-2 .Theanalysessupportedbythesystemarecategorizedintot wogroups: applicationanalyseswhichdealwithperformanceevaluati onofasinglerunand experimentsetanalysestocomparetheperformanceofrelat edruns.Wedesignedthis systemtofocusonprovidinganalysestohelpbothnoviceand expertusersinoptimizing theirapplicationviasourcecodemodication.Inparticul ar,theseanalysesfocuson ndingoperationsthattooklongerthanexpectedtorun,ope rationsthatmaybe redundant,andoperationsthatcouldbetransformedintoot heroperationstoimprove performance. Theparallelizedanalysisprocessingmechanismisapeer-t o-peersystemconsistsof uptoNprocessingunits(whereNistheapplicationsystemsi ze):0toN-1non-head processingunits,eachhas(local)accesstorawdatafromon eoragroupofPEs,and oneheadprocessingunitthatrequireaccesstoasmallporti onofdatafromallPEs toperformsglobalanalyses.Thisinherentlyparalleldesi gnisintendedtosupportthe analysisoflarge-scaleapplicationsinareasonableamoun toftime.InFigure 5-3 ,we illustratethetypesofanalysesconductedandtherawdatae xchangeneededforall processingunitsinanexample3-processingunitsystem. 2 Thebaselinelteringtechniqueisnewforautomaticanalys isbutreadilyusedin systemperformanceevaluation. 60


Figure5-2.PPWautomaticanalysissystemarchitecture Thecompleteanalysisprocesscanbebrokendownintosevera ldistinctcategories. Figure 5-4 depictstheanalysisworkrowforaprocessingunitinthesys temwhich includescommon-bottleneckanalysis,globalanalyses,fr equencyanalysis,andbottleneck resolution.Wenowdescribethesecategoriesinmoredetail inthefollowingsections;a summaryofanalysescurrentlysupportedbyPPWispresented inTable 5-1 5.2.2Common-BottleneckAnalysis ThegoalofCommon-BottleneckAnalysisistoidentifycommon lyencountered performancebottleneckswhichapplicationdevelopershav eseenovertheyears;dueto commonoccurrences,thereareusuallywell-knownoptimiza tionstrategiesthatcouldbe applytoremovetheseissues.Forexample,acommonoptimiza tionstrategytoremove 61


Table5-1.SummaryofexistingPPWanalyses NamePurposeRequireddatatypeGlobalorlocalRelatedbottl enecks ScalabilityDeterminescalabilityProledataGlobalLowa pplicationscalability Analysisofanapplication(Multipleruns) RevisionCompareperformanceProledataGlobalN/AAnalysisofdierentrevisions(Multipleruns) High-LevelComparecomp.,comm.,syncProledataGlobalPElevelload-balancing, AnalysisamongPEs(All)Lowcomp/commratio Block-LevelDetectload-balancingissueTracingdataGlob alBlock-level Analysisofindividualprogramblocks(A2A)load-balancing Event-LevelDetectload-balancingissueProledataGloba lEvent-level AnalysisofindividualeventamongPEs(All)load-balancing Barrier-RedundancyIdentifyunnecessaryTracingdataGlo balBlocklevelload-balancing Analysisbarrieroperations(A2A,dataxfer) Shared-DataEvaluatedataProledataGlobalPoordataloca lity Analysisanityeciency(Dataxfer) FrequencyIdentifyshort-livedTracingdataLocalInecie ncyrelatingto Analysishigh-frequencyoperations(All)multiplesmalltra nsfers BottleneckIdentifypotentialProledataLocalDelayedop erations Detectionbottlenecklocations(All) CauseIdentifycausesandtypesTracedataLocalBottleneck slistedin Analysisofcommon-bottlenecks(All)(min.dataxfer)Table 5-2 62


Figure5-3.Exampleanalysisprocessingsystemwith3proce ssingunitsshowingthe analyseseachprocessingunitperformsandrawdataexchang eneededbetween processingunits Table5-2.Common-bottleneckpatternscurrentlysupporte dbyPPWanddataneededto performcauseanalysis LocaldatatypeBottleneckpatternsRequesttargetsRemote datatype Globalsync./comm.Waitongroupsync./comm.AllotherGloba lsync./comm. (load-imbalance) P2PlockWaitonlockavailabilityAllotherP2PunlockP2Pwait-on-valueWait-on-valuechangeAllotherOne-sided dataxfer One-sidedput/getCompetingput/getAllotherOne-sidedput /get Two-sidedsendLatesenderReceiverPETwo-sidedreceiveTwo-sidedreceiveLatereceiverSenderPETwo-sidedsend 63


Figure5-4.Analysisprocessrowchartforaprocessinguniti nthesystem64


LateSender,acommon-bottleneckpatternrelatingtotwo-s ideddatacommunication,isto movethesendcallforwardintheexecutionsequence. Common-bottleneckanalysisisthemostsubstantialandtim e-consuminganalysis supportedbythePPWanalysissystem.Unlikeotherknowledge -basedsystemsuchas KappaPI-2andKOJAK,ourapproachusesbothproleandtraced ataintheanalysisand isscalablebydesign.ThePPWCommon-BottleneckAnalysispr ocessisseparatedinto twophases:aBottleneckDetectionphasetoidentifyspeci cbottleneckregionsusingonly proledata;followedbyaCauseAnalysisphasetodeterminet hecauseofthebottleneck usingtracedata. ThegoalofBottleneckDetectionistoidentifyprogramregi onsthatwhenoptimized couldimprovetheapplicationperformancebyanoticeablea mount.Duringthisdetection phase,eachprocessingunitexaminesitsportionofthe(loc al)prolingdataandidenties bottleneckprolingentries.Foreachoftheprolingentri es,theprocessingunitrst checkswhetherornotthatentry'stotalexecutiontimeexce edsapresetpercentageofthe totalapplicationtime(i.e.,isahotspot).Thepurposeoft hislteringstepistofocusthe analysiseortonportionsofprogramthatwouldnoticeably improvetheperformanceof theapplicationwhenoptimized. Next,theprocessingunitdecidesiftheidentiedhotspoten tryisabottleneckby applyingoneofthefollowingtwocomparisonmethods.Witht hebaselinecomparison method,theprocessingunitmarkstheentryasabottlenecki ftheratioofitsaverage executiontimetoitsbaselineexecutiontime|theminimala mountoftimeneededbya givenoperationtocompleteitsexecutionunderidealcircu mstances|exceedsapreset threshold. Ifthebaselinecomparisonmethodisnotapplicable(e.g.,b ecausetheentryisa userfunctionornobaselinevaluehasbeencollectedforthe entry),theprocessingunit usesthealternativedeviationevaluationmethod.Withthi smethodwemakeuseofthe 65


followingassumption:underidealcircumstances,whenane ventisexecutedmultiple times,theperformanceofeachinstanceshouldbesimilarto thatofotherinstances(the sameassumptionisusedinNoiseMiner).Thusforeachhotspot entry,theprocessingunit calculatestheratioofitsminimalexecutiontimeandofits maximumexecutiontimeto itsaverageexecutiontime.Ifoneorbothoftheratiosexcee dsapresetthreshold,the processingunitmarkstheentryasapotentialbottleneck. Thelistofpotentialbottlenecksidentiedinthedetectio nphasepointsapplication developerstospecicregionsofcodetofocustheirattenti ononbutdoesnotcontain sucientinformationtodeterminethecausesofperformanc eissuesoftenneededtodevice anappropriateoptimizationtechnique.Toprovidethesede tailinformation,PPW'sCause Analysis,usingavailabletracedata,aimsatndingremotee ventsthatpossiblycausedthe bottlenecksidentied. Theunderlyingconceptbehindourapproachisthatifsomere moteeventscaused thelocaleventtoexecutenon-optimally(asopposedtocaus edbyotherfactorssuchas networkcongestionnotrelatedtoeventordering),thenthe seremoteeventsmusthave occurredbetweenthestartandendtimeofthelocalevent.It isbecauseofthisconcept thattheamountofdataexchangebetweenprocessingunitsar eminimizedasonlythe relatedeventsthatoccurredduringthistimerangeneedtob eexchanged(comparedto theevent-replystrategyintroducedin[ 34 ]whereallrelatingeventsmustbeexchanged). Forexample,fora upc lock eventonPE0withstarttimeof2msandendtimeof5 ms,therequestentry f PE0,2ms,5ms,P2Punlock g wouldbeissuedtoallprocessing units.Thelogicbehindthisexampleisthefollowing:ifatt hetimeofthelockrequest, anotherPEholdsthelock,theP2PlockoperationissuedbyPE 0willblockuntilitis releasedbythelockholder.TondoutwhichPE(s)heldthelo ckthatcausedthedelayin theP2Plockoperation,wesimplylookattheP2Punlockopera tionsissuedbetweenthe startandendtimeofthelockoperation.IfnoP2Punlockoper ationwasissuedbyany 66


otherPEs,weconcludethatthedelaywascausedbyuncontrol lablefactorsthatcannot beresolvedbytheuser,suchasnetworkcongestionduetocon currentexecutionofother applications. Duringthecauseanalysisphase,eachprocessingunitcarri esoutseveralactivities usinglocaltracingdatainatwo-passscheme.Inthersttra ce-logpass,theprocessing unitidentiestraceevents 3 withsourcelocationmatchinganyoftheprolingentries discoveredinthedetectionphase.Foreachmatchingtracee vent,theprocessingunit generatesarequestentrycontainingitsname,starttime,a ndendtimealongwiththe event'soperationtypewhichissenttootherprocessinguni tstoretrieveappropriatetrace data(Table 5-2 illustratesthecurrentsetofcommon-bottlenecks|whichi scurrently hard-coded|supportedbyoursystem).Attheendoftherstp ass,theprocessingunit sendstherequestsouttoallotherprocessingunitsandwait sforthearrivalofrequests fromallotherprocessingunits. Next,theprocessingunitmakesasecondpassthroughitstrac elogandgeneratesthe correctreplies|consistingof f eventname,timestamp g tuples|andsendsthembackto therequestingprocessingunits.Finally,theprocessingu nitwaitsforthearrivalofreplies andcompletesthecauseanalysisbyassigningabottleneckp atternnametoeachmatching traceevent,alongwiththeremoteoperationsthatcontribu tedtothedelay. Intermsofexecutiontime,weexpectbottleneckdetectiont ocompleterelatively quicklyastheamountofproleentriesisusuallynotlarge( datasizeinKBrange).By contrast,weexpectcauseanalysistotakesignicantlylon gertocompleteduetoitsuseof tracedata;weexpecttheexecutiontimeofcauseanalysisto belinearlyproportionalto thenumberoftraceeventsintheperformancedatale. 3 Processingunitscanchoosetoapplythelteringtechnique soneachtraceevent duringthispasstofurtherreducetheamountofdataexchang eneededbetweenprocessing units. 67


5.2.3GlobalAnalyses PPWsupportsseveralanalysesthatrequireaglobalviewoft heperformancedata; morespecically,theheadprocessingunitneedstohaveacc esstosomedatafromall PEs.Dependingonthedatatypesrequiredbytheanalyses,th etimerequiredtocarry outtheseanalyses 4 willvary.Intheremainderofthissection,webrierydiscus sthese globalanalyseswhichincludeanalysestocomparetheperfo rmanceofmultiplerelated experiments(scalabilityanalysisandrevisionanalysis) andanalysestoevaluatethe performanceofasinglerun(barrier-redundancyanalysis, shared-dataanalysis,andseveral analysestoevaluateload-balanceamongPEs). Scalabilityofanapplicationisamajorconcernfordevelop ersofparallelapplications. Withtheever-growingincreaseinparallelsystemsize,iti sbecomingmoreimportantfor applicationstoexhibitgoodscalability.Usingprolingda tafromtwoormoreexperiments ondierentsystemsizes,PPW'sScalabilityAnalysisevalua tesanapplication'sscalability (ormoreprecisely,itsparalleleciency,theratioofpara llelperformanceimprovement overthesizeincrease).Fromtheexperimentwiththesmalle stnumberofPEs,thehead processingunitcalculatestheparalleleciencyforallot herexperiments.Aneciencyof 1indicatesthattheapplicationexhibitsperfectscalabil ity,whileavalueapproaching0 suggestsverypoorscalability. Duringtheiterativemeasure-modifyprocesswhichauserpe rformstooptimizehisor herapplication,multiplerevisionsofanapplicationareo ftenproducedbytheuser,with eachrevisioncontainingcodechangesaimedatremovingorm inimizingtheperformance issuesdiscovered.Toassistinevaluatingtheperformance eectsofthesecodechanges, 4 Notethattheseanalysescanbeperformedatanytimeduringth eanalysisprocessor aspartoftheotheranalyses. 68


PPW'sRevisionAnalysisfacilitatesperformancecompariso noftheapplicationandthe10 longest-runningcoderegionsbetweenrevisions;thisanal ysisisusedtodeterminewhether ornotcodechangesimprovedprogramperformanceandifso,w hatpartoftheprogram wasimproved. AchievinggoodworkdistributionamongPEsisdicultandoft enimpactsthe performanceandscalabilityoftheapplicationsignicant ly.Tohelpinthisaspect,PPW providesseveralanalysestoinvestigateanapplication's workloaddistributionatdierent levels.Atthehighestlevel,PPW'sHigh-LevelAnalysiscalcu latesandcomparesthetotal computation,communication,andsynchronizationtimeamo ngPEs.SincethePEswith largestcomputationtime(i.e.,highestworkload)oftende terminetheoverallperformance oftheapplication,thisanalysisassistsintheidenticat ionofbottleneckPEs(PEsthat whenoptimizedimprovetheoverallapplicationperformanc e). Next,theBlock-LevelAnalysisaimsatidentifyingspecicpr ogramblocks 5 with unevenworkdistributionandthusfurtherguidesuserstopa rtsoftheprogramwherethey shouldfocustheireorts.Byensuringthatallprogrambloc kshavegoodload-balance,the useressentiallyachievedgoodloadbalancefortheentirea pplication. Finally,atthelowestlevel,theEvent-LevelAnalysiscompa restheworkloadof individualevents(i.e.,aspeciclineofcodeorcoderegio n)amongPEswhichis extremelyusefulwhentheeventunderinvestigationrepres entsworkloadthatwasmeant tobeparallelizedorisaglobalsynchronizationevent(asa nunevenglobalsynchronization oftenstemsfromunevenworkloaddistributionpriortothes ynchronizationcall). 5 Aprogramblockisdenedasasegmentofcodebetweenoneglob alsynchronization tothenextsimilartoablockintheBulkSynchronousParalle lism(BSP)computation model. 69


Figure5-5.Barrier-redundancyanalysis Toensureprogramcorrectness,programmersmayofteninser textraglobalbarrier callsthatgiverisetotheperformancedegradationdepicte dinFigure 5-5 .Todetect potentiallyredundantbarriers,PPW'sBarrier-Redundanc yAnalysisexaminesthe shareddataaccessesbetweenbarriercallsandidentiesca llswithnoshareddata accessesbetweenthetargetcallandtheonebeforeitasredu ndant.Theideaisthat sincebarriersareoftenusedtoenforceglobalmemoryconsi stency,abarriercallwith nopriorshareddataaccessesmaynotbeneeded.Theoutputof thisanalysisisalist ofpotentialredundantbarriercalls(withsourceinformat ion)thatusermayconsider removingfromtheprogram. Dataanity(orlocality)isaveryimportantfactorinparal lelprogramminganditis oftenamajordecidingfactorbetweenagoodandapoorperfor mingparallelapplication. Toassessanapplication'sdatalocalityeciency,PPW'sSh ared-DataAnalysismeasures theratiooflocal-to-remoteaccess 6 forallPEsandcombinesthemintoasingledataaccess ratiothatcouldbeusedtodeterminetheapplicationlocali tyeciency(typically,the highertheratio,thebettertheprogram).Thisanalysiscou ldberenedtoanalyzethe 70


localityeciencyofaspecicsharedregion(suchasUPCshar edarray)whenthetool knowsthespecicmemoryregionsthataparticulardatacomm unicationcalltouches. InthecaseofUPC,thisrenementisextremelyusefulasitall owsthedeterminationof thebestblockingfactorleadingtominimizedremotedataac cessonallPEs(partofthe BottleneckResolution).5.2.4FrequencyAnalysis Theexistenceofshort-lived,high-frequencyevents(henc eforthreferredtosimply ashigh-frequencyevents)canaecttheaccuracyoftheperf ormancedatacollected,so itisusefultoidentifythesehigh-frequencyeventsthatsh ouldnotbetrackedduringthe subsequentdatacollectionprocess.Moreimportantly,hig h-frequencyeventssometimes representeventswhicharehighlybenecialtooptimize(si ncetheyarecalledmany times)orinthecaseofdatacommunicationoperations,coul dpotentiallybetransformed intomoreecientbulktransferoperations(amajorknownop timizationtechniqueas illustratedinFigure 5-6 ).Forthesereasons,PPWincludesamemory-boundFrequency Analysisaimedatidentifyinghigh-frequencyevents.Bymak ingapassthroughthetrace data,thisanalysisidentiesalistofhigh-frequencyeven tsforeachPE. 5.2.5BottleneckResolution Inthenalstepoftheanalysisprocess,BottleneckResolut ion 7 ,theprocessingunit aimsatidentifyinghintsusefultotheuserinremovingtheb ottlenecksidentiedinone ofthepreviousanalyses(Table 5-3 ).Thisprocessistheonlypartofthesystemthat mayneedtobemodel-dependent,asagivenresolutionstrate gymaynotalwaysworkfor allprogrammingmodels.Forexample,atechniquetoxthepe rformancedegradation 6 Notethatduetofactorssuchasvariablealiasing,itmaybeve rydiculttocollect performancedatarelatingtolocalaccesses(anditisevenm orediculttokeeptrackof specicmemoryaddressesbeingaccessed)andthusnotpossi bletocarryoutthisanalysis forsomeprogrammingmodelimplementations. 71


Figure5-6.Frequencyanalysis stemmingfrom upc memget ,versusfrom shmem get ,couldbedierenteventhoughthey arebothclassiedasone-sidedgetoperations. Oneexampleofamodel-specicresolutiontechniqueisthei denticationofthe bestblockingfactortouseindeclaringahigh-anityUPCsha redarray.Whenthe systemdetectsanexcessivecommunicationissueassociate dwithasharedarray,the processingunitwouldtrytondanalternativeblockingfac torthatwouldyieldthebest local-to-remotememoryaccessratioforallPEsinthesyste m. 5.3PrototypeDevelopmentandEvaluation SeveralanalysissystemprototypessupportingUPC,SHMEM,an dMPIwere developedandintegratedintothelatestversionofthePPWt ool.Theseprototypes addtoPPWanumberofanalysiscomponents,correspondingto thoseshowninFigure 5-2 ,toperformthenecessaryprocessing,managementofanalys isdata,andpresentation ofanalysisresultstothetooluser.Toperformanyoftheana lyses,theuserbringsup theanalysisuserinterface(Figure 5-7 ),selectsthedesiredanalysistype,andadjustsany parametervalues(suchaspercentageprogramthresholdtha tdenestheminimumhotspot percentage)ifdesired.Oncealltheanalysesarecompleted ,theresultsaresenttoan analysisvisualizationmanagerwhichgeneratestheapprop riatevisualizations. 7 Bottleneckresolutioniscurrentlyanopenresearcharea. 72


Table5-3.Exampleresolutiontechniquestoremoveparalle lbottlenecks BottleneckpatternPotentialresolutiontechniques Waitongroupsync./comm.Modifythecodetoachievebetterw orkdistribution Usemultiplepoint-to-pointsynchronizationoperations WaitonlockavailabilityPerformmorelocalcomputationbe forethewait-on-lockoperation Usemultiplelocksifappropriate Wait-on-valuechangePerformmorelocalcomputationbefor ethewait-on-valueoperation Competingput/getUsenon-blockingput/get LatesenderPerformlesslocalcomputationbeforethelocal sendoperation Performmorelocalcomputationbeforetheremotereceiveop eration Usenon-blockingreceive LatereceiverPerformlesslocalcomputationbeforetheloc alreceiveoperation Performmorelocalcomputationbeforetheremotesendopera tion Usenon-blockingsend ConsecutiveblockingdatatransfersUsenon-blockingdatat ransfers tounrelatedtargets(i.e.,dierentPEs,Usebulktransfero perationifappropriate dierentmemoryaddressesonsamePE) MultiplesmalldatatransferstosamePECombinemultiplesm alltransfersintoasinglebulktransfer PoordatalocalityModifyshareddatalayout(e.g.,usedie rentblockingfactorinUPC) 73


Figure5-7.PPWanalysisuserinterface Toacquiretheappropriatebaselinevaluesneededfortheba selinelteringtechnique, wecreatedasetofbottleneck-freebenchmarkprogramsfore achofthesupportedmodels. Thesebenchmarksarethenexecutedonthetargetsystem,and thegenerateddatales areprocessedtoextractthebaselinevalueforeachmodelco nstruct. PPWprovidesseveralnewanalysisvisualizationstodispla ythegeneratedanalysis results.Tofacilitateexperimentsetanalyses,ascalabil ity-analysisvisualizationthat plotsthecalculatedparalleleciencyvaluesagainstthei dealparalleleciency(Figure 5-8 )andarevision-comparisonvisualizationthatfacilitate sside-by-sidecomparisonof observedexecutiontimesforregionswithinseparateversi onsofanapplication(Figure 5-9 )aresupported.Tovisualizeanalysisresultofasingleexp eriment,PPWincludesa high-levelanalysisvisualizationdisplayingthebreakdo wnofcomputation,communication, andsynchronizationtimeforeachPEexecutinganapplicati ontoevaluatetheworkload distributionatahighlevel(Figure 5-10 ),anevent-levelload-balancevisualizationto comparetheworkloadofindividualeventsacrossPEs(Figur e 5-11 ),andamulti-table 74


Figure5-8.AnnotatedPPWscalability-analysisvisualizat ion analysisvisualizationwhichdisplaystheresultfromcomm onbottleneckdetectionand causeanalysissupplementedwithsource-codecorrelation (Figure 5-12 ).Finally,PPW generatesatext-basedreportprovidesasummaryoftheanal ysesperformed;thisreport includesinformationsuchasthespeedofanalysis,thepara metervaluesused,numberof andlistofbottlenecksfoundoneachPEs,andresultsfromse veralanalyses(block-level load-balancinganalysis,frequencyanalysis,barrier-re dundancyanalysis,shared-data analysis)notdisplayedintheanalysisvisualizationsjus tmentioned(Figure 5-13 ). Intheremainderofthissection,wepresentdetailsofthese quential,threaded,and distributedprototypesdevelopedandsupplyexperimental resultsregardingthespeedof theseprototypes.5.3.1SequentialPrototype Theproposedanalysissystemwasrstdevelopedaspartofth ePPWJavafront-end torerectacommonPPWusecaseillustratedinFigure 5-14 wheretheusercollects 75


Figure5-9.PPWrevision-comparisonvisualization Figure5-10.AnnotatedPPWhigh-levelanalysisvisualizati on 76


Figure5-11.PPWevent-levelload-balancevisualization applicationperformancedataontheparallelsystemusingt hePPWback-end,transfers thecombinedperformancedataletoapersonalworkstation ,andthenvisualizesthe collecteddatausingthePPWfront-endsystem. Inthisinitialprototype,asingleprocessingunitwasused toconductallofthe selectedanalysesinasequentialfashionillustratedinFi gure 5-15 a,usingmainmemoryto storeintermediate(i.e.,requestandreply)andresultdat a.Tovalidatethecorrectnessof thisprototype,wecreatedasetoftestprogramswritteninUP C,SHMEM,andMPIina methodsimilarthatdiscussedin[ 40 ].Thisanalysistestsuiteconsistsofcontrolprograms withnobottlenecksandtestprogramswhicheachcontainabo ttleneckpatternofinterest. Weappliedtheanalysisprocessontheseprogramsandverie dthatthesystemisableto detectthetargetbottleneckscorrectly. InTable 5-4 ,thespeedoftheanalysisforseveraloftheNAS2.4benchmarks executedwith128or256PEsisshown.Thetestbedforthisexp erimentisanIntel Corei7Quad-core(withHyper-Threadingsupport)2.66GHzpro cessorworkstationwith6 GBofRAMrunning64-bitWindows7.Asexpected,theanalysissp eedfortrace-related 77




Figure5-13.AnnotatedPPWanalysissummaryreport 79


Figure5-14.AcommonusecaseofPPWwhereusertransfersthe datafromparallel systemtoworkstationforanalysis Table5-4.SequentialanalysisspeedofNPBbenchmarksonwor kstation FTMGEPEP Systemsize128128128256 Avg.ProleentriesperPE36000230003737Totaltraceevents9.24million5.72million1057421072Analysistime3821s1705s0.68s2.27s (63.7min)(28.4min)0.68s2.27s analysesdominatestheoverallexecutiontime(prole-bas edanalysesalltooklessthan1 ms);weobservedthattheanalysisspeedislinearlyproport ionaltothenumberoftrace events(0.15-0.2milliontraceeventsperminute).5.3.2ThreadedPrototype Whiletheseinitialperformanceresultswereencouraging, wequicklyrealizedthat thesequentialapproachwouldnotsucefortworeasons.Fir st,thelargestdatasizethat thesequentialprototypecouldanalyzeislimitedbytheamo untofmemoryavailable;our attempttoanalyzea128-PErunoftheCGbenchmark(31.5mill iontraceevents)was unsuccessfulduetothisreason.Second,thetimerequiredt ocompletetheanalysismay becomeunreasonablylongformuchlargerdatasize;itmayta kehoursorevendaysbefore theusercanseetheresultoftheanalysisofanexperimentwi thhugeamountoftrace events. 80


Figure5-15.Analysisworkrowforthe(a)sequentialprototy pe(b)threadedprototype(c)distributedprototype81


Table5-5.AnalysisspeedofNPBbenchmarksonworkstation Num.threadsFTMGEP(128)EP(256) 1(seq.)3821s(63.7min)1705s(28.4min)0.68s2.27s22007s(33.5min)1128s(18.8min)0.37s1.10s41263s(21.1min)709s(11.8min)0.41s0.78s81026s(17.1min)603s(10.1min)0.39s0.81s161234s(20.6min)626s(10.4min)0.79s1.35s Fortunately,sincethedesignofPPWanalysissystemisinhe rentlyparallel,wewere abledevelopparallelversionsofoursystemtoaddressthes etwoissues;wedeveloped athreadedprototypetotakeadvantageofthedominatingmul ti-coreworkstation architectureandafullydistributedprototypethatcanexe cuteonalargecluster. Themodiedanalysisprocessofthe(Java-based)threadedp rototypeisillustrated inFigure 5-15 b.Inthisthreadedprototype,eachprocessingunit(1toK)i sassigneda groupofPEs(1toN)andisresponsibleforcarryingoutallthe analysesforthatgroup ofPEs.Theresultsproducedbythethreadedprototypewerev alidatedagainstthose producedbythesequentialprototype,andweagainranthean alysisofNASbenchmarks ontheCorei7workstationtomeasuretheanalysisspeed.The resultsareshowninTable 5-5 ;fromthistable,weseethattheanalysisspeed(forreasona blysizeddatales)scales fairlywelluptothenumberofcores(1to2and4threads),sho wsaslightimprovement (4to8threads)usingHyper-Threading,andslowsdownsomewh atwhenthethreadcount exceededthenumberofprocessingunits(16threads).Thean alysisoftheCGbenchmark wasagainunabletocompleteasthethreadedprototypealsou sesmainmemorytostore allintermediateandresultdatastructures.5.3.3DistributedPrototype Wehaveshownintheprevioussectionthatathreadedversion oftheanalysis improvesthespeedofanalysisfairlywelluptothenumberof coresontheworkstation. However,sincethenumberofcoresislimitedonasinglemachi ne,wecontinuedour prototypingeorttodevelopaversionofthePPWanalysissy stemcapableofrunning onclustersystemsthatcouldcontainthousandsofPEs.Ther eareseveralreasonsfor 82


Figure5-16.AusecaseofPPWwhereanalysesareperformedon theparallelsystems developingadistributedversionoftheanalysissystem.Fi rst,thedistributedversionis morescalablethanthethreadedversion;itcansupportthea nalysisoflargerdataruns andimprovestheanalysisspeedfurtherduetoincreaseamou ntofavailableprocessors. Second,thedistributedanalysisprocesscannowbeexecute dasabatchjoboraspart ofthedatacollectionprocessasshowninFigure 5-16 .Whenrunningaspartofthedata collectionprocess,theresultoftheanalysiscouldpotent iallybeusedtoreducetheraw datasizeandthusimprovethescalabilityofthePPWtoolits elf. Theworkrowforthedistributedprototype(Figure 5-15 c)isverysimilartothatof thethreadedprototypeexceptnoweachprocessingunitisas signtoprocessdataofa singlePE(eachprocessingunithaslocalaccesstoassigned PE'sdata)andintermediate data(requestsandreplies)mustnowbeexchangeacrossthen etwork.AsshowninFigure 5-17 ,theamountofmemoryspacerequiredoneachprocessingunit isreduced(from N x N x M requestsandrepliesto2x N x M requestsandreplies)andisnowabletosupport largerdatalesuchasCG(contain245974traceeventsperPE )whichwasunabletorun 83


Figure5-17.MemoryusageofPPWanalysissystemonclusterTable5-6.AnalysisspeedofNPBbenchmarksonEthernet-conne ctedcluster FTMGCGEPEP Num.PEs128128128128256 1processingunit2113s1019sN/A0.15s0.85s (35.2min)(17.0min) #PEprocessingunits32.5s242s16668s6.38s40.12s (0.5min)(4.0min)(4.63hrs) Speedup65.024.21{0.020.02s successfullyontheCorei7workstation.Theresultsproduc edbythedistributedprototype wereagainvalidatedagainstthoseproducedbythesequenti alprototype,andinTable 5-6 weshowtheanalysisspeedoftheNASbenchmarksonan80-PEQuad -coreXeonLinux clusterconnectedusingMPICH-21.0.8overEthernet. Wemadeseveralobservationsfromthisdata.Wesawthatthes equentialanalysis speedimprovedalmostbyafactorof2duetomovefromaJava-b asedtoaC-based environment.Moreimportantly,theanalysisspeedofthepa rallelversion(128or256 processingunits)isgreatlyimprovedforlargerdatales. Wesawthattheanalysisspeed forEP(82traceeventsperPE)worsenedbutthisbehaviorwas expectedasthereis simplynotenoughworktobedistributed.InthecaseofMG(44 670traceeventsperPE), theanalysisspeedimprovedbyafactorof4.Finallyintheca seofFT(72230traceevents perPE,morebottlenecksundergoingcauseanalysisthanMG) ,theanalysisspeedwas 84


improvedbyalmosttwoordersofmagnitude,demonstratingt heperformancebenetof thedistributedprototype.Weexpecttheperformanceimpro vementtobemoreapparent onsystemswithhigh-speedinterconnectsandforexperimen tswithalargernumberof traceeventsperPE.5.3.4SummaryofPrototypeDevelopment WehavedevelopedseveralversionsofthePPWanalysissyste mandprovided experimentaldataonthespeedofanalysis.Weobservedthat theanalysisspeed(in allversions)isdependentonthesizeofthetracedataasexp ectedbutalsoaectedby thenumberofbottlenecksundergoingcauseanalysis.Wehav eshownthecorrectnessof thePPWsystemdesignusingasyntheticanalysistestsuitea ndprovedthescalability ofthedesignbydemonstratingtheanalysisspeedimproveme ntofboththreadedand distributedprototypesovertheinitialsequentialversio n.Wenotedthatwhilethe sequentialprototype,andtolesserextentthethreadedpro totype,exhibitssomescalability issues,itisnotwithoutuse.Foranalysisofexperimentswi thsmalltomoderateamountof data,theworkstationprototypesaresucientincompletin gtheanalysisinareasonable amountoftime.However,whenthenumberoftraceeventsperPE exceedsacertain amount,userofthePPWanalysissystemshouldusethemoree cientdistributed prototype. 5.4Conclusions Thegoalofthesecondpartofthisresearchwastoinvestigat e,design,develop,and evaluateascalable,model-independentautomaticanalysi ssystem.Performance-tool-assisted manualanalysisfacilitatesthecumbersomeapplicationop timizationprocessbutdoesnot scale.Asthesizeoftheperformancedatasetgrows,itbecome snearlyimpossibleforthe usertomanuallyexaminethedataandndperformanceissues usingthevisualizations providedbythetool.Thisproblemexposestheneedforanaut omaticanalysissystem thatcandetect,diagnose,andpotentiallyresolvebottlen ecks.Whileseveralautomatic 85


analysisapproacheshavebeenproposed,eachhasparticula rdrawbacksthatlimitits eectivenessorapplicability. Toaddressthisissue,wedevelopedthemodel-independentP PWautomatic analysissystemthatsupportsavarietyofanalyses.Wepres entedthearchitectureof thePPWanalysissystem,introducednoveltechniquessucha sthebaselineltering techniquetoimprovedetectionaccuracy,anddiscussedthe scalableanalysisprocessing mechanismdesignedtosupportlarge-scaleapplicationana lysis.Weshowedcorrectness andperformanceresultsforasequentialversionofthesyst emthathasbeenintegrated intothePPWperformancetoolandthendemonstratedthepara llelnatureofthedesign anditsperformancebenetsinthediscussionofthethreade danddistributedversionsof thesystem. Futureworkforthissystemincludesexperimentalevaluati ononalargerparallel system,enhancementstotheexistinganalyses(e.g.usetem porarylestoreducememory requirements,fasteralgorithmstoimprovespeedoftracea nalyses),supportforadditional analysessuchasbottleneckresolution,expansionofthenu mberofcommon-bottleneck patternsthesystemdetects,anddevelopmentoffunctional itytoallowuserstodenenew bottlenecksthemselves. 86


CHAPTER6 EXPERIMENTALEVALUATIONOF PPW-ASSISTEDPARALLELAPPLICATIONOPTIMIZATIONPROCESS Inthischapter,wepresentstudiesusedtoevaluatetheeec tivenessoftheproposed PPWframeworkandautomaticanalysissystem. 6.1ProductivityStudy ToassesstheusefulnessandproductivityofPPW,weconduct edastudywithagroup of21graduatestudentswhohadabasicunderstandingofUPCpr ogrammingbutwere unfamiliarwiththeperformanceanalysisprocess.Eachstu dentwasaskedtospendseveral hoursconductingmanual(viatheinsertionof printf statements)andtool-assisted(using aversionofPPWwithoutautomaticanalysissupport)perfor manceanalysiswithasmall UPCcryptanalysisprogramcalledCAMEL(withapproximately1 000linesofcode)known tohaveseveralperformancebottlenecks.Studentsweretol dtoconcentratetheireorton ndingandresolvingonlyparallelbottlenecks. TheresultsdemonstratedthatPPWwasusefulinhelpingprog rammersidentify andresolveperformancebottlenecks(Figure 6-1 ).Onaverage,1.38bottleneckswere foundwithmanualperformanceanalysiswhile1.81bottlene ckswerefoundusingPPW. Allstudentswereabletoidentifyatleastasmanybottleneck susingPPW,andone thirdofthemidentiedmorebottlenecksusingPPW.Inaddit ion,moststudentsnoted thattheyhadaneasiertimepinpointingthebottlenecksusi ngPPW.However,only sixstudentswereabletocorrectlymodifytheoriginalcode toimproveitsperformance (withanaverageperformancegainof38.7%),whilethereste itherperformedincorrect codetransformationsorwereunabletodeviseastrategyto xtheissues.Thisinability tomodifytheoriginalcodewasnotsurprising,sincethestu dentswerenotfamiliar withthealgorithmsusedintheCAMELprogram,werenoviceswi threspecttoparallel programming,andwereaskedtospendonlyafewhoursontheta sk. Studentswerealsoaskedtocomparetheexperiencestheyhad withbothapproaches intermsofcodeanalysis(bottleneckidentication)andop timization.Overall,PPWwas 87


Figure6-1.Productivitystudyresultshowing(a)methodwi thmorebottlenecksidentied (b)preferredmethodforbottleneckidentication(c)pref erredmethodfor programoptimization viewedasahelpfultoolbystudents,withmoststudentspref erringPPWovermanual performanceanalysisforreasonslistedbelow(summarized fromstudentfeedback). Manualinsertionanddeletionoftimingcallsistediousand time-consuming.While notsignicantlydicultinthiscase,itcanpotentiallybe unmanageableforlarge applicationswithtensofthousandsoflinesofcode. Asignicantamountofeortwasneededindeterminingwhere toinsertthetiming calls,aprocesswhichwasautomatedinPPW. Visualizationsprovidedbythetoolweremuchmoreeectivei npinpointingthe sourceofbottlenecksandevenmoresoindeterminingthecau seofthebottlenecks. 6.2FTCaseStudy Fortherstapplicationcasestudy,werantheFourierTrans form(FT)benchmark (whichimplementsaFastFourierTransformalgorithm)from theNASbenchmarksuite version2.4usingGASP-enabledBerkeleyUPCversion2.6.Init iallynochangewas madetotheFTsourcecode,andtheperformancedatawerecoll ectedfortheclassB settingexecutedusing16PEsonanOpteronclusterwithQuad ricsQsNet II high-speed interconnects. FromtheTreeTable(Figure 6-2 ),itwasimmediatelyobviousthatthetfunction call(3rdrow)constitutedthebulkoftheexecutiontime(18 soutof20softotalexecution time).Furtherexaminationofperformancedataforeventsw ithinthetfunctionrevealed 88


Figure6-2.AnnotatedPPWTreeTablevisualizationoforigin alFTshowingcoderegions yieldingpartofperformancedegradation the upc barrier operations(representedas upc notify and upc wait )in transpose2 global (6throw)aspotentialbottlenecklocations.Wecametothis conclusionbyobservingthat theactualaverageexecutiontimesfor upc barrier atlines1943(78.71ms)and1953(1.06s) farexceedtheexpectedvalueof2msonoursystemfor16PEs(w eobtainedtheexpected valuebyrunningasimplebenchmark).Lookingatthecodebet weenthetwobarriers,we sawthatmultiple upc memget operationswereissuedandspeculatedthatthebottleneck wasrelatedtotheseoperations.However,weareunabletover ifythisspeculationand determinethecauseofthisbottleneckbasedsolelyonthiss tatisticaldata. Thus,wethenconvertedthetracedataintotheJumpshotSLOG -2formatand lookedatthebehaviorof upc barrier and upc memget operationsinatimelineview.We discoveredthatthe upc barrier atline1953waswaitingforthe upc memget operation 89


Figure6-3.AnnotatedJumpshotviewoforiginalFTshowingse rializednatureof upc memgetatline1950 tocomplete.Inaddition,wesawthat upc memget operationsissuedfromthesamePE wereunnecessarilyserialized,asshownintheannotatedJu mpshotscreenshot(Figure 6-3 ; notethezigzagpatternformemgetoperations).Lookingatt hestartandendtimesof the upc memget operationsissuedfromPE0toallotherPEs(seetheinfoboxi nFigure 6-3 ),wesawthatthelater upc memget operationsmustwaitfortheearlier upc memget operationstocompletebeforeinitiating,eventhoughthed ataobtainedwerefromdierent sourcesandstoredlocallyatdierentprivatememorylocat ions. AsolutiontoimprovetheperformanceoftheFTbenchmarkist ouseanon-blocking (asynchronous)bulk-transfergetsuchas bupc memget async providedbyBerkeleyUPC. 90


Figure6-4.PPWTreeTableformodiedFTwithreplacementas ynchronousbulktransfer calls Whenthiscodetransformation 1 wasmade(showninthelowerportionofFigure 6-4 ),we wereabletoimprovetheperformanceoftheprogramby14.4%o vertheoriginalversion. WelaterappliedtheautomaticanalysisprocesstothesameF Tdataletocheck whetherornottheanalysissystemcouldndthebottlenecks thatweidentied.Looking atthemulti-tableanalysisvisualization,wesawthatthes ystemfound4bottlenecks, includingthemostsignicant upc barrier bottleneck.Inaddition,thesystemwasable todeterminethecauseofdelayforeachoccurrenceofthebar rieroperationthattook longerthanexpected.Forexample,thesystemfoundthatthe barriercalledbyPE7 withastartingtimeof2.6stooklongerthanexpectedtoexec utebecausePEs8and15 enteredthebarrierlaterthanPE7(Figure 6-5 ,left).Asweobservedintheannotated 1 Notethatwhilethiscodetransformationisnotportabletoot hercompilers,the optimizationstrategyofusingnon-blockingtransferispo rtable. 91


Jumpshotview(Figure 6-5 ,right),thispatternisveried.Switchingtothehigh-lev el analysisvisualization,wesawthateachPEspent5to15%oft hetotalexecutiontime insidethebarriercall,furthervalidatingtheexistenceo fabarrier-relatedbottleneck.This percentagedropsto1to2%ofthetotalexecutiontimeforthe revisedversionusingthe non-blockinggetoperation. Inthiscasestudy,wehaveshownhowPPWwasusedtooptimizea UPCprogram. WithlittleknowledgeofhowtheFTbenchmarkworks,wewerea bletoapplythemanual analysisprocessandremoveamajorbottleneckintheprogra mwithinafewhoursof usingPPW.Inaddition,weshowedthatourautomaticanalysi ssystemwasableto correctlyidentifyanddeterminethecauseofsignicantbo ttlenecksintheFTbenchmark. 6.3SARCaseStudy Forthesecondapplicationcasestudy,weperformedanalysi sofbothUPCand SHMEMin-houseimplementationsoftheSyntheticApertureRad ar(SAR)algorithm usingGASP-enabledBerkeleyUPCversion2.6andQuadricsSHMEM onanOpteron clusterwithaQuadricsQsNet II interconnect.SARisahigh-resolution,broad-area imagingprocessingalgorithmusedforreconnaissance,sur veillance,targeting,navigation, andotheroperationsrequiringhighlydetailed,terrain-s tructuralinformation.Inthis algorithm,therawimagegatheredfromthedownward-facing radarisrstdividedinto patcheswithoverlappingboundariessotheycanbeprocesse dindependentlyfromeach other.Eachpatchthenundergoesatwo-dimensional,spacevariantconvolutionthatcan bedecomposedintotwodomainsofprocessing,therangeanda zimuth,toproducethe resultforasegmentofnalimage(Figure 6-6 )). ThesequentialversionfromScrippsInstitutionofOceanog raphyandMPIversion providedbytwofellowresearchersinourlab[ 26 ]wereusedasthetemplatesforthe developmentofUPCandSHMEMversions.TheMPIversionfollowt hemaster-worker approachwherethemasterPEreadspatchesfromtherawimage le,distributespatches forprocessing,collectsresultfromallPEs,andwritesthe resulttoanoutputle,while 92


Figure6-5.Multi-tableanalysisvisualizationforFTbenc hmarkwithannotatedJumpshotvisualization93


Figure6-6.OverviewofSyntheticApertureRadaralgorithm theworkerPEsperformtheactualrangeandazimuthcomputat iononthepatches(note: masterPEsalsoperformcomputation).Forthisstudy,weuse darawimagelewith parameterssettocreate35patches,eachofsize128MB.Whil eallpatchescouldbe executedinparallelinasingleiterationonasystemwithmo rethan35PEs,smaller systems,suchasour32-PEcluster,executeovermultipleit erations(ineachiteration,M patchesareprocessedwhereMequalstothenumberofcomputi ngPEs).Weassumeonly sequentialI/Oisavailablethroughoutthestudy,afairass umptionsinceneitherUPCnor SHMEMcurrentlyincludesstandardizedparallelI/O. WebeganthiscasestudybydevelopingaUPCbaselineversion( whichmimicsthe MPIversion)usingasinglemasterPEtohandlealltheI/Oope rationsandthatalso performsprocessingofpatchesineachiteration.Betweenc onsecutiveiterations,all-to-all barriersynchronizationisusedtoenforcetheconsistency ofthedata.Afterverifyingthe correctnessofthisversion,weusedPPWtoanalyzetheperfo rmanceonthreesystems sizesofcomputingPEs:6,12,and18;thesesystemsizeswere chosensothatineach iteration,atmostoneworkerPEisnotperforminganypatchp rocessing.Byexamining severalvisualizationsinPPW(oneofwhichistheProleMet ricsBarChartshownin Figure 6-7 ),wenoticedthatwith6computingPEs,18.7%oftheexecutio ntimewasspent insidethebarrierandthatthepercentageincreasedwithth enumberofcomputingPEs (20.4%for12PEs,27.6%for18PEs).Usingthetimelineviewto furtherinvestigatethe issue(Figure 6-8 ),wethenconcludedthatthecauseofthisbottleneckwastha tworker 94


PEsmustwaituntilthemasterPEwritestheresultfromthepr eviousiterationtostorage andsendsthenextpatchesofdatatoallprocessingPEsbefor etheycanexitthebarrier. Similarndingswereseenwhenautomaticanalysisisapplie d.Fromthehigh-level analysisvisualization(Figure 6-9 ),weobservedthatasignicantamountoftimeislost performingglobalsynchronization.Thisobservationisre conrmedbyexamingingthe multi-tableanalysisvisualizationwhichliststwo shmem barrier all bottlenecks. Wedevisedtwopossibleoptimizationstrategiestoimprove theperformanceofthe baselineversion.Therststrategywastheuseofdedicated masterPE(s)(performing nopatchprocessing)toensurethatI/Ooperationscouldcom pleteassoonaspossible. Thesecondstrategywastoreplaceall-to-allbarriersynch ronizationwithpoint-to-point ragsynchronization(implementingthewait-on-value-cha ngeoperation)soprocessingPEs couldworkonthepatchesasearlyaspossible.Weexpectedth attherstapproachwould yieldasmallperformanceimprovementwhilethesecondappr oachshouldgreatlyalleviate theissueidentied. Wethendevelopedverevisionsoftheprogramusingoneorbo thofthesestrategies: (1)dedicated-master,(2)ragsynchronization,(3)dedica tedmasterwithragsynchronization, (4)twodedicatedmasters(oneforread,oneforwrite),and( 5)twodedicatedmastersand ragsynchronization.Theserevisionswereagainrunonsyst emsizeswith6,12,and 18PEs,andtheperformanceoftherevisionswascomparedtot hatofthebaseline version(Figure 6-10 ).Asexpected,thededicatedmasterstrategyalonedidnotim prove theperformanceoftheapplication.Surprisingly,therags ynchronizationstrategyby itselfalsodidnotimprovetheperformanceasweexpected.Af tersomeinvestigation, wediscoveredthatwhileweeliminatedthebarrierwaittime ,weintroducedthesame amountofidletimewaitingforthesharedragstobeset(Figu re 6-11 ).Thecombination ofbothstrategies,however,didimprovetheperformanceof theprogrambyanoticeable amount,especiallyinthetwodedicatedmastersandragsync hronizationversionwhere thepercentageofpatchexecutiontimeincreased(from77.9 5%,78.09%,70.71%forthe 95


Figure6-7.PerformancebreakdownofUPCSARbaselineversion runwith6,12,and18computingPEsannotatedtoshow percentageofexecutiontimeassociatedwithbarriers96


Figure6-8.TimelineviewofUPCSARbaselineversionrunwith6 computingPEs annotatedtohighlightexecutiontimetakenbybarriers baselineversion)to97.05%,94.38%,and87.97%oftotaltim efor6,12,and18processing PEsrespectively(theremainingtimeismainlyspentonunav oidablesequentialI/O andbulkdatatransfer).Thisobservationwasveriedwhenw elookedatthehigh-level analysisvisualization(Figure 6-12 )andsawthatallPEsspentthemajorityoftheirtime performingcomputation. ThiscasestudywasthenperformedusingSHMEMimplementatio nsofSARbased onthesameapproachesoutlinedabovefortheUPCversion(per formancecomparison fortheseversionsarealsoshowninFigure 6-10 ).ForSHMEM,wenoticedthatthe dedicatedmasterstrategyimprovedtheperformancebyasma llamount,whiletherag synchronizationstrategystilldidnothelp.Thecombinati onofbothstrategiesagain improvedtheperformancebyanoticeablepercentage,witht hetwodedicatedmasters andragsynchronizationversionexhibiting6.1%,13.6%,an d15.8%improvementoverthe baselineversionfor6,12,and18PEsrespectively. 97


Figure6-9.High-levelanalysisvisualizationfortheorigi nalversion(v1)ofSAR applicationwithload-imbalanceissue Figure6-10.ObservedexecutiontimeforvariousUPCandSHMEM SARreversions 98


Figure6-11.TimelineviewofUPCSARragsynchronizationvers ionexecutedonsystem with6computingPEsannotatedtohighlightwaittimeofrags Additionally,PPWenabledustoobservethattheperformance oftheSHMEM versionswas15-20%slowerthanthecorrespondingUPCversio ns(Table 6-1 )).This observationwassurprisingsinceweusedthesameQuadricsi nterconnectwithcommunication librariesbuiltontopofthesamelow-levelnetworkAPI.Weex aminedtheperformanceof thedatatransfersforUPCandSHMEMversionsandfoundthatthe performanceofthese operationsisactuallybetterinSHMEM.Aftersomeinvestigat ion,wedeterminedthat thedierencebetweenthetwoversionscamefromthesignic antincreaseinexecution timeforreadandwriteofdataandpatchprocessingfunction s(i.e.,theazimuthand rangefunctions)intheSHMEMversions.Weconcludedthatthi sbehaviorismostlikely duetotheoverheadintroducedbytheQuadricsSHMEMlibraryt oallowaccesstothe sharedmemoryspace,whichincurredevenforaccessesofdat aphysicallyresidingonthe callingPE.ForUPC,acastofsharedpointertolocalpointerm adebeforeenteringthese 99

PAGE 100

Figure6-12.High-levelanalysisvisualizationfortheF2Mv ersion(v5)ofSARapplication withnomajorbottleneck functionseliminatestheoverheadassociatedwithglobalm emoryaccess(notavailablein SHMEM). Inthiscasestudy,wehaveshownhowPPWwasusedtofacilitat etheoptimization processofanin-houseSARapplication.WewereabletousePPW todiscoverperformance bottlenecks,comparetheperformanceofbothUPCandSHMEMver sionsside-by-side, anddiscoverpropertiesoftheQuadricsSHMEMenvironmentth atshouldbeconsidered whendealingwithglobalmemoryaccess. 6.4Conclusions Thegoalofthethirdpartofthisresearchwastoexperimenta llyevaluatethe PPW-assistedparallelapplicationoptimizationprocess. Throughseveralcasestudies, weassessedtheeectivenessofthePPWframeworkandtheana lysissystempresented inpreviouschapters.Aclassroomproductivitystudywasco nductedandtheresult 100

PAGE 101

Table6-1.PerformancecomparisonofvariousversionsofUPC andSHMEMSARprograms 6PEs12PEs18PEs UPCSHMEM%DiUPCSHMEM%DiUPCSHMEM%Di BaselineTotal120.3s149.8s24.668.3s84.5s23.652.1s63. 4s21.8 Transfer4.3s3.0s6.1s3.4s7.8s3.5s MasterTotal120.9s151.6s25.468.3s84.4s23.651.8s63.0s 21.6 Transfer5.2s3.7s6.7s3.7s8.2s3.7s FlagTotal120.8s152.0s25.868.2s85.3s24.951.5s63.6s23 .4 Transfer4.3s3.0s6.1s3.4s7.7s3.5s Master/FlagTotal118.7s148.2s24.865.7s81.3s23.849.6s 59.6s20.0 Transfer5.2s3.7s6.7s3.7s8.2s3.7s 2mastersTotal121.6s150.5s23.768.8s84.0s22.153.8s61. 4s14.2 Transfer5.2s3.7s6.7s3.7s8.2s3.7s 2masters/ragTotal113.7s140.1s23.360.7s74.0s21.945.7 s53.2s16.4 Transfer5.2s3.7s6.7s3.7s8.2s3.7s 101

PAGE 102

illustratedthatmoststudentspreferredPPWoverthe printfstyleperformanceanalysis. IntheFTcasestudy,weshowedhowPPWassistedthemanualana lysisprocessand veriedthattheanalysissystemwasabletocorrectlydeter mineandndcausesof bottlenecksidentiedduringthemanualanalysisprocess. IntheSARcasestudy,we demonstratedhowthecompletePPWtool(withmanualandauto maticanalysis)was usedintuninganin-house,inecientrstimplementationo fSARtoyieldanoptimized application. 102

PAGE 103

CHAPTER7 CONCLUSIONS Researchersfrommanyscienticeldshaveturnedtoparall elcomputinginpursuit ofthehighestpossibleapplicationperformance.Unfortuna tely,duetothecombined complexityofparallelexecutionenvironmentsandprogram mingmodels,applications mustoftenbeanalyzedandoptimizedbytheprogrammerbefor ereachinganacceptable levelofperformance.Manyperformancetoolsweredevelope dtofacilitatethisnon-trivial analyze-optimizeprocessbuthavetraditionallybeenlimi tedinprogrammingmodel support;existingtoolswereoftendevelopedtospecicall ytargetMPIandthusare noteasilyextensibletosupportalternativeprogrammingm odels.Tollthisneed,we presentedworkonwhatwebelievetobetherstgeneral-purp oseperformancetool system,theParallelPerformanceWizard(PPW)system,inth isdissertation. WerstpresentedthePPWframeworkanddiscussednovelconc eptstoimprovetool extensibility.Weintroducedthegeneric-operation-type abstractionandtheGASP-enabled datacollectionprocessdevelopedtominimizethedependen ceofthetoolonitssupported models,makingthePPWdesignhighlyextensibletosupporta rangeofprogramming models.Usingthisframework,wecreatedthePPWperformance systemthatfully supportsthemuchneededPGASmodels(i.e.,UPCandSHMEM)aswel lasMPI.Results fromourexperimentalstudiesshowedthatourPPWsysteminc urredanacceptablelevel ofoverheadandiscomparabletootherpopularperformancet oolsintermsofoverhead andstoragerequirement.Inaddition,wedemonstratedthat ourPPWsystemisscalable uptoatleast512processingelements. Wenextpresentedanewscalable,model-independentanalys issystemtoautomatically detectanddiagnoseperformancebottlenecks.Weintroduce dnewtechniquestoimprove detectionaccuracy,discussedarangeofnewandexistingan alysesdevelopedtond performancebottlenecks,anddiscussedaparallelizedana lysisprocessingmechanism designedtosupportlarge-scaleapplicationanalysis.Wev alidatedthecorrectnessofour 103

PAGE 104

analysissystemdesignusingatestsuitedesignedtoverify thesystem'scapabilityin detectingspecicbottlenecks.Wethendemonstratedthepa rallelizednatureofthedesign bysuccessfullydevelopingboththethreadedanddistribut edversionsofthesystem.We showedtheperformanceimprovementoftheparallelversion soverthesequentialversion; inonecase,weillustratedthattheanalysisspeedwasimpro vedbyalmosttwoordersof magnitude(from35minutesto35seconds). Finally,wepresentedseveralcasestudiestoevaluatetheP PWframeworkand theanalysissystem.Intheclassroomproductivitystudy,w edemonstratedthatPPW wasviewedasausefultoolinhelpingprogrammersidentifya ndresolveperformance bottlenecks.Onaverage,participantswereabletondmore bottlenecksusingPPWand mostparticipantsnotedthattheyhadaneasiertimepinpoin tingbottlenecksusingPPW. IntheFTcasestudy,werstdemonstratedhowPPWwasusedint hemanualanalysis processtoimprovetheperformanceoftheoriginalFTbenchm arkby14.4%withinafew hoursofuseandlatershowedthatourautomaticanalysissys temwasabletocorrectly identifyanddeterminethecauseofsignicantbottlenecks foundduringthemanual analysisprocess.IntheSARcasestudy,weillustratedhowth ecompletePPWsystem wasusedtodiscoverperformancebottlenecks,comparedthe performanceofbothUPC andSHMEMversionsside-by-side,anddiscoveredproperties oftheQuadricsSHMEM environmentthatshouldbeconsideredwhendealingwithglo balmemoryaccess. ThemaincontributionsofthisresearchincludethePPWgene ral-purposeperformance toolsystemforparallelapplicationoptimizationandasca lableautomaticanalysissystem. WiththecreationofourPPWtoolandinfrastructure,webrou ghtperformancetool supporttothemuchneededUPCandSHMEMprogrammingmodelsand madeiteasier tobringperformancetoolsupporttoother/developingprog rammingmodels.Inaddition, wecontributedtotheon-goingautomaticperformanceanaly sisresearchbydeveloping ascalableandportableautomaticanalysissystemandintro ducingnewtechniquesand analysestondperformancebottlenecksfasterandmoreacc urately. 104

PAGE 105

Thereareseveralfutureareasofresearchrelatedtothiswo rk.First,wehavethusfar testedandevaluatedthePPWsystemonparallelsystemsupto hundredsofprocessing elements.Tokeepupwiththeever-growingparallelsystems ize,weplantoexperimentally evaluatetheusefulnessofPPWonlargerparallelsystems(t housandsofprocessing elements)anddevelopstrategiestoresolvepotentialscal abilityissueswiththePPW system.Second,weareinterestedinextendingthePPWsyste mtosupportnewer programmingmodels;wehavealreadystartworkingonenabli ngsupportforX10[ 41 ]. Third,weareinterestedinenhancingthecapabilityofthea utomaticanalysissystemby developingalgorithmstofurtherimprovetheanalysisspee dandaccuracy,investigating techniquestoautomaticallyresolvebottleneck,anddevel opingmechanismstoallowusers todenebottlenecksthemselves.Finally,wewillcontinue toworkonimprovingthe usabilityofPPW.Forexample,wearecurrentlyworkingonin tegratingPPWintothe Eclipsedevelopmentenvironmentandweareplanningtodeve lopmechanismstoprovide lower-levelperformanceinformationusingGASP. 105

PAGE 106

APPENDIX GASPSPECIFICATION1.5 A.1Introduction InthisAppendix,weincludetheanadaptedversionoftheGASPi nterface(version 1.5).TheauthorsofthisspecicationareAdamLeko,Hung-Hsun Su,andAlanD.George fromtheElectricalandComputerEngineeringDepartmentat theUniversityofFlorida andDanBonacheafromtheComputerScienceDivisionattheUni versityofCaliforniaat Berkeley.A.1.1Scope Duetothewiderangeofcompilersandthelackofastandardiz edperformancetool interface,writersofperformancetoolsfacemanychalleng eswhenincorporatingsupport forglobaladdressspace(GAS)programmingmodelssuchasUni edParallelC(UPC), Titanium,andCo-ArrayFortran(CAF).Thisdocumentpresents aGlobalAddressSpace Performance(GASP)toolinterfacethatisrexibleenoughtob eadaptedintocurrent globaladdressspacecompilerandruntimeinfrastructures withlittleeort,whileallowing performanceanalysistoolstogathermuchinformationabou ttheperformanceofglobal addressspaceprograms.A.1.2Organization Section A.2 givesahigh-leveloverviewoftheGASPinterface.AsGASPcanbe usedtosupportmanyglobaladdressspaceprogrammingmodel s,theinterfacehasbeen brokendownintomodel-independentandmodel-specicsect ions.Section A.3 presentsthe model-independentportionsoftheGASPinterface,andthesu bsequentsectionsdetailthe model-specicportionsoftheinterface.A.1.3Denitions Inthissection,wedenethetermsusedthroughoutthisspec ication. Model {aparallelprogramminglanguageorlibrary,suchasUPCorMP I. Users {individualsusingaGASmodelsuchasUPC. 106

PAGE 107

Developers {individualswhowriteparallelsoftwareinfrastructures uchasUPC, CAF,orTitaniumcompilers. Tools {performanceanalysistoolssuchasVampir,TAU,orKOJAK. Tooldevelopers {individualswhodevelopperformanceanalysistools. Toolcode {codeorlibraryimplementingthetooldeveloper'sportion oftheGASP interface. Thread {athreadofcontrolinaGASprogram,mapsdirectlytoUPCsconc eptof threadsorCAFsconceptofimages. A.2GASPOverview TheGASPinterfacecontrolstheinteractionbetweenausersc ode,aperformance tool,andGASmodelcompilerand/orruntimesystem.Thisinte ractionisevent-based andcomesintheformofcallbackstothe gasp_event_notify functionatruntime.The callbacksmaycomefrominstrumentationcodeplaceddirect lyinanexecutable,from aninstrumentedruntimelibrary,oranyothermethod;thein terfaceonlyrequiresthat gasp_event_notify iscalledatappropriatetimesinthemannerdescribedinthe restof thisdocument. TheGASPinterfaceallowstooldeveloperstosupportGASmodel sonallplatforms andimplementationssupportingtheinterface.Theinterfa ceisusedinthefollowingthree steps: 1. UserscompiletheirGAScodeusingcompilerwrapperscriptspr ovidedbytool developers.Usersmayspecifywhichanalysistheywishtheto oltoperformontheir codethrougheithercommand-linearguments,environmentv ariablesorthrough othertool-specicmethods. 2. Thecompilerwrapperscriptspassappropriateragstotheco mpilerindicatingwhich callbacksthetoolwishestoreceive.Duringthelinkingpha se,thescriptslinkin appropriatecodefromtheperformancetoolthathandlesthe callbacksatruntime. Thistool-providedcodeshallbewritteninC. 3. Whenauserrunstheirprogram,thetool-providedcoderecei vescallbacksatruntime andmayperformsomeactionsuchasstoringalleventsinatra celeorperforming basicstatisticalproling. 107

PAGE 108

ThespecicsofeachstepwillbediscussedinSection A.3 .Themodel-specic portionsoftheGASPinterfacewillbediscussedinthesubseq uentsections.AGAS implementationmayexcludeanysystem-leveleventdenedi nthemodel-specicsections ofthisdocumentifanapplicationcannotbeinstrumentedfo rthatevent(e.g.,dueto designlimitationsorotherimplementation-specicconst raints).Anyactionresulting inaviolationofthisspecicationshallresultinundened behavior.Toolandmodel implementorsarestronglyencouragednottodeviatefromth esespecications. A.3Model-IndependentInterface A.3.1InstrumentationControl Instrumentationcontrolisaccomplishedthrougheitherco mpilationargumentsor compilerpragmas.Developersmayusealternativenamesfor thecommand-linearguments ifthenamesspeciedbelowdonotttheconventionsalready usedbythecompiler. A.3.1.1User-visibleinstrumentationcontrol Ifauserwishestoinstrumenttheircodeforusewithatoolus ingtheGASPinterface, theyshallpassoneofthecommand-lineargumentsdescribed inthissectiontothe compilerwrapperscripts.GASPsystemeventsaredividedint othefollowingbroad categories,forthepurposesofinstrumentationcontrol: Localaccessevents: {Eventsresultingfromaccesstoobjectsorvariables containedintheportionoftheglobaladdressspacewhichis localtotheaccessing thread. Userfunctionevents: Eventsresultingfromentryandexittouser-dened functions,asdescribedinSection A.4.3 Otherevents: Anysystemeventwhichdoesnotfallintotheabovecategories The --inst argumentspeciesthattheuserscodeshallbeinstrumented forall systemeventssupportedbytheGASmodelimplementationwhic hfallintothenal categoryofeventsdescribedabove.The --inst-local argumentimplies --inst ,and additionallyrequeststhatusercodeshallbeinstrumented togeneratelocalaccessevents supportedbytheGASmodelimplementation.Otherwise,suche ventsneednotbe 108

PAGE 109

generated.Formodelslackingasemanticconceptoflocalor remotememoryaccesses, --inst shallhavethesamesemanticsas --inst-local ,implyinginstrumentationof allglobaladdressspaceaccesses.The --inst-functions argumentimplies --inst andadditionallyrequeststhatusercodeshallbeinstrumen tedtogenerateuserfunction eventssupportedbytheGASmodelimplementation.Otherwise ,sucheventsneednotbe generated.A.3.1.2Tool-visibleinstrumentationcontrol CompilerssupportingtheGASPinterfaceshallprovidethefo llowingcommand-line argumentsforusebythetool-providedcompilerwrapperscr ipts.Thearguments --inst --inst-local and --inst-functions havethesamesemanticsastheuser-visible instrumentationragsspeciedinSection A.3.1.1 .Anadditionalargument --inst-only takesasingleargumentlenamewhichisalecontainingali stofsymboliceventnames (asdenedinthemodel-specicsectionsofthisdocument)s eparatedbynewlines. Thelescontentsindicatetheeventsforwhichtheperforma ncetoolwishestoreceive callbacks.Eventsinthislemaybeignoredbythecompileri ftheeventsarenot supportedbythemodelimplementation.Compilerimplement ationsareencouragedto avoidanyoverheadsassociatedwithgeneratingeventsnots peciedby --inst-only howevertoolsthatpass --inst-only muststillbepreparedtoreceiveandignoreevents whicharenotincludedinthe --inst-only list. A.3.1.3Interactionwithinstrumentation,measurement,a nduserevents Whencodeiscompiledwithoutan --inst rag,allinstrumentationcontrolshallbe ignoredandallusereventcallbacksshallbecompiledaway. Systemsmaylink\dummy" versionsof gasp_control and gasp_create_event (describedinSection A.3.3 and A.3.4 ) forapplicationsthathavenocodecompiledwith --inst Systemsmaysupportcompilingpartsofanapplicationusing oneofthe --inst rags andcompilingotherpartsofanapplicationnormally;forsy stemswherethisscenariois notpossible,thisbehaviormaybeprohibited.Applications compiledusingan --inst 109

PAGE 110

ragforatleastonetranslationunitshallalsopassthe --inst ragduringthelinking phasetothecompilerwrapperscripts.Anymodel-specicins trumentationcontrolshall nothaveanyeectonusereventsoronthestateofmeasuremen tcontrol.Asaresult, anymodel-specicinstrumentationcontrolsshallnotprev entusereventsfrombeing instrumentedduringcompilation(e.g., #pragmapupc shallnotchangethebehaviorofthe pupc_create_event and pupc_event_start functionsinUPCprograms). A.3.2CallbackStructure Atruntime,allthreadsofaninstrumentedexecutableshall collectivelycallthe gasp_init Cfunctionatthebeginningofprogramexecutionafterthemo delruntimehas nishedinitializationbutbeforeexecutingtheentrypoin tinauser'scode(e.g., main in UPC).The gasp_init functionshallhavethefollowingsignature: typedefenum f GASP LANG UPC, GASP LANG TITANIUM, GASP LANG CAF, GASP LANG MPI, GASP LANG SHMEM g gasp model t; struct gasp context S; typedefstruct gasp context S gasp context t; gasp context tgasp init(gasp model tsrcmodel, int argc,char argv); The gasp_init functionandanimplementationofthe _gasp_context_S struct shallbeprovidedbytooldevelopers.Asinglerunninginsta nceofanexecutablemay collectivelycall gasp_init multipletimesiftheexecutablecontainscodewrittenin 110

PAGE 111

multiplemodels(suchasahybridUPCandCAFprogram),withatm ostonecall permodel.The gasp_init functionreturnsapointertoanopaque,thread-specic, tool-implementedstruct.Thispointershallbepassedinal lsubsequentcallstothetool developer'scodemadeonbehalfofthisthread.Thispointer shallonlybeusedinevent callbacksforeventscorrespondingtothemodelindicatedb ythe srcmodel argument.Tool codemaymodifythecontentsofthe argc and argv pointerstosupporttheprocessingof command-linearguments. Afterthe gasp_init functionhasbeencalledbyeachthreadofexecution,thetoo l codeshallreceiveallothercallbacksthroughthetwofunct ionswhosesignaturesare shownbelow.Bothfunctionsmaybeusedinterchangeably;th e VA variantisprovidedasa conveniencetodevelopers. typedefenum f GASP START, GASP END, GASP ATOMIC, g gasp evttype t; voidgasp event notify(gasp context tcontext,unsignedintevttag, gasp evttype tevttype,constchar filename, intlinenum,intcolnum,...); voidgasp event notifyVA(gasp context tcontext,unsignedintevttag, gasp evttype tevttype,constchar filename, intlinenum,intcolnum,va listvarargs); The gasp_event_notify implementationshallbewritteninC,butmaymake upcallstocodewritteninthemodelspeciedbythesrcmodel argumentpassedtothe gasp_init functiononthethreadthatreceivedthecallback.Ifupcall sareused,the 111

PAGE 112

gasp_event_notify functionimplementationisresponsibleforhandlingre-en trantcalls. Additionally,codethatisusedinupcallsshallbecompiledu singthesameenvironmental specicationsasthecodeinausersapplication(e.g., gasp_event_notify shallonly performupcallstoUPCcodecompiledunderastaticthreadsen vironmentwhenusedwith aUPCprogramcompiledunderthestaticthreadsenvironment) Anyuserdatareferencedbypointerspassedto gasp_event_notify shallnot bemodiedbytoolcode.Fortherstargumentto gasp_event_notify ,toolcode shallreceivethesame gasp_context_t pointerthatwasreturnedfromthe gasp_init functionforthisthread.Tooldevelopersmayusethecontex tstructtostorethread-local informationforeachthread.The gasp_event_notify functionshallbethread-safein ordertosupportmodelimplementationsthatmakeuseof pthreads orotherthread libraries. The evttag argumentshallspecifytheeventidentierasdescribedint he model-specicsectionsofthisdocument.The evttype argumentshallbeoftype gasp_evttype_t andshallindicatewhethertheevent evttag isabeginevent,end event,oratomicevent. The filename linenum ,and colnum argumentsshallindicatethelineandcolumn numberinthemodel-levelsourcecodemostcloselyassociat edwiththegenerationofthe event evttag .Iflenameisnon-NULL,itreferencesacharacterstringwhos econtents mustremainvalidandunmodiedfortheremainderoftheprog ramexecution.Thesame filename pointerispermittedtobepassedinmultiplecallsandbymul tiplethreads,and itisalsopermittedfordierent filename pointers(passedindierentcalls)toindicate thesamelename(thisscenarioimpliesthetoolmaystore filename pointervaluesand usesimplepointercomparisonofnon-NULLvaluestoestablish lenameequality,butnot inequality). GASmodelimplementationsthatdonotretaincolumninformat ionduring compilationmaypass0inplaceofthe colnum parameter.GASmodelimplementations 112

PAGE 113

thatdonotretainanysource-levelinformationduringcomp ilationmaypass0forthe filename linenum ,and colnum parameters.GASmodelimplementationsarestrongly encouragedtosupporttheseargumentsunlessthisinformat ioncanbeecientlyand accuratelyobtainedthroughotherdocumentedmethods.GASm odelimplementations thatuseinstrumentedruntimelibrariesforGASPsupportmay providedummy implementationsforthe gasp_event_notify gasp_event_notifyVA gasp_init functions and _gasp_context_S structtopreventlinkerrorswhilelinkingauser'sapplica tionthat isnotbeingusedwithanyperformancetool.Thecontentsoft he varargs argumentshall bespecictoeacheventidentierandtypeandwillbediscus sedinthemodel-specic sectionsofthisdocument.A.3.3MeasurementControl Tooldevelopersshallprovideanimplementationforthefol lowingfunction: intgasp control(gasp context tcontext,inton); The gasp_control functiontakesthe context argumentinthesamemanneras the gasp_event_notify function.Whenthevalue0ispassedforthe on parameter, thetoolshallceasemeasuringanyperformancedataassocia tedwithsubsequentsystem orusereventsgeneratedonthecallingthread,untilthethr eadmakesafuturecallto gasp_control withanonzerovalueforthe on parameter.The gasp_control function shallreturnthelastvalueforthe on parameterthefunctionreceivedfromthisthread,ora nonzerovalueif gasp_control hasneverbeencalledforthisthread. A.3.4UserEvents Tooldevelopersshallprovideanimplementationforthefol lowingfunction: unsignedintgasp create event(gasp context tcontext, constchar name,constchar desc); The gasp_create_event shallreturnatool-generatedeventidentier.Compilers shalltranslatethecorrespondingmodel-specic _create_event functionslistedinthe 113

PAGE 114

model-specicsectionsofthisdocumentintocorrespondin g gasp_create_event calls. Thesemanticsofthe name and desc argumentsandthereturnvalueshallbethesame asdenedbythe _create_event functionlistedinthemodel-specicsectionofthis documentcorrespondingtothemodelindicatedbycontext.A.3.5HeaderFiles Developersshalldistributea gasp.h CheaderlewiththeirGASimplementations thatcontainsatleastthefollowingdenitions.The gasp.h leshallbeinstalledina directorythatisincludedinthecompilersdefaultsearchp ath. Functionprototypesforthe gasp_init gasp_event_notify gasp_control ,and gasp_create_event functionsandassociatedtypedefs,enums,andstructs. AGASP_VERSION macrothatshallbedenedtoanintegraldate(codedas YYYYMMDD)correspondingtotheGASPversionsupportedbythisGASPimplementation.Forimplementationsthatsupportthevers ionofGASPdenedin thisdocument,thismacroshallbesettotheintegralvalue2 0060914. Macrodenitionsthatmapthesymboliceventnameslistedin themodel-specic sectionsofthisdocumentto32-bitunsignedintegers. A.4CInterface A.4.1InstrumentationControl Instrumentationfortheeventsdenedinthissectionshall becontrolledbyusingthe correspondinginstrumentationcontrolmechanismsforUPCc odedenedinSection A.5.1 A.4.2MeasurementControl Measurementfortheeventsdenedinthissectionshallbeco ntrolledbyusingthe correspondingmeasurementcontrolmechanismsforUPCcoded enedinSection A.5.2 A.4.3SystemEventsA.4.3.1Functionevents Table A-1 showssystemeventsrelatedtoexecutinguserfunctions.Th eseevents occuruponeachcalltoauserfunction(afterentryintothat function),andbeforeexit fromauserfunction(beforereturningtothecallerasaresu ltofexecutingareturn statementorreachingtheclosingbracewhichterminatesth efunction).The funcsig 114

PAGE 115

TableA-1.Userfunctionevents Symbolicname Eventtype vararg arguments GASP_C_FUNC Start,End constchar*funcsig TableA-2.Memoryallocationevents Symbolicname Eventtype vararg arguments GASP_C_MALLOC Start size tnbytes GASP_C_MALLOC End size tnbytes,void returnptr GASP_C_REALLOC Start void ptr,size tsize GASP_C_REALLOC End void ptr,size tsize,void returnptr GASP_C_FREE Start,End void ptr argumentspeciesthecharacterstringrepresentingthefu llsignatureoftheuserfunction thatisbeingenteredorexited,orNULLifthatinformationisn otavailable. If funcsig isnon-NULL,itreferencesacharacterstringwhosecontentsm ustremain validandunmodiedfortheremainderoftheprogramexecuti on.Thesame funcsig pointerispermittedtobepassedinmultiplecallsandbymul tiplethreads,anditisalso permittedfordierent funcsig pointers(passedindierentcalls)toindicatethesame functionsignature(thisscenarioimpliesthetoolmaystor e funcsig pointervaluesand usesimplepointercomparisonofnon-NULLvaluestoestablish functionequality,butnot inequality).A.4.3.2Memoryallocationevents Table A-2 showssystemeventsrelatedtothestandardmemoryallocati onfunctions. The GASP_C_MALLOC GASP_C_REALLOC ,and GASP_C_FREE stemdirectlyfromthestandard Cdenitionsof malloc realloc ,and free A.4.4HeaderFiles SupportedCsystemeventsshallbehandledinthesamemethod asUPCevents, whicharedescribedinSection A.5.5 115

PAGE 116

A.5UPCInterface A.5.1InstrumentationControl Usersmayinsert #pragmapupcon or #pragmapupcoff directivesintheircode toinstructthecompilertoavoidinstrumentinglexicallyscopedregionsofausersUPC code.Thesepragmasmaybeignoredbythecompilerifthecomp ilercannotcontrol instrumentationforarbitraryregionsofcode.Whenan --inst argumentisgiventoa compilerorcompilerwrapperscript,the #pragmapupc shalldefaultto on A.5.2MeasurementControl Atruntime,usersmaycallthefollowingfunctionstocontro lthemeasurementof performancedata.The pupc_control functionshallbehaveinthesamemannerasthe gasp_control functiondenedinSection A.3.3 intpupc control(inton); A.5.3UserEvents unsignedintpupc create event(constchar name,constchar desc); voidpupc event start(unsignedintevttag,...); voidpupc event end(unsignedintevttag,...); voidpupc event atomic(unsignedintevttag,...); The pupc_create_event functionshallbeautomaticallytranslatedintoa corresponding gasp_create_event call,asdenedinSection A.3.4 .The name argument shallbeusedtoassociateauser-speciednamewiththeeven t,andthe desc argument maycontaineither NULL ora printf -styleformatstring.Thememoryreferencedbyboth argumentsneednotremainvalidoncethefunctionreturns. Theeventidentierreturnedby pupc_create_event shallbeauniquevaluein therangefrom GASP_UPC_USEREVT_START to GASP_UPC_USEREVT_END ,inclusive.The GASP_UPC_USEREVT macrosshallbeprovidedinthe gasp_upc.h headerledescribedin 116

PAGE 117

Section A.5.5 .Thevaluereturnedisthread-specic.Iftheuniqueidenti ersareexhausted forthecallingthread, pupc_create_event shallissueafatalerror. The pupc_event_start pupc_event_end ,and pupc_event_atomic functionsmay becalledbyausersUPCprogramatruntime.The evttag argumentshallbeanyvalue returnedbyaprior pupc_create_event functioncallfromthesamethread.Usersmay passinanylistofvaluesforthe ... arguments,providedtheargumenttypesmatchthe printf -styleformatstringsuppliedinthecorresponding pupc_create_event (according tothe printf formatstringconventionsspeciedbythetargetsystem).An ymemory referencedby ... arguments(e.g.,stringarguments)neednotremainvalidon cethe functionreturns.Aperformancetoolmayusethesevaluesto displayperformance informationalongsideapplication-specicdatacaptured duringruntimetoauser.The UPCimplementationshalltranslatethe pupc_event_start pupc_event_end ,and pupc_event_atomic functioncallsintocorresponding gasp_event_notify functioncalls. Whenacompilerdoesnotreceiveany --inst arguments,the pupc_event function callsshallbeexcludedfromtheexecutableorlinkedagains tdummyimplementations ofthesecalls.Ausersprogramshallnotdependonanysidee ectsthatoccurfrom executingthe pupc_event functions.Usersshallnotpassashared-qualiedpointeras an argumenttothe pupc_event functions. A.5.4SystemEvents Fortheeventargumentsbelow,theUPC-specictypes upc_flag_t and upc_op_t shallbeconvertedtoC int s.Pointerstoshareddatashallbepassedwithanextralevel ofindirection,andmayonlybedereferencedthroughUPCupca lls.UPCimplementations shallprovidetwoopaquetypes, gasp_upc_PTS_t and gasp_upc_lock_t ,whichshall representagenericpointer-to-shared(i.e., sharedvoid *),andaUPClockpointer(i.e., upc_lock_t *),respectively.Theseopaquetypesshallbetypedefedtov oidtoprevent Ccodefromattemptingtodereferencethemwithoutusingaca stinaUPCupcall.The contentofany gasp_upc_PTS_t or gasp_upc_lock_t locationpassedtoaneventisonly 117

PAGE 118

TableA-3.Exitevents Symbolicname Eventtype vararg arguments GASP_UPC_COLLECTIVE_EXIT Start,End intstatus GASP_UPC_NONCOLLECTIVE_EXIT Atomic intstatus TableA-4.Synchronizationevents Symbolicname Eventtype vararg arguments GASP_UPC_NOTIFY Start,End intnamed,intexpr GASP_UPC_WAIT Start,End intnamed,intexpr GASP_UPC_BARRIER Start,End intnamed,intexpr GASP_UPC_FENCE Start,End (none) guaranteedtoremainvalidforthedurationofthe gasp_event_notify call,andmustnot bemodiedbythetool.A.5.4.1Exitevents Table A-3 showssystemeventsrelatedtotheendofaprogramsexecutio n.The GASP_UPC_COLLECTIVE_EXIT eventsshalloccurattheendofaprogramsexecutionon eachthreadwhenacollectiveexitoccurs.Theseeventscorr espondtotheexecutionofthe nalimplicitbarrierforUPCprograms.The GASP_UPC_NONCOLLECTIVE_EXIT eventshall occurattheendofaprogram'sexecutiononasinglethreadwh enanon-collectiveexit occurs.A.5.4.2Synchronizationevents Table A-4 showseventsrelatedtosynchronizationconstructs.These eventsshall occurbeforeandafterexecutionofthenotify,wait,barrie r,andfencesynchronization statements.The named argumenttothenotify,wait,andbarrierstarteventsshall be nonzeroiftheuserhasprovidedanintegerexpressionforth ecorrespondingnotify, wait,andbarrierstatements.Inthiscase,the expr variableshallbesettotheresultof evaluatingthatintegerexpression.Iftheuserhasnotprov idedanintegerexpressionfor thecorrespondingnotify,wait,orbarrierstatements,the named argumentshallbezero andthevalueof expr shallbeundened. 118

PAGE 119

TableA-5.Work-sharingevents Symbolicname Eventtype vararg arguments GASP_UPC_FORALL Start,End (none) A.5.4.3Work-sharingevents Table A-5 showseventsrelatedtowork-sharingconstructs.Theseeve ntsshalloccur oneachthreadbeforeandafter upc_forall constructsareexecuted. A.5.4.4Library-relatedevents Table A-6 showseventsrelatedtolibraryfunctions.Theseeventsste mdirectlyfrom theUPClibraryfunctionsdenedintheUPCspecication.The vararg argumentsfor eacheventcallbackmirrorthosedenedintheUPClanguagesp ecication. A.5.4.5Blockingsharedvariableaccessevents Table A-7 showseventsrelatedtoblockingsharedvariableaccesses. Theseevents shalloccurwheneversharedvariablesareassignedtoorrea dfromusingthedirectsyntax (notusingthe upc.h libraryfunctions).Theargumentstotheseeventsmimictho seofthe upc_memget and upc_memput eventcallbackarguments,butdierfromtheonespresented intheprevioussectionbecausetheyonlyarisefromaccessi ngsharedvariablesdirectly.If thememoryaccessoccursundertherelaxedmemorymodel,the is_relaxed parameter shallbenonzero;otherwisethe is_relaxed parametershallbezero. A.5.4.6Non-blockingsharedvariableaccessevents Table A-8 showseventsrelatedtodirectsharedvariableaccessesimp lemented throughnon-blockingcommunication.Thesenon-blockingd irectsharedvariableaccess eventsaresimilartotheregulardirectsharedvariableacc esseventsinSection A.5.4.5 The INIT eventsshallcorrespondtothenon-blockingcommunication initiation,the DATAeventsshallcorrespondtowhenthedatastartstoarriv eandcompletelyarrives onthedestinationnode(theseeventsmaybeexcludedformos timplementationsthat usehardware-supportedDMA),andthe GASP_UPC_NB_SYNC functionshallcorrespondto 119

PAGE 120

TableA-6.Library-relatedevents Symbolicname Eventtype vararg arguments GASP_UPC_GLOBAL_ALLOC Start size tnblocks,size tnbytes GASP_UPC_GLOBAL_ALLOC End size tnblocks,size tnbytes, gasp upc PTS t newshrd ptr GASP_UPC_ALL_ALLOC Start size tnblocks,size tnbytes GASP_UPC_ALL_ALLOC End size tnblocks,size tnbytes, gasp upc PTS t newshrd ptr GASP_UPC_ALLOC Start size tnbytes GASP_UPC_ALLOC End size tnbytes, gasp upc PTS t newshrd ptr GASP_UPC_FREE Start,End gasp upc PTS t shrd ptr GASP_UPC_GLOBAL_LOCK_ALLOC Start (none) GASP_UPC_GLOBAL_LOCK_ALLOC End gasp upc lock t lck GASP_UPC_ALL_LOCK_ALLOC Start (none) GASP_UPC_ALL_LOCK_ALLOC End gasp upc lock t lck GASP_UPC_LOCK_FREE Start,End gasp upc lock t lck GASP_UPC_LOCK Start,End gasp upc lock t lck GASP_UPC_LOCK_ATTEMPT Start gasp upc lock t lck GASP_UPC_LOCK_ATTEMPT End gasp upc lock t lck, intresult GASP_UPC_UNLOCK Start,End gasp upc lock t lck GASP_UPC_MEMCPY Start,End gasp upc PTS t dst, gasp upc PTS t src, size tn GASP_UPC_MEMGET Start,End void dst, gasp upc PTS t src, size tn GASP_UPC_MEMPUT Start,End gasp upc PTS t dst, void src, size tn GASP_UPC_MEMSET Start,End gasp upc PTS t dst, intc, size tn 120

PAGE 121

TableA-7.Blockingsharedvariableaccessevents Symbolicname Eventtype vararg arguments GASP_UPC_GET Start,End intis relaxed, void dst, gasp upc PTS t src, size tn GASP_UPC_PUT Start,End intis relaxed, gasp upc PTS t dst, void src, size tn TableA-8.Non-blockingsharedvariableaccessevents Symbolicname Eventtype vararg arguments GASP_UPC_NB_GET_INIT Start intis relaxed, void dst, gasp upc PTS t src, size tn GASP_UPC_NB_GET_INIT End intis relaxed, void dst, gasp upc PTS t src, size tn, gasp upc nb handle thandle GASP_UPC_NB_GET_DATA Start,End gasp upc nb handle thandle GASP_UPC_NB_PUT_INIT Start intis relaxed, gasp upc PTS t dst, void src, size tn GASP_UPC_NB_PUT_INIT End intis relaxed, gasp upc PTS t dst, void src, size tn, gasp upc nb handle thandle GASP_UPC_NB_PUT_DATA Start,End gasp upc nb handle thandle GASP_UPC_NB_SYNC Start,End gasp upc nb handle thandle thenalsynchronizationcallthatblocksuntilthecorresp ondingdataofthenon-blocking operationisnolongerinright. gasp_upc_nb_handle_t shallbeanopaquetypedenedbytheUPCimplementation. Severaloutstandingnon-blocking get or put operationsmaybeattachedtoasingle 121

PAGE 122

TableA-9.Sharedvariablecacheevents Symbolicname Eventtype vararg arguments GASP_UPC_CACHE_MISS Atomic size tn, size tn lines GASP_UPC_CACHE_HIT Atomic size tn GASP_UPC_CACHE_INVALIDATE Atomic size tn dirty gasp_upc_nb_handle_t instance.Whenasynccallbackisreceived,thetoolcodesha ll assumeall get and put operationsforthecorresponding handle inthesynccallback havebeenretired.Theimplementationmaypassthehandle GASP_NB_TRIVIAL to GASP_UPC_NB_{PUT,GET}_INIT toindicatetheoperationwascompletedsynchronouslyin theinitiationinterval.Thetoolshouldignoreany DATA or SYNC eventcallbackswiththe handle GASP_NB_TRIVIAL A.5.4.7Sharedvariablecacheevents Table A-9 showseventsrelatedtosharedvariablecacheevents.The GASP_UPC_CACHE eventsmaybesentforUPCruntimesystemscontainingasoftwa recacheaftera corresponding get or put starteventbutbeforeacorresponding get or put endevent (includingnon-blockingcommunicationevents).UPCruntim esusingwrite-throughcache systemsmaysend GASP_UPC_CACHE_MISS eventsforeachcorresponding put event. The size_t nargumentforthe MISS and HIT eventsshallindicatetheamountof datareadfromthecachelinefortheparticularcachehitorc achemiss.The n_lines argumentofthe GASP_UPC_CACHE_MISS eventshallindicatethenumberofbytesbrought intothecacheasaresultofthemiss(inmostcases,thelines izeofthecache).The n_dirty argumentofthe GASP_UPC_CACHE_INVALIDATE shallindicatethenumberofdirty cachelinesthatwerewrittenbacktosharedmemoryduetoaca chelineinvalidation. A.5.4.8Collectivecommunicationevents Table A-10 showseventsrelatedtocollectivecommunication.Theeven tsinTable A-10 stemdirectlyfromtheUPCcollectivelibraryfunctionsden edintheUPC 122

PAGE 123

TableA-10.Collectivecommunicationevents Symbolicname Eventtype vararg arguments GASP_UPC_ALL_BROADCAST Start,End gasp upc PTS t dst, gasp upc PTS t src, size tnbytes, intupc flags GASP_UPC_ALL_SCATTER Start,End gasp upc PTS t dst, gasp upc PTS t src, size tnbytes, intupc flags GASP_UPC_ALL_GATHER Start,End gasp upc PTS t dst, gasp upc PTS t src, size tnbytes, intupc flags GASP_UPC_ALL_GATHER_ALL Start,End gasp upc PTS t dst, gasp upc PTS t src, size tnbytes, intupc flags GASP_UPC_ALL_EXCHANGE Start,End gasp upc PTS t dst, gasp upc PTS t src, size tnbytes, intupc flags GASP_UPC_ALL_PERMUTE Start,End gasp upc PTS t dst, gasp upc PTS t src, gasp upc PTS t perm, size tnbytes, intupc flags GASP_UPC_ALL_REDUCE Start,End gasp upc PTS t dst, gasp upc PTS t src, intupc op, size tnelems, size tblk size, void func, intupc flags, gasp upc reduction ttype specication.The vararg argumentsforeacheventcallbackmirrorthosedenedin theUPClanguagespecication. 123

PAGE 124

Table A-10 .Collectivecommunicationevents(Continued) Symbolicname Eventtype vararg arguments GASP_UPC_ALL_PREFIX_REDUCE Start,End gasp upc PTS t dst, gasp upc PTS t src, intupc op, size tnelems, size tblk size, void func, intupc flags, gasp upc reduction ttype Forthereductionfunctions,the gasp_upc_reduction_t enumshallbeprovidedbya UPCimplementationandshallbedenedasfollows.Thesuxto GASP_UPC_REDUCTION denotesthesametypeasspeciedintheUPCspecication. typedefenum f GASP UPC REDUCTION C, GASP UPC REDUCTION UC, GASP UPC REDUCTION S, GASP UPC REDUCTION US, GASP UPC REDUCTION I, GASP UPC REDUCTION UI, GASP UPC REDUCTION L, GASP UPC REDUCTION UL, GASP UPC REDUCTION F, GASP UPC REDUCTION D, GASP UPC REDUCTION LD g gasp upc reduction t; A.5.5HeaderFiles UPCcompilersshalldistributea pupc.h CheaderlewiththeirGASlanguage implementationsthatcontainsfunctionprototypesforthe functionsdenedinSections 124

PAGE 125

A.5.2 and A.5.3 .The pupc.h leshallbeinstalledinadirectorythatisincludedinthe UPCcompiler'sdefaultsearchpath. Allsupportedsystemeventsandassociated gasp_upc_* typesshallbedened ina gasp_upc.h lelocatedinthesamedirectoryasthe gasp.h le.Systemevents notsupportedbyanimplementationshallnotbeincludedint he gasp_upc.h le.The gasp_upc.h headerlemayincludedenitionsforimplementation-spec icevents,along withbriefdocumentationembeddedinsourcecodecomments. Compilersshalldeneacompiler-specicintegral GASP_UPC_VERSION versionnumber in gasp_upc.h thatmaybeincrementedwhennewimplementation-specicev entsare added.Compilerdevelopersareencouragedtousethe GASP_X_Y namingconventionfor allimplementation-specicevents,where X isanabbreviationfortheircompilationsystem (suchas BUPC )and Y isashort,descriptivenameforeachevent. Compilersthatimplementthepupcinterfaceshallpredene thefeaturemacro __UPC_PUPC__ tothevalue1.Themacroshouldbepredenedwheneverapplic ations maysafely #include ,invokethefunctionsitdenesandusethe #pragmapupc directives,withoutcausinganytranslationerrors.Thefe aturemacrodoesnotguarantee thatGASPinstrumentationisactuallyenabledforagivencom pilation,assomeofthe featuresmighthavenoeectinnon-instrumentingcompilat ions. 125

PAGE 126

REFERENCES [1] W.Gropp,E.Lusk,N.Doss,andA.Skjellum,\Ahigh-performanc e,portable implementationofthempimessagepassinginterface,"Tech .Rep.,ArgonneNational Laboratory,1996. [2] UPCConsortium,\Upclanguagespecications1.2," http://www.gwu.edu/ ~ upc/docs/upc_specs_1.2.pdf ,AccessedJuly2010. [3] UniversityofCaliforniaatBerkeley,\Berkeleyupcwebsite ," http://upc.lbl.gov AccessedJuly2010. [4] QuadricsLtd,\Quadricsshmemprogrammingmanual," http://web1.quadrics.com/downloads/documentation/Sh memMan_6.pdf ,Accessed July2010. [5] OakRidgeNationalLab,\Openshmemwebsite," https://email.ornl.gov/mailman/listinfo/openshmem ,AccessedJuly2010. [6] L.DeRoseandB.Mohr,\Tutorial:Principlesandpracticeof experimental performancemeasurementandanalysisofparallelapplicat ions,"in Supercomputing ,November15-212003. [7] J.Labarta,S.Girona,V.Pillet,T.Cortes,andL.Gregoris,\ Dip:Aparallelprogram developmentenvironment,"in 2ndInternationalEuro-ParConferenceonParallel Processing ,August26-291996. [8] IntelCorporation,\Intelclustertoolswebsite," http://www.intel.com/software/products/cluster ,AccessedJuly2010. [9] A.Chan,W.Gropp,andE.Lusk,\Scalableloglesforparallel programtracedata -draft," ftp://ftp.mcs.anl.gov/pub/mpi/slog2/slog2-draft.pdf ,AccessedJuly 2010. [10] M.T.HeathandJ.A.Etheridge,\Visualizingtheperformanceof parallelprograms," IEEESoftware ,vol.8,no.5,pp.29{39,1991. [11] P.J.Mucci,\Tutorial:Dynaprof,"in Supercomputing ,November15-212003. [12] J.S.VetterandM.O.McCracken,\Statisticalscalabilitya nalysisofcommunication operationsindistributedapplications,"in PrinciplesandPracticeofParallel Programming ,June18-202001. [13] L.DeRoseandD.A.Reed,\Svpablo:Amulti-languageperforma nceanalysis system,"in 10thInternationalConferenceonComputerPerformanceEva luation: ModelingTechniquesandTools ,September14-181998. 126

PAGE 127

[14] J.Mellor-Crummey,R.Fowler,andG.Marin,\Hpcview:Atoolf ortop-down analysisofnodeperformance," TheJournalofSupercomputing ,vol.23,no.1,pp. 81{104,2002. [15] B.P.Miller,M.D.Callaghan,J.M.Cargille,J.K.Hollingswo rth,R.B.Irvin,K.L. Karavanic,K.Kunchithapadam,andT.Newhall,\Theparadynp arallelperformance measurementtools," IEEEComputer ,vol.28,no.11,pp.37{46,1995. [16] B.Mohr,F.Wolf,B.Wylie,andM.Geimer,\Kojak-atoolsetfo rautomatic performanceanalysisofparallelprograms,"in 9thInternationalEuro-ParConference onParallelProcessing ,August26-292003. [17] S.S.ShendeandA.D.Malony,\Tau:Thetauparallelperforman cesystem," InternationalJournalofHighPerformanceComputingAppli cations ,vol.20,no.2, pp.287{331,2006. [18] A.Leko,H.Sherburne,H.Su,B.Golden,andA.D.George,\Practic alexperiences withmodernparallelperformanceanalysistools:anevalua tion,"Tech.Rep., UniversityofFlorida,AccessedJuly2010. [19] K.London,S.Moore,P.Mucci,K.Seymour,andR.Luczak,\The papi cross-platforminterfacetohardwareperformancecounter s,"in Departmentof DefenseUsers GroupConference ,June18-212001. [20] UniversityofFlorida,\Parallelperformancewizard(ppw)t oolprojectwebsite," http://ppw.hcs.ufl.edu ,AccessedJuly2010. [21] S.S.Shende, TheRoleofInstrumentationandMappinginPerformanceMeas urement ,Ph.D.thesis,UniversityofOregon,2001. [22] Hewlett-PackardDevelopmentCompany,L.P.,\Hpupcwebsite, http://h30097.www3.hp.com/upc/ ,AccessedJuly2010. [23] W.E.Nagel,\Vampirtoolwebsite," http://www.vampir.eu ,AccessedJuly2010. [24] GeorgeWashingtonUniversity,\Gwuupcnas2.4benchmarks," http://www.gwu.edu/ ~ upc/download.html ,AccessedJuly2010. [25] J.Stone,\Tachyonparallel/multiprocessorraytracingsy stem," http://jedi.ks.uiuc.edu/ ~ johns/raytracer/ ,AccessedJuly2010. [26] A.Jacobs,G.Cieslewski,C.Reardon,andA.D.George,\Multip aradigmcomputing forspace-basedsyntheticapertureradar,"in InternationalConferenceonEngineering ofRecongurableSystemsandAlgorithms ,July14-172008. [27] ArgonneNationalLaboratory,\Mpich2website," http://www.mcs.anl.gov/research/projects/mpich2/ ,AccessedJuly2010. 127

PAGE 128

[28] T.Fahringer,M.Gerndt,B.Mohr,F.Wolf,G.Riley,andJ.L.T ra,\Knowledge specicationforautomaticperformanceanalysis-revised version,"Tech.Rep., APARTWorkingGroup,August2001. [29] APARTWorkingGroup,\Automaticperformanceanalysis:Realto ols(apart)ist workinggroupwebsite," http://www.fz-juelich.de/apart ,AccessedJuly2010. [30] C.Coarfa,J.Mellor-Crummey,N.Froyd,andY.Dotsenko,\Scal ableanalysisof spmdcodesusingexpectations,"in InternationalConferenceonSupercomputing June16-202007. [31] K.A.HuckandA.D.Malony,\Perfexplorer:Aperformancedatami ningframework forlarge-scaleparallelcomputing,"in Supercomputing ,Nov.12-182005. [32] K.FurlingerandM.Gerndt,\Automatedperformanceanalysis usingaslperformance properties,"in WorkshoponState-of-the-ArtinScienticandParallelCom puting June18-212006. [33] J.Jorba,T.Margalef,andE.Luque,\Searchofperformancei necienciesinmessage passingapplicationswithkappapi-2tool,"in LectureNotesinComputerScience 2007,number4699,pp.409{419. [34] M.Geimer,F.Wolf,B.J.N.Wylie,andB.Mohr,\Scalableparal leltrace-based performanceanalysis,"in PVM/MPI ,Sep.17-202006. [35] L.LiandA.D.Malony,\Model-basedperformancediagosisofm aster-workerparallel computations,"in LectureNotesinComputerScience ,2006,number4128,pp.35{46. [36] J.K.Hollingsworth, FindingBottlenecksinLargeScaleParallelPrograms ,Ph.D. thesis,UniversityofWisconsin-Madison,1994. [37] I.Dooley,C.Mei,andL.Kale,\Noiseminer:Analgorithmforsc alableautomatic computationalnoiseandsoftwareinterferencedetection, "in 13thInternational WorkshoponHigh-LevelParallelProgrammingModelsandSup portiveEnvironments ofIPDPS ,April14-182008. [38] J.S.VetterandP.H.Worley,\Assertingperformanceexpectat ions,"in Supercomputing02 ,Nov.16-222002. [39] I.Chung,G.Cong,andD.Klepacki,\Aframeworkforautomate dperformance bottleneckdetection,"in 13thInternationalWorkshoponHigh-LevelParallel ProgrammingModelsandSupportiveEnvironmentsofIPDPS ,April14-182008. [40] M.Gerndt,B.Mohr,andJ.L.Tra,\Atestsuiteforparallelp erformanceanalysis tools,"in ConcurrencyandComputation:PracticeandExperience ,2007,number19, pp.1465{1480. [41] IBMCorporation,\X10website," http://x10-lang.org/ ,AccessedJuly2010. 128

PAGE 129

BIOGRAPHICALSKETCH Hung-HsunSuisaPh.D.graduatefromtheDepartmentofElectri calandComputer Engineering(withaminorincomputerengineeringfromtheD epartmentofComputer andInformationScienceandEngineering)attheUniversityo fFlorida.Hereceivedtwo B.S.degreesincomputerscienceandbiochemistryfromtheUn iversityofCalifornia,Los Angelesin1996,aM.S.inbiochemistryfromtheUniversityofS outhernCaliforniain 1999,andaM.S.incomputersciencefromtheUniversityofSou thernCaliforniain2002. Hisresearchfocusesonthedevelopmentandperformanceanal ysisofhigh-performance parallelapplications;therealizationofhigh-performan ceportablecommunicationsystems; andperformanceevaluationofhigh-performancesystems. 129