Group Title: Parallel performance wizard : a performance analysis tool for partitioned globa-address-space programming models
Title: Poster abstract
Full Citation
Permanent Link:
 Material Information
Title: Poster abstract
Physical Description: Archival
Language: English
Creator: Su, H.
Publisher: Su et al.
Place of Publication: Gainesville, Fla.
 Record Information
Bibliographic ID: UF00094705
Volume ID: VID00002
Source Institution: University of Florida
Holding Location: University of Florida
Rights Management: All rights reserved by the source institution and holding location.


This item has the following downloads:

SC06-GASP-abstract ( PDF )

Full Text

Parallel Performance Wizard: A Performance Analysis Tool for
Partitioned Global-Address-Space Programming Models
Hung-Hsun Su', Adam Leko', Dan Bonachea2, Bryan Golden', Hans Sherburne1, Max Billingsley III1, Alan George'
University of Florida' / UC Berkeley2

Scientific programmers must optimize the total time-to-solution, the combination of software development/refinement time and actual
execution time. The increasing complexity at all levels of supercomputing architectures, coupled with advancements in sequential
performance and a growing degree of hardware parallelism, often leads to programs that do not yield an expected performance level.
As a result, programmers must perform an iterative analysis and optimization process, which places the bulk of the time-to-solution
cost into the software development and tuning phase. Performance analysis tools facilitate this process by alleviating the work
required to determine the root cause of performance degradation. While several tools are available to support parallel programs
written using the well-known message-passing model, there is insufficient tool support for programs developed using Global-Address-
Space (GAS) programming models that are gaining popularity. Examples of GAS models include the Unified Parallel C (UPC), Co-
array Fortran, and Titanium languages and the SHMEM library. This poster introduces the Parallel Performance Wizard (PPW) tool
for analysis of GAS applications, along with the Global Address Space Performance (GASP) interface that enables tool support for
programs using GAS models on a variety of machines and implementations.

A major factor in the success of a performance analysis tool is the performance data it gathers and the techniques used to gather this
data in the instrumentation and measurement process. Techniques such as source instrumentation, binary instrumentation, and the use
of wrapper libraries have been successfully deployed for programs based on the message-passing programming model. Unfortunately,
these techniques are not sufficient for programs based on the GAS model, due to the wide range of GAS implementation techniques
and difficulties associated with one-sided memory operations, aggressive parallel compiler optimizations, and other aspects of the
global address memory model. The Global Address Space Performance (GASP) interface resolves this issue by specifying the
interaction between the user program, compiler and the performance analysis tool, and thereby allows GAS model implementers to
avoid optimization interference stemming from instrumentation. GASP permits tool developers to support GAS languages on all
platforms and languages supporting the interface, which employs a simple callback mechanism applicable to virtually any language.
These calls are then handled by the performance tool to record the desired performance data.

GASP support has been successfully implemented in the Berkeley UPC compiler. Figures 1 and 2, respectively, show the tracing and
profiling overhead with Berkeley UPC and the GASP PPW instrumentation module.

PPW exists to maximize user productivity while assisting users in analyzing and correcting performance problems in their parallel
programs. The tool is geared towards programs written in GAS languages, employing support for the GASP interface to provide
functionality for all implementations supporting GASP.

Using data gathered from the GASP interface, PPW provides visualizations and semi-automatic analyses that facilitate the
optimization process. With full support for source-line correlation, the visualizations include: profiling table and call-tree diagram
(Figure 3), which provide a high-level overview of program performance; a timeline view provided by export of data to Jumpshot
(Figure 4) and other popular trace viewers for a detailed view of program performance; and viewers for shared-memory distribution
and communication volume (Figure 5) that aid in understanding the one-sided communication patterns that typify GAS applications.
Furthermore, PPW includes mechanisms for automatic detection of possible performance bottlenecks and in some cases can provide
practical hints on how to remove these bottlenecks.

With the Parallel Performance Wizard tool and the Global Address Space Performance interface, users can effectively and
productively analyze the performance of their GAS language programs, ultimately improving the total time-to-solution for their
computing problem.

E I E I (local) lI+M lI+M (PAPI) lI+M (PAPI, Class C) 25

3 V 5
S2.5 15
2 >
1.5 10

0.5 0
Figure 1: Berkeley UPC GASP Profiling Overhead. Results are in
for NAS benchmark 2.4 class A unless otherwise noted. re

I I (Local) I+M I+M (PAPI)

gure 2: Berkeley UPC GASP Tracing Overhead. I -
strumentation only (empty calls). I+M actual events
corded through preliminary measurement layer. PAPI -
easurement layer records PAPI hardware counter events.

Fie Tlot; Hp
Prd Table Tree Tabl Communation Resource View I Pole charts I
MeaTk: Tme j Threadd:l hrea
ar e_ Total S 5e Mn ax I Cont l5obCant
l :mer_now 95 95 0 12 38 0
SIO 512,740 512,740 0 86 2097152 0
R exiendro 17 17 4 5 4 0
S* crypt 7,265,620 1,212,758 8 299 800000 400000
bcock 12,393,064 12,93,064 0 7,061,111 183
S* tmerJaped 89 48 1 10 34 34
lastRond 69,917,158 24,824,176 1 2,999 33563436 67126872
SX qulYWySrrtaL 59 9 59 59 6
F CAMEL_upc c:46 59 9 59 59 1
So upxc 50 50 2 36 60 2
S0)CpVut 931 931 1 13 440 0
S -sortNcount 176 176 176 176 1 0
B UINT2hex 3 3 1 2 2 o0
CAMEL_upcc CI CtEdf.c
52 for (q = 0; q < max; q-)
53 ft (PK2LOCAL[q + 1] < PK2LOCAL[q]) {
54 dIy = PK2LOCAL[q 1];
55 PK2LOCAL[q + 1] = PK2LOCAL[q];
56 PK2LOCAL[q] = dmy;

V0 -o % I- 0; < ax;
61 if (PK2LOCAL[r + i] '= PK2LOCAL[r]) {
62 keyAray[keyoumter] -= PLOCAL[r];
63 keycounter8++:
65 3
68 /HI8N Rn0CTIOIN 8 /
9 Int main(it agc, char *arow[]) {i

Figure 3: PPW call-tree diagram. This display shows
profile data alongside the original source code for a UPC
cryptanalysis application.

Figure 5: PPW communication volume viewer. This visualization
graphically illustrates the communication pattern from the UPC
implementation of the NAS FT benchmark (v2.4).

Lotest / Max. Depthl 4 Zoom Level Global Min Time View Init Time Zoom Focus Time View Final Time Global Max Time Time Per Pixel
'. r'4I . :- *.* -- -& ... - 4 < 1-- :. .:>... -. 1... 1..

CumulativeE... w

-1 SLOG-2

Di ----,, -,iH B'a~a I

@Lie I .IGel ingls

S II ----------------------_-- -
D7 -close

1Illlll l
_I I"

) -15 O u I O d e '":4i .1E L U Ij ..

4 close
@ LinelDI I I I I I

0.194804 0.194808 0.194812 0.194816 0.19482 0.194924 0.194828 0.194832


IL' 1:

11L' = I.I

me (seconds),

Fie Tools Help
Profle Tble | Tree Table Commurcabtim Resource View Profle Charts

Meric IAvgPayloadSize 1 Operao Puts+Get s + Payt oadS Ie I Payload
Da Affnty Thead 520000
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

E0- MN E M l 000)
Sllum -.

3- 000



(UrN : B r op)

Row- C


" 1

Fit All Rou


Figure 4: SLOG-2 export viewed in the Jumpshot viewer. Right-clicking on events in the timeline brings up detailed
information about that event, including source-code information.

= Tine~ie~uccamI~sog2 IdetityMap

~ iiiiiiiii
F FAA *A ji

i l, Vi lOF'

ZLf 1 N*

University of Florida Home Page
© 2004 - 2010 University of Florida George A. Smathers Libraries.
All rights reserved.

Acceptable Use, Copyright, and Disclaimer Statement
Last updated October 10, 2010 - - mvs