Title: Parallel Performance Wizard : an infrastructure and tool for analysis of parllel application performance
CITATION PDF VIEWER THUMBNAILS PAGE IMAGE ZOOMABLE
Full Citation
STANDARD VIEW MARC VIEW
Permanent Link: http://ufdc.ufl.edu/UF00094692/00001
 Material Information
Title: Parallel Performance Wizard : an infrastructure and tool for analysis of parllel application performance
Physical Description: Book
Language: English
Creator: Su, Hung-Hsun
Publisher: Su, Hung-Hsun
Place of Publication: Gainesville, Fla.
Copyright Date: 2007
 Record Information
Bibliographic ID: UF00094692
Volume ID: VID00001
Source Institution: University of Florida
Holding Location: University of Florida
Rights Management: All rights reserved by the source institution and holding location.

Downloads

This item has the following downloads:

SC07docshow_PPW ( PDF )


Full Text

UF UNIVERSITY of
UF FLORIDA


www. hcs. ufl.edu
I- i I I- .' .i .*'. .', 1 1 i. i & l. 11 '., .; '. '-


Parallel


Performance


Wizard:


Infrastructure and Tool for


of Parallel Application


Analysis


Performance


Presenter: Hung-Hsun Su
Advisor: Dr. Alan D. George

Electrical and Computer Engineering Dept.,
University of Florida


An






Need for Parallel Performance Analysis Tool


are
constantly being developed in many scientific fields
Parallel programming models supply application writers
with in a given
environment
Unfortunately, the added complexity of the environment
and model makes it more difficult to optimize the
application to


Performance Analysis Tools (PATs) increase productivity
by making application optimization process simpler for
user

IIUNIVERSITY of www,hcs, ufl.edu
U FI FLO B 21 r .' "i.hrnei.s rarr '> ['.(T. 'I Ij F L O IDlfAn IK. n.h 2&.










Role of Performance Analysis Tool


o!j Original
Application


Runtime
Performance Data
Gathering






Data Processing
and Analysis


Data and Result
Presentation I


-, __ __ _
* d_


-wwww


U UNIVERSITY of

UF FLORIDA


wwwhcs, ufIedu
[Ij 211 smarr jj r, r.' [ I I . .1 r. j c ,


111 1 N 011 1


~IIIAl~m I~I
C~-C-r
""~"--""""'~
I ~
~I:3i i.




;......~...~..~~~~...~...~.~
---c--r*.iic-


,_.._....







Need for Generalized Tool Infrastructure


* Development of a performance analysis tool is a time-consuming
process (takes years to develop)

* Quite a few performance analysis tools exist

* However, majority of them support Message Passing Interface (MPI)
with very few supporting other models such as those in the Partitioned
Global Address Space (PGAS) family
"--MPI
* One of the reasons for the limited model support is that these tools
were designed and developed specifically to target a single model
L Tool is too making it cumbersome to
add new model support


would help in


* A
this aspect


U UNIVERSITY of
UFIFLORIDA


www,hcs, ufl,edu
[1211ih eis-ar,. ,..'..' I[M .. . .lfcn 'n R~e






Properties of a Generalized Infrastructure

* Uses a
a Each model construct is mapped to a generic operation type
a Tool is designed to work largely with generic operation types
a Components that only use generic operation types are model-independent (i.e.
reusable across models)


* The goal is to
components


minimize the number of model-dependent


UNIVERSITY of
UFIFLORIDA


www,hcs, ufl,edu
211i h'neisrar, I '. i r . lfcnk I[I ,.In h I .. 1


Data exchange Pair-wise Group-wise Local
(P2P) synchronization synchronization processing
One-sided Lock Barrier Work distribution
(put, get, fence) manipulation (for-all)
Two-sided Wait on remote Collectives User functions &
(send, receive, (spin lock, atomic I/O operations
wait) swap, etc.)








PPW High-level Framework


I Instrumentation


Model-independent
components


I Model-dependent
1 components
S Measurement I
Measurement Unit
(MU)
n-7Ir------
n IeD Presentation
I iuaia M nge-ia I _
Performance-Data = Visualization Manager
Manager (PDM) (VM)
. A A It- ll 'I


Analysis


U UNIVERSITY of
U F LORIDA


wwwhcs, ufIedu
[Ij 2 11 marr jj r, I r.'.. [ .I .I j 1. I .. .1 r. j







Instrumentation-Measurement Interface


Instrumentation Unit
(Berkeley UPC)


Model-dependent
Tool-Independent


Model-Independent
Tool-dependent


Instrumentation Unit Instrumentation Unit
(HP UPC) (...)


Instrumentation Unit
(MPI MPICH)

$


Instrumentation-Measurement Interface


Measurement Unit
(MU)


* Different instrumentation techniques (adding code to collect
performance data) are applicable to different programming model
implementations
* However, one generic measurement unit is sufficient to record data
* Standardized instrumentation-measurement interface (a.k.a. GASP)
facilitates the transition from multiple instrumentation units to a
single measurement unit


UNIVERSITY of
UFIFLORIDA


www,hcs, ufl,edu
[ 2 n11ih eis-ar, >' T.' na. ..'lf.nlk .I M ,.In Uh I' ..


(GASP)






PPW Model Support
PPW infrastructure was first implemented to support Berkeley UPC
Took approximately 1-2 years to develop (sans bottleneck detection)
a Supports one-sided transfer, global synchronization, locks, etc. operation
types
Quadrics SHMEM and MPICH MPI were then quickly added
a Took about 3-6 month to complete
a Instrumentation provided via PSHMEM/PMPI interface with calls to GASP
L Majority of components remained unchanged
Minor modification made to measurement unit and visualization unit to support
collectives and two-sided transfers


mmmii-mm --.
_---WT .






U I F 0 R-- I 8 Hi 211 r, .i hn .Isr.ar.:_ .'...I.. F L O R Ilfn I .n.... h r1







Why do Automatic Analysis of Data?


* With a long running and/or complex application, the
amount of performance data available can be
overwhelming
* Automatic performance analysis helps by presenting a set
of useful information out of a much larger data set




-I ....
iig, r me_ j- [in_ [ g

11- kb- .L- A
I I,: I II1 -- _,, -
1. MJ Ak m* k *A U 1

4 I 1..1 I '--- -- .

4 1 I 1 1 'T im e ( e o n ds) 4 | 11 1


UNIVERSITY of
UF FLORIDA


www,hcs, ufl,edu
S2n11i ismrar,. >'.("I '. i r ., lfnlik [R.,.In Ih &.I ..








Automatic Performance Analysis


* Pattern is a description of a program behavior
a Expert programmers know how to recognize performance patterns
where an everyday programmer may not
a Fair amount of performance patterns generated over the years by
researchers


* Automatic performance analysis of an application
a Performs a series of tests to recognize patterns
a Classify the pattern base on the result of tests
a Suggest possible solution to remove the bottleneck (to some
extent)


Tests to recognize
patterns


Classification of \ F
pattern based on the L >
test results


Provide suggestion to
remove the bottleneck
P 0 s


UNIVERSITY of
UFIFLORIDA


www,hcs, ufl,edu
211i h'neisrar, I '. i r . lfcnk I[I ,.In h I .. 1


Application


I


I^^ ^^ ^ ^ ^ ^^ ^ ^ ^


I~







Pattern Categorization


* Patterns are categorized into one of the following three levels
a Experiment set level
Compare performance for a set of experiments
Pattern example: poor scalability of code
Time(p) x p Ratio < 1; non-Ideal application speedup of Algorithmic change for the
Time(q) x q application/region application/region

a Application level
Provide an overview of overall application performance
Pattern example: lots of small data transfers
Count(data.transfer) Ratio >= THRESHOLD; lots of small data Aggregate data transfers
Time(tra ) transfers
a Node level
Enable detailed analysis of performance data that helps to pinpoint the
exact location and cause of the bottleneck
Pattern example: 2nd invocation of send on node 2, line 10 is a late sender


UF UNIVERSITY of
UF FLORIDA


www,hcs, ufl,edu
211i h'neisrar, I '. i r . lfcnk I[I ,.In h I .. 1







Node Level Patterns


Each node performs independent analysis using local data
and a small amount of data from other node when needed
a Each node tries to minimize its execution time
a Observed application execution time = execution time of the longest
running node
Pattern tests aim to detect deviation from the optimal
situation
a Excessiveness analysis: large number of operation occurrences
Frequency evaluation (lots of operations in a short time?)
Excessive operation evaluation (operation could be eliminated?)
a Delay analysis: long running operations
Baseline approach (actual time >> expected?)
Variant approach (mintime << avg_time or maxtime >> avg time?)
Patterns are defined in term of the generic operation types,
thus applicable to any model

UUNIVERSITY www,hcs, ufledu
KU I *IF OID 1 Mj 11ih-nrIar.'>' r.['..I Ij F L O RlfnA IK. n.h 12.







Node Level Detection Mechanism


I ----------------------
I II


+


11=


F^o l bottlenecks


Local
Analysis

.Local
SAnals Analysis


1 0 G- B-T 1" U Analysis
(.,'-.,. ,, 79.

-OGB v------ IL-v--. Loca Lcal
calsLocal
A naGsGB isGB 1- I L ocalGA analysis
Local
Analysis


UF UNIVERSITY of
UF FLORIDA


wwwhcs,uf!,edu
211i h'neisrar, I '. i r . lfcnk I[I ,.In h .1 1..


r


- raeS at


T^^^ racesj dataS^^^
I (local, all)^^^^


:111~









Example Node-Level Pattern: Lock Delay


L0M*J [P.w


1 7 i..H : a.tu 4.log2-idetityMap


d k (, 1 1 ) t
n des
o i.- i



Reciv f N 0
Sd ( e 3, i c 1


Reciv f S
Unok ) en (ln 3, intac 1) 5


UF UNIVERSITY of
UF FLORIDA


www,hcs,ufL,edu
S2n11ih eis-ar,. >'.("I '.ir I lf[n iI r .jn. h I&. 1


Sj


(



'C


Node 2's Potential Bottleneck List
Locks
Lock at Node 2, Line 2
A-2-A


Node 2's Local Trace Records
Lock (x) at line 2, Time: 10 100
Lock (x) at line 5, Time: 100 120
Lock (x) at line 2, Time: 120 140









Example Analysis Result: Lock Delay


Total Time
Computation Time
Communication Time


Global Sync Time
P2P Sync Time
Comm / Comp Ratio
Sync / Comp Ratio
Btnk Time


Ratios showing that lots
= 7.75E07 ns of time were lost due to
= 2.51E07 ns, Ratio = 32.45% data transfer and
= 0.00E00 ns, Ratio = 0.00% synchronization
Count = 0.00E00,
Bandwidth = 0.00 MB/s [low bandwidth mean ots of small data transfers]
= 9.52E06 ns, Ratio = 12.28%
= 4.28E07 ns, Ratio = 55.27%
= 0.0000 [Low number means more work done]
= 2.0821 [Low number means low overhead]
= 5.22E07 ns, Ratio = 67.36% [of program with performance bottlenecks]


XXXXXXXXXX BOTTLENECKS (S) XXXXXXXXXXX
Found total of 7 filtered bottlenecks)


At node 2, upc_lock at line 2
(1st occurrence) executed
much slower than expected


--- P2P Operations (S) --


Node#2, Line#2, upc_lock(UPC), T(avg) = 2.40E6, T(exp) = 5907, 5 Call(s), Ratio = 4.06E2, 15.45%
Program, 15.42% Degradation
---> [Instance#1] Ratio = 3.78E2, 0.0020 s --> 0.0043 s, Duration = 0.0022
--> [Node#0-Line#3] :Wait for lock to be available from specified node;
--> [Node#3-Line#3] :Wait for lock to be available from specified node; \


on ol.,ic


XXXXXXXXXX BOTTLENECKS (E) XXXXXXXXXXX


LIuI C UC alllay31s
reveals node 2 waits on node
0 and 3 to release lock


UF UNIVERSITY of

UF FLORIDA


wwwhcsufIedu
[Ij 211 smarr jj rI'. ,I..' .I..' .I 1 ...h .i .






Conclusions


Parallel Performance Wizard is an infrastructure
designed to support multiple parallel programming
models with ease
a Uses the generic operation type abstraction that improves the re-
usability of the system components

A new automatic performance analysis approach is
currently being developed and tested
a Captures known performance patterns in one of the three levels
a Employs a distributed detection method to improve execution
time and minimize data transfer among nodes
a Potential to support multi-model and multi-level analysis

A working implementation of PPW is now available for
UPC, SHMEM and MPI
a For more information see http://ppw.hcs.ufl.edu

U UNIVERSITY of16 www,hcs, ufl,du
KU I *IF OID 1 Mj 11ih-nrIar.'>' r.['..I Ij F L O RlfnA IK. n.h 16.






Acknowledgements


* Department of Defense
a Funding the PPW project

* Dr. Alan D. George
a Advisor for my research

* Adam Leko, Max Billingsley
Sherburne [U. of Florida]


Ill,


Bryan Golden, Hans


a Design discussion of infrastructure, low-level design and
implementation, visualization generation

* Dan Bonachea [U.C. Berkeley]
a Berkeley GASP discussion and implementation


UF UNIVERSITY of
UF FLORIDA


www,hcs, ufI,edu
[1i 2 1- r, I. r. '.. I[ .I .I J 1. I .. .1 r. i





















UF^ (UNIVERSITY of
UF FLORIDA


www,hcs, ufI,edu
S211 r, I .. ' [ .I I. .' '. i .


0








Supplement Slide 1 Instrumentation-Measurement Interface
Overhead


Measuremen
PAPI
O Measuremen
------ Instrumentati


-------------------------


MG MG FT
profile trace profile
Benchmark


FT
trace


IS IS
profile trace


UNIVERSITY of
UFIFLORIDA


www,hcs, ufl,edu
211i h'neisrar, I '. i r . lfcnk I[I ,.In h I .. 1


t (tracing)


t (profiling)
on


CG
profile


CG
trace


------------------


---------------------------------

---------------------------------

---------------- --

-------------------


------------------








Profile Filtering Performance Improvement

Checked records U Total time (ms)


100
90
80
70
60
50
40
30
20
10
0


0^3


92


UF UNIVERSITY of
UFIFLORIDA


www,hcs, ufl,edu
ih.211i nes-ar, :' ,.il na. ..'lf.n i [M ,.InUh I .l'


A1




University of Florida Home Page
© 2004 - 2010 University of Florida George A. Smathers Libraries.
All rights reserved.

Acceptable Use, Copyright, and Disclaimer Statement
Last updated October 10, 2010 - - mvs