
Citation 
 Permanent Link:
 https://ufdc.ufl.edu/UFE0006633/00001
Material Information
 Title:
 A Parallel Direct OpenShell CoupledCluster Singles and Doubles Algorithm
 Creator:
 YAU, ANTHONY D. ( Author, Primary )
 Copyright Date:
 2008
Subjects
 Subjects / Keywords:
 Algorithms ( jstor )
Bandwidth ( jstor ) Computer memory ( jstor ) Eggshells ( jstor ) Electrons ( jstor ) Input output ( jstor ) Orbitals ( jstor ) Permutations ( jstor ) Real time computing ( jstor ) Simulations ( jstor )
Record Information
 Source Institution:
 University of Florida
 Holding Location:
 University of Florida
 Rights Management:
 Copyright Anthony D. Yau. Permission granted to University of Florida to digitize and display this item for nonprofit research and educational purposes. Any reuse of this item in excess of fair use or other copyright exemptions requires permission of the copyright holder.
 Embargo Date:
 12/18/2004
 Resource Identifier:
 57731785 ( OCLC )

Downloads 
This item has the following downloads:

Full Text 
A PARALLEL DIRECT OPENSHELL
COUPLEDCLUSTER SINGLES AND DOUBLES ALGORITHM
By
ANTHONY D. YAU
A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL
OF THE UNIVERSITY OF FLORIDA INT PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY
UNIVERSITY OF FLORIDA
2004
Copyright 2004
by
Anthony D. Yau
ACKNOWLEDGMENT S
I thank my family and friends for supporting me throughout the years, my strength
comes from them. I thank the faculty who have instructed me and the members of Rod
Bartlett's research group. I owe particular thanks to Dr. Erik Deumens for continued
motivation and guidance in scientific computing (and for teaching the first parallel
programming class that I attended) and to Dr. Victor Lotrich for insights into
implementing CoupledCluster theory. I especially thank Dr. Ajith Perera for extending
so much help (and patience) to me during all of my studies throughout graduate school.
The United States Department of Defense High Performance Computing
Modernization Program (HPCMP) supported all of this work. A grant under the
Programming Environment and Training (PET) program (GSA Contract No. DAAD05
01C003) funded the initial parallelization effort, and the Common High Performance
Computing Software Support Initiative (CHSSI) program (proj ect CBD03) funded the
research on the finegrain parallel algorithm.
TABLE OF CONTENTS
Page
ACKNOWLEDGMENT S ........._.___..... .__. .............._ iii...
LI ST OF T ABLE S ........._.___..... .___ .............._ vii...
LI ST OF FIGURE S ........._.___..... .__. .............._ viii..
AB S TRAC T ......_ ................. ............_........x
CHAPTER
1 INTRODUCTION ................. ...............1.................
2 PARALLELIZING A SERIAL CCSD PROGRAM .................... ...............6
Serial Profile................ ... .. ...............6
Bottleneck Categorization and Parallelization .............. ...............9.....
ParticleParticle Ladders................ ...............9
M atrix M ultiplication............... .............1
I/ODependent Routines ................. ...._._ ...._._.............1
Parallel Profile.................. ...................1
TopDown Parallelization Considerations .............. ...............17....
3 SPINUNRESTRICTED CCSD EQUATIONS............... ...............1
Group 1: T2 Virtual Orbitals Coming from V ............... ...............23....
Group 2: T2 Virtual Orbitals Coming from V and T............... ...................2
Set 1: One Virtual from V, One Virtual from T1 ................ .....................24
Set 2: One Virtual from V, One Virtual from T2 ................. .....................26
Group 3: T2 Virtual Orbitals Coming from T .............. ...............27....
Set 1: Rings and PH Ladders ................. ...............27........... ..
Subset 1: AC/BD............... ...............27.
Subset 2: DB/CA............... ...............28.
Subset 3: CB/AD ........................ ......................2
Set 2: HH Ladders from OOVV, OOOV, and OOOO .............. ..................30
Group 4: T 2 into Ti ................ ...... ..............3
Group 5: OneParticle Transformations .....__.....___ ........... .............3
Set 1: T1 Contributions .............. ...............32....
Set 2: T2 COntributions .............. ...............33....
4 P ARALLE L ALGORIT HM .............. ...............3 5....
Virtual Parallel Computer .............. ...............35....
Communication Networks .....__.....___ ..........._ .............3
Data Management ........................_ ...............38....
Arrays stored in distributed memory............... ...............38.
Arrays stored on disk .............. ...............42....
Parallel UHF CCSD .............. ...............42....
Direct Integral Phase............... ...............44.
Design considerations .............. ...............44....
Al gorithm .............. ...............46....
Stored Integral Phase ............. .....___ .....___ ...........5
One Particle Phase ........................._ ...............55...
Antisymmetrization and T2 Update .............. ...............57....
5 DISCUS SION AND SIMULATION ................. ...._.._ ......._ ...........6
D discussion .............. .. ...............69...
General Comments .............. ...............69....
Load B al ancing ........._..._.._ ...._._. ...............70.....
Job boards .............. ...............70....
Data aggregation .............. ...............70....
Asynchronous I/O ........._..._.._ ...._._. ...............72.....
Synchronization Delays ........._..._.._ ...._._. ...._._. ..........7
Sim ulation .............. ...............74...
General Overview............... ...............76
Specific Technologies............... ..............7
Calibration .............. ...............79....
Theoretical Examples .............. ...............83....
Runtime Simulations .............. ...............84....
D CEE* ............ ... ...............84..
Serine and phenylalanine .............. ...............88....
6 CONCLUSION ............. ......___ ...............92....
Outlook ............. ...... ...............93....
G radients ................ ........ ...............93....
Higher Excitation Clusters ..................... .___ ...............94...
Orbital Transformations and Space Reductions .............. ....................9
Conclusion............... ...............9
APPENDIX
A ACRONYMS AND ABBREVIATIONS .............. ...............97....
B NOTATION AND CONVENTIONS .............. ...............99....
Operators and Matrix Elements............... ...............99
Operator Indices .............. ...............99....
C ALGORITHM OVERVIEW............... ...............10
D ANATOMY OF A SHELL PAIR BLOCK .............. ...............102....
E LOGICAL SHELLS AND BIN PACKING ................. .............................104
LIST OF REFERENCES ................. ...............110................
BIOGRAPHICAL SKETCH ................. ...............112......... ......
LIST OF TABLES
Table pg
21. The distribution of real time over highlevel CCSD subroutines ............................8
31. The RHF CCSD T1 contributions .............. ...............19....
32. The RHF CCSD T2 COntributions .............. ...............20....
41. The direct integral processing loop summary .........__ ....... ................. 48
42. The Direct Integral phase prologue memory map .............. .....................6
43. The Direct Integral phase inner loop memory map .............. .....................6
44. The Direct Integral phase epilogue memory map ................. ....._ .............62
45. The Stored Integral phase Group 3 Set 1 Subset 1 memory map ..........................63
46. The Stored Integral phase Group 3 Set 1 Subset 2 memory map ..........................64
47. The Stored Integral phase Group 3 Set 1 Subset 3 memory map ..........................65
48. The Stored Integral phase Group 3 Set 2 memory map ................. ................ ...66
49. The One Particle phase FP TzaP memory map .............. ...............66....
410. The One Particle phase FPuT2aP (and same spin terms) memory map...................67
411. The One Particle phase FH and FR lOcal transformations .............. ...................67
412. The R2 antisymmetrization and T2 update memory map .............. ....................68
51. The schedule of synchronization points............... ...............75.
52. The latencies and bandwidths of random disk reads with respect to file size .......81
53. The minimum resources for amino acids with 1024 AOs in UHF CCSD....._.......84
54. The number of PEs needed to surpass the initial parallel performance ........._......88
LIST OF FIGURES
Figure pg
21. The distribution of real time over primary CCSD subroutines ............... .... ........._..7
22. The breakdown of significant UHF subroutines ....._____ ... .....___ ........._....8
23. The column assignments for each of 4 CPUs updating R2 ................. ................12
24. The distribution scheme for matrix multiplies over CPU threads .........................13
25. The parallel speedup for each maj or UHF CCSD subroutine and the CCSD
executable .............. ...............15....
26. The runtime profile of parallel CCSD with a UHF reference from DCEE* .........15
41. Communication network boundaries ..............._ ...............39 ........._....
47. The memory heap management for various molecular systems................... .........40
42. The T2 and R2 memory layout .............. ...............43....
48. The data flow in the Direct Integral phase ................. ...............49........... .
43. The path of Group 3 Set 1 Subset 1 updates per OOVV integral ................... .......53
44. The LD shell distribution schemes .............. ...............54....
49. The data flow in the Stored Integral phase .............. ...............55....
510. The binary tree approach to collective communications ................... ...............78
511. The local disk random read calibration data ....__. ................. ................. .81
512. The calibration data for onnode messaging parameters ................... ...............82
513. The calibration data for offnode messaging parameters............... ...............8
514. The simulated performance curves for DCEE* on 2 to 1024 PEs over Fast
Ethernet ................. ...............85.................
515. The simulated performance curves for DCEE* on 2 to 128 PEs over Fast
Ethernet ................. ...............86.................
516. The simulated performance curves for DCEE* on an SP switch. .........................86
517. The relative performance of the initial parallelization attempt compared to the
new al gorithm .............. ...............87....
518. The absolute performance of the initial parallelization attempt compared to the
new al gorithm .............. ...............88....
519. The performance curves for serine on 16 to 1024 PEs ................. ................ ...89
520. The performance curves for phenylalanine on 128 to 1024 PEs ........................89
D1. The spin blocks in an AO shell pair block ....._._._ ..... ... .__ ......._._......10
Abstract of Dissertation Presented to the Graduate School
of the University of Florida in Partial Fulfillment of the
Requirements for the Degree of Doctor of Philosophy
A PARALLEL DIRECT OPENSHELL
COUPLEDCLUSTER SINGLES AND DOUBLES ALGORITHM
By
Anthony D. Yau
December 2004
Chair: Rodney J. Bartlett
Major Department: Chemistry
CoupledCluster Singles and Doubles (CCSD) is a benchmark ab initio electron
correlation method. The biggest factor in preventing its widespread adoption for large
molecular systems is that the theory scales like N4 in data and N6 in COmputation (N
being the number of atomic orbital basis functions). Using simple factorization
techniques and data and task parallelization, a new approach to calculating openshell
CCSD energies is available. For a molecular system with 50 electrons (25 alpha, 25 beta)
and 500 AO basis functions, this approach has the potential to reduce storage
requirements from 630 GB to 2.3 GB for a traditional UHF CCSD algorithm in a basis
set of molecular orbitals. The number of operations will increase, but the natural
parallelization of the data provides a clear and concise framework for parallelizing the
computations.
CHAPTER 1
INTTRODUCTION
CoupledCluster (CC) theory is an ab initio electron correlation method that can be
thought of as infiniteorder perturbation theory or a sizeextensive approximation to the
full configuration interaction (CI) analysis [Bar95]. Given a oneparticle reference
wavefunction Go, the CC wavefunction is defined as an exponential expansion of
excitation clusters, T, from the occupied orbitals to the unoccupied (virtual) orbitals. The
T operator is a particle excitation operator that naturally truncates at the number of
particles, P, in the system (Equation 12). Occupied orbitals (the hole space) are denoted
with the letters i, j, k, and 1, while virtual orbitals (the particle space) are denoted with the
letters a, b, c, and d. Appendix A lists acronyms and abbreviations, and Appendix B
summarizes index conventions and matrix elements.
I cc)= exp(T)00) (11)
T =i T; T = 1 t a ,ibt ptk... (12)
p=1 22]k
a,b,c,
The difficulty in calculating the full CI is that the data requirement scales like N2P
and the computational requirement scales like N2P+2, in which N is the size of basis set
space. To retain size extensivity and computational tractability, the excitation operator is
usually truncated at a small number of excitations. The most common is at double
excitations, such that T=T1+T2. As computers become more capable of massive
calculations, the number of excitations can increase, but the size of the molecular system
must decrease much more rapidly. Currently, the highest handcoded truncation is at
pentuple excitations (i.e., T= T1+T2+...+T5) [Mus02], but efforts to automate this process
are becoming more promising as distribution techniques for data and computations
become available [Bau02].
Using the standard twoparticle electronic Hamiltonian, the CC energy is calculated
by contracting the elements of the Fock matrix (f,,) and the twoelectron Coulomb
integrals ( ) with the T1 and T2 amplitudes. Uppercase orbital indices correspond
to alpha spin orbitals, and lowercase orbital indices correspond to beta spin orbitals. The
doublebar integrals in Equation 14 are antisymmetrized Coulombexchange integrals
(Appendix B). (This derivation assumes that the reference orbitals are orthonormal.)
Ecc o exp() Go(13)
Ecc= Eo +C Ij Ab ( +t;'t1
IJAb
+f~l/ flltA + I A tAB + t~tJ t t" (14)
IA IJAB
+C1f~,f +SC (9 ab: I' +tzat tt)
la tyab
The T amplitudes are calculated by iterating over a coupled set of equations that
proj ect the Hamiltonian and wavefunction into a set of pparticle excitation spaces. The
CCSD approximation truncates T at double excitations, and is one of the benchmark
correlated methods in quantum chemistry today. The T1 and T2 equations come from the
projections in Equations 16 and 17, in which Hj, is the normalordered Hamiltonian
that does not include the reference energy. Only the P and pp spin equations are shown,
but the a, aa, and up spin equations are analogous to the former.
exp( T)H;, exp(T)I 00)= AEcI 0,) (15)
Ofl exp( T) ~, exp(T) Q0) = 0 (16)
/rab O Xp( T) exp(T)I 00) = (17)
The BakerCampbellHausdorff formula [Hal00] has the effect of reducing the T
equations to those terms that 1) contract directly with the Hamiltonian and 2) have no
more than four excitation amplitudes. The first effect is from disconnected terms being
cancelled by the commutation, and the second effect is from the twoparticle Hamiltonian
having at most four indices from the twoelectron integrals.
Computationally, the diagonal elements of the Fock matrix that contract with T are
grouped together, and define the input amplitudes for the next iteration. The T2P
equations are shown in Equations 18 through 11 1. If T is initialized to 0, then the next
iteration will produce T2=/D2 and T1=0, so for all intents and purposes, T2 can
be initialized to the secondorder wavefunction and T1 can be set to 0.
O = C~~factc bCfcfl ac k ~b _fktrab +... (18)
k k
0 = (Jas +/bb J;i af~b ac C~~b bc~f6t c k ~b _~ ir ktra +.. (19)
c~a c~b k ke
Dab=j; aa bb (110)
Dabtb acic b ca k kyd (k 2 **
c~a c~b km k
Since the T amplitudes differentiate between occupied and virtual orbitals, it is
common to group integrals into the various space combinations such as occupied
occupied (OO), occupiedvirtual (OV), and virtualvirtual (VV). There are Hyve such
combinations for the twoelectron integrals: OOOO, OOOV, OOVV, OVVV, and
VVVV. If the virtual orbitals outnumber the occupied orbitals by a factor of 3, then the
VVVV and AO integrals will outnumber the OOOO integrals by factors of 81 and 256,
respectively. Most CC implementations calculate and store the OOOO, OOOV, and
usually OOVV integrals before starting the CC iterations. Due to programming
complexity, most implementations also store the OVVV integrals.
The VVVV integrals are used in only one CCSD term and how they are processed
illustrates a fundamental technique that the parallel CCSD algorithm uses in almost every
contraction. The data requirement scales like N4 and the computational requirement
scales like N6, and this term is usually the ratedetermining step of the CC iteration. The
standard MO expression is shown in Equation 112. The "+= operator is shorthand for
"incremented by" (i.e., "x+=y" means "x=x+y"), and the r amplitude is the residual that
collects all of the contributions before dividing by the D denominator and producing the
next set of T amplitudes.
a~b+= }fab lcd cd (112)
Forward transforming T into the AO basis allows the AO integrals to contribute
directly to the VVVV terms. Conceptually, the AO coefficients from the reference
wavefunction are "removed" from the integral and "attached" to the amplitude. The
primes on the AO indices in T denote a forward transformation. This distinction must be
made since another useful transformation (a backward transformation) has the same
dimensionality (viz. po), but the AO overlap matrix is used in the contraction in addition
to the AO coefficients.
abh ac = ~Pr~ pa c:)c~t id (113)
cdvp
While this transformation removes the need to store the VVVV MO integrals, it
increases the computational workload substantially. In fact, since each integral
contributes to each amplitude, the R2 array WOuld be updated N4 times!i However,
implementing this technique presents a more optimistic view of calculating large systems.
The R2 amplitudes are backtransformed from MOs to AOs before being incremented by
the AO integral terms. Once the integrals have been processed, the R2 array is forward
transformed into the MO basis.
4""e =u _(' cd n S' (115)
cd
I4""+= pr per t "" (116)
The method of calculating amplitude contributions in the AO basis instead of the
virtual MO basis has the potential to reduce the amount of storage space needed to
calculate the aforementioned VVVV contribution. This is particularly true for openshell
systems, which need to store all three spin combinations of the VVVV integrals
compared to storing one set of AO integrals. If direct integral generation is used, then the
storage requirement for this term is completely removed. In either the AO or MO
scenario, parallelization involves distributing the data and operations of a large matrix
multiplication over the processing elements.
CHAPTER 2
PARALLELIZING A SERIAL CCSD PROGRAM
Widescale reformulation of computational algorithms should only be performed if
currently existing formulations are inadequate, but determining whether an algorithm is
inadequate depends heavily on its intended use and the computers that are meant to
execute it. Any of a number of runtime parameters could contribute to low performance:
compiler efficiency; net CPU speed; memory or disk storage requirements and
performance; parallel scalability and efficiency; etc. In the case of general computational
chemistry algorithms, how the number of operations scales with respect to the size of the
chemical system will also determine whether an algorithm should be reformulated.
The ACES II Program System [ACES] is one of the more popular software
packages that calculates CCSD electron correlation energies. Although it was written to
make use of CPU vectorization, parallelization is considered to be one of the most
important ways to decrease the run time on modern supercomputers. Traditional methods
for parallelizing existing software begin with execution proofing. Once the main
bottlenecks are identified and analyzed, parallel formulations of those sections are
implemented and profied.
Serial Profile
Profiing the CC code in ACES II was performed with four test systems:
* Alanine ("ala"; C3H7NO2; 48 e") in ccpVDZ (119 AO basis functions)
* Octane ("oct"; CsHls; 66 e") in ccpVDZ (202 AOs)
* P,P'dichloroethyl ether ("DCEE"; C4HsCl20; 74 e") in ccpVDZ (146 AOs)
* DCEE radical ("DCEE*"; C4H7Cl20; 73 e") in ccpVDZ (141 AOs).
The first three systems had an RHF reference, the fourth had UHF, and no spatial point
group symmetry was used for any system. All of the systems used the AO basis
formulation for the VVVV*T2 COntribution, which was described in Chapter 1. The
profiled code used the C system function gettimeofday() to record the realtime clock
before and after large code sections. Any subroutine (or group of related subroutines) that
consumed more than 3% of the real time became a potential target for parallelization.
Table 21 lists the average percentage of the total real time for each highlevel
subroutine measured across the three closedshell test systems (RHF column), and it lists
the percentage for each routine in the DCEE* test system (UHF column). Routines that
contributed more than 3% are in bold case. The maximum variance of any set of
percentages for the closedshell systems was 0.004, which suggests that the three profiles
have a consistent and reliable distribution of times. The table entry for "~AOLAD"
corresponds to the time spent in AOLAD that is not due to RDAO, even though the latter
is a dependent of the former. Figure 21 shows the amount of the total time consumed by
the maj or subroutines and the next most significant routine.
70%
60% I RHF
50% I I UHF
40%
30%
20%
10%
0%
T1W1AB
RNGDRV
CNTRCT
RDAO
next
Figure 21. The distribution of real time over primary CCSD subroutines.
Table 21. The distribution of real time over highlevel CCSD subroutines
Routine RHF UHF Routine RHF UHF
ZERSYM 0.2% 0.1% TllNTl 0.4% 0.2%
T1RAAAA 1.4% 0.6% LADAA 0.0% 0.1%
T1RABBA 0.8% 0.6% LADAB 1.2% 0.5%
T1RABAB 0.8% 0.6% AOLADLST 0.0% 0.0%
CNTRCT 7.7% 9.8% T2TOAO 0.7% 0.3%
SUMSYM 0.1% 0.2% RDAO 58.3% 65.1%
F2TAU 0.0% 0.0% ~AOLAD 0.6% 0.3%
GETLST 1.1% 0.8% Z2TOMO 0.9% 0.7%
QUAD 1 1.3% 0.5% ZIAAO 0.1% 0.2%
QUAD2 0.5% 0.3% T12INT2 0.8% 1.1%
QUAD3 0.3% 0.2% RNGDRV 9.4% 11.0%
MAKFME 0.1% 0.1% SSTOxl 0.0% 0.0%
T1W1AA 0.0% 0.6% SUMRNG 0.2% 0.1%
T1W1AB 3.9% 2.1% ERNGAA 0.0% 0.0%
T1INW2 0.1% 0.0% ERNGAB 0.5% 0.1%
FEACONT 0.7% 0.4% SUMSYM 0.2% 0.0%
FMICONT 0.3% 0.3% E4S 0.6% 0.2%
FMECONT 0.5% 0.2% NEWT2 0.3% 0.2%
T1INT2A 0.3% 0.2% PHASE3 0.6% 0.2%
T1INT2B 0.8% 0.5% DIIS 2.6% 1.0%
It is clear from the table that four routines consume most of the execution time:
CNTRCT, T1W1AB, RDAO, and RNGDRV. Together, they cost 79.3% and 88.1% for
RHF and UHF, respectively. Figure 22 shows the four maj or subroutines and any
individual routine that consumes more than 1% from the DCEE* profile.
RNGDRV
RDAO
Figure 22. The breakdown of significant UHF subroutines. Primary values are grouped
on the left, secondary values (greater than 1%) are grouped on the right.
T12INT2
DIIS
Bottleneck Categorization and Parallelization
ParticleParticle Ladders
The single most time consuming routine in either RHF or UHF calculations is
RDAO, which increments T2 by contributions from the VVVV integrals in the AO basis.
Integralstored schemes, such as the current implementation in ACES II, read the AO
integrals into memory from disk, multiply them by the T2 amplitudes from the previous
iteration, and add the product to the T2 TOSidual amplitudes. Neglecting integral sparsity,
the number of permutationally unique AO integrals is shown in Equation 21. If there are
at least 100 AO basis functions, then the number of unique integrals can be approximated
by NAO4/8 with more than 98% accuracy; 200 or more functions and the approximation is
more than 99% accurate.
Nin =PP+) P = NoNo+)(21)
2 2
Many AObased implementations have a userdefined threshold for discarding AO
integrals that would not have a significant contribution to the T2 TOSidual. The integral
program in ACES II stores the integrals in one long vector and discards individual
integrals that are less than the threshold. For DCEE* and a threshold of 1014, ACES II
stored almost 41.1 million integrals from a complete set of 50. 1 million, which means the
twoelectron integrals were 18.0% sparse.
The RDAO algorithm reads in a batch of integrals and processes the integrals
individually. For a particular integral, the value might contribute 1, 2, 4, or 8 times
depending on the permutational symmetry of the orbitals. The RHF case only has 1, 2, or
4 contributions and postprocesses the residual for the final permutation. The general
structure of the algorithm can be described by the following:
do while (more batches)
read batch
do for each int
do for each perm
do for all oo
R2 (OO, gy) += T2(OO, po)*
end do
end do
end do
end do
This algorithm is not particularly efficient since each permutation requires a vector
vector operation (R2(*,Cly) += V*T2(*,po)). Experience in computational algorithms
suggests that matrixmatrix operations (using integral blocks) would have much higher
computational throughput. For example, the RDAO update averaged 286.7 MFLOPS out
of a theoretical maximum of 1500 MFLOPS on the 375 MHz IBM POWER3 CPU. A
600x600 matrix multiplication using the ESSL library (a tuned linear algebra library for
IBM architectures) averages 1278 MFLOPS, which corresponds to a computational
efficiency of 85.2% compared to 19.1% for RDAO. However, parallel algorithms for
blockwise implementations are virtually identical to elementwise implementations, so
block optimizations were ignored in the parallel strategy and could be implemented later.
For any DCEE* CC iteration, 9.6 seconds out of 5952 seconds were spent reading
integrals from disk (i.e., I/O takes about 0. 16% of the total RDAO time). Before the
multiplication loop, there was a small sorting section, which took close to 13.6 seconds
(also insignificant); therefore, parallelizing the actual integral updates was expected to
have the largest impact on performance.
There are multiple levels at which the current algorithm can be parallelized:
batches, integrals, permutations, or rows of OO pairs. Distributing the operations over
OO pairs might decrease performance since the pairs are contiguous in memory, and the
cost of adding more rows is much less than the cost of adding more loops to process
them. The problem with distributing permutations is that the number of permutations for
a random integral is not always 8. This might introduce loadbalancing problems
depending on how the integrals are sorted on disk; furthermore, each integral has a
limited number of tasks (permutations), which might limit overall scalability above 8
processors.
The two remaining loops (unique integrals and integral batches) are ideal for
distributing over processors. The number of integrals is substantial enough to support
scalability over a wide number of processors, and each processor will need to store only
those integrals for which it is assigned to calculate contributions. The current ACES II
implementation expands the permutations into the "unique" integral list and processes
each one separately, so this expanded list was distributed over processors instead of the
truly unique list.
For maximum software portability, the parallel algorithm needed to distinguish
between processors on the same physical computer and processors on remote computers.
The strategy was to distribute integral batches over physical computers (nodes) and to
distribute the integral list from each batch over processors on the same computer. After
the nodes finished processing their batches of integrals, the R2 COntributions were
reduced and distributed back to all of the nodes. To minimize memory requirements on a
single node, only one integral list, T2 array, and R2 array WeTO Stored, and CPU threads
were created to process the integrals. The R2 array WAS divided into blocks of columns
(Cly pairs), and each CPU was assigned to calculate all of the terms that contributed to a
particular set of blocks. This technique reduced false cache conflicts (when two CPUs
attempt to write to different parts of the same cache line at the same time), and it
increased the probability that a CPU would have a target R2 COlumn in cache memory,
since multiple CPUs would be caching different columns. Figure 23 shows the
CPU/column assignments in a 4CPU computer with 4 columns per block.
CPU 0 1 2 3
WWW
Figure 23. The column assignments for each of 4 CPUs updating R2. In this example,
there are 49 Clv pairs (from 7 AOs).
For an arbitrary blocking factor (bf) and number of threads (Nth), the thread
assignment (from 0) of a column index (mn) can be determined with the following
FORTRAN function:
thread_id(mn,bf,Nth) = mod(mn1,bf*Nth)/bf
If bf*Nth is a power of 2, then the mod function can be replaced by a bitwise AND with
the mask (bf*Nth)1, and if the blocking factor is a power of 2, then the division can be
replaced by bit shifting.
Matrix Multiplication
If the VVVV contributions are calculated in the MO basis, then LADAA and
LADAB become the bottleneck routines. These perform large twodimensional matrix
multiplies, as do the other three bottlenecks: CNTRCT, T1W1AB, and RNGDRV.
Incidentally, the routine T12INT2 (1.1% of the UHF time) is also bound by matrix
multiplications. Parallelizing these routines involved distributing the columns of the
output matrix over CPU threads. Since the input and output matrices were all kept in
memory, the columns were divided once, unlike block cycling in the RDAO scheme,
which had irregular data access patterns. Figure 24 shows the distribution pattern for a
matrix multiply over 4 threads.
A B = C
Figure 24. The distribution scheme for matrix multiplies over CPU threads. Every
thread can read the entire A matrix, but each thread is assigned a block of the
B matrix and consequently writes to a unique block of the C matrix.
The multiplications were not distributed over nodes since the parallel computer that
ran the profile had 16 CPUs per node. The time penalty incurred from the global
reduction (after the multiplication) would have greatly lessened the gains from
parallelization; however, nodes with fewer CPUs might benefit from distributing the rows
of the A matrix. All nodes but one would then have to initialize the C matrix to 0 (if the
operation increments preexisting data), and the results would have to be reduced and
broadcast after the multiplication.
I/ODependent Routines
None of the maj or subroutines that required parallelization were heavily dependent
on I/O operations; however, many parallel computers (including the one that ran the
profiling calculations) have shared disk storage devices. This introduces competition
between some or all of the compute nodes when each one performs file read or write
operations. The DIIS routine (1% of the UHF time) is heavily dependent on I/O
performance, and the ratedetermining operation can be described by the following:
do for each i
do for each (x,y)
B(x,y) += Tali(x,i)*~TaillY,i)
end do
end do
The number of rows in Tanll and the size of each dimension of B) is usually 5 but could be
any value (preferably a small one). The i index spans all of the T amplitudes, both T1 and
T2. Obviously the construction of a 5x5 matrix from a single 5xl vector is negligible
compared to the time spent reading from and writing to disk. This operation was not
parallelized, but only one node was permitted to perform the operation.
I/O competition is a lowlevel problem that shared storage architectures have. In
many cases, extensive software changes are required to overcome the increasing penalties
from read and write delays that serial programs consider to be constant. These penalties
manifest themselves in almost every routine that uses disk storage in the parallel profile.
Parallel Profile
The DCEE* test case was run with the ACES II CCSD program that had parallel
modifications as described in the previous section. The parallel performance of the maj or
subroutines is shown in Figure 25. The real time spent in each highlevel routine is
shown in Figure 26. Only RDAO was distributed over every CPU in the calculation and
was 20.6% efficient in the 256CPU (16xl6) calculation. The three other highlevel
routines were only distributed over the 16 CPUs in each node and had average threading
efficiencies of 70. 1% (CNTRCT), 25.4% (T1W1AB), and 28.8% (RNGDRV).
O lxl
0 2x l6
4x l6
8xl6
 16xl6
CCSD
II~ II II
1~~ 1 I IIII
T1INT2A
T1INT2B I I I
T 1INT 1 I I I
LADAA II I II I
LADAB
T2TOAO I I I I
RDAO
~AOLAD I I I
Z2TOMO I I I
ZIAAO I I I I I I
T12INT2
RNGDRV I
SUMRNG II II II II
ERNGAB II II II II
E4S I I I I I I I
NEWT2 I I I I I I I
PHASE3 I I I I I I I
DIISI I I I I I
60 
50 
S40 
S30 
,a20 
10 
0
CNTRCT T1W1AB RDAO
(AOLAD) RNGDRV
Figure 25. The parallel speedup for each maj or UHF CCSD subroutine and the CCSD
executable. (AOLAD) contains the RDAO parallelization and the global
reduction afterward.
O 1800
3600
5400
7200
0 1800 3600 5400
7200
ZERSYM
T1RAAAA
T1RABBA
T1RABAB
CNTRCT
SUMSYM
GETLST
QUAD1
QUAD2
QUAD3
MAKFME
T1W1AA
T1W1AB
T1INW2
FEACONT
FMICONT
FMECONT
Figure 26. The runtime profile of parallel CCSD with a UHF reference from DCEE*.
The times are measured in seconds.
The overall parallel efficiency of the UHF CCSD calculation on 256 CPUs was
1.8% with a speedup of 4.62. The calculation with the highest speedup was 4xl6 (64)
CPUs with an efficiency of 8.8%. The reason for such poor performance can be attributed
to I/O contention by all of the routines that were insignificant in the serial calculation.
Most of the routines listed in Figure 26 increase in time even though they had no parallel
modifications. While further parallel optimizations could be made to the current CCSD
program, two maj or drawbacks would still plague the parallel code: problem scalability
and expensive modifyprofile development cycles (i.e., trial and error).
The worst aspect of the current approach is that tractability does not increase as
CPUs, memory, and disks are added to the pool of resources (i.e., the largest molecule
that can be studied is limited by the resources of a single node). There is no mechanism
for distributing data across memory or disks; however, if the current limitations are
adequate, then this model should be able to reduce the runtime to a reasonable amount.
Parallelizing existing serial software usually takes the bottomup approach, in
which significant bottlenecks are parallelized and profiled. As new bottlenecks present
themselves, they too are parallelized and profiled. Unfortunately, by the time this cycle
ends (or becomes too expensive), so much of the software might have been rewritten that
it would have been cheaper to reformulate the algorithm using a topdown parallelization
approach. In the current CCSD program, no matter how well the algorithm is parallelized,
the software is still limited to performing calculations on systems that are calculable by a
single node. It would be more efficient in the long term to reformulate the CCSD
algorithm from the top down (reusing existing software wherever possible) and to include
proper data distribution and small task parallelization as design fundamentals.
TopDown Parallelization Considerations
Although a few programs already calculate the CCSD energy in parallel, most of
them are limited to closedshell references [Koc96,Kob97,Per03]. The exception is the
general tensor contractor of Hirata [Hir03], but the Einal program requires stored
integrals. The goal of the topdown parallelization approach will be to produce an open
shell CCSD program capable of handling molecular systems with at least 500 AO basis
functions (although the internal target is 1000) and that can scale up to a few hundred (if
not a thousand) CPUs. These constraints almost necessitate calculating amplitude
contributions from AO integrals and distributing amplitudes across CPUs.
In addition to the VVVV integrals, the algorithm should also allow for direct
generation of the VVVO integrals. The extent of calculating amplitude contributions in
the AO basis instead of the MO basis is that the virtual orbitals are never formally
recognized by the equations other than to create complementary density matrices such as
that shown in Equation 22. These matrices are very important since they define how
amplitudes transition from backward to forwardtransformed relative to the MOs.
Q" = [efc (22)
The first step of the topdown approach involves cataloging the integral
contributions and identifying which output condition on the R amplitudes (forward or
backwardtransformed) would be best suited for their additions. Finding a data and task
distribution then becomes a matter of understanding how the integrals are processed and
which terms can be grouped together given a common output condition on the residual.
CHAPTER 3
SPINUNRESTRICTED CCSD EQUATIONS
The T1 and T2 equations in spinrestricted HartreeFock (HF) CCSD have 11 and
29 contributions, respectively. If the reference wavefunction is not HF but the spin
orbitals are still restricted, then the T1 and T2 equations have 14 and 31 contributions,
respectively. The spinunrestricted nonHF CCSD amplitude equations have a total of
222 contributions. Table 31 lists the spinrestricted T1 amplitude contributions and the
number of corresponding spinunrestricted terms. Table 32 does the same for the T2
contributions [Cra98]. P(pq) is the antisymmetric permutation operator (Appendix B).
There are many ways to group and factorize the amplitude contributions. The
efficiency of each approach largely depends on the following:
* The chemical system being examined
* The CPU speed and efficiency
* The amount of local memory per process
* Local and remote (shared or exclusive) storage capacities
* Local and remote I/O latencies and bandwidths
* Interprocess message latency and bandwidth.
For example, if the CPU speed were incredibly slow compared to the speed of the I/O
reads and writes, then it might be faster to group as many subterms together as possible,
and accumulate the intermediate product to disk. Consider the sum X*A*B + X*C*D +
X*E*F, which factorizes to X*I provided I = A*B + C*D + E*F (first accumulated on
disk). In most cases, however, millions (if not billions) of CPU operations are performed
before a single byte of data is retrieved from disk, so this approach is relegated to those
systems whose CPU speeds are unusually slow compared to other hardware capabilities.
Table 31. The RHF CCSD T1 contributions
Number of
Contribution Diagram
UHF terms
+Cltlclfkcd V 
1kl c tt 4
+ L,, / 2
+C(ka ci~f t 4
+IL,(x 2
+ ka edt~ tid 4
kcd
1 kl c iC tXII tza 4
tl k/ i t 4
Skled
+ fkc,/tttpe x 4
+ / ttda _I _~ _
The most common factorization strategy involves searching all possible
combinations of terms from either amplitude equation and finding the set of subterms that
is used most frequently. This implicitly assumes that the cost of recomputing the data is
greater than cost of storing and retrieving the data from disk. In the MO basis, this is
certainly true (i.e., the amount of computation needed to generate all of the MO integrals
from AO integrals is quite high), but the cost of calculating AO integrals on demand is
negligible when multiple processes are in competition for shared I/O resources.
Table 32. The RHF CCSD T2 COntributions
Number of
Contribution Diagram(s) H em
+ abh if) 3
+P(!i)C' hab jtae 4
+ (ij) C (abh cd tacd  3
P(ab) ~kb if) t \/
P (ij)P (ab) kb icrl tV 
(ij) (ab) ~kb cd tot;tdi 4I
+ arb cd) t c 3
+iUP~ijh)P(a)kb cj~ to or 10
+P~j)Po~ab) ak~ djidt)~ k0~jh 10
+P+) (kli if )~i t 3
+Pij)P3(ab)klCi cj otVt 4
+ (ij) (ab)' Ckl ic ft t 1
+ P~P~ij)P~(ab), k/ )t~ 11
+ (~ij) k/ J tal~ctdtab 3
+P(ab)n(l~: I~k/
+P(ij)P~a b) k/
Table 32. Continued
Number of
Contribution Diagram(s)
UHF terms
~ g k t  4
P(ij>) kl ci fttz 8
P(ij)C(~, k/\J t:ftdtb 8
kled
+Pli(ab) fbct \f x 4
kcd
h>(ab) L~~k/tlkj JI tftatb 8
i(a kle d f: t;
The following assumptions guided the development of the factorization strategy
described in this chapter:
*There are at least 3 times as many virtual orbitals as occupied orbitals for each spin
*The twoparticle data are too large to store in the memory of one process
*The VVVV and VVVO integrals are not transformed and stored on disk
*The CPU operations are several orders of magnitude faster than I/O latencies
*Latency is at least an order of magnitude more significant than bandwidth
*Each process is able to retrieve (GET) and increment (ACCUMU7LATE) data that
are stored by another process without interrupting the other process.
The first assumption follows from a study of basis sets that noted that ccpVTZ is the
minimum basis set needed to describe electron correlation accurately [DunOO]. For
chlorine, there are 17 electrons and 34 contracted basis functions. This produces an
almost 1:3 occupied/virtual split in molecular orbitals, since half of the electrons have
alpha spin and half have beta spin.
Designing the algorithm such that twoparticle terms (viz., VVOO) can be
distributed over PEs is necessary because the parallel program should be capable of
treating systems with 50 electrons and 500 basis functions (50/500), although 100
electrons and 1000 basis functions is the target (100/1000). For 50/500 systems, the three
VVOO spin combinations require 1.55 GB of memory. Considering the algorithm needs
at least two (for T2 and R2) held in core, a processor with 1 GB of memory will not be
able to store the minimum amount of data without overflowing to disk storage.
Furthermore, the VVVO and VVVV terms will require 59.8 and 568 GB of storage,
respectively. (This assumes a conversion factor of 1024.) The parallel algorithm should
then at least allow for the possibility that the terms that use these MO integrals can be
calculated from AO integrals generated on the fly.
Today, the fastest disk storage devices are able to seek for random data in
approximately 3.7 milliseconds (measured on a Maxtor Atlas 15k 73 GB Ultra320 SCSI
hard drive). The slowest desktop CPU (to within reasonable expectations of performance)
operates at 1 GHz. (The Intel Pentium 4 line of CPUs can operate at 3.4 GHz.) Assuming
that most floatingpoint operations in CCSD require multiplications and additions and
that each operation completes in one clock cycle, then the maximum theoretical
multiply/add (fma) throughput is somewhere around 500 million fma/s. Together, the
data indicate that a slow CPU could theoretically deliver 1.85 million fma/seek, which at
least supports the assumption that computation is several orders of magnitude faster than
I/O latency. On the same hard drive, bandwidth bursts measure 73.7 MB/s, which
suggests that 280 kB of data could be transferred during the same time it took to locate
the data.
The final assumption follows from the latest version of the most popular message
passing library MPI (Message Passing Interface) [Gro99]. MPI2 allows PEs in the same
communication network to retrieve (GET), increment (ACCUMULATE), and overwrite
(PUT) data that are stored by another CPU without communicating with it. In addition to
their own proprietary communication libraries, most vendors of parallel computers offer
implementations of the standard message passing libraries (like MPI) to ensure maximum
compatibility with the myriad parallel programs that depend on industry standards.
The main consequences of the aforementioned assumptions are that there must be
more than one process, that recomputation is better than reading from disk, and that
messages should be as large as possible. The primary goals of this parallelization effort
are scalability and addressability (i.e., the ability to treat a wide range of molecular
systems), so the grouping and factorization of terms should maximize message size and
minimize message frequency while minimizing the local scratch memory requirements.
Group 1: T2 Virtual Orbitals Coming from V
One of the smallest groups of terms consists of T2 COntributions, in which both
virtual orbitals come from the twoelectron integrals. Equation 31 shows the pp terms
the at terms are analogous. The and integrals will be generated on the
fly in the AO basis, so the expression must be written in terms of AO integrals. Before
doing so, the second term must be split and rearranged to allow for better factorization.
ab + = ab ij + (Of a~b cj tfc +i p(g)C l abtc~d t~t+ abln cd}t~cd (31)
cd cd
+ =C (bIiftLe(blct (32)
+ P< ,c ab lcd tot(+ ab cd /,lrJtCd
cd cd
sab+= BJPI)Cf abpa c,"c,
cpa I cpa 33
+ i(ij) [ ab pa c, c~tactd + ap c c~td
cdpa cdpa
ab+ = J(g) b pa ~cc c + tP'~ c" c"' t" + t 't" lp)+ abp t,"" (3 4)
I;:=Ab+= +(Abb + ACj tc~ +(Abllc to +e ~`Ab JCdtcd l/ +fAbJdtlCd (36)
C c Cd Cd
riA+ = C Ab pa (c;"c +t" c +c"t"' +t 't,')+ (Ab po tz'"" (3 7)
~rib+ ,Ab p t, Xe" +r" ) 1 (38)
Group 2: T2 Virtual Orbitals Coming from V and T
Set 1: One Virtual from V, One Virtual from T1
This set of T2 COntributions is very similar to the Group 1 terms. The pp increment
is shown (Equation 39). Since every term antisymmetrizes the contraction with respect
to ab, the permutation operator is removed and the increment is identified as non
antisymmetric. The F residual amplitudes (Appendix B) will be antisymmetrized with
respect to both ab and AB virtual pairs at the end of the iteration. Splitting the second
term, as in Equation 31, is also required.
r,7b = P~(ab)C kbl ilf t +P~ab)Plij)f jkb ic tf t
k kc (39)
+ J (ab) (j)(ib kb d tP t~ctd + (a) kbc tcd
kcd kcd
; = 1 (ii) (kb it + (ij) ~bljkb c tP t~C + (j bi
k L ckc 1 (310)
+ P(ij) kbL ci~~~d tLtotf + kb cdt~t
kcd kcd
C = Pijftkb poj c,"c" +t c"L~~IfP + ct, +t(p,")t+ t kb pa t,""tl~ (311)
kpa kpa
= t ,c, irv pa li(ij) c:' +tXc, ) t (312)
The of increments (Equation 313) are grouped by the braside T1 and then
expressed in terms of AO integrals and factorized.
rl b = [Kb lj tj + f Ak lj t(
K k
+ Kb Cd tit~ + A( k Cd~ttC tetf c)ttrt
KCd kCdK k
+o Kb Cd t td +f Ak Cd)t~tCd
KCd kCd
+ ft Kb ~~~Cd t / bedtb~
KCd d
rl b = ftAk ~Ij) + /C\UI t Ak y\Cj to + t@Ak Ich~te + ~ t A(k Cd ttf;
k kC c k~d(314b)
+ ft Alk 7llCd te
kCd
FI be _e t' ce trj t (3 15a)
rlb b ~ oj~p+ tz' Xc: terj tper (315b)
u per
Set 2: One Virtual from V, One Virtual from T2
The pp set contains four T2 COntributions that are ring contractions. Equations 316
through 318 show these terms and derive their AO basis expressions.
ab+ = (ab)(g), k cj oii +[ .ckb lcd tractf
kc kcd
(316)
KC KCd
(317)
+P(i) c 1 ppA c rtap' + c r a ,
ab+=Pg rp +( pr p~~a t c" t, (318)
Each half of the 000 increments contains four ring terms and two particlehole
ladder terms. The MO and AO expressions are shown in Equations 319 and 320.
rl b+= /. kb cj tol +C Kb7Cj tcKC f KbI~ c tcI
kc KC Kc (319a)
+ klb I~tcd tie + Kby~ dr~tfKC t Kbd l, tc otD
kcd KCd KcD
Ab bCyll~tb k~lAl~b cIbc kAj
KC kc kC (319b)
+ (KA CD tb/CtlD + kD tb tlD kd tdit
KCD kcD kCd
~uvp~(320a)
~uvp~(320b)
Group 3: T2 Virtual Orbitals Coming from T
Set 1: Rings and PH Ladders
There are 31 UHF T2 terms that belong to this set. Each process loads a block of
V," integrals and calculates all of the contributions to the range of R2 amplitudes that it
is assigned to store. Grouping and factorizing the terms for performance is heavily
dependent on the manner in which the old T2 amplitudes are stored and accessed.
The terms were first cataloged and sorted into subsets according to up spin pairs.
For example, if the a index is stride 1, then the term t, ckgt~d must step down in A and
down in B for a given cd pair in V. To minimize the number of messages, any other ring
terms that have the same access pattern should be calculated at the same time. From the
31 UHF terms, three access patterns emerged. Any term that did not prefer one pattern to
another was added to the pattern that had the most similar intermediate products.
Subset 1: AC/BD
Subset 1 contains 13 terms that step through T2 in the order AC/BD, which
corresponds to the order of the two bra indices. More specifically, for a given CD pair of
ket shells, the first (or outer) loop runs over A and the second (or inner) loop runs over B.
All of the at T2 COntributions are performed in this series.
r, + (BPU tfKC 1.Vt"" It VtDB fttccf" (321)
' PiHP~) KLCD KLD KLCD
rf" += P(AB)li(lU)Ctf tZ ~c 'VtW?
klcd
+ (AB) ~u(C )[tfKCVIKltd C t VIK~ ity ft tcVKlt?(322
KlCd Kld KlCd
SV notation can be ambiguous with respect to whether AO integrals are antisymmetrized. Both electrons in
the terms in this section have at least one occupied MO index, so V notation is used to save equation width.
rzAb 7 t~~tz [ fC bi, (323)
kled KlCd
Ab tlAcV/tf b, t,AcT/ktdtb
klc kled(3 24)
+ ft fC VKlb t? + ,A t KC ctd tb
KlC KlCd
In terms of uncontracted ket indices of V, the partial contributions reduce to
Equations 325 through 327, in which each equation corresponds to a "leftside"
intermediate enclosed in parentheses.
A+ = P(IJ)C /,tfK"V t c;"+ /K\" t (325)
La Kp Kp
+= (IJ tAP' 1VJ+ [tfK"V f c+t" t"" (326)
la kp Kp Kp
rPb+= tAp'Vji + tyfK^ h' "+,"(7
la kp Kp n
Subset 2: DB/CA
The DB/CA subset calculates 12 terms by looping first over B and then over A.
rr +=P~b)~g)[ty tdb l b rVt Ct,. ty~4tdt (328)
kled kcl kled
~b+= P~ab)Pi(g) ftPC (Vct/b
KLCD
h Ct~(329)
KlCd KCl KlCd
rAb+ t IAKC VKtob ( 3 3 0)
KLCD
FIb AVIK 0t b Atic'VCKDb
KLD KLCD ( 31
+ t VIKltdb + t ticc iKtif
Kld KlCd
The MO equations can be reduced into four equations with four "rightside"
intermediates when they are expressed in terms of V with uncontracted ket indices.
+ = (tij) tpb YkV t( V c,+"(2
,h b+= ft V t" (334)
Kp La ]~
briK b p p' V "K + VKl z, (3 3 5)
Kp La la
Subset 3: CB/AD
The remaining terms are primarily PH ladder terms and the outer loop is over B. If
the bra shells are thought of as a square VV matrix, then the main difference in loops
between this subset and Subset 2 is that the inner loop A steps down the D column, while
the inner loop A of Subset 2 steps across the C row.
r b ddTc iCb (336)
KlCd
rf _AdViC5l b AdVTKltib + ftAd~citct? (337)
KlCd Kld KlCd
rl~b V t~ C~ltfb A t~dTcitCb (338)
KlC KlCd
Factorizing the MO expressions in terms of rightside intermediates and
uncontracted ket indices yields Equations 339 through 341.
r~ tAU 1V t (339)
Ob t tA( +f V n~c" +t"' ( (340)
KU l lp
KU lp 1(1
Set 2: HH Ladders from OOVV, OOOV, and OOOO
The ladder term that is quadratic in T2 can be calculated in one of two ways: one
with a HH ladder intermediate and one with a PP ladder intermediate. Opting for the HH
ladder intermediate saves a large amount of memory and memory writes. Three other
terms have HH ladder intermediates from the VfO integral and can be factorized with the
quadratic term. The pp contributions are shown (Equation 342).
rGab Itdtilb 1 Ie/tactdt"b
kled kled (3 42)
+ P (ab I k/ jfitcd ati + (g)~b I .k .Ijtctd katl
kled kled
The four ladderlike terms that come from OOOO and OOOV integrals can be
combined with the OOVV ladders when the ket orbitals are from the plain AO integral.
Equation 343 shows the MO contributions.
r;: + = il (kl i ta'b + () l cj) tact"
kl klc (343)
+JP\ab)I kl ilftot?+(Pg)b)I(kll cij~tctatb
kl klc
Combining the two equations, incrementing preantisymmetrized amplitudes, and
expressing the ket indices in the uncontracted AO basis yields Equation 345.
anb + = 1 [ lpe)/ t' + (g)[kl pr \t(t' 1 t"
klpo klps
(344a)
+(PJg f (kll pjcj' 1 ta +(PtiiC ~kl d)pj 1 tab
klp klp
ga+= 4 C kl pa i tot1 +P~iij)C kll pa tP t"'tat1
klpw klp
ab+=JE l p i P~i) c" +,"'c,"+t ab t?) (345b)
klp kl
The 003 terms and factorized increment are shown in Equations 346 through 348.
rl b+= Kd d ,Ctf +f KCdttdt Ah
Kl~d Kl~d(3 46)
+ (Kl CJd~tCdtf tf + lK'll ~Cdtdt t?
KlCd KlCd
riAb+= (KIl~jttf fK:jttt
KlC KlC
+ (Kl Ictof~f + KlI totht (347)
Klc Klc
+C(Kl Ijt /CK +f K~L Ij t t
Kl Kl
rlb=[Klp + c;" +t PXC, r t +t rt (348)
Klps
Group 4: T2 into T1
The T2 amplitudes contribute to the R1 amplitudes in 14 terms. These terms can be
grouped together such that they define three oneparticle intermediate operators, which
effectively reduce the particle number of the perturbation. Equations 349 and 350 show
the terms that define the IP (particle) and IH (hole) intermediates, respectively. Equation
35 1 contains both spin cases of the IR (ring) intermediate.
I + = k/bl c; i t [ lC t to
klc KlC(349)
k/ Jtiatzd ~K~l l~Cd atd
kled KlCd
+= ka dtd+ Ka~).d
kcd K~d(350)
k/
kled KlCd
a+= KL D Cta kL c tt LDt~
KLCD kcD LD(35 1)
+C(" k/ ')t ~tida + Kl Cd~tctda~ fId Iu
kled KlCd Id
The terms that define the intermediates have been isolated in Equations 352
through 354. Both spin cases of the IR intermediate are shown because they both
contribute to p R1 amplitudes. (The terms with the Fock matrix elements are not part of
the intermediate.) The a spin cases of IP and IH are analOgous to the P cases.
I, : Sa+C Ckl ca: i Kl Co t ,"+ (352)
Sklc KlC
IH : y(+ = iCk @ cd td + Kv Cd t~ ;Cd a t) (353)
vkcd KCd
I" : r;+=CVr(!'', ~~K@Co kve tkc a LCr t ,a (354a)
If : 5"+=Cc CkVIca)CK Ct + kCc tob a dtid (354b)
vakcKC Id
Group 5: OneParticle Transformations
Set 1: T1 Contributions
The 12 remaining T1 contributions can be loosely described as oneparticle
transformations. The a terms are analogous to the P terms, which are shown. These are
then drastically simplified with the IR intermediate, and further consolidation yields a
nonHF set in Equation 361 and an IR Set in Equation 362.
F~~kc KC I(5
FI' : 5+= 1/adtd +( I kaed t td + KalCd t~ td (356)
dea kdKCd
(357)
(358)
(359)
(360)
(361)
(362)
n"+= 1c;
) 12 ab I~td z ab
+ ij)(1Cx l k I/ Jtk la +B [ Kl d t tla
kled KlCd
li(ij) kcittab + (Kl Ci ttla~b
Sklc KlC
+ P(i) k Jt tid tab + KlC "I b
kled KlCd
Foo (1): ab
(363a)
Foo (2): rzaib
(363b)
F,'(1 :ab+ = P(~r~dab) fad
(364a)
lar Id klc KlC
+ [ k/ J tf~tidt+ C Kl CdjtC t tdta
kled KlCd
Fo: Ca* fo + calf "
dea ver
Fo ra _~ = 12ta + 1d sdt t I c,"+")
lar Id ver
n"+ L 1JadtId 12l I dbC dz 1
dea 1 2 Id
t )rf (c," + t"')
Set 2: T2 COntributions
Two oneparticle transformations describe the final T2 pp COntributions.
id
icl(ab)i: }[ k/ J./ + K Cd).
kled KlCd
F,'(2) : ga a)[k dt td aC "I"
kcd KCd
P (b) k/\ txllf tzatzb + C Kl Cd tf~ tlajd
kled KlCd
(364b)
Similar to the Fl intermediates used to calculate the T1 contributions, the one
particle transformation matrix for T2 into R2 can be defined in terms of IP, IH, and IR, and
using these definitions, the pp and 000 increments can be trivially reduced to four terms
(Equations 367 and 368).
It; f;l, +fldCCrP d z" +~ t~) (365)
Id v ver
Fd dda Idl (("
Id ver
ab eat' + Fe t kb Fktab
(366)
(3 67)
(368)
k k
CFIKt, Fkt
K k
nAb ~+FCbt~bFe c
Cc
CHAPTER 4
PARALLEL ALGORITHM
Virtual Parallel Computer
Before any algorithm can be parallelized, the type of parallel computer that the
algorithm will run on must be identified. Is remote memory accessible through onesided
operations or must each process explicitly interrupt and request data from another
process? How much data will be needed during each section of the algorithm and can that
data be held by a single process? Does each process have the ability to read data from a
storage device, and if so, what is the penalty for multiple processes reading data
simultaneously? These are the types of questions that must be answered before making
assumptions, which will shape the plan for parallelization.
The most basic parallel computer is a network of workstations, in which each node
is a single CPU computer with a private memory space and a storage deviceusually a
hard disk. The most common connection fabric for these networks is TCP/IP over
Ethernet. This allows for any node to communicate with any other node in the network,
and the network hardware or operating system software will handle delivering the
messages. (As opposed to each process having to maintain a table of which nodes are
directly accessible and which nodes must be reached through other nodes.) The cost of
maintaining these networks is not linearly proportional to their size, so it is common for
500+ node networks to remove the most failure prone element, the hard disk, and provide
other means for each node to conduct I/O operations.
The extreme opposite of this picture is a single computer with multiple CPUs,
which are all able to access the same memory space and hard drives) with equal
performance. Each CPU is identical to the others, so these are commonly referred to as
Symmetric MultiProcessor (SMP) computers; however, there are tremendous costs
associated with adding each additional CPU, so it unusual to have more than 8 CPUs in a
truly SMP machine. Some manufacturers have mininetworks within a single computer
that allow them to increase the number of CPUs without making them pure SMP. To an
application programmer, these machines are programmed in the same way as Symmetric
MultiProcessors, but the true label is Shared Memory Programming (also SMP!) even
though the CPUs are not identical in terms of distance to memory, cache, and possible
hard drives. Henceforth, there will be no distinction between the meanings of SMP since
the parallel CCSD algorithm will be implemented as a user process and not a system or
kernel process.
The balance between cost and performance lies somewhere between the two
extreme types of parallel computers. The primary offerings of the maj or vendors of
supercomputers are clusters of SMP machines. For finegrain parallelization, each node
has multiple CPUs accessing the same memory, and for coarsegrain parallelization, each
node has physically separate resources. The virtual parallel computer that the parallel
CCSD algorithm relies on is this type of cluster of SMP workstations; however, program
portability and the ability to tune the network parameters are very desirable.
The only way to build portability and tunability into the program without having
divergent paths through the software is to model the physical computer in the boundaries
of network communication. By properly defining certain subnetworks, no process will
attempt to communicate through slow network channels when faster ones may exist;
furthermore, messages pertaining to a certain context can be limited to regions of the
network, which might reduce network contention on some architectures. This allows the
scope of communications to be set once at program run time (or compile time) and the
algorithm will transparently acknowledge the underlying hardware up to the efficiency of
the partitioning scheme.
Communication Networks
In some sense, parallelizing algorithms that handle massive amounts of data and
computations is akin to managing disaster recovery. The moment the event occurs, there
are many diverse and unused workers who have no assignment or direction. As the
situation is assessed, teams of workers are sent to handle different aspects of the recovery
from the time critical, like search and rescue of survivors, to the time forgiving, like
public relations and cost reimbursement.
Wildfire firefighters have some of the most extensive experience in rapid
deployment and management of labor units. In the late 1970s, the Incident Command
System (ICS) was created in response to a series of wildfire disasters in southern
California [Fed98]. It has since been adopted and refined by several federal and state
agencies and provides the basis for most emergency and nonemergency incident
responses. The parallel CCSD algorithm uses the following aspects of its effectiveness:
* Modular/scalable organization
* Manageable span of control
* Unit designations.
The lowest level of the organization is the single process (normally assigned to a
single CPU). Every process must have its own memory heap, and the combination of a
process and its memory is called a processor element (PE). PEs recognize other PEs on
the same physical node and form squads (SQD). Squads can be grouped together to form
platoons (PLT) based on algorithm requirements, and platoons can be grouped together to
form companies (CO), which execute largescale algorithms.
Although the units were listed in increasing size, their division and recognition
happen in decreasing size. The commanding process evaluates the available resources
and divides the group into subgroups accordingly. For example, the Company
Commander (CO_CDR) would catalog the total distributed memory and channels to disk
and form platoons based on how the company algorithm uses those resources. The squads
would then selfassemble within the platoons based on which node the PEs are on.
Each division has its own communication space that does not interfere with
messages in another division, and there are special communication spaces for the
commanders of the subdivisions. This allows platoons to synchronize independently from
each other, and more importantly, it allows company algorithms to disregard the
messages of other companies. The company that performs the CCSD iterations and
calculates the T amplitudes is called the Amplitude Company. Figure 41 illustrates the
communication boundaries for a company and its subdivisions.
Data Management
Arrays stored in distributed memory
The fundamental data elements involved in standard CC theory are T amplitudes,
integrals, and the reference wavefunction. The predominant method for solving the T
amplitude equations is iterating over a coupled set of equations for T1 and T2. This makes
the amplitudes stateful at each iteration compared to the twoelectron integrals, which are
stateless. Thus, the minimum amount of information that must be stored during the CC
iterations is TOLD (T1 and T2), TNEw (R1 and R2), and any information that defines the
reference wavefunction. The integrals do not need to be stored since they can be
generated on the fly.
SCO CDR NET
CO_NET __AMVPLITUDE CO CDRCD
1 st PLT PLT CDR NET 3 rd PLT
PLT NET ~~2nd PLT CDR_
1st SQD SQD_ LDR_ NET 3rd SQD
SQDNET2nd SQD LDR
2nd PE 4hP
3rd PE
Figure 41. Communication network boundaries. Rounded rectangles represent
communication networks (in italics). Square rectangles represent physical
entities. Dashed square rectangles could be redudant entities (i.e., a platoon
commander will probably be the leader of the first squad).
It is assumed that a single PE cannot store a full OOVV quantity. Even for SMP
machines, a single OOVV array might exceed the total physical memory of the node, so
the T2 Old and new arrays must be distributed over a set of PEs. It is also assumed that
each PE is able to retrieve and increment remote memory with onesided communications
(i.e., without interrupting the process that allocated the memory). If the total distributed
memory is many times larger than the amount of data that must be stored, then
subdividing the Amplitude Company into platoons such that multiple copies of the
amplitudes are available for reading will decrease the average distance between a PE and
the distributed data that it attempts to access.
In this algorithm, the number of PEs in the Amplitude Company is almost
irrelevant, but the total amount of distributed memory is critical. The Company must be
able to store TOLD, TNEw (R), and the larger of T2 Or the amount of scratch space needed
per PE, which depends on the chemical system (Figure 47). Each PE will have its own
local copy of small frequently accessed arrays, which include basis set definitions, AO
coefficients, T1 and R1 amplitudes, and other oneparticle intermediates. While the PEs
are calculating contributions to the R amplitudes, only T2 and R2 need to be accessed in
addition to local scratch memory. Between certain sections, R2 must be transformed,
which requires enough scratch memory for storing the third distributed OOVV array.
Example 1 Example 2
f reserved memory +
4 T2 distributed 4
4 R2 distributed 4
.    .   m in scr < R 2 block    
min scr > R2 block
Figure 47. The memory heap management for various molecular systems. In one
scenario, the minimum scratch memory needed for amplitude contributions
is less than the amount of memory needed to store a section of R2 during a
transformation. The other scenario is when the opposite occurs.
Example 1. The user runs a calculation that requires 128 MB of reserved data per
PE, such as basis set information, oneparticle matrix elements, etc. The calculation
requires another 128 MB (minimum) of scratch memory per PE for calculating various
quantities during the CCSD iteration, and the OOVV quantities require 4096 MB of
distributed memory from all of the PEs. If the user submits the j ob with 1024 MB of heap
per process, what is the minimum number of PEs needed to perform the calculation?
Alternatively, does the calculation have enough resources if the user submits the j ob with
X PEs, each with 1024 MB of memory?
Solution 1. The amount of nonreserved memory is 1024128 MB/PE (i.e., 896
MB is available for scratch and distributed data on each PE). Three OOVV arrays require
12,288 MB. Dividing this amount by 896 MB/PE and rounding up yields a minimum of
14 PEs. Distributing each OOVV array over 14 PEs requires each PE to store
approximately 293 MB of data per array. Since this amount is larger than the minimum
amount of scratch memory needed (128 MB), the calculation can proceed with 14 or
more PEs.
Example 2. Using the same system data as above, but increasing the minimum
required amount of scratch memory to 5 12 MB, what is the minimum number of PEs
needed to run the calculation?
Solution 2. This time, the regular scratch memory is larger than the distributed
OOVV scratch array. In essence, this becomes unavailable for distributed data, so the
effective amount of memory per PE for the OOVV arrays is (896512) 3 84 MB. Since the
third OOVV array can fit in the scratch memory, only two OOVV arrays, totaling 8192
MB, need to be distributed over the Amplitude Company. 8192 MB divided by 384
MB/PE yields 22 PEs (rounded up). This means each PE will store approximately 187
MB of each OOVV array, and the calculation can proceed if there are at least 22 PEs.
The minimum PE test can be described by the following pseudocode:
mem_di st = mem_heap mem_rese rved
min_PES = ceiling( 3 mem_oow / mem_dist )
mem_block = mem_00w / min_PES
if (mem_block < mem_minscr) then
mem_dist = mem_minscr
min_PES = ceiling( 2 mem_oow / mem_dist )
end if
If the user runs the j ob with more than twice the minimum number of PEs, then the
Amplitude Company can divide into multiple platoons, each with its own copy of TOLD
and TNEw. One of the benefits of this approach is that fewer PEs will access each block of
distributed memory, but the main advantage is that knowledge of the underlying
hardware can be programmed into the partitioning scheme and the algorithm will
transparently align itself with the hardware boundaries.
Arrays stored on disk
The VVVV and VVVO integrals will not be transformed and stored on disk, so
their contributions must be calculated on the fly in the AO basis. The OOOO and OOOV
integrals will not be stored on disk either, not because they are too large, but because
their contributions can be accounted for with intermediates from the direct integrals and
from the stored OOVV halftransformed integrals. The stored integrals can be distributed
over and individually processed by the platoons of the Amplitude Company.
Parallel UHF CCSD
Each CC iteration begins with initializing the global intermediates and residual
amplitudes to zero. The TOLD amplitudes are defined by Equation 41. This formulation
allows the amplitudes to contract directly with AO integrals. The T2 TOSidual amplitudes
will have different definitions depending on which part of the algorithm is executing. In
the Direct Integral (DI) phase, one of the virtual indices is coming from an uncontracted
AO integral, so the residual is related to the MO residual amplitudes by Equation 42.
t = [cft; t l= ~chcb$ab
a ab
tp'" = ~,c~tb~t
Ab
rAb ~ A b A'v
(4la)
(41b)
(42)
The Stored Integral (SI) and One Particle (OP) phases calculate residual
contributions with both virtual orbitals coming from T, so the DI residual is transformed
according to Equation 43.
by
(43)
Throughout the iteration, each PE stores all OO pairs (ket indices) for a range of
AO shell pairs (bra indices). The amplitudes are distributed over the platoon (Figure 42),
and onesided communications retrieve and increment all OO pairs for a block of AO
shell pairs.
MN = 1 2 3 4 5 6
13 13 13 I 13 13 13
PE 0 PE 1
Figure 42. The T2 and R2 memory layout. Each square represents the fourdimensional
array of amplitudes that have the designated spin pair and whose AO bra
indices belong to the MN shell pair. The arrows represent the order in
memory.
After all of the contributions have been calculated, the residual is antisymmetrized
with respect to the two virtual indices and backtransformed to the MO basis. The new T
amplitudes are calculated by dividing each amplitude by the diagonal elements of the
Fock matrix.
Direct Integral Phase
Design considerations
Analyzing the T2 COntribution in Equation 44 can summarize the DI phase. Since
the VVVO integrals are not transformed and stored prior to the CC iterations, the terms
must be evaluated in the AO basis.
r Ab + = C tA~ (Kb ICd() td +Clb (Al Cd') td (44)
KCd ICd
riA tA b~~ t"" + cA b t"~' (45)
The biggest problem with this situation is that every AO integral causes a global
update to R2 (i.e., One integral contributes to every R2 amplitude). The best way to reduce
the number of operations is to precondition R2 Such that a smaller set of VVVO
contributions is calculated. R2 WOuld then be posttransformed after the contributions
have been accumulated (Equations 46 through 48).
rv b Ab (46)
rI~v+ =C ~ tAv pr per)tf" (47)
r Ab b Av (48)
For each integral, only an IjA strip of R2 amplitudes is incremented compared to
the full set without the backward and forward transformations of R. The difficulty with
the up amplitudes is that while the P side can get away with 1/V updates, the a side still
requires a global update per integral (Equation 49). The C1 orbital must be forward
transformed into A, but the v orbital must be transformed into b by T1, which must then
be backtransformed by c l. One solution is to calculate the two spin cases in separate
passes (while transforming R between them). Although this requires generating the AO
integrals a second time, the performance of a naturally parallel local computation is much
greater than a global update per N4 matrix element.
rf "+= fc cfb/,t, (pv fpeb" (49)
A second performance issue is antisymmetrization of the samespin integrals.
Equation 410 shows one of the up ring contributions in MOs and Equation 411 shows
the terms appropriately transformed into the AO basis.
rlb + = b cd tjetfd (410)
kcd
rIf'"+= prp jt"(411)
The AO integrals are not antisymmetrized by the integral generator, so one of two
approaches must be taken to account for both contributions. The first approach involves
two updates per integral:
The benefit of this method is that the PE needs only enough memory to store the AO
integral batch and the T2 batch; however, two updates per integral is more costly than two
reads per integral, which can be achieved by antisymmetrizing the po pair:
This requires either two temporary T2 blocks or two serial loads. If there is enough
memory, then the exchange integral can be calculated with the regular Coulomb integral
and the antisymmetrized contribution can be calculated with one read and one update.
Since shell quadruplets of integrals are typically not more than a few hundred kilobytes,
the algorithm will assume this can be done. The permutational symmetry of the
antisymmetrized integrals is not the same as that of the plain Coulomb integrals so they
must be generated again in the middle of the loop over permutations. Although two
updates or two reads are avoided, the integrals need to be calculated three times per AO
shell quadruplet. Considering the cost of messages compared to the speed of locally
computing a shell quad, this is a favorable tradeoff.
From these considerations, it is clear that the DI phase must calculate the unique
integrals six times even though the AO integrals have 8fold permutational symmetry.
There are two passes (one for up and one for Pa) and there are three generations per
passone Coulomb and two exchange integrals. This might seem expensive, but the cost
of doubling or quadrupling the number of messages does not scale linearly with the
number of PEs and far outweighs the cost of the naturally parallel integral computation.
Algorithm
A critical aspect of this algorithm that makes or breaks the performance is one
sided communications. A onesided GET of a T2 block can be in flight while the PE
calculates terms that do not involve the T2 amplitudes. Immediately after the amplitudes
have been read, the next block of amplitudes can be requested, even if the PE is not ready
to use them. This technique is also used to hide the communication of R2 updates. The
main difference between the two arrays is that the T2 amplitudes are read once per GET,
while the R2 amplitudes are incremented two or three times per ACCUMULATE. Ideally,
there would be two R2 buffers, one active and one pending, that increase the time
available for each ACCUMULATE. If the user must run the calculation in limited
memory, then it is possible to still benefit from onesided ACCUMULATEs, but the PE
might have to wait for a while before initializing the buffer for the next R2 COntribution.
The main driver for the DI phase loops over AO shell quadruplets of integrals. The
MN shell pairs are distributed over all of the PEs in the Amplitude Company. This
distribution can happen a number of ways: round robin, blocked, block cyclic, j ob board,
master/slave, etc. How a PE processes each MN shell pair is independent of the shell pair
assignment, so this can be tweaked outside the main processing loop. Ideally, there would
be a globally accessible integer that corresponds to the next MN pair that must be
processed. The integer must be atomically read and incremented by each PE to get its
next assignment. This balances the load better than a fixed distribution algorithm like
round robin or block cyclic, and it performs the same function as a master process
sending out new tasks without actually needing a process to sit idle waiting for requests.
Once an MN shell pair is identified, the PE is responsible for generating and
processing all unique RS shell pairs that correspond to the MN upper bound. This
prevents any PE from processing Coulomb or exchange integrals that another PE is
responsible for. It also allows for a single generation of the integrals for each pass if there
is enough scratch memory (i.e., the PE would generate all RS pairs and internally
antisymmetrize them before beginning the processing loop). Typically, however, there
will be enough memory for two shell quads and the scratch buffers for contracting terms.
A shell quadruplet of integrals will contribute to all of the Group 1 terms, the
Group 2 terms, and the Group 4 intermediates. Table 41 summarizes the stages of the
processing loop and which elements are accessible at each stage.
Table 41. The direct integral processing loop summer
LloopScope Contributions
section
all Group 1 terms and the IH
prologue (IHs, I2) += f(V,C,TI,T2) itreit
all contributions to R2 fTOm
stage 1 Rzc += f(V,C,T1,T2,I2)
Group, 2 and Group, 1
sae 2 IPs += f(V,C,T2) the IP intermediate
epilogue IR += f(V,C,T1) the IR intermediate
The Group 1 terms are saved in a twoparticle intermediate during the loop over the
T1 amplitudes from the Group 2 terms. The I2 intermediate is then contracted with Q
(C*C) and the Group 1 contribution is sent with the Group 2 contribution to increment
R2. If there is not enough scratch memory to store the I2 intermediate, then it is possible
to calculate these terms in a single direct integral pass separately from the Group 2 terms
or the IP Of IR intermediates. Table 42 has the detailed order of operations for the
prologue of processing a shell quad. The w oneparticle matrix element is defined in
Equation 414.
L Y + t; (414)
When the prologue finishes, there are two important quantities stored in memory:
the Group 1 contribution and the T2 amplitudes that generated it. Since the Group 2
contributions loop over all L shells for a given S shell, it is convenient to begin the loop
with L=R. Table 43 lists the detailed order of operations for the L loop. The x one
particle matrix element is defined in Equation 415.
x~ =~ [ jc t (415)
After the last value of L, there is still a pending R2 ACCUMULATE, which must
be completed. Rather than WAIT immediately after the loop, the IR intermediate, which
requires only T1, C, and V, can be calculated. Furthermore, if there are more
permutations of V, then the PE can GET the next T2(RS) block and permute the integrals.
Table 44 lists the detailed order of operations for the permutation epilogue.
The first four permutations of V involve the same two sets of shell quads.
The second set of four permutations requires a different exchange shell quad, so the
integral generator is invoked a third time per MNRS configuration.
Figure 48 shows the movement of data needed to calculate the contributions to R2
from an integral shell pair.
Global Shell Pair Counter of
T ~h~ IT
R
1st Squad 2nd Squad 1st Squad 2nd Squad
1st Platoon 2nd Platoon
Figure 48. The data flow in the Direct Integral phase. Solid arrows represent onesided
messages, and dashed arrows represent reading and incrementing the global
shell pair counter. Each PE retrieves all T2 amplitudes that contract with the
integral and accumulates the contributions into the R2 Efray.
When the Amplitude Company has calculated all of the MN shell pairs, the R2
distributed arrays are transformed for the phase that processes the stored OOVV
integrals. The oneparticle global intermediates could be reduced here for use in the One
Particle phase. The following pseudocode lists the entire flow of control from the
beginning of the CC iteration up to the start of the SI phase.
ze ro(R2, R,, I,)
W, = C + T,
do spin = 2, 1, 1
X, = EzQ, C*T,
SYNCHRONIZE(CO_NET)
do while (more MN)
do all RS
GET T2(RS)
G = intgen(M,N,R,S)
H = G intgen(M,N,5,R)
do perm = 1, 8
calC Prologue()
nLeft = Nshells
L= R
do while (nLeft != 0)
calc InnerLoop()
nLeft = 1
L = 1 + mod(LNshells)
end do nLeft
calC Epilogue()
end do perm
end do RS
end do MN
SYNCHRONIZE(CO_NET)
if (spin =2) then
gl obal R2 (La, Np) = Q (Np,~ Dp) *R2 (Dp,~ M,) "Q,(M,, L,)
else
global R2(L,D) = R2(L,N)*Q,(N,D)
end if
end do spin
REDUCE (Iz, CO_CDR)
Stored Integral Phase
The SI phase calculates all of the Group 3 terms from integrals of the type ,
, and , collectively referred to as OOVV. Each PE will load or receive an
integral block, which contains all spin pairs for a particular po shell pair. The mechanism
for obtaining integrals depends on the performance and architecture of the I/O system. If
only one PE per node (or squad) is allowed to read from diskassuming this is an
operating system constraintthen the other PEs in the squad will have to request the
blocks from that I/O PE. The I/O PE will then have to juggle its own calculations while
acting as an I/O proxy; otherwise it could opt not to store R2 amplitudes, but then the
system takes a performance hit as whole because there is one relatively idle CPU per
node.
Most computers will allow any process to perform I/O and shift the performance hit
of I/O contention to the program. The technique the parallel CCSD algorithm uses to
lessen this penalty is asynchronous file reads. The logistics are the same as onesided
operations in the DI phase in that each PE issues an AIO READ immediately after the
last integral has been read but before other computation has been done. If asynchronous
I/O operations are not available and the user elects not to designate a special I/O PE, then
the algorithm can wait for the integrals up front with a blocking read or it can
synchronize the squad PEs by broadcasting the current block and waiting for all of them
to finish before reading the next block.
Resource competition, such as parallel blocking reads, suffers scaling penalties,
meaning the cost per read increases as the number of PEs per squad increases. The upshot
is that there is still the possibility that some PEs are computing while others are waiting
on I/O. Locked step algorithms, however, usually suffer far worse than resource
competition algorithms. Locked step guarantees that no one is working while a read is
pending and that no one is reading while the PEs are computing. Even if the I/O PE used
asynchronous reads, the locked step approach would suffer from expensive
synchronization delays compared to pure resource competition. From these
considerations, each PE will be responsible for reading its next block of integrals from
disk. If asynchronous reads are not supported, then the PE will default to blocking reads.
The OOVV integrals will be distributed over platoons by block cycling the AO
shell pair kets. Every PE in a platoon must read and process all of the integrals assigned
to its platoon. The integrals will only contribute to the R2 amplitudes that the PE stores,
but the T2 amplitudes must be read (with onesided GETs) from other PEs in the platoon.
Every shell pair of integrals results in four updates to all of the local R2 amplitudes,
one for each subset of Group 3 Set 1 and one for Group 3 Set 2. Since Group 3 involves
contributions in which all virtual orbitals come from T, the virtual orbitals of the integrals
must be the internal summation indices. The general approach to calculating the Set 1
subsets is to loop over h' and 6' for a fixed po pair. Each subset corresponds to a different
configuration of these nested loops.
Figure 43 shows the path of updates through the local set of R2 amplitudes. Table
45 shows the detailed order of operations for calculating Group 3 Set 1 Subset 1
(AC/BD is LR/DS in AO shells).
Accessing memory in nonunit stride is highly unfavorable when performance is
the utmost concern; however each square in Figure 43 represents three fourindex arrays
such as those shown in Figure 42, and those updates can be performed in unit stride.
Furthermore, the LD loop order is only used in Subset 1. Tables 46, 47, and 48 show
the operations for Subsets 2 and 3 of Set 1 and Set 2, respectively. Each steps through the
LD pairs in DL order.
Figure 43. The path of Group 3 Set 1 Subset 1 updates per OOVV integral. Dashed
lines show memory continuity and solid arrows show the loop path.
There is some flexibility in choosing the order in which the Group 3 sections are
calculated. The most important design consideration is choosing which section will be
performed last since it will AIO EAD the next RS shell pair of OOVV integrals from
disk. Subset 2 of Set I has the largest number of operations after the last integral is used,
so this section should be performed last.
Choosing a section to run first depends on how many operations can complete
during the first GET before reading the T2 amplitudes. Subsets 1 and 3 of Set 1 and Set 2
all use the amplitudes almost immediately, but each requires a different initial shell pair.
Set 2 reads only one shell pair, T2(RS), and neither shell contributes to the final R2
amplitudes, so the probability that T2 is local is inversely proportional to the number of
PEs in the platoon. The Subsets of Set 1 fare much better since at least one shell of T2
contributes directly to a shell of R2. If the amplitudes are distributed in bands of L, then
Subset 3 has a good chance of T2 being local, and if the amplitudes are distributed in
bands of D, then Subset 1 is better off executing first. Figure 44 illustrates the two
possibilities.
Lbands D bands
L L
D D
Figure 44. The LD shell distribution schemes. Each PE stores both T2 and R2
amplitudes for its block of LD shell pairs. Set 1 Subsets 2 and 3 favor L
bands, while Subset 1 favors D bands.
The parallel algorithm uses L bands since two of the three subsets favor them as far
as having local T2 amplitudes per R2 update. Subset 2 is already designated to run last, so
subset 3 should be performed first because it has the highest chance of having no wait
time for the initial T2 GET. The decision to run Set 2 before or after Subset 1 can be
made on the same argument. Since Set 2 updates the entire set of local LD pairs after
reading the T2(RS) amplitudes, it offers the most time for the initial GET of Set 1 Subset
1 to complete before reading them. Therefore, the calculation order of the Group 3 terms
is Set 1 Subset 3 (SIS3), Set 2, S1S1, and S1S2.
Figure 49 shows the flow of data in the SI phase. Each PE loops over all of the
shell pairs that were assigned to its platoon and calculates only the terms that contribute
to the local R2 amplitudes. The following pseudocode lists the entire flow of control
from the end of the DI phase up to the start of the One Particle phase.
AIO_READ V(RSo)
GET T2(RDo)
do RS = RSo, RS1
WAIT V(RS)
cal C G3S153(GET T2(RS))
cal C G3S2(GET T2(LoR))
Cal C G3S151(GET T2(SDo))
Cal C G3S152(AIO_READ next V(RS); GET next T2(RD))
end do RS
Platoon Shell Pairs of Platoon Shell Pairs of
i
i f
T ~h~ I T ~A~ R
T
R
1st Squad 2nd Squad 1st Squad 2nd Squad
1st Platoon 2nd Platoon
Figure 49. The data flow in the Stored Integral phase. Each PE processes all of the
integrals that are stored by its platoon and updates its local block of R2.
One Particle Phase
The OP phase is responsible for contracting the Fock matrix elements of the
reference wavefunction (f,,) and the li intermediates (IP, IH, and IR), with the T
amplitudes. There are three types of contractions: T2 into R1 (Group 4), T1 into R1 (Group
5 Set 1), and T2 into R2 (Group 5 Set 2). Only one PE must do the T1aR1 contractions
provided it has the complete li intermediate (it should since the contributions were
reduced at the end of the DI phase). These transformations, shown in Equation 416 for
T1 are insignificant performancewise and little is gained by parallelizing them.
+~ = t c"+t lf (4 1 6a)
I" +=C Ic far + ca fad C@tz 1 tz1 /2 1 t%/d" t"cr (416b)
aad I~ Id
The T24R1 contributions can be distributed over the PEs in one platoon, since they
are limited to local reads and local writes; however, while the T24R2 terms also have
local writes, they have both local and remote reads. The remote reads come from the
terms in Equations 417 and 418. The P part of FP is defined in Equation 419. (The oc
part is analogous.)
? += FPa t; +=C FP t""' (417)
F' = C cn"fadc C t + IdGi Iv" t (419)
ad IS Id
Parallelization of these particle terms should reduce the number of messages a PE
must send as the number of PEs in the company increases. This implies distributing the
summation index (p or o) over the platoons; however, since the SI phase prefers L bands
(the p index), it is likely that most (if not all) of the T2(RD) GETs are local, so it is not as
critical to performance to distribute them over the platoons as it is for o. For the a loop,
each PE must get data from multiple target PEs that contribute to its local R2 Set (Table 4
9). Table 410 shows the other FP terms.
The T24R2 hole terms and the T24R1 terms (along with the latter FP terms) should
be performed by the largest platoon (HVYPLT), even if that platoon has only one more
PE. During this time, all of the other platoons will send their contributions via
ACCUMULATE to the Heavy Platoon. Once these operations are complete, the only
remaining tasks are antisymmetrizing the samespin R2 amplitudes and updating the old
T2 amplitudes with R2. Table 411 shows the operations of the Heavy Platoon while the
other platoons are sending their contributions, and Equations 420 and 421 show the
terms that contribute to FH and FR, TOSpectively.
FZ = ft tak, Ideft,"r + c'I: +C c,Ifcp"+t,") (420)
Id v vor
FL = r. Ad+C~P (421)
The following pseudocode lists the operations that occur from the end of the SI
phase up to the start of the antisymmetrization and update phase.
if (CO_CDR) then
make FP, FH, FR
BROADCAST( F,, CO_N ET)
calc T,_into_R,()
else
RECEIVE (F,, CO_N ET)
end if
cal c FP 2uP(
if (HVY_PLT) then
calc other_FT2(
else
ACCUMU LATE (R2, HVY_PLT)
end if
REDUCE (Ry CO_CDR)
Antisymmetrization and T2 Update
The aa and pp R2 amplitudes must be antisymmetrized with respect to AO shell
pairs. Equation 422 shows this operation. All of the R amplitudes must then be divided
by the diagonal elements of the Fock matrix in order to generate the next set of T2
amplitudes. Each PE in the Heavy Platoon, having collected all of the R2 amplitudes that
it is assigned to store, will antisymmetrize and transform its amplitudes into T2. The
R24T2 transformation is shown in Equation 423 and the diagonal matrix elements of fh'
are shown in Equation 424.
r~ =
(422)
t r t =n (423a)
t =I (423b)
/ =Ic; ad d(424)
The R1 amplitudes must be transformed in the same way in order to generate the
next set of T1 amplitudes. One PE should do this since the transformation is trivial
(Equation 424).
r r"'
t =; t' = n (424)
Table 412 lists the operations of the Heavy Platoon for antisymmetrizing and
transforming R2. The code listing that follows shows the operations of the Amplitude
Company from the end of the OP phase up to the next CC iteration.
if (HVY_PLT) then
make T2
end if
if (CO_CDR) then
make T,
BROADCAST (T,, CO_N ET)
else
RECEIVE (Ty, CO_NET)
end if
if (not HVY_PLT) then
GET T2
end if
The order of the loop nesting, LD versus DL, does not affect the results, but the
performance of the loops might differ from one computer architecture to the next. The L
D ordering issues multiple GETs to the same PE (down the L column) before cycling up
to the next column. If every PE begins with L=1, then the first PE in the platoon will be
bombarded with the GETs of every PE at once. Normally, GET operations do not require
exclusive locks, in which one PE is allowed to access the data at a time, but the efficiency
of the messaging subsystem might be a factor in performance. This would be less
important in the DL ordering.
Another way in which the messaging subsystem could affect performance is by
"remembering" which PE was accessed last. In this situation, once a GET is established
between two PEs, any successive GET to the remote PE is less expensive than switching
to another PE. This would actually favor the LD ordering and penalize the DL ordering.
In either case, the code can be trivially designed to switch between the two depending on
which architecture the program is running.
Table 42. The Direct Integral phase prologue memory map
Order of V T2 2t W2
operations G H a au p aa au I aa u
W""I = H""11 ~ r w
IU pa
I = W'w" w r
W," = G 0 I r w
I =W/ w" w r
WAIT T2(RS)
W~u _i =H""~t r r w
Ia + = WI"" c r
I U1+ = WUI/ r
W ""r r w
I U1+ = WUI/ r
Each operation is implicitly summed over all shared indices. The G column of V
corresponds to the set of Coulomb integrals, and the H column is the set of
antisymmetrized (CoulombExchange) integrals. The values w, r, and i stand
for write, read, and increment, respectively. Shaded cells are involved in pending
communications and may not be read from or written to (without risking data
inconsistencies). Temporary arrays (W2) are Outlined to show their scope of dependence.
61
Table 43. The Direct Integral phase inner loop memory map
V T2 R2(1) R2(2) 12 W2 X2
Order of
operations G H c l l
if (L!i=R) WAIT T2(LS)
PV "'=c'tw
if (R!= S) GET T2(LR)
X" =GC;"W "'rrw
RI = 0 & r 'X n lrl
RI\ + = I I r
R"' = x)"I"" \\ r
if (R!= S) WAIT T2(LR)
if (more L) GET next T2(LS)
X '"P = WIP H r r
x H r I r~w
I~" + = X 'P&c;U r
X~1/ =W G~/ r ri
R~" += XIP"w"~ I r
Pa',~'a I I
Th 21)bfe is th civ rtebfe, an h 2()bferih acesfrbde
buffer fro the prviu L shll Th loo strt wit LRconsutoNhlsCCiS
back ~ r~ toL1 n otnusu oL()
One of the two R2 buffers will be accessforbidden. The last permutation does not need to
GET another T2(RS) block; otherwise, the GET should be issued as soon as possible. If
the permutation index is 4, then a different exchange integral will be generated directly.
The (w) cells are writeoptional, meaning the operation might or might not need the
scratch memory.
Order of V T2 R2(1) R2(2) SCT
operations G H a au up aa au am
if (perm==8) then
If += ct +c t G r r (w
WAIT R2
else
GET next T2(RS)
I~ )+ = c tf'H~ +cLtl G'p r r (w)
permute V w w (w)
end if
Table 44
The Direct Integral ph p
63
Table 45. The Stored Integral phase Group 3 Set 1 Subset 1 memory map
OrerofV T2(1) T2(2) I A B IC
operations d l cl d l cl d l cl d cl c
GE T THeLnR
ile L = Ln,. L,
WAI.~T T,(LRI
I(Ce = ta'p u"M r r w
Wi = m pt H" r r w
GET T( DnS
ile D = Dn,. D,
WAI.~T T,( DS
GE T nextr T (")'
I/ + = I' ir
I/ + = I' '?r i r' r
I += I' '? i r r'II I I I
tndl ile D
tndl ile L
During the loop over D, the T2 buffers alternate active/inactive, but the prologue in the L
loop uses the second buffer as scratch memory. The GET in the inner loop over D
requests T2(DS) unless D=D1, in which case it requests the next T2(LR) (provided LAL1).
64
Table 46. The Stored Integral phase Group 3 Set 1 Subset 2 memory map
Orde ofV T2(1) T2(2) IA IB C ID
operations d cl l d cl cl d cl cl cl cl c
GE T T~t SDn)I
ile D = Dn,. D,
WAI.~T Tt SDI
GE T T( RLn)I
I = (; i r r n I
I' = \i
I = H _i r r n
I' ~ r I = H i r r n
Ift'D==D.) I thn
etni It
ile L = Ln,. L,
I = is / I' r
WAI.~T T(RLI
GET nextr T(")'
I/ + = I' i '"
i~ + = I' i I' r
I + =I i I' r r I
tndl ile L
tndl ile D
T Te inner loop of this subset has more operations than either of the other two subse s, so
it is calculated last and the asynchronous read is issued during the last iteration.
Table 47. The Stored Integral phase Group 3 Set 1 Subset 3 memory map
Orde ofV T2(1) T2(2) IA B IC
operations d l cl d l cl d cl cl cl d
G;ET T,( RDn)I
ile D = Dn,. D,
\\~.MT T,(RDI
G;ET T,(LnSI
I''=(; _i
J' '+ = '
ile L = Ln,. L,
\\~.MT T( LSI
G;ET nextr TH (''
I +=I ?'I
I +=iI
tndl ile L
tndl ile D
r
r
r
66
Table 48. The Stored Integral phase Group 3 Set 2 memory map
Order of V T2
operations aa t uu 00 aa upa
WI,""' = t" +1., w r w
I K 1 = WI," V r
do LD = LDo, LDI
W '= t + tt w
s* ~ l r r
end do LD
W, t, (r) r Iw
'Irk = W Vk2P (r) r r w
do LD = LDo, LDI
A3 Okl A' '
end do LD
end do spin
All of the T2(LD) reads and R2(LD) updates are local operations. If there is not enough
memory for 12, then the Ij/ij/IJ bra pairs can be processed in blocks.
Table 49. The One Particle phase FpP Tap memory ma
Order of T2(1) T2(2)
operations up up
G;ET T,(LSn)I
ile S = Sn,. S,
ile L = Ln,. L,
WAI.~T T,(LSI
G;ET nextr T(L SI
ile D = Dn,. D,
I += TI /
tndl ile D
tndl ile L
tndl ile S
67
Table 410. The One Particle phase FPuT~ap (and same spin terms) memory map
Order of T2(1) T2(2)
operations me up PP ma ap PP
G;ET TH IDn)I
ile D = Dn,. D,
ilo R = 1. N\II, n
WAI~T T,( RD
G;E T nextr T( RD I
ile L = Ln,. L,
I +=/ li
I +=/ li
tndl ile L
tndl ile R
tndl ile D
Table 411. The One Particle phase FH and FR lOcal transformations
Order of operations
do LD = LDo, LD1
A'' f t +
K k
A~'"' 1 i, ;k A's'
FR F tk
KU ka
r + =CI [ ,t" t
ko KU
end do LD
68
Table 412. The R2 antisymmetrization and T2 update memory map
Order of R2(1) R2(2)
operations aa t uu 00 aa upa
G;E T R,(DnLn)I
ile L = Ln,. L,
dlo D = Dn,. D,
WAI.~T R,( DL
G;ET nextr R,(DL I
I = / +/ 1 1
/ = / +/ 1 1
/ =1 / +/ 1 1
tndl ile D
tndl ile L
CHAPTER 5
DISCUSSION AND SIMULATION
Discussion
General Comments
The parallel CCSD algorithm distributes almost every operation over the number of
PEs in the Amplitude Company. The few operations that are not distributed are
insignificant to overall performance. Specifically, the worst oneparticle transformation
would be a multiplication of two NAoxNAo matrices. This scales operationally like NAO3
In contrast, the Direct Integral phase multiplies two matrices that scales operationally like
O2NAO4. For a system with 100 electrons (50 oC and 50 P) and 500 basis functions, there
are approximately 125 million operations for the serial multiply and over 309 trillion
operations for the parallel multiply. Assuming that every operation takes the same
amount of time and that the parallel multiply is 100% efficient, the maximum theoretical
speedup is 2.48 million times faster (for this system).
t +t 1.25*108 +3.09*1014
lim SpeedUp= P"' 2.48*106 (51)
"proesam t 1.25*108
Obviously this limit is unachievable, and there are multiple operations that slow
down the parallel portions of the CCSD algorithm. Disk I/O, network messages, and
synchronizations all become more expensive as the number of PEs increases. The two
primary techniques that help to reduce these costs are realtime load balancing and
asynchronous operations (i.e., overlapping computation and communication for both disk
reads and remote memory operations).
Load Balancing
Job boards
In this context, a job board is a global structure that any PE can write to or read
from at will. The Direct Integral phase must have a global counter that registers the next
AO shell pair waiting to be processed. PEs that finish their tasks quickly will read and
increment the counter more often than PEs that lag behind the others. Natural load
imbalance or CPU contention from other processes can cause this lag (or drift). CPU
contention frequently happens if the operating system attempts to run more compute
intensive processes than installed CPUs. Even if the parallel program executes
exclusively but the number of PEs is the same as the number of CPUs, then the operating
system kernel itself will compete for CPU cycles and detract from overall scalability.
Job boards can relieve natural load imbalance in the Direct Integral phase, but the
Stored Integral phase cannot easily use them (if at all). The SI phase has two levels of
parallelization: the OOVV integrals are distributed over platoons and the R2 updates are
distributed over PEs within a platoon. Any attempt to balance the load in real time will
incur multiple synchronization delays since the PEs in a platoon have close to but not the
same number of operations. Instead of realtime load balancing, the AO shells are
grouped together to balance the work needed to process each shell.
Data aggregation
All uses of the word shell" that correspond to CC data processing refer to
collections of physical AO shells, which are the smallest entities that direct integral
generation programs can calculate. Since the number of operations scales like O2N4, the
difference between the smallest and largest physical shell is a direct measure of the
imbalance in computations per message. The hydrogen STO3G basis set has 1
contracted orbital in the s shell. Larger basis sets will typically run up to about 16 orbitals
per shell (although the iodine PBS basis has 55 orbitals in the d shell). The hydrogen
shell and the average shell are computationally uneven by a factor of 65536 for the DI
phase. If the messages pertaining to these two shells cost the same, which is a fair
assessment since message latency is an order of magnitude more important than message
size, then one PE will have 65536 times more work to do than another. Simply grouping
as many physical shells together as possible might not be the best solution since it is the
size variability that affects load balancing.
The parallel CCSD algorithm, before running the first iteration, will examine the
collection of physical AO shells and search for the set of logical shells that minimizes the
variance of shell sizes. This is a bounded twodimensional search along the dimensions
MaxShellSize and NLogShells. The maximum number of logical shells is simply the number
of physical shells. The fewest number of logical shells is 1, and the size of that shell is
equal to the sum of all of the physical shell sizes; however, given a maximum shell size,
the fewest number of shells is defined in Equation 52.
LoNel > = (52)
Physical AO shells should not be broken up since generating direct integrals costs
less when orbitals of the same angular momentum and atomic center are calculated at
once. Equation 52 is actually an inequality because there might not be a perfect grouping
of physical shells such that every logical shell has the same size. The search algorithm
loops over size limits (MaxShellSize) starting with the largest physical shell size and
ending with a predefined limit or the sum of all physical shell sizes, whichever is smaller.
For every potential MaxShellSize, the algorithm also loops over the number of logical
shells starting from the theoretical minimum up to the number of physical shells.
The problem of putting obj ects with mass (physical shells with orbitals) into groups
with a mass limit (logical shells with up to MaxShellSize orbitals) is well known in the
field of Computer Science as the Bin Packing Problem (BPP) [Cof97]. This is an NP
hard optimization and there are many algorithms for finding solution candidates. Rather
than solve the BPP, the CC algorithm uses the Worst Fit Decreasing (WFD) offline
algorithm limited to a known set of bins (logical shells) and measures the variance of
sizes (Appendix E). The partitioning scheme with the lowest variance should have the
best balance between computation and messaging. WFD is ideal for finding possible
partitions for this particular problem since it adds the next smallest physical shell to the
logical shell with the fewest orbitals.
Asynchronous I/O
Most parallel programs are hampered by the increasing cost of messages as the
number of PEs increases. Onesided operations allow PEs to process data while messages
are moving through the network. The two limits of this scheme are: 1) the CCSD
calculation is free because the messages take so long to arrive and 2) the program scales
almost linearly with the number of PEs because global data is available the moment it is
read. Given the complex task and data distributions, it is likely that some portions of the
algorithm will be bound by communication costs and others by computations.
Reading the OOVV integrals from disk should not be a performance bottleneck
considering the amount of computation that each PE will do while the AIO READ is
being fulfilled. The inner loops of the SI phase can become heavily dependent on T2 GET
operations if the logical shells are not large enough to saturate the retrieval with
computation. The easiest way to increase the number operations per message is to
increase the size of a logical AO shell pair. The difficulty in doing this is that the local
scratch memory requirements scale like N4 with respect to shell size. As more scratch
memory is needed, the less heap there is for distributed data and ultimately more PEs will
be needed. The problem of balancing shell size, scratch memory, and data distribution is
not solved, but as the algorithm has been stated, the data distribution schemes can be tried
independently of the data contraction kernels. For example, if more scratch memory is
needed but there is no possibility to increase the heap or the number of PEs, then the DI
and SI phases can be further divided to allow for fewer intermediates at the cost of more
integral processing.
Synchronization Delays
An attractive feature of the parallel CCSD algorithm is that no PE is required to
stall in a wait call for another PE unless the state of their globally accessible data changes
(i.e., if R2 is transformed or T2 is updated). A barrier in message passing programs is a
routine that causes every PE to wait until all of the PEs in a communication network have
entered the barrier. Certain collective operations can function like a barrier and some
pairs of functions act as a true barrier. Reductions involve every PE sending some data to
one PE, which will combine all the data into one element. Broadcasts involve one PE
sending some data to every PE. Individually, certain PEs may continue after the operation
whether or not other PEs have finished their operations, but properly configured, a
reduction and a broadcast can together function like a barrier. No PE will proceed until it
has received data from one PE, which will not send the reduced element before receiving
the contributions from all of the other PEs.
The concept of a fence comes from the MPI2 standard, which defines it as a
collective barrier that ensures all onesided communications on a shared obj ect are
complete before leaving the barrier. Synchronizations of any sort occur essentially 10
times in each CC iteration:
1. *initialize R and synchronize T before the Pa DI phase a CO fence (Appendix A)
2. complete the R2Pa COntributions from the Pa DI phase a PLT fence
3. complete the R2PuaR2aCP transformation into scratch a PLT fence
4. copy scratch into R2 before the up DI phase a PLT fence
5. *complete the R2aP COntributions from the up DI phase a CO fence
6. reduce li contributions onto CO CDR 4 CO reduce
7. receive F1 from CO CDR 4 CO bcast
8. complete the R2 '"R21'"' transformation into scratch a PLT fence
9. copy scratch into R2 before the SI phase a PLT fence
10. complete SI and OP contributions a CO fence
11. complete R24T2 update a CO fence
12. receive T1 from CO CDR 4 CO bcast
In the steady state of iterating over T, the first and last synchronizations can be
overlapped (i.e., point 1 is the same as point 12 in the list). Point 5 can be overlapped
with point 6 in that reducing li signals the end of the up DI phase (Table 51).
Collectively, the number of synchronizations does not scale with the problem size;
however, the cost of each synchronization point will increase as the number of PEs
increases.
Simulation
The milliondollar question that every parallel algorithm must answer is, "Does it
scale?" Amdahl's Law [Amd67], which neglects all process contention and performance
degradation, describes the maximum theoretical speedup for a Eixed problem size running
on a Eixed number of processors. The simplest expression of this law is shown in
Equation 53.
Each platoon independently transforms R2 and separates each step with a PLT fence
(dashed lines). The shaded rows correspond to collective communications across the
Company. r, w, i, and 0 stand for read, write, increment, and initialize to 0, respectively.
The largest platoon calculates T1 while the other platoons update their copies of T2.
t, + t
SpeedUp = "" P" (53)
tt
However, the parallel CCSD algorithm does not spend the same percentage of time in the
parallel parts relative to the serial parts for every molecular system. As the problem size
increases, the percent spent in the serial part decreases; thus, the parallel speedup will
increase with system size. This is loosely related to the GustafsonBarsis Law that makes
the observation that a parallel computer can do bigger calculations relative to the serial
computer for data parallel algorithms as the number of processors increases [Gus88]. The
parallel CCSD algorithm leaves N3type operations for one CPU (in the form of NxN
matrix multiplications). A few oneparticle N5 operations and the N4 T2 update are
distributed over the largest platoon, but most N5 and all N6 Operations are distributed over
the entire Amplitude Company. Furthermore, remote data is retrieved as early as possible
.~ ~~ ~~~~`` ~ r~"
PhaseHeavy Platoon Light Platoon
T1 R1 T2 R2 Scr T1 T2 R2 Scr
DI:pa r r i w/r r r i w/r
R(D,M)4S(L,N) r w r w
S(L,N)4R(L,N) w r w r
DI: a13 r r i w/r r r i w/r
Br~oadclast F,
R(L,N)4S(L,D) r w r w
S(L,D)4R(L,D) w r w r
SI; OP:FpP / T
OP:FPU,FH,FR /
antisym; 1/D w r 0
update T2 w w 0 w
Table 51 The schedule o s
______
and used as late as possible. This provides an ideal countermeasure for variable network
performance .
General Overview
A simulator is used to test a design before building a prototype or the final product.
Simulating software performance (scalability in the case of parallel CCSD) usually
involves estimating computation and data transfer times, which requires hardware
parameters such as CPU speed and efficiency; memory, disk, and network I/O latencies
and bandwidths; etc. These parameters can be measured and will be different for various
computer architectures; however, the goal of the software simulator is not to emulate
hardware. By using a variety of parameters, the simulator should be able to expose
general strengths and weaknesses of an algorithm that might need to be redesigned.
The simulator used to test the parallel CCSD algorithm constructs a parallel
network with an array of floatingpoint numbers that represents the clocks of the PEs.
The program assumes the identity of a PE and estimates how long it would take to
perform a set of operations. That time is added to the clock and the simulator switches to
the next PE. For the Direct Integral phase, the simulator maintains a counter that
corresponds to the next integral shell pair, and becomes the PE with the lowest clock (i.e.,
the one that would increment the counter next). Once the counter exceeds the number of
shell pairs, the clocks are synchronized to the highest value and the R2 amplitudes are
transformed. The process repeats itself for the second DI phase.
Binary trees are used to estimate reductions and broadcasts. The Stored Integral and
One Particle phases do not use realtime load balancing so the simulator cycles through
each PE and adjusts the clock accordingly. Finally, the antisymmetrization and update
phases are estimated with all of the appropriate synchronizations and computations.
Specific Technologies
Local computation. Accurately estimating local mathematical operations is
essential to simulating both parallel and serial performance. Each operation involves
reading data from memory, operating on the data, and writing the results back to memory
(Equation 54). S is the data size, B is the memory bandwidth, Nops IS the number of
operations, C is the CPU speed, and e is an efficiency modifier. Compiler optimization,
loop ordering, and vendorsupplied mathematical libraries will affect performance so
multiple efficiencies may be needed to estimate different parts of the algorithm.
S, So Nop
t + out op (54)
comp Bd B C* e
Asynchronous I/O. Disk operations can be estimated with latency, bandwidth, and
the number of channels (Equation 55). Latency is usually the time it takes for a disk
head to move to the appropriate position on the disk, Nch is the number of dedicated I/O
channels, and Nmsg IS the number of simultaneous disk operations. Disk contention in the
simulator is estimated by multiplying S by the number PEs on the node (i.e., if one PE is
reading data, then another probably is as well).
tr to = tlat +~ (55)
B *min(Nch> msg
When an asynchronous call is placed, the simulator estimates how long it would
take for the data to arrive and saves the time at which it should be available. The PE must
then call a wait routine, which will compare the current time to the time the data is
useable. Depending on the difference, the simulator will increase the clock (the PE must
wait) or return silently (the data arrived while computing).
Onesided communication. Sending and receiving messages through a network is
very similar to writing and reading from disk. If twosided messaging is used, then the
clocks of the origin and target must be modified. Blocking messages require the origin
and target to have the same clock, while nonblocking messages allow for drift. The
parallel CCSD algorithm only uses nonblocking onesided messages, which are very
similar to I/O operations in that only the origin is affected by the time it takes for data to
arrive. The simulator allows for onnode and offnode messaging parameters so that PEs
in the same squad can communicate at different speeds than those on different nodes.
Some MPI implementations require copying data into and out of system buffers, so
memory bandwidth might also affect the time.
Collective communication. Barriers, reductions, and broadcasts are simulated with
binary trees (Figure 510), and the number of stages depends on the number of PEs
involved in the operation (Equation 56). (Ceiling brackets mean round up to the nearest
integer.) Equation 57 shows the number of messages being sent in each stage i.
Istages= 10g2 PrE> (56)
N,,,z = min(NPE,2) 2' (57)
000000000000
1 ,
2 ,
Figure 510. The binary tree approach to collective communications. 12 PEs require 4
stages, and the last stage only has 4 messages.
Calibration
Memory bandwidth is measured by allocating a large heap and then flushing it with
data and reading the data back in. The average of 7 sweeps is used for the memory
reading and writing bandwidth. Henceforth, all data will pertain to an IBM POWER3
workstation with four 375 MHz CPUs. The following output was taken from a calibration
run:
512 MB heap at Ox3348f518
wr was 1.924009 > 266.11 MB/s; rd was 0.623147 > 821.64 MB/5
wr was 1.924412 > 266.06 MB/s; rd was 0.623328 > 821.40 MB/5
wr was 1.922241 > 266.36 MB/s; rd was 0.623201 > 821.56 MB/5
wr was 1.922671 > 266.30 MB/s; rd was 0.623183 > 821.59 MB/5
wr was 1.922141 > 266.37 MB/s; rd was 0.623107 > 821.69 MB/5
wr was 1.922495 > 266.32 MB/s; rd was 0.623170 > 821.61 MB/5
wr was 1.922200 > 266.36 MB/s; rd was 0.623147 > 821.64 MB/5
bw_memwr = 266.27 MB/s (279201344.00 B/s)
bw_memrd = 821.59 MB/s (861497664.00 B/s)
CPU throughput is rarely the clock speed of the processor so efficiency modifiers
are used to estimate the time it would take to compute a set of operations. The two
primary mathematical constructs that the CCSD algorithm uses are twodimensional
matrix multiplications and N6 array contractions. Many vendors sell highly optimized
linear algebra libraries that are tuned for specific computer architectures. Using these
libraries for standard matrix multiplies can dramatically increase performance if the
program is dependent on such operations. The simulator measures the efficiency of the
double precision general matrix multiply (DGEMM) by multiplying two 625x625
matrices. The matrices are then recast as two 25x25x25x25 fourdimensional arrays and a
handcoded sixloop array contraction is timed, wherein the array contraction performs
the same operations as the twodimensional matrix multiply. The following results were
taken from a calibration run:
CALIBRATION FOR POWER 375 MHz CPU
(2 fma units per CPU)
750000000 fma/s reference
Disk latency and bandwidth are difficult to measure in an isolated environment.
Most operating systems use aggressive Eile caching, which can have a noticeable effect
on performance curves that compare message size to bandwidth. The parallel CCSD
algorithm is somewhat forgiving of exact performance because all disk reads are
asynchronous; nonetheless, calibrating the I/O performance attempts to bypass the CPU
and memory caches.
The benchmarking program iozone [Nor02] was used to measure the disk latency
and bandwidth, and all Hiles were created with the descriptors: O DIRECT (direct I/O),
O_SYNC (synchronous writes), and ORSYNC (synchronous reads). Comparing fie
( 625 x 625 ) ( 625 x 625 )
dgemm took .387582 s > 88.1849% eff
dgemm took .384157 s > 89.0108% eff
dgemm took .384016 s > 89.0451% eff
dgemm took .383978 s > 89.0543% eff
dgemm took .384088 s > 89.0276% eff
dgemm took .384009 s > 89.0468% eff
dgemm took .384046 s > 89.0378% eff
loops took 1.785683 s > 18.4198% eff
loops took 1.786396 s > 18.4123% eff
loops took 1.785581 s > 18.4208% eff
loops took 1.785778 s > 18.4188% eff
loops took 1.786151 s > 18.4149% eff
loops took 1.785726 s > 18.4193% eff
loops took 1.785444 s > 18.4223% eff
dgemm eff = 0.889153
loops eff = 0.184183
vendor is 4.83 times more efficient
The results are then compared to another set of larger multiplications and contractions to
test for predictability. The simulator uses the latter set of parameters if the results are
precise. The following results were generated using the previous data:
( 30 x 30 ,40 x 40 ) x ( 40 x 40 ,40 x 40 )
DGEMM: real: 3.549262 s (est: 3.53) eff:
DGEMM: real: 3.547543 s (est: 3.53) eff:
DGEMM: real: 3.547254 s (est: 3.53) eff:
LOOPS: real: 22.458201 s (est: 16.76) 6loop eff:
LOOPS: real: 22.448650 s (est: 16.76) 6loop eff:
LOOPS: real: 22.446517 s (est: 16.76) 6loop eff:
LOOPS: real: 22.450999 s (est: 16.76) 6loop eff:
LOOPS: real: 22.447324 s (est: 16.76) 6loop eff:
LOOPS: real: 22.445928 s (est: 16.76) 6loop eff:
LOOPS: real: 22.449582 s (est: 16.76) 6loop eff:
.8892
.8892
.8892
.1373
.1373
.1373
.1373
.1373
.1373
.1373
record length to the time it took to read a random record produced data that were used to
describe the local disk performance (Figure 511 and Table 52).
600 4 2048
500 11 0El 4096
~400 A 8192
300 u x 16384
O 32768
100 65536
+ 13107
0 4096 8192 12288 16384 2
26214
record size (kB) 4
Figure 511. The local disk random read calibration data. The sets of data are almost
identical no matter what file size (kB) was used between 2 and 512 MB.
Table 52. The latencies and bandwidths of random disk reads with respect to file size
File Size Brd tint File Size Brd tint File Size Brd tla
2048 29.7 3.5 16384 30.4 3.5 131072 31.2 4.3
4096 30.4 4.0 32768 30.5 2.9 262144 31.5 4.9
8192 30.0 3.2 65536 31.1 4.0 524288 31.4 4.8
File sizes are in kB, bandwidths (Brd) are in MB/s, and latencies (tle) are in ms. For each
file size, the bandwidth is the inverse of the slope of the linear regression of record
lengths vs. time to read, and the latency is the intercept. (The minimum R2 was 0.999.)
The average bandwidth and latency for all 9 sets were 30.7 MB/s and 3.9 ms,
respectively.
Measuring network messaging performance was kept relatively simple. The
mpptest program, which ships with the MPICH library [Gro96], was used to measure the
pointtopoint (P2P) and bisection bandwidths for messages ranging from 16 to 64 kB.
P2P involves only two PEs, while bisection involves all PEs. Figure 512 shows the on
node P2P and bisection bandwidths measured on the IBM workstation. Latency is taken
from the intercept of the linear regression, not the zerosize transfer time. The regression
analysis for both curves suggests that both pairs of CPUs can communicate without
degradation, so the simulator set the number of onnode communication channels to 2
bisect: t 9.8093*S + 209.7
R2 = 0.9982
P2P: t = 9.9477*S + 210.69
R' 0.9973
""
and used the approximate values of 210 Cls and 99 MB/s for latency and bandwidth,
respectively. Three offnode network architectures were tested: switched Fast Ethernet
and two 150 MB/s full duplex, redundant path IBM SP switches (with different
configurations). Only P2P tests were used to measure offnode latency and bandwidth
(Figure 513), and Equation 55 was used to estimate degradation.
+ P2P
X bisect
  Linear
(P2P)
 Linear
(bisect)
900
800
400
300
0 8
24 32 40
message size (kB)
Figure 512.
The calibration data for onnode messaging parameters. Bandwidth is the
inverse of the regression slope, and latency is the intercept. The performance
does not change with one or two pairs of CPUs communicating at the same
time, which implies two independent messaging channels.
+ F.Ethernet
X SP swl
+ SP sw2
  Linear
(F.Ethernet)
 Linear (SP
swl)
  Linear (SP
sw2)
7000
6000
,5000
Y 4000
E 3 00 0
'2000
1000
0
0 8
24 32 40
message size (kB)
Figure 513. The calibration data for offnode messaging parameters. The regression
equations are Fast Ethernet: t=86.987*S+553.42 (R2=0.9998); SP switch 1:
t=15.304*S+265.18 (R2=0.9915); and SP switch 2: t=11.942*S+180.88
(R2=0.9970). Bandwidths are 11.2, 63.8, and 81.8 MB/s, respectively.
As in the case of asynchronous I/O, most messages in the parallel CCSD algorithm
are onesided and will hide behind computation. If the amount of computation is less than
the estimated time spent in communication, then a more accurate messaging function
should be used.
Theoretical Examples
With reasonable hardware parameters and timing algorithms, simulations of real
molecular systems can be used to estimate the performance of the parallel CCSD
algorithm. Table 53 lists the 20 amino acids and the corresponding numbers of electrons
and spin pairs. Also listed are the sizes of the openshell OOVV integralsand thus the
minimum disk requirements (with 1 GB equal to 10243 Bytes)and the minimum
number of PEs with 1 GB of heap needed to calculate the UHF CCSD energy if each
system had 1024 AO basis functions. Energy calculations of the anion, cation, and
zwitterion are certainly achievable with the resources of current networks of workstations
and clusters of SMP machines.
For 1024 AO basis functions, the size of the twoelectron integrals is 8 TB.
Approximately oneeighth of these need to be calculated on the fly since the others can
be generated by permutations; however, assuming the integrals are dense, the full 8 TB
array must be generated and processed twice (once for Pa and once for up). Systems of
this size should be executed on hundreds if not a thousand PEs or more. The initial
parallelized software cannot scale this high much less store the amount of data needed by
each node. The parallel CCSD algorithm overcomes this with finegrain task distribution
and distributing large arrays over as few PEs as possible. The net effect is that large
systems can be calculated with few resources at the expense of computing time.
calculation.)
Min Disk Disk
Amino acid ne noo
C H NOS PEs (GB) savings
Alanine 3 7 1 2 0 48 1128 27 8.8 99.92%
Cysteine 3 7 1 2 1 64 2016 48 15.8 99.86%
Aspartic acid 4 7 1 4 0 70 2415 57 18.9 99.84%
Glutamic acid 5 9 1 4 0 78 3003 71 23.5 99.79%
Phenylalanine 9 11 1 2 0 88 3828 90 29.9 99.73%
Glycine 2 5 1 2 0 40 780 19 6.1 99.95%
Histidine 6 9 3 2 0 82 3321 78 25.9 99.77%
Isoleucine 6 13 1 2 0 72 2556 60 20.0 99.83%
Lysine 6 14 2 2 0 80 3160 75 24.7 99.78%
Leucine 6 13 1 2 0 72 2556 60 20.0 99.83%
Methionine 5 11 1 2 1 80 3160 75 24.7 99.78%
Asparagine 4 8 2 3 0 70 2415 57 18.9 99.84%
Proline 5 9 1 2 0 62 1891 45 14.8 99.87%
Glutamine 5 10 2 3 0 78 3003 71 23.5 99.79%
Arginie 6 14 4 2 0 94 4371 103 34.1 99.69%
Serine 3 7 1 3 0 56 1540 37 12.0 99.90%
Threonine 4 9 1 3 0 64 2016 48 15.8 99.86%
Valine 5 11 1 2 0 64 2016 48 15.8 99.86%
Tryptophan 11 12 2 2 0 108 5778 136 45.1 99.59%
Tyrosine 9 11 1 3 0 96 4560 107 35.6 99.68%
Table 53 The minimum re D
_____
The data assume thle scratch memory needed by each PE, is less than thle amount of data
needed to store the third OOVV array during an R2 transformation. Every example saves
over 99.5% of the disk space required by a traditional MObased CCSD algorithm.
Runtime Simulations
DCEE*
A simulation of the parallel CCSD algorithm was performed on the DCEE* system
described in Chapter 2. Two network scenarios were tested: Fast Ethernet (F.Eth) and the
faster of the two IBM SP switches (henceforth referred to as x3 sw). Each node in the
network was configured as if it were the aforementioned 4CPU IBM workstation. All
data are compared to 100% and 50% theoretical efficiencies (both are technically linear,
but "linear" scaling commonly refers to 100% theoretical efficiency). (The parallel
algorithm requires at least two 1GB PEs, so all data are compared to this as the reference
mt
Fast Ethernet. The F.Eth simulation shows acceptable parallel performance up to
64 PEs, but beyond that, efficiency drops below 50%, which is considered to be
unacceptable (Figures 514 and 515). However, even up to 1024 PEs, the scalability
curve is still increasing. Efficiency shows normal behavior for parallel programs (starting
out high and decreasing asymptotically).
96 / ** Xth (100%)
  Xth (50%/)
64 X I~ pred.
r M Efficiency*128
30
0 256 512 768 1024
# PEs
Figure 514. The simulated performance curves for DCEE* on 2 to 1024 PEs over Fast
Ethernet. Xth (100%) and Xth (50%) are the theoretical speedup curves
with 100% and 50% efficiencies, respectively. X pred. is the predicted
speedup for the simulation. Efficiency*128 is the efficiency scaled by 128.
Beyond the standard analysis of speedup and efficiency, users might concern
themselves with the idea of costtoperformance. Another way of stating the idea, "At
what point does the cost of using more PEs outweigh the gain in speedup they might
contribute?" The "cost" function is considered to be the efficiency and the metric for
"bang per buck" is efficacy (Equation 58) [Gho91].
17(p1)= E(p~) S(p~) (58)
Efficiency times speedup can have multiple maxima, but only the first maximum value
(with respect to the number of PEs) corresponds to the most costeffective use of PEs in a
parallel run. The F.Eth simulation of DCEE* peaks at 64 PEs (r=10.2), but the curve is
so shallow that one could argue anywhere between 32 and 96 PEs is still cost effective.
O 32 64 96
# PEs
Figure 515. The simulated performance curves for DCEE* on 2 to 128 PEs over Fast
Ethernet. The curves correspond to the same series as in Figure 514 with
the exception that efficiency has been scaled by 32, not 128.
SP switch. The x3 sw simulation shows much better performance (Figure 516).
Predicted speedup never drops below 50% efficiency up to 1024 PEs, and efficacy has
not peaked. The simulation suggests that the parallel performance for this system
(DCEE*) is highly dependent on communication speed.
300
100  Xth (100%)
 Xth (50%)
50 ~7 X pred.
50  Efficiency*300
SEfficacy
0 256 512 768 1024
# PEs
Figure 516. The simulated performance curves for DCEE* on an SP switch.
Comparison to serial ACES II. Efficacy is a relative metric that depends on the
machine and (molecular) system, but it can be used to compare the effectiveness of
different parallel programs running the same system on the same computer. Figure 517
compares the speedup and efficacy of the new algorithm to the previous parallel CCSD
algorithm. The new algorithm scales better and is much more cost effective per PE;
however, the comparison of real times does not fare so well (Figure 518).
100
new X
75 ( new Efficacy
  prev X
50 + prev Efficacy
25
0 64 128 192 256
# PEs
Figure 517. The relative performance of the initial parallelization attempt compared to
the new algorithm. The discontinuity of the previous parallel program comes
from allocating whole nodes with 16 CPUs each.
The new algorithm achieves high scalability by decomposing the equations to such
a degree, that the PEs are swamped with calculations in an attempt to use as little disk
and memory as possible. The current serial ACES II CCSD program is 31.4 times faster
than the new parallel algorithm, but it also requires much more disk space and does not
scale as well. Nonetheless, Table 54 lists the numbers of PEs that the new algorithm
would need to match the time achieved by the initial parallelization attempt.
It is clear that the CCSD energy can be calculated quickly with few PEs for
molecular systems that have small data requirements (DCEE* used on the order of 2 GB
of disk space per node). The current ACES II program and the initial parallelization
attempt are more than adequate for reducing the time the new parallel algorithm would
take; however, the new algorithm requires very little disk space and can calculate
energies for very large systems.
300
200
M X new
serial ACES II
100 MX prey
O 256 512 768 1024
# PEs
Figure 518. The absolute performance of the initial parallelization attempt compared to
the new algorithm. The dark lines are the factors of speedup achieved by
using the previous code compared to the reference calculation of the new
algorithm .
Table 54. The number of PEs needed to surpass the initial parallel performance
Previous PEs New PEs Etaoae
speedup
1 76 31.8
32 409 146
64 514 176
128 424 151
256 405 145
The simulated speedup curve was fitted with a cubic regression, which was used to
predict how many PEs the new algorithm needs to finish a CC iteration in the same time
that the initial parallelization attempt did.
Serine and phenylalanine
Using the x3 sw network parameters, the simulator examined serine (Ser) and
phenylalanine (Phe), both in the ccpVQZ basis set. Ser, with 56 electrons, had 595 AO
basis functions and Phe (88 e") had 990 AOs. The current ACES II UHF CCSD program,
using the AO algorithm for VVVV integrals and assuming 20% of the AO integrals are
discarded, would require over 200 GB of local disk storage for Ser and over 1.5 TB for
Phe. The new parallel algorithm would need only 2.8 and 19.2 GB of distributed storage,
respectively.
Figures 519 and 520 show the simulated parallel performance of Ser and Phe. Ser
requires a minimum of 14 1GB PEs, and Phe requires 95 PEs. In either system, parallel
efficiency never drops below 96%. With respect to tractability, a single Ser UHF CCSD
iteration is expected to take 2.8 days on 1024 PEs. Phe requires approximately 2 months
per iteration on 1024 PEs. If the efficiency curve is to be believed, then this can be
dropped to about 2 weeks on over 4000 PEs.
 Xth (100%/)
48 . Xt~x h (50%)
X pred.
32 MEffciency*64
SEffcacy ..
10
0 256 512 768 1024
# PEs
Figure 519. The performance curves for serine on 16 to 1024 PEs.
 Xth (100%)
6  Xth (50%)
X pred.
4 MEffciency*8
SEnfcacy
0 256 512 768 1024
# PEs
Figure 520. The performance curves for phenylalanine on 128 to 1024 PEs.
The simulator is only able to model local computations with high accuracy. These
are generally highly ordered, welldefined sequences of commands executing on a
consistent processor (barring process contention in the operating system). After these, the
performance of local disk operations is somewhat predictable, even with I/O contention
from other PEs. The least predictable performance aspect of a parallel program is
network communications. This seems to be an N! correlation problem in its own right
(nevermind coupledcluster theory), but certain models are able to predict the statistical
nature of network traffic rather well.
The current implementation of the simulator treats the network as a single, flat
interconnected fabric in which every PE can reduce network bandwidth for every other
PE. In principle, this should then predict the worst performance since multilayered
network hierarchies favor local communications, which the platoons are supposed to
capitalize on. However, the penalty function for estimating network contention lacks the
ability to scale up with the number of PEs (i.e., more PEs reduces the accuracy of the
predicted network effects). This should be corrected with a better model such as the
hyperbolic model for layered environments [Sto96].
At a more fundamental level, the simulator assumes all messages can be delivered
in one eager message. This means each PE can send and receive data without bothering to
divide the information into multiple messages, but more importantly, it completely
eliminates the need for the target PE to acknowledge or create space for the incoming
data. In practice, this is rarely the case, and this rendezvous protocol must be modeled.
The simulator is also limited to sharing I/O bandwidth among PEs on a single node.
Many highperformance parallel supercomputers use storage area networks to increase

Full Text 
PAGE 1
A PARALLEL DIRECT OPENSHELL COUPLEDCLUSTER SINGLES AND DOUBLES ALGORITHM By ANTHONY D. YAU A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORID A IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 2004
PAGE 2
Copyright 2004 by Anthony D. Yau
PAGE 3
iii ACKNOWLEDGMENTS I thank my family and friends for supporting me throughout the years, my strength comes from them. I thank the faculty who ha ve instructed me and the members of Rod Bartlett's research group. I owe particular th anks to Dr. Erik Deumens for continued motivation and guidance in scientific com puting (and for teaching the first parallel programming class that I attended) and to Dr. Victor Lotrich for insights into implementing CoupledCluster theory. I especi ally thank Dr. Ajith Perera for extending so much help (and patience) to me during all of my studies throughout graduate school. The United States Department of Defense High Performance Computing Modernization Program (HPCMP) supported all of this work. A grant under the Programming Environment and Training (P ET) program (GSA Contract No. DAAD0501C003) funded the initial pa rallelization effort, and th e Common High Performance Computing Software Support Initiative (CH SSI) program (project CBD03) funded the research on the finegra in parallel algorithm.
PAGE 4
iv TABLE OF CONTENTS Page ACKNOWLEDGMENTS.................................................................................................iii LIST OF TABLES............................................................................................................vii LIST OF FIGURES.........................................................................................................viii ABSTRACT....................................................................................................................... ..x CHAPTER 1INTRODUCTION......................................................................................................1 2PARALLELIZING A SERIAL CCSD PROGRAM.................................................6 Serial Profile...............................................................................................................6 Bottleneck Categorization and Parallelization...........................................................9 ParticleParticle Ladders...................................................................................9 Matrix Multiplication......................................................................................12 I/ODependent Routines.................................................................................13 Parallel Profile..........................................................................................................14 TopDown Parallelization Considerations...............................................................17 3SPINUNRESTRICTED CCSD EQUATIONS.......................................................18 Group 1: T2 Virtual Orbitals Coming from V..........................................................23 Group 2: T2 Virtual Orbitals Coming from V and T................................................24 Set 1: One Virtual from V, One Virtual from T1............................................24 Set 2: One Virtual from V, One Virtual from T2............................................26 Group 3: T2 Virtual Orbitals Coming from T..........................................................27 Set 1: Rings and PH Ladders..........................................................................27 Subset 1: AC/BD...................................................................................27 Subset 2: DB/CA...................................................................................28 Subset 3: CB/AD...................................................................................29 Set 2: HH Ladders from OOVV, OOOV, and OOOO...................................30 Group 4: T2 into T1...................................................................................................31 Group 5: OneParticle Transformations...................................................................32 Set 1: T1 Contributions...................................................................................32 Set 2: T2 Contributions...................................................................................33
PAGE 5
v 4PARALLEL ALGORITHM....................................................................................35 Virtual Parallel Computer........................................................................................35 Communication Networks..............................................................................37 Data Management...........................................................................................38 Arrays stored in distributed memory.....................................................38 Arrays stored on disk............................................................................42 Parallel UHF CCSD.................................................................................................42 Direct Integral Phase.......................................................................................44 Design considerations...........................................................................44 Algorithm..............................................................................................46 Stored Integral Phase......................................................................................50 One Particle Phase..........................................................................................55 Antisymmetrization and T2 Update................................................................57 5DISCUSSION AND SIMULATION.......................................................................69 Discussion................................................................................................................69 General Comments.........................................................................................69 Load Balancing...............................................................................................70 Job boards.............................................................................................70 Data aggregation...................................................................................70 Asynchronous I/O...........................................................................................72 Synchronization Delays..................................................................................73 Simulation................................................................................................................74 General Overview...........................................................................................76 Specific Technologies.....................................................................................77 Calibration......................................................................................................79 Theoretical Examples.....................................................................................83 Runtime Simulations......................................................................................84 DCEE*..................................................................................................84 Serine and phenylalanine......................................................................88 6CONCLUSION........................................................................................................92 Outlook.....................................................................................................................93 Gradients.........................................................................................................93 Higher Excitation Clusters..............................................................................94 Orbital Transformations and Space Reductions.............................................95 Conclusion................................................................................................................95 APPENDIX AACRONYMS AND ABBREVIATIONS................................................................97
PAGE 6
vi BNOTATION AND CONVENTIONS......................................................................99 Operators and Matrix Elements................................................................................99 Operator Indices.......................................................................................................99 CALGORITHM OVERVIEW..................................................................................101 DANATOMY OF A SHELL PAIR BLOCK...........................................................102 ELOGICAL SHELLS A ND BIN PACKING..........................................................104 LIST OF REFERENCES.................................................................................................110 BIOGRAPHICAL SKETCH...........................................................................................112
PAGE 7
vii LIST OF TABLES Table page 21.The distribution of real time over highlevel CCSD subroutines............................8 31.The RHF CCSD T1 contributions..........................................................................19 32.The RHF CCSD T2 contributions..........................................................................20 41.The direct integral processing loop summary........................................................48 42.The Direct Integral pha se prologue memory map.................................................60 43.The Direct Integral pha se inner loop memory map...............................................61 44.The Direct Integral pha se epilogue memory map..................................................62 45.The Stored Integral phase Group 3 Set 1 Subset 1 memory map..........................63 46.The Stored Integral phase Group 3 Set 1 Subset 2 memory map..........................64 47.The Stored Integral phase Group 3 Set 1 Subset 3 memory map..........................65 48.The Stored Integral phase Group 3 Set 2 memory map.........................................66 49.The One Particle phase FP T2 memory map.......................................................66 410.The One Particle phase FP T2 (and same spin terms) memory map...................67 411.The One Particle phase FH and FR local transformations......................................67 412.The R2 antisymmetrization and T2 update memory map.......................................68 51.The schedule of synchronization points.................................................................75 52.The latencies and bandwidths of random disk reads with respect to file size.......81 53.The minimum resources for amino acids with 1024 AOs in UHF CCSD.............84 54.The number of PEs needed to surpas s the initial parallel performance.................88
PAGE 8
viii LIST OF FIGURES Figure page 21.The distribution of real time over primary CCSD subroutines................................7 22.The breakdown of significant UHF subroutines......................................................8 23.The column assignments for each of 4 CPUs updating R2....................................12 24.The distribution scheme for matrix multiplies over CPU threads.........................13 25.The parallel speedup for each major UHF CCSD subroutine and the CCSD executable..............................................................................................................15 26.The runtime profile of parallel CCS D with a UHF reference from DCEE*.........15 41.Communication network boundaries.....................................................................39 47.The memory heap management for various molecular systems............................40 42.The T2 and R2 memory layout...............................................................................43 48.The data flow in the Direct Integral phase.............................................................49 43.The path of Group 3 Set 1 Subset 1 updates per OOVV integral..........................53 44.The LD shell distribution schemes........................................................................54 49.The data flow in th e Stored Integral phase............................................................55 510.The binary tree approach to collective communications.......................................78 511.The local disk random read calibration data..........................................................81 512.The calibration data for onnode messaging parameters.......................................82 513.The calibration data for offnode messaging parameters.......................................82 514.The simulated performance curves for DCEE* on 2 to 1024 PEs over Fast Ethernet..................................................................................................................85
PAGE 9
ix 515.The simulated performance curves for DCEE* on 2 to 128 PEs over Fast Ethernet..................................................................................................................86 516.The simulated performance curves for DCEE* on an SP switch..........................86 517.The relative performance of the initial parallelization attempt compared to the new algorithm........................................................................................................87 518.The absolute performance of the initia l parallelization attempt compared to the new algorithm........................................................................................................88 519.The performance curves for serine on 16 to 1024 PEs..........................................89 520.The performance curves for phenylalanine on 128 to 1024 PEs...........................89 D1.The spin blocks in an AO shell pair block...........................................................103
PAGE 10
x Abstract of Dissertation Pres ented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy A PARALLEL DIRECT OPENSHELL COUPLEDCLUSTER SINGLES AND DOUBLES ALGORITHM By Anthony D. Yau December 2004 Chair: Rodney J. Bartlett Major Department: Chemistry CoupledCluster Singles and Doubles (CCSD) is a benchmark ab initio electron correlation method. The biggest factor in pr eventing its widespread adoption for large molecular systems is that the theory scales like N4 in data and N6 in computation (N being the number of atomic orbital basis functions). Using simple factorization techniques and data and task parallelization, a new appro ach to calculating openshell CCSD energies is available. For a molecular system with 50 electrons (25 alpha, 25 beta) and 500 AO basis functions, this approach has the potential to reduce storage requirements from 630 GB to 2.3 GB for a traditional UHF CCSD algorithm in a basis set of molecular orbitals. The number of operations will increase, but the natural parallelization of the data provides a clear and concise framework for parallelizing the computations.
PAGE 11
1 CHAPTER 1 INTRODUCTION CoupledCluster (CC) theory is an ab initio electron correlation method that can be thought of as infiniteorder perturbation theory or a sizeextensive approximation to the full configuration inter action (CI) analysis [ Bar95 ]. Given a oneparticle reference wavefunction 0, the CC wavefunction is defined as an exponential expansion of excitation clusters, T, from the occupied orbi tals to the unoccupied (virtual) orbitals. The T operator is a particle excitation operator that naturally trunc ates at the number of particles, P, in the system (Equation 12). O ccupied orbitals (the hole space) are denoted with the letters i, j, k, and l, while virtual orbitals (the particle space) are denoted with the letters a, b, c, and d. Appendix A lists acronyms and abbreviations, and Appendix B summarizes index conventions and matrix elements. 0exp TCC(11) ,... , ,... , Â† Â† Â† ... ... 2 1... 1 ;c b a k j i abc ijk p P p pk jc ib a t p T T T(12) The difficulty in calculating the full CI is that the data requirement scales like N2Pand the computational requirement scales like N2P+2, in which N is the size of basis set space. To retain size extensivity and comput ational tractability, th e excitation operator is usually truncated at a small number of ex citations. The most common is at double excitations, such that T=T1+T2. As computers become more capable of massive calculations, the number of excitations can increase, but the size of the molecular system
PAGE 12
2 must decrease much more rapidly. Currentl y, the highest handcoded truncation is at pentuple excitations (i.e., T=T1+T2+Â…+T5) [ Mus02 ], but efforts to automate this process are becoming more promising as distribu tion techniques for data and computations become available [ Bau02 ]. Using the standard twoparticle electron ic Hamiltonian, the CC energy is calculated by contracting the elements of the Fock matrix (fpq) and the twoelectron Coulomb integrals ( ) with the T1 and T2 amplitudes. Uppercase orbital indices correspond to alpha spin orbitals, and lo wercase orbital indices correspond to beta spin orbitals. The doublebar integrals in Equation 14 are anti symmetrized Coulombexchange integrals (Appendix B). (This derivation assumes that the reference orbitals are orthonormal.) 0 0) exp( Âˆ T H ECC(13) ijab b i a j b j a i ab ij ia a i ia IJAB B I A J B J A I AB IJ IA A I IA IjAb b j A I Ab Ij CCt t t t t ab ij t f t t t t t AB IJ t f t t t Ab Ij E E4 1 4 1 0(14) The T amplitudes are calculated by iterati ng over a coupled set of equations that project the Hamiltonian and wavefunction into a set of pparticle excitation spaces. The CCSD approximation truncates T at double exci tations, and is one of the benchmark correlated methods in quantum chemistry today. The T1 and T2 equations come from the projections in Equations 16 and 17, in which NH Âˆ is the normalordered Hamiltonian that does not include the reference energy. Only the and spin equations are shown, but the , and spin equations are analogous to the former. 0 0) exp( Âˆ exp CC NE T H T(15)
PAGE 13
3 0 ) exp( Âˆ exp0 T H TN a i(16) 0 ) exp( Âˆ exp0 T H TN ab ij(17) The BakerCampbellHausdorff formula [ Hal00 ] has the effect of reducing the T equations to those terms that 1) contract directly with the Hamiltonian and 2) have no more than four excitation amplitudes. The fi rst effect is from disconnected terms being cancelled by the commutation, and the second effect is from the twoparticle Hamiltonian having at most four indices from the twoelectron integrals. Computationally, the diagonal elements of th e Fock matrix that contract with T are grouped together, and define the input amplitudes for the next iteration. The T2 equations are shown in Equations 18 through 111. If T is initialized to 0, then the next iteration will produce T2=/D2 and T1=0, so for all intents and purposes, T2 can be initialized to the secondorder wavefunction and T1 can be set to 0. ... 0 k ab ik jk k ab kj ik c ac ij bc c cb ij act f t f t f t f (18) ... 0 j k ab ik jk i k ab kj ik b c ac ij bc a c cb ij ac ab ij jj ii bb aat f t f t f t f t f f f f (19) bb aa jj ii ab ijf f f f D (110) ... j k ab ik jk i k ab kj ik b c ac ij bc a c cb ij ac ab ij ab ijt f t f t f t f t D (111) Since the T amplitudes differentiate between occupied and virtual orbitals, it is common to group integrals into the various space combinations such as occupiedoccupied (OO), occupiedvirtu al (OV), and virtualvirtua l (VV). There are five such combinations for the twoelectron in tegrals: OOOO, OOOV, OOVV, OVVV, and VVVV. If the virtual orbitals outnumber the occ upied orbitals by a factor of 3, then the
PAGE 14
4 VVVV and AO integrals will outnumber th e OOOO integrals by factors of 81 and 256, respectively. Most CC implementations calculate and store the OOOO, OOOV, and usually OOVV integrals before starting the CC iterations. Due to programming complexity, most implementations also store the OVVV integrals. The VVVV integrals are used in only one CCSD term and how they are processed illustrates a fundamental technique that the parallel CCSD algorithm uses in almost every contraction. The data re quirement scales like N4 and the computational requirement scales like N6, and this term is usually the ratedetermining step of the CC iteration. The standard MO expression is s hown in Equation 112. The "+=" operator is shorthand for "incremented by" (i.e., "x+=y" means "x=x+y") and the r amplitude is the residual that collects all of the contributions before di viding by the D denominator and producing the next set of T amplitudes. cd cd ij ab ijt cd ab r2 1(112) Forward transforming T into the AO basis allows the AO integrals to contribute directly to the VVVV terms. Conceptually, the AO coefficients from the reference wavefunction are "removed" from the integr al and "attached" to the amplitude. The primes on the AO indices in T denote a forwar d transformation. This distinction must be made since another useful transformation (a backward transformation) has the same dimensionality ( viz. ), but the AO overlap matrix is used in the contraction in addition to the AO coefficients. cd cd ij d c b a ab ijt c c c c r 2 1(113) ij b a ab ijt c c r2 1(114)
PAGE 15
5 While this transformation removes the n eed to store the VVVV MO integrals, it increases the computational workload subs tantially. In fact, since each integral contributes to each amplitude, the R2 array would be updated N4 times! However, implementing this technique presents a more optimistic view of calculating large systems. The R2 amplitudes are backtransformed from MOs to AOs before being incremented by the AO integral terms. Once the integrals have been processed, the R2 array is forwardtransformed into the MO basis. c c cd cd ij d c ijc S c r c c r;(115) ij ijt r2 1(116) ij b a ab ijr c c r(117) The method of calculating amplitude contri butions in the AO basis instead of the virtual MO basis has the potential to re duce the amount of storage space needed to calculate the aforementioned VVVV contribution. This is particularly true for openshell systems, which need to store all three spin combinations of the VVVV integrals compared to storing one set of AO integrals. If direct integral generation is used, then the storage requirement for this term is comp letely removed. In either the AO or MO scenario, parallelization invo lves distributing the data a nd operations of a large matrix multiplication over the processing elements.
PAGE 16
6 CHAPTER 2 PARALLELIZING A SERIAL CCSD PROGRAM Widescale reformulation of computationa l algorithms should only be performed if currently existing formulations are inadequa te, but determining whether an algorithm is inadequate depends heavily on its intended use and the computers that are meant to execute it. Any of a number of runtime para meters could contribute to low performance: compiler efficiency; net CPU speed; memory or disk storage requirements and performance; parallel scalability and efficien cy; etc. In the case of general computational chemistry algorithms, how the number of operations scales with respect to the size of the chemical system will also determine whether an algorithm should be reformulated. The ACES II Program System [ ACES ] is one of the more popular software packages that calculates CCSD electron correlation ener gies. Although it was written to make use of CPU vectorization, paralleliza tion is considered to be one of the most important ways to decrease the run time on modern supercomputers. Traditional methods for parallelizing existing software begi n with execution profiling. Once the main bottlenecks are identified and analyzed, para llel formulations of those sections are implemented and profiled. Serial Profile Profiling the CC code in ACES II was performed with four test systems: Alanine ("ala"; C3H7NO2; 48 e) in ccpVDZ (119 AO basis functions) Octane ("oct"; C8H18; 66 e) in ccpVDZ (202 AOs) ,'dichloroethyl ether ("DCEE"; C4H8Cl2O; 74 e) in ccpVDZ (146 AOs) DCEE radical ("DCEE*"; C4H7Cl2O; 73 e) in ccpVDZ (141 AOs).
PAGE 17
7 The first three systems had an RHF referen ce, the fourth had UHF, and no spatial point group symmetry was used for any system. All of the systems used the AO basis formulation for the VVVV*T2 contribution, which was described in Chapter 1. The profiled code used the C system function ge ttimeofday() to record the realtime clock before and after large code sections. Any subroutine (or gr oup of related s ubroutines) that consumed more than 3% of the real time became a potential target for parallelization. Table 21 lists the average percentage of th e total real time for each highlevel subroutine measured across the three closeds hell test systems (RHF column), and it lists the percentage for each routine in the DCEE* test system (UHF column). Routines that contributed more than 3% are in bold case. The maximum variance of any set of percentages for the closedshell systems was 0.004, which suggests that the three profiles have a consistent and reliable distribu tion of times. The table entry for "~AOLAD" corresponds to the time spent in AOLAD that is not due to RDAO, even though the latter is a dependent of the former. Figure 21 shows the amount of the total time consumed by the major subroutines and the next most significant routine. 0% 10% 20% 30% 40% 50% 60% 70% CNTRCTT1W1ABRDAORNGDRVnext RHF UHF Figure 21.The distribution of real time over primary CCSD subroutines.
PAGE 18
8 Table 21.The distribution of real time over highlevel CCSD subroutines Routine RHF UHF Routine RHF UHF ZERSYM 0.2% 0.1% T1INT1 0.4% 0.2% T1RAAAA 1.4% 0.6% LADAA 0.0% 0.1% T1RABBA 0.8% 0.6% LADAB 1.2% 0.5% T1RABAB 0.8% 0.6% AOLADLST 0.0% 0.0% CNTRCT 7.7% 9.8% T2TOAO 0.7% 0.3% SUMSYM 0.1% 0.2% RDAO 58.3% 65.1% F2TAU 0.0% 0.0% ~AOLAD 0.6% 0.3% GETLST 1.1% 0.8% Z2TOMO 0.9% 0.7% QUAD1 1.3% 0.5% ZIAAO 0.1% 0.2% QUAD2 0.5% 0.3% T12INT2 0.8% 1.1% QUAD3 0.3% 0.2% RNGDRV 9.4% 11.0% MAKFME 0.1% 0.1% SST0xI 0.0% 0.0% T1W1AA 0.0% 0.6% SUMRNG 0.2% 0.1% T1W1AB 3.9% 2.1% ERNGAA 0.0% 0.0% T1INW2 0.1% 0.0% ERNGAB 0.5% 0.1% FEACONT 0.7% 0.4% SUMSYM 0.2% 0.0% FMICONT 0.3% 0.3% E4S 0.6% 0.2% FMECONT 0.5% 0.2% NEWT2 0.3% 0.2% T1INT2A 0.3% 0.2% PHASE3 0.6% 0.2% T1INT2B 0.8% 0.5% DIIS 2.6% 1.0% It is clear from the table that four r outines consume most of the execution time: CNTRCT, T1W1AB, RDAO, and RNGDRV. Toge ther, they cost 79.3% and 88.1% for RHF and UHF, respectively. Figure 22 shows the four major subroutines and any individual routine that consumes more than 1% from the DCEE* profile. RDAO CNTRCT T1W1AB T12INT2 DIIS RNGDRV else Figure 22.The breakdown of significant UHF subroutines. Primary values are grouped on the left, secondary values (great er than 1%) are grouped on the right.
PAGE 19
9 Bottleneck Categorization and Parallelization ParticleParticle Ladders The single most time consuming routine in either RHF or UHF calculations is RDAO, which increments T2 by contributions from the VVV V integrals in the AO basis. Integralstored schemes, such as the curr ent implementation in ACES II, read the AO integrals into memory from disk, multiply them by the T2 amplitudes from the previous iteration, and add the product to the T2 residual amplitudes. Neglecting integral sparsity, the number of permutationally unique AO integral s is shown in Equation 21. If there are at least 100 AO basis functions, then the numbe r of unique integrals can be approximated by NAO 4/8 with more than 98% accuracy; 200 or more functions and the approximation is more than 99% accurate. 2 1 ; 2 1 AO AO IntsN N P P P N (21) Many AObased implementations have a userdefined threshold for discarding AO integrals that would not have a significant contri bution to the T2 residual. The integral program in ACES II stores the integrals in one long vector and discards individual integrals that are less than the thres hold. For DCEE* and a threshold of 1014, ACES II stored almost 41.1 million integrals from a complete set of 50.1 million, which means the twoelectron integrals were 18.0% sparse. The RDAO algorithm reads in a batch of integrals and proces ses the integrals individually. For a particular integral, the value might cont ribute 1, 2, 4, or 8 times depending on the permutational symmetry of th e orbitals. The RHF case only has 1, 2, or 4 contributions and postprocesses the residu al for the final permutation. The general structure of the algorithm can be described by the following:
PAGE 20
10 do while (mor e batches) read batch do for each int do for each perm do for all OO R2(OO,) += T2(OO,)*<> end do end do end do end do This algorithm is not particularly efficien t since each permutation requires a vectorvector operation (R2(*, ) += V T2(*, )). Experience in com putational algorithms suggests that matrixmatrix operations (using integral blocks) would have much higher computational throughput. For example, the RDAO update averaged 286.7 MFLOPS out of a theoretical maximum of 1500 MFLO PS on the 375 MHz IBM POWER3 CPU. A 600x600 matrix multiplication using the ESSL libra ry (a tuned linear algebra library for IBM architectures) averages 1278 MFLOPS, which corresponds to a computational efficiency of 85.2% compared to 19.1% fo r RDAO. However, parallel algorithms for blockwise implementations ar e virtually identical to elementwise implementations, so block optimizations were ignored in the para llel strategy and could be implemented later. For any DCEE* CC iteration, 9.6 seconds out of 5952 seconds were spent reading integrals from disk (i.e., I/O takes about 0.16% of the total RDAO time). Before the multiplication loop, there was a small sorti ng section, which took close to 13.6 seconds (also insignificant); therefore, parallelizing the actual in tegral updates was expected to have the largest impact on performance. There are multiple levels at which the current algorithm can be parallelized: batches, integrals, permutations, or rows of OO pairs. Distribu ting the operations over OO pairs might decrease performance since th e pairs are contiguous in memory, and the
PAGE 21
11 cost of adding more rows is much less than the cost of adding more loops to process them. The problem with distribut ing permutations is that th e number of permutations for a random integral is not always 8. Th is might introduce loadbalancing problems depending on how the integrals are sorted on disk; furthermore, each integral has a limited number of tasks (permutations), which might limit overall scalability above 8 processors. The two remaining loops (unique integral s and integral batches) are ideal for distributing over processors. The number of integrals is substa ntial enough to support scalability over a wide number of processors and each processor will need to store only those integrals for which it is assigned to calculate contributions. The current ACES II implementation expands the permutations into the "unique" integr al list and processes each one separately, so this expanded list was distributed over processors instead of the truly unique list. For maximum software portability, the para llel algorithm needed to distinguish between processors on the same physical computer and processors on remote computers. The strategy was to distribute integral batches over physical computers (nodes) and to distribute the integral list from each batch over processors on the same computer. After the nodes finished processing thei r batches of integrals, the R2 contributions were reduced and distributed back to all of the nodes. To minimize memory requirements on a single node, only one integral list, T2 array, and R2 array were stored, and CPU threads were created to process the integrals. The R2 array was divided into blocks of columns ( pairs), and each CPU was assigned to calcul ate all of the terms that contributed to a particular set of blocks. This technique reduced false cache conflicts (when two CPUs
PAGE 22
12 attempt to write to different parts of the same cache line at the same time), and it increased the probability that a CPU would have a target R2 column in cache memory, since multiple CPUs would be caching different columns. Figure 23 shows the CPU/column assignments in a 4CPU computer with 4 columns per block. 0 1 2 3 CPU Figure 23.The column assignments for each of 4 CPUs updating R2. In this example, there are 49 pairs (from 7 AOs). For an arbitrary blocking factor (bf) and number of threads (Nth), the thread assignment (from 0) of a column index (mn) can be determined with the following FORTRAN function: thread _id(mn,bf,Nth) = mod(mn1,bf*Nth)/bf If bf*Nth is a power of 2, then the mod function can be replaced by a bitwise AND with the mask (bf*Nth)1, and if the blocking factor is a power of 2, then the division can be replaced by bit shifting. Matrix Multiplication If the VVVV contributions are calculated in the MO basis, then LADAA and LADAB become the bottleneck routines. These perform large twodimensional matrix multiplies, as do the other three bottl enecks: CNTRCT, T1W1AB, and RNGDRV.
PAGE 23
13 Incidentally, the routine T12INT2 (1.1% of the UHF time) is also bound by matrix multiplications. Parallelizing these routines involved distributing the columns of the output matrix over CPU threads. Since the i nput and output matric es were all kept in memory, the columns were divided once, unlike block cycling in the RDAO scheme, which had irregular data access patterns. Figure 24 shows the distribution pattern for a matrix multiply over 4 threads. A B C = Figure 24.The distribution scheme for ma trix multiplies over CPU threads. Every thread can read the entire A matrix, but each thread is assigned a block of the B matrix and consequently writes to a unique block of the C matrix. The multiplications were not distributed over nodes since the parallel computer that ran the profile had 16 CPUs per node. Th e time penalty incurred from the global reduction (after the multiplication) would have greatly lessened the gains from parallelization; however, nodes with fewer CP Us might benefit from distributing the rows of the A matrix. All nodes but one would then have to initiali ze the C matrix to 0 (if the operation increments preexisting data), and the results would have to be reduced and broadcast after the multiplication. I/ODependent Routines None of the major subroutines that requi red parallelization we re heavily dependent on I/O operations; however, many parallel co mputers (including th e one that ran the profiling calculations) have shared disk storage devices. This introduces competition between some or all of the compute nodes wh en each one performs file read or write
PAGE 24
14 operations. The DIIS routin e (1% of the UHF time) is heavily dependent on I/O performance, and the ratedetermining ope ration can be described by the following: do for each i read Tall(*,i) Tall(row,i) = T(i) do for each (x,y) B(x,y) += Tall(x,i)*Tall(y,i) end do write Tall(*,i) end do The number of rows in Tall (and the size of each dimension of B) is usually 5 but could be any value (preferably a small one). The i in dex spans all of the T amplitudes, both T1 and T2. Obviously the construction of a 5x5 matrix from a single 5x1 vector is negligible compared to the time spent reading from and writing to disk. This operation was not parallelized, but only one node was permitted to perform the operation. I/O competition is a lowlevel problem that shared storage architectures have. In many cases, extensive software changes are re quired to overcome the increasing penalties from read and write delays that serial progr ams consider to be constant. These penalties manifest themselves in almost every routine th at uses disk storage in the parallel profile. Parallel Profile The DCEE* test case was run with the ACES II CCSD program that had parallel modifications as described in the previous se ction. The parallel performance of the major subroutines is shown in Figure 25 The real time spent in e ach highlevel routine is shown in Figure 26 Only RDAO was distributed over every CPU in the calculation and was 20.6% efficient in the 256CPU (16x16) calculation. The th ree other highlevel routines were only distribu ted over the 16 CPUs in each node and had average threading efficiencies of 70.1% (CNTRCT), 25. 4% (T1W1AB), and 28.8% (RNGDRV).
PAGE 25
15 0 10 20 30 40 50 60 CNTRCTT1W1ABRDAO(AOLAD)RNGDRVCCSDSpeedUp 1x1 2x16 4x16 8x16 16x16 Figure 25.The parallel speedup for each major UHF CCSD subroutine and the CCSD executable. (AOLAD) contains the RDAO parallelization and the global reduction afterward. 01800360054007200 ZERSYM T1RAAAA T1RABBA T1RABAB CNTRCT SUMSYM GETLST QUAD1 QUAD2 QUAD3 MAKFME T1W1AA T1W1AB T1INW2 FEACONT FMICONT FMECONT 1x1 2x16 4x16 8x16 16x16 01800360054007200T1INT2A T1INT2B T1INT1 LADAA LADAB T2TOAO RDAO ~AOLAD Z2TOMO ZIAAO T12INT2 RNGDRV SUMRNG ERNGAB E4S NEWT2 PHASE3 DIIS Figure 26.The runtime profile of paralle l CCSD with a UHF reference from DCEE*. The times are measured in seconds.
PAGE 26
16 The overall parallel efficiency of the UHF CCSD calculation on 256 CPUs was 1.8% with a speedup of 4.62. The calcul ation with the highest speedup was 4x16 (64) CPUs with an efficiency of 8.8%. The reas on for such poor performance can be attributed to I/O contention by all of the routines that were insignificant in the serial calculation. Most of the routines listed in Figure 26 increase in time even though they had no parallel modifications. While further parallel optimiza tions could be made to the current CCSD program, two major drawbacks would still plague the parallel code: problem scalability and expensive modifyprofile developm ent cycles (i.e., trial and error). The worst aspect of the current approach is that tractability does not increase as CPUs, memory, and disks are added to the poo l of resources (i.e., the largest molecule that can be studied is limited by the resour ces of a single node). There is no mechanism for distributing data across memory or disk s; however, if the current limitations are adequate, then this model should be able to reduce the runtime to a reasonable amount. Parallelizing existing serial software usually takes the bottomup approach, in which significant bottlenecks are parallelized and profiled. As new bottlenecks present themselves, they too are parallelized and profiled. Unfortunately, by the time this cycle ends (or becomes too expensive), so much of the software might have been rewritten that it would have been cheaper to reformulate the algorithm using a topdown parallelization approach. In the current CCSD program, no ma tter how well the algorithm is parallelized, the software is still limited to performing calculations on systems that are calculable by a single node. It would be more efficient in the long term to reformulate the CCSD algorithm from the top down (reusing existing so ftware wherever possible) and to include proper data distribution and small task parallelization as design fundamentals.
PAGE 27
17 TopDown Parallelization Considerations Although a few programs already calculate the CCSD energy in parallel, most of them are limited to closedshell references [ Koc96 Kob97 Per03 ]. The exception is the general tensor contractor of Hirata [ Hir03 ], but the final program requires stored integrals. The goal of the t opdown parallelization approach will be to produce an openshell CCSD program capable of handling molecular systems with at least 500 AO basis functions (although the in ternal target is 1000) and that can scale up to a few hundred (if not a thousand) CPUs. These constraints almost necessitate calculating amplitude contributions from AO integrals and distributing amplitudes across CPUs. In addition to the VVVV integrals, the algorithm should also allow for direct generation of the VVVO integrals. The extent of calculating amplit ude contributions in the AO basis instead of the MO basis is that the virtual orbitals are never formally recognized by the equations other than to crea te complementary density matrices such as that shown in Equation 22. These matrices are very important since they define how amplitudes transition from backwardto forwardtransformed relative to the MOs. a a ac c Q (22) The first step of the topdown appr oach involves cataloging the integral contributions and identifying which output condition on the R amplitudes (forwardor backwardtransformed) would be best suited for their additions. Finding a data and task distribution then becomes a ma tter of understanding how the integrals are processed and which terms can be grouped together give n a common output condition on the residual.
PAGE 28
18 CHAPTER 3 SPINUNRESTRICTED CCSD EQUATIONS The T1 and T2 equations in spinrestricted Ha rtreeFock (HF) CCSD have 11 and 29 contributions, respectively. If the refe rence wavefunction is not HF but the spin orbitals are still restricted, then the T1 and T2 equations have 14 and 31 contributions, respectively. The spinunrestricted nonHF CCSD amplitude equations have a total of 222 contributions. Table 31 lists the spinrestricted T1 amplitude contributions and the number of corresponding spinu nrestricted terms. Table 32 does the same for the T2contributions [ Cra98 ]. ) ( Âˆ pq P is the antisymmetric permut ation operator (Appendix B). There are many ways to group and factorize the amplitude contributions. The efficiency of each approach largely depends on the following: The chemical system being examined The CPU speed and efficiency The amount of local memory per process Local and remote (shared or exclusive) storage capacities Local and remote I/O latencies and bandwidths Interprocess message latency and bandwidth. For example, if the CPU speed were incredib ly slow compared to the speed of the I/O reads and writes, then it might be faster to group as many subterms together as possible, and accumulate the intermediate product to disk. Consider the sum X*A*B + X*C*D + X*E*F, which factorizes to X*I provided I = A*B + C*D + E*F (first accumulated on disk). In most cases, however, millions (if not billions) of CPU operations are performed before a single byte of data is retrieved from disk, so this approach is relegated to those systems whose CPU speeds are unusually slow compared to other ha rdware capabilities.
PAGE 29
19 Table 31.The RHF CCSD T1 contributions Contribution Diagram Number of UHF terms kcd cd kit cd ka2 1 4 klc ca klt ci kl2 1 4 aif 2 kc c kt ci ka 4 c c i act f 2 kcd d i c kt t cd ka 4 klcd d i ca klt t cd kl2 1 4 k a k kit f 2 klc a l c kt t ci kl 4 klcd a l cd kit t cd kl2 1 4 kc a k c i kct t f 2 klcd a l d i c kt t t cd kl 4 kc ac ik kct f 4 klcd da li c kt t cd kl 8 The most common factorization strate gy involves searching all possible combinations of terms from either amplitude equation and finding the set of subterms that is used most frequently. This implicitly assumes that the cost of recomputing the data is greater than cost of storing and retrieving the data from di sk. In the MO basis, this is certainly true (i.e., the amount of computation needed to gene rate all of the MO integrals from AO integrals is quite high), but the cost of calculating AO integrals on demand is negligible when multiple processes are in competition for shared I/O resources.
PAGE 30
20 Table 32.The RHF CCSD T2 contributions Contribution Diagram(s) Number of UHF terms ij ab 3 c c it cj ab ij P ) ( Âˆ 4 cd d j c it t cd ab ij P2 1) ( Âˆ 3 k a kt ij kb ab P ) ( Âˆ 4 kc c j a kt t ic kb ab P ij P ) ( Âˆ ) ( Âˆ 4 kcd d j a k c it t t cd kb ab P ij P2 1) ( Âˆ ) ( Âˆ 4 cd cd ijt cd ab2 1 3 kcd cd ij a kt t cd kb ab P2 1) ( Âˆ 4 kc ac ikt cj kb ab P ij P ) ( Âˆ ) ( Âˆ or 10 kcd bc jk d it t dc ak ab P ij P ) ( Âˆ ) ( Âˆ or 10 kl b l a kt t ij kl ab P2 1) ( Âˆ 3 kl ab klt ij kl2 1 3 klc b l a k c it t t cj kl ab P ij P2 1) ( Âˆ ) ( Âˆ 4 klc ab kl c it t cj kl ij P2 1) ( Âˆ 4 klc bc jk a lt t ic kl ab P ij P ) ( Âˆ ) ( Âˆ or 10 klcd ad kj b l c it t t cd kl ab P ij P ) ( Âˆ ) ( Âˆ or 10 klcd db lj ac ikt t cd kl ab P ij P ) ( Âˆ ) ( Âˆ2 1 or 11 klcd ab kl cd ijt t cd kl4 1 3 klcd ab kl d j c it t t cd kl ij P4 1) ( Âˆ 3 klcd cd ij b l a kt t t cd kl ab P4 1) ( Âˆ 3 klcd b l d j a k c it t t t cd kl ab P ij P4 1) ( Âˆ ) ( Âˆ 3
PAGE 31
21 Table 32.Continued Contribution Diagram(s) Number of UHF terms k ab ik kjt f ij P ) ( Âˆ 4 kc ab jk c i kct t f ij P ) ( Âˆ 4 klc ab lj c kt t ci kl ij P ) ( Âˆ 8 klcd ab lj d i c kt t t cd kl ij P ) ( Âˆ 8 klcd cd jl ab ikt t cd kl ij P2 1) ( Âˆ 8 c ac ij bct f ab P ) ( Âˆ 4 kc bc ij a k kct t f ab P ) ( Âˆ 4 kcd db ij c kt t cd ka ab P ) ( Âˆ 8 klcd db ij a l c kt t t cd kl ab P ) ( Âˆ 8 klcd bd kl ac ijt t cd kl ab P2 1) ( Âˆ 8 The following assumptions guided the deve lopment of the factorization strategy described in this chapter: There are at least 3 times as many virtual or bitals as occupied orbitals for each spin The twoparticle data are too large to store in the memory of one process The VVVV and VVVO integrals are no t transformed and stored on disk The CPU operations are several orders of magnitude faster than I/O latencies Latency is at least an order of ma gnitude more significant than bandwidth Each process is able to retrieve (G ET) and increment (ACCUMULATE) data that are stored by another process wit hout interrupting the other process. The first assumption follows from a study of basis sets that noted that ccpVTZ is the minimum basis set needed to descri be electron correlation accurately [ Dun00 ]. For chlorine, there are 17 electrons and 34 cont racted basis functi ons. This produces an almost 1:3 occupied/virtual split in molecular orbitals, since half of the electrons have alpha spin and half have beta spin.
PAGE 32
22 Designing the algorithm such that twoparticle terms ( viz. VVOO) can be distributed over PEs is necessary because the parallel program should be capable of treating systems with 50 el ectrons and 500 basis func tions (50/500), although 100 electrons and 1000 basis functions is the ta rget (100/1000). For 50/500 systems, the three VVOO spin combinations require 1.55 GB of memory. Considering the algorithm needs at least two (for T2 and R2) held in core, a processor with 1 GB of memory will not be able to store the minimum amount of da ta without overflowing to disk storage. Furthermore, the VVVO and VVVV terms will require 59.8 and 568 GB of storage, respectively. (This assumes a conversion f actor of 1024.) The parallel algorithm should then at least allow for the possibility that the terms that use these MO integrals can be calculated from AO integrals generated on the fly. Today, the fastest disk storage devices are able to seek for random data in approximately 3.7 milliseconds (measured on a Maxtor Atlas 15k 73 GB Ultra320 SCSI hard drive). The slowest desktop CPU (to with in reasonable expectations of performance) operates at 1 GHz. (The Intel Pentium 4 line of CPUs can operate at 3.4 GHz.) Assuming that most floatingpoint operations in CCS D require multiplications and additions and that each operation completes in one cl ock cycle, then the maximum theoretical multiply/add (fma) throughput is somewhere around 500 million fma/s. Together, the data indicate that a slow CPU could theore tically deliver 1.85 million fma/seek, which at least supports the assumption that computation is several orders of magnitude faster than I/O latency. On the same hard drive, bandwidth bursts measure 73.7 MB/s, which suggests that 280 kB of data could be tran sferred during the same time it took to locate the data.
PAGE 33
23 The final assumption follows from the latest version of the most popular message passing library MPI (Messa ge Passing Interface) [ Gro99 ]. MPI2 allows PEs in the same communication network to retrieve (GET) increment (ACCUMULATE), and overwrite (PUT) data that are stored by another CPU without communicating with it. In addition to their own proprietary communication libraries, most vendors of parallel computers offer implementations of the standard message pa ssing libraries (like MPI) to ensure maximum compatibility with the myriad parallel programs that depend on industry standards. The main consequences of the aforementioned assumptions are that there must be more than one process, that recomputation is better than reading from disk, and that messages should be as large as possible. The primary goals of this parallelization effort are scalability and addressability (i.e., the ab ility to treat a wide range of molecular systems), so the grouping and factorizati on of terms should maximize message size and minimize message frequency while minimizi ng the local scratch memory requirements. Group 1: T2 Virtual Orbitals Coming from V One of the smallest groups of terms consists of T2 contributions, in which both virtual orbitals come from the twoel ectron integrals. Equation 31 shows the termsÂ— the terms are analogous. The and integrals will be generated on the fly in the AO basis, so the expression must be written in terms of AO integrals. Before doing so, the second term must be split and rea rranged to allow for better factorization. cd cd ij cd d j c i c c i ab ijt cd ab t t cd ab ij P t cj ab ij P ij ab r2 1 2 1) ( Âˆ ) ( Âˆ (31) cd cd ij cd d j c i c c j c c i ab ijt cd ab t t cd ab ij P t ic ab ij P t cj ab ij P ij ab ij P r2 1 2 1 2 1 2 1 2 1) ( Âˆ ) ( Âˆ ) ( Âˆ ) ( Âˆ (32)
PAGE 34
24 cd cd ij d c cd d j c i d c c c j c i c c i j c j i ab ijt c c ab t t c c ab ij P t c c ab ij P t c c ab ij P c c ab ij P r2 1 2 1 2 1 2 1 2 1) ( Âˆ ) ( Âˆ ) ( Âˆ ) ( Âˆ (33) ij j i j i j i j i ab ijt ab t t t c c t c c ab ij P r2 1 2 1) ( Âˆ(34) ij j j i i b a ab ijt t c t c ij P c c r ) ( Âˆ2 1(35) The terms are manipulated in the same fa shion, but do not re quire splitting any terms (Equations 36 through 38.) Cd Cd Ij Cd d j C I c c j C C I Ab Ijt Cd Ab t t Cd Ab t Ic Ab t Cj Ab Ij Ab r (36) Ij j I j I j I j I Ab Ijt Ab t t t c c t c c Ab r(37) Ij j j I I b A Ab Ijt t c t c c c r(38) Group 2: T2 Virtual Orbitals Coming from V and T Set 1: One Virtual from V, One Virtual from T1 This set of T2 contributions is very sim ilar to the Group 1 terms. The increment is shown (Equation 39). Since every term an tisymmetrizes the contraction with respect to ab, the permutation operator is remove d and the increment is identified as nonantisymmetric. The r ~ residual amplitudes (Appendix B) will be antisymmetrized with respect to both ab and AB virtual pairs at the end of the iteration. Splitting the second term, as in Equation 31, is also required.
PAGE 35
25 kcd cd ij a k kcd d j c i a k kc c j a k k a k ab ijt t cd kb ab P t t t cd kb ij P ab P t t ic kb ij P ab P t ij kb ab P r ) ( Âˆ ) ( Âˆ ) ( Âˆ ) ( Âˆ ) ( Âˆ ) ( Âˆ2 1 2 1(39) kcd cd ij a k kcd d j c i a k kc c j a k kc c i a k k a k ab ijt t cd kb t t t cd kb ij P t t ic kb ij P t t cj kb ij P t ij kb ij P r2 1 2 1 2 1 2 1 2 1) ( Âˆ ) ( Âˆ ) ( Âˆ ) ( Âˆ ~ (310) k ij a k k j i j i j i j i a k ab ijt kb t t t t c c t c c kb t ij P r2 1 2 1) ( Âˆ ~ (311) ij j j i i b a ab ijt t c t c ij P c t r ) ( Âˆ ~ 2 1(312) The increments (Equation 313) are grouped by the braside T1 and then expressed in terms of AO integrals and factorized. kCd Cd Ij b k KCd Cd Ij A K kCd d j C I b k KCd d j C I A K kc c j b k Kc c j A K kC C I b k KC C I A K k b k K A K Ab Ijt t Cd Ak t t Cd Kb t t t Cd Ak t t t Cd Kb t t Ic Ak t t Ic Kb t t Cj Ak t t Cj Kb t Ij Ak t Ij Kb r (313) Cd Cd Ij A K KCd d j C I A K Kc c j A K KC C I A K K A K Ab Ijt Cd Kb t t t Cd Kb t t Ic Kb t t Cj Kb t Ij Kb t r(314a) kCd Cd Ij b k kCd d j C I b k kc c j b k kC C I b k k b k Ab Ijt Cd Ak t t t Cd Ak t t Ic Ak t t Cj Ak t Ij Ak t r(314b) Ij j j I I b A Ab Ijt t c t c c t r(315a) Ij j j I I b A Ab Ijt t c t c t c r(315b)
PAGE 36
26 Set 2: One Virtual from V, One Virtual from T2 The set contains four T2 contributions that are ring contractions. Equations 316 through 318 show these terms and de rive their AO basis expressions. KCd d j aC iK KC aC iK kcd d j ac ik kc ac ik ab ijt t Cd Kb t Cj Kb ij P ab P t t cd kb t cj kb ij P ab P r ) ( Âˆ ) ( Âˆ ) ( Âˆ ) ( Âˆ(316) j a i b a i j b j a i b a i j b ab ijt t c t c c ij P t t c t c c ij P r ) ( Âˆ ) ( Âˆ ~ (317) j j a i a i b ab ijt c t t c ij P r ) ( Âˆ ~ (318) Each half of the increments contains four ri ng terms and two particlehole ladder terms. The MO and AO expressions are shown in Equations 319 and 320. KcD D I Ac Kj KCd d j AC IK kcd d j Ac Ik Kc Ac Kj KC AC IK kc Ac Ik Ab Ijt t Dc Kb t t Cd Kb t t cd kb t Ic Kb t Cj Kb t cj kb r(319a) kCd d j bC kI kcD D I bc jk KCD D I bC jK kC bC kI kc bc jk KC bC jK Ab Ijt t dC kA t t cD kA t t CD KA t jC kA t cI kA t CI KA r(319b) I I A j b j j A I A I b Ab Ijt c t c t c t t c r (320a) j j b I A I I b j b j A Ab Ijt c t c t c t t c r (320b)
PAGE 37
27 Group 3: T2 Virtual Orbitals Coming from T Set 1: Rings and PH Ladders There are 31 UHF T2 terms that belong to this set. Each process loads a block ofOO VVV integrals1, and calculates all of the c ontributions to the range of R2 amplitudes that it is assigned to store. Grouping and factoriz ing the terms for performance is heavily dependent on the manner in which the old T2 amplitudes are stored and accessed. The terms were first cataloged and sorted into subsets according to spin pairs. For example, if the index is stride 1, then the term Bd Jl kl cd Ac Ikt V t must step down in A and down in B for a given cd pair in V. To minimize the number of messages, any other ring terms that have the same access pattern should be calculated at the same time. From the 31 UHF terms, three access patterns emerged. A ny term that did not prefer one pattern to another was added to the pattern that ha d the most similar intermediate products. Subset 1: AC/BD Subset 1 contains 13 terms that step through T2 in the order AC/BD, which corresponds to the order of the two bra indices. More specifically, for a given CD pair of ket shells, the first (or oute r) loop runs over A and the sec ond (or inner) loop runs over B. All of the T2 contributions are performed in this series. KLCD DB LJ KL CD C I A K KLD DB LJ KL ID A K KLCD DB LJ KL CD AC IK AB IJt V t t t V t t V t IJ P AB P r2 1) ( Âˆ ) ( Âˆ (321) KlCd dB lJ Kl Cd C I A K Kld dB lJ Kl Id A K KlCd dB lJ Kl Cd AC IK klcd dB lJ kl cd Ac Ik AB IJt V t t t V t t V t IJ P AB P t V t IJ P AB P r ) ( Âˆ ) ( Âˆ ) ( Âˆ ) ( Âˆ2 1(322) 1 V notation can be ambiguous with respect to whether AO integrals are antisymmetrized. Both electrons in the terms in this section have at least one occupied MO index, so V notation is used to save equation width.
PAGE 38
28 KlCd db lj Kl Cd AC IK klcd db lj kl cd Ac Ik Ab Ijt V t t V t r (323) KlCd b l d j Kl Cd AC IK KlC b l Kl Cj AC IK klcd b l d j kl cd Ac Ik klc b l kl cj Ac Ik Ab Ijt t V t t V t t t V t t V t r(324) In terms of uncontracted ket indices of V, the partial c ontributions reduce to Equations 325 through 327, in which each equation corresponds to a "leftside" intermediate enclosed in parentheses. L B JL K KL I I A K K KL A IK AB IJt V t c t V t IJ P r2 1) ( Âˆ ~(325) l B Jl K Kl I I A K K Kl A IK k kl A Ik AB IJt V t c t V t V t IJ P r2 1) ( Âˆ ~(326) l j j b l b jl K Kl A IK k kl A Ik Ab Ijt c t t V t V t r(327) Subset 2: DB/CA The DB/CA subset calculates 12 terms by looping first over B and then over A. klcd b l d j kl cd ac ik kcl b l kl cj ac ik klcd db lj kl cd ac ik ab ijt t V t t V t t V t ij P ab P r2 1) ( Âˆ ) ( Âˆ (328) KlCd b l d j Kl Cd aC iK KCl b l Kl Cj aC iK KlCd db lj Kl Cd aC iK KLCD Db Lj KL CD aC iK ab ijt t V t t V t t V t ij P ab P t V t ij P ab P r ) ( Âˆ ) ( Âˆ ) ( Âˆ ) ( Âˆ2 1(329) KLCD Db Lj KL CD AC IK Ab Ijt V t r (330) KlCd db lj Kl Cd C I A K Kld db lj Kl Id A K KLCD Db Lj KL CD C I A K KLD Db Lj KL ID A K Ab Ijt V t t t V t t V t t t V t r(331)
PAGE 39
29 The MO equations can be reduced into four equations with four "rightside" intermediates when they are expressed in terms of V with uncontracted ket indices. kl b l j j kl l b lj kl a ki ab ijt t c V t V t ij P r2 1) ( Âˆ ~ (332) Kl b l j j Kl l b lj Kl L b Lj KL a Ki ab ijt t c V t V t V t ij P r2 1) ( Âˆ ~ (333) KL b Lj KL A KI Ab Ijt V t r (334) Kl b lj Kl L b Lj KL A K I I Ab Ijt V t V t t c r (335) Subset 3: CB/AD The remaining terms are primarily PH ladder terms and the outer loop is over B. If the bra shells are thought of as a square VV matrix, then the main difference in loops between this subset and Subset 2 is that the inner loop A steps down the D column, while the inner loop A of Subset 2 steps across the C row. KlCd Cb Kj Kl Cd Ad Il Ab Ijt V t r (336) KlCd b l C I Kl Cd Ad Kj Kld b l Kl Id Ad Kj KlCd Cb Il Kl Cd Ad Kj Ab Ijt t V t t V t t V t r (337) KlCd Cb Il Kl Cd d j A K KlC Cb Il Kl Cj A K Ab Ijt V t t t V t r (338) Factorizing the MO expressions in te rms of rightside intermediates and uncontracted ket indices yiel ds Equations 339 through 341. lK b Kj Kl A Il Ab Ijt V t r(339)
PAGE 40
30 Kl b l I I Kl l b Il Kl A Kj Ab Ijt t c V t V t r(340) Kl b Il Kl j j A K Ab Ijt V t c t r(341) Set 2: HH Ladders from OOVV, OOOV, and OOOO The ladder term that is quadratic in T2 can be calculated in one of two ways: one with a HH ladder intermediate and one with a PP ladder intermediate. Opting for the HH ladder intermediate saves a large amount of memory and memory writes. Three other terms have HH ladder intermediates from the OO VVV integral and can be factorized with the quadratic term. The contributions are shown (Equation 342). klcd b l a k d j c i klcd b l a k cd ij klcd ab kl d j c i klcd ab kl cd ij ab ijt t t t cd kl ab P ij P t t t cd kl ab P t t t cd kl ij P t t cd kl r ) ( Âˆ ) ( Âˆ ) ( Âˆ ) ( Âˆ4 1 4 1 4 1 4 1(342) The four ladderlike terms that come from OOOO and OOOV integrals can be combined with the OOVV ladders when the ket orbitals are from the plain AO integral. Equation 343 shows the MO contributions. klc b l a k c i kl b l a k klc ab kl c i kl ab kl ab ijt t t cj kl ab P ij P t t ij kl ab P t t cj kl ij P t ij kl r ) ( Âˆ ) ( Âˆ ) ( Âˆ ) ( Âˆ2 1 2 1 2 1 2 1(343) Combining the two equations, incrementi ng preantisymmetrized amplitudes, and expressing the ket indices in the uncont racted AO basis yields Equation 345. kl ab kl i kl ab kl i kl ab kl j i kl ab kl ij ab ijt t j kl ij P t c j kl ij P t t t kl ij P t t kl r2 1 2 1 2 1 2 1 2 1 2 1 4 1 2 1 4 1) ( Âˆ ) ( Âˆ ) ( Âˆ ~ (344a)
PAGE 41
31 kl b l a k i kl b l a k i kl b l a k j i kl b l a k ij ab ijt t t j kl ij P t t c j kl ij P t t t t kl ij P t t t kl r ) ( Âˆ ) ( Âˆ ) ( Âˆ ~ 2 1 2 1 2 1 4 1 4 1(344b) kl b l a k ab kl j j i i ij ab ijt t t t c t c ij P t kl r2 1 4 1) ( Âˆ ~ (345) The terms and factorized increment are shown in Equations 346 through 348. KlCd b l A K d j C I KlCd b l A K Cd Ij KlCd Ab Kl d j C I KlCd Ab Kl Cd Ij Ab Ijt t t t Cd Kl t t t Cd Kl t t t Cd Kl t t Cd Kl r(346) Kl b l A K Kl Ab Kl Klc b l A K c j Klc Ab Kl c j KlC b l A K C I KlC Ab Kl C I Ab Ijt t Ij Kl t Ij Kl t t t Ic Kl t t Ic Kl t t t Cj Kl t t Cj Kl r(347) Kl b l A K Ab Kl j j I I Ij Ab Ijt t t t c t c t Kl r(348) Group 4: T2 into T1 The T2 amplitudes contribute to the R1 amplitudes in 14 terms. These terms can be grouped together such that they define thr ee oneparticle intermed iate operators, which effectively reduce the particle number of th e perturbation. Equations 349 and 350 show the terms that define the IP (particle) and IH (hole) intermediates, respectively. Equation 351 contains both spin cases of the IR (ring) intermediate. KlCd d i Ca Kl klcd d i ca kl KlC Ca Kl klc ca kl a it t Cd Kl t t cd kl t Ci Kl t ci kl r2 1 2 1(349) KlCd a l Cd Ki klcd a l cd ki KCd Cd Ki kcd cd ki a it t Cd Kl t t cd kl t Cd Ka t cd ka r2 1 2 1(350)
PAGE 42
32 ld da li ld KlCd da li C K klcd da li c k LD Da Li LD kLcD Da Li c k KLCD Da Li C K a it f t t Cd Kl t t cd kl t f t t cD kL t t CD KL r(351) The terms that define the intermediates have been isolated in Equations 352 through 354. Both spin cases of the IR intermediate are shown because they both contribute to R1 amplitudes. (The terms with the Fock matrix elements are not part of the intermediate.) The spin cases of IP and IH are analogous to the cases. i i KlC Ca Kl klc ca kl a i Pt c t C Kl t c kl r I2 1:(352) a a KCd Cd Ki kcd cd ki a i Ht c t Cd K t cd k r I2 1:(353) LD Da Li LD a i kc c k KC C K a i Rt f t t c k t C K r I :(354a) ld da li ld a i KC C K kc c k a i Rt f t t C K t c k r I :(354b) Group 5: OneParticle Transformations Set 1: T1 Contributions The 12 remaining T1 contributions can be loosel y described as oneparticle transformations. The terms are analogous to the terms, which are shown. These are then drastically simplified with the IR intermediate, and further consolidation yields a nonHF set in Equation 361 and an IR set in Equation 362. KC C K kc c k ai a i v ot Ci Ka t ci ka f r F:(355) KCd d i C K kcd d i c k a d d i ad a i v vt t Cd Ka t t cd ka t f r F:(356)
PAGE 43
33 KlCd a l d i C K klcd a l d i c k KlC a l C K klc a l c k ld a l d i ld i l a l li a i o ot t t Cd Kl t t t cd kl t t Ci Kl t t ci kl t t f t f r F :(357) i R a ai a i v oc I c f r F:(358) i R a a d d i ad a i v vt I c t f r F:(359) i i R a ld a l d i ld i l a l li a i o ot c I t t t f t f r F:(360) ld a l d i ld i l a l li a d d i ad ai a it t f t f t f f r(361) i i R a a a it c I t c r(362) Set 2: T2 Contributions Two oneparticle transformati ons describe the final T2 contributions. KlCd ab lj Cd Ki klcd ab lj cd ki ld ab lj d i ld i l ab lj li ab ij o ot t Cd Kl t t cd kl ij P t t f t f ij P r F2 1) ( Âˆ ) ( Âˆ : ) 1 ((363a) KlCd ab lj d i C K klcd ab lj d i c k KlC ab lj C K klc ab lj c k ab ij o ot t t Cd Kl t t t cd kl ij P t t Ci Kl t t ci kl ij P r F ) ( Âˆ ) ( Âˆ : ) 2 ((363b) KlCd db ij Ca Kl klcd db ij ca kl ld db ij a l ld a d db ij ad ab ij v vt t Cd Kl t t cd kl ab P t t f t f ab P r F2 1) ( Âˆ ) ( Âˆ : ) 1 ((364a)
PAGE 44
34 KlCd db ij a l C K klcd db ij a l c k KCd db ij C K kcd db ij c k ab ij v vt t t Cd Kl t t t cd kl ab P t t Cd Ka t t cd ka ab P r F ) ( Âˆ ) ( Âˆ : ) 2 ((364b) Similar to the F1 intermediates used to calculate the T1 contributions, the oneparticle transformation matrix for T2 into R2 can be defined in terms of IP, IH, and IR, and using these definitions, the and increments can be trivially reduced to four terms (Equations 367 and 368). i i R l i H l ld d i ld i l li l it c I c I c t f f F(365) d R a a d a P ld a l ld a d ad a dc I t c c I t f f F(366) k ab ik k j k ab kj k i c ac ij b c c cb ij a c ab ijt F t F t F t F r(367) k Ab Ik k j K Ab Kj K I c Ac Ij b c C Cb Ij A C Ab Ijt F t F t F t F r(368)
PAGE 45
35 CHAPTER 4 PARALLEL ALGORITHM Virtual Parallel Computer Before any algorithm can be parallelized, the type of parallel computer that the algorithm will run on must be identified. Is remote memory accessible through onesided operations or must each process explicitly interrupt and request data from another process? How much data will be needed during each section of the algorithm and can that data be held by a single process? Does each pr ocess have the ability to read data from a storage device, and if so, what is the penalty for multiple processes reading data simultaneously? These are the t ypes of questions that must be answered before making assumptions, which will shape the plan for parallelization. The most basic parallel computer is a ne twork of workstations, in which each node is a single CPU computer with a private memory space and a storage deviceÂ—usually a hard disk. The most common connection fabric for these networks is TCP/IP over Ethernet. This allows for any node to comm unicate with any other node in the network, and the network hardware or operating system software will handle delivering the messages. (As opposed to each process having to maintain a table of which nodes are directly accessible and which nodes must be reached through other nodes.) The cost of maintaining these networks is not linearly pr oportional to their size, so it is common for 500+ node networks to remove the most failure prone element, the hard disk, and provide other means for each node to conduct I/O operations.
PAGE 46
36 The extreme opposite of this picture is a single computer with multiple CPUs, which are all able to access the same me mory space and hard drive(s) with equal performance. Each CPU is identical to the others, so these are commonly referred to as Symmetric MultiProcessor (SMP) computer s; however, there are tremendous costs associated with adding each additional CPU, so it unusual to have more than 8 CPUs in a truly SMP machine. Some manufacturers ha ve mininetworks within a single computer that allow them to increase the number of CPUs without making them pure SMP. To an application programmer, these machines are programmed in the same way as Symmetric MultiProcessors, but the true label is Shared Memory Programming (also SMP!) even though the CPUs are not identical in terms of distance to memory, cache, and possible hard drives. Henceforth, there will be no di stinction between the meanings of SMP since the parallel CCSD algorithm will be implemente d as a user process and not a system or kernel process. The balance between cost and perfor mance lies somewhere between the two extreme types of parallel computers. The primary offerings of the major vendors of supercomputers are clusters of SMP machin es. For finegrain pa rallelization, each node has multiple CPUs accessing the same memory, and for coarsegrain parallelization, each node has physically separate resources. The vi rtual parallel computer that the parallel CCSD algorithm relies on is this type of cl uster of SMP workstations; however, program portability and the ability to tune the network parameters are very desirable. The only way to build portability and tuna bility into the program without having divergent paths through the software is to model the physical computer in the boundaries of network communication. By properly defi ning certain subnetworks, no process will
PAGE 47
37 attempt to communicate through slow network channels when faster ones may exist; furthermore, messages pertaining to a certai n context can be limite d to regions of the network, which might reduce network contenti on on some architectures. This allows the scope of communications to be set once at program run time (or compile time) and the algorithm will transparently acknowledge the unde rlying hardware up to the efficiency of the partitioning scheme. Communication Networks In some sense, parallelizing algorithms that handle massive amounts of data and computations is akin to managing disaster recovery. The moment the event occurs, there are many diverse and unused workers who ha ve no assignment or direction. As the situation is assessed, teams of workers are sent to handle different aspects of the recovery from the time critical, like search and rescue of survivors, to the time forgiving, like public relations and cost reimbursement. Wildfire firefighters have some of th e most extensive experience in rapid deployment and management of labor units. In the late 1970s, the Incident Command System (ICS) was created in response to a series of wildfire disasters in southern California [ Fed98 ]. It has since been adopted and refined by several federal and state agencies and provides the basis for mo st emergency and nonemergency incident responses. The parallel CCSD algorithm uses th e following aspects of its effectiveness: Modular/scalable organization Manageable span of control Unit designations. The lowest level of the organization is the single process (normally assigned to a single CPU). Every process must have its own memory heap, and the combination of a
PAGE 48
38 process and its memory is called a processor element (PE). PEs recognize other PEs on the same physical node and form squads (SQD). Squads can be grouped together to form platoons (PLT) based on algorithm requirements, and platoons can be grouped together to form companies (CO), which execute largescale algorithms. Although the units were listed in increa sing size, their division and recognition happen in decreasing size. The commanding pr ocess evaluates the available resources and divides the group into subgroups acco rdingly. For example, the Company Commander (CO_CDR) would catalog the total distributed memory and channels to disk and form platoons based on how the company algorithm uses those resources. The squads would then selfassemble within the pl atoons based on which node the PEs are on. Each division has its own communicati on space that does not interfere with messages in another division, and there are special communication spaces for the commanders of the subdivisions. This allows platoons to synchronize independently from each other, and more importantly, it allo ws company algorithms to disregard the messages of other companies. The company that performs the CCSD iterations and calculates the T amplitudes is called the Amplitude Company. Figure 41 illustrates the communication boundaries for a co mpany and its subdivisions. Data Management Arrays stored in distributed memory The fundamental data elements involved in standard CC theory are T amplitudes, integrals, and the reference wavefunction. The predominant method for solving the T amplitude equations is iterating over a coupled set of equations for T1 and T2. This makes the amplitudes stateful at each iteration compared to the tw oelectron integrals, which are stateless. Thus, the minimum amount of info rmation that must be stored during the CC
PAGE 49
39 iterations is TOLD (T1 and T2), TNEW (R1 and R2), and any information that defines the reference wavefunction. The integrals do no t need to be stored since they can be generated on the fly. 2nd SQD LDR 2nd PE 3rd PE 4th PE SQD_NET SQD_LDR_NET 3rd SQD 1st SQD 2nd PLT CDR PLT_NET PLT_CDR_NET 3rd PLT 1st PLT AMPLITUDE CO CDR CO_NET CO_CDR_NET Figure 41.Communication network boundari es. Rounded rectangles represent communication networks (in italics). Square rectangles represent physical entities. Dashed square rectangles c ould be redudant entities (i.e., a platoon commander will probably be the leader of the first squad). It is assumed that a single PE cannot store a full OOVV quantity. Even for SMP machines, a single OOVV array might exceed th e total physical memory of the node, so the T2 old and new arrays must be distributed over a set of PEs. It is also assumed that each PE is able to retrieve and increment remote memory with onesided communications (i.e., without interrupting the process that allo cated the memory). If the total distributed memory is many times larger than the am ount of data that must be stored, then
PAGE 50
40 subdividing the Amplitude Company into platoons such that multiple copies of the amplitudes are available for reading will d ecrease the average dist ance between a PE and the distributed data that it attempts to access. In this algorithm, the number of PEs in the Amplitude Company is almost irrelevant, but the total amount of distributed memory is critical. The Company must be able to store TOLD, TNEW (R), and the larger of T2 or the amount of scratch space needed per PE, which depends on the chemical system (Figure 47 ). Each PE will have its own local copy of small frequently accessed arra ys, which include basi s set definitions, AO coefficients, T1 and R1 amplitudes, and other oneparticle intermediates. While the PEs are calculating contributions to the R amplitudes, only T2 and R2 need to be accessed in addition to local scratch memory. Between certain sections, R2 must be transformed, which requires enough scratch memory for storing the third distributed OOVV array. reserved memory Example 1 Example 2 T2 distributed R2 distributed min scr < R2 block min scr > R2 block Figure 47.The memory heap management for various molecular systems. In one scenario, the minimum scratch memory needed for amplitude contributions is less than the amount of memory needed to store a section of R2 during a transformation. The other scenar io is when the opposite occurs. Example 1. The user runs a calculation that requires 128 MB of reserved data per PE, such as basis set information, onepar ticle matrix elements, etc. The calculation requires another 128 MB (minimum) of scratc h memory per PE for calculating various
PAGE 51
41 quantities during the CCSD iteration, a nd the OOVV quantities require 4096 MB of distributed memory from all of the PEs. If the user submits the job with 1024 MB of heap per process, what is the minimum number of PEs needed to perform the calculation? Alternatively, does th e calculation have enough resources if the user submits the job with X PEs, each with 1024 MB of memory? Solution 1. The amount of nonreserved memory is 1024128 MB/PE (i.e., 896 MB is available for scratch and distributed data on each PE). Three OOVV arrays require 12,288 MB. Dividing this amount by 896 MB/PE and rounding up yields a minimum of 14 PEs. Distributing each OOVV array ove r 14 PEs requires each PE to store approximately 293 MB of data per array. Sinc e this amount is larger than the minimum amount of scratch memory needed (128 MB), the calculation can proceed with 14 or more PEs. Example 2. Using the same system data as above, but increasing the minimum required amount of scratch memory to 512 MB, what is the minimum number of PEs needed to run the calculation? Solution 2. This time, the regular scratch memory is larger than the distributed OOVV scratch array. In essence, this become s unavailable for distributed data, so the effective amount of memory per PE for th e OOVV arrays is (896512) 384 MB. Since the third OOVV array can fit in the scratch memory, only two OOVV arrays, totaling 8192 MB, need to be distributed over the Amplitude Company. 8192 MB divided by 384 MB/PE yields 22 PEs (rounded up). This means each PE will store approximately 187 MB of each OOVV array, and the calculation can proceed if there are at least 22 PEs. The minimum PE test can be desc ribed by the following pseudocode:
PAGE 52
42 mem_dist = mem_he ap mem_reserved min_PEs = ceiling( 3 mem_OOVV / me m_dist ) mem_block = mem_OOVV / min_PEs if (mem_block < mem_minscr) then mem_dist = mem_minscr min_PEs = ce iling( 2 mem_O OVV / mem_dist ) end if If the user runs the job with more than twice the minimum number of PEs, then the Amplitude Company can divide into multiple platoons, each with its own copy of TOLDand TNEW. One of the benefits of this approach is that fewer PEs will access each block of distributed memory, but the main advant age is that knowledge of the underlying hardware can be programmed into the pa rtitioning scheme and the algorithm will transparently align itself w ith the hardware boundaries. Arrays stored on disk The VVVV and VVVO integrals will not be transformed and stored on disk, so their contributions must be calculated on the fly in the AO basis. The OOOO and OOOV integrals will not be stored on disk either, not because they are too large, but because their contributions can be accounted for with intermediates from the direct integrals and from the stored OOVV halftransformed integral s. The stored integral s can be distributed over and individually processed by the platoons of the Amplitude Company. Parallel UHF CCSD Each CC iteration begins with initializi ng the global intermed iates and residual amplitudes to zero. The TOLD amplitudes are defined by Equation 41. This formulation allows the amplitudes to contract directly with AO integrals. The T2 residual amplitudes will have different definitions depending on which part of the algorithm is executing. In the Direct Integral (DI) phase, one of the vi rtual indices is coming from an uncontracted AO integral, so the residual is related to the MO residual amplitudes by Equation 42.
PAGE 53
43 ab ab ij b a ij a a i a it c c t t c t ;(41a) Ab Ab Ij b A Ijt c c t (41b) Ij b A Ab Ijr c c r(42) The Stored Integral (SI) and One Par ticle (OP) phases calculate residual contributions with both virtual orbitals coming from T, so the DI residual is transformed according to Equation 43. Ij b Ij b b Ijr Q r c c r(43) Throughout the iteration, each PE stores al l OO pairs (ket indi ces) for a range of AO shell pairs (bra indices). The amplitude s are distributed over the platoon (Figure 42 ), and onesided communications retrieve and increment all OO pairs for a block of AO shell pairs. IJ Ij ij IJ Ij ij IJ Ij ij IJ Ij ij IJ Ij ij IJ Ij ij 1 2 3 4 5 6 MN = PE 0 PE 1 Figure 42.The T2 and R2 memory layout. Each square re presents the fourdimensional array of amplitudes that have the de signated spin pair and whose AO bra indices belong to the MN shell pair The arrows repr esent the order in memory.
PAGE 54
44 After all of the contributions have been calculated, the residual is antisymmetrized with respect to the two virtual indices and b acktransformed to the MO basis. The new T amplitudes are calculated by dividing each am plitude by the diagonal elements of the Fock matrix. Direct Integral Phase Design considerations Analyzing the T2 contribution in Equation 44 can summarize the DI phase. Since the VVVO integrals are not transformed and stor ed prior to the CC it erations, the terms must be evaluated in the AO basis. lCd Cd Ij b l KCd Cd Ij A K Ab Ijt Cd Al t t Cd Kb t r(44) Ij b A Ij b A Ab Ijt t c t c t r(45) The biggest problem with this situation is that every AO integral causes a global update to R2 (i.e., one integral c ontributes to every R2 amplitude). The best way to reduce the number of operations is to precondition R2 such that a smaller set of VVVO contributions is calculated. R2 would then be posttransf ormed after the contributions have been accumulated (Equations 46 through 48). b Ab Ij b A Ijr c r (46) Ij A A Ijt t r(47) A Ij b Ab Ijr c r(48) For each integral, only an IjA strip of R2 amplitudes is incremented compared to the full set without the backward and forward transformations of R. The difficulty with
PAGE 55
45 the amplitudes is that while the side can get away with 1/V updates, the side still requires a global update per in tegral (Equation 49). The orbital must be forward transformed into A, but the orbital must be transformed into b by T1, which must then be backtransformed by c1. One solution is to calculate the two spin cases in separate passes (while transforming R between them ). Although this requires generating the AO integrals a second time, the performance of a naturally parallel local computation is much greater than a gl obal update per N4 matrix element. b Ij b b A A Ijt t c c r (49) A second performance issue is antisymm etrization of the samespin integrals. Equation 410 shows one of the ring contributions in MO s and Equation 411 shows the terms appropriately transformed into the AO basis. kcd d j Ac Ik Ab Ijt t cd kb r(410) j I Ijt t r(411) The AO integrals are not antisymmetrized by the integral generator, so one of two approaches must be taken to account for bo th contributions. The first approach involves two updates per integral: j I Ij j I Ijt t r t t r ;(412) The benefit of this method is that the PE needs only enough memory to store the AO integral batch and the T2 batch; however, two updates per inte gral is more costly than two reads per integral, which can be achieved by antisymmetrizing the pair:
PAGE 56
46 j I Ij j I Ijt t r t t r ;(413) This requires either two temporary T2 blocks or two serial loads. If there is enough memory, then the exchange integral can be calculated with the regular Coulomb integral and the antisymmetrized contribution can be calculated with one read and one update. Since shell quadruplets of integrals are typi cally not more than a few hundred kilobytes, the algorithm will assume this can be done. The permutational symmetry of the antisymmetrized integrals is not the same as that of the plain Coulomb integrals so they must be generated again in the middle of the loop over permutations. Although two updates or two reads are avoide d, the integrals need to be calculated three times per AO shell quadruplet. Considering the cost of messages compared to the speed of locally computing a shell quad, this is a favorable tradeoff. From these considerations, it is clear that the DI phase must calculate the unique integrals six times even though the AO integr als have 8fold permutational symmetry. There are two passes (one for and one for ) and there are three generations per passÂ—one Coulomb and two exchange integrals. This might seem expensive, but the cost of doubling or quadrupling the number of me ssages does not scale linearly with the number of PEs and far outweighs the cost of the naturally parallel integral computation. Algorithm A critical aspect of this algorithm that makes or breaks the performance is onesided communications. A onesided GET of a T2 block can be in flight while the PE calculates terms that do not involve the T2 amplitudes. Immediately after the amplitudes have been read, the next block of amplitudes can be requested, even if the PE is not ready to use them. This technique is also used to hide the communication of R2 updates. The
PAGE 57
47 main difference between the two arrays is that the T2 amplitudes are read once per GET, while the R2 amplitudes are incremented two or three times per ACCUMULATE. Ideally, there would be two R2 buffers, one active and one pe nding, that increase the time available for each ACCUMULATE. If the user must run the calculation in limited memory, then it is possible to still benefit from onesided ACCUMULATEs, but the PE might have to wait for a while before initializing the buffer for the next R2 contribution. The main driver for the DI phase loops ov er AO shell quadruplets of integrals. The MN shell pairs are distributed over all of the PEs in the Amplitude Company. This distribution can happen a number of ways: r ound robin, blocked, block cyclic, job board, master/slave, etc. How a PE processes each MN shell pair is independent of the shell pair assignment, so this can be tweaked outside the main processing loop. Ideally, there would be a globally accessible intege r that corresponds to the ne xt MN pair that must be processed. The integer must be atomically read and incremented by each PE to get its next assignment. This balances the load be tter than a fixed dist ribution algorithm like round robin or block cyclic, and it performs the same f unction as a master process sending out new tasks without actually needing a process to sit idle waiting for requests. Once an MN shell pair is identified, th e PE is responsible for generating and processing all unique RS shell pairs th at correspond to the MN upper bound. This prevents any PE from processing Coulomb or exchange integrals that another PE is responsible for. It also allows for a single ge neration of the integrals for each pass if there is enough scratch memory (i.e., the PE woul d generate all RS pairs and internally antisymmetrize them before beginning the pr ocessing loop). Typi cally, however, there will be enough memory for two shell quads and the scratch buffers for contracting terms.
PAGE 58
48 A shell quadruplet of integrals will cont ribute to all of the Group 1 terms, the Group 2 terms, and the Group 4 intermediates. Table 41 summarizes the stages of the processing loop and which elements are accessible at each stage. Table 41.The direct integr al processing loop summary L loop section Scope Contributions prologue {IHs, I2} += f(V,C,T1,T2) all Group 1 terms and the IHintermediate stage 1 R2c += f(V,C,T1,T2,I2) all contributions to R2 from Group 2 and Group 1 stage 2 IPs += f(V,C,T2) the IP intermediate epilogue IR += f(V,C,T1) the IR intermediate The Group 1 terms are saved in a twopartic le intermediate duri ng the loop over the T1 amplitudes from the Group 2 terms. The I2 intermediate is then contracted with Q (C*C) and the Group 1 contribution is sent with the Group 2 contribution to increment R2. If there is not enough scra tch memory to store the I2 intermediate, then it is possible to calculate these terms in a single direct integral pass sepa rately from the Group 2 terms or the IP or IR intermediates. Table 42 has the detailed order of operations for the prologue of processing a shell quad. The w oneparticle matrix element is defined in Equation 414. I I It c w(414) When the prologue finishes, there are two important quantities stored in memory: the Group 1 contribution and the T2 amplitudes that genera ted it. Since the Group 2 contributions loop over all L shells for a given S shell, it is convenient to begin the loop with L=R. Table 43 lists the detailed or der of operations for the L loop. The x oneparticle matrix element is defined in Equation 415. I I I A A Ac t c c x 2 1(415)
PAGE 59
49 After the last value of L, there is still a pending R2 ACCUMULATE, which must be completed. Rather than WAIT immediately after the loop, the IR intermediate, which requires only T1, C, and V, can be calculated. Furthermore, if there are more permutations of V, then the PE can GET the next T2(RS) block and permute the integrals. Table 44 lists the detailed order of opera tions for the permutation epilogue. The first four permutations of V involve the same two sets of shell quads. : : K J The second set of four permutations requires a different exchange shell quad, so the integral generator is invoked a th ird time per MNRS configuration. : : K J Figure 48 shows the movement of data needed to calculate the contributions to R2from an integral shell pair. 1 2 3 4 5 6 7 8 9 10 11 12 13 T T R T T R 1st Squad 2nd Squad 1st Squad 2nd Squad 1st Platoon 2nd Platoon Global Shell Pair Counter of < > Figure 48.The data flow in the Direct Inte gral phase. Solid arrows represent onesided messages, and dashed arrows represen t reading and incr ementing the global shell pair counter. Each PE retrieves all T2 amplitudes that contract with the integral and accumulates the contributions into the R2 array.
PAGE 60
50 When the Amplitude Company has calculated all of the MN shell pairs, the R2distributed arrays are transformed for the phase that processes the stored OOVV integrals. The oneparticle global intermediate s could be reduced here for use in the One Particle phase. The following pseudocode li sts the entire flow of control from the beginning of the CC iteration up to the start of the SI phase. zero(R2,R1,I1) W1 = C + T1 do spin = 2, 1, 1 X1 = Q1 Â– C*T1 SYNCHRON IZE(CO_NET) do while (more MN) do all RS GET T2(RS) G = intgen(M,N,R,S) H = G Â– intgen(M,N,S,R) do perm = 1, 8 calc Prologue() nLeft = NShells L = R do while (nLeft != 0) calc InnerLoop() nLeft = 1 L = 1 + mod(L,NShells) end do nLeft calc Epilogue() end do perm end do RS end do MN SYNCHRON IZE(CO_NET) if (spin==2) then global R2(L,N) = Q1 1(N,D)*R2(D,M)*Q1(M,L) else global R2(L,D) = R2(L,N)*Q1(N,D) end if end do spin REDUCE(I1,CO_CDR) Stored Integral Phase The SI phase calculates all of the Group 3 terms from integrals of the type , , and , collectively referred to as OOVV. Each PE will load or receive an integral block, which contains all spin pairs for a particular shell pair. The mechanism
PAGE 61
51 for obtaining integrals depends on the performance and archit ecture of the I/O system. If only one PE per node (or squad) is allowe d to read from diskÂ—assuming this is an operating system constraintÂ—the n the other PEs in the squa d will have to request the blocks from that I/O PE. The I/O PE will then have to juggle its own calculations while acting as an I/O proxy; otherwise it could opt not to store R2 amplitudes, but then the system takes a performance hit as whole be cause there is one relatively idle CPU per node. Most computers will allow any process to perform I/O and shift the performance hit of I/O contention to the program. The t echnique the parallel CCSD algorithm uses to lessen this penalty is asynchronous file read s. The logistics are the same as onesided operations in the DI phase in that each PE issues an AIO_READ immediately after the last integral has been read but before other computation has been done. If asynchronous I/O operations are not available and the user elects not to de signate a special I/O PE, then the algorithm can wait for the integrals up front with a bloc king read or it can synchronize the squad PEs by broadcasting the current block and waiting for all of them to finish before r eading the next block. Resource competition, such as parallel bl ocking reads, suffers scaling penalties, meaning the cost per read increases as the number of PEs per squad increases. The upshot is that there is still the possibility that so me PEs are computing while others are waiting on I/O. Locked step algorithms, however, usually suffer far worse than resource competition algorithms. Locked step guarantees that no one is working while a read is pending and that no one is reading while the PE s are computing. Even if the I/O PE used asynchronous reads, the locked step a pproach would suffer from expensive
PAGE 62
52 synchronization delays compared to pur e resource competition. From these considerations, each PE will be responsible fo r reading its next block of integrals from disk. If asynchronous reads are not supported, then the PE will default to blocking reads. The OOVV integrals will be distributed over platoons by block cycling the AO shell pair kets. Every PE in a platoon must r ead and process all of the integrals assigned to its platoon. The integrals will only contribute to the R2 amplitudes that the PE stores, but the T2 amplitudes must be read (with onesid ed GETs) from other PEs in the platoon. Every shell pair of integrals results in four updates to all of the local R2 amplitudes, one for each subset of Group 3 Set 1 and one for Group 3 Set 2. Since Group 3 involves contributions in which all virtual orbitals come from T, the virtual orbitals of the integrals must be the internal summation indices. Th e general approach to calculating the Set 1 subsets is to loop over and for a fixed pair. Each subset corresponds to a different configuration of these nested loops. Figure 43 shows the path of updates through the loca l set of R2 amplitudes. Table 45 shows the detailed order of operations for calculating Group 3 Set 1 Subset 1 (AC/BD is LR/DS in AO shells). Accessing memory in nonunit stride is highly unfavorable when performance is the utmost concern; however each square in Figure 43 represents three fourindex arrays such as those shown in Figure 42 and those updates can be performed in unit stride. Furthermore, the LD loop order is only used in Subset 1. Tables 46 47 and 48 show the operations for Subsets 2 and 3 of Set 1 a nd Set 2, respectively. Each steps through the LD pairs in DL order.
PAGE 63
53 L0 L1 L2 L3 D0 D1 D2 Figure 43.The path of Group 3 Set 1 S ubset 1 updates per OOVV integral. Dashed lines show memory continuity and solid arrows show the loop path. There is some flexibility in choosing th e order in which the Group 3 sections are calculated. The most important design consid eration is choosing which section will be performed last since it will AIO_READ the next RS shell pair of OOVV integrals from disk. Subset 2 of Set 1 has the largest number of operations after the last integral is used, so this section should be performed last. Choosing a section to run first depe nds on how many operations can complete during the first GET before reading the T2 amplitudes. Subsets 1 and 3 of Set 1 and Set 2 all use the amplitudes almost immediately, but each requires a different initial shell pair. Set 2 reads only one shell pair, T2(RS), and neither shell contributes to the final R2amplitudes, so the probability that T2 is local is inversely proportional to the number of PEs in the platoon. The Subsets of Set 1 fare much better since at least one shell of T2contributes directly to a shell of R2. If the amplitudes are distributed in bands of L, then Subset 3 has a good chance of T2 being local, and if the amplitudes are distributed in
PAGE 64
54 bands of D, then Subset 1 is better off executing first. Figure 44 illustrates the two possibilities. L bands D bands L D D L Figure 44.The LD shell distribution schemes. Each PE stores both T2 and R2amplitudes for its block of LD shell pairs. Set 1 Subsets 2 and 3 favor L bands, while Subset 1 favors D bands. The parallel algorithm uses L bands since tw o of the three subsets favor them as far as having local T2 amplitudes per R2 update. Subset 2 is alrea dy designated to run last, so subset 3 should be performe d first because it has the hi ghest chance of having no wait time for the initial T2 GET. The decision to run Set 2 before or after Subset 1 can be made on the same argument. Since Set 2 update s the entire set of local LD pairs after reading the T2(RS) amplitudes, it offers the most time for the initial GET of Set 1 Subset 1 to complete before reading them. Therefor e, the calculation orde r of the Group 3 terms is Set 1 Subset 3 (S1S3), Set 2, S1S1, and S1S2. Figure 49 shows the flow of data in the SI phase. Each PE loops over all of the shell pairs that were assigned to its platoon and calculates only the terms that contribute to the local R2 amplitudes. The following pseudocode lists the entire flow of control from the end of the DI phase up to the start of the One Particle phase. AIO_READ V(RS0) GET T2(RD0) do RS = RS0, RS1
PAGE 65
55 WAIT V(RS) calc G3S1S3(GET T2(RS)) calc G3S2(GET T2(L0R)) calc G3S1S1(GET T2(SD0)) calc G3S1S2(AIO_REA D next V(RS); GET next T2(RD)) end do RS 1 2 3 4 5 6 7 8 9 10 11 12 T R T T R T R T 1st Squad 2nd Squad 1st Squad 2nd Squad 1st Platoon 2nd Platoon Platoon Shell Pairs of Platoon Shell Pairs of Figure 49.The data flow in the Stored Inte gral phase. Each PE processes all of the integrals that are stored by its pl atoon and updates its local block of R2. One Particle Phase The OP phase is responsible for contracting the Fock matrix elements of the reference wavefunction (fpq) and the I1 intermediates (IP, IH, and IR), with the T amplitudes. There are three types of contractions: T2 into R1 (Group 4), T1 into R1 (Group 5 Set 1), and T2 into R2 (Group 5 Set 2). Only one PE must do the T1R1 contractions provided it has the complete I1 intermediate (it should si nce the contributions were reduced at the end of the DI phase). Thes e transformations, shown in Equation 416 for T1 are insignificant performancewise and little is gained by parallelizing them. i i P i i R i H it c I t c I t Q I t Q r(416a) ld i d ld l i l li l ad i d ad a a ai a it c f t f t t c f c f c r (416b)
PAGE 66
56 The T2R1 contributions can be distributed ove r the PEs in one platoon, since they are limited to local reads and lo cal writes; however, while the T2R2 terms also have local writes, they have both local and remote reads. The remote reads come from the terms in Equations 417 and 418. The part of FP is defined in Equation 419. (The part is analogous.) ij P ij IJ P IJt F r t F r~ ; ~ (417) Ij P Ij P Ijt F t F r(418) R P ld d ld l ad d ad a PI t Q I c f t c f c F(419) Parallelization of these pa rticle terms should reduce the number of messages a PE must send as the number of PEs in the comp any increases. This implies distributing the summation index ( or ) over the platoons; however, sin ce the SI phase prefers L bands (the index), it is likely that most (if not all) of the T2(RD) GETs are local, so it is not as critical to performance to distribute them over the platoons as it is for For the loop, each PE must get data from multiple targ et PEs that contribute to its local R2 set (Table 49 ). Table 410 shows the other FP terms. The T2R2 hole terms and the T2R1 terms (along with the latter FP terms) should be performed by the largest platoon (HVY_PLT) even if that platoon has only one more PE. During this time, all of the other platoons will send their contributions via ACCUMULATE to the Heavy Platoon. Once th ese operations are complete, the only remaining tasks are antisymmetrizing the samespin R2 amplitudes and updating the old T2 amplitudes with R2. Table 411 shows the operations of the Heavy Platoon while the
PAGE 67
57 other platoons are sending thei r contributions, and Equations 420 and 421 show the terms that contribute to FH and FR, respectively. i i R l i H l ld i d ld k i ki k i Ht c I c I c t c f f F(420) R k d d kd k RI c c f F(421) The following pseudocode lists the operati ons that occur from the end of the SI phase up to the start of the an tisymmetrization and update phase. if (CO_CDR) then make FP, FH, FR BROADCAST(F1,CO_NET) calc T1_into_R1() else RECEIVE(F1,CO_NET) end if calc FPT2() if (HVY_PLT) then calc other_F1T2() else ACCUMULATE(R2,HVY_PLT) end if REDUCE(R1,CO_CDR) Antisymmetrization and T2 Update The and R2 amplitudes must be antisymmetrized with respect to AO shell pairs. Equation 422 shows this operation. All of the R amplitudes must then be divided by the diagonal elements of the Fock matrix in order to generate the next set of T2amplitudes. Each PE in the Heavy Pl atoon, having collect ed all of the R2 amplitudes that it is assigned to store, will antisymmet rize and transform its amplitudes into T2. The R2T2 transformation is shown in Equation 423 and the diagonal matrix elements of f are shown in Equation 424. ij ij ijr r r ~ ~ (422)
PAGE 68
58 f f f f r t f f f f r tjj ii ij ij JJ II IJ IJ;(423a) f f f f r tjj II Ij Ij(423b) ad d ad ac f c f (424) The R1 amplitudes must be transformed in the same way in order to generate the next set of T1 amplitudes. One PE should do this since the transfor mation is trivial (Equation 424). f f r t f f r tjj j j II I I;(424) Table 412 lists the operations of the He avy Platoon for antisymmetrizing and transforming R2. The code listing that follows s hows the operations of the Amplitude Company from the end of the OP phase up to the next CC iteration. if (HVY_PLT) then make T2 end if if (CO_CDR) then make T1 BROADCAST(T1,CO_NET) else RECEIVE(T1,CO_NET) end if if (not HVY_PLT) then GET T2 end if The order of the loop nesting, LD versus DL, does not affect the results, but the performance of the loops might differ from one computer architecture to the next. The LD ordering issues multiple GETs to the same PE (down the L column) before cycling up to the next column. If every PE begins with L=1, then the first PE in the platoon will be
PAGE 69
59 bombarded with the GETs of every PE at once. Normally, GET operations do not require exclusive locks, in which one PE is allowed to access the data at a time, but the efficiency of the messaging subsystem might be a fact or in performance. This would be less important in the DL ordering. Another way in which the messaging subs ystem could affect performance is by "remembering" which PE was accessed last. In this situation, once a GET is established between two PEs, any successive GET to the remote PE is less expensive than switching to another PE. This would actually favor the LD ordering and penalize the DL ordering. In either case, the code can be trivially designed to switch between the two depending on which architecture the program is running.
PAGE 70
60 Table 42.The Direct Integral phase prologue memory map Order of V T2 I2 W2 operations G H I Iw H W r w J I IJ IJw W P I Âˆ w r I Iw G W r w j I Ijw W I w r WAIT T2(RS) IJ IJt H W2 1 r r w I IJ J Hc W I r IJ IJW I i r Ij Ijt G W r r w I Ij j Hc W I r Ij IjW I i r Each operation is implicitly summed over all shared indices. The G column of V corresponds to the set of <> Coulomb integrals, and the H column is the set of <> antisymmetrized (CoulombExchange) integrals. The values w, r, and i stand for write, read, and increment, respectively. Shaded cells are involved in pending communications and may not be read fr om or written to (without risking data inconsistencies). Temporary arrays (W2) are outlined to show their scope of dependence.
PAGE 71
61 Table 43.The Direct Integral phase inner loop memory map V T2 R2 (1) R2 (2) I2 W2 X2 Order of operations G H if (L!=R) WAIT T2(LS) Ij I jt c W r w if (R!=S) GET T2(LR) j jW G X r r w j I IjX w R w r Ij IjI x R i r IJ IJI x R w r if (R!=S) WAIT T2(LR) IJ J It c W r w Ij j It c W r w if (more L) GET next T2(LS) H W XI I r r w G W XI I r r i j I Ijw X R i r H W XI I r r w I I Pc X I 2 1 r G W XI I r r i J I IJ IJw X P RÂˆ i r if (L!=R) WAIT R2 ACCUMULATE R2 I I Pc X I r The R2 (1) buffer is the active write buffer, and the R2 (2) buffer is the accessforbidden buffer from the previous L shell. Th e loop starts with L=R, counts up to NShells, cycles back to L=1, and continues up to L=(R1).
PAGE 72
62 Table 44.The Direct Integral phase epilogue memory map V T2 R2 (1) R2 (2) scr Order of operations G H if (perm==8) then G t c H t c Ii i I I R r r (w) WAIT R2 else GET next T2(RS) G t c H t c Ii i I I R r r (w) permute V w w (w) end if One of the two R2 buffers will be accessforbidden. The last permutation does not need to GET another T2(RS) block; otherwise, the GET shoul d be issued as soon as possible. If the permutation index is 4, then a different exchange integral will be generated directly. The (w) cells are writeoptional, meaning the operation might or might not need the scratch memory.
PAGE 73
63 Table 45.The Stored Integral phase Group 3 Set 1 Subset 1 memory map V T2 (1) T2 (2) IA IB IC Order of operations GET T2(L0R) do L = L0, L1 Mn M I n I BG t w I ) ( r w WAIT T2(LR) Mn IM n I CG t I ) ( r r w M I IM IMt w t W2 1 r w mn Im n IH t W 2 1 r r w GET T2(D0S) MN IM N I AH W I ) ( r r w ) ( ) (C n I n I BI W I r i r n I n I CW I 2) ( r i do D = D0, D1 WAIT T2(DS) GET next T2(??) JN N I A IJ IJt I P r) (Âˆ ~ r r Jn n I B IJ IJt I P r) (Âˆ ~ r r n j jnt w t i jn n I C Ijt I r) ( r r end do D end do L During the loop over D, the T2 buffers alternate active/inactiv e, but the prologue in the L loop uses the second buffer as scratch memory. The GET in the inner loop over D requests T2(DS) unless D=D1, in which case it requests the next T2(LR) (provided LL1).
PAGE 74
64 Table 46.The Stored Integral phase Group 3 Set 1 Subset 2 memory map V T2 (1) T2 (2) IA IB IC ID Order of operations GET T2(SD0) do D = D0, D1 n j Mn M j Bt w G I) ( r w WAIT T2(SD) GET T2(RL0) nj Mn M j Dt G I) ( r r w n j nj njt w t t2 1 w nj mn m j At H I) ( r r w Nj MN M j Ct H I) ( r r w if (D==D1) then AIO_READ next V end if ) ( ) ( 2 1 ) (D C BI I I i r r ) ( ) (C D I I r i do L = L0, L1 M j D M I IjI t w r) ( r WAIT T2(RL) GET next T2(??) m j A mi ij ijI t P r) (Âˆ ~ r r M j B Mi ij ijI t P r) (Âˆ ~ r r M j C MI IjI t r) ( r r end do L end do D The inner loop of this subset has more operatio ns than either of th e other two subsets, so it is calculated last and the asynchronous r ead is issued during the last iteration.
PAGE 75
65 Table 47.The Stored Integral phase Group 3 Set 1 Subset 3 memory map V T2 (1) T2 (2) IA IB IC Order of operations GET T2(RD0) do D = D0, D1 n I Mn M I Bt w G I) ( r w WAIT T2(RD) GET T2(L0S) Mj Mn n j At G I) ( r r w In Mn M I Ct G I) ( r r w ) ( ) (C B I I i r do L = L0, L1 M I C M j IjI t w r) ( r WAIT T2(LS) GET next T2(??) n j A In IjI t r ) ( r r M I B Mj IjI t r) ( r r end do L end do D
PAGE 76
66 Table 48.The Stored Integral phase Group 3 Set 2 memory map V T2 Order of operations W2 I2 j I Ij Ijw w t W r w Kl Ij Kl IjV W I r r w do LD = LD0, LD1 l K Kl Klt t t W w Kl Kl Ij IjW I r r r end do LD do spin = 2, 1 j i ij ij ijw w P t WÂˆ (r) r w kl ij kl ijV W I 4 1 (r) r r w do LD = LD0, LD1 l k kl klt t t W2 1 w kl kl ij ijW I r ~ r r end do LD end do spin All of the T2(LD) reads and R2(LD) updates are local opera tions. If there is not enough memory for I2, then the Ij/ij/IJ bra pairs can be processed in blocks. Table 49.The One Particle phase FP T2 memory map T2 (1) T2 (2) Order of operations GET T2(LS0) do S = S0, S1 do L = L0, L1 WAIT T2(LS) GET next T2(LS) do D = D0, D1 Ij P Ijt F r r end do D end do L end do S
PAGE 77
67 Table 410.The One Particle phase FP T2 (and same spin terms) memory map T2 (1) T2 (2) Order of operations GET T2(1D0) do D = D0, D1 do R = 1, NShells WAIT T2(RD) GET next T2(RD) do L = L0, L1 IJ P IJt F r~ r Ij P Ijt F r r ij P ijt F r~ r end do L end do R end do D Table 411.The One Particle phase FH and FR local transformations Order of operations do LD = LD0, LD1 K KJ K I H IJ IJt F P r Âˆ ~2 1 k Ik k j H K Kj K I H Ijt F t F r k kj k i H ij ijt F P r Âˆ ~2 1 k Ik k R K IK K R It F t F r K Kj K R k kj k R jt F t F r end do LD
PAGE 78
68 Table 412.The R2 antisymmetrization and T2 update memory map R2 (1) R2 (2) Order of operations GET R2(D0L0) do L = L0, L1 do D = D0, D1 WAIT R2(DL) GET next R2(DL) IJ IJ IJr r t ~ ~ r ij ij ijr r t ~ ~ r f f f f tJJ II IJ f f f f tjj ii ij f f f f r tjj II Ij Ij end do D end do L
PAGE 79
69 CHAPTER 5 DISCUSSION AND SIMULATION Discussion General Comments The parallel CCSD algorithm distributes almost every operation over the number of PEs in the Amplitude Company. The few operations that are not distributed are insignificant to overall performance. Specifi cally, the worst oneparticle transformation would be a multiplication of two NAOxNAO matrices. This scales operationally like NAO 3. In contrast, the Direct Integral phase multiplies two matrices that scales operationally like O2NAO 4. For a system with 100 electrons (50 and 50 ) and 500 basis functions, there are approximately 125 million operations for the serial multiply and over 309 trillion operations for the parallel multiply. Assuming that every operation takes the same amount of time and that the parallel multiply is 100% efficient, the maximum theoretical speedup is 2.48 million times faster (for this system). 6 8 14 810 48 2 10 25 1 10 09 3 10 25 1 lim ser par ser nt t t SpeedUpprocs(51) Obviously this limit is unachievable, a nd there are multiple operations that slow down the parallel portions of the CCSD algor ithm. Disk I/O, network messages, and synchronizations all become more expensiv e as the number of PEs increases. The two primary techniques that help to reduce th ese costs are realtime load balancing and asynchronous operations (i.e., overlapping co mputation and communication for both disk reads and remote memory operations).
PAGE 80
70 Load Balancing Job boards In this context, a job board is a global structure that any PE can write to or read from at will. The Direct Integral phase must have a global counter that registers the next AO shell pair waiting to be processed. PEs th at finish their tasks quickly will read and increment the counter more often than PEs that lag behind the others. Natural load imbalance or CPU contention from other proc esses can cause this lag (or drift). CPU contention frequently happens if the operati ng system attempts to run more compute intensive processes than installed CPUs. Even if the parallel program executes exclusively but the number of PEs is the same as the number of CPUs, then the operating system kernel itself will compete for CPU cycles and detract from overall scalability. Job boards can relieve natural load imbalan ce in the Direct Integral phase, but the Stored Integral phase cannot easily use them (if at all). The SI phase has two levels of parallelization: the OOVV integrals are distributed over platoons and the R2 updates are distributed over PEs within a platoon. Any atte mpt to balance the load in real time will incur multiple synchronization delays since th e PEs in a platoon have close to but not the same number of operations. Instead of real time load balancing, the AO shells are grouped together to balance the wo rk needed to process each shell. Data aggregation All uses of the word "shell" that correspond to CC data processing refer to collections of physical AO shells, which are th e smallest entities that direct integral generation programs can calculate. Since the number of operations scales like O2N4, the difference between the smallest and largest physical shell is a di rect measure of the imbalance in computations per message. The hydrogen STO3G basis set has 1
PAGE 81
71 contracted orbital in th e s shell. Larger basis sets will typically run up to about 16 orbitals per shell (although the iodine PBS basis has 55 orbitals in the d shell). The hydrogen shell and the average shell are computationally uneven by a factor of 65536 for the DI phase. If the messages pertaining to these tw o shells cost the same, which is a fair assessment since message latency is an order of magnitude more important than message size, then one PE will have 65536 times more work to do than another. Simply grouping as many physical shells together as possible mi ght not be the best solution since it is the size variability that a ffects load balancing. The parallel CCSD algorithm, before runni ng the first iteration, will examine the collection of physical AO shells and search for the set of logical shells that minimizes the variance of shell sizes. This is a bounded twodimensional search along the dimensions MaxShellSize and NLogShells. The maximum number of logical shells is simply the number of physical shells. The fewest number of logical shells is 1, and th e size of that shell is equal to the sum of all of the physical sh ell sizes; however, given a maximum shell size, the fewest number of shells is defined in Equation 52. ze M axShellSi i ize PhysShellS NPhysShellsN i LogShells1(52) Physical AO shells should not be broken up since generating dire ct integrals costs less when orbitals of the same angular mo mentum and atomic center are calculated at once. Equation 52 is actually an inequality because there might not be a perfect grouping of physical shells such that every logical sh ell has the same size. The search algorithm loops over size limits (MaxShellSize) starting with the largest physic al shell size and ending with a predefined limit or the sum of all physical shell sizes, whichever is smaller.
PAGE 82
72 For every potential MaxShellSize, the algor ithm also loops over the number of logical shells starting from the theoretical mini mum up to the number of physical shells. The problem of putting objects with mass ( physical shells with orbitals) into groups with a mass limit (logical shells with up to MaxShellSize orbitals) is well known in the field of Computer Science as the Bin Packing Problem (BPP) [ Cof97 ]. This is an NPhard optimization and there are many algorit hms for finding solution candidates. Rather than solve the BPP, the CC algorithm uses the Worst Fit Decreasing (WFD) offline algorithm limited to a known set of bins (logi cal shells) and measures the variance of sizes (Appendix E). The partitioning scheme w ith the lowest variance should have the best balance between computation and me ssaging. WFD is ideal for finding possible partitions for this particular problem since it adds the next smallest physical shell to the logical shell with the fewest orbitals. Asynchronous I/O Most parallel programs are hampered by the increasing cost of messages as the number of PEs increases. Onesided operations allow PEs to process data while messages are moving through the network. The two lim its of this scheme are: 1) the CCSD calculation is free because the messages take so long to arrive and 2) the program scales almost linearly with the numbe r of PEs because global data is available the moment it is read. Given the complex task and data distribut ions, it is likely that some portions of the algorithm will be bound by communication co sts and others by computations. Reading the OOVV integrals from disk should not be a performance bottleneck considering the amount of computation th at each PE will do while the AIO_READ is being fulfilled. The inner loops of the SI phase can become heavily dependent on T2 GET operations if the logical sh ells are not large enough to saturate the retrieval with
PAGE 83
73 computation. The easiest way to increas e the number operations per message is to increase the size of a logical AO shell pair. The difficulty in doing this is that the local scratch memory requirements scale like N4 with respect to shell size. As more scratch memory is needed, the less heap there is for distributed data and ultimately more PEs will be needed. The problem of balancing shell si ze, scratch memory, and data distribution is not solved, but as the algorithm has been stat ed, the data distribution schemes can be tried independently of the data contraction kernel s. For example, if more scratch memory is needed but there is no possibility to increase the heap or the number of PEs, then the DI and SI phases can be further di vided to allow for fewer interm ediates at the cost of more integral processing. Synchronization Delays An attractive feature of the parallel CCS D algorithm is that no PE is required to stall in a wait call for another PE unless the st ate of their globally accessible data changes (i.e., if R2 is transformed or T2 is updated). A barrier in message passing programs is a routine that causes every PE to wait until all of the PEs in a communication network have entered the barrier. Certain collective oper ations can function lik e a barrier and some pairs of functions act as a true barrier. Reduc tions involve every PE sending some data to one PE, which will combine all the data into one element. Broadcasts involve one PE sending some data to every PE. Individually, certain PEs may continue after the operation whether or not other PEs have finished th eir operations, but pr operly configured, a reduction and a broadcast can together function like a barrier. No PE will proceed until it has received data from one PE, which will not send the reduced element before receiving the contributions from all of the other PEs.
PAGE 84
74 The concept of a fence comes from the MPI2 standard, which defines it as a collective barrier that ensures all onesid ed communications on a shared object are complete before leaving the barrier. Synchr onizations of any sort occur essentially 10 times in each CC iteration: 1. *initialize R and synchronize T before the DI phase CO fence (Appendix A) 2. complete the R2 contributions from the DI phase PLT fence 3. complete the R2 R2 transformation into scratch PLT fence 4. copy scratch into R2 before the DI phase PLT fence 5. *complete the R2 contributions from the DI phase CO fence 6. reduce I1 contributions onto CO_CDR CO reduce 7. receive F1 from CO_CDR CO bcast 8. complete the R2 R2 ' transformation into scratch PLT fence 9. copy scratch into R2 before the SI phase PLT fence 10. complete SI and OP contributions CO fence 11. complete R2T2 update CO fence 12. receive T1 from CO_CDR CO bcast In the steady state of iterating over T, th e first and last synchronizations can be overlapped (i.e., point 1 is the same as point 12 in the list). Point 5 can be overlapped with point 6 in that reducing I1 signals the end of the DI phase (Table 51 ). Collectively, the number of synchronizations does not scale with the problem size; however, the cost of each synchronization point will increase as the number of PEs increases. Simulation The milliondollar question that every parallel algorithm must answer is, "Does it scale?" Amdahl's Law [ Amd67 ], which neglects all proces s contention and performance degradation, describes the maximum theore tical speedup for a fixed problem size running on a fixed number of processors. The simple st expression of this law is shown in Equation 53.
PAGE 85
75 Table 51.The schedule of synchronization points Heavy Platoon Light Platoon Phase T1 R1 T2 R2 Scr T1 T2 R2 Scr DI: r r i w/r r r i w/r R(D,M)S(L,N) r w r w S(L,N)R(L,N) w r w r DI: r r i w/r r r i w/r Reduce I1 Broadcast F1 R(L,N)S(L,D) r w r w S(L,D)R(L,D) w r w r SI; OP:FP r r i w/r r r i w/r OP:FP ,FH,FR r i r i w/r r antisym; 1/D w r 0 update T2 w w 0 w Broadcast T1 w w Each platoon independently transforms R2 and separates each step with a PLT fence (dashed lines). The shaded rows corres pond to collective communications across the Company. r, w, i, and 0 stand for read, write, increment, and initialize to 0, respectively. The largest platoon calculates T1 while the other platoons update their copies of T2. procs par ser par sern t t t t SpeedUp(53) However, the parallel CCSD algorithm does not spend the same percentage of time in the parallel parts relative to the serial parts for every molecular system. As the problem size increases, the percent spent in the serial pa rt decreases; thus, th e parallel speedup will increase with system size. This is loosely re lated to the GustafsonBarsis Law that makes the observation that a parallel computer can do bigger calculations re lative to the serial computer for data parallel algorithms as the number of processors increases [ Gus88 ]. The parallel CCSD algorithm leaves N3type operations for one CPU (in the form of NxN matrix multiplications). A few oneparticle N5 operations and the N4 T2 update are distributed over the largest platoon, but most N5 and all N6 operations are distributed over the entire Amplitude Company. Furthermore, re mote data is retrieved as early as possible
PAGE 86
76 and used as late as possible. This provides an ideal countermeasure for variable network performance. General Overview A simulator is used to test a design before building a prototype or the final product. Simulating software performance (scalabilit y in the case of parallel CCSD) usually involves estimating computation and data transfer times, which requires hardware parameters such as CPU speed and efficien cy; memory, disk, and network I/O latencies and bandwidths; etc. These parameters can be measured and will be different for various computer architectures; however, the goal of the software simulator is not to emulate hardware. By using a variety of parameters the simulator should be able to expose general strengths and weakness es of an algorithm that might need to be redesigned. The simulator used to test the parallel CCSD algorithm constructs a parallel network with an array of floatingpoint numbe rs that represents the clocks of the PEs. The program assumes the identity of a PE and estimates how long it would take to perform a set of operations. That time is adde d to the clock and the simulator switches to the next PE. For the Direct Integral pha se, the simulator maintains a counter that corresponds to the next integral shell pair, and becomes the PE with the lowest clock (i.e., the one that would increment the counter ne xt). Once the counter exceeds the number of shell pairs, the clocks are synchroni zed to the highest value and the R2 amplitudes are transformed. The process repeats itself for the second DI phase. Binary trees are used to estimate reductions and broadcasts. The Stored Integral and One Particle phases do not use realtime load balancing so the simulator cycles through each PE and adjusts the clock accordingly. Finally, the antisymmetrization and update phases are estimated with all of the appropr iate synchronizations and computations.
PAGE 87
77 Specific Technologies Local computation. Accurately estimating local mathematical operations is essential to simulating both parallel and se rial performance. Each operation involves reading data from memory, operating on the da ta, and writing the results back to memory (Equation 54). S is the data size, B is the memory bandwidth, Nops is the number of operations, C is the CPU speed, and e is an efficiency modifier. Compiler optimization, loop ordering, and vendorsupplied mathemati cal libraries will a ffect performance so multiple efficiencies may be needed to estimate different parts of the algorithm. e C N B S B S tops wr out rd in comp* (54) Asynchronous I/O. Disk operations can be estimated with latency, bandwidth, and the number of channels (Equation 55). Late ncy is usually the time it takes for a disk head to move to the appropriate position on the disk, Nch is the number of dedicated I/O channels, and Nmsg is the number of simultaneous disk operations. Disk contention in the simulator is estimated by multiplying S by th e number PEs on the node (i.e., if one PE is reading data, then anothe r probably is as well). ) min( */ msg ch lat O IN N B S t t (55) When an asynchronous call is placed, the simulator estimates how long it would take for the data to arrive a nd saves the time at which it s hould be available. The PE must then call a wait routine, which will comp are the current time to the time the data is useable. Depending on the difference, the simu lator will increase the clock (the PE must wait) or return silently (the data arrived while computing).
PAGE 88
78 Onesided communication. Sending and receiving me ssages through a network is very similar to writing and reading from di sk. If twosided messagi ng is used, then the clocks of the origin and target must be modified. Blocking messages require the origin and target to have the same clock, while nonblocking messages allow for drift. The parallel CCSD algorithm only uses nonbloc king onesided messages, which are very similar to I/O operations in that only the origin is affected by the time it takes for data to arrive. The simulator allows for onnode and offnode messaging parameters so that PEs in the same squad can communicate at diffe rent speeds than those on different nodes. Some MPI implementations require copying da ta into and out of system buffers, so memory bandwidth might also affect the time. Collective communication. Barriers, reductions, and broadcasts are simulated with binary trees (Figure 510 ), and the number of stages depends on the number of PEs involved in the operation (Equa tion 56). (Ceiling brackets mean round up to the nearest integer.) Equation 57 shows the number of messages being sent in each stage i. PE stagesN I2log (56) 12 2 min i i PE msgN N(57) 1 2 3 4 Figure 510.The binary tree approach to collective communicati ons. 12 PEs require 4 stages, and the last stage only has 4 messages.
PAGE 89
79 Calibration Memory bandwidth is measured by allocati ng a large heap and then flushing it with data and reading the data back in. The av erage of 7 sweeps is used for the memory reading and writing bandwidth. Henceforth, all data will pertain to an IBM POWER3 workstation with four 375 MHz CPUs. The fo llowing output was taken from a calibration run: 512 MB heap at 0x3348f518 wr was 1.924009 > 266.11 MB/s; rd was 0.623147 > 821.64 MB/s wr was 1.924412 > 266.06 MB/s; rd was 0.623328 > 821.40 MB/s wr was 1.922241 > 266.36 MB/s; rd was 0.623201 > 821.56 MB/s wr was 1.922671 > 266.30 MB/s; rd was 0.623183 > 821.59 MB/s wr was 1.922141 > 266.37 MB/s; rd was 0.623107 > 821.69 MB/s wr was 1.922495 > 266.32 MB/s; rd was 0.623170 > 821.61 MB/s wr was 1.922200 > 266.36 MB/s; rd was 0.623147 > 821.64 MB/s bw_memwr = 266.27 MB/s (279201344.00 B/s) bw_memrd = 821.59 MB/s (861497664.00 B/s) CPU throughput is rarely the clock speed of the processor so efficiency modifiers are used to estimate the time it would take to compute a set of operations. The two primary mathematical constructs that th e CCSD algorithm uses are twodimensional matrix multiplications and N6 array contractions. Many ve ndors sell highly optimized linear algebra libraries that are tuned for sp ecific computer architectures. Using these libraries for standard matrix multiplies can dramatically increase performance if the program is dependent on such operations. The simulator measures the efficiency of the double precision general matrix mu ltiply (DGEMM) by multiplying two 625x625 matrices. The matrices are then recast as two 25x25x25x25 fourdimensional arrays and a handcoded sixloop array contra ction is timed, wherein th e array contraction performs the same operations as the twodimensional matrix multiply. The following results were taken from a calibration run: CALIBRATION FOR POWER3 375 MHz CPU (2 fma units per CPU) 750000000 fma/s reference
PAGE 90
80 ( 625 x 625 ) ( 625 x 625 ) dgemm took .387582 s > 88.1849% eff dgemm took .384157 s > 89.0108% eff dgemm took .384016 s > 89.0451% eff dgemm took .383978 s > 89.0543% eff dgemm took .384088 s > 89.0276% eff dgemm took .384009 s > 89.0468% eff dgemm took .384046 s > 89.0378% eff loops took 1.785683 s > 18.4198% eff loops took 1.786396 s > 18.4123% eff loops took 1.785581 s > 18.4208% eff loops took 1.785778 s > 18.4188% eff loops took 1.786151 s > 18.4149% eff loops took 1.785726 s > 18.4193% eff loops took 1.785444 s > 18.4223% eff dgemm eff = 0.889153 loops eff = 0.184183 vendor is 4.83 times more efficient The results are then compared to another set of larger multiplications and contractions to test for predictability. The simulator uses th e latter set of parameters if the results are precise. The following results were ge nerated using the previous data: ( 30 x 30 40 x 40 ) x ( 40 x 40 40 x 40 ) DGEMM: real: 3.549262 s (est: 3.53) eff: .8892 DGEMM: real: 3.547543 s (est: 3.53) eff: .8892 DGEMM: real: 3.547254 s (est: 3.53) eff: .8892 LOOPS: real: 22.458201 s (est: 16.76) 6loop eff: .1373 LOOPS: real: 22.448650 s (est: 16.76) 6loop eff: .1373 LOOPS: real: 22.446517 s (est: 16.76) 6loop eff: .1373 LOOPS: real: 22.450999 s (est: 16.76) 6loop eff: .1373 LOOPS: real: 22.447324 s (est: 16.76) 6loop eff: .1373 LOOPS: real: 22.445928 s (est: 16.76) 6loop eff: .1373 LOOPS: real: 22.449582 s (est: 16.76) 6loop eff: .1373 Disk latency and bandwidth are difficult to measure in an isolated environment. Most operating systems use aggressive file caching, which can have a noticeable effect on performance curves that compare message size to bandwidth. The parallel CCSD algorithm is somewhat forgiving of exact performance because all disk reads are asynchronous; nonetheless, calibra ting the I/O performance attempts to bypass the CPU and memory caches. The benchmarking program iozone [ Nor02 ] was used to measure the disk latency and bandwidth, and all files were created with the descriptors: O_DIRECT (direct I/O), O_SYNC (synchronous writes), and O_RS YNC (synchronous reads). Comparing file
PAGE 91
81 record length to the time it took to read a random record produced data that were used to describe the local disk performance (Figure 511 and Table 52 ). 0 100 200 300 400 500 600 0409681921228816384record size (kB)time (ms) 2048 4096 8192 16384 32768 65536 13107 2 26214 4 52428 Figure 511.The local disk random read calib ration data. The sets of data are almost identical no matter what file size (k B) was used between 2 and 512 MB. Table 52.The latencies and bandwidths of ra ndom disk reads with respect to file size File Size Brd tlat File Size Brd tlat File Size Brd tlat 2048 29.7 3.5 16384 30.4 3.5 131072 31.2 4.3 4096 30.4 4.0 32768 30.5 2.9 262144 31.5 4.9 8192 30.0 3.2 65536 31.1 4.0 524288 31.4 4.8 File sizes are in kB, bandwidths (Brd) are in MB/s, and latencies (tlat) are in ms. For each file size, the bandwidth is the inverse of th e slope of the linear regression of record lengths vs. time to read, and the latenc y is the intercept. (The minimum R2 was 0.999.) The average bandwidth and latency for al l 9 sets were 30.7 MB/s and 3.9 ms, respectively. Measuring network messaging performan ce was kept relatively simple. The mpptest program, which ships with the MPICH library [ Gro96 ], was used to measure the pointtopoint (P2P) and bise ction bandwidths for messages ranging from 16 to 64 kB. P2P involves only two PEs, while bis ection involves a ll PEs. Figure 512 shows the onnode P2P and bisection bandwidths measured on the IBM workstation. Latency is taken from the intercept of the linear regression, not the zerosize transfer time. The regression analysis for both curves s uggests that both pairs of CPUs can communicate without degradation, so the simulator set the num ber of onnode communication channels to 2
PAGE 92
82 and used the approximate values of 210 s and 99 MB/s for latency and bandwidth, respectively. Three offnode network architect ures were tested: switched Fast Ethernet and two 150 MB/s full duplex, redundant path IBM SP switches (with different configurations). Only P2P tests were used to measure offnode latency and bandwidth (Figure 513 ), and Equation 55 was used to estimate degradation. P2P: t = 9.9477*S + 210.69 R2 = 0.9973 bisect: t = 9.8093*S + 209.7 R2 = 0.9982 300 400 500 600 700 800 900 0816243240485664message size (kB)t (us) P2P bisect Linear (P2P) Linear (bisect) Figure 512.The calibration data for onnode messaging parameters. Bandwidth is the inverse of the regression slope, and late ncy is the intercep t. The performance does not change with one or two pair s of CPUs communicating at the same time, which implies two independent messaging channels. 0 1000 2000 3000 4000 5000 6000 7000 0816243240485664message size (kB)time (us) F.Ethernet SP sw1 SP sw2 Linear (F.Ethernet) Linear (SP sw1) Linear (SP sw2) Figure 513.The calibration data for offnode messaging parameters. The regression equations are Fast Ethernet: t=86.987*S+553.42 (R2=0.9998); SP switch 1: t=15.304*S+265.18 (R2=0.9915); and SP switch 2: t=11.942*S+180.88 (R2=0.9970). Bandwidths are 11.2, 63.8, and 81.8 MB/s, respectively.
PAGE 93
83 As in the case of asynchronous I/O, most messages in the parallel CCSD algorithm are onesided and will hide behind computation. If the amount of computation is less than the estimated time spent in communication, then a more accurate messaging function should be used. Theoretical Examples With reasonable hardware parameters a nd timing algorithms, simulations of real molecular systems can be used to estima te the performance of the parallel CCSD algorithm. Table 53 lists the 20 amino acids and the corresponding numbers of electrons and spin pairs. Also listed are the sizes of the openshell OOVV integralsÂ—and thus the minimum disk requirements (with 1 GB equal to 10243 Bytes)Â—and the minimum number of PEs with 1 GB of heap need ed to calculate the UHF CCSD energy if each system had 1024 AO basis functions. Energy ca lculations of the anion, cation, and zwitterion are certainly achievable with the reso urces of current networks of workstations and clusters of SMP machines. For 1024 AO basis functions, the size of the twoelectron integrals is 8 TB. Approximately oneeighth of these need to be calculated on the fly since the others can be generated by permutations; however, assumi ng the integrals are dense, the full 8 TB array must be generated and processed twice (once for and once for ). Systems of this size should be executed on hundreds if not a thousand PEs or more. The initial parallelized software cannot scale this high much less store the amount of data needed by each node. The parallel CCSD algorithm overcomes this with finegrain task distribution and distributing large arrays ove r as few PEs as possible. Th e net effect is that large systems can be calculated with few resources at the expense of computing time.
PAGE 94
84 Table 53.The minimum resources for amino acids with 1024 AOs in UHF CCSD Amino acid C H N O S ne nOO Min PEs Disk (GB) Disk savings Alanine 3 7 1 2 0 48 1128 27 8.8 99.92% Cysteine 3 7 1 2 1 64 2016 48 15.8 99.86% Aspartic acid 4 7 1 4 0 70 2415 57 18.9 99.84% Glutamic acid 5 9 1 4 0 78 3003 71 23.5 99.79% Phenylalanine 9 11 1 2 0 88 3828 90 29.9 99.73% Glycine 2 5 1 2 0 40 780 19 6.1 99.95% Histidine 6 9 3 2 0 82 3321 78 25.9 99.77% Isoleucine 6 13 1 2 0 72 2556 60 20.0 99.83% Lysine 6 14 2 2 0 80 3160 75 24.7 99.78% Leucine 6 13 1 2 0 72 2556 60 20.0 99.83% Methionine 5 11 1 2 1 80 3160 75 24.7 99.78% Asparagine 4 8 2 3 0 70 2415 57 18.9 99.84% Proline 5 9 1 2 0 62 1891 45 14.8 99.87% Glutamine 5 10 2 3 0 78 3003 71 23.5 99.79% Arginine 6 14 4 2 0 94 4371 103 34.1 99.69% Serine 3 7 1 3 0 56 1540 37 12.0 99.90% Threonine 4 9 1 3 0 64 2016 48 15.8 99.86% Valine 5 11 1 2 0 64 2016 48 15.8 99.86% Tryptophan 11 12 2 2 0 108 5778 136 45.1 99.59% Tyrosine 9 11 1 3 0 96 4560 107 35.6 99.68% The data assume the scratch memory needed by each PE is less than the amount of data needed to store the third OOVV array during an R2 transformation. Every example saves over 99.5% of the disk space required by a traditional MOb ased CCSD algorithm. Runtime Simulations DCEE* A simulation of the parallel CCSD algorithm was performed on the DCEE* system described in Chapter 2. Two network scenarios were tested: Fast Ethernet (F.Eth) and the faster of the two IBM SP swit ches (henceforth referred to as x3sw). Each node in the network was configured as if it were the aforementioned 4CPU IBM workstation. All data are compared to 100% and 50% theoretical effi ciencies (both are technically linear, but "linear" scaling commonly refers to 100% theoretical efficien cy). (The parallel algorithm requires at least two 1GB PEs, so all data are compared to this as the reference calculation.)
PAGE 95
85 Fast Ethernet. The F.Eth simulation shows acceptable parallel performance up to 64 PEs, but beyond that, efficiency drops below 50%, which is considered to be unacceptable (Figures 514 and 515 ). However, even up to 1024 PEs, the scalability curve is still increasing. Effici ency shows normal behavior for parallel programs (starting out high and decreasing asymptotically). 0 32 64 96 128 02565127681024# PEs Xth (100%) Xth (50%) X pred. Efficiency*128 Efficacy Figure 514.The simulated performance curves for DCEE* on 2 to 1024 PEs over Fast Ethernet. Xth (100%) and Xth (50%) are the theoretical speedup curves with 100% and 50% efficiencies, resp ectively. X pred. is the predicted speedup for the simulation. Efficien cy*128 is the efficiency scaled by 128. Beyond the standard analysis of speedup and efficiency, users might concern themselves with the idea of costtoperforma nce. Another way of stating the idea, "At what point does the cost of using more PE s outweigh the gain in speedup they might contribute?" The "cost" function is considered to be the efficiency and the metric for "bang per buck" is efficacy (Equation 58) [ Gho91 ]. p S p E p (58) Efficiency times speedup can have multiple maxima, but only the first maximum value (with respect to the number of PEs) corresponds to the most costeffective use of PEs in a
PAGE 96
86 parallel run. The F.Eth simula tion of DCEE* peaks at 64 PEs (=10.2), but the curve is so shallow that one could argue anywhere be tween 32 and 96 PEs is still cost effective. 0 8 16 24 32 0326496128# PEs Figure 515.The simulated performance curves for DCEE* on 2 to 128 PEs over Fast Ethernet. The curves correspond to the same series as in Figure 514 with the exception that efficiency has been scaled by 32, not 128. SP switch. The x3sw simulation shows much better performance (Figure 516 ). Predicted speedup never drops below 50% efficiency up to 1024 PEs, and efficacy has not peaked. The simulation suggests that th e parallel performance for this system (DCEE*) is highly dependent on communication speed. 0 50 100 150 200 250 300 02565127681024# PEs Xth (100%) Xth (50%) X pred. Efficiency*300 Efficacy Figure 516.The simulated performance curves for DCEE* on an SP switch.
PAGE 97
87 Comparison to serial ACES II. Efficacy is a relative me tric that depends on the machine and (molecular) system, but it can be used to compare the effectiveness of different parallel programs running the sa me system on the same computer. Figure 517 compares the speedup and efficacy of the ne w algorithm to the previous parallel CCSD algorithm. The new algorithm scales better a nd is much more cost effective per PE; however, the comparison of real times does not fare so well (Figure 518 ). 0 25 50 75 100 064128192256 # PEs new X new Efficacy prev X prev Efficacy Figure 517.The relative performance of the in itial parallelization attempt compared to the new algorithm. The discontinuity of the previous parallel program comes from allocating whole nodes with 16 CPUs each. The new algorithm achieves high scalability by decomposing the equations to such a degree, that the PEs are swamped with calcula tions in an attempt to use as little disk and memory as possible. The current serial ACES II CCSD program is 31.4 times faster than the new parallel algorithm, but it al so requires much more disk space and does not scale as well. Nonetheless, Table 54 lists the numbers of PEs that the new algorithm would need to match the time achieved by the initial parallelization attempt. It is clear that the CCSD energy can be calculated quickly with few PEs for molecular systems that have small data re quirements (DCEE* used on the order of 2 GB
PAGE 98
88 of disk space per node). The current ACES II program and the initial parallelization attempt are more than adequate for reducing the time the new parallel algorithm would take; however, the new algorithm requires ve ry little disk space and can calculate energies for very large systems. 0 100 200 300 02565127681024 # PEs X new serial ACES II X prev Figure 518.The absolute performance of the initial parallelization attempt compared to the new algorithm. The dark lines ar e the factors of speedup achieved by using the previous code compared to the reference calculation of the new algorithm. Table 54.The number of PEs needed to su rpass the initial parallel performance Previous PEs New PEs Extrapolated speedup 1 76 31.8 32 409 146 64 514 176 128 424 151 256 405 145 The simulated speedup curve was fitted with a cubic regression, which was used to predict how many PEs the new algorithm needs to finish a CC iteration in the same time that the initial parallelization attempt did. Serine and phenylalanine Using the x3sw network parameters, the simulator examined serine (Ser) and phenylalanine (Phe), both in the ccpVQZ basis set. Ser, with 56 electrons, had 595 AO basis functions and Phe (88 e) had 990 AOs. The current ACES II UHF CCSD program,
PAGE 99
89 using the AO algorithm for VVVV integrals and assuming 20% of the AO integrals are discarded, would require over 200 GB of local disk storage for Ser and over 1.5 TB for Phe. The new parallel algorithm would need only 2.8 and 19.2 GB of distributed storage, respectively. Figures 519 and 520 show the simulated parallel performance of Ser and Phe. Ser requires a minimum of 14 1GB PEs, and Phe requires 95 PEs. In either system, parallel efficiency never drops below 96%. With resp ect to tractability, a single Ser UHF CCSD iteration is expected to take 2.8 days on 1024 PEs. Phe requires approximately 2 months per iteration on 1024 PEs. If the efficiency cu rve is to be believed, then this can be dropped to about 2 weeks on over 4000 PEs. 0 16 32 48 64 02565127681024# PEs Xth (100%) Xth (50%) X pred. Efficiency*64 Efficacy Figure 519.The performance curves for serine on 16 to 1024 PEs. 0 2 4 6 8 02565127681024# PEs Xth (100%) Xth (50%) X pred. Efficiency*8 Efficacy Figure 520.The performance curves for phenylalanine on 128 to 1024 PEs.
PAGE 100
90 The simulator is only able to model local computations with high accuracy. These are generally highly ordered, welldef ined sequences of commands executing on a consistent processor (barring process contenti on in the operating system). After these, the performance of local disk operations is some what predictable, even with I/O contention from other PEs. The least predictable performance aspect of a parallel program is network communications. This seems to be an N! correlation problem in its own right (nevermind coupledcluster theory ), but certain models are ab le to predict the statistical nature of network traffic rather well. The current implementation of the simulator treats the network as a single, flat interconnected fabric in which every PE can reduce network bandwidth for every other PE. In principle, this should then pred ict the worst performance since multilayered network hierarchies favor local communica tions, which the platoons are supposed to capitalize on. However, the pe nalty function for estimating network contention lacks the ability to scale up with the number of PEs (i.e., more PEs reduces the accuracy of the predicted network effects). This should be corrected with a better model such as the hyperbolic model for layered environments [ Sto96 ]. At a more fundamental level, the simula tor assumes all messages can be delivered in one eager message. This means each PE can send and receive data without bothering to divide the information into multiple messages, but more importantly, it completely eliminates the need for the target PE to acknowledge or create space for the incoming data. In practice, this is ra rely the case, and th is rendezvous protocol must be modeled. The simulator is also limited to shari ng I/O bandwidth among PEs on a single node. Many highperformance parallel supercomputer s use storage area ne tworks to increase
PAGE 101
91 capacity and reduce operating and maintenance cost s. The consequence of this is that I/O contention must then take into account al l of the PEs and not those on a single node. Despite the crudeness, the simulator is able to predict that the computational workload will be distributed ve ry well over the parallel envi ronment, in part because the number of operations is many orders of magnitude greater than the number of PEs. Overlapping onesided communication and computation and reading from disk asynchronously both ensure a high degree of scalability for large molecular systems distributed over hundreds of PEs. Beyond predicting parallel performance, a simulator of this nature is able to recommend areas in which capital improve ments yield the highest returns. The simulation of serine/ccpVQZ on 64 4CPU workstations (with 375 MHz POWER3 CPUs) connected with Fast Ethernet pred icts a runtime on the order of 310 hours per iteration. Upgrading the network to the SP sw itch (with the x3sw parameters) reduces the runtime to approximately 264 hr/iter; howev er, doubling the CPU speed but keeping the F.Eth network (assuming the same operati ng efficiencies) reduces the runtime to approximately 216 hr/iter. Of course, any su ch recommendations are biased toward this parallel algorithm, but the simulator is ab le gauge the effect of hardware upgrades without installing test machines (pr ovided it has accurate calibration data).
PAGE 102
92 CHAPTER 6 CONCLUSION The predicted success of the parallel CCSD algorithm is due to four design principles: PEs can compute while waiting for data retrieval to complete PE coordination can eliminate explicit PE communication Most messages can be localized to a small subset of the network Few (if any) operations re quire global communication. Despite the high parallel perfor mance, overall computational co st is still rather high for small to medium sized molecular systems comp ared to traditional diskbased approaches of serial algorithms. The surest way to decrease overall cost is to repartition the most expensive phases of the algorithm such that reused contractions are stored in local scratch, in distributed memory, or on disk, rather than be recom puted by each PE on demand. If a user has already allocated 256 PEs, then the objective of a serine/ ccpVQZ calculation should be to finish as quickly as possible and not necessa rily to finish as efficiently as possible with respect to the minimum PE configuration. Considering that only 14 PEs are needed to calculate serine in ccpVQZ having fewer platoons with more expensive messages is acceptable if the total time can be decreased by an amount that is greater than that returned by the maximum parallel speedup. However, the algorithm is able to calculate CCSD results for systems that are beyond the reach of even the fastest serial implementation. In this regard, the new pa rallel CCSD algorithm is the method of last resort that will produce answers in finite time.
PAGE 103
93 Outlook CCSD energies are not the final goal for a robust treatment of electron correlation. Two paths remain to be investigated before a widely applicable and generally useful program can be released: parallel CCSD an alytical energy gradients and parallel noniterative triples corrections (to both the energy and the gradient). Gradients Without energy gradients, stru ctural analysis and force constants must be estimated using grids of finite displacements and singl epoint energy calculations. First order finite difference grids require 2M displacements to calculate the next hi gher derivative. (M is the number of vibrational modes, i.e., M=3NAtoms6 for nonlinear systems.) Second order finite difference grids require 1+M+M2 displacements. For example, a numerical Hessian calculation, which produces internal force constants, from total energies alone for a system with 15 atoms requires 1561 displacements. If analytical gradients were available, then only 78 gradients would be needed to estimate the Hessian matrix, and if each gradient calculation costs roughly 2.5 times that of an energy calculation, then th e analytical gradient calculation would be approximately 88% cheaper than th e numerical gradient calculation. CCSD gradients require two main computing sections: forming deexcitation amplitudes and contracting the derivatives of the AO integrals with the twoparticle density matrix [ Sal89 ]. The amplitudes can be determined in much the same way as the T amplitudes, so the techniques described fo r parallelizing the T equations are identical to those needed for the equations. The derivative integr al portion is different enough from the T equations that making decent use of da ta and task parallelis m when forming the actual energy gradient will need to be investigated further.
PAGE 104
94 Higher Excitation Clusters Many weakly bound and dissociative molecu lar systems require a treatment of triple excitations. Even the most basic treat ment can give accurate, predictive answers when CCSD is insufficient. The most widely used treatment of triples is CCSD(T) in which the converged T2 amplitudes from a CCSD calcu lation are contracted with OOOV and VVVO integrals to generate thirdorder triple excitation amp litudes, as shown in Equation 61 for the amplitudes. m bc mk e ec jk abc ijkt ij am t ie ab bc a P jk i P rÂˆ Âˆ (61) p q r f r p q f r q p f qr p P, , , Âˆ (62) The data scales like O3V3 and the computation scales like N7. Since the amplitudes do not contribute to any quantity other th an the energy, traditional methods for calculating them involve loopi ng over as few fixed indices as possible, generating the T3amplitudes that have those indices in comm on, calculating the energy contribution, and purging them from memory. Rendell and coworkers have been persistent in pushing the boundaries of CCSD(T), and they also find that if the in tegrals are stored, the parallel efficiency decreases dramatically [ Ren92 Ren93 Ren94 Kob97 ]. The difficulty in calculating these terms is that the VVVO integrals are not stor ed and must be generated directly. It is almost a certainty that the number of integr al generations will sc ale with some power of O, the number of electrons; however, it is no t obvious how to paralle lize the task and data distribution such that not ever y integral shell quadruplet will contribute to the full set of T3 amplitudes. Although this seems contradictory, the DI phase for R2 bypasses this by
PAGE 105
95 calculating the contributions in uncontracted AO integral s and then transforming them with two N5 steps after th e integral loop. Orbital Transformations and Space Reductions There are a few mathematical manipula tions that the twoparticle space can undergo such that fewer computations are n eeded during the CC iterations, especially for the PP ladder term. Cholesky decomposition, singular value decomposition, and frozen natural orbitals are three methods for reduci ng the size of the virtual space or the number of terms that must be calculated. The proble m with these methods is that they shift the work from computation to data. 1024 AO basis functions requires 8 TB of st orage for the two electron integrals, but this matrix can be generated in parallel a nd on the fly from the basis set information. A Cholesky or singular value deco mposition of the two electron integral matrix will require at least 8 TB of storage just for the analysis, and since fewer matrix elements will be more important, there will be less work to parallelize and scalability will suffer. Linear combinations of the virtual space, like froz en natural orbitals, will only affect the efficiency of a CC algorithm that uses MO integrals. The proposed parallel CCSD algorithm never formally recognizes that a virt ual space exists other than to create the complementary density matrix, Q. As such MO based CC algorithms have difficulties with feeding data to the CPUs fast enough if that data must be streamed through I/O. Conclusion The iterative CC algorithm is highly amenab le to parallelizati on because a readonly data set (TOLD amplitudes) must map into an accumulateonly data set (TNEWamplitudes) through a massive number of co mputations. The primary difficulties in finding an efficient parallel approach are figur ing out how to get di stributed data to a
PAGE 106
96 CPU for processing and deciding when the CPU should compute nonexistent data. As molecular systems increase in size, the co mputational and storage requirements that CCSD demands quickly become unreasonable fo r a single computer. The parallel CCSD algorithm described in this dissertation atte mpts to maximize use of idle CPUs and distributed memory in such a way that a CCSD treatment of electron correlation can now be considered for systems that were be lieved to be treatable only by oneparticle correlation theories such as Density Functional Theory.
PAGE 107
97 APPENDIX A ACRONYMS AND ABBREVIATIONS AOAtomic Orbital bcastBroadcast BPPBin Packing Problem CCCoupledCluster CCSDCoupledCluster Singles and Doubles CDRCommander CIConfiguration Interaction COCompany CPUCentral Processing Unit DIDirect Integral F.EthFast Ethernet fmaFused MultiplyAdd HFHartreeFock HHHoleHole I/OInput/Output IPInternet Protocol LDRLeader MOMolecular Orbital OOOOintegrals of the type , , or OOOVintegrals of the type , , or
PAGE 108
98 OOVVintegrals of the type , , , , or OPOne Particle OVVVsee VVVO P2PPointtoPoint (or PeertoPeer) PEProcessor Element (CPU and memory heap) PHParticleHole PLTPlatoon PPParticleParticle RHFRestricted HartreeFock SCFSelf Consistent Field SIStored Integral SMPSymmetric MultiProcessor or Shared Memory Program TCPTransmission Control Protocol UHFUnrestricted HartreeFock VOOOsee OOOV VVOOsee OOVV VVVOintegrals of the type , , or VVVVintegrals of the type , , or WF/WFDWorst Fit (Decreasing) x3swIBM SP switch with approximately 81.8 MB/s throughput
PAGE 109
99 APPENDIX B NOTATION AND CONVENTIONS Operators and Matrix Elements twoelectron Coulomb integrals antisymmetrized twoelectron Coul ombexchange integrals  fpqFock matrix elements from the oneparticle reference wavefunction Pq RsV integrals for partic les with different spins pq rsV integrals for samespin particles pcAO coefficients from the singledeterminant reference wavefunction VV OO V Ot t;oneand twoparticle T excitation amplitudes, respectively VV OO V Or r;oneand twoparticle T excitation residual amplitudes, respectively (Dt r ) ) ( Âˆpq Pantisymmetric permutati on operator f(p,q)f(q,p) VV OOr ~ twoparticle residual amplitude before VV antisymmetrization Operator Indices I, J, K, Loccupied alpha molecular orbitals i, j, k, loccupied beta molecular orbitals A, B, C, Dvirtual alpha molecular orbitals a, b, c, dvirtual beta molecular orbitals p, q, r, sarbitrary molecular orbitals M, N, R, Satomic orbital shells
PAGE 110
100 , atomic orbitals 's, 's, Â…molecular orbitals from spin s projected into the AO basis
PAGE 111
101 APPENDIX C ALGORITHM OVERVIEW Initialization:0 ; 0 ~ ; 0 ; 0 ~ ; 0 ; 01 I r r r r rji jI IJ i I DI Phase: 2 1 1, , ~ ,T T V f I r rdirect ji jI Q r Q rjI Ij DI Phase: 2 1 1, , ~ T T V f I r rdirect Ij IJ ij ij Ij Ij IJ IJr Q r r Q r r Q r ~ ~ ; ; ~ ~ SI OOVV Ring Pass 1: 2 1, , ~ T T V f r rOO VV Ij IJ SI OOVV HH Ladder: 2 1, ~ , ~ T T V f r r rOO VV ij Ij IJ SI OOVV Ring Pass 2: 2 1, ~ ,T T V f r rOO VV ij Ij SI OOVV Ring Pass 3: 2 1, ,T T V f rOO VV Ij One Particle: 2 1 1, ~ , ~ ,T T I f r r r r rij Ij IJ i I AntiSym: ij ij ij IJ IJ IJr r r r r r ~ ~ ; ~ ~ T update: ... ; ...; ; f f f f r t f f r tJJ II IJ IJ I I I
PAGE 112
102 APPENDIX D ANATOMY OF A SHELL PAIR BLOCK The T2, R2, and OOVV integrals are stored in bloc ks of AO shell pairs. Each shell pair block contains three spin blocks: , and The nested loops that address each element in the MN shell block (called BL OCK) are shown in the following FORTRAN code listing. A pictorial repr esentation is shown in Figure D1 index = 1 c AA block fo r < I do nu = 1, nAO(N) do mu = 1, nAO(M) do J = 1, nOcc(1) do I = 1, J1 print *, I,J,mu,nu,'= ',BLOCK(index) index = index + 1 end do end do end do end do c AB block for < Ij  mu nu > do nu = 1, nAO(N) do mu = 1, nAO(M) do j = 1, nOcc(2) do I = 1, nOcc(1) print *, I,j,mu,nu,'= ',BLOCK(index) index = index + 1 end do end do end do end do c BB block fo r < i do nu = 1, nAO(N) do mu = 1, nAO(M) do j = 1, nOcc(2) do i = 1, j1 print *, i,j,mu,nu,'= ',BLOCK(index) index = index + 1 end do end do end do end do
PAGE 113
103 = I
PAGE 114
104 APPENDIX E LOGICAL SHELLS AND BIN PACKING Bin packing problems (BPP) involve gr ouping objects together such that each group satisfies a constraint placed on the collection of elements in the group. A prototypical example would be to partition an arbitrary set of numbers such that the sum of the numbers in each group does not exceed a predetermined value. Most of the interest in studying these problems involves optimiz ing the partition with respect to some parameter, and usually the objective is to mi nimize the number of groups in the partition. There are four primary algorithms for finding solutions to the BPP: first fit (FF), best fit (BF), next fit (NF), and worst fit (WF). First fit puts the next ungrouped object in the first group that can accept it. Although "best" is a subjective word, best fit algorithms imply putting the next object in the group that will have the least available space after adding it. Alternatively, worst fit finds th e group with the most available space before adding the next object. Next fit algorithms are usually only applied to online scenarios. An offline scenario is one in which the set of objects to group is fini te and known before the packing process; all other scenarios ar e considered online. Next f it also implies "closing" groups such that no other object can be added, no matter how optimal the new group might be, so a next fit algorithm will cl ose the current group and form a new one if the next object cannot fit the open one. Offline s cenarios allow for the objects to be sorted prior to the packing procedure, and this a dds two qualifiers to the type of algorithm: increasing and decreasing.
PAGE 115
105 In the parallel CCSD algorithm, processor elements work on data segments in units of AO shells. The number of shells is irrelevant, but the balance of the shells will directly affect the overall load balanc ing. The physical AO shells ar e quite uneven in size and the largest difference in sizes can have a consid erable effect on performance, so the physical shells are grouped into sim ilarly sized logical shells. The search algorithm loops over maximu m logical shell sizes and numbers of logical shells, and attempts to minimize th e shell variance. The ideal packing algorithm for this scenario is worst fit decreasing (WFD ), which places a shell smaller than or equal to the previous shell in the logical shell with the fewest orbitals. The following original FORTRAN code implements a WFD packing al gorithm starting from the largest physical shell size and searching up to a userdefined limit or the total number of shells, whichever is smaller. c This program implements a worst fit decreasing bin packing c algorithm. It searches over the number of possible bins and the c range of maximum bin sizes and minimizes the group mass c variance. c isort and dependencies come from: c http://www.netlib.org/slatec/src/isort.f program main implicit none integer max_cells parameter (max_cells=300) integer nCells, mass(max_cells), MasLim integer TotMass, lrg, sml integer GrpMax, nGrps, pGrp0, SmlGrp integer m(max_cells), GrpNdx(max_cells), GrpMass(max_cells) integer FinGrps, FinNdx(max_cells), FinMass(max_cells) double precision avg, FinAvg, var, FinVar, dTmp logical bValid integer i, j c print *, 'Enter the number of cells:' read(*,*) nCells
PAGE 116
106 if (nCells.lt.1) stop 'Nothing to do!' print *, 'Enter the cell masses:' read(*,*) (mass(i),i=1,nCells) TotMass = 0 lrg = 0 sml = 2**30 do i = 1, nCells if (mass(i).lt.1) stop 'Cells must have a mass > 0.' TotMass = TotMass + mass(i) lrg = max(lrg,mass(i)) sml = min(sml,mass(i)) m(i) = mass(i) end do print *, 'Enter the maximum group mass:' read(*,*) MasLim if (MasLim.lt.lrg) then stop 'The group limit is already exceeded.' end if c c print *, 'The cell masses are: ',(mass(i),i=1,nCells) c print *, 'The total mass is ',TotMass c print *, 'The largest mass is ',lrg c print *, 'The smallest mass is ',sml c c o sort the masses in descending order call isort(m,i,nCells,1) c o initialize the final groups to the individual cells do i = 1, nCells FinNdx(i) = i FinMass(i) = m(i) end do FinGrps = nCells call stat(FinMass,FinGrps,FinAvg,FinVar) print *, 'searching limits from ',lrg,' to ', & min(MasLim,TotMass) do GrpMax = lrg, min(MasLim,TotMass) do nGrps = (TotMass+GrpMax 1)/GrpMax, nCells c o fill in the largest cells c (pGrp0 points to the first group to scan from) pGrp0 = 1 do i = 1, nGrps GrpNdx(i) = i GrpMass(i) = m(i) if (m(i).gt.GrpMaxsml) pGrp0 = pGrp0 + 1 end do bValid=(pGrp0.le.nGrps.or.nGrps.eq.nCells) c o assign the remaining cells do i = nGrps+1, nCells c o find the smallest group SmlGrp = pGrp0 do j = pGrp0+1, nGrps if (GrpMass(j).lt.GrpMass(SmlGrp)) SmlGrp = j end do GrpNdx(i) = SmlGrp
PAGE 117
107 GrpMass(SmlGrp) = GrpMass(SmlGrp) + m(i) if (GrpMass(SmlGrp).gt.GrpMax) bValid = .false. end do if (bValid) then c o find the variance of the group masses call stat(GrpMass,nGrps,avg,var) c print '(2(i3,a),f6.2,a,f8.4)', c & nGrps,' groups <= ',GrpMax, c & ': avg = ',avg,'; var = ',var c o compare this grouping with the best known grouping if (var.lt.FinVar) then print *, '> lower var at ',nGrps,' groups under ', & GrpMax do i = 1, nCells FinNdx(i) = GrpNdx(i) FinMass(i) = GrpMass(i) end do FinAvg = avg FinVar = var FinGrps = nGrps end if else c print '(2(i3,a),a)', nGrps,' groups <= ',GrpMax, c & is invalid' end if end do end do call isort(FinNdx,m,nCells,2) print '()' print *, 'Cell masses: ',(m(i),i=1,nCells) c o print the original scheme call stat(m,nCells,avg,var) dTmp = (1.d0*lrg)/sml print '(2(a,i3))', 'max = ',lrg,'; min = ',sml print '(2(a,f8.4))', 'avg = ',avg,'; variance = ',var print '(a,f8.2)', 'The cells are uneven by a factor of ', & dTmp**4 print '()' c o print the new results lrg = 0 sml = 2**30 do i = 1, FinGrps lrg = max(lrg,FinMass(i)) sml = min(sml,FinMass(i)) end do dTmp = (1.d0*lrg)/sml print *, 'There are ',FinGrps,' groups' print *, 'Group ndx: ',(FinNdx(i),i=1,nCells) print *, 'Grp masses: ',(FinMass(i),i=1,FinGrps) print '(2(a,i3))', 'max = ',lrg,'; min = ',sml print '(2(a,f8.4))', 'avg = ',FinAvg,'; variance = ',FinVar print '(a,f8.2)', 'The cells are uneven by a factor of ', & dTmp**4
PAGE 118
108 end c c m(1:n) is considered the population, not a sample. subroutine stat(m,n,avg,var) implicit none double precision avg, var, a, v, dTmp integer m(*), n, i if (n.eq.1) then avg = m(1) var = 0.d0 return end if a = m(1) do i = 2, n dTmp = 1.d0/i a = a + (1.d0*m(i)a)*dTmp end do dTmp = 1.d0*m(1)a v = dTmp*dTmp do i = 2, n dTmp = 1.d0*m(i)a v = v*(i1)/i v = v + dTmp*dTmp/i end do var = v avg = a return end The following example was taken from C20 in the AUGCCPVQZ basis set with spherical harmonics. (Spherical harmonics ha ve 2L+1 orbitals pe r contracted basis function, L is the angular momentum of the shell, while redundant Cartesian orbitals have (L+1)*(L+2)/2 orbitals per basis function.) The molecule is rather pathological in the size of each AO shell. If the shells are too large and the scratc h memory requirements are too high, then they can be broken up at the expe nse of recalculating r ecursion data in the direct integral code. binpack> cat stdin.c20.augccpvqz 100 06 15 20 21 18 06 15 20 21 18 06 15 20 21 18 06 15 20 21 18 06 15 20 21 18 06 15 20 21 18 06 15 20 21 18 06 15 20 21 18 06 15 20 21 18 06 15 20 21 18
PAGE 119
109 06 15 20 21 18 06 15 20 21 18 06 15 20 21 18 06 15 20 21 18 06 15 20 21 18 06 15 20 21 18 06 15 20 21 18 06 15 20 21 18 06 15 20 21 18 06 15 20 21 18 30 binpack> ./a.out < stdin.c20.augccpvqz Enter the number of cells: Enter the cell masses: Enter the maximum group mass: searching limits from 21 to 30 > lower var at 80 groups under 21 Cell masses: 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 15 6 6 15 15 6 15 6 6 15 6 15 6 15 15 6 15 6 6 15 15 6 6 15 15 6 6 15 6 15 15 6 6 15 6 15 6 15 15 6 max = 21; min = 6 avg = 16.0000; variance = 29.2000 The cells are uneven by a factor of 150.06 There are 80 groups Group ndx: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 61 62 62 63 63 64 64 65 65 66 66 67 67 68 68 69 69 70 70 71 71 72 72 73 73 74 74 75 75 76 76 77 77 78 78 79 79 80 80 Grp masses: 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 max = 21; min = 18 avg = 20.0000; variance = 1.5000 The cells are uneven by a factor of 1.85 The "unevenness" is calculated from the ratio of the largest to smallest shell size raised to the fourth power, (L/S)4, which is due to the twoelectron integrals. Each shell quadruplet requires N4 storage and is used in an O2N4 contraction with T2.
PAGE 120
110 LIST OF REFERENCES ACESACES II is a program product of the Quantum Theory Project, University of Florida. Authors: J. F. Stanton, J. Gau ss, J. D. Watts, M. Nooijen, N. Oliphant, S. A. Perera, P. G. Szalay, W. J. Lauderdale, S. A. Kucharski, S. R. Gwaltney, S. Beck, A. Balkov, D. E. Bernholdt, K. K. Baeck, P. Rozyczko, H. Sekino, C. Hober, and R. J. Bartlett. Integral pack ages included are: VMOL (J. Almlf and P. R. Taylor); VPROPS (P. Taylor); and ABACUS (T. Helgaker, H. J. Aa. Jensen, P. Jrgensen, J. Olsen, and P. R. Taylor). Amd67G. M. Amdahl, AFIPS Conference Proceedings 30 (1967) 483485. Bar95R. J. Bartlett, in Modern Electronic Structure Theory, Part I., edited by D. R. Yarkony (World Scientific Publishing Co., Singapore, 1995) 10471131. Bau02G. Baumgartner, D. E. Bernholdt, D. Cociorva, R. J. Harrison, S. Hirata, C. Lam, M. Nooijen, R. M. Pitzer, J. Ramanujam, and P. Sadayappan, inProceedings of the 2002 ACM/IEEE Conference on Supercomputing (IEEE Computer Society Press, Ba ltimore, Maryland, 2002) 110. Cof97E. G. Coffman, Jr., M. R. Garey, and D. S. Johnson, in Approximation Algorithms for NPHard Problems, edited by D. Hochbaum (PWS Publishing, Boston, 1997) 4693. Cra98T. D. Crawford and H. F. Schaefer III, in Reviews in Computational Chemistry, Vol. 14, edited by K. B. Lipkowitz and D. B. Boyd (WileyVCH, New York, 2000) 33136. [URL: http://zopyros.ccqc.uga.edu/lec_top/cc/html/review.html accessed 26 June 2004] Dun00T. H. Dunning, Jr., J. Phys. Chem. A 104 (2000) 90629080. Fed98Federal Emergency Management Agency Emergency Management Institute, IS195 Incident Command System, Federal Emergency Management Agency (ONLINE, 1998). [URL: http://training.fema.gov/EMIWeb/IS/is195lst.asp accessed 30 June 2004] Gho91D. Ghosal, G. Serazzi, and S. K. Tripathi, IEEE Trans. Soft. Eng. 17 (1991) 443453. Gro96W. Gropp, E. Lusk, N. Doss, and A. Skjellum, Parallel Computing 22 (1996) 789828. [URL: http://wwwunix.mcs.anl.gov/mpi/mpich accessed 21 July 2004]
PAGE 121
111 Gro99W. Gropp, E. Lusk, and R. Thakur, Using MPI2 Advanced Features of the Message Passing Interface (The MIT Press, Cambridge, Massachusetts, 1999). Gus88J. L. Gustafson, Communications of the ACM, 31 (1988) 532533. Hal00B. C. Hall, An Elementary Introduction to Groups and Representations(ONLINE, 2000). [URL: http://arxiv.org/abs/mathph/0005032 accessed 3 July 2004] Hir03S. Hirata, J. Phys. Chem. A 107 (2003) 98879897. Kob97R. Kobayashi and A. P. Rendell, Chem. Phys. Lett. 265 (1997) 111. Koc96H. Koch, A. S. de Mers, T. Helgaker, and O. Christiansen, J. Chem. Phys. 104(1996) 41574165. Mus02M. Musial, S. A. Kucharski, and R. J. Bartlett, J. Chem. Phys. 116 (2002) 43824388. Nor02W. Norcott and D. Capps, Iozone ver. 3.221 (COMPUTER SOFTWARE, 2002). [URL: http://www.iozone.org accessed 21 July 2004] Per03M. Pernpointner and L. Visscher, J. Comput. Chem. 24 (2003) 754759. Ren92A. P. Rendell, T. J. Lee, and R. Lindh, Chem. Phys. Lett. 194 (1992) 8494. Ren93A. P. Rendell, T. J. Lee, A. Komornicki, and S. Wilson, Theor. Chim. Acta 84(1993) 271287. Ren94A. P. Rendell and T. J. Lee, J. Chem. Phys. 101 (1) 400408. Sal89E. A. Salter, G. W. Trucks, and R. J. Bartlett, J. Chem. Phys. 90 (1989) 17521766. Sto96I. Stoica, F. Sultan, and D. E. Keyes, J. Parallel Dist. Comput. 39 (1996) 2945.
PAGE 122
112 BIOGRAPHICAL SKETCH Anthony D. Yau graduated from Miami Palm etto Senior High School in the Spring of 1993. He attended the University of Flor ida as an undergraduate student from August 1993 to December 1996, and graduated with a B. S. in Chemistry. After surveying the possibilities for graduate studies in Theoretical Chemistry, he reapplied to the University of Florida as a graduate student in January 1997, and worked with Dr. Rodney J. Bartlett in the Quantum Theory Project, a worldclass consortium of faculty in the Chemistry and Physics departments interested in develo ping, implementing, and applying quantum theories.

