|UFDC Home||myUFDC Home | Help|
This item has the following downloads:
A PARALLEL DIRECT OPEN-SHELL
COUPLED-CLUSTER SINGLES AND DOUBLES ALGORITHM
ANTHONY D. YAU
A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL
OF THE UNIVERSITY OF FLORIDA INT PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY
UNIVERSITY OF FLORIDA
Anthony D. Yau
I thank my family and friends for supporting me throughout the years, my strength
comes from them. I thank the faculty who have instructed me and the members of Rod
Bartlett's research group. I owe particular thanks to Dr. Erik Deumens for continued
motivation and guidance in scientific computing (and for teaching the first parallel
programming class that I attended) and to Dr. Victor Lotrich for insights into
implementing Coupled-Cluster theory. I especially thank Dr. Ajith Perera for extending
so much help (and patience) to me during all of my studies throughout graduate school.
The United States Department of Defense High Performance Computing
Modernization Program (HPCMP) supported all of this work. A grant under the
Programming Environment and Training (PET) program (GSA Contract No. DAAD05-
01-C-003) funded the initial parallelization effort, and the Common High Performance
Computing Software Support Initiative (CHSSI) program (proj ect CBD-03) funded the
research on the fine-grain parallel algorithm.
TABLE OF CONTENTS
ACKNOWLEDGMENT S ........._.___..... .__. .............._ iii...
LI ST OF T ABLE S ........._.___..... .___ .............._ vii...
LI ST OF FIGURE S ........._.___..... .__. .............._ viii..
AB S TRAC T ......_ ................. ............_........x
1 INTRODUCTION ................. ...............1.................
2 PARALLELIZING A SERIAL CCSD PROGRAM .................... ...............6
Serial Profile................ ... .. ...............6
Bottleneck Categorization and Parallelization .............. ...............9.....
Particle-Particle Ladders................ ...............9
M atrix M ultiplication............... .............1
I/O-Dependent Routines ................. ...._._ ...._._.............1
Parallel Profile.................. ...................1
Top-Down Parallelization Considerations .............. ...............17....
3 SPIN-UNRESTRICTED CCSD EQUATIONS............... ...............1
Group 1: T2 Virtual Orbitals Coming from V ............... ...............23....
Group 2: T2 Virtual Orbitals Coming from V and T............... ...................2
Set 1: One Virtual from V, One Virtual from T1 ................ .....................24
Set 2: One Virtual from V, One Virtual from T2 ................. .....................26
Group 3: T2 Virtual Orbitals Coming from T .............. ...............27....
Set 1: Rings and PH Ladders ................. ...............27........... ..
Subset 1: AC/BD............... ...............27.
Subset 2: DB/CA............... ...............28.
Subset 3: CB/AD ........................ ......................2
Set 2: HH Ladders from OOVV, OOOV, and OOOO .............. ..................30
Group 4: T 2 into Ti ................ ...... ..............3
Group 5: One-Particle Transformations .....__.....___ ........... .............3
Set 1: T1 Contributions .............. ...............32....
Set 2: T2 COntributions .............. ...............33....
4 P ARALLE L ALGORIT HM .............. ...............3 5....
Virtual Parallel Computer .............. ...............35....
Communication Networks .....__.....___ ..........._ .............3
Data Management ........................_ ...............38....
Arrays stored in distributed memory............... ...............38.
Arrays stored on disk .............. ...............42....
Parallel UHF CCSD .............. ...............42....
Direct Integral Phase............... ...............44.
Design considerations .............. ...............44....
Al gorithm .............. ...............46....
Stored Integral Phase ............. .....___ .....___ ...........5
One Particle Phase ........................._ ...............55...
Antisymmetrization and T2 Update .............. ...............57....
5 DISCUS SION AND SIMULATION ................. ...._.._ ......._ ...........6
D discussion .............. .. ...............69...
General Comments .............. ...............69....
Load B al ancing ........._..._.._ ...._._. ...............70.....
Job boards .............. ...............70....
Data aggregation .............. ...............70....
Asynchronous I/O ........._..._.._ ...._._. ...............72.....
Synchronization Delays ........._..._.._ ...._._. ...._._. ..........7
Sim ulation .............. ...............74...
General Overview............... ...............76
Specific Technologies............... ..............7
Calibration .............. ...............79....
Theoretical Examples .............. ...............83....
Runtime Simulations .............. ...............84....
D CEE* ............ ... ...............84..
Serine and phenylalanine .............. ...............88....
6 CONCLUSION ............. ......___ ...............92....
Outlook ............. ...... ...............93....
G radients ................ ........ ...............93....
Higher Excitation Clusters ..................... .___ ...............94...
Orbital Transformations and Space Reductions .............. ....................9
A ACRONYMS AND ABBREVIATIONS .............. ...............97....
B NOTATION AND CONVENTIONS .............. ...............99....
Operators and Matrix Elements............... ...............99
Operator Indices .............. ...............99....
C ALGORITHM OVERVIEW............... ...............10
D ANATOMY OF A SHELL PAIR BLOCK .............. ...............102....
E LOGICAL SHELLS AND BIN PACKING ................. .............................104
LIST OF REFERENCES ................. ...............110................
BIOGRAPHICAL SKETCH ................. ...............112......... ......
LIST OF TABLES
2-1. The distribution of real time over high-level CCSD subroutines ............................8
3-1. The RHF CCSD T1 contributions .............. ...............19....
3-2. The RHF CCSD T2 COntributions .............. ...............20....
4-1. The direct integral processing loop summary .........__ ....... ................. 48
4-2. The Direct Integral phase prologue memory map .............. .....................6
4-3. The Direct Integral phase inner loop memory map .............. .....................6
4-4. The Direct Integral phase epilogue memory map ................. ....._ .............62
4-5. The Stored Integral phase Group 3 Set 1 Subset 1 memory map ..........................63
4-6. The Stored Integral phase Group 3 Set 1 Subset 2 memory map ..........................64
4-7. The Stored Integral phase Group 3 Set 1 Subset 3 memory map ..........................65
4-8. The Stored Integral phase Group 3 Set 2 memory map ................. ................ ...66
4-9. The One Particle phase FP TzaP memory map .............. ...............66....
4-10. The One Particle phase FPuT2aP (and same spin terms) memory map...................67
4-11. The One Particle phase FH and FR lOcal transformations .............. ...................67
4-12. The R2 antisymmetrization and T2 update memory map .............. ....................68
5-1. The schedule of synchronization points............... ...............75.
5-2. The latencies and bandwidths of random disk reads with respect to file size .......81
5-3. The minimum resources for amino acids with 1024 AOs in UHF CCSD....._.......84
5-4. The number of PEs needed to surpass the initial parallel performance ........._......88
LIST OF FIGURES
2-1. The distribution of real time over primary CCSD subroutines ............... .... ........._..7
2-2. The breakdown of significant UHF subroutines ....._____ ... .....___ ........._....8
2-3. The column assignments for each of 4 CPUs updating R2 ................. ................12
2-4. The distribution scheme for matrix multiplies over CPU threads .........................13
2-5. The parallel speedup for each maj or UHF CCSD subroutine and the CCSD
executable .............. ...............15....
2-6. The runtime profile of parallel CCSD with a UHF reference from DCEE* .........15
4-1. Communication network boundaries ..............._ ...............39 ........._....
4-7. The memory heap management for various molecular systems................... .........40
4-2. The T2 and R2 memory layout .............. ...............43....
4-8. The data flow in the Direct Integral phase ................. ...............49........... .
4-3. The path of Group 3 Set 1 Subset 1 updates per OOVV integral ................... .......53
4-4. The LD shell distribution schemes .............. ...............54....
4-9. The data flow in the Stored Integral phase .............. ...............55....
5-10. The binary tree approach to collective communications ................... ...............78
5-11. The local disk random read calibration data ....__. ................. ................. .81
5-12. The calibration data for on-node messaging parameters ................... ...............82
5-13. The calibration data for off-node messaging parameters............... ...............8
5-14. The simulated performance curves for DCEE* on 2 to 1024 PEs over Fast
Ethernet ................. ...............85.................
5-15. The simulated performance curves for DCEE* on 2 to 128 PEs over Fast
Ethernet ................. ...............86.................
5-16. The simulated performance curves for DCEE* on an SP switch. .........................86
5-17. The relative performance of the initial parallelization attempt compared to the
new al gorithm .............. ...............87....
5-18. The absolute performance of the initial parallelization attempt compared to the
new al gorithm .............. ...............88....
5-19. The performance curves for serine on 16 to 1024 PEs ................. ................ ...89
5-20. The performance curves for phenylalanine on 128 to 1024 PEs ........................89
D-1. The spin blocks in an AO shell pair block ....._._._ ..... ... .__ ......._._......10
Abstract of Dissertation Presented to the Graduate School
of the University of Florida in Partial Fulfillment of the
Requirements for the Degree of Doctor of Philosophy
A PARALLEL DIRECT OPEN-SHELL
COUPLED-CLUSTER SINGLES AND DOUBLES ALGORITHM
Anthony D. Yau
Chair: Rodney J. Bartlett
Major Department: Chemistry
Coupled-Cluster Singles and Doubles (CCSD) is a benchmark ab initio electron
correlation method. The biggest factor in preventing its widespread adoption for large
molecular systems is that the theory scales like N4 in data and N6 in COmputation (N
being the number of atomic orbital basis functions). Using simple factorization
techniques and data and task parallelization, a new approach to calculating open-shell
CCSD energies is available. For a molecular system with 50 electrons (25 alpha, 25 beta)
and 500 AO basis functions, this approach has the potential to reduce storage
requirements from 630 GB to 2.3 GB for a traditional UHF CCSD algorithm in a basis
set of molecular orbitals. The number of operations will increase, but the natural
parallelization of the data provides a clear and concise framework for parallelizing the
Coupled-Cluster (CC) theory is an ab initio electron correlation method that can be
thought of as infinite-order perturbation theory or a size-extensive approximation to the
full configuration interaction (CI) analysis [Bar95]. Given a one-particle reference
wavefunction Go, the CC wavefunction is defined as an exponential expansion of
excitation clusters, T, from the occupied orbitals to the unoccupied (virtual) orbitals. The
T operator is a particle excitation operator that naturally truncates at the number of
particles, P, in the system (Equation 1-2). Occupied orbitals (the hole space) are denoted
with the letters i, j, k, and 1, while virtual orbitals (the particle space) are denoted with the
letters a, b, c, and d. Appendix A lists acronyms and abbreviations, and Appendix B
summarizes index conventions and matrix elements.
I cc)= exp(T)00) (1-1)
T =i T; T = 1 t a ,ibt ptk... (1-2)
The difficulty in calculating the full CI is that the data requirement scales like N2P
and the computational requirement scales like N2P+2, in which N is the size of basis set
space. To retain size extensivity and computational tractability, the excitation operator is
usually truncated at a small number of excitations. The most common is at double
excitations, such that T=T1+T2. As computers become more capable of massive
calculations, the number of excitations can increase, but the size of the molecular system
must decrease much more rapidly. Currently, the highest hand-coded truncation is at
pentuple excitations (i.e., T= T1+T2+...+T5) [Mus02], but efforts to automate this process
are becoming more promising as distribution techniques for data and computations
become available [Bau02].
Using the standard two-particle electronic Hamiltonian, the CC energy is calculated
by contracting the elements of the Fock matrix (f,,) and the two-electron Coulomb
to alpha spin orbitals, and lowercase orbital indices correspond to beta spin orbitals. The
double-bar integrals in Equation 1-4 are antisymmetrized Coulomb-exchange integrals
(Appendix B). (This derivation assumes that the reference orbitals are orthonormal.)
Ecc o exp() Go(1-3)
Ecc= Eo +C Ij Ab ( +t;'t1
+f~l/ flltA + I A tAB + t~tJ --t t" (1-4)
+C1f~,f +SC (9 ab: I' +tzat tt)
The T amplitudes are calculated by iterating over a coupled set of equations that
proj ect the Hamiltonian and wavefunction into a set of p-particle excitation spaces. The
CCSD approximation truncates T at double excitations, and is one of the benchmark
correlated methods in quantum chemistry today. The T1 and T2 equations come from the
projections in Equations 1-6 and 1-7, in which Hj, is the normal-ordered Hamiltonian
that does not include the reference energy. Only the P and pp spin equations are shown,
but the a, aa, and up spin equations are analogous to the former.
exp(- T)H;, exp(T)I 00)= AEcI 0,) (1-5)
Ofl exp(- T) ~, exp(T) Q0) = 0 (1-6)
/rab O Xp(- T) exp(T)I 00) = (1-7)
The Baker-Campbell-Hausdorff formula [Hal00] has the effect of reducing the T
equations to those terms that 1) contract directly with the Hamiltonian and 2) have no
more than four excitation amplitudes. The first effect is from disconnected terms being
cancelled by the commutation, and the second effect is from the two-particle Hamiltonian
having at most four indices from the two-electron integrals.
Computationally, the diagonal elements of the Fock matrix that contract with T are
grouped together, and define the input amplitudes for the next iteration. The T2P
equations are shown in Equations 1-8 through 1-1 1. If T is initialized to 0, then the next
iteration will produce T2=
be initialized to the second-order wavefunction and T1 can be set to 0.
O = C~~factc bCfcfl ac k ~b _fktrab +... (1-8)
0 = (Jas +/bb J;i af~b ac C~~b bc~f6t c k ~b _~ ir ktra +.. (1-9)
c~a c~b k ke
Dab=j; aa bb (1-10)
Dabtb acic b ca k kyd (k 2 **
c~a c~b km k
Since the T amplitudes differentiate between occupied and virtual orbitals, it is
common to group integrals into the various space combinations such as occupied-
occupied (OO), occupied-virtual (OV), and virtual-virtual (VV). There are Hyve such
combinations for the two-electron integrals: OOOO, OOOV, OOVV, OVVV, and
VVVV. If the virtual orbitals outnumber the occupied orbitals by a factor of 3, then the
VVVV and AO integrals will outnumber the OOOO integrals by factors of 81 and 256,
respectively. Most CC implementations calculate and store the OOOO, OOOV, and
usually OOVV integrals before starting the CC iterations. Due to programming
complexity, most implementations also store the OVVV integrals.
The VVVV integrals are used in only one CCSD term and how they are processed
illustrates a fundamental technique that the parallel CCSD algorithm uses in almost every
contraction. The data requirement scales like N4 and the computational requirement
scales like N6, and this term is usually the rate-determining step of the CC iteration. The
standard MO expression is shown in Equation 1-12. The "+= operator is shorthand for
"incremented by" (i.e., "x+=y" means "x=x+y"), and the r amplitude is the residual that
collects all of the contributions before dividing by the D denominator and producing the
next set of T amplitudes.
a~b+= }fab lcd cd (1-12)
Forward transforming T into the AO basis allows the AO integrals to contribute
directly to the VVVV terms. Conceptually, the AO coefficients from the reference
wavefunction are "removed" from the integral and "attached" to the amplitude. The
primes on the AO indices in T denote a forward transformation. This distinction must be
made since another useful transformation (a backward transformation) has the same
dimensionality (viz. po), but the AO overlap matrix is used in the contraction in addition
to the AO coefficients.
abh ac = ~Pr~ pa c:)c~t id (1-13)
While this transformation removes the need to store the VVVV MO integrals, it
increases the computational workload substantially. In fact, since each integral
contributes to each amplitude, the R2 array WOuld be updated N4 times!i However,
implementing this technique presents a more optimistic view of calculating large systems.
The R2 amplitudes are back-transformed from MOs to AOs before being incremented by
the AO integral terms. Once the integrals have been processed, the R2 array is forward-
transformed into the MO basis.
4""e =u _(' cd n S' (1-15)
I4""+= pr per t "" (1-16)
The method of calculating amplitude contributions in the AO basis instead of the
virtual MO basis has the potential to reduce the amount of storage space needed to
calculate the aforementioned VVVV contribution. This is particularly true for open-shell
systems, which need to store all three spin combinations of the VVVV integrals
compared to storing one set of AO integrals. If direct integral generation is used, then the
storage requirement for this term is completely removed. In either the AO or MO
scenario, parallelization involves distributing the data and operations of a large matrix
multiplication over the processing elements.
PARALLELIZING A SERIAL CCSD PROGRAM
Wide-scale reformulation of computational algorithms should only be performed if
currently existing formulations are inadequate, but determining whether an algorithm is
inadequate depends heavily on its intended use and the computers that are meant to
execute it. Any of a number of runtime parameters could contribute to low performance:
compiler efficiency; net CPU speed; memory or disk storage requirements and
performance; parallel scalability and efficiency; etc. In the case of general computational
chemistry algorithms, how the number of operations scales with respect to the size of the
chemical system will also determine whether an algorithm should be reformulated.
The ACES II Program System [ACES] is one of the more popular software
packages that calculates CCSD electron correlation energies. Although it was written to
make use of CPU vectorization, parallelization is considered to be one of the most
important ways to decrease the run time on modern supercomputers. Traditional methods
for parallelizing existing software begin with execution proofing. Once the main
bottlenecks are identified and analyzed, parallel formulations of those sections are
implemented and profied.
Profiing the CC code in ACES II was performed with four test systems:
* Alanine ("ala"; C3H7NO2; 48 e") in cc-pVDZ (119 AO basis functions)
* Octane ("oct"; CsHls; 66 e") in cc-pVDZ (202 AOs)
* P,P'-dichloroethyl ether ("DCEE"; C4HsCl20; 74 e") in cc-pVDZ (146 AOs)
* DCEE radical ("DCEE*"; C4H7Cl20; 73 e") in cc-pVDZ (141 AOs).
The first three systems had an RHF reference, the fourth had UHF, and no spatial point
group symmetry was used for any system. All of the systems used the AO basis
formulation for the VVVV*T2 COntribution, which was described in Chapter 1. The
profiled code used the C system function gettimeofday() to record the real-time clock
before and after large code sections. Any subroutine (or group of related subroutines) that
consumed more than 3% of the real time became a potential target for parallelization.
Table 2-1 lists the average percentage of the total real time for each high-level
subroutine measured across the three closed-shell test systems (RHF column), and it lists
the percentage for each routine in the DCEE* test system (UHF column). Routines that
contributed more than 3% are in bold case. The maximum variance of any set of
percentages for the closed-shell systems was 0.004, which suggests that the three profiles
have a consistent and reliable distribution of times. The table entry for "~AOLAD"
corresponds to the time spent in AOLAD that is not due to RDAO, even though the latter
is a dependent of the former. Figure 2-1 shows the amount of the total time consumed by
the maj or subroutines and the next most significant routine.
60% I RHF
50% I I UHF
Figure 2-1. The distribution of real time over primary CCSD subroutines.
Table 2-1. The distribution of real time over high-level CCSD subroutines
Routine RHF UHF Routine RHF UHF
ZERSYM 0.2% 0.1% TllNTl 0.4% 0.2%
T1RAAAA 1.4% 0.6% LADAA 0.0% 0.1%
T1RABBA 0.8% 0.6% LADAB 1.2% 0.5%
T1RABAB 0.8% 0.6% AOLADLST 0.0% 0.0%
CNTRCT 7.7% 9.8% T2TOAO 0.7% 0.3%
SUMSYM 0.1% 0.2% RDAO 58.3% 65.1%
F2TAU 0.0% 0.0% ~AOLAD 0.6% 0.3%
GETLST 1.1% 0.8% Z2TOMO 0.9% 0.7%
QUAD 1 1.3% 0.5% ZIAAO 0.1% 0.2%
QUAD2 0.5% 0.3% T12INT2 0.8% 1.1%
QUAD3 0.3% 0.2% RNGDRV 9.4% 11.0%
MAKFME 0.1% 0.1% SSTOxl 0.0% 0.0%
T1W1AA 0.0% 0.6% SUMRNG 0.2% 0.1%
T1W1AB 3.9% 2.1% ERNGAA 0.0% 0.0%
T1INW2 0.1% 0.0% ERNGAB 0.5% 0.1%
FEACONT 0.7% 0.4% SUMSYM 0.2% 0.0%
FMICONT 0.3% 0.3% E4S 0.6% 0.2%
FMECONT 0.5% 0.2% NEWT2 0.3% 0.2%
T1INT2A 0.3% 0.2% PHASE3 0.6% 0.2%
T1INT2B 0.8% 0.5% DIIS 2.6% 1.0%
It is clear from the table that four routines consume most of the execution time:
CNTRCT, T1W1AB, RDAO, and RNGDRV. Together, they cost 79.3% and 88.1% for
RHF and UHF, respectively. Figure 2-2 shows the four maj or subroutines and any
individual routine that consumes more than 1% from the DCEE* profile.
Figure 2-2. The breakdown of significant UHF subroutines. Primary values are grouped
on the left, secondary values (greater than 1%) are grouped on the right.
Bottleneck Categorization and Parallelization
The single most time consuming routine in either RHF or UHF calculations is
RDAO, which increments T2 by contributions from the VVVV integrals in the AO basis.
Integral-stored schemes, such as the current implementation in ACES II, read the AO
integrals into memory from disk, multiply them by the T2 amplitudes from the previous
iteration, and add the product to the T2 TOSidual amplitudes. Neglecting integral sparsity,
the number of permutationally unique AO integrals is shown in Equation 2-1. If there are
at least 100 AO basis functions, then the number of unique integrals can be approximated
by NAO4/8 with more than 98% accuracy; 200 or more functions and the approximation is
more than 99% accurate.
Nin =PP+) P = NoNo+)(2-1)
Many AO-based implementations have a user-defined threshold for discarding AO
integrals that would not have a significant contribution to the T2 TOSidual. The integral
program in ACES II stores the integrals in one long vector and discards individual
integrals that are less than the threshold. For DCEE* and a threshold of 10-14, ACES II
stored almost 41.1 million integrals from a complete set of 50. 1 million, which means the
two-electron integrals were 18.0% sparse.
The RDAO algorithm reads in a batch of integrals and processes the integrals
individually. For a particular integral, the value might contribute 1, 2, 4, or 8 times
depending on the permutational symmetry of the orbitals. The RHF case only has 1, 2, or
4 contributions and post-processes the residual for the final permutation. The general
structure of the algorithm can be described by the following:
do while (more batches)
do for each int
do for each perm
do for all oo
R2 (OO, gy) += T2(OO, po)*
This algorithm is not particularly efficient since each permutation requires a vector-
vector operation (R2(*,Cly) += V*T2(*,po)). Experience in computational algorithms
suggests that matrix-matrix operations (using integral blocks) would have much higher
computational throughput. For example, the RDAO update averaged 286.7 MFLOPS out
of a theoretical maximum of 1500 MFLOPS on the 375 MHz IBM POWER3 CPU. A
600x600 matrix multiplication using the ESSL library (a tuned linear algebra library for
IBM architectures) averages 1278 MFLOPS, which corresponds to a computational
efficiency of 85.2% compared to 19.1% for RDAO. However, parallel algorithms for
block-wise implementations are virtually identical to element-wise implementations, so
block optimizations were ignored in the parallel strategy and could be implemented later.
For any DCEE* CC iteration, 9.6 seconds out of 5952 seconds were spent reading
integrals from disk (i.e., I/O takes about 0. 16% of the total RDAO time). Before the
multiplication loop, there was a small sorting section, which took close to 13.6 seconds
(also insignificant); therefore, parallelizing the actual integral updates was expected to
have the largest impact on performance.
There are multiple levels at which the current algorithm can be parallelized:
batches, integrals, permutations, or rows of OO pairs. Distributing the operations over
OO pairs might decrease performance since the pairs are contiguous in memory, and the
cost of adding more rows is much less than the cost of adding more loops to process
them. The problem with distributing permutations is that the number of permutations for
a random integral is not always 8. This might introduce load-balancing problems
depending on how the integrals are sorted on disk; furthermore, each integral has a
limited number of tasks (permutations), which might limit overall scalability above 8
The two remaining loops (unique integrals and integral batches) are ideal for
distributing over processors. The number of integrals is substantial enough to support
scalability over a wide number of processors, and each processor will need to store only
those integrals for which it is assigned to calculate contributions. The current ACES II
implementation expands the permutations into the "unique" integral list and processes
each one separately, so this expanded list was distributed over processors instead of the
truly unique list.
For maximum software portability, the parallel algorithm needed to distinguish
between processors on the same physical computer and processors on remote computers.
The strategy was to distribute integral batches over physical computers (nodes) and to
distribute the integral list from each batch over processors on the same computer. After
the nodes finished processing their batches of integrals, the R2 COntributions were
reduced and distributed back to all of the nodes. To minimize memory requirements on a
single node, only one integral list, T2 array, and R2 array WeTO Stored, and CPU threads
were created to process the integrals. The R2 array WAS divided into blocks of columns
(Cly pairs), and each CPU was assigned to calculate all of the terms that contributed to a
particular set of blocks. This technique reduced false cache conflicts (when two CPUs
attempt to write to different parts of the same cache line at the same time), and it
increased the probability that a CPU would have a target R2 COlumn in cache memory,
since multiple CPUs would be caching different columns. Figure 2-3 shows the
CPU/column assignments in a 4-CPU computer with 4 columns per block.
CPU 0 1 2 3
Figure 2-3. The column assignments for each of 4 CPUs updating R2. In this example,
there are 49 Clv pairs (from 7 AOs).
For an arbitrary blocking factor (bf) and number of threads (Nth), the thread
assignment (from 0) of a column index (mn) can be determined with the following
thread_id(mn,bf,Nth) = mod(mn-1,bf*Nth)/bf
If bf*Nth is a power of 2, then the mod function can be replaced by a bit-wise AND with
the mask (bf*Nth)-1, and if the blocking factor is a power of 2, then the division can be
replaced by bit shifting.
If the VVVV contributions are calculated in the MO basis, then LADAA and
LADAB become the bottleneck routines. These perform large two-dimensional matrix
multiplies, as do the other three bottlenecks: CNTRCT, T1W1AB, and RNGDRV.
Incidentally, the routine T12INT2 (1.1% of the UHF time) is also bound by matrix
multiplications. Parallelizing these routines involved distributing the columns of the
output matrix over CPU threads. Since the input and output matrices were all kept in
memory, the columns were divided once, unlike block cycling in the RDAO scheme,
which had irregular data access patterns. Figure 2-4 shows the distribution pattern for a
matrix multiply over 4 threads.
A B = C
Figure 2-4. The distribution scheme for matrix multiplies over CPU threads. Every
thread can read the entire A matrix, but each thread is assigned a block of the
B matrix and consequently writes to a unique block of the C matrix.
The multiplications were not distributed over nodes since the parallel computer that
ran the profile had 16 CPUs per node. The time penalty incurred from the global
reduction (after the multiplication) would have greatly lessened the gains from
parallelization; however, nodes with fewer CPUs might benefit from distributing the rows
of the A matrix. All nodes but one would then have to initialize the C matrix to 0 (if the
operation increments pre-existing data), and the results would have to be reduced and
broadcast after the multiplication.
None of the maj or subroutines that required parallelization were heavily dependent
on I/O operations; however, many parallel computers (including the one that ran the
profiling calculations) have shared disk storage devices. This introduces competition
between some or all of the compute nodes when each one performs file read or write
operations. The DIIS routine (1% of the UHF time) is heavily dependent on I/O
performance, and the rate-determining operation can be described by the following:
do for each i
do for each (x,y)
B(x,y) += Tali(x,i)*~TaillY,i)
The number of rows in Tanll and the size of each dimension of B) is usually 5 but could be
any value (preferably a small one). The i index spans all of the T amplitudes, both T1 and
T2. Obviously the construction of a 5x5 matrix from a single 5xl vector is negligible
compared to the time spent reading from and writing to disk. This operation was not
parallelized, but only one node was permitted to perform the operation.
I/O competition is a low-level problem that shared storage architectures have. In
many cases, extensive software changes are required to overcome the increasing penalties
from read and write delays that serial programs consider to be constant. These penalties
manifest themselves in almost every routine that uses disk storage in the parallel profile.
The DCEE* test case was run with the ACES II CCSD program that had parallel
modifications as described in the previous section. The parallel performance of the maj or
subroutines is shown in Figure 2-5. The real time spent in each high-level routine is
shown in Figure 2-6. Only RDAO was distributed over every CPU in the calculation and
was 20.6% efficient in the 256-CPU (16xl6) calculation. The three other high-level
routines were only distributed over the 16 CPUs in each node and had average threading
efficiencies of 70. 1% (CNTRCT), 25.4% (T1W1AB), and 28.8% (RNGDRV).
0 2x l6
II~ II II
1~~ 1 I IIII
T1INT2B I I I
T 1INT 1 I I I
LADAA II I II I
T2TOAO I I I I
~AOLAD I I I
Z2TOMO I I I
ZIAAO I I I I I I
SUMRNG II II II II
ERNGAB II II II II
E4S I I I I I I I
NEWT2 I I I I I I I
PHASE3 I I I I I I I
DIISI I I I I I
CNTRCT T1W1AB RDAO
Figure 2-5. The parallel speedup for each maj or UHF CCSD subroutine and the CCSD
executable. (AOLAD) contains the RDAO parallelization and the global
0 1800 3600 5400
Figure 2-6. The runtime profile of parallel CCSD with a UHF reference from DCEE*.
The times are measured in seconds.
The overall parallel efficiency of the UHF CCSD calculation on 256 CPUs was
1.8% with a speed-up of 4.62. The calculation with the highest speed-up was 4xl6 (64)
CPUs with an efficiency of 8.8%. The reason for such poor performance can be attributed
to I/O contention by all of the routines that were insignificant in the serial calculation.
Most of the routines listed in Figure 2-6 increase in time even though they had no parallel
modifications. While further parallel optimizations could be made to the current CCSD
program, two maj or drawbacks would still plague the parallel code: problem scalability
and expensive modify-profile development cycles (i.e., trial and error).
The worst aspect of the current approach is that tractability does not increase as
CPUs, memory, and disks are added to the pool of resources (i.e., the largest molecule
that can be studied is limited by the resources of a single node). There is no mechanism
for distributing data across memory or disks; however, if the current limitations are
adequate, then this model should be able to reduce the runtime to a reasonable amount.
Parallelizing existing serial software usually takes the bottom-up approach, in
which significant bottlenecks are parallelized and profiled. As new bottlenecks present
themselves, they too are parallelized and profiled. Unfortunately, by the time this cycle
ends (or becomes too expensive), so much of the software might have been rewritten that
it would have been cheaper to reformulate the algorithm using a top-down parallelization
approach. In the current CCSD program, no matter how well the algorithm is parallelized,
the software is still limited to performing calculations on systems that are calculable by a
single node. It would be more efficient in the long term to reformulate the CCSD
algorithm from the top down (reusing existing software wherever possible) and to include
proper data distribution and small task parallelization as design fundamentals.
Top-Down Parallelization Considerations
Although a few programs already calculate the CCSD energy in parallel, most of
them are limited to closed-shell references [Koc96,Kob97,Per03]. The exception is the
general tensor contractor of Hirata [Hir03], but the Einal program requires stored
integrals. The goal of the top-down parallelization approach will be to produce an open-
shell CCSD program capable of handling molecular systems with at least 500 AO basis
functions (although the internal target is 1000) and that can scale up to a few hundred (if
not a thousand) CPUs. These constraints almost necessitate calculating amplitude
contributions from AO integrals and distributing amplitudes across CPUs.
In addition to the VVVV integrals, the algorithm should also allow for direct
generation of the VVVO integrals. The extent of calculating amplitude contributions in
the AO basis instead of the MO basis is that the virtual orbitals are never formally
recognized by the equations other than to create complementary density matrices such as
that shown in Equation 2-2. These matrices are very important since they define how
amplitudes transition from backward- to forward-transformed relative to the MOs.
Q" = [efc (2-2)
The first step of the top-down approach involves cataloging the integral
contributions and identifying which output condition on the R amplitudes (forward- or
backward-transformed) would be best suited for their additions. Finding a data and task
distribution then becomes a matter of understanding how the integrals are processed and
which terms can be grouped together given a common output condition on the residual.
SPIN-UNRESTRICTED CCSD EQUATIONS
The T1 and T2 equations in spin-restricted Hartree-Fock (HF) CCSD have 11 and
29 contributions, respectively. If the reference wavefunction is not HF but the spin
orbitals are still restricted, then the T1 and T2 equations have 14 and 31 contributions,
respectively. The spin-unrestricted non-HF CCSD amplitude equations have a total of
222 contributions. Table 3-1 lists the spin-restricted T1 amplitude contributions and the
number of corresponding spin-unrestricted terms. Table 3-2 does the same for the T2
contributions [Cra98]. P(pq) is the antisymmetric permutation operator (Appendix B).
There are many ways to group and factorize the amplitude contributions. The
efficiency of each approach largely depends on the following:
* The chemical system being examined
* The CPU speed and efficiency
* The amount of local memory per process
* Local and remote (shared or exclusive) storage capacities
* Local and remote I/O latencies and bandwidths
* Interprocess message latency and bandwidth.
For example, if the CPU speed were incredibly slow compared to the speed of the I/O
reads and writes, then it might be faster to group as many subterms together as possible,
and accumulate the intermediate product to disk. Consider the sum X*A*B + X*C*D +
X*E*F, which factorizes to X*I provided I = A*B + C*D + E*F (first accumulated on
disk). In most cases, however, millions (if not billions) of CPU operations are performed
before a single byte of data is retrieved from disk, so this approach is relegated to those
systems whose CPU speeds are unusually slow compared to other hardware capabilities.
Table 3-1. The RHF CCSD T1 contributions
+Cltlclfkcd V -
1kl c tt 4
+ L,, /-- 2
+C(ka ci~f t 4
+ ka edt~ tid 4
1 kl c iC tXII tza 4
-tl k/ i t 4
+ fkc,/tttpe --x 4
+ / ttda _I _~ _
The most common factorization strategy involves searching all possible
combinations of terms from either amplitude equation and finding the set of subterms that
is used most frequently. This implicitly assumes that the cost of recomputing the data is
greater than cost of storing and retrieving the data from disk. In the MO basis, this is
certainly true (i.e., the amount of computation needed to generate all of the MO integrals
from AO integrals is quite high), but the cost of calculating AO integrals on demand is
negligible when multiple processes are in competition for shared I/O resources.
Table 3-2. The RHF CCSD T2 COntributions
Contribution Diagram(s) H em
+ abh if) 3
+P(!i)C' hab jtae 4
+ (ij) C (abh cd tacd -- 3
P(ab) ~kb if) t \/
-P (ij)P (ab) kb icrl tV --
(ij) (ab) ~kb cd tot;tdi 4I
+ arb cd) t c 3
+iUP~ijh)P(a)kb cj~ to or 10
+P~j)Po~ab) ak~ djidt)~ k0~jh 10
+P+) (kli if )~i t 3
+Pij)P3(ab)klCi cj otVt 4
+ (ij) (ab)' Ckl ic ft t 1
+ P~P~ij)P~(ab), k/ )t~ 11
+ (~ij)- k/ J tal~ctdtab 3
Table 3-2. Continued
~ g k t --- 4
P(ij>) kl ci fttz 8
-P(ij)C(~, k/\J t:ftdtb 8
+Pli(ab) fbct \f ---x 4
h>(ab) L~~k/tlkj JI tftatb 8
i(a kle d f: t;
The following assumptions guided the development of the factorization strategy
described in this chapter:
*There are at least 3 times as many virtual orbitals as occupied orbitals for each spin
*The two-particle data are too large to store in the memory of one process
*The VVVV and VVVO integrals are not transformed and stored on disk
*The CPU operations are several orders of magnitude faster than I/O latencies
*Latency is at least an order of magnitude more significant than bandwidth
*Each process is able to retrieve (GET) and increment (ACCUMU7LATE) data that
are stored by another process without interrupting the other process.
The first assumption follows from a study of basis sets that noted that cc-pVTZ is the
minimum basis set needed to describe electron correlation accurately [DunOO]. For
chlorine, there are 17 electrons and 34 contracted basis functions. This produces an
almost 1:3 occupied/virtual split in molecular orbitals, since half of the electrons have
alpha spin and half have beta spin.
Designing the algorithm such that two-particle terms (viz., VVOO) can be
distributed over PEs is necessary because the parallel program should be capable of
treating systems with 50 electrons and 500 basis functions (50/500), although 100
electrons and 1000 basis functions is the target (100/1000). For 50/500 systems, the three
VVOO spin combinations require 1.55 GB of memory. Considering the algorithm needs
at least two (for T2 and R2) held in core, a processor with 1 GB of memory will not be
able to store the minimum amount of data without overflowing to disk storage.
Furthermore, the VVVO and VVVV terms will require 59.8 and 568 GB of storage,
respectively. (This assumes a conversion factor of 1024.) The parallel algorithm should
then at least allow for the possibility that the terms that use these MO integrals can be
calculated from AO integrals generated on the fly.
Today, the fastest disk storage devices are able to seek for random data in
approximately 3.7 milliseconds (measured on a Maxtor Atlas 15k 73 GB Ultra320 SCSI
hard drive). The slowest desktop CPU (to within reasonable expectations of performance)
operates at 1 GHz. (The Intel Pentium 4 line of CPUs can operate at 3.4 GHz.) Assuming
that most floating-point operations in CCSD require multiplications and additions and
that each operation completes in one clock cycle, then the maximum theoretical
multiply/add (fma) throughput is somewhere around 500 million fma/s. Together, the
data indicate that a slow CPU could theoretically deliver 1.85 million fma/seek, which at
least supports the assumption that computation is several orders of magnitude faster than
I/O latency. On the same hard drive, bandwidth bursts measure 73.7 MB/s, which
suggests that 280 kB of data could be transferred during the same time it took to locate
The final assumption follows from the latest version of the most popular message
passing library MPI (Message Passing Interface) [Gro99]. MPI-2 allows PEs in the same
communication network to retrieve (GET), increment (ACCUMULATE), and overwrite
(PUT) data that are stored by another CPU without communicating with it. In addition to
their own proprietary communication libraries, most vendors of parallel computers offer
implementations of the standard message passing libraries (like MPI) to ensure maximum
compatibility with the myriad parallel programs that depend on industry standards.
The main consequences of the aforementioned assumptions are that there must be
more than one process, that recomputation is better than reading from disk, and that
messages should be as large as possible. The primary goals of this parallelization effort
are scalability and addressability (i.e., the ability to treat a wide range of molecular
systems), so the grouping and factorization of terms should maximize message size and
minimize message frequency while minimizing the local scratch memory requirements.
Group 1: T2 Virtual Orbitals Coming from V
One of the smallest groups of terms consists of T2 COntributions, in which both
virtual orbitals come from the two-electron integrals. Equation 3-1 shows the pp terms--
the at terms are analogous. The
fly in the AO basis, so the expression must be written in terms of AO integrals. Before
doing so, the second term must be split and rearranged to allow for better factorization.
ab + = ab ij + (Of a~b cj tfc +i p(g)C l abtc~d t~t+ abln cd}t~cd (3-1)
+ =C (bIiftLe(blct (3-2)
+ P< ,c ab lcd tot(+ ab cd /,lrJtCd
sab+= BJPI)Cf abpa c,"c,
cpa I cpa 33
+ i(ij) [ ab pa c, c~tactd + ap c c~td
ab+ = J(g) b pa ~cc c + tP'~ c" c"' t" + t 't" lp)+ abp t,"" (3 -4)
I;:=Ab+= +(Abb + ACj tc~ +(Abllc to +e ~`Ab JCdtcd l/ +fAbJdtlCd (3-6)
C c Cd Cd
riA+ = C Ab pa (c;"c +t" c +c"t"' +t 't,')+ (Ab po tz'"" (3 -7)
~rib+ ,Ab p t, Xe" +r" ) 1 (3-8)
Group 2: T2 Virtual Orbitals Coming from V and T
Set 1: One Virtual from V, One Virtual from T1
This set of T2 COntributions is very similar to the Group 1 terms. The pp increment
is shown (Equation 3-9). Since every term antisymmetrizes the contraction with respect
to ab, the permutation operator is removed and the increment is identified as non-
antisymmetric. The F residual amplitudes (Appendix B) will be antisymmetrized with
respect to both ab and AB virtual pairs at the end of the iteration. Splitting the second
term, as in Equation 3-1, is also required.
r,7b = P~(ab)C kbl ilf t +P~ab)Plij)f jkb ic tf t
k kc (3-9)
+ J (ab) (j)(ib kb d tP t~ctd + (a) kbc tcd
; = 1 (ii) (kb it + (ij) ~bljkb c tP t~C + (j bi
k L ckc 1 (3-10)
+ P(ij) kbL ci~~~d tLtotf + kb cdt~t
C = Pijftkb poj c,"c" +t c"L~~IfP + ct, +t(p,")t+ t kb pa t,""tl~ (3-11)
= t ,c, irv pa li(ij) c:' +tXc, ) t (3-12)
The of increments (Equation 3-13) are grouped by the bra-side T1 and then
expressed in terms of AO integrals and factorized.
rl b -= [Kb lj tj + f Ak lj t(
+ Kb Cd tit~ + A( k Cd~ttC tetf c)ttrt
KCd kCdK k
+o Kb Cd t td +f Ak Cd)t~tCd
+ ft Kb ~~~Cd t / bedtb~
rl b- = ftAk ~Ij) + /C\UI t Ak y\Cj to + t@Ak Ich~te + ~ t A(k Cd ttf;
k kC c k~d(3-14b)
+ ft Alk 7llCd te
FI be _e t' ce trj t (3 -15a)
rlb b ~ oj~p+ tz' Xc: terj tper (3-15b)
Set 2: One Virtual from V, One Virtual from T2
The pp set contains four T2 COntributions that are ring contractions. Equations 3-16
through 3-18 show these terms and derive their AO basis expressions.
ab+ = (ab)(g), k cj oii +[ .ckb lcd tractf
+P(i) c 1 ppA c rtap' + c r a ,
-ab+=Pg rp +( pr p~~a t c" t, (3-18)
Each half of the 000 increments contains four ring terms and two particle-hole
ladder terms. The MO and AO expressions are shown in Equations 3-19 and 3-20.
rl b+= /. kb cj tol +C Kb7Cj tcKC -f KbI~ c tcI
kc KC Kc (3-19a)
+ klb I~tcd tie + Kby~ dr~tfKC t- Kbd l, tc otD
kcd KCd KcD
Ab bCyll~tb k~lAl~b cIbc- kAj
KC kc kC (3-19b)
+ (KA CD tb/CtlD + kD tb tlD- kd tdit
KCD kcD kCd
Group 3: T2 Virtual Orbitals Coming from T
Set 1: Rings and PH Ladders
There are 31 UHF T2 terms that belong to this set. Each process loads a block of
V," integrals and calculates all of the contributions to the range of R2 amplitudes that it
is assigned to store. Grouping and factorizing the terms for performance is heavily
dependent on the manner in which the old T2 amplitudes are stored and accessed.
The terms were first cataloged and sorted into subsets according to up spin pairs.
For example, if the a index is stride 1, then the term t, ckgt~d must step down in A and
down in B for a given cd pair in V. To minimize the number of messages, any other ring
terms that have the same access pattern should be calculated at the same time. From the
31 UHF terms, three access patterns emerged. Any term that did not prefer one pattern to
another was added to the pattern that had the most similar intermediate products.
Subset 1: AC/BD
Subset 1 contains 13 terms that step through T2 in the order AC/BD, which
corresponds to the order of the two bra indices. More specifically, for a given CD pair of
ket shells, the first (or outer) loop runs over A and the second (or inner) loop runs over B.
All of the at T2 COntributions are performed in this series.
r, + (BPU tfKC 1.Vt"" It VtDB fttccf" (3-21)
'- PiHP~) KLCD KLD KLCD
rf" += P(AB)li(lU)Ctf tZ ~c '-VtW?
+ (AB) ~u(C )[tfKCVIKltd -C t VIK~ ity ft tcVKlt?(322
KlCd Kld KlCd
SV notation can be ambiguous with respect to whether AO integrals are antisymmetrized. Both electrons in
the terms in this section have at least one occupied MO index, so V notation is used to save equation width.
rzAb -7 t~~tz [ fC bi, (3-23)
Ab- tlAcV/tf b, t,AcT/ktdtb
klc kled(3 -24)
+ ft fC VKlb t? + ,A t KC ctd tb
In terms of uncontracted ket indices of V, the partial contributions reduce to
Equations 3-25 through 3-27, in which each equation corresponds to a "left-side"
intermediate enclosed in parentheses.
-A+ = P(IJ)C /,tfK"V t c;"+ /K\" t (3-25)
La Kp Kp
--+= (IJ tAP' 1VJ+ [tfK"V -f c+t" t"" (3-26)
la kp Kp Kp
rPb+= tAp'Vji + tyfK^ h'- "+,"(-7
la kp Kp n
Subset 2: DB/CA
The DB/CA subset calculates 12 terms by looping first over B and then over A.
rr +=P~b)~g)[ty tdb l b rVt Ct,. ty~4tdt (3-28)
kled kcl kled
~b+= P~ab)Pi(g) ftPC (Vct/b
KlCd KCl KlCd
rAb+ t IAKC VKtob ( 3- 3 0)
FIb AVIK 0t b Atic'VCKDb
KLD KLCD ( 31
+ t VIKltdb + t ticc iKtif
The MO equations can be reduced into four equations with four "right-side"
intermediates when they are expressed in terms of V with uncontracted ket indices.
+ = (tij) tpb YkV t(- V c,+"(-2
,h b+= ft V t" (3-34)
Kp La ]~
--briK b p p' V "K + VKl z, (3 -3 5)
Kp La la
Subset 3: CB/AD
The remaining terms are primarily PH ladder terms and the outer loop is over B. If
the bra shells are thought of as a square VV matrix, then the main difference in loops
between this subset and Subset 2 is that the inner loop A steps down the D column, while
the inner loop A of Subset 2 steps across the C row.
r b ddTc iCb (3-36)
rf _AdViC5l b AdVTKltib + ftAd~citct? (3-37)
KlCd Kld KlCd
rl~b V t~ C~ltfb A t~dTcitCb (3-38)
Factorizing the MO expressions in terms of right-side intermediates and
uncontracted ket indices yields Equations 3-39 through 3-41.
r~ tAU 1V t (3-39)
Ob- t tA( +f V n~c" +t"' ( (3-40)
KU l lp
KU lp 1(-1
Set 2: HH Ladders from OOVV, OOOV, and OOOO
The ladder term that is quadratic in T2 can be calculated in one of two ways: one
with a HH ladder intermediate and one with a PP ladder intermediate. Opting for the HH
ladder intermediate saves a large amount of memory and memory writes. Three other
terms have HH ladder intermediates from the VfO integral and can be factorized with the
quadratic term. The pp contributions are shown (Equation 3-42).
rGab Itdtilb 1 Ie/tactdt"b
kled kled (3 -42)
+ P (ab I k/ jfitcd ati + (g)~b I .k .Ijtctd katl
The four ladder-like terms that come from OOOO and OOOV integrals can be
combined with the OOVV ladders when the ket orbitals are from the plain AO integral.
Equation 3-43 shows the MO contributions.
r;: + = il (kl i ta'b + () l cj) tact"
kl klc (3-43)
+JP\ab)I kl ilftot?+(Pg)b)I(kll cij~tctatb
Combining the two equations, incrementing pre-antisymmetrized amplitudes, and
expressing the ket indices in the uncontracted AO basis yields Equation 3-45.
anb + = 1 [ lpe)/ t' + (g)[kl pr \t(t' 1 t"
+(PJg f (kll pjcj' 1 ta +(PtiiC ~kl d)pj 1 tab
ga+= 4 C kl pa i tot1 +P~iij)C kll pa tP t"'tat1
ab+=JE l p i P~i) c" +,"'c,"+t ab t?) (3-45b)
The 003 terms and factorized increment are shown in Equations 3-46 through 3-48.
rl b+= Kd d ,Ctf +f KCdttdt Ah
Kl~d Kl~d(3 -46)
+ (Kl CJd~tCdtf tf + lK'll ~Cdtdt t?
riAb+= (KIl~jttf fK:jttt
+ (Kl Ictof~f + KlI totht (3-47)
+C(Kl Ijt /CK +f K~L Ij t t
rlb=[Klp + c;" +t PXC, r t +t rt (3-48)
Group 4: T2 into T1
The T2 amplitudes contribute to the R1 amplitudes in 14 terms. These terms can be
grouped together such that they define three one-particle intermediate operators, which
effectively reduce the particle number of the perturbation. Equations 3-49 and 3-50 show
the terms that define the IP (particle) and IH (hole) intermediates, respectively. Equation
3-5 1 contains both spin cases of the IR (ring) intermediate.
I + = k/bl c; i t -[ lC t to
k/ Jtiatzd ~K~l l~Cd atd
+= ka dtd+ Ka~).d
a+= KL D Cta kL c tt LDt~
KLCD kcD LD(3-5 1)
+C(" k/ ')t ~tida + Kl Cd~tctda~ fId Iu
kled KlCd Id
The terms that define the intermediates have been isolated in Equations 3-52
through 3-54. Both spin cases of the IR intermediate are shown because they both
contribute to p R1 amplitudes. (The terms with the Fock matrix elements are not part of
the intermediate.) The a spin cases of IP and IH are analOgous to the P cases.
I, : Sa+-C Ckl ca: i Kl Co t ,"+ (3-52)
IH : y(+ = iCk @ cd td + Kv Cd t~ ;Cd a -t) (3-53)
I" : r;+=CVr(!'', ~~K@Co kve tkc a LCr t ,a (3-54a)
If : 5"+=Cc CkVIca)CK Ct + kCc tob a dtid (3-54b)
Group 5: One-Particle Transformations
Set 1: T1 Contributions
The 12 remaining T1 contributions can be loosely described as one-particle
transformations. The a terms are analogous to the P terms, which are shown. These are
then drastically simplified with the IR intermediate, and further consolidation yields a
non-HF set in Equation 3-61 and an IR Set in Equation 3-62.
F~~kc KC -I(-5
FI' : 5+= 1/adtd +( I kaed t td + KalCd t~ td (3-56)
) 12 ab I~td z ab
+ ij)(1Cx l k I/ Jtk la +B [ Kl d t tla
li(ij) kcittab + (Kl Ci ttla~b
+ P(i) k Jt tid tab + KlC "I b
Foo (1): ab
Foo (2): rzaib
F,'(1 :ab+ = P(~r~dab) fad
lar Id klc KlC
+ [ k/ J tf~tidt+ C Kl CdjtC t tdta
Fo: Ca* fo + calf "
Fo ra _~ = 12ta + 1d sdt t I c,"+")
lar Id ver
n"+ L 1JadtId 12l I dbC -dz 1
dea 1 2 Id
t )rf (c," + t"')
Set 2: T2 COntributions
Two one-particle transformations describe the final T2 pp COntributions.
icl(ab)i: }[ k/ J./ + K Cd).
F,'(2) : ga a)[k dt td aC "I"
-P (b) k/\ txllf tzatzb + C Kl Cd tf~ tlajd
Similar to the Fl intermediates used to calculate the T1 contributions, the one-
particle transformation matrix for T2 into R2 can be defined in terms of IP, IH, and IR, and
using these definitions, the pp and 000 increments can be trivially reduced to four terms
(Equations 3-67 and 3-68).
It; f;l, +fldCCrP d z" +~ t~) (3-65)
Id v ver
Fd dda Idl (("
ab eat' + Fe t kb Fktab
nAb ~+FCbt~bFe c
Virtual Parallel Computer
Before any algorithm can be parallelized, the type of parallel computer that the
algorithm will run on must be identified. Is remote memory accessible through one-sided
operations or must each process explicitly interrupt and request data from another
process? How much data will be needed during each section of the algorithm and can that
data be held by a single process? Does each process have the ability to read data from a
storage device, and if so, what is the penalty for multiple processes reading data
simultaneously? These are the types of questions that must be answered before making
assumptions, which will shape the plan for parallelization.
The most basic parallel computer is a network of workstations, in which each node
is a single CPU computer with a private memory space and a storage device--usually a
hard disk. The most common connection fabric for these networks is TCP/IP over
Ethernet. This allows for any node to communicate with any other node in the network,
and the network hardware or operating system software will handle delivering the
messages. (As opposed to each process having to maintain a table of which nodes are
directly accessible and which nodes must be reached through other nodes.) The cost of
maintaining these networks is not linearly proportional to their size, so it is common for
500+ node networks to remove the most failure prone element, the hard disk, and provide
other means for each node to conduct I/O operations.
The extreme opposite of this picture is a single computer with multiple CPUs,
which are all able to access the same memory space and hard drives) with equal
performance. Each CPU is identical to the others, so these are commonly referred to as
Symmetric Multi-Processor (SMP) computers; however, there are tremendous costs
associated with adding each additional CPU, so it unusual to have more than 8 CPUs in a
truly SMP machine. Some manufacturers have mini-networks within a single computer
that allow them to increase the number of CPUs without making them pure SMP. To an
application programmer, these machines are programmed in the same way as Symmetric
Multi-Processors, but the true label is Shared Memory Programming (also SMP!) even
though the CPUs are not identical in terms of distance to memory, cache, and possible
hard drives. Henceforth, there will be no distinction between the meanings of SMP since
the parallel CCSD algorithm will be implemented as a user process and not a system or
The balance between cost and performance lies somewhere between the two
extreme types of parallel computers. The primary offerings of the maj or vendors of
supercomputers are clusters of SMP machines. For fine-grain parallelization, each node
has multiple CPUs accessing the same memory, and for coarse-grain parallelization, each
node has physically separate resources. The virtual parallel computer that the parallel
CCSD algorithm relies on is this type of cluster of SMP workstations; however, program
portability and the ability to tune the network parameters are very desirable.
The only way to build portability and tunability into the program without having
divergent paths through the software is to model the physical computer in the boundaries
of network communication. By properly defining certain sub-networks, no process will
attempt to communicate through slow network channels when faster ones may exist;
furthermore, messages pertaining to a certain context can be limited to regions of the
network, which might reduce network contention on some architectures. This allows the
scope of communications to be set once at program run time (or compile time) and the
algorithm will transparently acknowledge the underlying hardware up to the efficiency of
the partitioning scheme.
In some sense, parallelizing algorithms that handle massive amounts of data and
computations is akin to managing disaster recovery. The moment the event occurs, there
are many diverse and unused workers who have no assignment or direction. As the
situation is assessed, teams of workers are sent to handle different aspects of the recovery
from the time critical, like search and rescue of survivors, to the time forgiving, like
public relations and cost reimbursement.
Wildfire firefighters have some of the most extensive experience in rapid
deployment and management of labor units. In the late 1970s, the Incident Command
System (ICS) was created in response to a series of wildfire disasters in southern
California [Fed98]. It has since been adopted and refined by several federal and state
agencies and provides the basis for most emergency and non-emergency incident
responses. The parallel CCSD algorithm uses the following aspects of its effectiveness:
* Modular/scalable organization
* Manageable span of control
* Unit designations.
The lowest level of the organization is the single process (normally assigned to a
single CPU). Every process must have its own memory heap, and the combination of a
process and its memory is called a processor element (PE). PEs recognize other PEs on
the same physical node and form squads (SQD). Squads can be grouped together to form
platoons (PLT) based on algorithm requirements, and platoons can be grouped together to
form companies (CO), which execute large-scale algorithms.
Although the units were listed in increasing size, their division and recognition
happen in decreasing size. The commanding process evaluates the available resources
and divides the group into subgroups accordingly. For example, the Company
Commander (CO_CDR) would catalog the total distributed memory and channels to disk
and form platoons based on how the company algorithm uses those resources. The squads
would then self-assemble within the platoons based on which node the PEs are on.
Each division has its own communication space that does not interfere with
messages in another division, and there are special communication spaces for the
commanders of the subdivisions. This allows platoons to synchronize independently from
each other, and more importantly, it allows company algorithms to disregard the
messages of other companies. The company that performs the CCSD iterations and
calculates the T amplitudes is called the Amplitude Company. Figure 4-1 illustrates the
communication boundaries for a company and its subdivisions.
Arrays stored in distributed memory
The fundamental data elements involved in standard CC theory are T amplitudes,
integrals, and the reference wavefunction. The predominant method for solving the T
amplitude equations is iterating over a coupled set of equations for T1 and T2. This makes
the amplitudes stateful at each iteration compared to the two-electron integrals, which are
stateless. Thus, the minimum amount of information that must be stored during the CC
iterations is TOLD (T1 and T2), TNEw (R1 and R2), and any information that defines the
reference wavefunction. The integrals do not need to be stored since they can be
generated on the fly.
SCO CDR NET
CO_NET __AMVPLITUDE CO CDRCD
1 st PLT PLT CDR NET 3 rd PLT
PLT NET ~~2nd PLT CDR_
1st SQD SQD_ LDR_ NET 3rd SQD
SQDNET2nd SQD LDR
2nd PE 4hP
Figure 4-1. Communication network boundaries. Rounded rectangles represent
communication networks (in italics). Square rectangles represent physical
entities. Dashed square rectangles could be redudant entities (i.e., a platoon
commander will probably be the leader of the first squad).
It is assumed that a single PE cannot store a full OOVV quantity. Even for SMP
machines, a single OOVV array might exceed the total physical memory of the node, so
the T2 Old and new arrays must be distributed over a set of PEs. It is also assumed that
each PE is able to retrieve and increment remote memory with one-sided communications
(i.e., without interrupting the process that allocated the memory). If the total distributed
memory is many times larger than the amount of data that must be stored, then
subdividing the Amplitude Company into platoons such that multiple copies of the
amplitudes are available for reading will decrease the average distance between a PE and
the distributed data that it attempts to access.
In this algorithm, the number of PEs in the Amplitude Company is almost
irrelevant, but the total amount of distributed memory is critical. The Company must be
able to store TOLD, TNEw (R), and the larger of T2 Or the amount of scratch space needed
per PE, which depends on the chemical system (Figure 4-7). Each PE will have its own
local copy of small frequently accessed arrays, which include basis set definitions, AO
coefficients, T1 and R1 amplitudes, and other one-particle intermediates. While the PEs
are calculating contributions to the R amplitudes, only T2 and R2 need to be accessed in
addition to local scratch memory. Between certain sections, R2 must be transformed,
which requires enough scratch memory for storing the third distributed OOVV array.
Example 1 Example 2
f- reserved memory +
4- T2 distributed 4
4- R2 distributed 4
.-- --- -- --- .- --- -- m in scr < R 2 block -- - - -
min scr > R2 block
Figure 4-7. The memory heap management for various molecular systems. In one
scenario, the minimum scratch memory needed for amplitude contributions
is less than the amount of memory needed to store a section of R2 during a
transformation. The other scenario is when the opposite occurs.
Example 1. The user runs a calculation that requires 128 MB of reserved data per
PE, such as basis set information, one-particle matrix elements, etc. The calculation
requires another 128 MB (minimum) of scratch memory per PE for calculating various
quantities during the CCSD iteration, and the OOVV quantities require 4096 MB of
distributed memory from all of the PEs. If the user submits the j ob with 1024 MB of heap
per process, what is the minimum number of PEs needed to perform the calculation?
Alternatively, does the calculation have enough resources if the user submits the j ob with
X PEs, each with 1024 MB of memory?
Solution 1. The amount of non-reserved memory is 1024-128 MB/PE (i.e., 896
MB is available for scratch and distributed data on each PE). Three OOVV arrays require
12,288 MB. Dividing this amount by 896 MB/PE and rounding up yields a minimum of
14 PEs. Distributing each OOVV array over 14 PEs requires each PE to store
approximately 293 MB of data per array. Since this amount is larger than the minimum
amount of scratch memory needed (128 MB), the calculation can proceed with 14 or
Example 2. Using the same system data as above, but increasing the minimum
required amount of scratch memory to 5 12 MB, what is the minimum number of PEs
needed to run the calculation?
Solution 2. This time, the regular scratch memory is larger than the distributed
OOVV scratch array. In essence, this becomes unavailable for distributed data, so the
effective amount of memory per PE for the OOVV arrays is (896-512) 3 84 MB. Since the
third OOVV array can fit in the scratch memory, only two OOVV arrays, totaling 8192
MB, need to be distributed over the Amplitude Company. 8192 MB divided by 384
MB/PE yields 22 PEs (rounded up). This means each PE will store approximately 187
MB of each OOVV array, and the calculation can proceed if there are at least 22 PEs.
The minimum PE test can be described by the following pseudo-code:
mem_di st = mem_heap mem_rese rved
min_PES = ceiling( 3 mem_oow / mem_dist )
mem_block = mem_00w / min_PES
if (mem_block < mem_minscr) then
mem_dist -= mem_minscr
min_PES = ceiling( 2 mem_oow / mem_dist )
If the user runs the j ob with more than twice the minimum number of PEs, then the
Amplitude Company can divide into multiple platoons, each with its own copy of TOLD
and TNEw. One of the benefits of this approach is that fewer PEs will access each block of
distributed memory, but the main advantage is that knowledge of the underlying
hardware can be programmed into the partitioning scheme and the algorithm will
transparently align itself with the hardware boundaries.
Arrays stored on disk
The VVVV and VVVO integrals will not be transformed and stored on disk, so
their contributions must be calculated on the fly in the AO basis. The OOOO and OOOV
integrals will not be stored on disk either, not because they are too large, but because
their contributions can be accounted for with intermediates from the direct integrals and
from the stored OOVV half-transformed integrals. The stored integrals can be distributed
over and individually processed by the platoons of the Amplitude Company.
Parallel UHF CCSD
Each CC iteration begins with initializing the global intermediates and residual
amplitudes to zero. The TOLD amplitudes are defined by Equation 4-1. This formulation
allows the amplitudes to contract directly with AO integrals. The T2 TOSidual amplitudes
will have different definitions depending on which part of the algorithm is executing. In
the Direct Integral (DI) phase, one of the virtual indices is coming from an uncontracted
AO integral, so the residual is related to the MO residual amplitudes by Equation 4-2.
t = [cft; t l= ~chcb$ab
tp'" = ~,c~tb~t
rAb ~ A b A'v
The Stored Integral (SI) and One Particle (OP) phases calculate residual
contributions with both virtual orbitals coming from T, so the DI residual is transformed
according to Equation 4-3.
Throughout the iteration, each PE stores all OO pairs (ket indices) for a range of
AO shell pairs (bra indices). The amplitudes are distributed over the platoon (Figure 4-2),
and one-sided communications retrieve and increment all OO pairs for a block of AO
MN = 1 2 3 4 5 6
13 13 13 I 13 13 13
PE 0 PE 1
Figure 4-2. The T2 and R2 memory layout. Each square represents the four-dimensional
array of amplitudes that have the designated spin pair and whose AO bra
indices belong to the MN shell pair. The arrows represent the order in
After all of the contributions have been calculated, the residual is antisymmetrized
with respect to the two virtual indices and back-transformed to the MO basis. The new T
amplitudes are calculated by dividing each amplitude by the diagonal elements of the
Direct Integral Phase
Analyzing the T2 COntribution in Equation 4-4 can summarize the DI phase. Since
the VVVO integrals are not transformed and stored prior to the CC iterations, the terms
must be evaluated in the AO basis.
r Ab + = C tA~ (Kb ICd() td +Clb (Al Cd') td (4-4)
riA tA b~~ t"" + cA b t"~' (4-5)
The biggest problem with this situation is that every AO integral causes a global
update to R2 (i.e., One integral contributes to every R2 amplitude). The best way to reduce
the number of operations is to precondition R2 Such that a smaller set of VVVO
contributions is calculated. R2 WOuld then be post-transformed after the contributions
have been accumulated (Equations 4-6 through 4-8).
r-v b Ab (4-6)
rI~v+ =C ~ tAv pr per)tf" (4-7)
r Ab b Av (4-8)
For each integral, only an IjA strip of R2 amplitudes is incremented compared to
the full set without the backward and forward transformations of R. The difficulty with
the up amplitudes is that while the P side can get away with 1/V updates, the a side still
requires a global update per integral (Equation 4-9). The C1 orbital must be forward
transformed into A, but the v orbital must be transformed into b by T1, which must then
be back-transformed by c l. One solution is to calculate the two spin cases in separate
passes (while transforming R between them). Although this requires generating the AO
integrals a second time, the performance of a naturally parallel local computation is much
greater than a global update per N4 matrix element.
rf "+= fc cfb/,t, (pv fpeb" (4-9)
A second performance issue is antisymmetrization of the same-spin integrals.
Equation 4-10 shows one of the up ring contributions in MOs and Equation 4-11 shows
the terms appropriately transformed into the AO basis.
rlb + = b cd tjetfd (4-10)
rIf'"+= prp jt"(4-11)
The AO integrals are not antisymmetrized by the integral generator, so one of two
approaches must be taken to account for both contributions. The first approach involves
two updates per integral:
The benefit of this method is that the PE needs only enough memory to store the AO
integral batch and the T2 batch; however, two updates per integral is more costly than two
reads per integral, which can be achieved by antisymmetrizing the po pair:
This requires either two temporary T2 blocks or two serial loads. If there is enough
memory, then the exchange integral can be calculated with the regular Coulomb integral
and the antisymmetrized contribution can be calculated with one read and one update.
Since shell quadruplets of integrals are typically not more than a few hundred kilobytes,
the algorithm will assume this can be done. The permutational symmetry of the
antisymmetrized integrals is not the same as that of the plain Coulomb integrals so they
must be generated again in the middle of the loop over permutations. Although two
updates or two reads are avoided, the integrals need to be calculated three times per AO
shell quadruplet. Considering the cost of messages compared to the speed of locally
computing a shell quad, this is a favorable tradeoff.
From these considerations, it is clear that the DI phase must calculate the unique
integrals six times even though the AO integrals have 8-fold permutational symmetry.
There are two passes (one for up and one for Pa) and there are three generations per
pass--one Coulomb and two exchange integrals. This might seem expensive, but the cost
of doubling or quadrupling the number of messages does not scale linearly with the
number of PEs and far outweighs the cost of the naturally parallel integral computation.
A critical aspect of this algorithm that makes or breaks the performance is one-
sided communications. A one-sided GET of a T2 block can be in flight while the PE
calculates terms that do not involve the T2 amplitudes. Immediately after the amplitudes
have been read, the next block of amplitudes can be requested, even if the PE is not ready
to use them. This technique is also used to hide the communication of R2 updates. The
main difference between the two arrays is that the T2 amplitudes are read once per GET,
while the R2 amplitudes are incremented two or three times per ACCUMULATE. Ideally,
there would be two R2 buffers, one active and one pending, that increase the time
available for each ACCUMULATE. If the user must run the calculation in limited
memory, then it is possible to still benefit from one-sided ACCUMULATEs, but the PE
might have to wait for a while before initializing the buffer for the next R2 COntribution.
The main driver for the DI phase loops over AO shell quadruplets of integrals. The
MN shell pairs are distributed over all of the PEs in the Amplitude Company. This
distribution can happen a number of ways: round robin, blocked, block cyclic, j ob board,
master/slave, etc. How a PE processes each MN shell pair is independent of the shell pair
assignment, so this can be tweaked outside the main processing loop. Ideally, there would
be a globally accessible integer that corresponds to the next MN pair that must be
processed. The integer must be atomically read and incremented by each PE to get its
next assignment. This balances the load better than a fixed distribution algorithm like
round robin or block cyclic, and it performs the same function as a master process
sending out new tasks without actually needing a process to sit idle waiting for requests.
Once an MN shell pair is identified, the PE is responsible for generating and
processing all unique RS shell pairs that correspond to the MN upper bound. This
prevents any PE from processing Coulomb or exchange integrals that another PE is
responsible for. It also allows for a single generation of the integrals for each pass if there
is enough scratch memory (i.e., the PE would generate all RS pairs and internally
antisymmetrize them before beginning the processing loop). Typically, however, there
will be enough memory for two shell quads and the scratch buffers for contracting terms.
A shell quadruplet of integrals will contribute to all of the Group 1 terms, the
Group 2 terms, and the Group 4 intermediates. Table 4-1 summarizes the stages of the
processing loop and which elements are accessible at each stage.
Table 4-1. The direct integral processing loop summer
all Group 1 terms and the IH
prologue (IHs, I2) += f(V,C,TI,T2) itreit
all contributions to R2 fTOm
stage 1 Rzc += f(V,C,T1,T2,I2)
Group, 2 and Group, 1
sae 2 IPs += f(V,C,T2) the IP intermediate
epilogue IR += f(V,C,T1) the IR intermediate
The Group 1 terms are saved in a two-particle intermediate during the loop over the
T1 amplitudes from the Group 2 terms. The I2 intermediate is then contracted with Q
(C*C) and the Group 1 contribution is sent with the Group 2 contribution to increment
R2. If there is not enough scratch memory to store the I2 intermediate, then it is possible
to calculate these terms in a single direct integral pass separately from the Group 2 terms
or the IP Of IR intermediates. Table 4-2 has the detailed order of operations for the
prologue of processing a shell quad. The w one-particle matrix element is defined in
L Y + t; (4-14)
When the prologue finishes, there are two important quantities stored in memory:
the Group 1 contribution and the T2 amplitudes that generated it. Since the Group 2
contributions loop over all L shells for a given S shell, it is convenient to begin the loop
with L=R. Table 4-3 lists the detailed order of operations for the L loop. The x one-
particle matrix element is defined in Equation 4-15.
x~ =~ [ jc t (4-15)
After the last value of L, there is still a pending R2 ACCUMULATE, which must
be completed. Rather than WAIT immediately after the loop, the IR intermediate, which
requires only T1, C, and V, can be calculated. Furthermore, if there are more
permutations of V, then the PE can GET the next T2(RS) block and permute the integrals.
Table 4-4 lists the detailed order of operations for the permutation epilogue.
The first four permutations of V involve the same two sets of shell quads.
The second set of four permutations requires a different exchange shell quad, so the
integral generator is invoked a third time per MNRS configuration.
Figure 4-8 shows the movement of data needed to calculate the contributions to R2
from an integral shell pair.
Global Shell Pair Counter of
T ~h~ IT
1st Squad 2nd Squad 1st Squad 2nd Squad
1st Platoon 2nd Platoon
Figure 4-8. The data flow in the Direct Integral phase. Solid arrows represent one-sided
messages, and dashed arrows represent reading and incrementing the global
shell pair counter. Each PE retrieves all T2 amplitudes that contract with the
integral and accumulates the contributions into the R2 Efray.
When the Amplitude Company has calculated all of the MN shell pairs, the R2
distributed arrays are transformed for the phase that processes the stored OOVV
integrals. The one-particle global intermediates could be reduced here for use in the One
Particle phase. The following pseudo-code lists the entire flow of control from the
beginning of the CC iteration up to the start of the SI phase.
ze ro(R2, R,, I,)
W, = C + T,
do spin = 2, 1, -1
X, = EzQ, C*T,
do while (more MN)
do all RS
G = intgen(M,N,R,S)
H = G intgen(M,N,5,R)
do perm = 1, 8
nLeft = Nshells
do while (nLeft != 0)
nLeft -= 1
L = 1 + mod(LNshells)
end do nLeft
end do perm
end do RS
end do MN
if (spin =2) then
gl obal R2 (La, Np) = Q -(Np,~ Dp) *R2 (Dp,~ M,) "Q,(M,, L,)
global R2(L,D) = R2(L,N)*Q,(N,D)
end do spin
REDUCE (Iz, CO_CDR)
Stored Integral Phase
The SI phase calculates all of the Group 3 terms from integrals of the type
integral block, which contains all spin pairs for a particular po shell pair. The mechanism
for obtaining integrals depends on the performance and architecture of the I/O system. If
only one PE per node (or squad) is allowed to read from disk--assuming this is an
operating system constraint--then the other PEs in the squad will have to request the
blocks from that I/O PE. The I/O PE will then have to juggle its own calculations while
acting as an I/O proxy; otherwise it could opt not to store R2 amplitudes, but then the
system takes a performance hit as whole because there is one relatively idle CPU per
Most computers will allow any process to perform I/O and shift the performance hit
of I/O contention to the program. The technique the parallel CCSD algorithm uses to
lessen this penalty is asynchronous file reads. The logistics are the same as one-sided
operations in the DI phase in that each PE issues an AIO READ immediately after the
last integral has been read but before other computation has been done. If asynchronous
I/O operations are not available and the user elects not to designate a special I/O PE, then
the algorithm can wait for the integrals up front with a blocking read or it can
synchronize the squad PEs by broadcasting the current block and waiting for all of them
to finish before reading the next block.
Resource competition, such as parallel blocking reads, suffers scaling penalties,
meaning the cost per read increases as the number of PEs per squad increases. The upshot
is that there is still the possibility that some PEs are computing while others are waiting
on I/O. Locked step algorithms, however, usually suffer far worse than resource
competition algorithms. Locked step guarantees that no one is working while a read is
pending and that no one is reading while the PEs are computing. Even if the I/O PE used
asynchronous reads, the locked step approach would suffer from expensive
synchronization delays compared to pure resource competition. From these
considerations, each PE will be responsible for reading its next block of integrals from
disk. If asynchronous reads are not supported, then the PE will default to blocking reads.
The OOVV integrals will be distributed over platoons by block cycling the AO
shell pair kets. Every PE in a platoon must read and process all of the integrals assigned
to its platoon. The integrals will only contribute to the R2 amplitudes that the PE stores,
but the T2 amplitudes must be read (with one-sided GETs) from other PEs in the platoon.
Every shell pair of integrals results in four updates to all of the local R2 amplitudes,
one for each subset of Group 3 Set 1 and one for Group 3 Set 2. Since Group 3 involves
contributions in which all virtual orbitals come from T, the virtual orbitals of the integrals
must be the internal summation indices. The general approach to calculating the Set 1
subsets is to loop over h' and 6' for a fixed po pair. Each subset corresponds to a different
configuration of these nested loops.
Figure 4-3 shows the path of updates through the local set of R2 amplitudes. Table
4-5 shows the detailed order of operations for calculating Group 3 Set 1 Subset 1
(AC/BD is LR/DS in AO shells).
Accessing memory in non-unit stride is highly unfavorable when performance is
the utmost concern; however each square in Figure 4-3 represents three four-index arrays
such as those shown in Figure 4-2, and those updates can be performed in unit stride.
Furthermore, the L-D loop order is only used in Subset 1. Tables 4-6, 4-7, and 4-8 show
the operations for Subsets 2 and 3 of Set 1 and Set 2, respectively. Each steps through the
LD pairs in D-L order.
Figure 4-3. The path of Group 3 Set 1 Subset 1 updates per OOVV integral. Dashed
lines show memory continuity and solid arrows show the loop path.
There is some flexibility in choosing the order in which the Group 3 sections are
calculated. The most important design consideration is choosing which section will be
performed last since it will AIO EAD the next RS shell pair of OOVV integrals from
disk. Subset 2 of Set I has the largest number of operations after the last integral is used,
so this section should be performed last.
Choosing a section to run first depends on how many operations can complete
during the first GET before reading the T2 amplitudes. Subsets 1 and 3 of Set 1 and Set 2
all use the amplitudes almost immediately, but each requires a different initial shell pair.
Set 2 reads only one shell pair, T2(RS), and neither shell contributes to the final R2
amplitudes, so the probability that T2 is local is inversely proportional to the number of
PEs in the platoon. The Subsets of Set 1 fare much better since at least one shell of T2
contributes directly to a shell of R2. If the amplitudes are distributed in bands of L, then
Subset 3 has a good chance of T2 being local, and if the amplitudes are distributed in
bands of D, then Subset 1 is better off executing first. Figure 4-4 illustrates the two
Lbands D bands
Figure 4-4. The LD shell distribution schemes. Each PE stores both T2 and R2
amplitudes for its block of LD shell pairs. Set 1 Subsets 2 and 3 favor L
bands, while Subset 1 favors D bands.
The parallel algorithm uses L bands since two of the three subsets favor them as far
as having local T2 amplitudes per R2 update. Subset 2 is already designated to run last, so
subset 3 should be performed first because it has the highest chance of having no wait
time for the initial T2 GET. The decision to run Set 2 before or after Subset 1 can be
made on the same argument. Since Set 2 updates the entire set of local LD pairs after
reading the T2(RS) amplitudes, it offers the most time for the initial GET of Set 1 Subset
1 to complete before reading them. Therefore, the calculation order of the Group 3 terms
is Set 1 Subset 3 (SIS3), Set 2, S1S1, and S1S2.
Figure 4-9 shows the flow of data in the SI phase. Each PE loops over all of the
shell pairs that were assigned to its platoon and calculates only the terms that contribute
to the local R2 amplitudes. The following pseudo-code lists the entire flow of control
from the end of the DI phase up to the start of the One Particle phase.
do RS = RSo, RS1
cal C G3S153(GET T2(RS))
cal C G3S2(GET T2(LoR))
Cal C G3S151(GET T2(SDo))
Cal C G3S152(AIO_READ next V(RS); GET next T2(RD))
end do RS
Platoon Shell Pairs of
T ~h~ I T ~A~ R
1st Squad 2nd Squad 1st Squad 2nd Squad
1st Platoon 2nd Platoon
Figure 4-9. The data flow in the Stored Integral phase. Each PE processes all of the
integrals that are stored by its platoon and updates its local block of R2.
One Particle Phase
The OP phase is responsible for contracting the Fock matrix elements of the
reference wavefunction (f,,) and the li intermediates (IP, IH, and IR), with the T
amplitudes. There are three types of contractions: T2 into R1 (Group 4), T1 into R1 (Group
5 Set 1), and T2 into R2 (Group 5 Set 2). Only one PE must do the T1aR1 contractions
provided it has the complete li intermediate (it should since the contributions were
reduced at the end of the DI phase). These transformations, shown in Equation 4-16 for
T1 are insignificant performance-wise and little is gained by parallelizing them.
+~ = t c"+t lf (4- 1 6a)
I" +=C Ic far + ca fad C@tz -1 tz1 -/2 1 t%/d" t"cr (4-16b)
aad I~ Id
The T24R1 contributions can be distributed over the PEs in one platoon, since they
are limited to local reads and local writes; however, while the T24R2 terms also have
local writes, they have both local and remote reads. The remote reads come from the
terms in Equations 4-17 and 4-18. The P part of FP is defined in Equation 4-19. (The oc
part is analogous.)
? += FPa t; +=C FP t""' (4-17)
F' = C cn"fadc -C t + IdGi Iv" t (4-19)
ad IS Id
Parallelization of these particle terms should reduce the number of messages a PE
must send as the number of PEs in the company increases. This implies distributing the
summation index (p or o) over the platoons; however, since the SI phase prefers L bands
(the p index), it is likely that most (if not all) of the T2(RD) GETs are local, so it is not as
critical to performance to distribute them over the platoons as it is for o. For the a loop,
each PE must get data from multiple target PEs that contribute to its local R2 Set (Table 4-
9). Table 4-10 shows the other FP terms.
The T24R2 hole terms and the T24R1 terms (along with the latter FP terms) should
be performed by the largest platoon (HVYPLT), even if that platoon has only one more
PE. During this time, all of the other platoons will send their contributions via
ACCUMULATE to the Heavy Platoon. Once these operations are complete, the only
remaining tasks are antisymmetrizing the same-spin R2 amplitudes and updating the old
T2 amplitudes with R2. Table 4-11 shows the operations of the Heavy Platoon while the
other platoons are sending their contributions, and Equations 4-20 and 4-21 show the
terms that contribute to FH and FR, TOSpectively.
FZ = ft tak, Ideft,"r + c'I: +C c,Ifcp"+t,") (4-20)
Id v vor
FL = r. Ad+C~P (4-21)
The following pseudo-code lists the operations that occur from the end of the SI
phase up to the start of the antisymmetrization and update phase.
if (CO_CDR) then
make FP, FH, FR
BROADCAST( F,, CO_N ET)
RECEIVE (F,, CO_N ET)
cal c FP 2uP(
if (HVY_PLT) then
ACCUMU LATE (R2, HVY_PLT)
REDUCE (Ry CO_CDR)
Antisymmetrization and T2 Update
The aa and pp R2 amplitudes must be antisymmetrized with respect to AO shell
pairs. Equation 4-22 shows this operation. All of the R amplitudes must then be divided
by the diagonal elements of the Fock matrix in order to generate the next set of T2
amplitudes. Each PE in the Heavy Platoon, having collected all of the R2 amplitudes that
it is assigned to store, will antisymmetrize and transform its amplitudes into T2. The
R24T2 transformation is shown in Equation 4-23 and the diagonal matrix elements of fh'
are shown in Equation 4-24.
t r t =n (4-23a)
t =I (4-23b)
/ =Ic; ad d(4-24)
The R1 amplitudes must be transformed in the same way in order to generate the
next set of T1 amplitudes. One PE should do this since the transformation is trivial
t =; t' = n (4-24)
Table 4-12 lists the operations of the Heavy Platoon for antisymmetrizing and
transforming R2. The code listing that follows shows the operations of the Amplitude
Company from the end of the OP phase up to the next CC iteration.
if (HVY_PLT) then
if (CO_CDR) then
BROADCAST (T,, CO_N ET)
RECEIVE (Ty, CO_NET)
if (not HVY_PLT) then
The order of the loop nesting, L-D versus D-L, does not affect the results, but the
performance of the loops might differ from one computer architecture to the next. The L-
D ordering issues multiple GETs to the same PE (down the L column) before cycling up
to the next column. If every PE begins with L=1, then the first PE in the platoon will be
bombarded with the GETs of every PE at once. Normally, GET operations do not require
exclusive locks, in which one PE is allowed to access the data at a time, but the efficiency
of the messaging subsystem might be a factor in performance. This would be less
important in the D-L ordering.
Another way in which the messaging subsystem could affect performance is by
"remembering" which PE was accessed last. In this situation, once a GET is established
between two PEs, any successive GET to the remote PE is less expensive than switching
to another PE. This would actually favor the L-D ordering and penalize the D-L ordering.
In either case, the code can be trivially designed to switch between the two depending on
which architecture the program is running.
Table 4-2. The Direct Integral phase prologue memory map
Order of V T2 2t W2
operations G H a au p aa au I aa u
W""I = H""11 ~ r w
I = W'w" w r
W," = G 0 I r w
I =W/ w" w r
W~u _i =H""~t r r w
Ia + = WI"" c r
I U1+ = WUI/ r
W ""r r w
I U1+ = WUI/ r
Each operation is implicitly summed over all shared indices. The G column of V
corresponds to the set of
for write, read, and increment, respectively. Shaded cells are involved in pending
communications and may not be read from or written to (without risking data
inconsistencies). Temporary arrays (W2) are Outlined to show their scope of dependence.
Table 4-3. The Direct Integral phase inner loop memory map
V T2 R2(1) R2(2) 12 W2 X2
operations G H c l l
if (L!i=R) WAIT T2(LS)
if (R!= S) GET T2(LR)
X" =GC;"W "'rrw
RI = -0 & r 'X n lrl
RI\ + = I I r
R"' = x)"I"" \\ r
if (R!= S) WAIT T2(LR)
if (more L) GET next T2(LS)
X '"P = WIP H r r
x H r I r~w
I~" + = X 'P&c;U r
X~1/ =W G~/ r ri
R~" += XIP"w"~ I r
Pa',~'a I I
Th 21)bfe is th civ rtebfe, an h 2()bferih acesfrbde
buffer fro the prviu L shll Th loo strt wit LRconsutoNhlsCCiS
back ~ r~ toL1 n otnusu oL(-)
One of the two R2 buffers will be access-forbidden. The last permutation does not need to
GET another T2(RS) block; otherwise, the GET should be issued as soon as possible. If
the permutation index is 4, then a different exchange integral will be generated directly.
The (w) cells are write-optional, meaning the operation might or might not need the
Order of V T2 R2(1) R2(2) SCT
operations G H a au up aa au am
if (perm==8) then
If += ct +c t G r r (w
GET next T2(RS)
I~ )+ = c tf'H~ +cLtl G'p r r (w)
permute V w w (w)
The Direct Integral ph p
Table 4-5. The Stored Integral phase Group 3 Set 1 Subset 1 memory map
OrerofV T2(1) T2(2) I A B IC
operations d l cl d l cl d l cl d cl c
GE T THeLnR
ile L = Ln,. L,
I(Ce = ta'p u"M r r w
Wi = m pt H" r r w
GET T( DnS
ile D = Dn,. D,
WAI.~T T,( DS
GE T nextr T (")'
I/ + = I' ir
I/ + = I' '?r i r' r
I += I' '? i r r'II I I I
tndl ile D
tndl ile L
During the loop over D, the T2 buffers alternate active/inactive, but the prologue in the L
loop uses the second buffer as scratch memory. The GET in the inner loop over D
requests T2(DS) unless D=D1, in which case it requests the next T2(LR) (provided LAL1).
Table 4-6. The Stored Integral phase Group 3 Set 1 Subset 2 memory map
Orde ofV T2(1) T2(2) IA IB C ID
operations d cl l d cl cl d cl cl cl cl c
GE T T~t SDn)I
ile D = Dn,. D,
WAI.~T Tt SDI
GE T T( RLn)I
I = (; i r r n I
I' = \i
I = H _i r r n
I' ~ r I = H i r r n
Ift'D==D.) I thn
ile L = Ln,. L,
I = is / I' r
GET nextr T(")'
I/ + = I' i '"
i~ + = I' i I' r
I + =I i I' r r I
tndl ile L
tndl ile D
T Te inner loop of this subset has more operations than either of the other two subse s, so
it is calculated last and the asynchronous read is issued during the last iteration.
Table 4-7. The Stored Integral phase Group 3 Set 1 Subset 3 memory map
Orde ofV T2(1) T2(2) IA B IC
operations d l cl d l cl d cl cl cl d
G;ET T,( RDn)I
ile D = Dn,. D,
J' '+ = '
ile L = Ln,. L,
\\~.MT T( LSI
G;ET nextr TH (''
I +=I ?'I
tndl ile L
tndl ile D
Table 4-8. The Stored Integral phase Group 3 Set 2 memory map
Order of V T2
operations aa t uu 00 aa upa
WI,""' = t" +1., w r w
I K 1 = WI," V r
do LD = LDo, LDI
W '= t + tt w
s*- ~ l r r
end do LD
W, t, (r) r Iw
'Irk = W Vk2P (r) r r w
do LD = LDo, LDI
A3 Okl A' '
end do LD
end do spin
All of the T2(LD) reads and R2(LD) updates are local operations. If there is not enough
memory for 12, then the Ij/ij/IJ bra pairs can be processed in blocks.
Table 4-9. The One Particle phase FpP Tap memory ma
Order of T2(1) T2(2)
operations up up
ile S = Sn,. S,
ile L = Ln,. L,
G;ET nextr T(L SI
ile D = Dn,. D,
I += TI /
tndl ile D
tndl ile L
tndl ile S
Table 4-10. The One Particle phase FPuT~ap (and same spin terms) memory map
Order of T2(1) T2(2)
operations me up PP ma ap PP
G;ET TH IDn)I
ile D = Dn,. D,
ilo R = 1. N\II, n
WAI~T T,( RD
G;E T nextr T( RD I
ile L = Ln,. L,
I +=/ li
I +=/ li
tndl ile L
tndl ile R
tndl ile D
Table 4-11. The One Particle phase FH and FR lOcal transformations
Order of operations
do LD = LDo, LD1
A''- f t +
A~'"' 1 i, ;k A's'
FR F tk
r + =CI [ ,t" t
end do LD
Table 4-12. The R2 antisymmetrization and T2 update memory map
Order of R2(1) R2(2)
operations aa t uu 00 aa upa
G;E T R,(DnLn)I
ile L = Ln,. L,
dlo D = Dn,. D,
WAI.~T R,( DL
G;ET nextr R,(DL I
I -= / +/ -1 -1
/ -= / +/ -1 -1
/ =1 / +/ -1 -1
tndl ile D
tndl ile L
DISCUSSION AND SIMULATION
The parallel CCSD algorithm distributes almost every operation over the number of
PEs in the Amplitude Company. The few operations that are not distributed are
insignificant to overall performance. Specifically, the worst one-particle transformation
would be a multiplication of two NAoxNAo matrices. This scales operationally like NAO3
In contrast, the Direct Integral phase multiplies two matrices that scales operationally like
O2NAO4. For a system with 100 electrons (50 oC and 50 P) and 500 basis functions, there
are approximately 125 million operations for the serial multiply and over 309 trillion
operations for the parallel multiply. Assuming that every operation takes the same
amount of time and that the parallel multiply is 100% efficient, the maximum theoretical
speedup is 2.48 million times faster (for this system).
t +t 1.25*108 +3.09*1014
lim SpeedUp= P"' 2.48*106 (5-1)
"proesam t 1.25*108
Obviously this limit is unachievable, and there are multiple operations that slow
down the parallel portions of the CCSD algorithm. Disk I/O, network messages, and
synchronizations all become more expensive as the number of PEs increases. The two
primary techniques that help to reduce these costs are real-time load balancing and
asynchronous operations (i.e., overlapping computation and communication for both disk
reads and remote memory operations).
In this context, a job board is a global structure that any PE can write to or read
from at will. The Direct Integral phase must have a global counter that registers the next
AO shell pair waiting to be processed. PEs that finish their tasks quickly will read and
increment the counter more often than PEs that lag behind the others. Natural load
imbalance or CPU contention from other processes can cause this lag (or drift). CPU
contention frequently happens if the operating system attempts to run more compute
intensive processes than installed CPUs. Even if the parallel program executes
exclusively but the number of PEs is the same as the number of CPUs, then the operating
system kernel itself will compete for CPU cycles and detract from overall scalability.
Job boards can relieve natural load imbalance in the Direct Integral phase, but the
Stored Integral phase cannot easily use them (if at all). The SI phase has two levels of
parallelization: the OOVV integrals are distributed over platoons and the R2 updates are
distributed over PEs within a platoon. Any attempt to balance the load in real time will
incur multiple synchronization delays since the PEs in a platoon have close to but not the
same number of operations. Instead of real-time load balancing, the AO shells are
grouped together to balance the work needed to process each shell.
All uses of the word shell" that correspond to CC data processing refer to
collections of physical AO shells, which are the smallest entities that direct integral
generation programs can calculate. Since the number of operations scales like O2N4, the
difference between the smallest and largest physical shell is a direct measure of the
imbalance in computations per message. The hydrogen STO-3G basis set has 1
contracted orbital in the s shell. Larger basis sets will typically run up to about 16 orbitals
per shell (although the iodine PBS basis has 55 orbitals in the d shell). The hydrogen
shell and the average shell are computationally uneven by a factor of 65536 for the DI
phase. If the messages pertaining to these two shells cost the same, which is a fair
assessment since message latency is an order of magnitude more important than message
size, then one PE will have 65536 times more work to do than another. Simply grouping
as many physical shells together as possible might not be the best solution since it is the
size variability that affects load balancing.
The parallel CCSD algorithm, before running the first iteration, will examine the
collection of physical AO shells and search for the set of logical shells that minimizes the
variance of shell sizes. This is a bounded two-dimensional search along the dimensions
MaxShellSize and NLogShells. The maximum number of logical shells is simply the number
of physical shells. The fewest number of logical shells is 1, and the size of that shell is
equal to the sum of all of the physical shell sizes; however, given a maximum shell size,
the fewest number of shells is defined in Equation 5-2.
LoNel > = (5-2)
Physical AO shells should not be broken up since generating direct integrals costs
less when orbitals of the same angular momentum and atomic center are calculated at
once. Equation 5-2 is actually an inequality because there might not be a perfect grouping
of physical shells such that every logical shell has the same size. The search algorithm
loops over size limits (MaxShellSize) starting with the largest physical shell size and
ending with a predefined limit or the sum of all physical shell sizes, whichever is smaller.
For every potential MaxShellSize, the algorithm also loops over the number of logical
shells starting from the theoretical minimum up to the number of physical shells.
The problem of putting obj ects with mass (physical shells with orbitals) into groups
with a mass limit (logical shells with up to MaxShellSize orbitals) is well known in the
field of Computer Science as the Bin Packing Problem (BPP) [Cof97]. This is an NP-
hard optimization and there are many algorithms for finding solution candidates. Rather
than solve the BPP, the CC algorithm uses the Worst Fit Decreasing (WFD) off-line
algorithm limited to a known set of bins (logical shells) and measures the variance of
sizes (Appendix E). The partitioning scheme with the lowest variance should have the
best balance between computation and messaging. WFD is ideal for finding possible
partitions for this particular problem since it adds the next smallest physical shell to the
logical shell with the fewest orbitals.
Most parallel programs are hampered by the increasing cost of messages as the
number of PEs increases. One-sided operations allow PEs to process data while messages
are moving through the network. The two limits of this scheme are: 1) the CCSD
calculation is free because the messages take so long to arrive and 2) the program scales
almost linearly with the number of PEs because global data is available the moment it is
read. Given the complex task and data distributions, it is likely that some portions of the
algorithm will be bound by communication costs and others by computations.
Reading the OOVV integrals from disk should not be a performance bottleneck
considering the amount of computation that each PE will do while the AIO READ is
being fulfilled. The inner loops of the SI phase can become heavily dependent on T2 GET
operations if the logical shells are not large enough to saturate the retrieval with
computation. The easiest way to increase the number operations per message is to
increase the size of a logical AO shell pair. The difficulty in doing this is that the local
scratch memory requirements scale like N4 with respect to shell size. As more scratch
memory is needed, the less heap there is for distributed data and ultimately more PEs will
be needed. The problem of balancing shell size, scratch memory, and data distribution is
not solved, but as the algorithm has been stated, the data distribution schemes can be tried
independently of the data contraction kernels. For example, if more scratch memory is
needed but there is no possibility to increase the heap or the number of PEs, then the DI
and SI phases can be further divided to allow for fewer intermediates at the cost of more
An attractive feature of the parallel CCSD algorithm is that no PE is required to
stall in a wait call for another PE unless the state of their globally accessible data changes
(i.e., if R2 is transformed or T2 is updated). A barrier in message passing programs is a
routine that causes every PE to wait until all of the PEs in a communication network have
entered the barrier. Certain collective operations can function like a barrier and some
pairs of functions act as a true barrier. Reductions involve every PE sending some data to
one PE, which will combine all the data into one element. Broadcasts involve one PE
sending some data to every PE. Individually, certain PEs may continue after the operation
whether or not other PEs have finished their operations, but properly configured, a
reduction and a broadcast can together function like a barrier. No PE will proceed until it
has received data from one PE, which will not send the reduced element before receiving
the contributions from all of the other PEs.
The concept of a fence comes from the MPI-2 standard, which defines it as a
collective barrier that ensures all one-sided communications on a shared obj ect are
complete before leaving the barrier. Synchronizations of any sort occur essentially 10
times in each CC iteration:
1. *initialize R and synchronize T before the Pa DI phase a CO fence (Appendix A)
2. complete the R2Pa COntributions from the Pa DI phase a PLT fence
3. complete the R2PuaR2aCP transformation into scratch a PLT fence
4. copy scratch into R2 before the up DI phase a PLT fence
5. *complete the R2aP COntributions from the up DI phase a CO fence
6. reduce li contributions onto CO CDR 4 CO reduce
7. receive F1 from CO CDR 4 CO bcast
8. complete the R2 '"R21'"' transformation into scratch a PLT fence
9. copy scratch into R2 before the SI phase a PLT fence
10. complete SI and OP contributions a CO fence
11. complete R24T2 update a CO fence
12. receive T1 from CO CDR 4 CO bcast
In the steady state of iterating over T, the first and last synchronizations can be
overlapped (i.e., point 1 is the same as point 12 in the list). Point 5 can be overlapped
with point 6 in that reducing li signals the end of the up DI phase (Table 5-1).
Collectively, the number of synchronizations does not scale with the problem size;
however, the cost of each synchronization point will increase as the number of PEs
The million-dollar question that every parallel algorithm must answer is, "Does it
scale?" Amdahl's Law [Amd67], which neglects all process contention and performance
degradation, describes the maximum theoretical speedup for a Eixed problem size running
on a Eixed number of processors. The simplest expression of this law is shown in
Each platoon independently transforms R2 and separates each step with a PLT fence
(dashed lines). The shaded rows correspond to collective communications across the
Company. r, w, i, and 0 stand for read, write, increment, and initialize to 0, respectively.
The largest platoon calculates T1 while the other platoons update their copies of T2.
t, + t
SpeedUp = "" P" (5-3)
However, the parallel CCSD algorithm does not spend the same percentage of time in the
parallel parts relative to the serial parts for every molecular system. As the problem size
increases, the percent spent in the serial part decreases; thus, the parallel speedup will
increase with system size. This is loosely related to the Gustafson-Barsis Law that makes
the observation that a parallel computer can do bigger calculations relative to the serial
computer for data parallel algorithms as the number of processors increases [Gus88]. The
parallel CCSD algorithm leaves N3-type operations for one CPU (in the form of NxN
matrix multiplications). A few one-particle N5 operations and the N4 T2 update are
distributed over the largest platoon, but most N5 and all N6 Operations are distributed over
the entire Amplitude Company. Furthermore, remote data is retrieved as early as possible
.~ --~--~ ~-~~~-`` ~ r~"
PhaseHeavy Platoon Light Platoon
T1 R1 T2 R2 Scr T1 T2 R2 Scr
DI:pa r r i w/r r r i w/r
R(D,M)4S(L,N) r w r w
S(L,N)4R(L,N) w r w r
DI: a13 r r i w/r r r i w/r
R(L,N)4S(L,D) r w r w
S(L,D)4R(L,D) w r w r
SI; OP:FpP / T
antisym; 1/D w r 0
update T2 w w 0 w
Table 5-1 The schedule o s
and used as late as possible. This provides an ideal countermeasure for variable network
A simulator is used to test a design before building a prototype or the final product.
Simulating software performance (scalability in the case of parallel CCSD) usually
involves estimating computation and data transfer times, which requires hardware
parameters such as CPU speed and efficiency; memory, disk, and network I/O latencies
and bandwidths; etc. These parameters can be measured and will be different for various
computer architectures; however, the goal of the software simulator is not to emulate
hardware. By using a variety of parameters, the simulator should be able to expose
general strengths and weaknesses of an algorithm that might need to be redesigned.
The simulator used to test the parallel CCSD algorithm constructs a parallel
network with an array of floating-point numbers that represents the clocks of the PEs.
The program assumes the identity of a PE and estimates how long it would take to
perform a set of operations. That time is added to the clock and the simulator switches to
the next PE. For the Direct Integral phase, the simulator maintains a counter that
corresponds to the next integral shell pair, and becomes the PE with the lowest clock (i.e.,
the one that would increment the counter next). Once the counter exceeds the number of
shell pairs, the clocks are synchronized to the highest value and the R2 amplitudes are
transformed. The process repeats itself for the second DI phase.
Binary trees are used to estimate reductions and broadcasts. The Stored Integral and
One Particle phases do not use real-time load balancing so the simulator cycles through
each PE and adjusts the clock accordingly. Finally, the antisymmetrization and update
phases are estimated with all of the appropriate synchronizations and computations.
Local computation. Accurately estimating local mathematical operations is
essential to simulating both parallel and serial performance. Each operation involves
reading data from memory, operating on the data, and writing the results back to memory
(Equation 5-4). S is the data size, B is the memory bandwidth, Nops IS the number of
operations, C is the CPU speed, and e is an efficiency modifier. Compiler optimization,
loop ordering, and vendor-supplied mathematical libraries will affect performance so
multiple efficiencies may be needed to estimate different parts of the algorithm.
S, So Nop
t + out op (5-4)
comp Bd B C* e
Asynchronous I/O. Disk operations can be estimated with latency, bandwidth, and
the number of channels (Equation 5-5). Latency is usually the time it takes for a disk
head to move to the appropriate position on the disk, Nch is the number of dedicated I/O
channels, and Nmsg IS the number of simultaneous disk operations. Disk contention in the
simulator is estimated by multiplying S by the number PEs on the node (i.e., if one PE is
reading data, then another probably is as well).
tr to = tlat +~ (5-5)
B *min(Nch> msg
When an asynchronous call is placed, the simulator estimates how long it would
take for the data to arrive and saves the time at which it should be available. The PE must
then call a wait routine, which will compare the current time to the time the data is
useable. Depending on the difference, the simulator will increase the clock (the PE must
wait) or return silently (the data arrived while computing).
One-sided communication. Sending and receiving messages through a network is
very similar to writing and reading from disk. If two-sided messaging is used, then the
clocks of the origin and target must be modified. Blocking messages require the origin
and target to have the same clock, while non-blocking messages allow for drift. The
parallel CCSD algorithm only uses non-blocking one-sided messages, which are very
similar to I/O operations in that only the origin is affected by the time it takes for data to
arrive. The simulator allows for on-node and off-node messaging parameters so that PEs
in the same squad can communicate at different speeds than those on different nodes.
Some MPI implementations require copying data into and out of system buffers, so
memory bandwidth might also affect the time.
Collective communication. Barriers, reductions, and broadcasts are simulated with
binary trees (Figure 5-10), and the number of stages depends on the number of PEs
involved in the operation (Equation 5-6). (Ceiling brackets mean round up to the nearest
integer.) Equation 5-7 shows the number of messages being sent in each stage i.
Istages= 10g2 PrE> (5-6)
N,,,z = min(NPE,2)- 2' (5-7)
Figure 5-10. The binary tree approach to collective communications. 12 PEs require 4
stages, and the last stage only has 4 messages.
Memory bandwidth is measured by allocating a large heap and then flushing it with
data and reading the data back in. The average of 7 sweeps is used for the memory
reading and writing bandwidth. Henceforth, all data will pertain to an IBM POWER3
workstation with four 375 MHz CPUs. The following output was taken from a calibration
512 MB heap at Ox3348f518
wr was 1.924009 -> 266.11 MB/s; rd was 0.623147 -> 821.64 MB/5
wr was 1.924412 -> 266.06 MB/s; rd was 0.623328 -> 821.40 MB/5
wr was 1.922241 -> 266.36 MB/s; rd was 0.623201 -> 821.56 MB/5
wr was 1.922671 -> 266.30 MB/s; rd was 0.623183 -> 821.59 MB/5
wr was 1.922141 -> 266.37 MB/s; rd was 0.623107 -> 821.69 MB/5
wr was 1.922495 -> 266.32 MB/s; rd was 0.623170 -> 821.61 MB/5
wr was 1.922200 -> 266.36 MB/s; rd was 0.623147 -> 821.64 MB/5
bw_memwr = 266.27 MB/s (279201344.00 B/s)
bw_memrd = 821.59 MB/s (861497664.00 B/s)
CPU throughput is rarely the clock speed of the processor so efficiency modifiers
are used to estimate the time it would take to compute a set of operations. The two
primary mathematical constructs that the CCSD algorithm uses are two-dimensional
matrix multiplications and N6 array contractions. Many vendors sell highly optimized
linear algebra libraries that are tuned for specific computer architectures. Using these
libraries for standard matrix multiplies can dramatically increase performance if the
program is dependent on such operations. The simulator measures the efficiency of the
double precision general matrix multiply (DGEMM) by multiplying two 625x625
matrices. The matrices are then recast as two 25x25x25x25 four-dimensional arrays and a
hand-coded six-loop array contraction is timed, wherein the array contraction performs
the same operations as the two-dimensional matrix multiply. The following results were
taken from a calibration run:
CALIBRATION FOR POWER 375 MHz CPU
(2 fma units per CPU)
750000000 fma/s reference
Disk latency and bandwidth are difficult to measure in an isolated environment.
Most operating systems use aggressive Eile caching, which can have a noticeable effect
on performance curves that compare message size to bandwidth. The parallel CCSD
algorithm is somewhat forgiving of exact performance because all disk reads are
asynchronous; nonetheless, calibrating the I/O performance attempts to bypass the CPU
and memory caches.
The benchmarking program iozone [Nor02] was used to measure the disk latency
and bandwidth, and all Hiles were created with the descriptors: O DIRECT (direct I/O),
O_SYNC (synchronous writes), and ORSYNC (synchronous reads). Comparing fie
( 625 x 625 ) ( 625 x 625 )
dgemm took .387582 s -> 88.1849% eff
dgemm took .384157 s -> 89.0108% eff
dgemm took .384016 s -> 89.0451% eff
dgemm took .383978 s -> 89.0543% eff
dgemm took .384088 s -> 89.0276% eff
dgemm took .384009 s -> 89.0468% eff
dgemm took .384046 s -> 89.0378% eff
loops took 1.785683 s -> 18.4198% eff
loops took 1.786396 s -> 18.4123% eff
loops took 1.785581 s -> 18.4208% eff
loops took 1.785778 s -> 18.4188% eff
loops took 1.786151 s -> 18.4149% eff
loops took 1.785726 s -> 18.4193% eff
loops took 1.785444 s -> 18.4223% eff
dgemm eff = 0.889153
loops eff = 0.184183
vendor is 4.83 times more efficient
The results are then compared to another set of larger multiplications and contractions to
test for predictability. The simulator uses the latter set of parameters if the results are
precise. The following results were generated using the previous data:
( 30 x 30 ,40 x 40 ) x ( 40 x 40 ,40 x 40 )
DGEMM: real: 3.549262 s (est: 3.53) eff:
DGEMM: real: 3.547543 s (est: 3.53) eff:
DGEMM: real: 3.547254 s (est: 3.53) eff:
LOOPS: real: 22.458201 s (est: 16.76) 6-loop eff:
LOOPS: real: 22.448650 s (est: 16.76) 6-loop eff:
LOOPS: real: 22.446517 s (est: 16.76) 6-loop eff:
LOOPS: real: 22.450999 s (est: 16.76) 6-loop eff:
LOOPS: real: 22.447324 s (est: 16.76) 6-loop eff:
LOOPS: real: 22.445928 s (est: 16.76) 6-loop eff:
LOOPS: real: 22.449582 s (est: 16.76) 6-loop eff:
record length to the time it took to read a random record produced data that were used to
describe the local disk performance (Figure 5-11 and Table 5-2).
600 4 2048
500 11 0El 4096
~400 A 8192
300 u x 16384
0 4096 8192 12288 16384 2
record size (kB) 4
Figure 5-11. The local disk random read calibration data. The sets of data are almost
identical no matter what file size (kB) was used between 2 and 512 MB.
Table 5-2. The latencies and bandwidths of random disk reads with respect to file size
File Size Brd tint File Size Brd tint File Size Brd tla
2048 29.7 3.5 16384 30.4 3.5 131072 31.2 4.3
4096 30.4 4.0 32768 30.5 2.9 262144 31.5 4.9
8192 30.0 3.2 65536 31.1 4.0 524288 31.4 4.8
File sizes are in kB, bandwidths (Brd) are in MB/s, and latencies (tle) are in ms. For each
file size, the bandwidth is the inverse of the slope of the linear regression of record
lengths vs. time to read, and the latency is the intercept. (The minimum R2 was 0.999.)
The average bandwidth and latency for all 9 sets were 30.7 MB/s and 3.9 ms,
Measuring network messaging performance was kept relatively simple. The
mpptest program, which ships with the MPICH library [Gro96], was used to measure the
point-to-point (P2P) and bisection bandwidths for messages ranging from 16 to 64 kB.
P2P involves only two PEs, while bisection involves all PEs. Figure 5-12 shows the on-
node P2P and bisection bandwidths measured on the IBM workstation. Latency is taken
from the intercept of the linear regression, not the zero-size transfer time. The regression
analysis for both curves suggests that both pairs of CPUs can communicate without
degradation, so the simulator set the number of on-node communication channels to 2
bisect: t 9.8093*S + 209.7
R2 = 0.9982
P2P: t = 9.9477*S + 210.69
and used the approximate values of 210 Cls and 99 MB/s for latency and bandwidth,
respectively. Three off-node network architectures were tested: switched Fast Ethernet
and two 150 MB/s full duplex, redundant path IBM SP switches (with different
configurations). Only P2P tests were used to measure off-node latency and bandwidth
(Figure 5-13), and Equation 5-5 was used to estimate degradation.
-- -- -Linear
24 32 40
message size (kB)
The calibration data for on-node messaging parameters. Bandwidth is the
inverse of the regression slope, and latency is the intercept. The performance
does not change with one or two pairs of CPUs communicating at the same
time, which implies two independent messaging channels.
X SP swl
+ SP sw2
- -- Linear
- Linear (SP
-- -- -Linear (SP
E 3 00 0
24 32 40
message size (kB)
Figure 5-13. The calibration data for off-node messaging parameters. The regression
equations are Fast Ethernet: t=86.987*S+553.42 (R2=0.9998); SP switch 1:
t=15.304*S+265.18 (R2=0.9915); and SP switch 2: t=11.942*S+180.88
(R2=0.9970). Bandwidths are 11.2, 63.8, and 81.8 MB/s, respectively.
As in the case of asynchronous I/O, most messages in the parallel CCSD algorithm
are one-sided and will hide behind computation. If the amount of computation is less than
the estimated time spent in communication, then a more accurate messaging function
should be used.
With reasonable hardware parameters and timing algorithms, simulations of real
molecular systems can be used to estimate the performance of the parallel CCSD
algorithm. Table 5-3 lists the 20 amino acids and the corresponding numbers of electrons
and spin pairs. Also listed are the sizes of the open-shell OOVV integrals--and thus the
minimum disk requirements (with 1 GB equal to 10243 Bytes)--and the minimum
number of PEs with 1 GB of heap needed to calculate the UHF CCSD energy if each
system had 1024 AO basis functions. Energy calculations of the anion, cation, and
zwitterion are certainly achievable with the resources of current networks of workstations
and clusters of SMP machines.
For 1024 AO basis functions, the size of the two-electron integrals is 8 TB.
Approximately one-eighth of these need to be calculated on the fly since the others can
be generated by permutations; however, assuming the integrals are dense, the full 8 TB
array must be generated and processed twice (once for Pa and once for up). Systems of
this size should be executed on hundreds if not a thousand PEs or more. The initial
parallelized software cannot scale this high much less store the amount of data needed by
each node. The parallel CCSD algorithm overcomes this with fine-grain task distribution
and distributing large arrays over as few PEs as possible. The net effect is that large
systems can be calculated with few resources at the expense of computing time.
Min Disk Disk
Amino acid ne noo
C H NOS PEs (GB) savings
Alanine 3 7 1 2 0 48 1128 27 8.8 99.92%
Cysteine 3 7 1 2 1 64 2016 48 15.8 99.86%
Aspartic acid 4 7 1 4 0 70 2415 57 18.9 99.84%
Glutamic acid 5 9 1 4 0 78 3003 71 23.5 99.79%
Phenylalanine 9 11 1 2 0 88 3828 90 29.9 99.73%
Glycine 2 5 1 2 0 40 780 19 6.1 99.95%
Histidine 6 9 3 2 0 82 3321 78 25.9 99.77%
Isoleucine 6 13 1 2 0 72 2556 60 20.0 99.83%
Lysine 6 14 2 2 0 80 3160 75 24.7 99.78%
Leucine 6 13 1 2 0 72 2556 60 20.0 99.83%
Methionine 5 11 1 2 1 80 3160 75 24.7 99.78%
Asparagine 4 8 2 3 0 70 2415 57 18.9 99.84%
Proline 5 9 1 2 0 62 1891 45 14.8 99.87%
Glutamine 5 10 2 3 0 78 3003 71 23.5 99.79%
Arginie 6 14 4 2 0 94 4371 103 34.1 99.69%
Serine 3 7 1 3 0 56 1540 37 12.0 99.90%
Threonine 4 9 1 3 0 64 2016 48 15.8 99.86%
Valine 5 11 1 2 0 64 2016 48 15.8 99.86%
Tryptophan 11 12 2 2 0 108 5778 136 45.1 99.59%
Tyrosine 9 11 1 3 0 96 4560 107 35.6 99.68%
Table 5-3 The minimum re D
The data assume thle scratch memory needed by each PE, is less than thle amount of data
needed to store the third OOVV array during an R2 transformation. Every example saves
over 99.5% of the disk space required by a traditional MO-based CCSD algorithm.
A simulation of the parallel CCSD algorithm was performed on the DCEE* system
described in Chapter 2. Two network scenarios were tested: Fast Ethernet (F.Eth) and the
faster of the two IBM SP switches (henceforth referred to as x3 sw). Each node in the
network was configured as if it were the aforementioned 4-CPU IBM workstation. All
data are compared to 100% and 50% theoretical efficiencies (both are technically linear,
but "linear" scaling commonly refers to 100% theoretical efficiency). (The parallel
algorithm requires at least two 1-GB PEs, so all data are compared to this as the reference
Fast Ethernet. The F.Eth simulation shows acceptable parallel performance up to
64 PEs, but beyond that, efficiency drops below 50%, which is considered to be
unacceptable (Figures 5-14 and 5-15). However, even up to 1024 PEs, the scalability
curve is still increasing. Efficiency shows normal behavior for parallel programs (starting
out high and decreasing asymptotically).
96 / **- Xth (100%)
-- -- -Xth (50%/)
64 X I~ pred.
r -M-- Efficiency*128
0 256 512 768 1024
Figure 5-14. The simulated performance curves for DCEE* on 2 to 1024 PEs over Fast
Ethernet. Xth (100%) and Xth (50%) are the theoretical speed-up curves
with 100% and 50% efficiencies, respectively. X pred. is the predicted
speed-up for the simulation. Efficiency*128 is the efficiency scaled by 128.
Beyond the standard analysis of speed-up and efficiency, users might concern
themselves with the idea of cost-to-performance. Another way of stating the idea, "At
what point does the cost of using more PEs outweigh the gain in speed-up they might
contribute?" The "cost" function is considered to be the efficiency and the metric for
"bang per buck" is efficacy (Equation 5-8) [Gho91].
17(p1)= E(p~)- S(p~) (5-8)
Efficiency times speed-up can have multiple maxima, but only the first maximum value
(with respect to the number of PEs) corresponds to the most cost-effective use of PEs in a
parallel run. The F.Eth simulation of DCEE* peaks at 64 PEs (r=10.2), but the curve is
so shallow that one could argue anywhere between 32 and 96 PEs is still cost effective.
O 32 64 96
Figure 5-15. The simulated performance curves for DCEE* on 2 to 128 PEs over Fast
Ethernet. The curves correspond to the same series as in Figure 5-14 with
the exception that efficiency has been scaled by 32, not 128.
SP switch. The x3 sw simulation shows much better performance (Figure 5-16).
Predicted speed-up never drops below 50% efficiency up to 1024 PEs, and efficacy has
not peaked. The simulation suggests that the parallel performance for this system
(DCEE*) is highly dependent on communication speed.
100 -- Xth (100%)
-- -Xth (50%)
50 ~7- X pred.
50 -- Efficiency*300
0 256 512 768 1024
Figure 5-16. The simulated performance curves for DCEE* on an SP switch.
Comparison to serial ACES II. Efficacy is a relative metric that depends on the
machine and (molecular) system, but it can be used to compare the effectiveness of
different parallel programs running the same system on the same computer. Figure 5-17
compares the speed-up and efficacy of the new algorithm to the previous parallel CCSD
algorithm. The new algorithm scales better and is much more cost effective per PE;
however, the comparison of real times does not fare so well (Figure 5-18).
75 -( new Efficacy
-- -- -prev X
50 + prev Efficacy
0 64 128 192 256
Figure 5-17. The relative performance of the initial parallelization attempt compared to
the new algorithm. The discontinuity of the previous parallel program comes
from allocating whole nodes with 16 CPUs each.
The new algorithm achieves high scalability by decomposing the equations to such
a degree, that the PEs are swamped with calculations in an attempt to use as little disk
and memory as possible. The current serial ACES II CCSD program is 31.4 times faster
than the new parallel algorithm, but it also requires much more disk space and does not
scale as well. Nonetheless, Table 5-4 lists the numbers of PEs that the new algorithm
would need to match the time achieved by the initial parallelization attempt.
It is clear that the CCSD energy can be calculated quickly with few PEs for
molecular systems that have small data requirements (DCEE* used on the order of 2 GB
of disk space per node). The current ACES II program and the initial parallelization
attempt are more than adequate for reducing the time the new parallel algorithm would
take; however, the new algorithm requires very little disk space and can calculate
energies for very large systems.
-M-- X new
-serial ACES II
100 -M-X prey
O 256 512 768 1024
Figure 5-18. The absolute performance of the initial parallelization attempt compared to
the new algorithm. The dark lines are the factors of speed-up achieved by
using the previous code compared to the reference calculation of the new
Table 5-4. The number of PEs needed to surpass the initial parallel performance
Previous PEs New PEs Etaoae
1 76 31.8
32 409 146
64 514 176
128 424 151
256 405 145
The simulated speed-up curve was fitted with a cubic regression, which was used to
predict how many PEs the new algorithm needs to finish a CC iteration in the same time
that the initial parallelization attempt did.
Serine and phenylalanine
Using the x3 sw network parameters, the simulator examined serine (Ser) and
phenylalanine (Phe), both in the cc-pVQZ basis set. Ser, with 56 electrons, had 595 AO
basis functions and Phe (88 e") had 990 AOs. The current ACES II UHF CCSD program,
using the AO algorithm for VVVV integrals and assuming 20% of the AO integrals are
discarded, would require over 200 GB of local disk storage for Ser and over 1.5 TB for
Phe. The new parallel algorithm would need only 2.8 and 19.2 GB of distributed storage,
Figures 5-19 and 5-20 show the simulated parallel performance of Ser and Phe. Ser
requires a minimum of 14 1-GB PEs, and Phe requires 95 PEs. In either system, parallel
efficiency never drops below 96%. With respect to tractability, a single Ser UHF CCSD
iteration is expected to take 2.8 days on 1024 PEs. Phe requires approximately 2 months
per iteration on 1024 PEs. If the efficiency curve is to be believed, then this can be
dropped to about 2 weeks on over 4000 PEs.
-- Xth (100%/)
48 -.-- -Xt~x h (50%)
0 256 512 768 1024
Figure 5-19. The performance curves for serine on 16 to 1024 PEs.
-- Xth (100%)
6 -- Xth (50%)
0 256 512 768 1024
Figure 5-20. The performance curves for phenylalanine on 128 to 1024 PEs.
The simulator is only able to model local computations with high accuracy. These
are generally highly ordered, well-defined sequences of commands executing on a
consistent processor (barring process contention in the operating system). After these, the
performance of local disk operations is somewhat predictable, even with I/O contention
from other PEs. The least predictable performance aspect of a parallel program is
network communications. This seems to be an N! correlation problem in its own right
(nevermind coupled-cluster theory), but certain models are able to predict the statistical
nature of network traffic rather well.
The current implementation of the simulator treats the network as a single, flat
interconnected fabric in which every PE can reduce network bandwidth for every other
PE. In principle, this should then predict the worst performance since multi-layered
network hierarchies favor local communications, which the platoons are supposed to
capitalize on. However, the penalty function for estimating network contention lacks the
ability to scale up with the number of PEs (i.e., more PEs reduces the accuracy of the
predicted network effects). This should be corrected with a better model such as the
hyperbolic model for layered environments [Sto96].
At a more fundamental level, the simulator assumes all messages can be delivered
in one eager message. This means each PE can send and receive data without bothering to
divide the information into multiple messages, but more importantly, it completely
eliminates the need for the target PE to acknowledge or create space for the incoming
data. In practice, this is rarely the case, and this rendezvous protocol must be modeled.
The simulator is also limited to sharing I/O bandwidth among PEs on a single node.
Many high-performance parallel supercomputers use storage area networks to increase