<%BANNER%>

A Parallel Direct Open-Shell Coupled-Cluster Singles and Doubles Algorithm


PAGE 1

A PARALLEL DIRECT OPEN-SHELL COUPLED-CLUSTER SINGLES AND DOUBLES ALGORITHM By ANTHONY D. YAU A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORID A IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 2004

PAGE 2

Copyright 2004 by Anthony D. Yau

PAGE 3

iii ACKNOWLEDGMENTS I thank my family and friends for supporting me throughout the years, my strength comes from them. I thank the faculty who ha ve instructed me and the members of Rod Bartlett's research group. I owe particular th anks to Dr. Erik Deumens for continued motivation and guidance in scientific com puting (and for teaching the first parallel programming class that I attended) and to Dr. Victor Lotrich for insights into implementing Coupled-Cluster theory. I especi ally thank Dr. Ajith Perera for extending so much help (and patience) to me during all of my studies throughout graduate school. The United States Department of Defense High Performance Computing Modernization Program (HPCMP) supported all of this work. A grant under the Programming Environment and Training (P ET) program (GSA Contract No. DAAD0501-C-003) funded the initial pa rallelization effort, and th e Common High Performance Computing Software Support Initiative (CH SSI) program (project CBD-03) funded the research on the fine-gra in parallel algorithm.

PAGE 4

iv TABLE OF CONTENTS Page ACKNOWLEDGMENTS.................................................................................................iii LIST OF TABLES............................................................................................................vii LIST OF FIGURES.........................................................................................................viii ABSTRACT....................................................................................................................... ..x CHAPTER 1INTRODUCTION......................................................................................................1 2PARALLELIZING A SERIAL CCSD PROGRAM.................................................6 Serial Profile...............................................................................................................6 Bottleneck Categorization and Parallelization...........................................................9 Particle-Particle Ladders...................................................................................9 Matrix Multiplication......................................................................................12 I/O-Dependent Routines.................................................................................13 Parallel Profile..........................................................................................................14 Top-Down Parallelization Considerations...............................................................17 3SPIN-UNRESTRICTED CCSD EQUATIONS.......................................................18 Group 1: T2 Virtual Orbitals Coming from V..........................................................23 Group 2: T2 Virtual Orbitals Coming from V and T................................................24 Set 1: One Virtual from V, One Virtual from T1............................................24 Set 2: One Virtual from V, One Virtual from T2............................................26 Group 3: T2 Virtual Orbitals Coming from T..........................................................27 Set 1: Rings and PH Ladders..........................................................................27 Subset 1: AC/BD...................................................................................27 Subset 2: DB/CA...................................................................................28 Subset 3: CB/AD...................................................................................29 Set 2: HH Ladders from OOVV, OOOV, and OOOO...................................30 Group 4: T2 into T1...................................................................................................31 Group 5: One-Particle Transformations...................................................................32 Set 1: T1 Contributions...................................................................................32 Set 2: T2 Contributions...................................................................................33

PAGE 5

v 4PARALLEL ALGORITHM....................................................................................35 Virtual Parallel Computer........................................................................................35 Communication Networks..............................................................................37 Data Management...........................................................................................38 Arrays stored in distributed memory.....................................................38 Arrays stored on disk............................................................................42 Parallel UHF CCSD.................................................................................................42 Direct Integral Phase.......................................................................................44 Design considerations...........................................................................44 Algorithm..............................................................................................46 Stored Integral Phase......................................................................................50 One Particle Phase..........................................................................................55 Antisymmetrization and T2 Update................................................................57 5DISCUSSION AND SIMULATION.......................................................................69 Discussion................................................................................................................69 General Comments.........................................................................................69 Load Balancing...............................................................................................70 Job boards.............................................................................................70 Data aggregation...................................................................................70 Asynchronous I/O...........................................................................................72 Synchronization Delays..................................................................................73 Simulation................................................................................................................74 General Overview...........................................................................................76 Specific Technologies.....................................................................................77 Calibration......................................................................................................79 Theoretical Examples.....................................................................................83 Runtime Simulations......................................................................................84 DCEE*..................................................................................................84 Serine and phenylalanine......................................................................88 6CONCLUSION........................................................................................................92 Outlook.....................................................................................................................93 Gradients.........................................................................................................93 Higher Excitation Clusters..............................................................................94 Orbital Transformations and Space Reductions.............................................95 Conclusion................................................................................................................95 APPENDIX AACRONYMS AND ABBREVIATIONS................................................................97

PAGE 6

vi BNOTATION AND CONVENTIONS......................................................................99 Operators and Matrix Elements................................................................................99 Operator Indices.......................................................................................................99 CALGORITHM OVERVIEW..................................................................................101 DANATOMY OF A SHELL PAIR BLOCK...........................................................102 ELOGICAL SHELLS A ND BIN PACKING..........................................................104 LIST OF REFERENCES.................................................................................................110 BIOGRAPHICAL SKETCH...........................................................................................112

PAGE 7

vii LIST OF TABLES Table page 2-1.The distribution of real time over high-level CCSD subroutines............................8 3-1.The RHF CCSD T1 contributions..........................................................................19 3-2.The RHF CCSD T2 contributions..........................................................................20 4-1.The direct integral processing loop summary........................................................48 4-2.The Direct Integral pha se prologue memory map.................................................60 4-3.The Direct Integral pha se inner loop memory map...............................................61 4-4.The Direct Integral pha se epilogue memory map..................................................62 4-5.The Stored Integral phase Group 3 Set 1 Subset 1 memory map..........................63 4-6.The Stored Integral phase Group 3 Set 1 Subset 2 memory map..........................64 4-7.The Stored Integral phase Group 3 Set 1 Subset 3 memory map..........................65 4-8.The Stored Integral phase Group 3 Set 2 memory map.........................................66 4-9.The One Particle phase FP T2 memory map.......................................................66 4-10.The One Particle phase FP T2 (and same spin terms) memory map...................67 4-11.The One Particle phase FH and FR local transformations......................................67 4-12.The R2 antisymmetrization and T2 update memory map.......................................68 5-1.The schedule of synchronization points.................................................................75 5-2.The latencies and bandwidths of random disk reads with respect to file size.......81 5-3.The minimum resources for amino acids with 1024 AOs in UHF CCSD.............84 5-4.The number of PEs needed to surpas s the initial parallel performance.................88

PAGE 8

viii LIST OF FIGURES Figure page 2-1.The distribution of real time over primary CCSD subroutines................................7 2-2.The breakdown of significant UHF subroutines......................................................8 2-3.The column assignments for each of 4 CPUs updating R2....................................12 2-4.The distribution scheme for matrix multiplies over CPU threads.........................13 2-5.The parallel speedup for each major UHF CCSD subroutine and the CCSD executable..............................................................................................................15 2-6.The runtime profile of parallel CCS D with a UHF reference from DCEE*.........15 4-1.Communication network boundaries.....................................................................39 4-7.The memory heap management for various molecular systems............................40 4-2.The T2 and R2 memory layout...............................................................................43 4-8.The data flow in the Direct Integral phase.............................................................49 4-3.The path of Group 3 Set 1 Subset 1 updates per OOVV integral..........................53 4-4.The LD shell distribution schemes........................................................................54 4-9.The data flow in th e Stored Integral phase............................................................55 5-10.The binary tree approach to collective communications.......................................78 5-11.The local disk random read calibration data..........................................................81 5-12.The calibration data for on-node messaging parameters.......................................82 5-13.The calibration data for off-node messaging parameters.......................................82 5-14.The simulated performance curves for DCEE* on 2 to 1024 PEs over Fast Ethernet..................................................................................................................85

PAGE 9

ix 5-15.The simulated performance curves for DCEE* on 2 to 128 PEs over Fast Ethernet..................................................................................................................86 5-16.The simulated performance curves for DCEE* on an SP switch..........................86 5-17.The relative performance of the initial parallelization attempt compared to the new algorithm........................................................................................................87 5-18.The absolute performance of the initia l parallelization attempt compared to the new algorithm........................................................................................................88 5-19.The performance curves for serine on 16 to 1024 PEs..........................................89 5-20.The performance curves for phenylalanine on 128 to 1024 PEs...........................89 D-1.The spin blocks in an AO shell pair block...........................................................103

PAGE 10

x Abstract of Dissertation Pres ented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy A PARALLEL DIRECT OPEN-SHELL COUPLED-CLUSTER SINGLES AND DOUBLES ALGORITHM By Anthony D. Yau December 2004 Chair: Rodney J. Bartlett Major Department: Chemistry Coupled-Cluster Singles and Doubles (CCSD) is a benchmark ab initio electron correlation method. The biggest factor in pr eventing its widespread adoption for large molecular systems is that the theory scales like N4 in data and N6 in computation (N being the number of atomic orbital basis functions). Using simple factorization techniques and data and task parallelization, a new appro ach to calculating open-shell CCSD energies is available. For a molecular system with 50 electrons (25 alpha, 25 beta) and 500 AO basis functions, this approach has the potential to reduce storage requirements from 630 GB to 2.3 GB for a traditional UHF CCSD algorithm in a basis set of molecular orbitals. The number of operations will increase, but the natural parallelization of the data provides a clear and concise framework for parallelizing the computations.

PAGE 11

1 CHAPTER 1 INTRODUCTION Coupled-Cluster (CC) theory is an ab initio electron correlation method that can be thought of as infinite-order perturbation theory or a sizeextensive approximation to the full configuration inter action (CI) analysis [ Bar95 ]. Given a one-particle reference wavefunction 0, the CC wavefunction is defined as an exponential expansion of excitation clusters, T, from the occupied orbi tals to the unoccupied (virtual) orbitals. The T operator is a particle excitation operator that naturally trunc ates at the number of particles, P, in the system (Equation 1-2). O ccupied orbitals (the hole space) are denoted with the letters i, j, k, and l, while virtual orbitals (the particle space) are denoted with the letters a, b, c, and d. Appendix A lists acronyms and abbreviations, and Appendix B summarizes index conventions and matrix elements. 0exp TCC(1-1) ,... , ,... , † † † ... ... 2 1... 1 ;c b a k j i abc ijk p P p pk jc ib a t p T T T(1-2) The difficulty in calculating the full CI is that the data requirement scales like N2Pand the computational requirement scales like N2P+2, in which N is the size of basis set space. To retain size extensivity and comput ational tractability, th e excitation operator is usually truncated at a small number of ex citations. The most common is at double excitations, such that T=T1+T2. As computers become more capable of massive calculations, the number of excitations can increase, but the size of the molecular system

PAGE 12

2 must decrease much more rapidly. Currentl y, the highest hand-coded truncation is at pentuple excitations (i.e., T=T1+T2+…+T5) [ Mus02 ], but efforts to automate this process are becoming more promising as distribu tion techniques for data and computations become available [ Bau02 ]. Using the standard two-particle electron ic Hamiltonian, the CC energy is calculated by contracting the elements of the Fock matrix (fpq) and the two-electron Coulomb integrals () with the T1 and T2 amplitudes. Uppercase orbital indices correspond to alpha spin orbitals, and lo wercase orbital indices correspond to beta spin orbitals. The double-bar integrals in Equation 1-4 are anti symmetrized Coulomb-exchange integrals (Appendix B). (This derivation assumes that the reference orbitals are orthonormal.) 0 0) exp( ˆ T H ECC(1-3) ijab b i a j b j a i ab ij ia a i ia IJAB B I A J B J A I AB IJ IA A I IA IjAb b j A I Ab Ij CCt t t t t ab ij t f t t t t t AB IJ t f t t t Ab Ij E E4 1 4 1 0(1-4) The T amplitudes are calculated by iterati ng over a coupled set of equations that project the Hamiltonian and wavefunction into a set of p-particle excitation spaces. The CCSD approximation truncates T at double exci tations, and is one of the benchmark correlated methods in quantum chemistry today. The T1 and T2 equations come from the projections in Equations 1-6 and 1-7, in which NH ˆ is the normal-ordered Hamiltonian that does not include the reference energy. Only the and spin equations are shown, but the , and spin equations are analogous to the former. 0 0) exp( ˆ exp CC NE T H T(1-5)

PAGE 13

3 0 ) exp( ˆ exp0 T H TN a i(1-6) 0 ) exp( ˆ exp0 T H TN ab ij(1-7) The Baker-Campbell-Hausdorff formula [ Hal00 ] has the effect of reducing the T equations to those terms that 1) contract directly with the Hamiltonian and 2) have no more than four excitation amplitudes. The fi rst effect is from disconnected terms being cancelled by the commutation, and the second effect is from the two-particle Hamiltonian having at most four indices from the two-electron integrals. Computationally, the diagonal elements of th e Fock matrix that contract with T are grouped together, and define the input amplitudes for the next iteration. The T2 equations are shown in Equations 1-8 through 111. If T is initialized to 0, then the next iteration will produce T2=/D2 and T1=0, so for all intents and purposes, T2 can be initialized to the second-order wavefunction and T1 can be set to 0. ... 0 k ab ik jk k ab kj ik c ac ij bc c cb ij act f t f t f t f (1-8) ... 0 j k ab ik jk i k ab kj ik b c ac ij bc a c cb ij ac ab ij jj ii bb aat f t f t f t f t f f f f (1-9) bb aa jj ii ab ijf f f f D (1-10) ... j k ab ik jk i k ab kj ik b c ac ij bc a c cb ij ac ab ij ab ijt f t f t f t f t D (1-11) Since the T amplitudes differentiate between occupied and virtual orbitals, it is common to group integrals into the various space combinations such as occupiedoccupied (OO), occupied-virtu al (OV), and virtual-virtua l (VV). There are five such combinations for the two-electron in tegrals: OOOO, OOOV, OOVV, OVVV, and VVVV. If the virtual orbitals outnumber the occ upied orbitals by a factor of 3, then the

PAGE 14

4 VVVV and AO integrals will outnumber th e OOOO integrals by factors of 81 and 256, respectively. Most CC implementations calculate and store the OOOO, OOOV, and usually OOVV integrals before starting the CC iterations. Due to programming complexity, most implementations also store the OVVV integrals. The VVVV integrals are used in only one CCSD term and how they are processed illustrates a fundamental technique that the parallel CCSD algorithm uses in almost every contraction. The data re quirement scales like N4 and the computational requirement scales like N6, and this term is usually the rate-determining step of the CC iteration. The standard MO expression is s hown in Equation 1-12. The "+=" operator is shorthand for "incremented by" (i.e., "x+=y" means "x=x+y") and the r amplitude is the residual that collects all of the contributions before di viding by the D denominator and producing the next set of T amplitudes. cd cd ij ab ijt cd ab r2 1(1-12) Forward transforming T into the AO basis allows the AO integrals to contribute directly to the VVVV terms. Conceptually, the AO coefficients from the reference wavefunction are "removed" from the integr al and "attached" to the amplitude. The primes on the AO indices in T denote a forwar d transformation. This distinction must be made since another useful transformation (a backward transformation) has the same dimensionality ( viz. ), but the AO overlap matrix is used in the contraction in addition to the AO coefficients. cd cd ij d c b a ab ijt c c c c r 2 1(1-13) ij b a ab ijt c c r2 1(1-14)

PAGE 15

5 While this transformation removes the n eed to store the VVVV MO integrals, it increases the computational workload subs tantially. In fact, since each integral contributes to each amplitude, the R2 array would be updated N4 times! However, implementing this technique presents a more optimistic view of calculating large systems. The R2 amplitudes are back-transformed from MOs to AOs before being incremented by the AO integral terms. Once the integrals have been processed, the R2 array is forwardtransformed into the MO basis. c c cd cd ij d c ijc S c r c c r;(1-15) ij ijt r2 1(1-16) ij b a ab ijr c c r(1-17) The method of calculating amplitude contri butions in the AO basis instead of the virtual MO basis has the potential to re duce the amount of storage space needed to calculate the aforementioned VVVV contribution. This is particularly true for open-shell systems, which need to store all three spin combinations of the VVVV integrals compared to storing one set of AO integrals. If direct integral generation is used, then the storage requirement for this term is comp letely removed. In either the AO or MO scenario, parallelization invo lves distributing the data a nd operations of a large matrix multiplication over the processing elements.

PAGE 16

6 CHAPTER 2 PARALLELIZING A SERIAL CCSD PROGRAM Wide-scale reformulation of computationa l algorithms should only be performed if currently existing formulations are inadequa te, but determining whether an algorithm is inadequate depends heavily on its intended use and the computers that are meant to execute it. Any of a number of runtime para meters could contribute to low performance: compiler efficiency; net CPU speed; memory or disk storage requirements and performance; parallel scalability and efficien cy; etc. In the case of general computational chemistry algorithms, how the number of operations scales with respect to the size of the chemical system will also determine whether an algorithm should be reformulated. The ACES II Program System [ ACES ] is one of the more popular software packages that calculates CCSD electron correlation ener gies. Although it was written to make use of CPU vectorization, paralleliza tion is considered to be one of the most important ways to decrease the run time on modern supercomputers. Traditional methods for parallelizing existing software begi n with execution profiling. Once the main bottlenecks are identified and analyzed, para llel formulations of those sections are implemented and profiled. Serial Profile Profiling the CC code in ACES II was performed with four test systems: Alanine ("ala"; C3H7NO2; 48 e-) in cc-pVDZ (119 AO basis functions) Octane ("oct"; C8H18; 66 e-) in cc-pVDZ (202 AOs) ,'-dichloroethyl ether ("DCEE"; C4H8Cl2O; 74 e-) in cc-pVDZ (146 AOs) DCEE radical ("DCEE*"; C4H7Cl2O; 73 e-) in cc-pVDZ (141 AOs).

PAGE 17

7 The first three systems had an RHF referen ce, the fourth had UHF, and no spatial point group symmetry was used for any system. All of the systems used the AO basis formulation for the VVVV*T2 contribution, which was described in Chapter 1. The profiled code used the C system function ge ttimeofday() to record the real-time clock before and after large code sections. Any subroutine (or gr oup of related s ubroutines) that consumed more than 3% of the real time became a potential target for parallelization. Table 2-1 lists the average percentage of th e total real time for each high-level subroutine measured across the three closed-s hell test systems (RHF column), and it lists the percentage for each routine in the DCEE* test system (UHF column). Routines that contributed more than 3% are in bold case. The maximum variance of any set of percentages for the closed-shell systems was 0.004, which suggests that the three profiles have a consistent and reliable distribu tion of times. The table entry for "~AOLAD" corresponds to the time spent in AOLAD that is not due to RDAO, even though the latter is a dependent of the former. Figure 2-1 shows the amount of the total time consumed by the major subroutines and the next most significant routine. 0% 10% 20% 30% 40% 50% 60% 70% CNTRCTT1W1ABRDAORNGDRVnext RHF UHF Figure 2-1.The distribution of real time over primary CCSD subroutines.

PAGE 18

8 Table 2-1.The distribution of real time over high-level CCSD subroutines Routine RHF UHF Routine RHF UHF ZERSYM 0.2% 0.1% T1INT1 0.4% 0.2% T1RAAAA 1.4% 0.6% LADAA 0.0% 0.1% T1RABBA 0.8% 0.6% LADAB 1.2% 0.5% T1RABAB 0.8% 0.6% AOLADLST 0.0% 0.0% CNTRCT 7.7% 9.8% T2TOAO 0.7% 0.3% SUMSYM 0.1% 0.2% RDAO 58.3% 65.1% F2TAU 0.0% 0.0% ~AOLAD 0.6% 0.3% GETLST 1.1% 0.8% Z2TOMO 0.9% 0.7% QUAD1 1.3% 0.5% ZIAAO 0.1% 0.2% QUAD2 0.5% 0.3% T12INT2 0.8% 1.1% QUAD3 0.3% 0.2% RNGDRV 9.4% 11.0% MAKFME 0.1% 0.1% SST0xI 0.0% 0.0% T1W1AA 0.0% 0.6% SUMRNG 0.2% 0.1% T1W1AB 3.9% 2.1% ERNGAA 0.0% 0.0% T1INW2 0.1% 0.0% ERNGAB 0.5% 0.1% FEACONT 0.7% 0.4% SUMSYM 0.2% 0.0% FMICONT 0.3% 0.3% E4S 0.6% 0.2% FMECONT 0.5% 0.2% NEWT2 0.3% 0.2% T1INT2A 0.3% 0.2% PHASE3 0.6% 0.2% T1INT2B 0.8% 0.5% DIIS 2.6% 1.0% It is clear from the table that four r outines consume most of the execution time: CNTRCT, T1W1AB, RDAO, and RNGDRV. Toge ther, they cost 79.3% and 88.1% for RHF and UHF, respectively. Figure 2-2 shows the four major subroutines and any individual routine that consumes more than 1% from the DCEE* profile. RDAO CNTRCT T1W1AB T12INT2 DIIS RNGDRV else Figure 2-2.The breakdown of significant UHF subroutines. Primary values are grouped on the left, secondary values (great er than 1%) are grouped on the right.

PAGE 19

9 Bottleneck Categorization and Parallelization Particle-Particle Ladders The single most time consuming routine in either RHF or UHF calculations is RDAO, which increments T2 by contributions from the VVV V integrals in the AO basis. Integral-stored schemes, such as the curr ent implementation in ACES II, read the AO integrals into memory from disk, multiply them by the T2 amplitudes from the previous iteration, and add the product to the T2 residual amplitudes. Neglecting integral sparsity, the number of permutationally unique AO integral s is shown in Equation 2-1. If there are at least 100 AO basis functions, then the numbe r of unique integrals can be approximated by NAO 4/8 with more than 98% accuracy; 200 or more functions and the approximation is more than 99% accurate. 2 1 ; 2 1 AO AO IntsN N P P P N (2-1) Many AO-based implementations have a user-defined threshold for discarding AO integrals that would not have a significant contri bution to the T2 residual. The integral program in ACES II stores the integrals in one long vector and discards individual integrals that are less than the thres hold. For DCEE* and a threshold of 10-14, ACES II stored almost 41.1 million integrals from a complete set of 50.1 million, which means the two-electron integrals were 18.0% sparse. The RDAO algorithm reads in a batch of integrals and proces ses the integrals individually. For a particular integral, the value might cont ribute 1, 2, 4, or 8 times depending on the permutational symmetry of th e orbitals. The RHF case only has 1, 2, or 4 contributions and post-processes the residu al for the final permutation. The general structure of the algorithm can be described by the following:

PAGE 20

10 do while (mor e batches) read batch do for each int do for each perm do for all OO R2(OO,) += T2(OO,)*<|> end do end do end do end do This algorithm is not particularly efficien t since each permutation requires a vectorvector operation (R2(*, ) += V T2(*, )). Experience in com putational algorithms suggests that matrix-matrix operations (using integral blocks) would have much higher computational throughput. For example, the RDAO update averaged 286.7 MFLOPS out of a theoretical maximum of 1500 MFLO PS on the 375 MHz IBM POWER3 CPU. A 600x600 matrix multiplication using the ESSL libra ry (a tuned linear algebra library for IBM architectures) averages 1278 MFLOPS, which corresponds to a computational efficiency of 85.2% compared to 19.1% fo r RDAO. However, parallel algorithms for block-wise implementations ar e virtually identical to element-wise implementations, so block optimizations were ignored in the para llel strategy and could be implemented later. For any DCEE* CC iteration, 9.6 seconds out of 5952 seconds were spent reading integrals from disk (i.e., I/O takes about 0.16% of the total RDAO time). Before the multiplication loop, there was a small sorti ng section, which took close to 13.6 seconds (also insignificant); therefore, parallelizing the actual in tegral updates was expected to have the largest impact on performance. There are multiple levels at which the current algorithm can be parallelized: batches, integrals, permutations, or rows of OO pairs. Distribu ting the operations over OO pairs might decrease performance since th e pairs are contiguous in memory, and the

PAGE 21

11 cost of adding more rows is much less than the cost of adding more loops to process them. The problem with distribut ing permutations is that th e number of permutations for a random integral is not always 8. Th is might introduce load-balancing problems depending on how the integrals are sorted on disk; furthermore, each integral has a limited number of tasks (permutations), which might limit overall scalability above 8 processors. The two remaining loops (unique integral s and integral batches) are ideal for distributing over processors. The number of integrals is substa ntial enough to support scalability over a wide number of processors and each processor will need to store only those integrals for which it is assigned to calculate contributions. The current ACES II implementation expands the permutations into the "unique" integr al list and processes each one separately, so this expanded list was distributed over processors instead of the truly unique list. For maximum software portability, the para llel algorithm needed to distinguish between processors on the same physical computer and processors on remote computers. The strategy was to distribute integral batches over physical computers (nodes) and to distribute the integral list from each batch over processors on the same computer. After the nodes finished processing thei r batches of integrals, the R2 contributions were reduced and distributed back to all of the nodes. To minimize memory requirements on a single node, only one integral list, T2 array, and R2 array were stored, and CPU threads were created to process the integrals. The R2 array was divided into blocks of columns ( pairs), and each CPU was assigned to calcul ate all of the terms that contributed to a particular set of blocks. This technique reduced false cache conflicts (when two CPUs

PAGE 22

12 attempt to write to different parts of the same cache line at the same time), and it increased the probability that a CPU would have a target R2 column in cache memory, since multiple CPUs would be caching different columns. Figure 2-3 shows the CPU/column assignments in a 4-CPU computer with 4 columns per block. 0 1 2 3 CPU Figure 2-3.The column assignments for each of 4 CPUs updating R2. In this example, there are 49 pairs (from 7 AOs). For an arbitrary blocking factor (bf) and number of threads (Nth), the thread assignment (from 0) of a column index (mn) can be determined with the following FORTRAN function: thread _id(mn,bf,Nth) = mod(mn-1,bf*Nth)/bf If bf*Nth is a power of 2, then the mod function can be replaced by a bit-wise AND with the mask (bf*Nth)-1, and if the blocking factor is a power of 2, then the division can be replaced by bit shifting. Matrix Multiplication If the VVVV contributions are calculated in the MO basis, then LADAA and LADAB become the bottleneck routines. These perform large two-dimensional matrix multiplies, as do the other three bottl enecks: CNTRCT, T1W1AB, and RNGDRV.

PAGE 23

13 Incidentally, the routine T12INT2 (1.1% of the UHF time) is also bound by matrix multiplications. Parallelizing these routines involved distributing the columns of the output matrix over CPU threads. Since the i nput and output matric es were all kept in memory, the columns were divided once, unlike block cycling in the RDAO scheme, which had irregular data access patterns. Figure 2-4 shows the distribution pattern for a matrix multiply over 4 threads. A B C = Figure 2-4.The distribution scheme for ma trix multiplies over CPU threads. Every thread can read the entire A matrix, but each thread is assigned a block of the B matrix and consequently writes to a unique block of the C matrix. The multiplications were not distributed over nodes since the parallel computer that ran the profile had 16 CPUs per node. Th e time penalty incurred from the global reduction (after the multiplication) would have greatly lessened the gains from parallelization; however, nodes with fewer CP Us might benefit from distributing the rows of the A matrix. All nodes but one would then have to initiali ze the C matrix to 0 (if the operation increments pre-existing data), and the results would have to be reduced and broadcast after the multiplication. I/O-Dependent Routines None of the major subroutines that requi red parallelization we re heavily dependent on I/O operations; however, many parallel co mputers (including th e one that ran the profiling calculations) have shared disk storage devices. This introduces competition between some or all of the compute nodes wh en each one performs file read or write

PAGE 24

14 operations. The DIIS routin e (1% of the UHF time) is heavily dependent on I/O performance, and the rate-determining ope ration can be described by the following: do for each i read Tall(*,i) Tall(row,i) = T(i) do for each (x,y) B(x,y) += Tall(x,i)*Tall(y,i) end do write Tall(*,i) end do The number of rows in Tall (and the size of each dimension of B) is usually 5 but could be any value (preferably a small one). The i in dex spans all of the T amplitudes, both T1 and T2. Obviously the construction of a 5x5 matrix from a single 5x1 vector is negligible compared to the time spent reading from and writing to disk. This operation was not parallelized, but only one node was permitted to perform the operation. I/O competition is a low-level problem that shared storage architectures have. In many cases, extensive software changes are re quired to overcome the increasing penalties from read and write delays that serial progr ams consider to be constant. These penalties manifest themselves in almost every routine th at uses disk storage in the parallel profile. Parallel Profile The DCEE* test case was run with the ACES II CCSD program that had parallel modifications as described in the previous se ction. The parallel performance of the major subroutines is shown in Figure 2-5 The real time spent in e ach high-level routine is shown in Figure 2-6 Only RDAO was distributed over every CPU in the calculation and was 20.6% efficient in the 256-CPU (16x16) calculation. The th ree other high-level routines were only distribu ted over the 16 CPUs in each node and had average threading efficiencies of 70.1% (CNTRCT), 25. 4% (T1W1AB), and 28.8% (RNGDRV).

PAGE 25

15 0 10 20 30 40 50 60 CNTRCTT1W1ABRDAO(AOLAD)RNGDRVCCSDSpeed-Up 1x1 2x16 4x16 8x16 16x16 Figure 2-5.The parallel speedup for each major UHF CCSD subroutine and the CCSD executable. (AOLAD) contains the RDAO parallelization and the global reduction afterward. 01800360054007200 ZERSYM T1RAAAA T1RABBA T1RABAB CNTRCT SUMSYM GETLST QUAD1 QUAD2 QUAD3 MAKFME T1W1AA T1W1AB T1INW2 FEACONT FMICONT FMECONT 1x1 2x16 4x16 8x16 16x16 01800360054007200T1INT2A T1INT2B T1INT1 LADAA LADAB T2TOAO RDAO ~AOLAD Z2TOMO ZIAAO T12INT2 RNGDRV SUMRNG ERNGAB E4S NEWT2 PHASE3 DIIS Figure 2-6.The runtime profile of paralle l CCSD with a UHF reference from DCEE*. The times are measured in seconds.

PAGE 26

16 The overall parallel efficiency of the UHF CCSD calculation on 256 CPUs was 1.8% with a speed-up of 4.62. The calcul ation with the highest speed-up was 4x16 (64) CPUs with an efficiency of 8.8%. The reas on for such poor performance can be attributed to I/O contention by all of the routines that were insignificant in the serial calculation. Most of the routines listed in Figure 2-6 increase in time even though they had no parallel modifications. While further parallel optimiza tions could be made to the current CCSD program, two major drawbacks would still plague the parallel code: problem scalability and expensive modify-profile developm ent cycles (i.e., trial and error). The worst aspect of the current approach is that tractability does not increase as CPUs, memory, and disks are added to the poo l of resources (i.e., the largest molecule that can be studied is limited by the resour ces of a single node). There is no mechanism for distributing data across memory or disk s; however, if the current limitations are adequate, then this model should be able to reduce the runtime to a reasonable amount. Parallelizing existing serial software usually takes the bottom-up approach, in which significant bottlenecks are parallelized and profiled. As new bottlenecks present themselves, they too are parallelized and profiled. Unfortunately, by the time this cycle ends (or becomes too expensive), so much of the software might have been rewritten that it would have been cheaper to reformulate the algorithm using a top-down parallelization approach. In the current CCSD program, no ma tter how well the algorithm is parallelized, the software is still limited to performing calculations on systems that are calculable by a single node. It would be more efficient in the long term to reformulate the CCSD algorithm from the top down (reusing existing so ftware wherever possible) and to include proper data distribution and small task parallelization as design fundamentals.

PAGE 27

17 Top-Down Parallelization Considerations Although a few programs already calculate the CCSD energy in parallel, most of them are limited to closed-shell references [ Koc96 Kob97 Per03 ]. The exception is the general tensor contractor of Hirata [ Hir03 ], but the final program requires stored integrals. The goal of the t op-down parallelization approach will be to produce an openshell CCSD program capable of handling molecular systems with at least 500 AO basis functions (although the in ternal target is 1000) and that can scale up to a few hundred (if not a thousand) CPUs. These constraints almost necessitate calculating amplitude contributions from AO integrals and distributing amplitudes across CPUs. In addition to the VVVV integrals, the algorithm should also allow for direct generation of the VVVO integrals. The extent of calculating amplit ude contributions in the AO basis instead of the MO basis is that the virtual orbitals are never formally recognized by the equations other than to crea te complementary density matrices such as that shown in Equation 2-2. These matrices are very important since they define how amplitudes transition from backwardto forward-transformed relative to the MOs. a a ac c Q (2-2) The first step of the top-down appr oach involves cataloging the integral contributions and identifying which output condition on the R amplitudes (forwardor backward-transformed) would be best suited for their additions. Finding a data and task distribution then becomes a ma tter of understanding how the integrals are processed and which terms can be grouped together give n a common output condition on the residual.

PAGE 28

18 CHAPTER 3 SPIN-UNRESTRICTED CCSD EQUATIONS The T1 and T2 equations in spin-restricted Ha rtree-Fock (HF) CCSD have 11 and 29 contributions, respectively. If the refe rence wavefunction is not HF but the spin orbitals are still restricted, then the T1 and T2 equations have 14 and 31 contributions, respectively. The spin-unrestricted non-HF CCSD amplitude equations have a total of 222 contributions. Table 3-1 lists the spin-restricted T1 amplitude contributions and the number of corresponding spin-u nrestricted terms. Table 3-2 does the same for the T2contributions [ Cra98 ]. ) ( ˆ pq P is the antisymmetric permut ation operator (Appendix B). There are many ways to group and factorize the amplitude contributions. The efficiency of each approach largely depends on the following: The chemical system being examined The CPU speed and efficiency The amount of local memory per process Local and remote (shared or exclusive) storage capacities Local and remote I/O latencies and bandwidths Interprocess message latency and bandwidth. For example, if the CPU speed were incredib ly slow compared to the speed of the I/O reads and writes, then it might be faster to group as many subterms together as possible, and accumulate the intermediate product to disk. Consider the sum X*A*B + X*C*D + X*E*F, which factorizes to X*I provided I = A*B + C*D + E*F (first accumulated on disk). In most cases, however, millions (if not billions) of CPU operations are performed before a single byte of data is retrieved from disk, so this approach is relegated to those systems whose CPU speeds are unusually slow compared to other ha rdware capabilities.

PAGE 29

19 Table 3-1.The RHF CCSD T1 contributions Contribution Diagram Number of UHF terms kcd cd kit cd ka2 1 4 klc ca klt ci kl2 1 4 aif 2 kc c kt ci ka 4 c c i act f 2 kcd d i c kt t cd ka 4 klcd d i ca klt t cd kl2 1 4 k a k kit f 2 klc a l c kt t ci kl 4 klcd a l cd kit t cd kl2 1 4 kc a k c i kct t f 2 klcd a l d i c kt t t cd kl 4 kc ac ik kct f 4 klcd da li c kt t cd kl 8 The most common factorization strate gy involves searching all possible combinations of terms from either amplitude equation and finding the set of subterms that is used most frequently. This implicitly assumes that the cost of recomputing the data is greater than cost of storing and retrieving the data from di sk. In the MO basis, this is certainly true (i.e., the amount of computation needed to gene rate all of the MO integrals from AO integrals is quite high), but the cost of calculating AO integrals on demand is negligible when multiple processes are in competition for shared I/O resources.

PAGE 30

20 Table 3-2.The RHF CCSD T2 contributions Contribution Diagram(s) Number of UHF terms ij ab 3 c c it cj ab ij P ) ( ˆ 4 cd d j c it t cd ab ij P2 1) ( ˆ 3 k a kt ij kb ab P ) ( ˆ 4 kc c j a kt t ic kb ab P ij P ) ( ˆ ) ( ˆ 4 kcd d j a k c it t t cd kb ab P ij P2 1) ( ˆ ) ( ˆ 4 cd cd ijt cd ab2 1 3 kcd cd ij a kt t cd kb ab P2 1) ( ˆ 4 kc ac ikt cj kb ab P ij P ) ( ˆ ) ( ˆ or 10 kcd bc jk d it t dc ak ab P ij P ) ( ˆ ) ( ˆ or 10 kl b l a kt t ij kl ab P2 1) ( ˆ 3 kl ab klt ij kl2 1 3 klc b l a k c it t t cj kl ab P ij P2 1) ( ˆ ) ( ˆ 4 klc ab kl c it t cj kl ij P2 1) ( ˆ 4 klc bc jk a lt t ic kl ab P ij P ) ( ˆ ) ( ˆ or 10 klcd ad kj b l c it t t cd kl ab P ij P ) ( ˆ ) ( ˆ or 10 klcd db lj ac ikt t cd kl ab P ij P ) ( ˆ ) ( ˆ2 1 or 11 klcd ab kl cd ijt t cd kl4 1 3 klcd ab kl d j c it t t cd kl ij P4 1) ( ˆ 3 klcd cd ij b l a kt t t cd kl ab P4 1) ( ˆ 3 klcd b l d j a k c it t t t cd kl ab P ij P4 1) ( ˆ ) ( ˆ 3

PAGE 31

21 Table 3-2.Continued Contribution Diagram(s) Number of UHF terms k ab ik kjt f ij P ) ( ˆ 4 kc ab jk c i kct t f ij P ) ( ˆ 4 klc ab lj c kt t ci kl ij P ) ( ˆ 8 klcd ab lj d i c kt t t cd kl ij P ) ( ˆ 8 klcd cd jl ab ikt t cd kl ij P2 1) ( ˆ 8 c ac ij bct f ab P ) ( ˆ 4 kc bc ij a k kct t f ab P ) ( ˆ 4 kcd db ij c kt t cd ka ab P ) ( ˆ 8 klcd db ij a l c kt t t cd kl ab P ) ( ˆ 8 klcd bd kl ac ijt t cd kl ab P2 1) ( ˆ 8 The following assumptions guided the deve lopment of the factorization strategy described in this chapter: There are at least 3 times as many virtual or bitals as occupied orbitals for each spin The two-particle data are too large to store in the memory of one process The VVVV and VVVO integrals are no t transformed and stored on disk The CPU operations are several orders of magnitude faster than I/O latencies Latency is at least an order of ma gnitude more significant than bandwidth Each process is able to retrieve (G ET) and increment (ACCUMULATE) data that are stored by another process wit hout interrupting the other process. The first assumption follows from a study of basis sets that noted that cc-pVTZ is the minimum basis set needed to descri be electron correlation accurately [ Dun00 ]. For chlorine, there are 17 electrons and 34 cont racted basis functi ons. This produces an almost 1:3 occupied/virtual split in molecular orbitals, since half of the electrons have alpha spin and half have beta spin.

PAGE 32

22 Designing the algorithm such that two-particle terms ( viz. VVOO) can be distributed over PEs is necessary because the parallel program should be capable of treating systems with 50 el ectrons and 500 basis func tions (50/500), although 100 electrons and 1000 basis functions is the ta rget (100/1000). For 50/500 systems, the three VVOO spin combinations require 1.55 GB of memory. Considering the algorithm needs at least two (for T2 and R2) held in core, a processor with 1 GB of memory will not be able to store the minimum amount of da ta without overflowing to disk storage. Furthermore, the VVVO and VVVV terms will require 59.8 and 568 GB of storage, respectively. (This assumes a conversion f actor of 1024.) The parallel algorithm should then at least allow for the possibility that the terms that use these MO integrals can be calculated from AO integrals generated on the fly. Today, the fastest disk storage devices are able to seek for random data in approximately 3.7 milliseconds (measured on a Maxtor Atlas 15k 73 GB Ultra320 SCSI hard drive). The slowest desktop CPU (to with in reasonable expectations of performance) operates at 1 GHz. (The Intel Pentium 4 line of CPUs can operate at 3.4 GHz.) Assuming that most floating-point operations in CCS D require multiplications and additions and that each operation completes in one cl ock cycle, then the maximum theoretical multiply/add (fma) throughput is somewhere around 500 million fma/s. Together, the data indicate that a slow CPU could theore tically deliver 1.85 million fma/seek, which at least supports the assumption that computation is several orders of magnitude faster than I/O latency. On the same hard drive, bandwidth bursts measure 73.7 MB/s, which suggests that 280 kB of data could be tran sferred during the same time it took to locate the data.

PAGE 33

23 The final assumption follows from the latest version of the most popular message passing library MPI (Messa ge Passing Interface) [ Gro99 ]. MPI-2 allows PEs in the same communication network to retrieve (GET) increment (ACCUMULATE), and overwrite (PUT) data that are stored by another CPU without communicating with it. In addition to their own proprietary communication libraries, most vendors of parallel computers offer implementations of the standard message pa ssing libraries (like MPI) to ensure maximum compatibility with the myriad parallel programs that depend on industry standards. The main consequences of the aforementioned assumptions are that there must be more than one process, that recomputation is better than reading from disk, and that messages should be as large as possible. The primary goals of this parallelization effort are scalability and addressability (i.e., the ab ility to treat a wide range of molecular systems), so the grouping and factorizati on of terms should maximize message size and minimize message frequency while minimizi ng the local scratch memory requirements. Group 1: T2 Virtual Orbitals Coming from V One of the smallest groups of terms consists of T2 contributions, in which both virtual orbitals come from the two-el ectron integrals. Equation 3-1 shows the terms— the terms are analogous. The and integrals will be generated on the fly in the AO basis, so the expression must be written in terms of AO integrals. Before doing so, the second term must be split and rea rranged to allow for better factorization. cd cd ij cd d j c i c c i ab ijt cd ab t t cd ab ij P t cj ab ij P ij ab r2 1 2 1) ( ˆ ) ( ˆ (3-1) cd cd ij cd d j c i c c j c c i ab ijt cd ab t t cd ab ij P t ic ab ij P t cj ab ij P ij ab ij P r2 1 2 1 2 1 2 1 2 1) ( ˆ ) ( ˆ ) ( ˆ ) ( ˆ (3-2)

PAGE 34

24 cd cd ij d c cd d j c i d c c c j c i c c i j c j i ab ijt c c ab t t c c ab ij P t c c ab ij P t c c ab ij P c c ab ij P r2 1 2 1 2 1 2 1 2 1) ( ˆ ) ( ˆ ) ( ˆ ) ( ˆ (3-3) ij j i j i j i j i ab ijt ab t t t c c t c c ab ij P r2 1 2 1) ( ˆ(3-4) ij j j i i b a ab ijt t c t c ij P c c r ) ( ˆ2 1(3-5) The terms are manipulated in the same fa shion, but do not re quire splitting any terms (Equations 3-6 through 3-8.) Cd Cd Ij Cd d j C I c c j C C I Ab Ijt Cd Ab t t Cd Ab t Ic Ab t Cj Ab Ij Ab r (3-6) Ij j I j I j I j I Ab Ijt Ab t t t c c t c c Ab r(3-7) Ij j j I I b A Ab Ijt t c t c c c r(3-8) Group 2: T2 Virtual Orbitals Coming from V and T Set 1: One Virtual from V, One Virtual from T1 This set of T2 contributions is very sim ilar to the Group 1 terms. The increment is shown (Equation 3-9). Since every term an tisymmetrizes the contraction with respect to ab, the permutation operator is remove d and the increment is identified as nonantisymmetric. The r ~ residual amplitudes (Appendix B) will be antisymmetrized with respect to both ab and AB virtual pairs at the end of the iteration. Splitting the second term, as in Equation 3-1, is also required.

PAGE 35

25 kcd cd ij a k kcd d j c i a k kc c j a k k a k ab ijt t cd kb ab P t t t cd kb ij P ab P t t ic kb ij P ab P t ij kb ab P r ) ( ˆ ) ( ˆ ) ( ˆ ) ( ˆ ) ( ˆ ) ( ˆ2 1 2 1(3-9) kcd cd ij a k kcd d j c i a k kc c j a k kc c i a k k a k ab ijt t cd kb t t t cd kb ij P t t ic kb ij P t t cj kb ij P t ij kb ij P r2 1 2 1 2 1 2 1 2 1) ( ˆ ) ( ˆ ) ( ˆ ) ( ˆ ~ (3-10) k ij a k k j i j i j i j i a k ab ijt kb t t t t c c t c c kb t ij P r2 1 2 1) ( ˆ ~ (3-11) ij j j i i b a ab ijt t c t c ij P c t r ) ( ˆ ~ 2 1(3-12) The increments (Equation 3-13) are grouped by the bra-side T1 and then expressed in terms of AO integrals and factorized. kCd Cd Ij b k KCd Cd Ij A K kCd d j C I b k KCd d j C I A K kc c j b k Kc c j A K kC C I b k KC C I A K k b k K A K Ab Ijt t Cd Ak t t Cd Kb t t t Cd Ak t t t Cd Kb t t Ic Ak t t Ic Kb t t Cj Ak t t Cj Kb t Ij Ak t Ij Kb r (3-13) Cd Cd Ij A K KCd d j C I A K Kc c j A K KC C I A K K A K Ab Ijt Cd Kb t t t Cd Kb t t Ic Kb t t Cj Kb t Ij Kb t r(3-14a) kCd Cd Ij b k kCd d j C I b k kc c j b k kC C I b k k b k Ab Ijt Cd Ak t t t Cd Ak t t Ic Ak t t Cj Ak t Ij Ak t r(3-14b) Ij j j I I b A Ab Ijt t c t c c t r(3-15a) Ij j j I I b A Ab Ijt t c t c t c r(3-15b)

PAGE 36

26 Set 2: One Virtual from V, One Virtual from T2 The set contains four T2 contributions that are ring contractions. Equations 3-16 through 3-18 show these terms and de rive their AO basis expressions. KCd d j aC iK KC aC iK kcd d j ac ik kc ac ik ab ijt t Cd Kb t Cj Kb ij P ab P t t cd kb t cj kb ij P ab P r ) ( ˆ ) ( ˆ ) ( ˆ ) ( ˆ(3-16) j a i b a i j b j a i b a i j b ab ijt t c t c c ij P t t c t c c ij P r ) ( ˆ ) ( ˆ ~ (3-17) j j a i a i b ab ijt c t t c ij P r ) ( ˆ ~ (3-18) Each half of the increments contains four ri ng terms and two particle-hole ladder terms. The MO and AO expressions are shown in Equations 3-19 and 3-20. KcD D I Ac Kj KCd d j AC IK kcd d j Ac Ik Kc Ac Kj KC AC IK kc Ac Ik Ab Ijt t Dc Kb t t Cd Kb t t cd kb t Ic Kb t Cj Kb t cj kb r(3-19a) kCd d j bC kI kcD D I bc jk KCD D I bC jK kC bC kI kc bc jk KC bC jK Ab Ijt t dC kA t t cD kA t t CD KA t jC kA t cI kA t CI KA r(3-19b) I I A j b j j A I A I b Ab Ijt c t c t c t t c r (3-20a) j j b I A I I b j b j A Ab Ijt c t c t c t t c r (3-20b)

PAGE 37

27 Group 3: T2 Virtual Orbitals Coming from T Set 1: Rings and PH Ladders There are 31 UHF T2 terms that belong to this set. Each process loads a block ofOO VVV integrals1, and calculates all of the c ontributions to the range of R2 amplitudes that it is assigned to store. Grouping and factoriz ing the terms for performance is heavily dependent on the manner in which the old T2 amplitudes are stored and accessed. The terms were first cataloged and sorted into subsets according to spin pairs. For example, if the index is stride 1, then the term Bd Jl kl cd Ac Ikt V t must step down in A and down in B for a given cd pair in V. To minimize the number of messages, any other ring terms that have the same access pattern should be calculated at the same time. From the 31 UHF terms, three access patterns emerged. A ny term that did not prefer one pattern to another was added to the pattern that ha d the most similar intermediate products. Subset 1: AC/BD Subset 1 contains 13 terms that step through T2 in the order AC/BD, which corresponds to the order of the two bra indices. More specifically, for a given CD pair of ket shells, the first (or oute r) loop runs over A and the sec ond (or inner) loop runs over B. All of the T2 contributions are performed in this series. KLCD DB LJ KL CD C I A K KLD DB LJ KL ID A K KLCD DB LJ KL CD AC IK AB IJt V t t t V t t V t IJ P AB P r2 1) ( ˆ ) ( ˆ (3-21) KlCd dB lJ Kl Cd C I A K Kld dB lJ Kl Id A K KlCd dB lJ Kl Cd AC IK klcd dB lJ kl cd Ac Ik AB IJt V t t t V t t V t IJ P AB P t V t IJ P AB P r ) ( ˆ ) ( ˆ ) ( ˆ ) ( ˆ2 1(3-22) 1 V notation can be ambiguous with respect to whether AO integrals are antisymmetrized. Both electrons in the terms in this section have at least one occupied MO index, so V notation is used to save equation width.

PAGE 38

28 KlCd db lj Kl Cd AC IK klcd db lj kl cd Ac Ik Ab Ijt V t t V t r (3-23) KlCd b l d j Kl Cd AC IK KlC b l Kl Cj AC IK klcd b l d j kl cd Ac Ik klc b l kl cj Ac Ik Ab Ijt t V t t V t t t V t t V t r(3-24) In terms of uncontracted ket indices of V, the partial c ontributions reduce to Equations 3-25 through 3-27, in which each equation corresponds to a "left-side" intermediate enclosed in parentheses. L B JL K KL I I A K K KL A IK AB IJt V t c t V t IJ P r2 1) ( ˆ ~(3-25) l B Jl K Kl I I A K K Kl A IK k kl A Ik AB IJt V t c t V t V t IJ P r2 1) ( ˆ ~(3-26) l j j b l b jl K Kl A IK k kl A Ik Ab Ijt c t t V t V t r(3-27) Subset 2: DB/CA The DB/CA subset calculates 12 terms by looping first over B and then over A. klcd b l d j kl cd ac ik kcl b l kl cj ac ik klcd db lj kl cd ac ik ab ijt t V t t V t t V t ij P ab P r2 1) ( ˆ ) ( ˆ (3-28) KlCd b l d j Kl Cd aC iK KCl b l Kl Cj aC iK KlCd db lj Kl Cd aC iK KLCD Db Lj KL CD aC iK ab ijt t V t t V t t V t ij P ab P t V t ij P ab P r ) ( ˆ ) ( ˆ ) ( ˆ ) ( ˆ2 1(3-29) KLCD Db Lj KL CD AC IK Ab Ijt V t r (3-30) KlCd db lj Kl Cd C I A K Kld db lj Kl Id A K KLCD Db Lj KL CD C I A K KLD Db Lj KL ID A K Ab Ijt V t t t V t t V t t t V t r(3-31)

PAGE 39

29 The MO equations can be reduced into four equations with four "right-side" intermediates when they are expressed in terms of V with uncontracted ket indices. kl b l j j kl l b lj kl a ki ab ijt t c V t V t ij P r2 1) ( ˆ ~ (3-32) Kl b l j j Kl l b lj Kl L b Lj KL a Ki ab ijt t c V t V t V t ij P r2 1) ( ˆ ~ (3-33) KL b Lj KL A KI Ab Ijt V t r (3-34) Kl b lj Kl L b Lj KL A K I I Ab Ijt V t V t t c r (3-35) Subset 3: CB/AD The remaining terms are primarily PH ladder terms and the outer loop is over B. If the bra shells are thought of as a square VV matrix, then the main difference in loops between this subset and Subset 2 is that the inner loop A steps down the D column, while the inner loop A of Subset 2 steps across the C row. KlCd Cb Kj Kl Cd Ad Il Ab Ijt V t r (3-36) KlCd b l C I Kl Cd Ad Kj Kld b l Kl Id Ad Kj KlCd Cb Il Kl Cd Ad Kj Ab Ijt t V t t V t t V t r (3-37) KlCd Cb Il Kl Cd d j A K KlC Cb Il Kl Cj A K Ab Ijt V t t t V t r (3-38) Factorizing the MO expressions in te rms of right-side intermediates and uncontracted ket indices yiel ds Equations 3-39 through 3-41. lK b Kj Kl A Il Ab Ijt V t r(3-39)

PAGE 40

30 Kl b l I I Kl l b Il Kl A Kj Ab Ijt t c V t V t r(3-40) Kl b Il Kl j j A K Ab Ijt V t c t r(3-41) Set 2: HH Ladders from OOVV, OOOV, and OOOO The ladder term that is quadratic in T2 can be calculated in one of two ways: one with a HH ladder intermediate and one with a PP ladder intermediate. Opting for the HH ladder intermediate saves a large amount of memory and memory writes. Three other terms have HH ladder intermediates from the OO VVV integral and can be factorized with the quadratic term. The contributions are shown (Equation 3-42). klcd b l a k d j c i klcd b l a k cd ij klcd ab kl d j c i klcd ab kl cd ij ab ijt t t t cd kl ab P ij P t t t cd kl ab P t t t cd kl ij P t t cd kl r ) ( ˆ ) ( ˆ ) ( ˆ ) ( ˆ4 1 4 1 4 1 4 1(3-42) The four ladder-like terms that come from OOOO and OOOV integrals can be combined with the OOVV ladders when the ket orbitals are from the plain AO integral. Equation 3-43 shows the MO contributions. klc b l a k c i kl b l a k klc ab kl c i kl ab kl ab ijt t t cj kl ab P ij P t t ij kl ab P t t cj kl ij P t ij kl r ) ( ˆ ) ( ˆ ) ( ˆ ) ( ˆ2 1 2 1 2 1 2 1(3-43) Combining the two equations, incrementi ng pre-antisymmetrized amplitudes, and expressing the ket indices in the uncont racted AO basis yields Equation 3-45. kl ab kl i kl ab kl i kl ab kl j i kl ab kl ij ab ijt t j kl ij P t c j kl ij P t t t kl ij P t t kl r2 1 2 1 2 1 2 1 2 1 2 1 4 1 2 1 4 1) ( ˆ ) ( ˆ ) ( ˆ ~ (3-44a)

PAGE 41

31 kl b l a k i kl b l a k i kl b l a k j i kl b l a k ij ab ijt t t j kl ij P t t c j kl ij P t t t t kl ij P t t t kl r ) ( ˆ ) ( ˆ ) ( ˆ ~ 2 1 2 1 2 1 4 1 4 1(3-44b) kl b l a k ab kl j j i i ij ab ijt t t t c t c ij P t kl r2 1 4 1) ( ˆ ~ (3-45) The terms and factorized increment are shown in Equations 3-46 through 3-48. KlCd b l A K d j C I KlCd b l A K Cd Ij KlCd Ab Kl d j C I KlCd Ab Kl Cd Ij Ab Ijt t t t Cd Kl t t t Cd Kl t t t Cd Kl t t Cd Kl r(3-46) Kl b l A K Kl Ab Kl Klc b l A K c j Klc Ab Kl c j KlC b l A K C I KlC Ab Kl C I Ab Ijt t Ij Kl t Ij Kl t t t Ic Kl t t Ic Kl t t t Cj Kl t t Cj Kl r(3-47) Kl b l A K Ab Kl j j I I Ij Ab Ijt t t t c t c t Kl r(3-48) Group 4: T2 into T1 The T2 amplitudes contribute to the R1 amplitudes in 14 terms. These terms can be grouped together such that they define thr ee one-particle intermed iate operators, which effectively reduce the particle number of th e perturbation. Equations 3-49 and 3-50 show the terms that define the IP (particle) and IH (hole) intermediates, respectively. Equation 3-51 contains both spin cases of the IR (ring) intermediate. KlCd d i Ca Kl klcd d i ca kl KlC Ca Kl klc ca kl a it t Cd Kl t t cd kl t Ci Kl t ci kl r2 1 2 1(3-49) KlCd a l Cd Ki klcd a l cd ki KCd Cd Ki kcd cd ki a it t Cd Kl t t cd kl t Cd Ka t cd ka r2 1 2 1(3-50)

PAGE 42

32 ld da li ld KlCd da li C K klcd da li c k LD Da Li LD kLcD Da Li c k KLCD Da Li C K a it f t t Cd Kl t t cd kl t f t t cD kL t t CD KL r(3-51) The terms that define the intermediates have been isolated in Equations 3-52 through 3-54. Both spin cases of the IR intermediate are shown because they both contribute to R1 amplitudes. (The terms with the Fock matrix elements are not part of the intermediate.) The spin cases of IP and IH are analogous to the cases. i i KlC Ca Kl klc ca kl a i Pt c t C Kl t c kl r I2 1:(3-52) a a KCd Cd Ki kcd cd ki a i Ht c t Cd K t cd k r I2 1:(3-53) LD Da Li LD a i kc c k KC C K a i Rt f t t c k t C K r I :(3-54a) ld da li ld a i KC C K kc c k a i Rt f t t C K t c k r I :(3-54b) Group 5: One-Particle Transformations Set 1: T1 Contributions The 12 remaining T1 contributions can be loosel y described as one-particle transformations. The terms are analogous to the terms, which are shown. These are then drastically simplified with the IR intermediate, and further consolidation yields a non-HF set in Equation 3-61 and an IR set in Equation 3-62. KC C K kc c k ai a i v ot Ci Ka t ci ka f r F:(3-55) KCd d i C K kcd d i c k a d d i ad a i v vt t Cd Ka t t cd ka t f r F:(3-56)

PAGE 43

33 KlCd a l d i C K klcd a l d i c k KlC a l C K klc a l c k ld a l d i ld i l a l li a i o ot t t Cd Kl t t t cd kl t t Ci Kl t t ci kl t t f t f r F :(3-57) i R a ai a i v oc I c f r F:(3-58) i R a a d d i ad a i v vt I c t f r F:(3-59) i i R a ld a l d i ld i l a l li a i o ot c I t t t f t f r F:(3-60) ld a l d i ld i l a l li a d d i ad ai a it t f t f t f f r(3-61) i i R a a a it c I t c r(3-62) Set 2: T2 Contributions Two one-particle transformati ons describe the final T2 contributions. KlCd ab lj Cd Ki klcd ab lj cd ki ld ab lj d i ld i l ab lj li ab ij o ot t Cd Kl t t cd kl ij P t t f t f ij P r F2 1) ( ˆ ) ( ˆ : ) 1 ((3-63a) KlCd ab lj d i C K klcd ab lj d i c k KlC ab lj C K klc ab lj c k ab ij o ot t t Cd Kl t t t cd kl ij P t t Ci Kl t t ci kl ij P r F ) ( ˆ ) ( ˆ : ) 2 ((3-63b) KlCd db ij Ca Kl klcd db ij ca kl ld db ij a l ld a d db ij ad ab ij v vt t Cd Kl t t cd kl ab P t t f t f ab P r F2 1) ( ˆ ) ( ˆ : ) 1 ((3-64a)

PAGE 44

34 KlCd db ij a l C K klcd db ij a l c k KCd db ij C K kcd db ij c k ab ij v vt t t Cd Kl t t t cd kl ab P t t Cd Ka t t cd ka ab P r F ) ( ˆ ) ( ˆ : ) 2 ((3-64b) Similar to the F1 intermediates used to calculate the T1 contributions, the oneparticle transformation matrix for T2 into R2 can be defined in terms of IP, IH, and IR, and using these definitions, the and increments can be trivially reduced to four terms (Equations 3-67 and 3-68). i i R l i H l ld d i ld i l li l it c I c I c t f f F(3-65) d R a a d a P ld a l ld a d ad a dc I t c c I t f f F(3-66) k ab ik k j k ab kj k i c ac ij b c c cb ij a c ab ijt F t F t F t F r(3-67) k Ab Ik k j K Ab Kj K I c Ac Ij b c C Cb Ij A C Ab Ijt F t F t F t F r(3-68)

PAGE 45

35 CHAPTER 4 PARALLEL ALGORITHM Virtual Parallel Computer Before any algorithm can be parallelized, the type of parallel computer that the algorithm will run on must be identified. Is remote memory accessible through one-sided operations or must each process explicitly interrupt and request data from another process? How much data will be needed during each section of the algorithm and can that data be held by a single process? Does each pr ocess have the ability to read data from a storage device, and if so, what is the penalty for multiple processes reading data simultaneously? These are the t ypes of questions that must be answered before making assumptions, which will shape the plan for parallelization. The most basic parallel computer is a ne twork of workstations, in which each node is a single CPU computer with a private memory space and a storage device—usually a hard disk. The most common connection fabric for these networks is TCP/IP over Ethernet. This allows for any node to comm unicate with any other node in the network, and the network hardware or operating system software will handle delivering the messages. (As opposed to each process having to maintain a table of which nodes are directly accessible and which nodes must be reached through other nodes.) The cost of maintaining these networks is not linearly pr oportional to their size, so it is common for 500+ node networks to remove the most failure prone element, the hard disk, and provide other means for each node to conduct I/O operations.

PAGE 46

36 The extreme opposite of this picture is a single computer with multiple CPUs, which are all able to access the same me mory space and hard drive(s) with equal performance. Each CPU is identical to the others, so these are commonly referred to as Symmetric Multi-Processor (SMP) computer s; however, there are tremendous costs associated with adding each additional CPU, so it unusual to have more than 8 CPUs in a truly SMP machine. Some manufacturers ha ve mini-networks within a single computer that allow them to increase the number of CPUs without making them pure SMP. To an application programmer, these machines are programmed in the same way as Symmetric Multi-Processors, but the true label is Shared Memory Programming (also SMP!) even though the CPUs are not identical in terms of distance to memory, cache, and possible hard drives. Henceforth, there will be no di stinction between the meanings of SMP since the parallel CCSD algorithm will be implemente d as a user process and not a system or kernel process. The balance between cost and perfor mance lies somewhere between the two extreme types of parallel computers. The primary offerings of the major vendors of supercomputers are clusters of SMP machin es. For fine-grain pa rallelization, each node has multiple CPUs accessing the same memory, and for coarse-grain parallelization, each node has physically separate resources. The vi rtual parallel computer that the parallel CCSD algorithm relies on is this type of cl uster of SMP workstations; however, program portability and the ability to tune the network parameters are very desirable. The only way to build portability and tuna bility into the program without having divergent paths through the software is to model the physical computer in the boundaries of network communication. By properly defi ning certain sub-networks, no process will

PAGE 47

37 attempt to communicate through slow network channels when faster ones may exist; furthermore, messages pertaining to a certai n context can be limite d to regions of the network, which might reduce network contenti on on some architectures. This allows the scope of communications to be set once at program run time (or compile time) and the algorithm will transparently acknowledge the unde rlying hardware up to the efficiency of the partitioning scheme. Communication Networks In some sense, parallelizing algorithms that handle massive amounts of data and computations is akin to managing disaster recovery. The moment the event occurs, there are many diverse and unused workers who ha ve no assignment or direction. As the situation is assessed, teams of workers are sent to handle different aspects of the recovery from the time critical, like search and rescue of survivors, to the time forgiving, like public relations and cost reimbursement. Wildfire firefighters have some of th e most extensive experience in rapid deployment and management of labor units. In the late 1970s, the Incident Command System (ICS) was created in response to a series of wildfire disasters in southern California [ Fed98 ]. It has since been adopted and refined by several federal and state agencies and provides the basis for mo st emergency and non-emergency incident responses. The parallel CCSD algorithm uses th e following aspects of its effectiveness: Modular/scalable organization Manageable span of control Unit designations. The lowest level of the organization is the single process (normally assigned to a single CPU). Every process must have its own memory heap, and the combination of a

PAGE 48

38 process and its memory is called a processor element (PE). PEs recognize other PEs on the same physical node and form squads (SQD). Squads can be grouped together to form platoons (PLT) based on algorithm requirements, and platoons can be grouped together to form companies (CO), which execute large-scale algorithms. Although the units were listed in increa sing size, their division and recognition happen in decreasing size. The commanding pr ocess evaluates the available resources and divides the group into subgroups acco rdingly. For example, the Company Commander (CO_CDR) would catalog the total distributed memory and channels to disk and form platoons based on how the company algorithm uses those resources. The squads would then self-assemble within the pl atoons based on which node the PEs are on. Each division has its own communicati on space that does not interfere with messages in another division, and there are special communication spaces for the commanders of the subdivisions. This allows platoons to synchronize independently from each other, and more importantly, it allo ws company algorithms to disregard the messages of other companies. The company that performs the CCSD iterations and calculates the T amplitudes is called the Amplitude Company. Figure 4-1 illustrates the communication boundaries for a co mpany and its subdivisions. Data Management Arrays stored in distributed memory The fundamental data elements involved in standard CC theory are T amplitudes, integrals, and the reference wavefunction. The predominant method for solving the T amplitude equations is iterating over a coupled set of equations for T1 and T2. This makes the amplitudes stateful at each iteration compared to the tw o-electron integrals, which are stateless. Thus, the minimum amount of info rmation that must be stored during the CC

PAGE 49

39 iterations is TOLD (T1 and T2), TNEW (R1 and R2), and any information that defines the reference wavefunction. The integrals do no t need to be stored since they can be generated on the fly. 2nd SQD LDR 2nd PE 3rd PE 4th PE SQD_NET SQD_LDR_NET 3rd SQD 1st SQD 2nd PLT CDR PLT_NET PLT_CDR_NET 3rd PLT 1st PLT AMPLITUDE CO CDR CO_NET CO_CDR_NET Figure 4-1.Communication network boundari es. Rounded rectangles represent communication networks (in italics). Square rectangles represent physical entities. Dashed square rectangles c ould be redudant entities (i.e., a platoon commander will probably be the leader of the first squad). It is assumed that a single PE cannot store a full OOVV quantity. Even for SMP machines, a single OOVV array might exceed th e total physical memory of the node, so the T2 old and new arrays must be distributed over a set of PEs. It is also assumed that each PE is able to retrieve and increment remote memory with one-sided communications (i.e., without interrupting the process that allo cated the memory). If the total distributed memory is many times larger than the am ount of data that must be stored, then

PAGE 50

40 subdividing the Amplitude Company into platoons such that multiple copies of the amplitudes are available for reading will d ecrease the average dist ance between a PE and the distributed data that it attempts to access. In this algorithm, the number of PEs in the Amplitude Company is almost irrelevant, but the total amount of distributed memory is critical. The Company must be able to store TOLD, TNEW (R), and the larger of T2 or the amount of scratch space needed per PE, which depends on the chemical system (Figure 4-7 ). Each PE will have its own local copy of small frequently accessed arra ys, which include basi s set definitions, AO coefficients, T1 and R1 amplitudes, and other one-particle intermediates. While the PEs are calculating contributions to the R amplitudes, only T2 and R2 need to be accessed in addition to local scratch memory. Between certain sections, R2 must be transformed, which requires enough scratch memory for storing the third distributed OOVV array. reserved memory Example 1 Example 2 T2 distributed R2 distributed min scr < R2 block min scr > R2 block Figure 4-7.The memory heap management for various molecular systems. In one scenario, the minimum scratch memory needed for amplitude contributions is less than the amount of memory needed to store a section of R2 during a transformation. The other scenar io is when the opposite occurs. Example 1. The user runs a calculation that requires 128 MB of reserved data per PE, such as basis set information, one-par ticle matrix elements, etc. The calculation requires another 128 MB (minimum) of scratc h memory per PE for calculating various

PAGE 51

41 quantities during the CCSD iteration, a nd the OOVV quantities require 4096 MB of distributed memory from all of the PEs. If the user submits the job with 1024 MB of heap per process, what is the minimum number of PEs needed to perform the calculation? Alternatively, does th e calculation have enough resources if the user submits the job with X PEs, each with 1024 MB of memory? Solution 1. The amount of non-reserved memory is 1024-128 MB/PE (i.e., 896 MB is available for scratch and distributed data on each PE). Three OOVV arrays require 12,288 MB. Dividing this amount by 896 MB/PE and rounding up yields a minimum of 14 PEs. Distributing each OOVV array ove r 14 PEs requires each PE to store approximately 293 MB of data per array. Sinc e this amount is larger than the minimum amount of scratch memory needed (128 MB), the calculation can proceed with 14 or more PEs. Example 2. Using the same system data as above, but increasing the minimum required amount of scratch memory to 512 MB, what is the minimum number of PEs needed to run the calculation? Solution 2. This time, the regular scratch memory is larger than the distributed OOVV scratch array. In essence, this become s unavailable for distributed data, so the effective amount of memory per PE for th e OOVV arrays is (896-512) 384 MB. Since the third OOVV array can fit in the scratch memory, only two OOVV arrays, totaling 8192 MB, need to be distributed over the Amplitude Company. 8192 MB divided by 384 MB/PE yields 22 PEs (rounded up). This means each PE will store approximately 187 MB of each OOVV array, and the calculation can proceed if there are at least 22 PEs. The minimum PE test can be desc ribed by the following pseudo-code:

PAGE 52

42 mem_dist = mem_he ap mem_reserved min_PEs = ceiling( 3 mem_OOVV / me m_dist ) mem_block = mem_OOVV / min_PEs if (mem_block < mem_minscr) then mem_dist -= mem_minscr min_PEs = ce iling( 2 mem_O OVV / mem_dist ) end if If the user runs the job with more than twice the minimum number of PEs, then the Amplitude Company can divide into multiple platoons, each with its own copy of TOLDand TNEW. One of the benefits of this approach is that fewer PEs will access each block of distributed memory, but the main advant age is that knowledge of the underlying hardware can be programmed into the pa rtitioning scheme and the algorithm will transparently align itself w ith the hardware boundaries. Arrays stored on disk The VVVV and VVVO integrals will not be transformed and stored on disk, so their contributions must be calculated on the fly in the AO basis. The OOOO and OOOV integrals will not be stored on disk either, not because they are too large, but because their contributions can be accounted for with intermediates from the direct integrals and from the stored OOVV half-transformed integral s. The stored integral s can be distributed over and individually processed by the platoons of the Amplitude Company. Parallel UHF CCSD Each CC iteration begins with initializi ng the global intermed iates and residual amplitudes to zero. The TOLD amplitudes are defined by Equation 4-1. This formulation allows the amplitudes to contract directly with AO integrals. The T2 residual amplitudes will have different definitions depending on which part of the algorithm is executing. In the Direct Integral (DI) phase, one of the vi rtual indices is coming from an uncontracted AO integral, so the residual is related to the MO residual amplitudes by Equation 4-2.

PAGE 53

43 ab ab ij b a ij a a i a it c c t t c t ;(4-1a) Ab Ab Ij b A Ijt c c t (4-1b) Ij b A Ab Ijr c c r(4-2) The Stored Integral (SI) and One Par ticle (OP) phases calculate residual contributions with both virtual orbitals coming from T, so the DI residual is transformed according to Equation 4-3. Ij b Ij b b Ijr Q r c c r(4-3) Throughout the iteration, each PE stores al l OO pairs (ket indi ces) for a range of AO shell pairs (bra indices). The amplitude s are distributed over the platoon (Figure 4-2 ), and one-sided communications retrieve and increment all OO pairs for a block of AO shell pairs. IJ Ij ij IJ Ij ij IJ Ij ij IJ Ij ij IJ Ij ij IJ Ij ij 1 2 3 4 5 6 MN = PE 0 PE 1 Figure 4-2.The T2 and R2 memory layout. Each square re presents the four-dimensional array of amplitudes that have the de signated spin pair and whose AO bra indices belong to the MN shell pair The arrows repr esent the order in memory.

PAGE 54

44 After all of the contributions have been calculated, the residual is antisymmetrized with respect to the two virtual indices and b ack-transformed to the MO basis. The new T amplitudes are calculated by dividing each am plitude by the diagonal elements of the Fock matrix. Direct Integral Phase Design considerations Analyzing the T2 contribution in Equation 4-4 can summarize the DI phase. Since the VVVO integrals are not transformed and stor ed prior to the CC it erations, the terms must be evaluated in the AO basis. lCd Cd Ij b l KCd Cd Ij A K Ab Ijt Cd Al t t Cd Kb t r(4-4) Ij b A Ij b A Ab Ijt t c t c t r(4-5) The biggest problem with this situation is that every AO integral causes a global update to R2 (i.e., one integral c ontributes to every R2 amplitude). The best way to reduce the number of operations is to precondition R2 such that a smaller set of VVVO contributions is calculated. R2 would then be post-transf ormed after the contributions have been accumulated (Equations 4-6 through 4-8). b Ab Ij b A Ijr c r (4-6) Ij A A Ijt t r(4-7) A Ij b Ab Ijr c r(4-8) For each integral, only an IjA strip of R2 amplitudes is incremented compared to the full set without the backward and forward transformations of R. The difficulty with

PAGE 55

45 the amplitudes is that while the side can get away with 1/V updates, the side still requires a global update per in tegral (Equation 4-9). The orbital must be forward transformed into A, but the orbital must be transformed into b by T1, which must then be back-transformed by c-1. One solution is to calculate the two spin cases in separate passes (while transforming R between them ). Although this requires generating the AO integrals a second time, the performance of a naturally parallel local computation is much greater than a gl obal update per N4 matrix element. b Ij b b A A Ijt t c c r (4-9) A second performance issue is antisymm etrization of the same-spin integrals. Equation 4-10 shows one of the ring contributions in MO s and Equation 4-11 shows the terms appropriately transformed into the AO basis. kcd d j Ac Ik Ab Ijt t cd kb r(4-10) j I Ijt t r(4-11) The AO integrals are not antisymmetrized by the integral generator, so one of two approaches must be taken to account for bo th contributions. The first approach involves two updates per integral: j I Ij j I Ijt t r t t r ;(4-12) The benefit of this method is that the PE needs only enough memory to store the AO integral batch and the T2 batch; however, two updates per inte gral is more costly than two reads per integral, which can be achieved by antisymmetrizing the pair:

PAGE 56

46 j I Ij j I Ijt t r t t r ;(4-13) This requires either two temporary T2 blocks or two serial loads. If there is enough memory, then the exchange integral can be calculated with the regular Coulomb integral and the antisymmetrized contribution can be calculated with one read and one update. Since shell quadruplets of integrals are typi cally not more than a few hundred kilobytes, the algorithm will assume this can be done. The permutational symmetry of the antisymmetrized integrals is not the same as that of the plain Coulomb integrals so they must be generated again in the middle of the loop over permutations. Although two updates or two reads are avoide d, the integrals need to be calculated three times per AO shell quadruplet. Considering the cost of messages compared to the speed of locally computing a shell quad, this is a favorable tradeoff. From these considerations, it is clear that the DI phase must calculate the unique integrals six times even though the AO integr als have 8-fold permutational symmetry. There are two passes (one for and one for ) and there are three generations per pass—one Coulomb and two exchange integrals. This might seem expensive, but the cost of doubling or quadrupling the number of me ssages does not scale linearly with the number of PEs and far outweighs the cost of the naturally parallel integral computation. Algorithm A critical aspect of this algorithm that makes or breaks the performance is onesided communications. A one-sided GET of a T2 block can be in flight while the PE calculates terms that do not involve the T2 amplitudes. Immediately after the amplitudes have been read, the next block of amplitudes can be requested, even if the PE is not ready to use them. This technique is also used to hide the communication of R2 updates. The

PAGE 57

47 main difference between the two arrays is that the T2 amplitudes are read once per GET, while the R2 amplitudes are incremented two or three times per ACCUMULATE. Ideally, there would be two R2 buffers, one active and one pe nding, that increase the time available for each ACCUMULATE. If the user must run the calculation in limited memory, then it is possible to still benefit from one-sided ACCUMULATEs, but the PE might have to wait for a while before initializing the buffer for the next R2 contribution. The main driver for the DI phase loops ov er AO shell quadruplets of integrals. The MN shell pairs are distributed over all of the PEs in the Amplitude Company. This distribution can happen a number of ways: r ound robin, blocked, block cyclic, job board, master/slave, etc. How a PE processes each MN shell pair is independent of the shell pair assignment, so this can be tweaked outside the main processing loop. Ideally, there would be a globally accessible intege r that corresponds to the ne xt MN pair that must be processed. The integer must be atomically read and incremented by each PE to get its next assignment. This balances the load be tter than a fixed dist ribution algorithm like round robin or block cyclic, and it performs the same f unction as a master process sending out new tasks without actually needing a process to sit idle waiting for requests. Once an MN shell pair is identified, th e PE is responsible for generating and processing all unique RS shell pairs th at correspond to the MN upper bound. This prevents any PE from processing Coulomb or exchange integrals that another PE is responsible for. It also allows for a single ge neration of the integrals for each pass if there is enough scratch memory (i.e., the PE woul d generate all RS pairs and internally antisymmetrize them before beginning the pr ocessing loop). Typi cally, however, there will be enough memory for two shell quads and the scratch buffers for contracting terms.

PAGE 58

48 A shell quadruplet of integrals will cont ribute to all of the Group 1 terms, the Group 2 terms, and the Group 4 intermediates. Table 4-1 summarizes the stages of the processing loop and which elements are accessible at each stage. Table 4-1.The direct integr al processing loop summary L loop section Scope Contributions prologue {IHs, I2} += f(V,C,T1,T2) all Group 1 terms and the IHintermediate stage 1 R2c += f(V,C,T1,T2,I2) all contributions to R2 from Group 2 and Group 1 stage 2 IPs += f(V,C,T2) the IP intermediate epilogue IR += f(V,C,T1) the IR intermediate The Group 1 terms are saved in a two-partic le intermediate duri ng the loop over the T1 amplitudes from the Group 2 terms. The I2 intermediate is then contracted with Q (C*C) and the Group 1 contribution is sent with the Group 2 contribution to increment R2. If there is not enough scra tch memory to store the I2 intermediate, then it is possible to calculate these terms in a single direct integral pass sepa rately from the Group 2 terms or the IP or IR intermediates. Table 4-2 has the detailed order of operations for the prologue of processing a shell quad. The w one-particle matrix element is defined in Equation 4-14. I I It c w(4-14) When the prologue finishes, there are two important quantities stored in memory: the Group 1 contribution and the T2 amplitudes that genera ted it. Since the Group 2 contributions loop over all L shells for a given S shell, it is convenient to begin the loop with L=R. Table 4-3 lists the detailed or der of operations for the L loop. The x oneparticle matrix element is defined in Equation 4-15. I I I A A Ac t c c x 2 1(4-15)

PAGE 59

49 After the last value of L, there is still a pending R2 ACCUMULATE, which must be completed. Rather than WAIT immediately after the loop, the IR intermediate, which requires only T1, C, and V, can be calculated. Furthermore, if there are more permutations of V, then the PE can GET the next T2(RS) block and permute the integrals. Table 4-4 lists the detailed order of opera tions for the permutation epilogue. The first four permutations of V involve the same two sets of shell quads. : : K J The second set of four permutations requires a different exchange shell quad, so the integral generator is invoked a th ird time per MNRS configuration. : : K J Figure 4-8 shows the movement of data needed to calculate the contributions to R2from an integral shell pair. 1 2 3 4 5 6 7 8 9 10 11 12 13 T T R T T R 1st Squad 2nd Squad 1st Squad 2nd Squad 1st Platoon 2nd Platoon Global Shell Pair Counter of < > Figure 4-8.The data flow in the Direct Inte gral phase. Solid arrows represent one-sided messages, and dashed arrows represen t reading and incr ementing the global shell pair counter. Each PE retrieves all T2 amplitudes that contract with the integral and accumulates the contributions into the R2 array.

PAGE 60

50 When the Amplitude Company has calculated all of the MN shell pairs, the R2distributed arrays are transformed for the phase that processes the stored OOVV integrals. The one-particle global intermediate s could be reduced here for use in the One Particle phase. The following pseudo-code li sts the entire flow of control from the beginning of the CC iteration up to the start of the SI phase. zero(R2,R1,I1) W1 = C + T1 do spin = 2, 1, -1 X1 = Q1 – C*T1 SYNCHRON IZE(CO_NET) do while (more MN) do all RS GET T2(RS) G = intgen(M,N,R,S) H = G – intgen(M,N,S,R) do perm = 1, 8 calc Prologue() nLeft = NShells L = R do while (nLeft != 0) calc InnerLoop() nLeft -= 1 L = 1 + mod(L,NShells) end do nLeft calc Epilogue() end do perm end do RS end do MN SYNCHRON IZE(CO_NET) if (spin==2) then global R2(L,N) = Q1 -1(N,D)*R2(D,M)*Q1(M,L) else global R2(L,D) = R2(L,N)*Q1(N,D) end if end do spin REDUCE(I1,CO_CDR) Stored Integral Phase The SI phase calculates all of the Group 3 terms from integrals of the type , , and , collectively referred to as OOVV. Each PE will load or receive an integral block, which contains all spin pairs for a particular shell pair. The mechanism

PAGE 61

51 for obtaining integrals depends on the performance and archit ecture of the I/O system. If only one PE per node (or squad) is allowe d to read from disk—assuming this is an operating system constraint—the n the other PEs in the squa d will have to request the blocks from that I/O PE. The I/O PE will then have to juggle its own calculations while acting as an I/O proxy; otherwise it could opt not to store R2 amplitudes, but then the system takes a performance hit as whole be cause there is one relatively idle CPU per node. Most computers will allow any process to perform I/O and shift the performance hit of I/O contention to the program. The t echnique the parallel CCSD algorithm uses to lessen this penalty is asynchronous file read s. The logistics are the same as one-sided operations in the DI phase in that each PE issues an AIO_READ immediately after the last integral has been read but before other computation has been done. If asynchronous I/O operations are not available and the user elects not to de signate a special I/O PE, then the algorithm can wait for the integrals up front with a bloc king read or it can synchronize the squad PEs by broadcasting the current block and waiting for all of them to finish before r eading the next block. Resource competition, such as parallel bl ocking reads, suffers scaling penalties, meaning the cost per read increases as the number of PEs per squad increases. The upshot is that there is still the possibility that so me PEs are computing while others are waiting on I/O. Locked step algorithms, however, usually suffer far worse than resource competition algorithms. Locked step guarantees that no one is working while a read is pending and that no one is reading while the PE s are computing. Even if the I/O PE used asynchronous reads, the locked step a pproach would suffer from expensive

PAGE 62

52 synchronization delays compared to pur e resource competition. From these considerations, each PE will be responsible fo r reading its next block of integrals from disk. If asynchronous reads are not supported, then the PE will default to blocking reads. The OOVV integrals will be distributed over platoons by block cycling the AO shell pair kets. Every PE in a platoon must r ead and process all of the integrals assigned to its platoon. The integrals will only contribute to the R2 amplitudes that the PE stores, but the T2 amplitudes must be read (with one-sid ed GETs) from other PEs in the platoon. Every shell pair of integrals results in four updates to all of the local R2 amplitudes, one for each subset of Group 3 Set 1 and one for Group 3 Set 2. Since Group 3 involves contributions in which all virtual orbitals come from T, the virtual orbitals of the integrals must be the internal summation indices. Th e general approach to calculating the Set 1 subsets is to loop over and for a fixed pair. Each subset corresponds to a different configuration of these nested loops. Figure 4-3 shows the path of updates through the loca l set of R2 amplitudes. Table 4-5 shows the detailed order of operations for calculating Group 3 Set 1 Subset 1 (AC/BD is LR/DS in AO shells). Accessing memory in non-unit stride is highly unfavorable when performance is the utmost concern; however each square in Figure 4-3 represents three four-index arrays such as those shown in Figure 4-2 and those updates can be performed in unit stride. Furthermore, the L-D loop order is only used in Subset 1. Tables 4-6 4-7 and 4-8 show the operations for Subsets 2 and 3 of Set 1 a nd Set 2, respectively. Each steps through the LD pairs in D-L order.

PAGE 63

53 L0 L1 L2 L3 D0 D1 D2 Figure 4-3.The path of Group 3 Set 1 S ubset 1 updates per OOVV integral. Dashed lines show memory continuity and solid arrows show the loop path. There is some flexibility in choosing th e order in which the Group 3 sections are calculated. The most important design consid eration is choosing which section will be performed last since it will AIO_READ the next RS shell pair of OOVV integrals from disk. Subset 2 of Set 1 has the largest number of operations after the last integral is used, so this section should be performed last. Choosing a section to run first depe nds on how many operations can complete during the first GET before reading the T2 amplitudes. Subsets 1 and 3 of Set 1 and Set 2 all use the amplitudes almost immediately, but each requires a different initial shell pair. Set 2 reads only one shell pair, T2(RS), and neither shell contributes to the final R2amplitudes, so the probability that T2 is local is inversely proportional to the number of PEs in the platoon. The Subsets of Set 1 fare much better since at least one shell of T2contributes directly to a shell of R2. If the amplitudes are distributed in bands of L, then Subset 3 has a good chance of T2 being local, and if the amplitudes are distributed in

PAGE 64

54 bands of D, then Subset 1 is better off executing first. Figure 4-4 illustrates the two possibilities. L bands D bands L D D L Figure 4-4.The LD shell distribution schemes. Each PE stores both T2 and R2amplitudes for its block of LD shell pairs. Set 1 Subsets 2 and 3 favor L bands, while Subset 1 favors D bands. The parallel algorithm uses L bands since tw o of the three subsets favor them as far as having local T2 amplitudes per R2 update. Subset 2 is alrea dy designated to run last, so subset 3 should be performe d first because it has the hi ghest chance of having no wait time for the initial T2 GET. The decision to run Set 2 before or after Subset 1 can be made on the same argument. Since Set 2 update s the entire set of local LD pairs after reading the T2(RS) amplitudes, it offers the most time for the initial GET of Set 1 Subset 1 to complete before reading them. Therefor e, the calculation orde r of the Group 3 terms is Set 1 Subset 3 (S1S3), Set 2, S1S1, and S1S2. Figure 4-9 shows the flow of data in the SI phase. Each PE loops over all of the shell pairs that were assigned to its platoon and calculates only the terms that contribute to the local R2 amplitudes. The following pseudo-code lists the entire flow of control from the end of the DI phase up to the start of the One Particle phase. AIO_READ V(RS0) GET T2(RD0) do RS = RS0, RS1

PAGE 65

55 WAIT V(RS) calc G3S1S3(GET T2(RS)) calc G3S2(GET T2(L0R)) calc G3S1S1(GET T2(SD0)) calc G3S1S2(AIO_REA D next V(RS); GET next T2(RD)) end do RS 1 2 3 4 5 6 7 8 9 10 11 12 T R T T R T R T 1st Squad 2nd Squad 1st Squad 2nd Squad 1st Platoon 2nd Platoon Platoon Shell Pairs of Platoon Shell Pairs of Figure 4-9.The data flow in the Stored Inte gral phase. Each PE processes all of the integrals that are stored by its pl atoon and updates its local block of R2. One Particle Phase The OP phase is responsible for contracting the Fock matrix elements of the reference wavefunction (fpq) and the I1 intermediates (IP, IH, and IR), with the T amplitudes. There are three types of contractions: T2 into R1 (Group 4), T1 into R1 (Group 5 Set 1), and T2 into R2 (Group 5 Set 2). Only one PE must do the T1R1 contractions provided it has the complete I1 intermediate (it should si nce the contributions were reduced at the end of the DI phase). Thes e transformations, shown in Equation 4-16 for T1 are insignificant performance-wise and little is gained by parallelizing them. i i P i i R i H it c I t c I t Q I t Q r(4-16a) ld i d ld l i l li l ad i d ad a a ai a it c f t f t t c f c f c r (4-16b)

PAGE 66

56 The T2R1 contributions can be distributed ove r the PEs in one platoon, since they are limited to local reads and lo cal writes; however, while the T2R2 terms also have local writes, they have both local and remote reads. The remote reads come from the terms in Equations 4-17 and 4-18. The part of FP is defined in Equation 4-19. (The part is analogous.) ij P ij IJ P IJt F r t F r~ ; ~ (4-17) Ij P Ij P Ijt F t F r(4-18) R P ld d ld l ad d ad a PI t Q I c f t c f c F(4-19) Parallelization of these pa rticle terms should reduce the number of messages a PE must send as the number of PEs in the comp any increases. This implies distributing the summation index ( or ) over the platoons; however, sin ce the SI phase prefers L bands (the index), it is likely that most (if not all) of the T2(RD) GETs are local, so it is not as critical to performance to distribute them over the platoons as it is for For the loop, each PE must get data from multiple targ et PEs that contribute to its local R2 set (Table 49 ). Table 4-10 shows the other FP terms. The T2R2 hole terms and the T2R1 terms (along with the latter FP terms) should be performed by the largest platoon (HVY_PLT) even if that platoon has only one more PE. During this time, all of the other platoons will send their contributions via ACCUMULATE to the Heavy Platoon. Once th ese operations are complete, the only remaining tasks are antisymmetrizing the same-spin R2 amplitudes and updating the old T2 amplitudes with R2. Table 4-11 shows the operations of the Heavy Platoon while the

PAGE 67

57 other platoons are sending thei r contributions, and Equations 4-20 and 4-21 show the terms that contribute to FH and FR, respectively. i i R l i H l ld i d ld k i ki k i Ht c I c I c t c f f F(4-20) R k d d kd k RI c c f F(4-21) The following pseudo-code lists the operati ons that occur from the end of the SI phase up to the start of the an tisymmetrization and update phase. if (CO_CDR) then make FP, FH, FR BROADCAST(F1,CO_NET) calc T1_into_R1() else RECEIVE(F1,CO_NET) end if calc FPT2() if (HVY_PLT) then calc other_F1T2() else ACCUMULATE(R2,HVY_PLT) end if REDUCE(R1,CO_CDR) Antisymmetrization and T2 Update The and R2 amplitudes must be antisymmetrized with respect to AO shell pairs. Equation 4-22 shows this operation. All of the R amplitudes must then be divided by the diagonal elements of the Fock matrix in order to generate the next set of T2amplitudes. Each PE in the Heavy Pl atoon, having collect ed all of the R2 amplitudes that it is assigned to store, will antisymmet rize and transform its amplitudes into T2. The R2T2 transformation is shown in Equation 4-23 and the diagonal matrix elements of f are shown in Equation 4-24. ij ij ijr r r ~ ~ (4-22)

PAGE 68

58 f f f f r t f f f f r tjj ii ij ij JJ II IJ IJ;(4-23a) f f f f r tjj II Ij Ij(4-23b) ad d ad ac f c f (4-24) The R1 amplitudes must be transformed in the same way in order to generate the next set of T1 amplitudes. One PE should do this since the transfor mation is trivial (Equation 4-24). f f r t f f r tjj j j II I I;(4-24) Table 4-12 lists the operations of the He avy Platoon for antisymmetrizing and transforming R2. The code listing that follows s hows the operations of the Amplitude Company from the end of the OP phase up to the next CC iteration. if (HVY_PLT) then make T2 end if if (CO_CDR) then make T1 BROADCAST(T1,CO_NET) else RECEIVE(T1,CO_NET) end if if (not HVY_PLT) then GET T2 end if The order of the loop nesting, L-D versus D-L, does not affect the results, but the performance of the loops might differ from one computer architecture to the next. The LD ordering issues multiple GETs to the same PE (down the L column) before cycling up to the next column. If every PE begins with L=1, then the first PE in the platoon will be

PAGE 69

59 bombarded with the GETs of every PE at once. Normally, GET operations do not require exclusive locks, in which one PE is allowed to access the data at a time, but the efficiency of the messaging subsystem might be a fact or in performance. This would be less important in the D-L ordering. Another way in which the messaging subs ystem could affect performance is by "remembering" which PE was accessed last. In this situation, once a GET is established between two PEs, any successive GET to the remote PE is less expensive than switching to another PE. This would actually favor the L-D ordering and penalize the D-L ordering. In either case, the code can be trivially designed to switch between the two depending on which architecture the program is running.

PAGE 70

60 Table 4-2.The Direct Integral phase prologue memory map Order of V T2 I2 W2 operations G H I Iw H W r w J I IJ IJw W P I ˆ w r I Iw G W r w j I Ijw W I w r WAIT T2(RS) IJ IJt H W2 1 r r w I IJ J Hc W I r IJ IJW I i r Ij Ijt G W r r w I Ij j Hc W I r Ij IjW I i r Each operation is implicitly summed over all shared indices. The G column of V corresponds to the set of <|> Coulomb integrals, and the H column is the set of <||> antisymmetrized (Coulomb-Exchange) integrals. The values w, r, and i stand for write, read, and increment, respectively. Shaded cells are involved in pending communications and may not be read fr om or written to (without risking data inconsistencies). Temporary arrays (W2) are outlined to show their scope of dependence.

PAGE 71

61 Table 4-3.The Direct Integral phase inner loop memory map V T2 R2 (1) R2 (2) I2 W2 X2 Order of operations G H if (L!=R) WAIT T2(LS) Ij I jt c W r w if (R!=S) GET T2(LR) j jW G X r r w j I IjX w R w r Ij IjI x R i r IJ IJI x R w r if (R!=S) WAIT T2(LR) IJ J It c W r w Ij j It c W r w if (more L) GET next T2(LS) H W XI I r r w G W XI I r r i j I Ijw X R i r H W XI I r r w I I Pc X I 2 1 r G W XI I r r i J I IJ IJw X P Rˆ i r if (L!=R) WAIT R2 ACCUMULATE R2 I I Pc X I r The R2 (1) buffer is the active write buffer, and the R2 (2) buffer is the access-forbidden buffer from the previous L shell. Th e loop starts with L=R, counts up to NShells, cycles back to L=1, and continues up to L=(R-1).

PAGE 72

62 Table 4-4.The Direct Integral phase epilogue memory map V T2 R2 (1) R2 (2) scr Order of operations G H if (perm==8) then G t c H t c Ii i I I R r r (w) WAIT R2 else GET next T2(RS) G t c H t c Ii i I I R r r (w) permute V w w (w) end if One of the two R2 buffers will be access-forbidden. The last permutation does not need to GET another T2(RS) block; otherwise, the GET shoul d be issued as soon as possible. If the permutation index is 4, then a different exchange integral will be generated directly. The (w) cells are write-optional, meaning the operation might or might not need the scratch memory.

PAGE 73

63 Table 4-5.The Stored Integral phase Group 3 Set 1 Subset 1 memory map V T2 (1) T2 (2) IA IB IC Order of operations GET T2(L0R) do L = L0, L1 Mn M I n I BG t w I ) ( r w WAIT T2(LR) Mn IM n I CG t I ) ( r r w M I IM IMt w t W2 1 r w mn Im n IH t W 2 1 r r w GET T2(D0S) MN IM N I AH W I ) ( r r w ) ( ) (C n I n I BI W I r i r n I n I CW I 2) ( r i do D = D0, D1 WAIT T2(DS) GET next T2(??) JN N I A IJ IJt I P r) (ˆ ~ r r Jn n I B IJ IJt I P r) (ˆ ~ r r n j jnt w t i jn n I C Ijt I r) ( r r end do D end do L During the loop over D, the T2 buffers alternate active/inactiv e, but the prologue in the L loop uses the second buffer as scratch memory. The GET in the inner loop over D requests T2(DS) unless D=D1, in which case it requests the next T2(LR) (provided LL1).

PAGE 74

64 Table 4-6.The Stored Integral phase Group 3 Set 1 Subset 2 memory map V T2 (1) T2 (2) IA IB IC ID Order of operations GET T2(SD0) do D = D0, D1 n j Mn M j Bt w G I) ( r w WAIT T2(SD) GET T2(RL0) nj Mn M j Dt G I) ( r r w n j nj njt w t t2 1 w nj mn m j At H I) ( r r w Nj MN M j Ct H I) ( r r w if (D==D1) then AIO_READ next V end if ) ( ) ( 2 1 ) (D C BI I I i r r ) ( ) (C D I I r i do L = L0, L1 M j D M I IjI t w r) ( r WAIT T2(RL) GET next T2(??) m j A mi ij ijI t P r) (ˆ ~ r r M j B Mi ij ijI t P r) (ˆ ~ r r M j C MI IjI t r) ( r r end do L end do D The inner loop of this subset has more operatio ns than either of th e other two subsets, so it is calculated last and the asynchronous r ead is issued during the last iteration.

PAGE 75

65 Table 4-7.The Stored Integral phase Group 3 Set 1 Subset 3 memory map V T2 (1) T2 (2) IA IB IC Order of operations GET T2(RD0) do D = D0, D1 n I Mn M I Bt w G I) ( r w WAIT T2(RD) GET T2(L0S) Mj Mn n j At G I) ( r r w In Mn M I Ct G I) ( r r w ) ( ) (C B I I i r do L = L0, L1 M I C M j IjI t w r) ( r WAIT T2(LS) GET next T2(??) n j A In IjI t r ) ( r r M I B Mj IjI t r) ( r r end do L end do D

PAGE 76

66 Table 4-8.The Stored Integral phase Group 3 Set 2 memory map V T2 Order of operations W2 I2 j I Ij Ijw w t W r w Kl Ij Kl IjV W I r r w do LD = LD0, LD1 l K Kl Klt t t W w Kl Kl Ij IjW I r r r end do LD do spin = 2, 1 j i ij ij ijw w P t Wˆ (r) r w kl ij kl ijV W I 4 1 (r) r r w do LD = LD0, LD1 l k kl klt t t W2 1 w kl kl ij ijW I r ~ r r end do LD end do spin All of the T2(LD) reads and R2(LD) updates are local opera tions. If there is not enough memory for I2, then the Ij/ij/IJ bra pairs can be processed in blocks. Table 4-9.The One Particle phase FP T2 memory map T2 (1) T2 (2) Order of operations GET T2(LS0) do S = S0, S1 do L = L0, L1 WAIT T2(LS) GET next T2(LS) do D = D0, D1 Ij P Ijt F r r end do D end do L end do S

PAGE 77

67 Table 4-10.The One Particle phase FP T2 (and same spin terms) memory map T2 (1) T2 (2) Order of operations GET T2(1D0) do D = D0, D1 do R = 1, NShells WAIT T2(RD) GET next T2(RD) do L = L0, L1 IJ P IJt F r~ r Ij P Ijt F r r ij P ijt F r~ r end do L end do R end do D Table 4-11.The One Particle phase FH and FR local transformations Order of operations do LD = LD0, LD1 K KJ K I H IJ IJt F P r ˆ ~2 1 k Ik k j H K Kj K I H Ijt F t F r k kj k i H ij ijt F P r ˆ ~2 1 k Ik k R K IK K R It F t F r K Kj K R k kj k R jt F t F r end do LD

PAGE 78

68 Table 4-12.The R2 antisymmetrization and T2 update memory map R2 (1) R2 (2) Order of operations GET R2(D0L0) do L = L0, L1 do D = D0, D1 WAIT R2(DL) GET next R2(DL) IJ IJ IJr r t ~ ~ r ij ij ijr r t ~ ~ r f f f f tJJ II IJ f f f f tjj ii ij f f f f r tjj II Ij Ij end do D end do L

PAGE 79

69 CHAPTER 5 DISCUSSION AND SIMULATION Discussion General Comments The parallel CCSD algorithm distributes almost every operation over the number of PEs in the Amplitude Company. The few operations that are not distributed are insignificant to overall performance. Specifi cally, the worst one-particle transformation would be a multiplication of two NAOxNAO matrices. This scales operationally like NAO 3. In contrast, the Direct Integral phase multiplies two matrices that scales operationally like O2NAO 4. For a system with 100 electrons (50 and 50 ) and 500 basis functions, there are approximately 125 million operations for the serial multiply and over 309 trillion operations for the parallel multiply. Assuming that every operation takes the same amount of time and that the parallel multiply is 100% efficient, the maximum theoretical speedup is 2.48 million times faster (for this system). 6 8 14 810 48 2 10 25 1 10 09 3 10 25 1 lim ser par ser nt t t SpeedUpprocs(5-1) Obviously this limit is unachievable, a nd there are multiple operations that slow down the parallel portions of the CCSD algor ithm. Disk I/O, network messages, and synchronizations all become more expensiv e as the number of PEs increases. The two primary techniques that help to reduce th ese costs are real-time load balancing and asynchronous operations (i.e., overlapping co mputation and communication for both disk reads and remote memory operations).

PAGE 80

70 Load Balancing Job boards In this context, a job board is a global structure that any PE can write to or read from at will. The Direct Integral phase must have a global counter that registers the next AO shell pair waiting to be processed. PEs th at finish their tasks quickly will read and increment the counter more often than PEs that lag behind the others. Natural load imbalance or CPU contention from other proc esses can cause this lag (or drift). CPU contention frequently happens if the operati ng system attempts to run more compute intensive processes than installed CPUs. Even if the parallel program executes exclusively but the number of PEs is the same as the number of CPUs, then the operating system kernel itself will compete for CPU cycles and detract from overall scalability. Job boards can relieve natural load imbalan ce in the Direct Integral phase, but the Stored Integral phase cannot easily use them (if at all). The SI phase has two levels of parallelization: the OOVV integrals are distributed over platoons and the R2 updates are distributed over PEs within a platoon. Any atte mpt to balance the load in real time will incur multiple synchronization delays since th e PEs in a platoon have close to but not the same number of operations. Instead of real -time load balancing, the AO shells are grouped together to balance the wo rk needed to process each shell. Data aggregation All uses of the word "shell" that correspond to CC data processing refer to collections of physical AO shells, which are th e smallest entities that direct integral generation programs can calculate. Since the number of operations scales like O2N4, the difference between the smallest and largest physical shell is a di rect measure of the imbalance in computations per message. The hydrogen STO-3G basis set has 1

PAGE 81

71 contracted orbital in th e s shell. Larger basis sets will typically run up to about 16 orbitals per shell (although the iodine PBS basis has 55 orbitals in the d shell). The hydrogen shell and the average shell are computationally uneven by a factor of 65536 for the DI phase. If the messages pertaining to these tw o shells cost the same, which is a fair assessment since message latency is an order of magnitude more important than message size, then one PE will have 65536 times more work to do than another. Simply grouping as many physical shells together as possible mi ght not be the best solution since it is the size variability that a ffects load balancing. The parallel CCSD algorithm, before runni ng the first iteration, will examine the collection of physical AO shells and search for the set of logical shells that minimizes the variance of shell sizes. This is a bounded two-dimensional search along the dimensions MaxShellSize and NLogShells. The maximum number of logical shells is simply the number of physical shells. The fewest number of logical shells is 1, and th e size of that shell is equal to the sum of all of the physical sh ell sizes; however, given a maximum shell size, the fewest number of shells is defined in Equation 5-2. ze M axShellSi i ize PhysShellS NPhysShellsN i LogShells1(5-2) Physical AO shells should not be broken up since generating dire ct integrals costs less when orbitals of the same angular mo mentum and atomic center are calculated at once. Equation 5-2 is actually an inequality because there might not be a perfect grouping of physical shells such that every logical sh ell has the same size. The search algorithm loops over size limits (MaxShellSize) starting with the largest physic al shell size and ending with a predefined limit or the sum of all physical shell sizes, whichever is smaller.

PAGE 82

72 For every potential MaxShellSize, the algor ithm also loops over the number of logical shells starting from the theoretical mini mum up to the number of physical shells. The problem of putting objects with mass ( physical shells with orbitals) into groups with a mass limit (logical shells with up to MaxShellSize orbitals) is well known in the field of Computer Science as the Bin Packing Problem (BPP) [ Cof97 ]. This is an NPhard optimization and there are many algorit hms for finding solution candidates. Rather than solve the BPP, the CC algorithm uses the Worst Fit Decreasing (WFD) off-line algorithm limited to a known set of bins (logi cal shells) and measures the variance of sizes (Appendix E). The partitioning scheme w ith the lowest variance should have the best balance between computation and me ssaging. WFD is ideal for finding possible partitions for this particular problem since it adds the next smallest physical shell to the logical shell with the fewest orbitals. Asynchronous I/O Most parallel programs are hampered by the increasing cost of messages as the number of PEs increases. One-sided operations allow PEs to process data while messages are moving through the network. The two lim its of this scheme are: 1) the CCSD calculation is free because the messages take so long to arrive and 2) the program scales almost linearly with the numbe r of PEs because global data is available the moment it is read. Given the complex task and data distribut ions, it is likely that some portions of the algorithm will be bound by communication co sts and others by computations. Reading the OOVV integrals from disk should not be a performance bottleneck considering the amount of computation th at each PE will do while the AIO_READ is being fulfilled. The inner loops of the SI phase can become heavily dependent on T2 GET operations if the logical sh ells are not large enough to saturate the retrieval with

PAGE 83

73 computation. The easiest way to increas e the number operations per message is to increase the size of a logical AO shell pair. The difficulty in doing this is that the local scratch memory requirements scale like N4 with respect to shell size. As more scratch memory is needed, the less heap there is for distributed data and ultimately more PEs will be needed. The problem of balancing shell si ze, scratch memory, and data distribution is not solved, but as the algorithm has been stat ed, the data distribution schemes can be tried independently of the data contraction kernel s. For example, if more scratch memory is needed but there is no possibility to increase the heap or the number of PEs, then the DI and SI phases can be further di vided to allow for fewer interm ediates at the cost of more integral processing. Synchronization Delays An attractive feature of the parallel CCS D algorithm is that no PE is required to stall in a wait call for another PE unless the st ate of their globally accessible data changes (i.e., if R2 is transformed or T2 is updated). A barrier in message passing programs is a routine that causes every PE to wait until all of the PEs in a communication network have entered the barrier. Certain collective oper ations can function lik e a barrier and some pairs of functions act as a true barrier. Reduc tions involve every PE sending some data to one PE, which will combine all the data into one element. Broadcasts involve one PE sending some data to every PE. Individually, certain PEs may continue after the operation whether or not other PEs have finished th eir operations, but pr operly configured, a reduction and a broadcast can together function like a barrier. No PE will proceed until it has received data from one PE, which will not send the reduced element before receiving the contributions from all of the other PEs.

PAGE 84

74 The concept of a fence comes from the MPI-2 standard, which defines it as a collective barrier that ensures all one-sid ed communications on a shared object are complete before leaving the barrier. Synchr onizations of any sort occur essentially 10 times in each CC iteration: 1. *initialize R and synchronize T before the DI phase CO fence (Appendix A) 2. complete the R2 contributions from the DI phase PLT fence 3. complete the R2 R2 transformation into scratch PLT fence 4. copy scratch into R2 before the DI phase PLT fence 5. *complete the R2 contributions from the DI phase CO fence 6. reduce I1 contributions onto CO_CDR CO reduce 7. receive F1 from CO_CDR CO bcast 8. complete the R2 R2 ' transformation into scratch PLT fence 9. copy scratch into R2 before the SI phase PLT fence 10. complete SI and OP contributions CO fence 11. complete R2T2 update CO fence 12. receive T1 from CO_CDR CO bcast In the steady state of iterating over T, th e first and last synchronizations can be overlapped (i.e., point 1 is the same as point 12 in the list). Point 5 can be overlapped with point 6 in that reducing I1 signals the end of the DI phase (Table 5-1 ). Collectively, the number of synchronizations does not scale with the problem size; however, the cost of each synchronization point will increase as the number of PEs increases. Simulation The million-dollar question that every parallel algorithm must answer is, "Does it scale?" Amdahl's Law [ Amd67 ], which neglects all proces s contention and performance degradation, describes the maximum theore tical speedup for a fixed problem size running on a fixed number of processors. The simple st expression of this law is shown in Equation 5-3.

PAGE 85

75 Table 5-1.The schedule of synchronization points Heavy Platoon Light Platoon Phase T1 R1 T2 R2 Scr T1 T2 R2 Scr DI: r r i w/r r r i w/r R(D,M)S(L,N) r w r w S(L,N)R(L,N) w r w r DI: r r i w/r r r i w/r Reduce I1 Broadcast F1 R(L,N)S(L,D) r w r w S(L,D)R(L,D) w r w r SI; OP:FP r r i w/r r r i w/r OP:FP ,FH,FR r i r i w/r r antisym; 1/D w r 0 update T2 w w 0 w Broadcast T1 w w Each platoon independently transforms R2 and separates each step with a PLT fence (dashed lines). The shaded rows corres pond to collective communications across the Company. r, w, i, and 0 stand for read, write, increment, and initialize to 0, respectively. The largest platoon calculates T1 while the other platoons update their copies of T2. procs par ser par sern t t t t SpeedUp(5-3) However, the parallel CCSD algorithm does not spend the same percentage of time in the parallel parts relative to the serial parts for every molecular system. As the problem size increases, the percent spent in the serial pa rt decreases; thus, th e parallel speedup will increase with system size. This is loosely re lated to the Gustafson-Barsis Law that makes the observation that a parallel computer can do bigger calculations re lative to the serial computer for data parallel algorithms as the number of processors increases [ Gus88 ]. The parallel CCSD algorithm leaves N3-type operations for one CPU (in the form of NxN matrix multiplications). A few one-particle N5 operations and the N4 T2 update are distributed over the largest platoon, but most N5 and all N6 operations are distributed over the entire Amplitude Company. Furthermore, re mote data is retrieved as early as possible

PAGE 86

76 and used as late as possible. This provides an ideal countermeasure for variable network performance. General Overview A simulator is used to test a design before building a prototype or the final product. Simulating software performance (scalabilit y in the case of parallel CCSD) usually involves estimating computation and data transfer times, which requires hardware parameters such as CPU speed and efficien cy; memory, disk, and network I/O latencies and bandwidths; etc. These parameters can be measured and will be different for various computer architectures; however, the goal of the software simulator is not to emulate hardware. By using a variety of parameters the simulator should be able to expose general strengths and weakness es of an algorithm that might need to be redesigned. The simulator used to test the parallel CCSD algorithm constructs a parallel network with an array of floating-point numbe rs that represents the clocks of the PEs. The program assumes the identity of a PE and estimates how long it would take to perform a set of operations. That time is adde d to the clock and the simulator switches to the next PE. For the Direct Integral pha se, the simulator maintains a counter that corresponds to the next integral shell pair, and becomes the PE with the lowest clock (i.e., the one that would increment the counter ne xt). Once the counter exceeds the number of shell pairs, the clocks are synchroni zed to the highest value and the R2 amplitudes are transformed. The process repeats itself for the second DI phase. Binary trees are used to estimate reductions and broadcasts. The Stored Integral and One Particle phases do not use real-time load balancing so the simulator cycles through each PE and adjusts the clock accordingly. Finally, the antisymmetrization and update phases are estimated with all of the appropr iate synchronizations and computations.

PAGE 87

77 Specific Technologies Local computation. Accurately estimating local mathematical operations is essential to simulating both parallel and se rial performance. Each operation involves reading data from memory, operating on the da ta, and writing the results back to memory (Equation 5-4). S is the data size, B is the memory bandwidth, Nops is the number of operations, C is the CPU speed, and e is an efficiency modifier. Compiler optimization, loop ordering, and vendor-supplied mathemati cal libraries will a ffect performance so multiple efficiencies may be needed to estimate different parts of the algorithm. e C N B S B S tops wr out rd in comp* (5-4) Asynchronous I/O. Disk operations can be estimated with latency, bandwidth, and the number of channels (Equation 5-5). Late ncy is usually the time it takes for a disk head to move to the appropriate position on the disk, Nch is the number of dedicated I/O channels, and Nmsg is the number of simultaneous disk operations. Disk contention in the simulator is estimated by multiplying S by th e number PEs on the node (i.e., if one PE is reading data, then anothe r probably is as well). ) min( */ msg ch lat O IN N B S t t (5-5) When an asynchronous call is placed, the simulator estimates how long it would take for the data to arrive a nd saves the time at which it s hould be available. The PE must then call a wait routine, which will comp are the current time to the time the data is useable. Depending on the difference, the simu lator will increase the clock (the PE must wait) or return silently (the data arrived while computing).

PAGE 88

78 One-sided communication. Sending and receiving me ssages through a network is very similar to writing and reading from di sk. If two-sided messagi ng is used, then the clocks of the origin and target must be modified. Blocking messages require the origin and target to have the same clock, while non-blocking messages allow for drift. The parallel CCSD algorithm only uses non-bloc king one-sided messages, which are very similar to I/O operations in that only the origin is affected by the time it takes for data to arrive. The simulator allows for on-node and off-node messaging parameters so that PEs in the same squad can communicate at diffe rent speeds than those on different nodes. Some MPI implementations require copying da ta into and out of system buffers, so memory bandwidth might also affect the time. Collective communication. Barriers, reductions, and broadcasts are simulated with binary trees (Figure 5-10 ), and the number of stages depends on the number of PEs involved in the operation (Equa tion 5-6). (Ceiling brackets mean round up to the nearest integer.) Equation 5-7 shows the number of messages being sent in each stage i. PE stagesN I2log (5-6) 12 2 min i i PE msgN N(5-7) 1 2 3 4 Figure 5-10.The binary tree approach to collective communicati ons. 12 PEs require 4 stages, and the last stage only has 4 messages.

PAGE 89

79 Calibration Memory bandwidth is measured by allocati ng a large heap and then flushing it with data and reading the data back in. The av erage of 7 sweeps is used for the memory reading and writing bandwidth. Henceforth, all data will pertain to an IBM POWER3 workstation with four 375 MHz CPUs. The fo llowing output was taken from a calibration run: 512 MB heap at 0x3348f518 wr was 1.924009 -> 266.11 MB/s; rd was 0.623147 -> 821.64 MB/s wr was 1.924412 -> 266.06 MB/s; rd was 0.623328 -> 821.40 MB/s wr was 1.922241 -> 266.36 MB/s; rd was 0.623201 -> 821.56 MB/s wr was 1.922671 -> 266.30 MB/s; rd was 0.623183 -> 821.59 MB/s wr was 1.922141 -> 266.37 MB/s; rd was 0.623107 -> 821.69 MB/s wr was 1.922495 -> 266.32 MB/s; rd was 0.623170 -> 821.61 MB/s wr was 1.922200 -> 266.36 MB/s; rd was 0.623147 -> 821.64 MB/s bw_memwr = 266.27 MB/s (279201344.00 B/s) bw_memrd = 821.59 MB/s (861497664.00 B/s) CPU throughput is rarely the clock speed of the processor so efficiency modifiers are used to estimate the time it would take to compute a set of operations. The two primary mathematical constructs that th e CCSD algorithm uses are two-dimensional matrix multiplications and N6 array contractions. Many ve ndors sell highly optimized linear algebra libraries that are tuned for sp ecific computer architectures. Using these libraries for standard matrix multiplies can dramatically increase performance if the program is dependent on such operations. The simulator measures the efficiency of the double precision general matrix mu ltiply (DGEMM) by multiplying two 625x625 matrices. The matrices are then recast as two 25x25x25x25 four-dimensional arrays and a hand-coded six-loop array contra ction is timed, wherein th e array contraction performs the same operations as the two-dimensional matrix multiply. The following results were taken from a calibration run: CALIBRATION FOR POWER3 375 MHz CPU (2 fma units per CPU) 750000000 fma/s reference

PAGE 90

80 ( 625 x 625 ) ( 625 x 625 ) dgemm took .387582 s -> 88.1849% eff dgemm took .384157 s -> 89.0108% eff dgemm took .384016 s -> 89.0451% eff dgemm took .383978 s -> 89.0543% eff dgemm took .384088 s -> 89.0276% eff dgemm took .384009 s -> 89.0468% eff dgemm took .384046 s -> 89.0378% eff loops took 1.785683 s -> 18.4198% eff loops took 1.786396 s -> 18.4123% eff loops took 1.785581 s -> 18.4208% eff loops took 1.785778 s -> 18.4188% eff loops took 1.786151 s -> 18.4149% eff loops took 1.785726 s -> 18.4193% eff loops took 1.785444 s -> 18.4223% eff dgemm eff = 0.889153 loops eff = 0.184183 vendor is 4.83 times more efficient The results are then compared to another set of larger multiplications and contractions to test for predictability. The simulator uses th e latter set of parameters if the results are precise. The following results were ge nerated using the previous data: ( 30 x 30 40 x 40 ) x ( 40 x 40 40 x 40 ) DGEMM: real: 3.549262 s (est: 3.53) eff: .8892 DGEMM: real: 3.547543 s (est: 3.53) eff: .8892 DGEMM: real: 3.547254 s (est: 3.53) eff: .8892 LOOPS: real: 22.458201 s (est: 16.76) 6-loop eff: .1373 LOOPS: real: 22.448650 s (est: 16.76) 6-loop eff: .1373 LOOPS: real: 22.446517 s (est: 16.76) 6-loop eff: .1373 LOOPS: real: 22.450999 s (est: 16.76) 6-loop eff: .1373 LOOPS: real: 22.447324 s (est: 16.76) 6-loop eff: .1373 LOOPS: real: 22.445928 s (est: 16.76) 6-loop eff: .1373 LOOPS: real: 22.449582 s (est: 16.76) 6-loop eff: .1373 Disk latency and bandwidth are difficult to measure in an isolated environment. Most operating systems use aggressive file caching, which can have a noticeable effect on performance curves that compare message size to bandwidth. The parallel CCSD algorithm is somewhat forgiving of exact performance because all disk reads are asynchronous; nonetheless, calibra ting the I/O performance attempts to bypass the CPU and memory caches. The benchmarking program iozone [ Nor02 ] was used to measure the disk latency and bandwidth, and all files were created with the descriptors: O_DIRECT (direct I/O), O_SYNC (synchronous writes), and O_RS YNC (synchronous reads). Comparing file

PAGE 91

81 record length to the time it took to read a random record produced data that were used to describe the local disk performance (Figure 5-11 and Table 5-2 ). 0 100 200 300 400 500 600 0409681921228816384record size (kB)time (ms) 2048 4096 8192 16384 32768 65536 13107 2 26214 4 52428 Figure 5-11.The local disk random read calib ration data. The sets of data are almost identical no matter what file size (k B) was used between 2 and 512 MB. Table 5-2.The latencies and bandwidths of ra ndom disk reads with respect to file size File Size Brd tlat File Size Brd tlat File Size Brd tlat 2048 29.7 3.5 16384 30.4 3.5 131072 31.2 4.3 4096 30.4 4.0 32768 30.5 2.9 262144 31.5 4.9 8192 30.0 3.2 65536 31.1 4.0 524288 31.4 4.8 File sizes are in kB, bandwidths (Brd) are in MB/s, and latencies (tlat) are in ms. For each file size, the bandwidth is the inverse of th e slope of the linear regression of record lengths vs. time to read, and the latenc y is the intercept. (The minimum R2 was 0.999.) The average bandwidth and latency for al l 9 sets were 30.7 MB/s and 3.9 ms, respectively. Measuring network messaging performan ce was kept relatively simple. The mpptest program, which ships with the MPICH library [ Gro96 ], was used to measure the point-to-point (P2P) and bise ction bandwidths for messages ranging from 16 to 64 kB. P2P involves only two PEs, while bis ection involves a ll PEs. Figure 5-12 shows the onnode P2P and bisection bandwidths measured on the IBM workstation. Latency is taken from the intercept of the linear regression, not the zero-size transfer time. The regression analysis for both curves s uggests that both pairs of CPUs can communicate without degradation, so the simulator set the num ber of on-node communication channels to 2

PAGE 92

82 and used the approximate values of 210 s and 99 MB/s for latency and bandwidth, respectively. Three off-node network architect ures were tested: switched Fast Ethernet and two 150 MB/s full duplex, redundant path IBM SP switches (with different configurations). Only P2P tests were used to measure off-node latency and bandwidth (Figure 5-13 ), and Equation 5-5 was used to estimate degradation. P2P: t = 9.9477*S + 210.69 R2 = 0.9973 bisect: t = 9.8093*S + 209.7 R2 = 0.9982 300 400 500 600 700 800 900 0816243240485664message size (kB)t (us) P2P bisect Linear (P2P) Linear (bisect) Figure 5-12.The calibration data for on-node messaging parameters. Bandwidth is the inverse of the regression slope, and late ncy is the intercep t. The performance does not change with one or two pair s of CPUs communicating at the same time, which implies two independent messaging channels. 0 1000 2000 3000 4000 5000 6000 7000 0816243240485664message size (kB)time (us) F.Ethernet SP sw1 SP sw2 Linear (F.Ethernet) Linear (SP sw1) Linear (SP sw2) Figure 5-13.The calibration data for offnode messaging parameters. The regression equations are Fast Ethernet: t=86.987*S+553.42 (R2=0.9998); SP switch 1: t=15.304*S+265.18 (R2=0.9915); and SP switch 2: t=11.942*S+180.88 (R2=0.9970). Bandwidths are 11.2, 63.8, and 81.8 MB/s, respectively.

PAGE 93

83 As in the case of asynchronous I/O, most messages in the parallel CCSD algorithm are one-sided and will hide behind computation. If the amount of computation is less than the estimated time spent in communication, then a more accurate messaging function should be used. Theoretical Examples With reasonable hardware parameters a nd timing algorithms, simulations of real molecular systems can be used to estima te the performance of the parallel CCSD algorithm. Table 5-3 lists the 20 amino acids and the corresponding numbers of electrons and spin pairs. Also listed are the sizes of the open-shell OOVV integrals—and thus the minimum disk requirements (with 1 GB equal to 10243 Bytes)—and the minimum number of PEs with 1 GB of heap need ed to calculate the UHF CCSD energy if each system had 1024 AO basis functions. Energy ca lculations of the anion, cation, and zwitterion are certainly achievable with the reso urces of current networks of workstations and clusters of SMP machines. For 1024 AO basis functions, the size of the two-electron integrals is 8 TB. Approximately one-eighth of these need to be calculated on the fly since the others can be generated by permutations; however, assumi ng the integrals are dense, the full 8 TB array must be generated and processed twice (once for and once for ). Systems of this size should be executed on hundreds if not a thousand PEs or more. The initial parallelized software cannot scale this high much less store the amount of data needed by each node. The parallel CCSD algorithm overcomes this with fine-grain task distribution and distributing large arrays ove r as few PEs as possible. Th e net effect is that large systems can be calculated with few resources at the expense of computing time.

PAGE 94

84 Table 5-3.The minimum resources for amino acids with 1024 AOs in UHF CCSD Amino acid C H N O S ne nOO Min PEs Disk (GB) Disk savings Alanine 3 7 1 2 0 48 1128 27 8.8 99.92% Cysteine 3 7 1 2 1 64 2016 48 15.8 99.86% Aspartic acid 4 7 1 4 0 70 2415 57 18.9 99.84% Glutamic acid 5 9 1 4 0 78 3003 71 23.5 99.79% Phenylalanine 9 11 1 2 0 88 3828 90 29.9 99.73% Glycine 2 5 1 2 0 40 780 19 6.1 99.95% Histidine 6 9 3 2 0 82 3321 78 25.9 99.77% Isoleucine 6 13 1 2 0 72 2556 60 20.0 99.83% Lysine 6 14 2 2 0 80 3160 75 24.7 99.78% Leucine 6 13 1 2 0 72 2556 60 20.0 99.83% Methionine 5 11 1 2 1 80 3160 75 24.7 99.78% Asparagine 4 8 2 3 0 70 2415 57 18.9 99.84% Proline 5 9 1 2 0 62 1891 45 14.8 99.87% Glutamine 5 10 2 3 0 78 3003 71 23.5 99.79% Arginine 6 14 4 2 0 94 4371 103 34.1 99.69% Serine 3 7 1 3 0 56 1540 37 12.0 99.90% Threonine 4 9 1 3 0 64 2016 48 15.8 99.86% Valine 5 11 1 2 0 64 2016 48 15.8 99.86% Tryptophan 11 12 2 2 0 108 5778 136 45.1 99.59% Tyrosine 9 11 1 3 0 96 4560 107 35.6 99.68% The data assume the scratch memory needed by each PE is less than the amount of data needed to store the third OOVV array during an R2 transformation. Every example saves over 99.5% of the disk space required by a traditional MO-b ased CCSD algorithm. Runtime Simulations DCEE* A simulation of the parallel CCSD algorithm was performed on the DCEE* system described in Chapter 2. Two network scenarios were tested: Fast Ethernet (F.Eth) and the faster of the two IBM SP swit ches (henceforth referred to as x3sw). Each node in the network was configured as if it were the aforementioned 4-CPU IBM workstation. All data are compared to 100% and 50% theoretical effi ciencies (both are technically linear, but "linear" scaling commonly refers to 100% theoretical efficien cy). (The parallel algorithm requires at least two 1-GB PEs, so all data are compared to this as the reference calculation.)

PAGE 95

85 Fast Ethernet. The F.Eth simulation shows acceptable parallel performance up to 64 PEs, but beyond that, efficiency drops below 50%, which is considered to be unacceptable (Figures 5-14 and 5-15 ). However, even up to 1024 PEs, the scalability curve is still increasing. Effici ency shows normal behavior for parallel programs (starting out high and decreasing asymptotically). 0 32 64 96 128 02565127681024# PEs Xth (100%) Xth (50%) X pred. Efficiency*128 Efficacy Figure 5-14.The simulated performance curves for DCEE* on 2 to 1024 PEs over Fast Ethernet. Xth (100%) and Xth (50%) are the theoretical speed-up curves with 100% and 50% efficiencies, resp ectively. X pred. is the predicted speed-up for the simulation. Efficien cy*128 is the efficiency scaled by 128. Beyond the standard analysis of speedup and efficiency, users might concern themselves with the idea of cost-to-performa nce. Another way of stating the idea, "At what point does the cost of using more PE s outweigh the gain in speed-up they might contribute?" The "cost" function is considered to be the efficiency and the metric for "bang per buck" is efficacy (Equation 5-8) [ Gho91 ]. p S p E p (5-8) Efficiency times speed-up can have multiple maxima, but only the first maximum value (with respect to the number of PEs) corresponds to the most cost-effective use of PEs in a

PAGE 96

86 parallel run. The F.Eth simula tion of DCEE* peaks at 64 PEs (=10.2), but the curve is so shallow that one could argue anywhere be tween 32 and 96 PEs is still cost effective. 0 8 16 24 32 0326496128# PEs Figure 5-15.The simulated performance curves for DCEE* on 2 to 128 PEs over Fast Ethernet. The curves correspond to the same series as in Figure 5-14 with the exception that efficiency has been scaled by 32, not 128. SP switch. The x3sw simulation shows much better performance (Figure 5-16 ). Predicted speed-up never drops below 50% efficiency up to 1024 PEs, and efficacy has not peaked. The simulation suggests that th e parallel performance for this system (DCEE*) is highly dependent on communication speed. 0 50 100 150 200 250 300 02565127681024# PEs Xth (100%) Xth (50%) X pred. Efficiency*300 Efficacy Figure 5-16.The simulated performance curves for DCEE* on an SP switch.

PAGE 97

87 Comparison to serial ACES II. Efficacy is a relative me tric that depends on the machine and (molecular) system, but it can be used to compare the effectiveness of different parallel programs running the sa me system on the same computer. Figure 5-17 compares the speed-up and efficacy of the ne w algorithm to the previous parallel CCSD algorithm. The new algorithm scales better a nd is much more cost effective per PE; however, the comparison of real times does not fare so well (Figure 5-18 ). 0 25 50 75 100 064128192256 # PEs new X new Efficacy prev X prev Efficacy Figure 5-17.The relative performance of the in itial parallelization attempt compared to the new algorithm. The discontinuity of the previous parallel program comes from allocating whole nodes with 16 CPUs each. The new algorithm achieves high scalability by decomposing the equations to such a degree, that the PEs are swamped with calcula tions in an attempt to use as little disk and memory as possible. The current serial ACES II CCSD program is 31.4 times faster than the new parallel algorithm, but it al so requires much more disk space and does not scale as well. Nonetheless, Table 5-4 lists the numbers of PEs that the new algorithm would need to match the time achieved by the initial parallelization attempt. It is clear that the CCSD energy can be calculated quickly with few PEs for molecular systems that have small data re quirements (DCEE* used on the order of 2 GB

PAGE 98

88 of disk space per node). The current ACES II program and the initial parallelization attempt are more than adequate for reducing the time the new parallel algorithm would take; however, the new algorithm requires ve ry little disk space and can calculate energies for very large systems. 0 100 200 300 02565127681024 # PEs X new serial ACES II X prev Figure 5-18.The absolute performance of the initial parallelization attempt compared to the new algorithm. The dark lines ar e the factors of speed-up achieved by using the previous code compared to the reference calculation of the new algorithm. Table 5-4.The number of PEs needed to su rpass the initial parallel performance Previous PEs New PEs Extrapolated speed-up 1 76 31.8 32 409 146 64 514 176 128 424 151 256 405 145 The simulated speed-up curve was fitted with a cubic regression, which was used to predict how many PEs the new algorithm needs to finish a CC iteration in the same time that the initial parallelization attempt did. Serine and phenylalanine Using the x3sw network parameters, the simulator examined serine (Ser) and phenylalanine (Phe), both in the cc-pVQZ basis set. Ser, with 56 electrons, had 595 AO basis functions and Phe (88 e-) had 990 AOs. The current ACES II UHF CCSD program,

PAGE 99

89 using the AO algorithm for VVVV integrals and assuming 20% of the AO integrals are discarded, would require over 200 GB of local disk storage for Ser and over 1.5 TB for Phe. The new parallel algorithm would need only 2.8 and 19.2 GB of distributed storage, respectively. Figures 5-19 and 5-20 show the simulated parallel performance of Ser and Phe. Ser requires a minimum of 14 1-GB PEs, and Phe requires 95 PEs. In either system, parallel efficiency never drops below 96%. With resp ect to tractability, a single Ser UHF CCSD iteration is expected to take 2.8 days on 1024 PEs. Phe requires approximately 2 months per iteration on 1024 PEs. If the efficiency cu rve is to be believed, then this can be dropped to about 2 weeks on over 4000 PEs. 0 16 32 48 64 02565127681024# PEs Xth (100%) Xth (50%) X pred. Efficiency*64 Efficacy Figure 5-19.The performance curves for serine on 16 to 1024 PEs. 0 2 4 6 8 02565127681024# PEs Xth (100%) Xth (50%) X pred. Efficiency*8 Efficacy Figure 5-20.The performance curves for phenylalanine on 128 to 1024 PEs.

PAGE 100

90 The simulator is only able to model local computations with high accuracy. These are generally highly ordered, well-def ined sequences of commands executing on a consistent processor (barring process contenti on in the operating system). After these, the performance of local disk operations is some what predictable, even with I/O contention from other PEs. The least predictable performance aspect of a parallel program is network communications. This seems to be an N! correlation problem in its own right (nevermind coupled-cluster theory ), but certain models are ab le to predict the statistical nature of network traffic rather well. The current implementation of the simulator treats the network as a single, flat interconnected fabric in which every PE can reduce network bandwidth for every other PE. In principle, this should then pred ict the worst performance since multi-layered network hierarchies favor local communica tions, which the platoons are supposed to capitalize on. However, the pe nalty function for estimating network contention lacks the ability to scale up with the number of PEs (i.e., more PEs reduces the accuracy of the predicted network effects). This should be corrected with a better model such as the hyperbolic model for layered environments [ Sto96 ]. At a more fundamental level, the simula tor assumes all messages can be delivered in one eager message. This means each PE can send and receive data without bothering to divide the information into multiple messages, but more importantly, it completely eliminates the need for the target PE to acknowledge or create space for the incoming data. In practice, this is ra rely the case, and th is rendezvous protocol must be modeled. The simulator is also limited to shari ng I/O bandwidth among PEs on a single node. Many high-performance parallel supercomputer s use storage area ne tworks to increase

PAGE 101

91 capacity and reduce operating and maintenance cost s. The consequence of this is that I/O contention must then take into account al l of the PEs and not those on a single node. Despite the crudeness, the simulator is able to predict that the computational workload will be distributed ve ry well over the parallel envi ronment, in part because the number of operations is many orders of magnitude greater than the number of PEs. Overlapping one-sided communication and computation and reading from disk asynchronously both ensure a high degree of scalability for large molecular systems distributed over hundreds of PEs. Beyond predicting parallel performance, a simulator of this nature is able to recommend areas in which capital improve ments yield the highest returns. The simulation of serine/cc-pVQZ on 64 4CPU workstations (with 375 MHz POWER3 CPUs) connected with Fast Ethernet pred icts a runtime on the order of 310 hours per iteration. Upgrading the network to the SP sw itch (with the x3sw parameters) reduces the runtime to approximately 264 hr/iter; howev er, doubling the CPU speed but keeping the F.Eth network (assuming the same operati ng efficiencies) reduces the runtime to approximately 216 hr/iter. Of course, any su ch recommendations are biased toward this parallel algorithm, but the simulator is ab le gauge the effect of hardware upgrades without installing test machines (pr ovided it has accurate calibration data).

PAGE 102

92 CHAPTER 6 CONCLUSION The predicted success of the parallel CCSD algorithm is due to four design principles: PEs can compute while waiting for data retrieval to complete PE coordination can eliminate explicit PE communication Most messages can be localized to a small subset of the network Few (if any) operations re quire global communication. Despite the high parallel perfor mance, overall computational co st is still rather high for small to medium sized molecular systems comp ared to traditional disk-based approaches of serial algorithms. The surest way to decrease overall cost is to repartition the most expensive phases of the algorithm such that reused contractions are stored in local scratch, in distributed memory, or on disk, rather than be recom puted by each PE on demand. If a user has already allocated 256 PEs, then the objective of a serine/ cc-pVQZ calculation should be to finish as quickly as possible and not necessa rily to finish as efficiently as possible with respect to the minimum PE configuration. Considering that only 14 PEs are needed to calculate serine in cc-pVQZ having fewer platoons with more expensive messages is acceptable if the total time can be decreased by an amount that is greater than that returned by the maximum parallel speed-up. However, the algorithm is able to calculate CCSD results for systems that are beyond the reach of even the fastest serial implementation. In this regard, the new pa rallel CCSD algorithm is the method of last resort that will produce answers in finite time.

PAGE 103

93 Outlook CCSD energies are not the final goal for a robust treatment of electron correlation. Two paths remain to be investigated before a widely applicable and generally useful program can be released: parallel CCSD an alytical energy gradients and parallel noniterative triples corrections (to both the energy and the gradient). Gradients Without energy gradients, stru ctural analysis and force constants must be estimated using grids of finite displacements and singl e-point energy calculations. First order finite difference grids require 2M displacements to calculate the next hi gher derivative. (M is the number of vibrational modes, i.e., M=3NAtoms-6 for nonlinear systems.) Second order finite difference grids require 1+M+M2 displacements. For example, a numerical Hessian calculation, which produces internal force constants, from total energies alone for a system with 15 atoms requires 1561 displacements. If analytical gradients were available, then only 78 gradients would be needed to estimate the Hessian matrix, and if each gradient calculation costs roughly 2.5 times that of an energy calculation, then th e analytical gradient calculation would be approximately 88% cheaper than th e numerical gradient calculation. CCSD gradients require two main computing sections: forming de-excitation amplitudes and contracting the derivatives of the AO integrals with the two-particle density matrix [ Sal89 ]. The amplitudes can be determined in much the same way as the T amplitudes, so the techniques described fo r parallelizing the T equations are identical to those needed for the equations. The derivative integr al portion is different enough from the T equations that making decent use of da ta and task parallelis m when forming the actual energy gradient will need to be investigated further.

PAGE 104

94 Higher Excitation Clusters Many weakly bound and dissociative molecu lar systems require a treatment of triple excitations. Even the most basic treat ment can give accurate, predictive answers when CCSD is insufficient. The most widely used treatment of triples is CCSD(T) in which the converged T2 amplitudes from a CCSD calcu lation are contracted with OOOV and VVVO integrals to generate third-order triple excitation amp litudes, as shown in Equation 6-1 for the amplitudes. m bc mk e ec jk abc ijkt ij am t ie ab bc a P jk i P rˆ ˆ (6-1) p q r f r p q f r q p f qr p P, , , ˆ (6-2) The data scales like O3V3 and the computation scales like N7. Since the amplitudes do not contribute to any quantity other th an the energy, traditional methods for calculating them involve loopi ng over as few fixed indices as possible, generating the T3amplitudes that have those indices in comm on, calculating the energy contribution, and purging them from memory. Rendell and coworkers have been persistent in pushing the boundaries of CCSD(T), and they also find that if the in tegrals are stored, the parallel efficiency decreases dramatically [ Ren92 Ren93 Ren94 Kob97 ]. The difficulty in calculating these terms is that the VVVO integrals are not stor ed and must be generated directly. It is almost a certainty that the number of integr al generations will sc ale with some power of O, the number of electrons; however, it is no t obvious how to paralle lize the task and data distribution such that not ever y integral shell quadruplet will contribute to the full set of T3 amplitudes. Although this seems contradictory, the DI phase for R2 bypasses this by

PAGE 105

95 calculating the contributions in uncontracted AO integral s and then transforming them with two N5 steps after th e integral loop. Orbital Transformations and Space Reductions There are a few mathematical manipula tions that the two-particle space can undergo such that fewer computations are n eeded during the CC iterations, especially for the PP ladder term. Cholesky decomposition, singular value decomposition, and frozen natural orbitals are three methods for reduci ng the size of the virtual space or the number of terms that must be calculated. The proble m with these methods is that they shift the work from computation to data. 1024 AO basis functions requires 8 TB of st orage for the two electron integrals, but this matrix can be generated in parallel a nd on the fly from the basis set information. A Cholesky or singular value deco mposition of the two electron integral matrix will require at least 8 TB of storage just for the analysis, and since fewer matrix elements will be more important, there will be less work to parallelize and scalability will suffer. Linear combinations of the virtual space, like froz en natural orbitals, will only affect the efficiency of a CC algorithm that uses MO integrals. The proposed parallel CCSD algorithm never formally recognizes that a virt ual space exists other than to create the complementary density matrix, Q. As such MO based CC algorithms have difficulties with feeding data to the CPUs fast enough if that data must be streamed through I/O. Conclusion The iterative CC algorithm is highly amenab le to parallelizati on because a readonly data set (TOLD amplitudes) must map into an accumulate-only data set (TNEWamplitudes) through a massive number of co mputations. The primary difficulties in finding an efficient parallel approach are figur ing out how to get di stributed data to a

PAGE 106

96 CPU for processing and deciding when the CPU should compute non-existent data. As molecular systems increase in size, the co mputational and storage requirements that CCSD demands quickly become unreasonable fo r a single computer. The parallel CCSD algorithm described in this dissertation atte mpts to maximize use of idle CPUs and distributed memory in such a way that a CCSD treatment of electron correlation can now be considered for systems that were be lieved to be treatable only by one-particle correlation theories such as Density Functional Theory.

PAGE 107

97 APPENDIX A ACRONYMS AND ABBREVIATIONS AOAtomic Orbital bcastBroadcast BPPBin Packing Problem CCCoupled-Cluster CCSDCoupled-Cluster Singles and Doubles CDRCommander CIConfiguration Interaction COCompany CPUCentral Processing Unit DIDirect Integral F.EthFast Ethernet fmaFused Multiply-Add HFHartree-Fock HHHole-Hole I/OInput/Output IPInternet Protocol LDRLeader MOMolecular Orbital OOOOintegrals of the type , , or OOOVintegrals of the type , , or

PAGE 108

98 OOVVintegrals of the type , , , , or OPOne Particle OVVVsee VVVO P2PPoint-to-Point (or Peer-to-Peer) PEProcessor Element (CPU and memory heap) PHParticle-Hole PLTPlatoon PPParticle-Particle RHFRestricted Hartree-Fock SCFSelf Consistent Field SIStored Integral SMPSymmetric Multi-Processor or Shared Memory Program TCPTransmission Control Protocol UHFUnrestricted Hartree-Fock VOOOsee OOOV VVOOsee OOVV VVVOintegrals of the type , , or VVVVintegrals of the type , , or WF/WFDWorst Fit (Decreasing) x3swIBM SP switch with approximately 81.8 MB/s throughput

PAGE 109

99 APPENDIX B NOTATION AND CONVENTIONS Operators and Matrix Elements two-electron Coulomb integrals antisymmetrized two-electron Coul omb-exchange integrals - fpqFock matrix elements from the one-particle reference wavefunction Pq RsV integrals for partic les with different spins pq rsV integrals for same-spin particles pcAO coefficients from the single-determinant reference wavefunction VV OO V Ot t;oneand two-particle T excitation amplitudes, respectively VV OO V Or r;oneand two-particle T excitation residual amplitudes, respectively (Dt r ) ) ( ˆpq Pantisymmetric permutati on operator f(p,q)-f(q,p) VV OOr ~ two-particle residual amplitude before VV antisymmetrization Operator Indices I, J, K, Loccupied alpha molecular orbitals i, j, k, loccupied beta molecular orbitals A, B, C, Dvirtual alpha molecular orbitals a, b, c, dvirtual beta molecular orbitals p, q, r, sarbitrary molecular orbitals M, N, R, Satomic orbital shells

PAGE 110

100 , atomic orbitals 's, 's, …molecular orbitals from spin s projected into the AO basis

PAGE 111

101 APPENDIX C ALGORITHM OVERVIEW Initialization:0 ; 0 ~ ; 0 ; 0 ~ ; 0 ; 01 I r r r r rji jI IJ i I DI Phase: 2 1 1, , ~ ,T T V f I r rdirect ji jI Q r Q rjI Ij DI Phase: 2 1 1, , ~ T T V f I r rdirect Ij IJ ij ij Ij Ij IJ IJr Q r r Q r r Q r ~ ~ ; ; ~ ~ SI OOVV Ring Pass 1: 2 1, , ~ T T V f r rOO VV Ij IJ SI OOVV HH Ladder: 2 1, ~ , ~ T T V f r r rOO VV ij Ij IJ SI OOVV Ring Pass 2: 2 1, ~ ,T T V f r rOO VV ij Ij SI OOVV Ring Pass 3: 2 1, ,T T V f rOO VV Ij One Particle: 2 1 1, ~ , ~ ,T T I f r r r r rij Ij IJ i I AntiSym: ij ij ij IJ IJ IJr r r r r r ~ ~ ; ~ ~ T update: ... ; ...; ; f f f f r t f f r tJJ II IJ IJ I I I

PAGE 112

102 APPENDIX D ANATOMY OF A SHELL PAIR BLOCK The T2, R2, and OOVV integrals are stored in bloc ks of AO shell pairs. Each shell pair block contains three spin blocks: , and The nested loops that address each element in the MN shell block (called BL OCK) are shown in the following FORTRAN code listing. A pictorial repr esentation is shown in Figure D-1 index = 1 c --AA block fo r < I do nu = 1, nAO(N) do mu = 1, nAO(M) do J = 1, nOcc(1) do I = 1, J-1 print *, I,J,mu,nu,'= ',BLOCK(index) index = index + 1 end do end do end do end do c --AB block for < Ij | mu nu > do nu = 1, nAO(N) do mu = 1, nAO(M) do j = 1, nOcc(2) do I = 1, nOcc(1) print *, I,j,mu,nu,'= ',BLOCK(index) index = index + 1 end do end do end do end do c --BB block fo r < i do nu = 1, nAO(N) do mu = 1, nAO(M) do j = 1, nOcc(2) do i = 1, j-1 print *, i,j,mu,nu,'= ',BLOCK(index) index = index + 1 end do end do end do end do

PAGE 113

103 = I
PAGE 114

104 APPENDIX E LOGICAL SHELLS AND BIN PACKING Bin packing problems (BPP) involve gr ouping objects together such that each group satisfies a constraint placed on the collection of elements in the group. A prototypical example would be to partition an arbitrary set of numbers such that the sum of the numbers in each group does not exceed a predetermined value. Most of the interest in studying these problems involves optimiz ing the partition with respect to some parameter, and usually the objective is to mi nimize the number of groups in the partition. There are four primary algorithms for finding solutions to the BPP: first fit (FF), best fit (BF), next fit (NF), and worst fit (WF). First fit puts the next ungrouped object in the first group that can accept it. Although "best" is a subjective word, best fit algorithms imply putting the next object in the group that will have the least available space after adding it. Alternatively, worst fit finds th e group with the most available space before adding the next object. Next fit algorithms are usually only applied to on-line scenarios. An off-line scenario is one in which the set of objects to group is fini te and known before the packing process; all other scenarios ar e considered on-line. Next f it also implies "closing" groups such that no other object can be added, no matter how optimal the new group might be, so a next fit algorithm will cl ose the current group and form a new one if the next object cannot fit the open one. Off-line s cenarios allow for the objects to be sorted prior to the packing procedure, and this a dds two qualifiers to the type of algorithm: increasing and decreasing.

PAGE 115

105 In the parallel CCSD algorithm, processor elements work on data segments in units of AO shells. The number of shells is irrelevant, but the balance of the shells will directly affect the overall load balanc ing. The physical AO shells ar e quite uneven in size and the largest difference in sizes can have a consid erable effect on performance, so the physical shells are grouped into sim ilarly sized logical shells. The search algorithm loops over maximu m logical shell sizes and numbers of logical shells, and attempts to minimize th e shell variance. The ideal packing algorithm for this scenario is worst fit decreasing (WFD ), which places a shell smaller than or equal to the previous shell in the logical shell with the fewest orbitals. The following original FORTRAN code implements a WFD packing al gorithm starting from the largest physical shell size and searching up to a user-defined limit or the total number of shells, whichever is smaller. c This program implements a worst fit decreasing bin packing c algorithm. It searches over the number of possible bins and the c range of maximum bin sizes and minimizes the group mass c variance. c isort and dependencies come from: c http://www.netlib.org/slatec/src/isort.f program main implicit none integer max_cells parameter (max_cells=300) integer nCells, mass(max_cells), MasLim integer TotMass, lrg, sml integer GrpMax, nGrps, pGrp0, SmlGrp integer m(max_cells), GrpNdx(max_cells), GrpMass(max_cells) integer FinGrps, FinNdx(max_cells), FinMass(max_cells) double precision avg, FinAvg, var, FinVar, dTmp logical bValid integer i, j c --------------------------------------------------------------print *, 'Enter the number of cells:' read(*,*) nCells

PAGE 116

106 if (nCells.lt.1) stop 'Nothing to do!' print *, 'Enter the cell masses:' read(*,*) (mass(i),i=1,nCells) TotMass = 0 lrg = 0 sml = 2**30 do i = 1, nCells if (mass(i).lt.1) stop 'Cells must have a mass > 0.' TotMass = TotMass + mass(i) lrg = max(lrg,mass(i)) sml = min(sml,mass(i)) m(i) = mass(i) end do print *, 'Enter the maximum group mass:' read(*,*) MasLim if (MasLim.lt.lrg) then stop 'The group limit is already exceeded.' end if c --------------------------------------------------------------c print *, 'The cell masses are: ',(mass(i),i=1,nCells) c print *, 'The total mass is ',TotMass c print *, 'The largest mass is ',lrg c print *, 'The smallest mass is ',sml c --------------------------------------------------------------c o sort the masses in descending order call isort(m,i,nCells,-1) c o initialize the final groups to the individual cells do i = 1, nCells FinNdx(i) = i FinMass(i) = m(i) end do FinGrps = nCells call stat(FinMass,FinGrps,FinAvg,FinVar) print *, 'searching limits from ',lrg,' to ', & min(MasLim,TotMass) do GrpMax = lrg, min(MasLim,TotMass) do nGrps = (TotMass+GrpMax -1)/GrpMax, nCells c o fill in the largest cells c (pGrp0 points to the first group to scan from) pGrp0 = 1 do i = 1, nGrps GrpNdx(i) = i GrpMass(i) = m(i) if (m(i).gt.GrpMax-sml) pGrp0 = pGrp0 + 1 end do bValid=(pGrp0.le.nGrps.or.nGrps.eq.nCells) c o assign the remaining cells do i = nGrps+1, nCells c o find the smallest group SmlGrp = pGrp0 do j = pGrp0+1, nGrps if (GrpMass(j).lt.GrpMass(SmlGrp)) SmlGrp = j end do GrpNdx(i) = SmlGrp

PAGE 117

107 GrpMass(SmlGrp) = GrpMass(SmlGrp) + m(i) if (GrpMass(SmlGrp).gt.GrpMax) bValid = .false. end do if (bValid) then c o find the variance of the group masses call stat(GrpMass,nGrps,avg,var) c print '(2(i3,a),f6.2,a,f8.4)', c & nGrps,' groups <= ',GrpMax, c & ': avg = ',avg,'; var = ',var c o compare this grouping with the best known grouping if (var.lt.FinVar) then print *, '> lower var at ',nGrps,' groups under ', & GrpMax do i = 1, nCells FinNdx(i) = GrpNdx(i) FinMass(i) = GrpMass(i) end do FinAvg = avg FinVar = var FinGrps = nGrps end if else c print '(2(i3,a),a)', nGrps,' groups <= ',GrpMax, c & is invalid' end if end do end do call isort(FinNdx,m,nCells,2) print '()' print *, 'Cell masses: ',(m(i),i=1,nCells) c o print the original scheme call stat(m,nCells,avg,var) dTmp = (1.d0*lrg)/sml print '(2(a,i3))', 'max = ',lrg,'; min = ',sml print '(2(a,f8.4))', 'avg = ',avg,'; variance = ',var print '(a,f8.2)', 'The cells are uneven by a factor of ', & dTmp**4 print '()' c o print the new results lrg = 0 sml = 2**30 do i = 1, FinGrps lrg = max(lrg,FinMass(i)) sml = min(sml,FinMass(i)) end do dTmp = (1.d0*lrg)/sml print *, 'There are ',FinGrps,' groups' print *, 'Group ndx: ',(FinNdx(i),i=1,nCells) print *, 'Grp masses: ',(FinMass(i),i=1,FinGrps) print '(2(a,i3))', 'max = ',lrg,'; min = ',sml print '(2(a,f8.4))', 'avg = ',FinAvg,'; variance = ',FinVar print '(a,f8.2)', 'The cells are uneven by a factor of ', & dTmp**4

PAGE 118

108 end c --------------------------------------------------------------c m(1:n) is considered the population, not a sample. subroutine stat(m,n,avg,var) implicit none double precision avg, var, a, v, dTmp integer m(*), n, i if (n.eq.1) then avg = m(1) var = 0.d0 return end if a = m(1) do i = 2, n dTmp = 1.d0/i a = a + (1.d0*m(i)-a)*dTmp end do dTmp = 1.d0*m(1)-a v = dTmp*dTmp do i = 2, n dTmp = 1.d0*m(i)-a v = v*(i-1)/i v = v + dTmp*dTmp/i end do var = v avg = a return end The following example was taken from C20 in the AUG-CC-PVQZ basis set with spherical harmonics. (Spherical harmonics ha ve 2L+1 orbitals pe r contracted basis function, L is the angular momentum of the shell, while redundant Cartesian orbitals have (L+1)*(L+2)/2 orbitals per basis function.) The molecule is rather pathological in the size of each AO shell. If the shells are too large and the scratc h memory requirements are too high, then they can be broken up at the expe nse of recalculating r ecursion data in the direct integral code. binpack> cat stdin.c20.aug-cc-pvqz 100 06 15 20 21 18 06 15 20 21 18 06 15 20 21 18 06 15 20 21 18 06 15 20 21 18 06 15 20 21 18 06 15 20 21 18 06 15 20 21 18 06 15 20 21 18 06 15 20 21 18

PAGE 119

109 06 15 20 21 18 06 15 20 21 18 06 15 20 21 18 06 15 20 21 18 06 15 20 21 18 06 15 20 21 18 06 15 20 21 18 06 15 20 21 18 06 15 20 21 18 06 15 20 21 18 30 binpack> ./a.out < stdin.c20.aug-cc-pvqz Enter the number of cells: Enter the cell masses: Enter the maximum group mass: searching limits from 21 to 30 > lower var at 80 groups under 21 Cell masses: 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 15 6 6 15 15 6 15 6 6 15 6 15 6 15 15 6 15 6 6 15 15 6 6 15 15 6 6 15 6 15 15 6 6 15 6 15 6 15 15 6 max = 21; min = 6 avg = 16.0000; variance = 29.2000 The cells are uneven by a factor of 150.06 There are 80 groups Group ndx: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 61 62 62 63 63 64 64 65 65 66 66 67 67 68 68 69 69 70 70 71 71 72 72 73 73 74 74 75 75 76 76 77 77 78 78 79 79 80 80 Grp masses: 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 max = 21; min = 18 avg = 20.0000; variance = 1.5000 The cells are uneven by a factor of 1.85 The "unevenness" is calculated from the ratio of the largest to smallest shell size raised to the fourth power, (L/S)4, which is due to the two-electron integrals. Each shell quadruplet requires N4 storage and is used in an O2N4 contraction with T2.

PAGE 120

110 LIST OF REFERENCES ACESACES II is a program product of the Quantum Theory Project, University of Florida. Authors: J. F. Stanton, J. Gau ss, J. D. Watts, M. Nooijen, N. Oliphant, S. A. Perera, P. G. Szalay, W. J. Lauderdale, S. A. Kucharski, S. R. Gwaltney, S. Beck, A. Balkov, D. E. Bernholdt, K. K. Baeck, P. Rozyczko, H. Sekino, C. Hober, and R. J. Bartlett. Integral pack ages included are: VMOL (J. Almlf and P. R. Taylor); VPROPS (P. Taylor); and ABACUS (T. Helgaker, H. J. Aa. Jensen, P. Jrgensen, J. Olsen, and P. R. Taylor). Amd67G. M. Amdahl, AFIPS Conference Proceedings 30 (1967) 483-485. Bar95R. J. Bartlett, in Modern Electronic Structure Theory, Part I., edited by D. R. Yarkony (World Scientific Publishing Co., Singapore, 1995) 1047-1131. Bau02G. Baumgartner, D. E. Bernholdt, D. Cociorva, R. J. Harrison, S. Hirata, C. Lam, M. Nooijen, R. M. Pitzer, J. Ramanujam, and P. Sadayappan, inProceedings of the 2002 ACM/IEEE Conference on Supercomputing (IEEE Computer Society Press, Ba ltimore, Maryland, 2002) 1-10. Cof97E. G. Coffman, Jr., M. R. Garey, and D. S. Johnson, in Approximation Algorithms for NP-Hard Problems, edited by D. Hochbaum (PWS Publishing, Boston, 1997) 46-93. Cra98T. D. Crawford and H. F. Schaefer III, in Reviews in Computational Chemistry, Vol. 14, edited by K. B. Lipkowitz and D. B. Boyd (Wiley-VCH, New York, 2000) 33-136. [URL: http://zopyros.ccqc.uga.edu/lec_top/cc/html/review.html accessed 26 June 2004] Dun00T. H. Dunning, Jr., J. Phys. Chem. A 104 (2000) 9062-9080. Fed98Federal Emergency Management Agency Emergency Management Institute, IS195 Incident Command System, Federal Emergency Management Agency (ONLINE, 1998). [URL: http://training.fema.gov/EMIWeb/IS/is195lst.asp accessed 30 June 2004] Gho91D. Ghosal, G. Serazzi, and S. K. Tripathi, IEEE Trans. Soft. Eng. 17 (1991) 443-453. Gro96W. Gropp, E. Lusk, N. Doss, and A. Skjellum, Parallel Computing 22 (1996) 789-828. [URL: http://www-unix.mcs.anl.gov/mpi/mpich accessed 21 July 2004]

PAGE 121

111 Gro99W. Gropp, E. Lusk, and R. Thakur, Using MPI-2 Advanced Features of the Message Passing Interface (The MIT Press, Cambridge, Massachusetts, 1999). Gus88J. L. Gustafson, Communications of the ACM, 31 (1988) 532-533. Hal00B. C. Hall, An Elementary Introduction to Groups and Representations(ONLINE, 2000). [URL: http://arxiv.org/abs/math-ph/0005032 accessed 3 July 2004] Hir03S. Hirata, J. Phys. Chem. A 107 (2003) 9887-9897. Kob97R. Kobayashi and A. P. Rendell, Chem. Phys. Lett. 265 (1997) 1-11. Koc96H. Koch, A. S. de Mers, T. Helgaker, and O. Christiansen, J. Chem. Phys. 104(1996) 4157-4165. Mus02M. Musial, S. A. Kucharski, and R. J. Bartlett, J. Chem. Phys. 116 (2002) 43824388. Nor02W. Norcott and D. Capps, Iozone ver. 3.221 (COMPUTER SOFTWARE, 2002). [URL: http://www.iozone.org accessed 21 July 2004] Per03M. Pernpointner and L. Visscher, J. Comput. Chem. 24 (2003) 754-759. Ren92A. P. Rendell, T. J. Lee, and R. Lindh, Chem. Phys. Lett. 194 (1992) 84-94. Ren93A. P. Rendell, T. J. Lee, A. Komornicki, and S. Wilson, Theor. Chim. Acta 84(1993) 271-287. Ren94A. P. Rendell and T. J. Lee, J. Chem. Phys. 101 (1) 400-408. Sal89E. A. Salter, G. W. Trucks, and R. J. Bartlett, J. Chem. Phys. 90 (1989) 17521766. Sto96I. Stoica, F. Sultan, and D. E. Keyes, J. Parallel Dist. Comput. 39 (1996) 29-45.

PAGE 122

112 BIOGRAPHICAL SKETCH Anthony D. Yau graduated from Miami Palm etto Senior High School in the Spring of 1993. He attended the University of Flor ida as an undergraduate student from August 1993 to December 1996, and graduated with a B. S. in Chemistry. After surveying the possibilities for graduate studies in Theoretical Chemistry, he reapplied to the University of Florida as a graduate student in January 1997, and worked with Dr. Rodney J. Bartlett in the Quantum Theory Project, a world-class consortium of faculty in the Chemistry and Physics departments interested in develo ping, implementing, and applying quantum theories.


Permanent Link: http://ufdc.ufl.edu/UFE0006633/00001

Material Information

Title: A Parallel Direct Open-Shell Coupled-Cluster Singles and Doubles Algorithm
Physical Description: Mixed Material
Copyright Date: 2008

Record Information

Source Institution: University of Florida
Holding Location: University of Florida
Rights Management: All rights reserved by the source institution and holding location.
System ID: UFE0006633:00001

Permanent Link: http://ufdc.ufl.edu/UFE0006633/00001

Material Information

Title: A Parallel Direct Open-Shell Coupled-Cluster Singles and Doubles Algorithm
Physical Description: Mixed Material
Copyright Date: 2008

Record Information

Source Institution: University of Florida
Holding Location: University of Florida
Rights Management: All rights reserved by the source institution and holding location.
System ID: UFE0006633:00001


This item has the following downloads:


Full Text











A PARALLEL DIRECT OPEN-SHELL
COUPLED-CLUSTER SINGLES AND DOUBLES ALGORITHM

















By

ANTHONY D. YAU


A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL
OF THE UNIVERSITY OF FLORIDA INT PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY

UNIVERSITY OF FLORIDA


2004
































Copyright 2004

by

Anthony D. Yau
















ACKNOWLEDGMENT S

I thank my family and friends for supporting me throughout the years, my strength

comes from them. I thank the faculty who have instructed me and the members of Rod

Bartlett's research group. I owe particular thanks to Dr. Erik Deumens for continued

motivation and guidance in scientific computing (and for teaching the first parallel

programming class that I attended) and to Dr. Victor Lotrich for insights into

implementing Coupled-Cluster theory. I especially thank Dr. Ajith Perera for extending

so much help (and patience) to me during all of my studies throughout graduate school.

The United States Department of Defense High Performance Computing

Modernization Program (HPCMP) supported all of this work. A grant under the

Programming Environment and Training (PET) program (GSA Contract No. DAAD05-

01-C-003) funded the initial parallelization effort, and the Common High Performance

Computing Software Support Initiative (CHSSI) program (proj ect CBD-03) funded the

research on the fine-grain parallel algorithm.




















TABLE OF CONTENTS

Page


ACKNOWLEDGMENT S ........._.___..... .__. .............._ iii...


LI ST OF T ABLE S ........._.___..... .___ .............._ vii...


LI ST OF FIGURE S ........._.___..... .__. .............._ viii..


AB S TRAC T ......_ ................. ............_........x


CHAPTER


1 INTRODUCTION ................. ...............1.................


2 PARALLELIZING A SERIAL CCSD PROGRAM .................... ...............6


Serial Profile................ ... .. ...............6
Bottleneck Categorization and Parallelization .............. ...............9.....
Particle-Particle Ladders................ ...............9
M atrix M ultiplication............... .............1
I/O-Dependent Routines ................. ...._._ ...._._.............1
Parallel Profile.................. ...................1

Top-Down Parallelization Considerations .............. ...............17....


3 SPIN-UNRESTRICTED CCSD EQUATIONS............... ...............1


Group 1: T2 Virtual Orbitals Coming from V ............... ...............23....
Group 2: T2 Virtual Orbitals Coming from V and T............... ...................2
Set 1: One Virtual from V, One Virtual from T1 ................ .....................24
Set 2: One Virtual from V, One Virtual from T2 ................. .....................26
Group 3: T2 Virtual Orbitals Coming from T .............. ...............27....
Set 1: Rings and PH Ladders ................. ...............27........... ..
Subset 1: AC/BD............... ...............27.
Subset 2: DB/CA............... ...............28.
Subset 3: CB/AD ........................ ......................2
Set 2: HH Ladders from OOVV, OOOV, and OOOO .............. ..................30
Group 4: T 2 into Ti ................ ...... ..............3
Group 5: One-Particle Transformations .....__.....___ ........... .............3
Set 1: T1 Contributions .............. ...............32....
Set 2: T2 COntributions .............. ...............33....












4 P ARALLE L ALGORIT HM .............. ...............3 5....


Virtual Parallel Computer .............. ...............35....
Communication Networks .....__.....___ ..........._ .............3

Data Management ........................_ ...............38....

Arrays stored in distributed memory............... ...............38.
Arrays stored on disk .............. ...............42....
Parallel UHF CCSD .............. ...............42....

Direct Integral Phase............... ...............44.

Design considerations .............. ...............44....
Al gorithm .............. ...............46....
Stored Integral Phase ............. .....___ .....___ ...........5
One Particle Phase ........................._ ...............55...

Antisymmetrization and T2 Update .............. ...............57....


5 DISCUS SION AND SIMULATION ................. ...._.._ ......._ ...........6


D discussion .............. .. ...............69...
General Comments .............. ...............69....

Load B al ancing ........._..._.._ ...._._. ...............70.....
Job boards .............. ...............70....

Data aggregation .............. ...............70....

Asynchronous I/O ........._..._.._ ...._._. ...............72.....
Synchronization Delays ........._..._.._ ...._._. ...._._. ..........7
Sim ulation .............. ...............74...
General Overview............... ...............76

Specific Technologies............... ..............7
Calibration .............. ...............79....

Theoretical Examples .............. ...............83....
Runtime Simulations .............. ...............84....
D CEE* ............ ... ...............84..

Serine and phenylalanine .............. ...............88....


6 CONCLUSION ............. ......___ ...............92....


Outlook ............. ...... ...............93....
G radients ................ ........ ...............93....

Higher Excitation Clusters ..................... .___ ...............94...
Orbital Transformations and Space Reductions .............. ....................9
Conclusion............... ...............9


APPENDIX


A ACRONYMS AND ABBREVIATIONS .............. ...............97....












B NOTATION AND CONVENTIONS .............. ...............99....


Operators and Matrix Elements............... ...............99
Operator Indices .............. ...............99....


C ALGORITHM OVERVIEW............... ...............10


D ANATOMY OF A SHELL PAIR BLOCK .............. ...............102....


E LOGICAL SHELLS AND BIN PACKING ................. .............................104


LIST OF REFERENCES ................. ...............110................


BIOGRAPHICAL SKETCH ................. ...............112......... ......


















LIST OF TABLES


Table pg

2-1. The distribution of real time over high-level CCSD subroutines ............................8

3-1. The RHF CCSD T1 contributions .............. ...............19....

3-2. The RHF CCSD T2 COntributions .............. ...............20....

4-1. The direct integral processing loop summary .........__ ....... ................. 48

4-2. The Direct Integral phase prologue memory map .............. .....................6

4-3. The Direct Integral phase inner loop memory map .............. .....................6

4-4. The Direct Integral phase epilogue memory map ................. ....._ .............62

4-5. The Stored Integral phase Group 3 Set 1 Subset 1 memory map ..........................63

4-6. The Stored Integral phase Group 3 Set 1 Subset 2 memory map ..........................64

4-7. The Stored Integral phase Group 3 Set 1 Subset 3 memory map ..........................65

4-8. The Stored Integral phase Group 3 Set 2 memory map ................. ................ ...66

4-9. The One Particle phase FP TzaP memory map .............. ...............66....

4-10. The One Particle phase FPuT2aP (and same spin terms) memory map...................67

4-11. The One Particle phase FH and FR lOcal transformations .............. ...................67

4-12. The R2 antisymmetrization and T2 update memory map .............. ....................68

5-1. The schedule of synchronization points............... ...............75.

5-2. The latencies and bandwidths of random disk reads with respect to file size .......81

5-3. The minimum resources for amino acids with 1024 AOs in UHF CCSD....._.......84

5-4. The number of PEs needed to surpass the initial parallel performance ........._......88



















LIST OF FIGURES


Figure pg

2-1. The distribution of real time over primary CCSD subroutines ............... .... ........._..7

2-2. The breakdown of significant UHF subroutines ....._____ ... .....___ ........._....8

2-3. The column assignments for each of 4 CPUs updating R2 ................. ................12

2-4. The distribution scheme for matrix multiplies over CPU threads .........................13

2-5. The parallel speedup for each maj or UHF CCSD subroutine and the CCSD
executable .............. ...............15....

2-6. The runtime profile of parallel CCSD with a UHF reference from DCEE* .........15

4-1. Communication network boundaries ..............._ ...............39 ........._....

4-7. The memory heap management for various molecular systems................... .........40

4-2. The T2 and R2 memory layout .............. ...............43....

4-8. The data flow in the Direct Integral phase ................. ...............49........... .

4-3. The path of Group 3 Set 1 Subset 1 updates per OOVV integral ................... .......53

4-4. The LD shell distribution schemes .............. ...............54....

4-9. The data flow in the Stored Integral phase .............. ...............55....

5-10. The binary tree approach to collective communications ................... ...............78

5-11. The local disk random read calibration data ....__. ................. ................. .81

5-12. The calibration data for on-node messaging parameters ................... ...............82

5-13. The calibration data for off-node messaging parameters............... ...............8

5-14. The simulated performance curves for DCEE* on 2 to 1024 PEs over Fast
Ethernet ................. ...............85.................











5-15. The simulated performance curves for DCEE* on 2 to 128 PEs over Fast
Ethernet ................. ...............86.................

5-16. The simulated performance curves for DCEE* on an SP switch. .........................86

5-17. The relative performance of the initial parallelization attempt compared to the
new al gorithm .............. ...............87....

5-18. The absolute performance of the initial parallelization attempt compared to the
new al gorithm .............. ...............88....

5-19. The performance curves for serine on 16 to 1024 PEs ................. ................ ...89

5-20. The performance curves for phenylalanine on 128 to 1024 PEs ........................89

D-1. The spin blocks in an AO shell pair block ....._._._ ..... ... .__ ......._._......10
















Abstract of Dissertation Presented to the Graduate School
of the University of Florida in Partial Fulfillment of the
Requirements for the Degree of Doctor of Philosophy

A PARALLEL DIRECT OPEN-SHELL
COUPLED-CLUSTER SINGLES AND DOUBLES ALGORITHM

By

Anthony D. Yau

December 2004

Chair: Rodney J. Bartlett
Major Department: Chemistry

Coupled-Cluster Singles and Doubles (CCSD) is a benchmark ab initio electron

correlation method. The biggest factor in preventing its widespread adoption for large

molecular systems is that the theory scales like N4 in data and N6 in COmputation (N

being the number of atomic orbital basis functions). Using simple factorization

techniques and data and task parallelization, a new approach to calculating open-shell

CCSD energies is available. For a molecular system with 50 electrons (25 alpha, 25 beta)

and 500 AO basis functions, this approach has the potential to reduce storage

requirements from 630 GB to 2.3 GB for a traditional UHF CCSD algorithm in a basis

set of molecular orbitals. The number of operations will increase, but the natural

parallelization of the data provides a clear and concise framework for parallelizing the

computations.















CHAPTER 1
INTTRODUCTION

Coupled-Cluster (CC) theory is an ab initio electron correlation method that can be

thought of as infinite-order perturbation theory or a size-extensive approximation to the

full configuration interaction (CI) analysis [Bar95]. Given a one-particle reference

wavefunction Go, the CC wavefunction is defined as an exponential expansion of

excitation clusters, T, from the occupied orbitals to the unoccupied (virtual) orbitals. The

T operator is a particle excitation operator that naturally truncates at the number of

particles, P, in the system (Equation 1-2). Occupied orbitals (the hole space) are denoted

with the letters i, j, k, and 1, while virtual orbitals (the particle space) are denoted with the

letters a, b, c, and d. Appendix A lists acronyms and abbreviations, and Appendix B

summarizes index conventions and matrix elements.


I cc)= exp(T)00) (1-1)


T =i T; T = 1 t a ,ibt ptk... (1-2)
p=1 22]k
a,b,c,

The difficulty in calculating the full CI is that the data requirement scales like N2P

and the computational requirement scales like N2P+2, in which N is the size of basis set

space. To retain size extensivity and computational tractability, the excitation operator is

usually truncated at a small number of excitations. The most common is at double

excitations, such that T=T1+T2. As computers become more capable of massive

calculations, the number of excitations can increase, but the size of the molecular system









must decrease much more rapidly. Currently, the highest hand-coded truncation is at

pentuple excitations (i.e., T= T1+T2+...+T5) [Mus02], but efforts to automate this process

are becoming more promising as distribution techniques for data and computations

become available [Bau02].

Using the standard two-particle electronic Hamiltonian, the CC energy is calculated

by contracting the elements of the Fock matrix (f,,) and the two-electron Coulomb

integrals () with the T1 and T2 amplitudes. Uppercase orbital indices correspond

to alpha spin orbitals, and lowercase orbital indices correspond to beta spin orbitals. The

double-bar integrals in Equation 1-4 are antisymmetrized Coulomb-exchange integrals

(Appendix B). (This derivation assumes that the reference orbitals are orthonormal.)


Ecc o exp() Go(1-3)


Ecc= Eo +C Ij Ab ( +t;'t1
IJAb
+f~l/ flltA + I A tAB + t~tJ --t t" (1-4)
IA IJAB

+C1f~,f +SC (9 ab: I' +tzat tt)
la tyab

The T amplitudes are calculated by iterating over a coupled set of equations that

proj ect the Hamiltonian and wavefunction into a set of p-particle excitation spaces. The

CCSD approximation truncates T at double excitations, and is one of the benchmark

correlated methods in quantum chemistry today. The T1 and T2 equations come from the

projections in Equations 1-6 and 1-7, in which Hj, is the normal-ordered Hamiltonian

that does not include the reference energy. Only the P and pp spin equations are shown,

but the a, aa, and up spin equations are analogous to the former.

exp(- T)H;, exp(T)I 00)= AEcI 0,) (1-5)










Ofl exp(- T) ~, exp(T) Q0) = 0 (1-6)

/rab O Xp(- T) exp(T)I 00) = (1-7)


The Baker-Campbell-Hausdorff formula [Hal00] has the effect of reducing the T

equations to those terms that 1) contract directly with the Hamiltonian and 2) have no

more than four excitation amplitudes. The first effect is from disconnected terms being

cancelled by the commutation, and the second effect is from the two-particle Hamiltonian

having at most four indices from the two-electron integrals.

Computationally, the diagonal elements of the Fock matrix that contract with T are

grouped together, and define the input amplitudes for the next iteration. The T2P

equations are shown in Equations 1-8 through 1-1 1. If T is initialized to 0, then the next

iteration will produce T2=/D2 and T1=0, so for all intents and purposes, T2 can

be initialized to the second-order wavefunction and T1 can be set to 0.

O = C~~factc bCfcfl ac k ~b _fktrab +... (1-8)
k k

0 = (Jas +/bb J;i af~b ac C~~b bc~f6t c k ~b _~ ir ktra +.. (1-9)
c~a c~b k ke


Dab=j; aa bb (1-10)

Dabtb acic b ca k kyd (k 2 **
c~a c~b km k

Since the T amplitudes differentiate between occupied and virtual orbitals, it is

common to group integrals into the various space combinations such as occupied-

occupied (OO), occupied-virtual (OV), and virtual-virtual (VV). There are Hyve such

combinations for the two-electron integrals: OOOO, OOOV, OOVV, OVVV, and

VVVV. If the virtual orbitals outnumber the occupied orbitals by a factor of 3, then the









VVVV and AO integrals will outnumber the OOOO integrals by factors of 81 and 256,

respectively. Most CC implementations calculate and store the OOOO, OOOV, and

usually OOVV integrals before starting the CC iterations. Due to programming

complexity, most implementations also store the OVVV integrals.

The VVVV integrals are used in only one CCSD term and how they are processed

illustrates a fundamental technique that the parallel CCSD algorithm uses in almost every

contraction. The data requirement scales like N4 and the computational requirement

scales like N6, and this term is usually the rate-determining step of the CC iteration. The

standard MO expression is shown in Equation 1-12. The "+= operator is shorthand for

"incremented by" (i.e., "x+=y" means "x=x+y"), and the r amplitude is the residual that

collects all of the contributions before dividing by the D denominator and producing the

next set of T amplitudes.

a~b+= }fab lcd cd (1-12)


Forward transforming T into the AO basis allows the AO integrals to contribute

directly to the VVVV terms. Conceptually, the AO coefficients from the reference

wavefunction are "removed" from the integral and "attached" to the amplitude. The

primes on the AO indices in T denote a forward transformation. This distinction must be

made since another useful transformation (a backward transformation) has the same

dimensionality (viz. po), but the AO overlap matrix is used in the contraction in addition

to the AO coefficients.

abh ac = ~Pr~ pa c:)c~t id (1-13)
cdvp









While this transformation removes the need to store the VVVV MO integrals, it

increases the computational workload substantially. In fact, since each integral

contributes to each amplitude, the R2 array WOuld be updated N4 times!i However,

implementing this technique presents a more optimistic view of calculating large systems.

The R2 amplitudes are back-transformed from MOs to AOs before being incremented by

the AO integral terms. Once the integrals have been processed, the R2 array is forward-

transformed into the MO basis.

4""e =u _(' cd n S' (1-15)
cd


I4""+= pr per t "" (1-16)





The method of calculating amplitude contributions in the AO basis instead of the

virtual MO basis has the potential to reduce the amount of storage space needed to

calculate the aforementioned VVVV contribution. This is particularly true for open-shell

systems, which need to store all three spin combinations of the VVVV integrals

compared to storing one set of AO integrals. If direct integral generation is used, then the

storage requirement for this term is completely removed. In either the AO or MO

scenario, parallelization involves distributing the data and operations of a large matrix

multiplication over the processing elements.















CHAPTER 2
PARALLELIZING A SERIAL CCSD PROGRAM

Wide-scale reformulation of computational algorithms should only be performed if

currently existing formulations are inadequate, but determining whether an algorithm is

inadequate depends heavily on its intended use and the computers that are meant to

execute it. Any of a number of runtime parameters could contribute to low performance:

compiler efficiency; net CPU speed; memory or disk storage requirements and

performance; parallel scalability and efficiency; etc. In the case of general computational

chemistry algorithms, how the number of operations scales with respect to the size of the

chemical system will also determine whether an algorithm should be reformulated.

The ACES II Program System [ACES] is one of the more popular software

packages that calculates CCSD electron correlation energies. Although it was written to

make use of CPU vectorization, parallelization is considered to be one of the most

important ways to decrease the run time on modern supercomputers. Traditional methods

for parallelizing existing software begin with execution proofing. Once the main

bottlenecks are identified and analyzed, parallel formulations of those sections are

implemented and profied.

Serial Profile

Profiing the CC code in ACES II was performed with four test systems:

* Alanine ("ala"; C3H7NO2; 48 e") in cc-pVDZ (119 AO basis functions)
* Octane ("oct"; CsHls; 66 e") in cc-pVDZ (202 AOs)
* P,P'-dichloroethyl ether ("DCEE"; C4HsCl20; 74 e") in cc-pVDZ (146 AOs)
* DCEE radical ("DCEE*"; C4H7Cl20; 73 e") in cc-pVDZ (141 AOs).










The first three systems had an RHF reference, the fourth had UHF, and no spatial point

group symmetry was used for any system. All of the systems used the AO basis

formulation for the VVVV*T2 COntribution, which was described in Chapter 1. The

profiled code used the C system function gettimeofday() to record the real-time clock

before and after large code sections. Any subroutine (or group of related subroutines) that

consumed more than 3% of the real time became a potential target for parallelization.

Table 2-1 lists the average percentage of the total real time for each high-level

subroutine measured across the three closed-shell test systems (RHF column), and it lists

the percentage for each routine in the DCEE* test system (UHF column). Routines that

contributed more than 3% are in bold case. The maximum variance of any set of

percentages for the closed-shell systems was 0.004, which suggests that the three profiles

have a consistent and reliable distribution of times. The table entry for "~AOLAD"

corresponds to the time spent in AOLAD that is not due to RDAO, even though the latter

is a dependent of the former. Figure 2-1 shows the amount of the total time consumed by

the maj or subroutines and the next most significant routine.


70%

60% I RHF

50% I I UHF

40%-

30%-

20%-

10%-

0%


T1W1AB


RNGDRV


CNTRCT


RDAO


next


Figure 2-1. The distribution of real time over primary CCSD subroutines.










Table 2-1. The distribution of real time over high-level CCSD subroutines
Routine RHF UHF Routine RHF UHF
ZERSYM 0.2% 0.1% TllNTl 0.4% 0.2%
T1RAAAA 1.4% 0.6% LADAA 0.0% 0.1%
T1RABBA 0.8% 0.6% LADAB 1.2% 0.5%
T1RABAB 0.8% 0.6% AOLADLST 0.0% 0.0%
CNTRCT 7.7% 9.8% T2TOAO 0.7% 0.3%
SUMSYM 0.1% 0.2% RDAO 58.3% 65.1%
F2TAU 0.0% 0.0% ~AOLAD 0.6% 0.3%
GETLST 1.1% 0.8% Z2TOMO 0.9% 0.7%
QUAD 1 1.3% 0.5% ZIAAO 0.1% 0.2%
QUAD2 0.5% 0.3% T12INT2 0.8% 1.1%
QUAD3 0.3% 0.2% RNGDRV 9.4% 11.0%
MAKFME 0.1% 0.1% SSTOxl 0.0% 0.0%
T1W1AA 0.0% 0.6% SUMRNG 0.2% 0.1%
T1W1AB 3.9% 2.1% ERNGAA 0.0% 0.0%
T1INW2 0.1% 0.0% ERNGAB 0.5% 0.1%
FEACONT 0.7% 0.4% SUMSYM 0.2% 0.0%
FMICONT 0.3% 0.3% E4S 0.6% 0.2%
FMECONT 0.5% 0.2% NEWT2 0.3% 0.2%
T1INT2A 0.3% 0.2% PHASE3 0.6% 0.2%
T1INT2B 0.8% 0.5% DIIS 2.6% 1.0%

It is clear from the table that four routines consume most of the execution time:

CNTRCT, T1W1AB, RDAO, and RNGDRV. Together, they cost 79.3% and 88.1% for

RHF and UHF, respectively. Figure 2-2 shows the four maj or subroutines and any

individual routine that consumes more than 1% from the DCEE* profile.


RNGDRV


RDAO


Figure 2-2. The breakdown of significant UHF subroutines. Primary values are grouped
on the left, secondary values (greater than 1%) are grouped on the right.


T12INT2
DIIS









Bottleneck Categorization and Parallelization

Particle-Particle Ladders

The single most time consuming routine in either RHF or UHF calculations is

RDAO, which increments T2 by contributions from the VVVV integrals in the AO basis.

Integral-stored schemes, such as the current implementation in ACES II, read the AO

integrals into memory from disk, multiply them by the T2 amplitudes from the previous

iteration, and add the product to the T2 TOSidual amplitudes. Neglecting integral sparsity,

the number of permutationally unique AO integrals is shown in Equation 2-1. If there are

at least 100 AO basis functions, then the number of unique integrals can be approximated

by NAO4/8 with more than 98% accuracy; 200 or more functions and the approximation is

more than 99% accurate.


Nin =PP+) P = NoNo+)(2-1)
2 2

Many AO-based implementations have a user-defined threshold for discarding AO

integrals that would not have a significant contribution to the T2 TOSidual. The integral

program in ACES II stores the integrals in one long vector and discards individual

integrals that are less than the threshold. For DCEE* and a threshold of 10-14, ACES II

stored almost 41.1 million integrals from a complete set of 50. 1 million, which means the

two-electron integrals were 18.0% sparse.

The RDAO algorithm reads in a batch of integrals and processes the integrals

individually. For a particular integral, the value might contribute 1, 2, 4, or 8 times

depending on the permutational symmetry of the orbitals. The RHF case only has 1, 2, or

4 contributions and post-processes the residual for the final permutation. The general

structure of the algorithm can be described by the following:









do while (more batches)
read batch
do for each int
do for each perm
do for all oo
R2 (OO, gy) += T2(OO, po)*
end do
end do
end do
end do

This algorithm is not particularly efficient since each permutation requires a vector-

vector operation (R2(*,Cly) += V*T2(*,po)). Experience in computational algorithms

suggests that matrix-matrix operations (using integral blocks) would have much higher

computational throughput. For example, the RDAO update averaged 286.7 MFLOPS out

of a theoretical maximum of 1500 MFLOPS on the 375 MHz IBM POWER3 CPU. A

600x600 matrix multiplication using the ESSL library (a tuned linear algebra library for

IBM architectures) averages 1278 MFLOPS, which corresponds to a computational

efficiency of 85.2% compared to 19.1% for RDAO. However, parallel algorithms for

block-wise implementations are virtually identical to element-wise implementations, so

block optimizations were ignored in the parallel strategy and could be implemented later.

For any DCEE* CC iteration, 9.6 seconds out of 5952 seconds were spent reading

integrals from disk (i.e., I/O takes about 0. 16% of the total RDAO time). Before the

multiplication loop, there was a small sorting section, which took close to 13.6 seconds

(also insignificant); therefore, parallelizing the actual integral updates was expected to

have the largest impact on performance.

There are multiple levels at which the current algorithm can be parallelized:

batches, integrals, permutations, or rows of OO pairs. Distributing the operations over

OO pairs might decrease performance since the pairs are contiguous in memory, and the









cost of adding more rows is much less than the cost of adding more loops to process

them. The problem with distributing permutations is that the number of permutations for

a random integral is not always 8. This might introduce load-balancing problems

depending on how the integrals are sorted on disk; furthermore, each integral has a

limited number of tasks (permutations), which might limit overall scalability above 8

processors.

The two remaining loops (unique integrals and integral batches) are ideal for

distributing over processors. The number of integrals is substantial enough to support

scalability over a wide number of processors, and each processor will need to store only

those integrals for which it is assigned to calculate contributions. The current ACES II

implementation expands the permutations into the "unique" integral list and processes

each one separately, so this expanded list was distributed over processors instead of the

truly unique list.

For maximum software portability, the parallel algorithm needed to distinguish

between processors on the same physical computer and processors on remote computers.

The strategy was to distribute integral batches over physical computers (nodes) and to

distribute the integral list from each batch over processors on the same computer. After

the nodes finished processing their batches of integrals, the R2 COntributions were

reduced and distributed back to all of the nodes. To minimize memory requirements on a

single node, only one integral list, T2 array, and R2 array WeTO Stored, and CPU threads

were created to process the integrals. The R2 array WAS divided into blocks of columns

(Cly pairs), and each CPU was assigned to calculate all of the terms that contributed to a

particular set of blocks. This technique reduced false cache conflicts (when two CPUs







attempt to write to different parts of the same cache line at the same time), and it
increased the probability that a CPU would have a target R2 COlumn in cache memory,
since multiple CPUs would be caching different columns. Figure 2-3 shows the
CPU/column assignments in a 4-CPU computer with 4 columns per block.

CPU 0 1 2 3







WWW
Figure 2-3. The column assignments for each of 4 CPUs updating R2. In this example,
there are 49 Clv pairs (from 7 AOs).

For an arbitrary blocking factor (bf) and number of threads (Nth), the thread
assignment (from 0) of a column index (mn) can be determined with the following
FORTRAN function:
thread_id(mn,bf,Nth) = mod(mn-1,bf*Nth)/bf
If bf*Nth is a power of 2, then the mod function can be replaced by a bit-wise AND with
the mask (bf*Nth)-1, and if the blocking factor is a power of 2, then the division can be
replaced by bit shifting.
Matrix Multiplication
If the VVVV contributions are calculated in the MO basis, then LADAA and
LADAB become the bottleneck routines. These perform large two-dimensional matrix
multiplies, as do the other three bottlenecks: CNTRCT, T1W1AB, and RNGDRV.









Incidentally, the routine T12INT2 (1.1% of the UHF time) is also bound by matrix

multiplications. Parallelizing these routines involved distributing the columns of the

output matrix over CPU threads. Since the input and output matrices were all kept in

memory, the columns were divided once, unlike block cycling in the RDAO scheme,

which had irregular data access patterns. Figure 2-4 shows the distribution pattern for a

matrix multiply over 4 threads.

A B = C








Figure 2-4. The distribution scheme for matrix multiplies over CPU threads. Every
thread can read the entire A matrix, but each thread is assigned a block of the
B matrix and consequently writes to a unique block of the C matrix.

The multiplications were not distributed over nodes since the parallel computer that

ran the profile had 16 CPUs per node. The time penalty incurred from the global

reduction (after the multiplication) would have greatly lessened the gains from

parallelization; however, nodes with fewer CPUs might benefit from distributing the rows

of the A matrix. All nodes but one would then have to initialize the C matrix to 0 (if the

operation increments pre-existing data), and the results would have to be reduced and

broadcast after the multiplication.

I/O-Dependent Routines

None of the maj or subroutines that required parallelization were heavily dependent

on I/O operations; however, many parallel computers (including the one that ran the

profiling calculations) have shared disk storage devices. This introduces competition

between some or all of the compute nodes when each one performs file read or write









operations. The DIIS routine (1% of the UHF time) is heavily dependent on I/O

performance, and the rate-determining operation can be described by the following:

do for each i


do for each (x,y)
B(x,y) += Tali(x,i)*~TaillY,i)
end do

end do

The number of rows in Tanll and the size of each dimension of B) is usually 5 but could be

any value (preferably a small one). The i index spans all of the T amplitudes, both T1 and

T2. Obviously the construction of a 5x5 matrix from a single 5xl vector is negligible

compared to the time spent reading from and writing to disk. This operation was not

parallelized, but only one node was permitted to perform the operation.

I/O competition is a low-level problem that shared storage architectures have. In

many cases, extensive software changes are required to overcome the increasing penalties

from read and write delays that serial programs consider to be constant. These penalties

manifest themselves in almost every routine that uses disk storage in the parallel profile.

Parallel Profile

The DCEE* test case was run with the ACES II CCSD program that had parallel

modifications as described in the previous section. The parallel performance of the maj or

subroutines is shown in Figure 2-5. The real time spent in each high-level routine is

shown in Figure 2-6. Only RDAO was distributed over every CPU in the calculation and

was 20.6% efficient in the 256-CPU (16xl6) calculation. The three other high-level

routines were only distributed over the 16 CPUs in each node and had average threading

efficiencies of 70. 1% (CNTRCT), 25.4% (T1W1AB), and 28.8% (RNGDRV).





O lxl
0 2x l6
4x l6
8xl6
- 16xl6



CCSD


II~ II II





1~~ 1 I IIII


T1INT2A
T1INT2B I I I
T 1INT 1 I I I
LADAA II I II I
LADAB
T2TOAO I I I I
RDAO
~AOLAD I I I
Z2TOMO I I I
ZIAAO I I I I I I
T12INT2
RNGDRV I
SUMRNG II II II II
ERNGAB II II II II

E4S I I I I I I I
NEWT2 I I I I I I I
PHASE3 I I I I I I I
DIISI I I I I I


60 -
50 -
S40 -
S30 -

,a20 -
10 -
0-


CNTRCT T1W1AB RDAO


(AOLAD) RNGDRV


Figure 2-5. The parallel speedup for each maj or UHF CCSD subroutine and the CCSD
executable. (AOLAD) contains the RDAO parallelization and the global
reduction afterward.


O 1800


3600


5400


7200


0 1800 3600 5400


7200


ZERSYM

T1RAAAA

T1RABBA

T1RABAB

CNTRCT

SUMSYM

GETLST

QUAD1

QUAD2

QUAD3
MAKFME

T1W1AA

T1W1AB

T1INW2

FEACONT

FMICONT

FMECONT


Figure 2-6. The runtime profile of parallel CCSD with a UHF reference from DCEE*.
The times are measured in seconds.









The overall parallel efficiency of the UHF CCSD calculation on 256 CPUs was

1.8% with a speed-up of 4.62. The calculation with the highest speed-up was 4xl6 (64)

CPUs with an efficiency of 8.8%. The reason for such poor performance can be attributed

to I/O contention by all of the routines that were insignificant in the serial calculation.

Most of the routines listed in Figure 2-6 increase in time even though they had no parallel

modifications. While further parallel optimizations could be made to the current CCSD

program, two maj or drawbacks would still plague the parallel code: problem scalability

and expensive modify-profile development cycles (i.e., trial and error).

The worst aspect of the current approach is that tractability does not increase as

CPUs, memory, and disks are added to the pool of resources (i.e., the largest molecule

that can be studied is limited by the resources of a single node). There is no mechanism

for distributing data across memory or disks; however, if the current limitations are

adequate, then this model should be able to reduce the runtime to a reasonable amount.

Parallelizing existing serial software usually takes the bottom-up approach, in

which significant bottlenecks are parallelized and profiled. As new bottlenecks present

themselves, they too are parallelized and profiled. Unfortunately, by the time this cycle

ends (or becomes too expensive), so much of the software might have been rewritten that

it would have been cheaper to reformulate the algorithm using a top-down parallelization

approach. In the current CCSD program, no matter how well the algorithm is parallelized,

the software is still limited to performing calculations on systems that are calculable by a

single node. It would be more efficient in the long term to reformulate the CCSD

algorithm from the top down (reusing existing software wherever possible) and to include

proper data distribution and small task parallelization as design fundamentals.









Top-Down Parallelization Considerations

Although a few programs already calculate the CCSD energy in parallel, most of

them are limited to closed-shell references [Koc96,Kob97,Per03]. The exception is the

general tensor contractor of Hirata [Hir03], but the Einal program requires stored

integrals. The goal of the top-down parallelization approach will be to produce an open-

shell CCSD program capable of handling molecular systems with at least 500 AO basis

functions (although the internal target is 1000) and that can scale up to a few hundred (if

not a thousand) CPUs. These constraints almost necessitate calculating amplitude

contributions from AO integrals and distributing amplitudes across CPUs.

In addition to the VVVV integrals, the algorithm should also allow for direct

generation of the VVVO integrals. The extent of calculating amplitude contributions in

the AO basis instead of the MO basis is that the virtual orbitals are never formally

recognized by the equations other than to create complementary density matrices such as

that shown in Equation 2-2. These matrices are very important since they define how

amplitudes transition from backward- to forward-transformed relative to the MOs.

Q" = [efc (2-2)


The first step of the top-down approach involves cataloging the integral

contributions and identifying which output condition on the R amplitudes (forward- or

backward-transformed) would be best suited for their additions. Finding a data and task

distribution then becomes a matter of understanding how the integrals are processed and

which terms can be grouped together given a common output condition on the residual.















CHAPTER 3
SPIN-UNRESTRICTED CCSD EQUATIONS

The T1 and T2 equations in spin-restricted Hartree-Fock (HF) CCSD have 11 and

29 contributions, respectively. If the reference wavefunction is not HF but the spin

orbitals are still restricted, then the T1 and T2 equations have 14 and 31 contributions,

respectively. The spin-unrestricted non-HF CCSD amplitude equations have a total of

222 contributions. Table 3-1 lists the spin-restricted T1 amplitude contributions and the

number of corresponding spin-unrestricted terms. Table 3-2 does the same for the T2

contributions [Cra98]. P(pq) is the antisymmetric permutation operator (Appendix B).

There are many ways to group and factorize the amplitude contributions. The

efficiency of each approach largely depends on the following:

* The chemical system being examined
* The CPU speed and efficiency
* The amount of local memory per process
* Local and remote (shared or exclusive) storage capacities
* Local and remote I/O latencies and bandwidths
* Interprocess message latency and bandwidth.

For example, if the CPU speed were incredibly slow compared to the speed of the I/O

reads and writes, then it might be faster to group as many subterms together as possible,

and accumulate the intermediate product to disk. Consider the sum X*A*B + X*C*D +

X*E*F, which factorizes to X*I provided I = A*B + C*D + E*F (first accumulated on

disk). In most cases, however, millions (if not billions) of CPU operations are performed

before a single byte of data is retrieved from disk, so this approach is relegated to those

systems whose CPU speeds are unusually slow compared to other hardware capabilities.










Table 3-1. The RHF CCSD T1 contributions
Number of
Contribution Diagram
UHF terms

+Cltlclfkcd V -

1kl c tt 4

+ L,, /-- 2

+C(ka ci~f t 4

+IL,(--x 2

+ ka edt~ tid 4
kcd



1 kl c iC tXII tza 4



-tl k/ i t 4

Skled





+ fkc,/tttpe --x 4

+ / ttda _I _~ _


The most common factorization strategy involves searching all possible

combinations of terms from either amplitude equation and finding the set of subterms that

is used most frequently. This implicitly assumes that the cost of recomputing the data is

greater than cost of storing and retrieving the data from disk. In the MO basis, this is

certainly true (i.e., the amount of computation needed to generate all of the MO integrals

from AO integrals is quite high), but the cost of calculating AO integrals on demand is

negligible when multiple processes are in competition for shared I/O resources.







Table 3-2. The RHF CCSD T2 COntributions
Number of
Contribution Diagram(s) H em
+ abh if) 3
+P(!i)C' hab jtae 4
+ (ij) C (abh cd tacd -- 3
P(ab) ~kb if) t \/
-P (ij)P (ab) kb icrl tV --
(ij) (ab) ~kb cd tot;tdi 4I
+ arb cd) t c 3


+iUP~ijh)P(a)kb cj~ to or 10
+P~j)Po~ab) ak~ djidt)~ k0~jh 10

+P+) (kli if )~i t 3

+Pij)P3(ab)klCi cj otVt 4


+ (ij) (ab)' Ckl ic ft t 1


+ P~P~ij)P~(ab), k/ )t~ 11

+ (~ij)- k/ J tal~ctdtab 3
+P(ab)-n(l~: I~k/ +P(ij)P~a b) k/









Table 3-2. Continued
Number of
Contribution Diagram(s)
UHF terms


~ g k t --- 4


P(ij>) kl ci fttz 8

-P(ij)C(~, k/\J t:ftdtb 8
kled



+Pli(ab) fbct \f ---x 4




kcd
h>(ab) L~~k/tlkj JI tftatb 8
i(a kle d f: t;



The following assumptions guided the development of the factorization strategy

described in this chapter:

*There are at least 3 times as many virtual orbitals as occupied orbitals for each spin
*The two-particle data are too large to store in the memory of one process
*The VVVV and VVVO integrals are not transformed and stored on disk
*The CPU operations are several orders of magnitude faster than I/O latencies
*Latency is at least an order of magnitude more significant than bandwidth
*Each process is able to retrieve (GET) and increment (ACCUMU7LATE) data that
are stored by another process without interrupting the other process.

The first assumption follows from a study of basis sets that noted that cc-pVTZ is the

minimum basis set needed to describe electron correlation accurately [DunOO]. For

chlorine, there are 17 electrons and 34 contracted basis functions. This produces an

almost 1:3 occupied/virtual split in molecular orbitals, since half of the electrons have

alpha spin and half have beta spin.









Designing the algorithm such that two-particle terms (viz., VVOO) can be

distributed over PEs is necessary because the parallel program should be capable of

treating systems with 50 electrons and 500 basis functions (50/500), although 100

electrons and 1000 basis functions is the target (100/1000). For 50/500 systems, the three

VVOO spin combinations require 1.55 GB of memory. Considering the algorithm needs

at least two (for T2 and R2) held in core, a processor with 1 GB of memory will not be

able to store the minimum amount of data without overflowing to disk storage.

Furthermore, the VVVO and VVVV terms will require 59.8 and 568 GB of storage,

respectively. (This assumes a conversion factor of 1024.) The parallel algorithm should

then at least allow for the possibility that the terms that use these MO integrals can be

calculated from AO integrals generated on the fly.

Today, the fastest disk storage devices are able to seek for random data in

approximately 3.7 milliseconds (measured on a Maxtor Atlas 15k 73 GB Ultra320 SCSI

hard drive). The slowest desktop CPU (to within reasonable expectations of performance)

operates at 1 GHz. (The Intel Pentium 4 line of CPUs can operate at 3.4 GHz.) Assuming

that most floating-point operations in CCSD require multiplications and additions and

that each operation completes in one clock cycle, then the maximum theoretical

multiply/add (fma) throughput is somewhere around 500 million fma/s. Together, the

data indicate that a slow CPU could theoretically deliver 1.85 million fma/seek, which at

least supports the assumption that computation is several orders of magnitude faster than

I/O latency. On the same hard drive, bandwidth bursts measure 73.7 MB/s, which

suggests that 280 kB of data could be transferred during the same time it took to locate

the data.









The final assumption follows from the latest version of the most popular message

passing library MPI (Message Passing Interface) [Gro99]. MPI-2 allows PEs in the same

communication network to retrieve (GET), increment (ACCUMULATE), and overwrite

(PUT) data that are stored by another CPU without communicating with it. In addition to

their own proprietary communication libraries, most vendors of parallel computers offer

implementations of the standard message passing libraries (like MPI) to ensure maximum

compatibility with the myriad parallel programs that depend on industry standards.

The main consequences of the aforementioned assumptions are that there must be

more than one process, that recomputation is better than reading from disk, and that

messages should be as large as possible. The primary goals of this parallelization effort

are scalability and addressability (i.e., the ability to treat a wide range of molecular

systems), so the grouping and factorization of terms should maximize message size and

minimize message frequency while minimizing the local scratch memory requirements.

Group 1: T2 Virtual Orbitals Coming from V

One of the smallest groups of terms consists of T2 COntributions, in which both

virtual orbitals come from the two-electron integrals. Equation 3-1 shows the pp terms--

the at terms are analogous. The and integrals will be generated on the

fly in the AO basis, so the expression must be written in terms of AO integrals. Before

doing so, the second term must be split and rearranged to allow for better factorization.

ab + = ab ij + (Of a~b cj tfc +i p(g)C l abtc~d t~t+ abln cd}t~cd (3-1)
cd cd


+ =C (bIiftLe(blct (3-2)

+ P< ,c ab lcd tot(+ ab cd /,lrJtCd
cd cd










sab+= BJPI)Cf abpa c,"c,


cpa I cpa 33

+ i(ij) [ ab pa c, c~tactd + ap c c~td
cdpa cdpa

ab+ = J(g) b pa ~cc c + tP'~ c" c"' t" + t 't" lp)+ abp t,"" (3 -4)












I;:=Ab+= +(Abb + ACj tc~ +(Abllc to +e ~`Ab JCdtcd l/ +fAbJdtlCd (3-6)
C c Cd Cd


riA+ = C Ab pa (c;"c +t" c +c"t"' +t 't,')+ (Ab po tz'"" (3 -7)


~rib+ ,Ab p t, Xe" +r" ) 1 (3-8)


Group 2: T2 Virtual Orbitals Coming from V and T

Set 1: One Virtual from V, One Virtual from T1

This set of T2 COntributions is very similar to the Group 1 terms. The pp increment

is shown (Equation 3-9). Since every term antisymmetrizes the contraction with respect

to ab, the permutation operator is removed and the increment is identified as non-

antisymmetric. The F residual amplitudes (Appendix B) will be antisymmetrized with

respect to both ab and AB virtual pairs at the end of the iteration. Splitting the second

term, as in Equation 3-1, is also required.










r,7b = P~(ab)C kbl ilf t +P~ab)Plij)f jkb ic tf t
k kc (3-9)
+ J (ab) (j)(ib kb d tP t~ctd + (a) kbc tcd
kcd kcd

; = 1 (ii) (kb it + (ij) ~bljkb c tP t~C + (j bi
k L ckc 1 (3-10)
+ P(ij) kbL ci~~~d tLtotf + kb cdt~t
kcd kcd


C = Pijftkb poj c,"c" +t c"L~~IfP + ct, +t(p,")t+ t kb pa t,""tl~ (3-11)
kpa kpa


= t ,c, irv pa li(ij) c:' +tXc, ) t (3-12)


The of increments (Equation 3-13) are grouped by the bra-side T1 and then

expressed in terms of AO integrals and factorized.

rl b -= [Kb lj tj + f Ak lj t(
K k

+ Kb Cd tit~ + A( k Cd~ttC tetf c)ttrt
KCd kCdK k
+o Kb Cd t td +f Ak Cd)t~tCd
KCd kCd


+ ft Kb ~~~Cd t / bedtb~
KCd d

rl b- = ftAk ~Ij) + /C\UI t Ak y\Cj to + t@Ak Ich~te + ~ t A(k Cd ttf;
k kC c k~d(3-14b)
+ ft Alk 7llCd te
kCd


FI be _e t' ce trj t (3 -15a)





rlb b ~ oj~p+ tz' Xc: terj tper (3-15b)
u per










Set 2: One Virtual from V, One Virtual from T2

The pp set contains four T2 COntributions that are ring contractions. Equations 3-16

through 3-18 show these terms and derive their AO basis expressions.


ab+ = (ab)(g), k cj oii +[ .ckb lcd tractf
kc kcd
(3-16)

KC KCd




(3-17)
+P(i) c 1 ppA c rtap' + c r a ,



-ab+=Pg rp +( pr p~~a t c" t, (3-18)



Each half of the 000 increments contains four ring terms and two particle-hole

ladder terms. The MO and AO expressions are shown in Equations 3-19 and 3-20.

rl b+= /. kb cj tol +C Kb7Cj tcKC -f KbI~ c tcI
kc KC Kc (3-19a)
+ klb I~tcd tie + Kby~ dr~tfKC t- Kbd l, tc otD
kcd KCd KcD

Ab bCyll~tb k~lAl~b cIbc- kAj
KC kc kC (3-19b)

+ (KA CD tb/CtlD + kD tb tlD- kd tdit
KCD kcD kCd



~uvp~(3-20a)





~uvp~(3-20b)









Group 3: T2 Virtual Orbitals Coming from T

Set 1: Rings and PH Ladders

There are 31 UHF T2 terms that belong to this set. Each process loads a block of

V," integrals and calculates all of the contributions to the range of R2 amplitudes that it

is assigned to store. Grouping and factorizing the terms for performance is heavily

dependent on the manner in which the old T2 amplitudes are stored and accessed.

The terms were first cataloged and sorted into subsets according to up spin pairs.

For example, if the a index is stride 1, then the term t, ckgt~d must step down in A and

down in B for a given cd pair in V. To minimize the number of messages, any other ring

terms that have the same access pattern should be calculated at the same time. From the

31 UHF terms, three access patterns emerged. Any term that did not prefer one pattern to

another was added to the pattern that had the most similar intermediate products.

Subset 1: AC/BD

Subset 1 contains 13 terms that step through T2 in the order AC/BD, which

corresponds to the order of the two bra indices. More specifically, for a given CD pair of

ket shells, the first (or outer) loop runs over A and the second (or inner) loop runs over B.

All of the at T2 COntributions are performed in this series.

r, + (BPU tfKC 1.Vt"" It VtDB fttccf" (3-21)
'- PiHP~) KLCD KLD KLCD


rf" += P(AB)li(lU)Ctf tZ ~c '-VtW?
klcd

+ (AB) ~u(C )[tfKCVIKltd -C t VIK~ ity ft tcVKlt?(322
KlCd Kld KlCd


SV notation can be ambiguous with respect to whether AO integrals are antisymmetrized. Both electrons in
the terms in this section have at least one occupied MO index, so V notation is used to save equation width.










rzAb -7 t~~tz [ fC bi, (3-23)
kled KlCd

Ab- tlAcV/tf b, t,AcT/ktdtb
klc kled(3 -24)
+ ft fC VKlb t? + ,A t KC ctd tb
KlC KlCd

In terms of uncontracted ket indices of V, the partial contributions reduce to

Equations 3-25 through 3-27, in which each equation corresponds to a "left-side"

intermediate enclosed in parentheses.


-A+ = P(IJ)C /,tfK"V t c;"+ /K\" t (3-25)
La Kp Kp


--+= (IJ tAP' 1VJ+ [tfK"V -f c+t" t"" (3-26)
la kp Kp Kp


rPb+= tAp'Vji + tyfK^ h'- "+,"(-7

la kp Kp n

Subset 2: DB/CA

The DB/CA subset calculates 12 terms by looping first over B and then over A.


rr +=P~b)~g)[ty tdb l b rVt Ct,. ty~4tdt (3-28)
kled kcl kled


~b+= P~ab)Pi(g) ftPC (Vct/b
KLCD
h Ct~(3-29)

KlCd KCl KlCd

rAb+ t IAKC VKtob ( 3- 3 0)
KLCD


FIb AVIK 0t b Atic'VCKDb
KLD KLCD ( 31
+ t VIKltdb + t ticc iKtif
Kld KlCd










The MO equations can be reduced into four equations with four "right-side"

intermediates when they are expressed in terms of V with uncontracted ket indices.


+ = (tij) tpb YkV t(- V c,+"(-2



,h b+= ft V t" (3-34)
Kp La ]~


--briK b p p' V "K + VKl z, (3 -3 5)
Kp La la





Subset 3: CB/AD

The remaining terms are primarily PH ladder terms and the outer loop is over B. If

the bra shells are thought of as a square VV matrix, then the main difference in loops

between this subset and Subset 2 is that the inner loop A steps down the D column, while

the inner loop A of Subset 2 steps across the C row.

r b ddTc iCb (3-36)
KlCd

rf _AdViC5l b AdVTKltib + ftAd~citct? (3-37)
KlCd Kld KlCd

rl~b V t~ C~ltfb A t~dTcitCb (3-38)
KlC KlCd

Factorizing the MO expressions in terms of right-side intermediates and

uncontracted ket indices yields Equations 3-39 through 3-41.



r~ tAU 1V t (3-39)










Ob- t tA( +f V n~c" +t"' ( (3-40)
KU l lp



KU lp 1(-1

Set 2: HH Ladders from OOVV, OOOV, and OOOO

The ladder term that is quadratic in T2 can be calculated in one of two ways: one

with a HH ladder intermediate and one with a PP ladder intermediate. Opting for the HH

ladder intermediate saves a large amount of memory and memory writes. Three other

terms have HH ladder intermediates from the VfO integral and can be factorized with the

quadratic term. The pp contributions are shown (Equation 3-42).

rGab Itdtilb 1 Ie/tactdt"b
kled kled (3 -42)
+ P (ab I k/ jfitcd ati + (g)~b I .k .Ijtctd katl
kled kled

The four ladder-like terms that come from OOOO and OOOV integrals can be

combined with the OOVV ladders when the ket orbitals are from the plain AO integral.

Equation 3-43 shows the MO contributions.

r;: + = il (kl i ta'b + () l cj) tact"
kl klc (3-43)
+JP\ab)I kl ilftot?+(Pg)b)I(kll cij~tctatb
kl klc

Combining the two equations, incrementing pre-antisymmetrized amplitudes, and

expressing the ket indices in the uncontracted AO basis yields Equation 3-45.

anb + = 1 [ lpe)/ t' + (g)[kl pr \t(t' 1 t"
klpo klps
(3-44a)
+(PJg f (kll pjcj' 1 ta +(PtiiC ~kl d)pj 1 tab
klp klp










ga+= 4 C kl pa i tot1 +P~iij)C kll pa tP t"'tat1

klpw klp

ab+=JE l p i P~i) c" +,"'c,"+t ab t?) (3-45b)
klp kl



The 003 terms and factorized increment are shown in Equations 3-46 through 3-48.


rl b+= Kd d ,Ctf +f KCdttdt Ah
Kl~d Kl~d(3 -46)
+ (Kl CJd~tCdtf tf + lK'll ~Cdtdt t?
KlCd KlCd


riAb+= (KIl~jttf fK:jttt
KlC KlC

+ (Kl Ictof~f + KlI totht (3-47)
Klc Klc

+C(Kl Ijt /CK +f K~L Ij t t
Kl Kl


rlb=[Klp + c;" +t PXC, r t +t rt (3-48)
Klps

Group 4: T2 into T1

The T2 amplitudes contribute to the R1 amplitudes in 14 terms. These terms can be

grouped together such that they define three one-particle intermediate operators, which

effectively reduce the particle number of the perturbation. Equations 3-49 and 3-50 show

the terms that define the IP (particle) and IH (hole) intermediates, respectively. Equation

3-5 1 contains both spin cases of the IR (ring) intermediate.

I + = k/bl c; i t -[ lC t to

klc KlC(3-49)
k/ Jtiatzd ~K~l l~Cd atd
kled KlCd


+= ka dtd+ Ka~).d
kcd K~d(3-50)
k/ kled KlCd










a+= KL D Cta kL c tt LDt~
KLCD kcD LD(3-5 1)
+C(" k/ ')t ~tida + Kl Cd~tctda~ fId Iu
kled KlCd Id

The terms that define the intermediates have been isolated in Equations 3-52

through 3-54. Both spin cases of the IR intermediate are shown because they both

contribute to p R1 amplitudes. (The terms with the Fock matrix elements are not part of

the intermediate.) The a spin cases of IP and IH are analOgous to the P cases.


I, : Sa+-C Ckl ca: i Kl Co t ,"+ (3-52)
Sklc KlC


IH : y(+ = iCk @ cd td + Kv Cd t~ ;Cd a -t) (3-53)
vkcd KCd


I" : r;+=CVr(!'', ~~K@Co kve tkc a LCr t ,a (3-54a)



If : 5"+=Cc CkVIca)CK Ct + kCc tob a dtid (3-54b)
vakcKC Id

Group 5: One-Particle Transformations

Set 1: T1 Contributions

The 12 remaining T1 contributions can be loosely described as one-particle

transformations. The a terms are analogous to the P terms, which are shown. These are

then drastically simplified with the IR intermediate, and further consolidation yields a

non-HF set in Equation 3-61 and an IR Set in Equation 3-62.



F~~kc KC -I(-5


FI' : 5+= 1/adtd +( I kaed t td + KalCd t~ td (3-56)
dea kdKCd












(3-57)




(3-58)



(3-59)


(3-60)


(3-61)



(3-62)


n"+= 1c;


) 12 ab I~td z ab

+ ij)(1Cx l k I/ Jtk la +B [ Kl d t tla
kled KlCd


li(ij) kcittab + (Kl Ci ttla~b
Sklc KlC

+ P(i) k Jt tid tab + KlC "I b
kled KlCd


Foo (1): ab


(3-63a)


Foo (2): rzaib


(3-63b)


F,'(1 :ab+ = P(~r~dab) fad


(3-64a)


lar Id klc KlC

+ [ k/ J tf~tidt+ C Kl CdjtC t tdta
kled KlCd


Fo: Ca* fo + calf "



dea ver


Fo ra _~ = 12ta + 1d sdt t I c,"+")
lar Id ver


n"+ L 1JadtId 12l I dbC -dz 1
dea 1 2 Id


t )rf (c," + t"')


Set 2: T2 COntributions

Two one-particle transformations describe the final T2 pp COntributions.


id


icl(ab)i: }[ k/ J./ + K Cd).
kled KlCd











F,'(2) : ga a)[k dt td aC "I"
kcd KCd

-P (b) k/\ txllf tzatzb + C Kl Cd tf~ tlajd
kled KlCd


(3-64b)


Similar to the Fl intermediates used to calculate the T1 contributions, the one-

particle transformation matrix for T2 into R2 can be defined in terms of IP, IH, and IR, and

using these definitions, the pp and 000 increments can be trivially reduced to four terms

(Equations 3-67 and 3-68).


It; f;l, +fldCCrP d z" +~ t~) (3-65)


Id v ver


Fd dda Idl (("
Id ver

ab eat' + Fe t kb Fktab


(3-66)


(3 -67)


(3-68)


k k


CFIKt, Fkt
K k


nAb ~+FCbt~bFe c
Cc















CHAPTER 4
PARALLEL ALGORITHM

Virtual Parallel Computer

Before any algorithm can be parallelized, the type of parallel computer that the

algorithm will run on must be identified. Is remote memory accessible through one-sided

operations or must each process explicitly interrupt and request data from another

process? How much data will be needed during each section of the algorithm and can that

data be held by a single process? Does each process have the ability to read data from a

storage device, and if so, what is the penalty for multiple processes reading data

simultaneously? These are the types of questions that must be answered before making

assumptions, which will shape the plan for parallelization.

The most basic parallel computer is a network of workstations, in which each node

is a single CPU computer with a private memory space and a storage device--usually a

hard disk. The most common connection fabric for these networks is TCP/IP over

Ethernet. This allows for any node to communicate with any other node in the network,

and the network hardware or operating system software will handle delivering the

messages. (As opposed to each process having to maintain a table of which nodes are

directly accessible and which nodes must be reached through other nodes.) The cost of

maintaining these networks is not linearly proportional to their size, so it is common for

500+ node networks to remove the most failure prone element, the hard disk, and provide

other means for each node to conduct I/O operations.









The extreme opposite of this picture is a single computer with multiple CPUs,

which are all able to access the same memory space and hard drives) with equal

performance. Each CPU is identical to the others, so these are commonly referred to as

Symmetric Multi-Processor (SMP) computers; however, there are tremendous costs

associated with adding each additional CPU, so it unusual to have more than 8 CPUs in a

truly SMP machine. Some manufacturers have mini-networks within a single computer

that allow them to increase the number of CPUs without making them pure SMP. To an

application programmer, these machines are programmed in the same way as Symmetric

Multi-Processors, but the true label is Shared Memory Programming (also SMP!) even

though the CPUs are not identical in terms of distance to memory, cache, and possible

hard drives. Henceforth, there will be no distinction between the meanings of SMP since

the parallel CCSD algorithm will be implemented as a user process and not a system or

kernel process.

The balance between cost and performance lies somewhere between the two

extreme types of parallel computers. The primary offerings of the maj or vendors of

supercomputers are clusters of SMP machines. For fine-grain parallelization, each node

has multiple CPUs accessing the same memory, and for coarse-grain parallelization, each

node has physically separate resources. The virtual parallel computer that the parallel

CCSD algorithm relies on is this type of cluster of SMP workstations; however, program

portability and the ability to tune the network parameters are very desirable.

The only way to build portability and tunability into the program without having

divergent paths through the software is to model the physical computer in the boundaries

of network communication. By properly defining certain sub-networks, no process will










attempt to communicate through slow network channels when faster ones may exist;

furthermore, messages pertaining to a certain context can be limited to regions of the

network, which might reduce network contention on some architectures. This allows the

scope of communications to be set once at program run time (or compile time) and the

algorithm will transparently acknowledge the underlying hardware up to the efficiency of

the partitioning scheme.

Communication Networks

In some sense, parallelizing algorithms that handle massive amounts of data and

computations is akin to managing disaster recovery. The moment the event occurs, there

are many diverse and unused workers who have no assignment or direction. As the

situation is assessed, teams of workers are sent to handle different aspects of the recovery

from the time critical, like search and rescue of survivors, to the time forgiving, like

public relations and cost reimbursement.

Wildfire firefighters have some of the most extensive experience in rapid

deployment and management of labor units. In the late 1970s, the Incident Command

System (ICS) was created in response to a series of wildfire disasters in southern

California [Fed98]. It has since been adopted and refined by several federal and state

agencies and provides the basis for most emergency and non-emergency incident

responses. The parallel CCSD algorithm uses the following aspects of its effectiveness:

* Modular/scalable organization
* Manageable span of control
* Unit designations.

The lowest level of the organization is the single process (normally assigned to a

single CPU). Every process must have its own memory heap, and the combination of a










process and its memory is called a processor element (PE). PEs recognize other PEs on

the same physical node and form squads (SQD). Squads can be grouped together to form

platoons (PLT) based on algorithm requirements, and platoons can be grouped together to

form companies (CO), which execute large-scale algorithms.

Although the units were listed in increasing size, their division and recognition

happen in decreasing size. The commanding process evaluates the available resources

and divides the group into subgroups accordingly. For example, the Company

Commander (CO_CDR) would catalog the total distributed memory and channels to disk

and form platoons based on how the company algorithm uses those resources. The squads

would then self-assemble within the platoons based on which node the PEs are on.

Each division has its own communication space that does not interfere with

messages in another division, and there are special communication spaces for the

commanders of the subdivisions. This allows platoons to synchronize independently from

each other, and more importantly, it allows company algorithms to disregard the

messages of other companies. The company that performs the CCSD iterations and

calculates the T amplitudes is called the Amplitude Company. Figure 4-1 illustrates the

communication boundaries for a company and its subdivisions.

Data Management

Arrays stored in distributed memory

The fundamental data elements involved in standard CC theory are T amplitudes,

integrals, and the reference wavefunction. The predominant method for solving the T

amplitude equations is iterating over a coupled set of equations for T1 and T2. This makes

the amplitudes stateful at each iteration compared to the two-electron integrals, which are

stateless. Thus, the minimum amount of information that must be stored during the CC









iterations is TOLD (T1 and T2), TNEw (R1 and R2), and any information that defines the

reference wavefunction. The integrals do not need to be stored since they can be

generated on the fly.


SCO CDR NET


CO_NET __AMVPLITUDE CO CDRCD

1 st PLT PLT CDR NET 3 rd PLT


PLT NET ~~2nd PLT CDR_

1st SQD SQD_ LDR_ NET 3rd SQD


SQDNET2nd SQD LDR

2nd PE 4hP

3rd PE



Figure 4-1. Communication network boundaries. Rounded rectangles represent
communication networks (in italics). Square rectangles represent physical
entities. Dashed square rectangles could be redudant entities (i.e., a platoon
commander will probably be the leader of the first squad).


It is assumed that a single PE cannot store a full OOVV quantity. Even for SMP

machines, a single OOVV array might exceed the total physical memory of the node, so

the T2 Old and new arrays must be distributed over a set of PEs. It is also assumed that

each PE is able to retrieve and increment remote memory with one-sided communications

(i.e., without interrupting the process that allocated the memory). If the total distributed

memory is many times larger than the amount of data that must be stored, then










subdividing the Amplitude Company into platoons such that multiple copies of the

amplitudes are available for reading will decrease the average distance between a PE and

the distributed data that it attempts to access.

In this algorithm, the number of PEs in the Amplitude Company is almost

irrelevant, but the total amount of distributed memory is critical. The Company must be

able to store TOLD, TNEw (R), and the larger of T2 Or the amount of scratch space needed

per PE, which depends on the chemical system (Figure 4-7). Each PE will have its own

local copy of small frequently accessed arrays, which include basis set definitions, AO

coefficients, T1 and R1 amplitudes, and other one-particle intermediates. While the PEs

are calculating contributions to the R amplitudes, only T2 and R2 need to be accessed in

addition to local scratch memory. Between certain sections, R2 must be transformed,

which requires enough scratch memory for storing the third distributed OOVV array.

Example 1 Example 2
f- reserved memory +
4- T2 distributed 4

4- R2 distributed 4

.-- --- -- --- .- --- -- m in scr < R 2 block -- - - -

min scr > R2 block


Figure 4-7. The memory heap management for various molecular systems. In one
scenario, the minimum scratch memory needed for amplitude contributions
is less than the amount of memory needed to store a section of R2 during a
transformation. The other scenario is when the opposite occurs.


Example 1. The user runs a calculation that requires 128 MB of reserved data per

PE, such as basis set information, one-particle matrix elements, etc. The calculation

requires another 128 MB (minimum) of scratch memory per PE for calculating various










quantities during the CCSD iteration, and the OOVV quantities require 4096 MB of

distributed memory from all of the PEs. If the user submits the j ob with 1024 MB of heap

per process, what is the minimum number of PEs needed to perform the calculation?

Alternatively, does the calculation have enough resources if the user submits the j ob with

X PEs, each with 1024 MB of memory?

Solution 1. The amount of non-reserved memory is 1024-128 MB/PE (i.e., 896

MB is available for scratch and distributed data on each PE). Three OOVV arrays require

12,288 MB. Dividing this amount by 896 MB/PE and rounding up yields a minimum of

14 PEs. Distributing each OOVV array over 14 PEs requires each PE to store

approximately 293 MB of data per array. Since this amount is larger than the minimum

amount of scratch memory needed (128 MB), the calculation can proceed with 14 or

more PEs.

Example 2. Using the same system data as above, but increasing the minimum

required amount of scratch memory to 5 12 MB, what is the minimum number of PEs

needed to run the calculation?

Solution 2. This time, the regular scratch memory is larger than the distributed

OOVV scratch array. In essence, this becomes unavailable for distributed data, so the

effective amount of memory per PE for the OOVV arrays is (896-512) 3 84 MB. Since the

third OOVV array can fit in the scratch memory, only two OOVV arrays, totaling 8192

MB, need to be distributed over the Amplitude Company. 8192 MB divided by 384

MB/PE yields 22 PEs (rounded up). This means each PE will store approximately 187

MB of each OOVV array, and the calculation can proceed if there are at least 22 PEs.

The minimum PE test can be described by the following pseudo-code:









mem_di st = mem_heap mem_rese rved
min_PES = ceiling( 3 mem_oow / mem_dist )
mem_block = mem_00w / min_PES
if (mem_block < mem_minscr) then
mem_dist -= mem_minscr
min_PES = ceiling( 2 mem_oow / mem_dist )
end if

If the user runs the j ob with more than twice the minimum number of PEs, then the

Amplitude Company can divide into multiple platoons, each with its own copy of TOLD

and TNEw. One of the benefits of this approach is that fewer PEs will access each block of

distributed memory, but the main advantage is that knowledge of the underlying

hardware can be programmed into the partitioning scheme and the algorithm will

transparently align itself with the hardware boundaries.

Arrays stored on disk

The VVVV and VVVO integrals will not be transformed and stored on disk, so

their contributions must be calculated on the fly in the AO basis. The OOOO and OOOV

integrals will not be stored on disk either, not because they are too large, but because

their contributions can be accounted for with intermediates from the direct integrals and

from the stored OOVV half-transformed integrals. The stored integrals can be distributed

over and individually processed by the platoons of the Amplitude Company.

Parallel UHF CCSD

Each CC iteration begins with initializing the global intermediates and residual

amplitudes to zero. The TOLD amplitudes are defined by Equation 4-1. This formulation

allows the amplitudes to contract directly with AO integrals. The T2 TOSidual amplitudes

will have different definitions depending on which part of the algorithm is executing. In

the Direct Integral (DI) phase, one of the virtual indices is coming from an uncontracted

AO integral, so the residual is related to the MO residual amplitudes by Equation 4-2.










t = [cft; t l= ~chcb$ab
a ab


tp'" = ~,c~tb~t
Ab

rAb ~ A b A'v


(4-la)


(4-1b)


(4-2)


The Stored Integral (SI) and One Particle (OP) phases calculate residual

contributions with both virtual orbitals coming from T, so the DI residual is transformed

according to Equation 4-3.


by


(4-3)


Throughout the iteration, each PE stores all OO pairs (ket indices) for a range of

AO shell pairs (bra indices). The amplitudes are distributed over the platoon (Figure 4-2),

and one-sided communications retrieve and increment all OO pairs for a block of AO

shell pairs.

MN = 1 2 3 4 5 6

13 13 13 I 13 13 13











PE 0 PE 1

Figure 4-2. The T2 and R2 memory layout. Each square represents the four-dimensional
array of amplitudes that have the designated spin pair and whose AO bra
indices belong to the MN shell pair. The arrows represent the order in
memory.









After all of the contributions have been calculated, the residual is antisymmetrized

with respect to the two virtual indices and back-transformed to the MO basis. The new T

amplitudes are calculated by dividing each amplitude by the diagonal elements of the

Fock matrix.

Direct Integral Phase

Design considerations

Analyzing the T2 COntribution in Equation 4-4 can summarize the DI phase. Since

the VVVO integrals are not transformed and stored prior to the CC iterations, the terms

must be evaluated in the AO basis.

r Ab + = C tA~ (Kb ICd() td +Clb (Al Cd') td (4-4)
KCd ICd

riA tA b~~ t"" + cA b t"~' (4-5)


The biggest problem with this situation is that every AO integral causes a global

update to R2 (i.e., One integral contributes to every R2 amplitude). The best way to reduce

the number of operations is to precondition R2 Such that a smaller set of VVVO

contributions is calculated. R2 WOuld then be post-transformed after the contributions

have been accumulated (Equations 4-6 through 4-8).

r-v b Ab (4-6)



rI~v+ =C ~ tAv pr per)tf" (4-7)


r Ab b Av (4-8)


For each integral, only an IjA strip of R2 amplitudes is incremented compared to

the full set without the backward and forward transformations of R. The difficulty with









the up amplitudes is that while the P side can get away with 1/V updates, the a side still

requires a global update per integral (Equation 4-9). The C1 orbital must be forward

transformed into A, but the v orbital must be transformed into b by T1, which must then

be back-transformed by c l. One solution is to calculate the two spin cases in separate

passes (while transforming R between them). Although this requires generating the AO

integrals a second time, the performance of a naturally parallel local computation is much

greater than a global update per N4 matrix element.

rf "+= fc cfb/,t, (pv fpeb" (4-9)



A second performance issue is antisymmetrization of the same-spin integrals.

Equation 4-10 shows one of the up ring contributions in MOs and Equation 4-11 shows

the terms appropriately transformed into the AO basis.

rlb + = b cd tjetfd (4-10)
kcd

rIf'"+= prp jt"(4-11)


The AO integrals are not antisymmetrized by the integral generator, so one of two

approaches must be taken to account for both contributions. The first approach involves

two updates per integral:




The benefit of this method is that the PE needs only enough memory to store the AO

integral batch and the T2 batch; however, two updates per integral is more costly than two

reads per integral, which can be achieved by antisymmetrizing the po pair:













This requires either two temporary T2 blocks or two serial loads. If there is enough

memory, then the exchange integral can be calculated with the regular Coulomb integral

and the antisymmetrized contribution can be calculated with one read and one update.

Since shell quadruplets of integrals are typically not more than a few hundred kilobytes,

the algorithm will assume this can be done. The permutational symmetry of the

antisymmetrized integrals is not the same as that of the plain Coulomb integrals so they

must be generated again in the middle of the loop over permutations. Although two

updates or two reads are avoided, the integrals need to be calculated three times per AO

shell quadruplet. Considering the cost of messages compared to the speed of locally

computing a shell quad, this is a favorable tradeoff.

From these considerations, it is clear that the DI phase must calculate the unique

integrals six times even though the AO integrals have 8-fold permutational symmetry.

There are two passes (one for up and one for Pa) and there are three generations per

pass--one Coulomb and two exchange integrals. This might seem expensive, but the cost

of doubling or quadrupling the number of messages does not scale linearly with the

number of PEs and far outweighs the cost of the naturally parallel integral computation.

Algorithm

A critical aspect of this algorithm that makes or breaks the performance is one-

sided communications. A one-sided GET of a T2 block can be in flight while the PE

calculates terms that do not involve the T2 amplitudes. Immediately after the amplitudes

have been read, the next block of amplitudes can be requested, even if the PE is not ready

to use them. This technique is also used to hide the communication of R2 updates. The









main difference between the two arrays is that the T2 amplitudes are read once per GET,

while the R2 amplitudes are incremented two or three times per ACCUMULATE. Ideally,

there would be two R2 buffers, one active and one pending, that increase the time

available for each ACCUMULATE. If the user must run the calculation in limited

memory, then it is possible to still benefit from one-sided ACCUMULATEs, but the PE

might have to wait for a while before initializing the buffer for the next R2 COntribution.

The main driver for the DI phase loops over AO shell quadruplets of integrals. The

MN shell pairs are distributed over all of the PEs in the Amplitude Company. This

distribution can happen a number of ways: round robin, blocked, block cyclic, j ob board,

master/slave, etc. How a PE processes each MN shell pair is independent of the shell pair

assignment, so this can be tweaked outside the main processing loop. Ideally, there would

be a globally accessible integer that corresponds to the next MN pair that must be

processed. The integer must be atomically read and incremented by each PE to get its

next assignment. This balances the load better than a fixed distribution algorithm like

round robin or block cyclic, and it performs the same function as a master process

sending out new tasks without actually needing a process to sit idle waiting for requests.

Once an MN shell pair is identified, the PE is responsible for generating and

processing all unique RS shell pairs that correspond to the MN upper bound. This

prevents any PE from processing Coulomb or exchange integrals that another PE is

responsible for. It also allows for a single generation of the integrals for each pass if there

is enough scratch memory (i.e., the PE would generate all RS pairs and internally

antisymmetrize them before beginning the processing loop). Typically, however, there

will be enough memory for two shell quads and the scratch buffers for contracting terms.









A shell quadruplet of integrals will contribute to all of the Group 1 terms, the

Group 2 terms, and the Group 4 intermediates. Table 4-1 summarizes the stages of the

processing loop and which elements are accessible at each stage.

Table 4-1. The direct integral processing loop summer

LloopScope Contributions
section
all Group 1 terms and the IH
prologue (IHs, I2) += f(V,C,TI,T2) itreit
all contributions to R2 fTOm
stage 1 Rzc += f(V,C,T1,T2,I2)
Group, 2 and Group, 1
sae 2 IPs += f(V,C,T2) the IP intermediate
epilogue IR += f(V,C,T1) the IR intermediate

The Group 1 terms are saved in a two-particle intermediate during the loop over the

T1 amplitudes from the Group 2 terms. The I2 intermediate is then contracted with Q

(C*C) and the Group 1 contribution is sent with the Group 2 contribution to increment

R2. If there is not enough scratch memory to store the I2 intermediate, then it is possible

to calculate these terms in a single direct integral pass separately from the Group 2 terms

or the IP Of IR intermediates. Table 4-2 has the detailed order of operations for the

prologue of processing a shell quad. The w one-particle matrix element is defined in

Equation 4-14.

L Y + t; (4-14)

When the prologue finishes, there are two important quantities stored in memory:

the Group 1 contribution and the T2 amplitudes that generated it. Since the Group 2

contributions loop over all L shells for a given S shell, it is convenient to begin the loop

with L=R. Table 4-3 lists the detailed order of operations for the L loop. The x one-

particle matrix element is defined in Equation 4-15.

x~ =~ [ jc t (4-15)





After the last value of L, there is still a pending R2 ACCUMULATE, which must

be completed. Rather than WAIT immediately after the loop, the IR intermediate, which

requires only T1, C, and V, can be calculated. Furthermore, if there are more

permutations of V, then the PE can GET the next T2(RS) block and permute the integrals.

Table 4-4 lists the detailed order of operations for the permutation epilogue.

The first four permutations of V involve the same two sets of shell quads.







The second set of four permutations requires a different exchange shell quad, so the

integral generator is invoked a third time per MNRS configuration.







Figure 4-8 shows the movement of data needed to calculate the contributions to R2

from an integral shell pair.

Global Shell Pair Counter of


T ~h~ IT

R


1st Squad 2nd Squad 1st Squad 2nd Squad
1st Platoon 2nd Platoon


Figure 4-8. The data flow in the Direct Integral phase. Solid arrows represent one-sided
messages, and dashed arrows represent reading and incrementing the global
shell pair counter. Each PE retrieves all T2 amplitudes that contract with the
integral and accumulates the contributions into the R2 Efray.









When the Amplitude Company has calculated all of the MN shell pairs, the R2

distributed arrays are transformed for the phase that processes the stored OOVV

integrals. The one-particle global intermediates could be reduced here for use in the One

Particle phase. The following pseudo-code lists the entire flow of control from the

beginning of the CC iteration up to the start of the SI phase.

ze ro(R2, R,, I,)
W, = C + T,
do spin = 2, 1, -1
X, = EzQ, C*T,
SYNCHRONIZE(CO_NET)
do while (more MN)
do all RS
GET T2(RS)
G = intgen(M,N,R,S)
H = G intgen(M,N,5,R)
do perm = 1, 8
calC Prologue()
nLeft = Nshells
L= R
do while (nLeft != 0)
calc InnerLoop()
nLeft -= 1
L = 1 + mod(LNshells)
end do nLeft
calC Epilogue()
end do perm
end do RS
end do MN
SYNCHRONIZE(CO_NET)
if (spin =2) then
gl obal R2 (La, Np) = Q -(Np,~ Dp) *R2 (Dp,~ M,) "Q,(M,, L,)
else
global R2(L,D) = R2(L,N)*Q,(N,D)
end if
end do spin
REDUCE (Iz, CO_CDR)

Stored Integral Phase

The SI phase calculates all of the Group 3 terms from integrals of the type ,

, and , collectively referred to as OOVV. Each PE will load or receive an

integral block, which contains all spin pairs for a particular po shell pair. The mechanism









for obtaining integrals depends on the performance and architecture of the I/O system. If

only one PE per node (or squad) is allowed to read from disk--assuming this is an

operating system constraint--then the other PEs in the squad will have to request the

blocks from that I/O PE. The I/O PE will then have to juggle its own calculations while

acting as an I/O proxy; otherwise it could opt not to store R2 amplitudes, but then the

system takes a performance hit as whole because there is one relatively idle CPU per

node.

Most computers will allow any process to perform I/O and shift the performance hit

of I/O contention to the program. The technique the parallel CCSD algorithm uses to

lessen this penalty is asynchronous file reads. The logistics are the same as one-sided

operations in the DI phase in that each PE issues an AIO READ immediately after the

last integral has been read but before other computation has been done. If asynchronous

I/O operations are not available and the user elects not to designate a special I/O PE, then

the algorithm can wait for the integrals up front with a blocking read or it can

synchronize the squad PEs by broadcasting the current block and waiting for all of them

to finish before reading the next block.

Resource competition, such as parallel blocking reads, suffers scaling penalties,

meaning the cost per read increases as the number of PEs per squad increases. The upshot

is that there is still the possibility that some PEs are computing while others are waiting

on I/O. Locked step algorithms, however, usually suffer far worse than resource

competition algorithms. Locked step guarantees that no one is working while a read is

pending and that no one is reading while the PEs are computing. Even if the I/O PE used

asynchronous reads, the locked step approach would suffer from expensive









synchronization delays compared to pure resource competition. From these

considerations, each PE will be responsible for reading its next block of integrals from

disk. If asynchronous reads are not supported, then the PE will default to blocking reads.

The OOVV integrals will be distributed over platoons by block cycling the AO

shell pair kets. Every PE in a platoon must read and process all of the integrals assigned

to its platoon. The integrals will only contribute to the R2 amplitudes that the PE stores,

but the T2 amplitudes must be read (with one-sided GETs) from other PEs in the platoon.

Every shell pair of integrals results in four updates to all of the local R2 amplitudes,

one for each subset of Group 3 Set 1 and one for Group 3 Set 2. Since Group 3 involves

contributions in which all virtual orbitals come from T, the virtual orbitals of the integrals

must be the internal summation indices. The general approach to calculating the Set 1

subsets is to loop over h' and 6' for a fixed po pair. Each subset corresponds to a different

configuration of these nested loops.

Figure 4-3 shows the path of updates through the local set of R2 amplitudes. Table

4-5 shows the detailed order of operations for calculating Group 3 Set 1 Subset 1

(AC/BD is LR/DS in AO shells).

Accessing memory in non-unit stride is highly unfavorable when performance is

the utmost concern; however each square in Figure 4-3 represents three four-index arrays

such as those shown in Figure 4-2, and those updates can be performed in unit stride.

Furthermore, the L-D loop order is only used in Subset 1. Tables 4-6, 4-7, and 4-8 show

the operations for Subsets 2 and 3 of Set 1 and Set 2, respectively. Each steps through the

LD pairs in D-L order.





























Figure 4-3. The path of Group 3 Set 1 Subset 1 updates per OOVV integral. Dashed
lines show memory continuity and solid arrows show the loop path.


There is some flexibility in choosing the order in which the Group 3 sections are

calculated. The most important design consideration is choosing which section will be

performed last since it will AIO EAD the next RS shell pair of OOVV integrals from

disk. Subset 2 of Set I has the largest number of operations after the last integral is used,

so this section should be performed last.

Choosing a section to run first depends on how many operations can complete

during the first GET before reading the T2 amplitudes. Subsets 1 and 3 of Set 1 and Set 2

all use the amplitudes almost immediately, but each requires a different initial shell pair.

Set 2 reads only one shell pair, T2(RS), and neither shell contributes to the final R2

amplitudes, so the probability that T2 is local is inversely proportional to the number of

PEs in the platoon. The Subsets of Set 1 fare much better since at least one shell of T2

contributes directly to a shell of R2. If the amplitudes are distributed in bands of L, then

Subset 3 has a good chance of T2 being local, and if the amplitudes are distributed in









bands of D, then Subset 1 is better off executing first. Figure 4-4 illustrates the two

possibilities.

Lbands D bands

L L







D D

Figure 4-4. The LD shell distribution schemes. Each PE stores both T2 and R2
amplitudes for its block of LD shell pairs. Set 1 Subsets 2 and 3 favor L
bands, while Subset 1 favors D bands.


The parallel algorithm uses L bands since two of the three subsets favor them as far

as having local T2 amplitudes per R2 update. Subset 2 is already designated to run last, so

subset 3 should be performed first because it has the highest chance of having no wait

time for the initial T2 GET. The decision to run Set 2 before or after Subset 1 can be

made on the same argument. Since Set 2 updates the entire set of local LD pairs after

reading the T2(RS) amplitudes, it offers the most time for the initial GET of Set 1 Subset

1 to complete before reading them. Therefore, the calculation order of the Group 3 terms

is Set 1 Subset 3 (SIS3), Set 2, S1S1, and S1S2.

Figure 4-9 shows the flow of data in the SI phase. Each PE loops over all of the

shell pairs that were assigned to its platoon and calculates only the terms that contribute

to the local R2 amplitudes. The following pseudo-code lists the entire flow of control

from the end of the DI phase up to the start of the One Particle phase.

AIO_READ V(RSo)
GET T2(RDo)
do RS = RSo, RS1









WAIT V(RS)
cal C G3S153(GET T2(RS))
cal C G3S2(GET T2(LoR))
Cal C G3S151(GET T2(SDo))
Cal C G3S152(AIO_READ next V(RS); GET next T2(RD))
end do RS

Platoon Shell Pairs of Platoon Shell Pairs of


i



i f


T ~h~ I T ~A~ R
T


R


1st Squad 2nd Squad 1st Squad 2nd Squad
1st Platoon 2nd Platoon


Figure 4-9. The data flow in the Stored Integral phase. Each PE processes all of the
integrals that are stored by its platoon and updates its local block of R2.


One Particle Phase

The OP phase is responsible for contracting the Fock matrix elements of the

reference wavefunction (f,,) and the li intermediates (IP, IH, and IR), with the T

amplitudes. There are three types of contractions: T2 into R1 (Group 4), T1 into R1 (Group

5 Set 1), and T2 into R2 (Group 5 Set 2). Only one PE must do the T1aR1 contractions

provided it has the complete li intermediate (it should since the contributions were

reduced at the end of the DI phase). These transformations, shown in Equation 4-16 for

T1 are insignificant performance-wise and little is gained by parallelizing them.

+~ = t c"+t lf (4- 1 6a)


I" +=C Ic far + ca fad C@tz -1 tz1 -/2 1 t%/d" t"cr (4-16b)
aad I~ Id










The T24R1 contributions can be distributed over the PEs in one platoon, since they

are limited to local reads and local writes; however, while the T24R2 terms also have

local writes, they have both local and remote reads. The remote reads come from the

terms in Equations 4-17 and 4-18. The P part of FP is defined in Equation 4-19. (The oc

part is analogous.)

? += FPa t; +=C FP t""' (4-17)







F' = C cn"fadc -C t + IdGi Iv" t (4-19)
ad IS Id

Parallelization of these particle terms should reduce the number of messages a PE

must send as the number of PEs in the company increases. This implies distributing the

summation index (p or o) over the platoons; however, since the SI phase prefers L bands

(the p index), it is likely that most (if not all) of the T2(RD) GETs are local, so it is not as

critical to performance to distribute them over the platoons as it is for o. For the a loop,

each PE must get data from multiple target PEs that contribute to its local R2 Set (Table 4-

9). Table 4-10 shows the other FP terms.

The T24R2 hole terms and the T24R1 terms (along with the latter FP terms) should

be performed by the largest platoon (HVYPLT), even if that platoon has only one more

PE. During this time, all of the other platoons will send their contributions via

ACCUMULATE to the Heavy Platoon. Once these operations are complete, the only

remaining tasks are antisymmetrizing the same-spin R2 amplitudes and updating the old

T2 amplitudes with R2. Table 4-11 shows the operations of the Heavy Platoon while the









other platoons are sending their contributions, and Equations 4-20 and 4-21 show the

terms that contribute to FH and FR, TOSpectively.


FZ = ft tak, Ideft,"r + c'I: +C c,Ifcp"+t,") (4-20)
Id v vor

FL = r. Ad+C~P (4-21)


The following pseudo-code lists the operations that occur from the end of the SI

phase up to the start of the antisymmetrization and update phase.

if (CO_CDR) then
make FP, FH, FR
BROADCAST( F,, CO_N ET)
calc T,_into_R,()
else
RECEIVE (F,, CO_N ET)
end if
cal c FP 2uP(
if (HVY_PLT) then
calc other_FT2(
else
ACCUMU LATE (R2, HVY_PLT)
end if
REDUCE (Ry CO_CDR)

Antisymmetrization and T2 Update

The aa and pp R2 amplitudes must be antisymmetrized with respect to AO shell

pairs. Equation 4-22 shows this operation. All of the R amplitudes must then be divided

by the diagonal elements of the Fock matrix in order to generate the next set of T2

amplitudes. Each PE in the Heavy Platoon, having collected all of the R2 amplitudes that

it is assigned to store, will antisymmetrize and transform its amplitudes into T2. The

R24T2 transformation is shown in Equation 4-23 and the diagonal matrix elements of fh'

are shown in Equation 4-24.


r~ =


(4-22)










t r t =n (4-23a)



t =I (4-23b)


/ =Ic; ad d(4-24)


The R1 amplitudes must be transformed in the same way in order to generate the

next set of T1 amplitudes. One PE should do this since the transformation is trivial

(Equation 4-24).

r r"'
t =; t' = n (4-24)


Table 4-12 lists the operations of the Heavy Platoon for antisymmetrizing and

transforming R2. The code listing that follows shows the operations of the Amplitude

Company from the end of the OP phase up to the next CC iteration.

if (HVY_PLT) then
make T2
end if
if (CO_CDR) then
make T,
BROADCAST (T,, CO_N ET)
else
RECEIVE (Ty, CO_NET)
end if
if (not HVY_PLT) then
GET T2
end if

The order of the loop nesting, L-D versus D-L, does not affect the results, but the

performance of the loops might differ from one computer architecture to the next. The L-

D ordering issues multiple GETs to the same PE (down the L column) before cycling up

to the next column. If every PE begins with L=1, then the first PE in the platoon will be









bombarded with the GETs of every PE at once. Normally, GET operations do not require

exclusive locks, in which one PE is allowed to access the data at a time, but the efficiency

of the messaging subsystem might be a factor in performance. This would be less

important in the D-L ordering.

Another way in which the messaging subsystem could affect performance is by

"remembering" which PE was accessed last. In this situation, once a GET is established

between two PEs, any successive GET to the remote PE is less expensive than switching

to another PE. This would actually favor the L-D ordering and penalize the D-L ordering.

In either case, the code can be trivially designed to switch between the two depending on

which architecture the program is running.










Table 4-2. The Direct Integral phase prologue memory map
Order of V T2 2t W2
operations G H a au p aa au I aa u
W""I = H""11 ~ r w
IU pa
I = W'w" w r

W," = G 0 I r w
I =W/ w" w r
WAIT T2(RS)
W~u _i =H""~t r r w

Ia + = WI"" c r

I U1+ = WUI/ r

W ""r r w



I U1+ = WUI/ r

Each operation is implicitly summed over all shared indices. The G column of V
corresponds to the set of Coulomb integrals, and the H column is the set of
antisymmetrized (Coulomb-Exchange) integrals. The values w, r, and i stand
for write, read, and increment, respectively. Shaded cells are involved in pending
communications and may not be read from or written to (without risking data
inconsistencies). Temporary arrays (W2) are Outlined to show their scope of dependence.






61


Table 4-3. The Direct Integral phase inner loop memory map
V T2 R2(1) R2(2) 12 W2 X2
Order of
operations G H c l l

if (L!i=R) WAIT T2(LS)

PV "'=c'tw

if (R!= S) GET T2(LR)
X" =GC;"W "'rrw
RI = -0 & r 'X n lrl

RI\ + = I I r



R"' = x)"I"" \\ r

if (R!= S) WAIT T2(LR)







if (more L) GET next T2(LS)
X '"P = WIP H r r


x H r I r~w

I~" + = X 'P&c;U r

X~1/ =W G~/ r ri

R~" += XIP"w"~ I r


Pa',~'a I I
Th 21)bfe is th civ rtebfe, an h 2()bferih acesfrbde
buffer fro the prviu L shll Th loo strt wit LRconsutoNhlsCCiS
back ~ r~ toL1 n otnusu oL(-)























One of the two R2 buffers will be access-forbidden. The last permutation does not need to
GET another T2(RS) block; otherwise, the GET should be issued as soon as possible. If
the permutation index is 4, then a different exchange integral will be generated directly.
The (w) cells are write-optional, meaning the operation might or might not need the
scratch memory.


Order of V T2 R2(1) R2(2) SCT
operations G H a au up aa au am
if (perm==8) then
If += ct +c t G r r (w
WAIT R2
else
GET next T2(RS)
I~ )+ = c tf'H~ +cLtl G'p r r (w)
permute V w w (w)
end if


Table 4-4


The Direct Integral ph p







63


Table 4-5. The Stored Integral phase Group 3 Set 1 Subset 1 memory map
OrerofV T2(1) T2(2) I A B IC
operations d l cl d l cl d l cl d cl c

GE T THeLnR
ile L = Ln,. L,


WAI.~T T,(LRI
I(Ce = ta'p u"M r r w



Wi = m pt H" r r w
GET T( DnS






ile D = Dn,. D,
WAI.~T T,( DS
GE T nextr T (")'
I/ + = I' ir

I/ + = I' '?r i r' r



I += I' '? i r r'II I I I
tndl ile D
tndl ile L
During the loop over D, the T2 buffers alternate active/inactive, but the prologue in the L
loop uses the second buffer as scratch memory. The GET in the inner loop over D
requests T2(DS) unless D=D1, in which case it requests the next T2(LR) (provided LAL1).







64


Table 4-6. The Stored Integral phase Group 3 Set 1 Subset 2 memory map
Orde ofV T2(1) T2(2) IA IB C ID
operations d cl l d cl cl d cl cl cl cl c

GE T T~t SDn)I
ile D = Dn,. D,


WAI.~T Tt SDI
GE T T( RLn)I

I = (; i r r n I

I' = \i

I = H _i r r n

I' ~ r I = H i r r n

Ift'D==D.) I thn

etni It



ile L = Ln,. L,

I = is / I' r

WAI.~T T(RLI
GET nextr T(")'

I/ + = I' i '"

i~ + = I' i I' r

I + =I i I' r r I

tndl ile L
tndl ile D
T Te inner loop of this subset has more operations than either of the other two subse s, so
it is calculated last and the asynchronous read is issued during the last iteration.










Table 4-7. The Stored Integral phase Group 3 Set 1 Subset 3 memory map
Orde ofV T2(1) T2(2) IA B IC
operations d l cl d l cl d cl cl cl d


G;ET T,( RDn)I
ile D = Dn,. D,


\\~.MT T,(RDI
G;ET T,(LnSI


I''-=(; _i


J' '+ = '
ile L = Ln,. L,


\\~.MT T( LSI
G;ET nextr TH (''

I +=I ?'I

I +=iI
tndl ile L
tndl ile D


r




r

r







66


Table 4-8. The Stored Integral phase Group 3 Set 2 memory map
Order of V T2
operations aa t uu 00 aa upa
WI,""' = t" +1., w r w
I K 1 = WI," V r
do LD = LDo, LDI
W '= t + tt w
s*- ~ l r r
end do LD


W, t, (r) r Iw
'Irk = W Vk2P (r) r r w
do LD = LDo, LDI


A3 Okl A' '
end do LD
end do spin
All of the T2(LD) reads and R2(LD) updates are local operations. If there is not enough
memory for 12, then the Ij/ij/IJ bra pairs can be processed in blocks.


Table 4-9. The One Particle phase FpP Tap memory ma
Order of T2(1) T2(2)
operations up up
G;ET T,(LSn)I
ile S = Sn,. S,
ile L = Ln,. L,
WAI.~T T,(LSI
G;ET nextr T(L SI
ile D = Dn,. D,
I += TI /

tndl ile D
tndl ile L
tndl ile S







67


Table 4-10. The One Particle phase FPuT~ap (and same spin terms) memory map
Order of T2(1) T2(2)
operations me up PP ma ap PP
G;ET TH IDn)I
ile D = Dn,. D,
ilo R = 1. N\II, n
WAI~T T,( RD
G;E T nextr T( RD I
ile L = Ln,. L,
I +=/ li


I +=/ li




tndl ile L
tndl ile R
tndl ile D


Table 4-11. The One Particle phase FH and FR lOcal transformations
Order of operations
do LD = LDo, LD1



A''- f t +

K k
A~'"' 1 i, ;k A's'
FR F tk


KU ka
r + =CI [ ,t" t
ko KU
end do LD







68


Table 4-12. The R2 antisymmetrization and T2 update memory map
Order of R2(1) R2(2)
operations aa t uu 00 aa upa
G;E T R,(DnLn)I
ile L = Ln,. L,
dlo D = Dn,. D,
WAI.~T R,( DL
G;ET nextr R,(DL I




I -= / +/ -1 -1

/ -= / +/ -1 -1

/ =1 / +/ -1 -1
tndl ile D
tndl ile L















CHAPTER 5
DISCUSSION AND SIMULATION

Discussion

General Comments

The parallel CCSD algorithm distributes almost every operation over the number of

PEs in the Amplitude Company. The few operations that are not distributed are

insignificant to overall performance. Specifically, the worst one-particle transformation

would be a multiplication of two NAoxNAo matrices. This scales operationally like NAO3

In contrast, the Direct Integral phase multiplies two matrices that scales operationally like

O2NAO4. For a system with 100 electrons (50 oC and 50 P) and 500 basis functions, there

are approximately 125 million operations for the serial multiply and over 309 trillion

operations for the parallel multiply. Assuming that every operation takes the same

amount of time and that the parallel multiply is 100% efficient, the maximum theoretical

speedup is 2.48 million times faster (for this system).

t +t 1.25*108 +3.09*1014
lim SpeedUp= P"' 2.48*106 (5-1)
"proesam t 1.25*108

Obviously this limit is unachievable, and there are multiple operations that slow

down the parallel portions of the CCSD algorithm. Disk I/O, network messages, and

synchronizations all become more expensive as the number of PEs increases. The two

primary techniques that help to reduce these costs are real-time load balancing and

asynchronous operations (i.e., overlapping computation and communication for both disk

reads and remote memory operations).









Load Balancing

Job boards

In this context, a job board is a global structure that any PE can write to or read

from at will. The Direct Integral phase must have a global counter that registers the next

AO shell pair waiting to be processed. PEs that finish their tasks quickly will read and

increment the counter more often than PEs that lag behind the others. Natural load

imbalance or CPU contention from other processes can cause this lag (or drift). CPU

contention frequently happens if the operating system attempts to run more compute

intensive processes than installed CPUs. Even if the parallel program executes

exclusively but the number of PEs is the same as the number of CPUs, then the operating

system kernel itself will compete for CPU cycles and detract from overall scalability.

Job boards can relieve natural load imbalance in the Direct Integral phase, but the

Stored Integral phase cannot easily use them (if at all). The SI phase has two levels of

parallelization: the OOVV integrals are distributed over platoons and the R2 updates are

distributed over PEs within a platoon. Any attempt to balance the load in real time will

incur multiple synchronization delays since the PEs in a platoon have close to but not the

same number of operations. Instead of real-time load balancing, the AO shells are

grouped together to balance the work needed to process each shell.

Data aggregation

All uses of the word shell" that correspond to CC data processing refer to

collections of physical AO shells, which are the smallest entities that direct integral

generation programs can calculate. Since the number of operations scales like O2N4, the

difference between the smallest and largest physical shell is a direct measure of the

imbalance in computations per message. The hydrogen STO-3G basis set has 1









contracted orbital in the s shell. Larger basis sets will typically run up to about 16 orbitals

per shell (although the iodine PBS basis has 55 orbitals in the d shell). The hydrogen

shell and the average shell are computationally uneven by a factor of 65536 for the DI

phase. If the messages pertaining to these two shells cost the same, which is a fair

assessment since message latency is an order of magnitude more important than message

size, then one PE will have 65536 times more work to do than another. Simply grouping

as many physical shells together as possible might not be the best solution since it is the

size variability that affects load balancing.

The parallel CCSD algorithm, before running the first iteration, will examine the

collection of physical AO shells and search for the set of logical shells that minimizes the

variance of shell sizes. This is a bounded two-dimensional search along the dimensions

MaxShellSize and NLogShells. The maximum number of logical shells is simply the number

of physical shells. The fewest number of logical shells is 1, and the size of that shell is

equal to the sum of all of the physical shell sizes; however, given a maximum shell size,

the fewest number of shells is defined in Equation 5-2.



LoNel > = (5-2)


Physical AO shells should not be broken up since generating direct integrals costs

less when orbitals of the same angular momentum and atomic center are calculated at

once. Equation 5-2 is actually an inequality because there might not be a perfect grouping

of physical shells such that every logical shell has the same size. The search algorithm

loops over size limits (MaxShellSize) starting with the largest physical shell size and

ending with a predefined limit or the sum of all physical shell sizes, whichever is smaller.









For every potential MaxShellSize, the algorithm also loops over the number of logical

shells starting from the theoretical minimum up to the number of physical shells.

The problem of putting obj ects with mass (physical shells with orbitals) into groups

with a mass limit (logical shells with up to MaxShellSize orbitals) is well known in the

field of Computer Science as the Bin Packing Problem (BPP) [Cof97]. This is an NP-

hard optimization and there are many algorithms for finding solution candidates. Rather

than solve the BPP, the CC algorithm uses the Worst Fit Decreasing (WFD) off-line

algorithm limited to a known set of bins (logical shells) and measures the variance of

sizes (Appendix E). The partitioning scheme with the lowest variance should have the

best balance between computation and messaging. WFD is ideal for finding possible

partitions for this particular problem since it adds the next smallest physical shell to the

logical shell with the fewest orbitals.

Asynchronous I/O

Most parallel programs are hampered by the increasing cost of messages as the

number of PEs increases. One-sided operations allow PEs to process data while messages

are moving through the network. The two limits of this scheme are: 1) the CCSD

calculation is free because the messages take so long to arrive and 2) the program scales

almost linearly with the number of PEs because global data is available the moment it is

read. Given the complex task and data distributions, it is likely that some portions of the

algorithm will be bound by communication costs and others by computations.

Reading the OOVV integrals from disk should not be a performance bottleneck

considering the amount of computation that each PE will do while the AIO READ is

being fulfilled. The inner loops of the SI phase can become heavily dependent on T2 GET

operations if the logical shells are not large enough to saturate the retrieval with










computation. The easiest way to increase the number operations per message is to

increase the size of a logical AO shell pair. The difficulty in doing this is that the local

scratch memory requirements scale like N4 with respect to shell size. As more scratch

memory is needed, the less heap there is for distributed data and ultimately more PEs will

be needed. The problem of balancing shell size, scratch memory, and data distribution is

not solved, but as the algorithm has been stated, the data distribution schemes can be tried

independently of the data contraction kernels. For example, if more scratch memory is

needed but there is no possibility to increase the heap or the number of PEs, then the DI

and SI phases can be further divided to allow for fewer intermediates at the cost of more

integral processing.

Synchronization Delays

An attractive feature of the parallel CCSD algorithm is that no PE is required to

stall in a wait call for another PE unless the state of their globally accessible data changes

(i.e., if R2 is transformed or T2 is updated). A barrier in message passing programs is a

routine that causes every PE to wait until all of the PEs in a communication network have

entered the barrier. Certain collective operations can function like a barrier and some

pairs of functions act as a true barrier. Reductions involve every PE sending some data to

one PE, which will combine all the data into one element. Broadcasts involve one PE

sending some data to every PE. Individually, certain PEs may continue after the operation

whether or not other PEs have finished their operations, but properly configured, a

reduction and a broadcast can together function like a barrier. No PE will proceed until it

has received data from one PE, which will not send the reduced element before receiving

the contributions from all of the other PEs.









The concept of a fence comes from the MPI-2 standard, which defines it as a

collective barrier that ensures all one-sided communications on a shared obj ect are

complete before leaving the barrier. Synchronizations of any sort occur essentially 10

times in each CC iteration:

1. *initialize R and synchronize T before the Pa DI phase a CO fence (Appendix A)
2. complete the R2Pa COntributions from the Pa DI phase a PLT fence
3. complete the R2PuaR2aCP transformation into scratch a PLT fence
4. copy scratch into R2 before the up DI phase a PLT fence
5. *complete the R2aP COntributions from the up DI phase a CO fence
6. reduce li contributions onto CO CDR 4 CO reduce
7. receive F1 from CO CDR 4 CO bcast
8. complete the R2 '"R21'"' transformation into scratch a PLT fence
9. copy scratch into R2 before the SI phase a PLT fence
10. complete SI and OP contributions a CO fence
11. complete R24T2 update a CO fence
12. receive T1 from CO CDR 4 CO bcast

In the steady state of iterating over T, the first and last synchronizations can be

overlapped (i.e., point 1 is the same as point 12 in the list). Point 5 can be overlapped

with point 6 in that reducing li signals the end of the up DI phase (Table 5-1).

Collectively, the number of synchronizations does not scale with the problem size;

however, the cost of each synchronization point will increase as the number of PEs

increases.

Simulation

The million-dollar question that every parallel algorithm must answer is, "Does it

scale?" Amdahl's Law [Amd67], which neglects all process contention and performance

degradation, describes the maximum theoretical speedup for a Eixed problem size running

on a Eixed number of processors. The simplest expression of this law is shown in

Equation 5-3.





























Each platoon independently transforms R2 and separates each step with a PLT fence
(dashed lines). The shaded rows correspond to collective communications across the
Company. r, w, i, and 0 stand for read, write, increment, and initialize to 0, respectively.
The largest platoon calculates T1 while the other platoons update their copies of T2.

t, + t
SpeedUp = "" P" (5-3)
tt


However, the parallel CCSD algorithm does not spend the same percentage of time in the

parallel parts relative to the serial parts for every molecular system. As the problem size

increases, the percent spent in the serial part decreases; thus, the parallel speedup will

increase with system size. This is loosely related to the Gustafson-Barsis Law that makes

the observation that a parallel computer can do bigger calculations relative to the serial

computer for data parallel algorithms as the number of processors increases [Gus88]. The

parallel CCSD algorithm leaves N3-type operations for one CPU (in the form of NxN

matrix multiplications). A few one-particle N5 operations and the N4 T2 update are

distributed over the largest platoon, but most N5 and all N6 Operations are distributed over

the entire Amplitude Company. Furthermore, remote data is retrieved as early as possible


.~ --~--~ ~-~~~-`` ~ r~"
PhaseHeavy Platoon Light Platoon
T1 R1 T2 R2 Scr T1 T2 R2 Scr
DI:pa r r i w/r r r i w/r
R(D,M)4S(L,N) r w r w
S(L,N)4R(L,N) w r w r
DI: a13 r r i w/r r r i w/r

Br~oadclast F,
R(L,N)4S(L,D) r w r w
S(L,D)4R(L,D) w r w r
SI; OP:FpP / T
OP:FPU,FH,FR /
antisym; 1/D w r 0
update T2 w w 0 w


Table 5-1 The schedule o s


______










and used as late as possible. This provides an ideal countermeasure for variable network

performance .

General Overview

A simulator is used to test a design before building a prototype or the final product.

Simulating software performance (scalability in the case of parallel CCSD) usually

involves estimating computation and data transfer times, which requires hardware

parameters such as CPU speed and efficiency; memory, disk, and network I/O latencies

and bandwidths; etc. These parameters can be measured and will be different for various

computer architectures; however, the goal of the software simulator is not to emulate

hardware. By using a variety of parameters, the simulator should be able to expose

general strengths and weaknesses of an algorithm that might need to be redesigned.

The simulator used to test the parallel CCSD algorithm constructs a parallel

network with an array of floating-point numbers that represents the clocks of the PEs.

The program assumes the identity of a PE and estimates how long it would take to

perform a set of operations. That time is added to the clock and the simulator switches to

the next PE. For the Direct Integral phase, the simulator maintains a counter that

corresponds to the next integral shell pair, and becomes the PE with the lowest clock (i.e.,

the one that would increment the counter next). Once the counter exceeds the number of

shell pairs, the clocks are synchronized to the highest value and the R2 amplitudes are

transformed. The process repeats itself for the second DI phase.

Binary trees are used to estimate reductions and broadcasts. The Stored Integral and

One Particle phases do not use real-time load balancing so the simulator cycles through

each PE and adjusts the clock accordingly. Finally, the antisymmetrization and update

phases are estimated with all of the appropriate synchronizations and computations.










Specific Technologies

Local computation. Accurately estimating local mathematical operations is

essential to simulating both parallel and serial performance. Each operation involves

reading data from memory, operating on the data, and writing the results back to memory

(Equation 5-4). S is the data size, B is the memory bandwidth, Nops IS the number of

operations, C is the CPU speed, and e is an efficiency modifier. Compiler optimization,

loop ordering, and vendor-supplied mathematical libraries will affect performance so

multiple efficiencies may be needed to estimate different parts of the algorithm.

S, So Nop
t + out op (5-4)
comp Bd B C* e


Asynchronous I/O. Disk operations can be estimated with latency, bandwidth, and

the number of channels (Equation 5-5). Latency is usually the time it takes for a disk

head to move to the appropriate position on the disk, Nch is the number of dedicated I/O

channels, and Nmsg IS the number of simultaneous disk operations. Disk contention in the

simulator is estimated by multiplying S by the number PEs on the node (i.e., if one PE is

reading data, then another probably is as well).


tr to = tlat +~ (5-5)
B *min(Nch> msg

When an asynchronous call is placed, the simulator estimates how long it would

take for the data to arrive and saves the time at which it should be available. The PE must

then call a wait routine, which will compare the current time to the time the data is

useable. Depending on the difference, the simulator will increase the clock (the PE must

wait) or return silently (the data arrived while computing).









One-sided communication. Sending and receiving messages through a network is

very similar to writing and reading from disk. If two-sided messaging is used, then the

clocks of the origin and target must be modified. Blocking messages require the origin

and target to have the same clock, while non-blocking messages allow for drift. The

parallel CCSD algorithm only uses non-blocking one-sided messages, which are very

similar to I/O operations in that only the origin is affected by the time it takes for data to

arrive. The simulator allows for on-node and off-node messaging parameters so that PEs

in the same squad can communicate at different speeds than those on different nodes.

Some MPI implementations require copying data into and out of system buffers, so

memory bandwidth might also affect the time.

Collective communication. Barriers, reductions, and broadcasts are simulated with

binary trees (Figure 5-10), and the number of stages depends on the number of PEs

involved in the operation (Equation 5-6). (Ceiling brackets mean round up to the nearest

integer.) Equation 5-7 shows the number of messages being sent in each stage i.

Istages= 10g2 PrE> (5-6)

N,,,z = min(NPE,2)- 2' (5-7)


000000000000

1 ,

2 ,





Figure 5-10. The binary tree approach to collective communications. 12 PEs require 4
stages, and the last stage only has 4 messages.










Calibration

Memory bandwidth is measured by allocating a large heap and then flushing it with

data and reading the data back in. The average of 7 sweeps is used for the memory

reading and writing bandwidth. Henceforth, all data will pertain to an IBM POWER3

workstation with four 375 MHz CPUs. The following output was taken from a calibration

run:

512 MB heap at Ox3348f518
wr was 1.924009 -> 266.11 MB/s; rd was 0.623147 -> 821.64 MB/5
wr was 1.924412 -> 266.06 MB/s; rd was 0.623328 -> 821.40 MB/5
wr was 1.922241 -> 266.36 MB/s; rd was 0.623201 -> 821.56 MB/5
wr was 1.922671 -> 266.30 MB/s; rd was 0.623183 -> 821.59 MB/5
wr was 1.922141 -> 266.37 MB/s; rd was 0.623107 -> 821.69 MB/5
wr was 1.922495 -> 266.32 MB/s; rd was 0.623170 -> 821.61 MB/5
wr was 1.922200 -> 266.36 MB/s; rd was 0.623147 -> 821.64 MB/5
bw_memwr = 266.27 MB/s (279201344.00 B/s)
bw_memrd = 821.59 MB/s (861497664.00 B/s)

CPU throughput is rarely the clock speed of the processor so efficiency modifiers

are used to estimate the time it would take to compute a set of operations. The two

primary mathematical constructs that the CCSD algorithm uses are two-dimensional

matrix multiplications and N6 array contractions. Many vendors sell highly optimized

linear algebra libraries that are tuned for specific computer architectures. Using these

libraries for standard matrix multiplies can dramatically increase performance if the

program is dependent on such operations. The simulator measures the efficiency of the

double precision general matrix multiply (DGEMM) by multiplying two 625x625

matrices. The matrices are then recast as two 25x25x25x25 four-dimensional arrays and a

hand-coded six-loop array contraction is timed, wherein the array contraction performs

the same operations as the two-dimensional matrix multiply. The following results were

taken from a calibration run:

CALIBRATION FOR POWER 375 MHz CPU
(2 fma units per CPU)
750000000 fma/s reference














































Disk latency and bandwidth are difficult to measure in an isolated environment.

Most operating systems use aggressive Eile caching, which can have a noticeable effect

on performance curves that compare message size to bandwidth. The parallel CCSD

algorithm is somewhat forgiving of exact performance because all disk reads are

asynchronous; nonetheless, calibrating the I/O performance attempts to bypass the CPU

and memory caches.

The benchmarking program iozone [Nor02] was used to measure the disk latency

and bandwidth, and all Hiles were created with the descriptors: O DIRECT (direct I/O),

O_SYNC (synchronous writes), and ORSYNC (synchronous reads). Comparing fie


( 625 x 625 ) ( 625 x 625 )
dgemm took .387582 s -> 88.1849% eff
dgemm took .384157 s -> 89.0108% eff
dgemm took .384016 s -> 89.0451% eff
dgemm took .383978 s -> 89.0543% eff
dgemm took .384088 s -> 89.0276% eff
dgemm took .384009 s -> 89.0468% eff
dgemm took .384046 s -> 89.0378% eff
loops took 1.785683 s -> 18.4198% eff
loops took 1.786396 s -> 18.4123% eff
loops took 1.785581 s -> 18.4208% eff
loops took 1.785778 s -> 18.4188% eff
loops took 1.786151 s -> 18.4149% eff
loops took 1.785726 s -> 18.4193% eff
loops took 1.785444 s -> 18.4223% eff
dgemm eff = 0.889153
loops eff = 0.184183
vendor is 4.83 times more efficient

The results are then compared to another set of larger multiplications and contractions to

test for predictability. The simulator uses the latter set of parameters if the results are

precise. The following results were generated using the previous data:


( 30 x 30 ,40 x 40 ) x ( 40 x 40 ,40 x 40 )
DGEMM: real: 3.549262 s (est: 3.53) eff:
DGEMM: real: 3.547543 s (est: 3.53) eff:
DGEMM: real: 3.547254 s (est: 3.53) eff:
LOOPS: real: 22.458201 s (est: 16.76) 6-loop eff:
LOOPS: real: 22.448650 s (est: 16.76) 6-loop eff:
LOOPS: real: 22.446517 s (est: 16.76) 6-loop eff:
LOOPS: real: 22.450999 s (est: 16.76) 6-loop eff:
LOOPS: real: 22.447324 s (est: 16.76) 6-loop eff:
LOOPS: real: 22.445928 s (est: 16.76) 6-loop eff:
LOOPS: real: 22.449582 s (est: 16.76) 6-loop eff:


.8892
.8892
.8892
.1373
.1373
.1373
.1373
.1373
.1373
.1373










record length to the time it took to read a random record produced data that were used to

describe the local disk performance (Figure 5-11 and Table 5-2).

600 4 2048
500 11 0El 4096
~400 A 8192

300 u x 16384
O 32768

100 -65536
+ 13107
0 4096 8192 12288 16384 2
-26214
record size (kB) 4


Figure 5-11. The local disk random read calibration data. The sets of data are almost
identical no matter what file size (kB) was used between 2 and 512 MB.


Table 5-2. The latencies and bandwidths of random disk reads with respect to file size
File Size Brd tint File Size Brd tint File Size Brd tla
2048 29.7 3.5 16384 30.4 3.5 131072 31.2 4.3
4096 30.4 4.0 32768 30.5 2.9 262144 31.5 4.9
8192 30.0 3.2 65536 31.1 4.0 524288 31.4 4.8
File sizes are in kB, bandwidths (Brd) are in MB/s, and latencies (tle) are in ms. For each
file size, the bandwidth is the inverse of the slope of the linear regression of record
lengths vs. time to read, and the latency is the intercept. (The minimum R2 was 0.999.)
The average bandwidth and latency for all 9 sets were 30.7 MB/s and 3.9 ms,
respectively.

Measuring network messaging performance was kept relatively simple. The

mpptest program, which ships with the MPICH library [Gro96], was used to measure the

point-to-point (P2P) and bisection bandwidths for messages ranging from 16 to 64 kB.

P2P involves only two PEs, while bisection involves all PEs. Figure 5-12 shows the on-

node P2P and bisection bandwidths measured on the IBM workstation. Latency is taken

from the intercept of the linear regression, not the zero-size transfer time. The regression

analysis for both curves suggests that both pairs of CPUs can communicate without

degradation, so the simulator set the number of on-node communication channels to 2
























bisect: t 9.8093*S + 209.7
R2 = 0.9982



P2P: t = 9.9477*S + 210.69
R' 0.9973


"-"


and used the approximate values of 210 Cls and 99 MB/s for latency and bandwidth,

respectively. Three off-node network architectures were tested: switched Fast Ethernet

and two 150 MB/s full duplex, redundant path IBM SP switches (with different

configurations). Only P2P tests were used to measure off-node latency and bandwidth

(Figure 5-13), and Equation 5-5 was used to estimate degradation.


+ P2P


X bisect


-- -- -Linear
(P2P)

- Linear
(bisect)


900
800




400
300


0 8


24 32 40
message size (kB)


Figure 5-12.


The calibration data for on-node messaging parameters. Bandwidth is the
inverse of the regression slope, and latency is the intercept. The performance
does not change with one or two pairs of CPUs communicating at the same
time, which implies two independent messaging channels.


+ F.Ethernet

X SP swl

+ SP sw2

- -- Linear
(F.Ethernet)
- Linear (SP
swl)
-- -- -Linear (SP
sw2)


7000
6000
,5000
Y 4000
E 3 00 0
'2000
1000
0


0 8


24 32 40
message size (kB)


Figure 5-13. The calibration data for off-node messaging parameters. The regression
equations are Fast Ethernet: t=86.987*S+553.42 (R2=0.9998); SP switch 1:
t=15.304*S+265.18 (R2=0.9915); and SP switch 2: t=11.942*S+180.88
(R2=0.9970). Bandwidths are 11.2, 63.8, and 81.8 MB/s, respectively.









As in the case of asynchronous I/O, most messages in the parallel CCSD algorithm

are one-sided and will hide behind computation. If the amount of computation is less than

the estimated time spent in communication, then a more accurate messaging function

should be used.

Theoretical Examples

With reasonable hardware parameters and timing algorithms, simulations of real

molecular systems can be used to estimate the performance of the parallel CCSD

algorithm. Table 5-3 lists the 20 amino acids and the corresponding numbers of electrons

and spin pairs. Also listed are the sizes of the open-shell OOVV integrals--and thus the

minimum disk requirements (with 1 GB equal to 10243 Bytes)--and the minimum

number of PEs with 1 GB of heap needed to calculate the UHF CCSD energy if each

system had 1024 AO basis functions. Energy calculations of the anion, cation, and

zwitterion are certainly achievable with the resources of current networks of workstations

and clusters of SMP machines.

For 1024 AO basis functions, the size of the two-electron integrals is 8 TB.

Approximately one-eighth of these need to be calculated on the fly since the others can

be generated by permutations; however, assuming the integrals are dense, the full 8 TB

array must be generated and processed twice (once for Pa and once for up). Systems of

this size should be executed on hundreds if not a thousand PEs or more. The initial

parallelized software cannot scale this high much less store the amount of data needed by

each node. The parallel CCSD algorithm overcomes this with fine-grain task distribution

and distributing large arrays over as few PEs as possible. The net effect is that large

systems can be calculated with few resources at the expense of computing time.
































































calculation.)


Min Disk Disk
Amino acid ne noo
C H NOS PEs (GB) savings
Alanine 3 7 1 2 0 48 1128 27 8.8 99.92%
Cysteine 3 7 1 2 1 64 2016 48 15.8 99.86%
Aspartic acid 4 7 1 4 0 70 2415 57 18.9 99.84%
Glutamic acid 5 9 1 4 0 78 3003 71 23.5 99.79%
Phenylalanine 9 11 1 2 0 88 3828 90 29.9 99.73%
Glycine 2 5 1 2 0 40 780 19 6.1 99.95%
Histidine 6 9 3 2 0 82 3321 78 25.9 99.77%
Isoleucine 6 13 1 2 0 72 2556 60 20.0 99.83%
Lysine 6 14 2 2 0 80 3160 75 24.7 99.78%
Leucine 6 13 1 2 0 72 2556 60 20.0 99.83%
Methionine 5 11 1 2 1 80 3160 75 24.7 99.78%
Asparagine 4 8 2 3 0 70 2415 57 18.9 99.84%
Proline 5 9 1 2 0 62 1891 45 14.8 99.87%
Glutamine 5 10 2 3 0 78 3003 71 23.5 99.79%
Arginie 6 14 4 2 0 94 4371 103 34.1 99.69%
Serine 3 7 1 3 0 56 1540 37 12.0 99.90%
Threonine 4 9 1 3 0 64 2016 48 15.8 99.86%
Valine 5 11 1 2 0 64 2016 48 15.8 99.86%
Tryptophan 11 12 2 2 0 108 5778 136 45.1 99.59%
Tyrosine 9 11 1 3 0 96 4560 107 35.6 99.68%


Table 5-3 The minimum re D


_____


The data assume thle scratch memory needed by each PE, is less than thle amount of data
needed to store the third OOVV array during an R2 transformation. Every example saves
over 99.5% of the disk space required by a traditional MO-based CCSD algorithm.

Runtime Simulations

DCEE*

A simulation of the parallel CCSD algorithm was performed on the DCEE* system

described in Chapter 2. Two network scenarios were tested: Fast Ethernet (F.Eth) and the

faster of the two IBM SP switches (henceforth referred to as x3 sw). Each node in the

network was configured as if it were the aforementioned 4-CPU IBM workstation. All

data are compared to 100% and 50% theoretical efficiencies (both are technically linear,

but "linear" scaling commonly refers to 100% theoretical efficiency). (The parallel

algorithm requires at least two 1-GB PEs, so all data are compared to this as the reference


mt










Fast Ethernet. The F.Eth simulation shows acceptable parallel performance up to

64 PEs, but beyond that, efficiency drops below 50%, which is considered to be

unacceptable (Figures 5-14 and 5-15). However, even up to 1024 PEs, the scalability

curve is still increasing. Efficiency shows normal behavior for parallel programs (starting

out high and decreasing asymptotically).





96 / **- Xth (100%)
-- -- -Xth (50%/)
64 X I~ pred.

r -M-- Efficiency*128

30


0 256 512 768 1024
# PEs


Figure 5-14. The simulated performance curves for DCEE* on 2 to 1024 PEs over Fast
Ethernet. Xth (100%) and Xth (50%) are the theoretical speed-up curves
with 100% and 50% efficiencies, respectively. X pred. is the predicted
speed-up for the simulation. Efficiency*128 is the efficiency scaled by 128.


Beyond the standard analysis of speed-up and efficiency, users might concern

themselves with the idea of cost-to-performance. Another way of stating the idea, "At

what point does the cost of using more PEs outweigh the gain in speed-up they might

contribute?" The "cost" function is considered to be the efficiency and the metric for

"bang per buck" is efficacy (Equation 5-8) [Gho91].

17(p1)= E(p~)- S(p~) (5-8)

Efficiency times speed-up can have multiple maxima, but only the first maximum value

(with respect to the number of PEs) corresponds to the most cost-effective use of PEs in a










parallel run. The F.Eth simulation of DCEE* peaks at 64 PEs (r=10.2), but the curve is

so shallow that one could argue anywhere between 32 and 96 PEs is still cost effective.


O 32 64 96
# PEs


Figure 5-15. The simulated performance curves for DCEE* on 2 to 128 PEs over Fast
Ethernet. The curves correspond to the same series as in Figure 5-14 with
the exception that efficiency has been scaled by 32, not 128.


SP switch. The x3 sw simulation shows much better performance (Figure 5-16).

Predicted speed-up never drops below 50% efficiency up to 1024 PEs, and efficacy has

not peaked. The simulation suggests that the parallel performance for this system

(DCEE*) is highly dependent on communication speed.


300






100 -- Xth (100%)
-- -Xth (50%)
50 ~7- X pred.
50 -- Efficiency*300
SEfficacy

0 256 512 768 1024
# PEs


Figure 5-16. The simulated performance curves for DCEE* on an SP switch.










Comparison to serial ACES II. Efficacy is a relative metric that depends on the

machine and (molecular) system, but it can be used to compare the effectiveness of

different parallel programs running the same system on the same computer. Figure 5-17

compares the speed-up and efficacy of the new algorithm to the previous parallel CCSD

algorithm. The new algorithm scales better and is much more cost effective per PE;

however, the comparison of real times does not fare so well (Figure 5-18).

100
new X
75 -( new Efficacy
-- -- -prev X
50 + prev Efficacy

25



0 64 128 192 256
# PEs


Figure 5-17. The relative performance of the initial parallelization attempt compared to
the new algorithm. The discontinuity of the previous parallel program comes
from allocating whole nodes with 16 CPUs each.


The new algorithm achieves high scalability by decomposing the equations to such

a degree, that the PEs are swamped with calculations in an attempt to use as little disk

and memory as possible. The current serial ACES II CCSD program is 31.4 times faster

than the new parallel algorithm, but it also requires much more disk space and does not

scale as well. Nonetheless, Table 5-4 lists the numbers of PEs that the new algorithm

would need to match the time achieved by the initial parallelization attempt.

It is clear that the CCSD energy can be calculated quickly with few PEs for

molecular systems that have small data requirements (DCEE* used on the order of 2 GB










of disk space per node). The current ACES II program and the initial parallelization

attempt are more than adequate for reducing the time the new parallel algorithm would

take; however, the new algorithm requires very little disk space and can calculate

energies for very large systems.

300


200
-M-- X new
-serial ACES II
100 -M-X prey



O 256 512 768 1024
# PEs


Figure 5-18. The absolute performance of the initial parallelization attempt compared to
the new algorithm. The dark lines are the factors of speed-up achieved by
using the previous code compared to the reference calculation of the new
algorithm .


Table 5-4. The number of PEs needed to surpass the initial parallel performance

Previous PEs New PEs Etaoae
speed-up
1 76 31.8
32 409 146
64 514 176
128 424 151
256 405 145
The simulated speed-up curve was fitted with a cubic regression, which was used to
predict how many PEs the new algorithm needs to finish a CC iteration in the same time
that the initial parallelization attempt did.

Serine and phenylalanine

Using the x3 sw network parameters, the simulator examined serine (Ser) and

phenylalanine (Phe), both in the cc-pVQZ basis set. Ser, with 56 electrons, had 595 AO

basis functions and Phe (88 e") had 990 AOs. The current ACES II UHF CCSD program,










using the AO algorithm for VVVV integrals and assuming 20% of the AO integrals are

discarded, would require over 200 GB of local disk storage for Ser and over 1.5 TB for

Phe. The new parallel algorithm would need only 2.8 and 19.2 GB of distributed storage,

respectively.

Figures 5-19 and 5-20 show the simulated parallel performance of Ser and Phe. Ser

requires a minimum of 14 1-GB PEs, and Phe requires 95 PEs. In either system, parallel

efficiency never drops below 96%. With respect to tractability, a single Ser UHF CCSD

iteration is expected to take 2.8 days on 1024 PEs. Phe requires approximately 2 months

per iteration on 1024 PEs. If the efficiency curve is to be believed, then this can be

dropped to about 2 weeks on over 4000 PEs.



-- Xth (100%/)
48 -.-- -Xt~x h (50%)
-X pred.
32 M-Effciency*64
SEffcacy ..

10

0 256 512 768 1024
# PEs


Figure 5-19. The performance curves for serine on 16 to 1024 PEs.




-- Xth (100%)
6 -- Xth (50%)
-X pred.
4 M-Effciency*8
SEnfcacy




0 256 512 768 1024
# PEs


Figure 5-20. The performance curves for phenylalanine on 128 to 1024 PEs.









The simulator is only able to model local computations with high accuracy. These

are generally highly ordered, well-defined sequences of commands executing on a

consistent processor (barring process contention in the operating system). After these, the

performance of local disk operations is somewhat predictable, even with I/O contention

from other PEs. The least predictable performance aspect of a parallel program is

network communications. This seems to be an N! correlation problem in its own right

(nevermind coupled-cluster theory), but certain models are able to predict the statistical

nature of network traffic rather well.

The current implementation of the simulator treats the network as a single, flat

interconnected fabric in which every PE can reduce network bandwidth for every other

PE. In principle, this should then predict the worst performance since multi-layered

network hierarchies favor local communications, which the platoons are supposed to

capitalize on. However, the penalty function for estimating network contention lacks the

ability to scale up with the number of PEs (i.e., more PEs reduces the accuracy of the

predicted network effects). This should be corrected with a better model such as the

hyperbolic model for layered environments [Sto96].

At a more fundamental level, the simulator assumes all messages can be delivered

in one eager message. This means each PE can send and receive data without bothering to

divide the information into multiple messages, but more importantly, it completely

eliminates the need for the target PE to acknowledge or create space for the incoming

data. In practice, this is rarely the case, and this rendezvous protocol must be modeled.

The simulator is also limited to sharing I/O bandwidth among PEs on a single node.

Many high-performance parallel supercomputers use storage area networks to increase