UFDC Home  myUFDC Home  Help 



Full Text  
A PARALLEL DIRECT OPENSHELL COUPLEDCLUSTER SINGLES AND DOUBLES ALGORITHM By ANTHONY D. YAU A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA INT PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 2004 Copyright 2004 by Anthony D. Yau ACKNOWLEDGMENT S I thank my family and friends for supporting me throughout the years, my strength comes from them. I thank the faculty who have instructed me and the members of Rod Bartlett's research group. I owe particular thanks to Dr. Erik Deumens for continued motivation and guidance in scientific computing (and for teaching the first parallel programming class that I attended) and to Dr. Victor Lotrich for insights into implementing CoupledCluster theory. I especially thank Dr. Ajith Perera for extending so much help (and patience) to me during all of my studies throughout graduate school. The United States Department of Defense High Performance Computing Modernization Program (HPCMP) supported all of this work. A grant under the Programming Environment and Training (PET) program (GSA Contract No. DAAD05 01C003) funded the initial parallelization effort, and the Common High Performance Computing Software Support Initiative (CHSSI) program (proj ect CBD03) funded the research on the finegrain parallel algorithm. TABLE OF CONTENTS Page ACKNOWLEDGMENT S ........._.___..... .__. .............._ iii... LI ST OF T ABLE S ........._.___..... .___ .............._ vii... LI ST OF FIGURE S ........._.___..... .__. .............._ viii.. AB S TRAC T ......_ ................. ............_........x CHAPTER 1 INTRODUCTION ................. ...............1................. 2 PARALLELIZING A SERIAL CCSD PROGRAM .................... ...............6 Serial Profile................ ... .. ...............6 Bottleneck Categorization and Parallelization .............. ...............9..... ParticleParticle Ladders................ ...............9 M atrix M ultiplication............... .............1 I/ODependent Routines ................. ...._._ ...._._.............1 Parallel Profile.................. ...................1 TopDown Parallelization Considerations .............. ...............17.... 3 SPINUNRESTRICTED CCSD EQUATIONS............... ...............1 Group 1: T2 Virtual Orbitals Coming from V ............... ...............23.... Group 2: T2 Virtual Orbitals Coming from V and T............... ...................2 Set 1: One Virtual from V, One Virtual from T1 ................ .....................24 Set 2: One Virtual from V, One Virtual from T2 ................. .....................26 Group 3: T2 Virtual Orbitals Coming from T .............. ...............27.... Set 1: Rings and PH Ladders ................. ...............27........... .. Subset 1: AC/BD............... ...............27. Subset 2: DB/CA............... ...............28. Subset 3: CB/AD ........................ ......................2 Set 2: HH Ladders from OOVV, OOOV, and OOOO .............. ..................30 Group 4: T 2 into Ti ................ ...... ..............3 Group 5: OneParticle Transformations .....__.....___ ........... .............3 Set 1: T1 Contributions .............. ...............32.... Set 2: T2 COntributions .............. ...............33.... 4 P ARALLE L ALGORIT HM .............. ...............3 5.... Virtual Parallel Computer .............. ...............35.... Communication Networks .....__.....___ ..........._ .............3 Data Management ........................_ ...............38.... Arrays stored in distributed memory............... ...............38. Arrays stored on disk .............. ...............42.... Parallel UHF CCSD .............. ...............42.... Direct Integral Phase............... ...............44. Design considerations .............. ...............44.... Al gorithm .............. ...............46.... Stored Integral Phase ............. .....___ .....___ ...........5 One Particle Phase ........................._ ...............55... Antisymmetrization and T2 Update .............. ...............57.... 5 DISCUS SION AND SIMULATION ................. ...._.._ ......._ ...........6 D discussion .............. .. ...............69... General Comments .............. ...............69.... Load B al ancing ........._..._.._ ...._._. ...............70..... Job boards .............. ...............70.... Data aggregation .............. ...............70.... Asynchronous I/O ........._..._.._ ...._._. ...............72..... Synchronization Delays ........._..._.._ ...._._. ...._._. ..........7 Sim ulation .............. ...............74... General Overview............... ...............76 Specific Technologies............... ..............7 Calibration .............. ...............79.... Theoretical Examples .............. ...............83.... Runtime Simulations .............. ...............84.... D CEE* ............ ... ...............84.. Serine and phenylalanine .............. ...............88.... 6 CONCLUSION ............. ......___ ...............92.... Outlook ............. ...... ...............93.... G radients ................ ........ ...............93.... Higher Excitation Clusters ..................... .___ ...............94... Orbital Transformations and Space Reductions .............. ....................9 Conclusion............... ...............9 APPENDIX A ACRONYMS AND ABBREVIATIONS .............. ...............97.... B NOTATION AND CONVENTIONS .............. ...............99.... Operators and Matrix Elements............... ...............99 Operator Indices .............. ...............99.... C ALGORITHM OVERVIEW............... ...............10 D ANATOMY OF A SHELL PAIR BLOCK .............. ...............102.... E LOGICAL SHELLS AND BIN PACKING ................. .............................104 LIST OF REFERENCES ................. ...............110................ BIOGRAPHICAL SKETCH ................. ...............112......... ...... LIST OF TABLES Table pg 21. The distribution of real time over highlevel CCSD subroutines ............................8 31. The RHF CCSD T1 contributions .............. ...............19.... 32. The RHF CCSD T2 COntributions .............. ...............20.... 41. The direct integral processing loop summary .........__ ....... ................. 48 42. The Direct Integral phase prologue memory map .............. .....................6 43. The Direct Integral phase inner loop memory map .............. .....................6 44. The Direct Integral phase epilogue memory map ................. ....._ .............62 45. The Stored Integral phase Group 3 Set 1 Subset 1 memory map ..........................63 46. The Stored Integral phase Group 3 Set 1 Subset 2 memory map ..........................64 47. The Stored Integral phase Group 3 Set 1 Subset 3 memory map ..........................65 48. The Stored Integral phase Group 3 Set 2 memory map ................. ................ ...66 49. The One Particle phase FP TzaP memory map .............. ...............66.... 410. The One Particle phase FPuT2aP (and same spin terms) memory map...................67 411. The One Particle phase FH and FR lOcal transformations .............. ...................67 412. The R2 antisymmetrization and T2 update memory map .............. ....................68 51. The schedule of synchronization points............... ...............75. 52. The latencies and bandwidths of random disk reads with respect to file size .......81 53. The minimum resources for amino acids with 1024 AOs in UHF CCSD....._.......84 54. The number of PEs needed to surpass the initial parallel performance ........._......88 LIST OF FIGURES Figure pg 21. The distribution of real time over primary CCSD subroutines ............... .... ........._..7 22. The breakdown of significant UHF subroutines ....._____ ... .....___ ........._....8 23. The column assignments for each of 4 CPUs updating R2 ................. ................12 24. The distribution scheme for matrix multiplies over CPU threads .........................13 25. The parallel speedup for each maj or UHF CCSD subroutine and the CCSD executable .............. ...............15.... 26. The runtime profile of parallel CCSD with a UHF reference from DCEE* .........15 41. Communication network boundaries ..............._ ...............39 ........._.... 47. The memory heap management for various molecular systems................... .........40 42. The T2 and R2 memory layout .............. ...............43.... 48. The data flow in the Direct Integral phase ................. ...............49........... . 43. The path of Group 3 Set 1 Subset 1 updates per OOVV integral ................... .......53 44. The LD shell distribution schemes .............. ...............54.... 49. The data flow in the Stored Integral phase .............. ...............55.... 510. The binary tree approach to collective communications ................... ...............78 511. The local disk random read calibration data ....__. ................. ................. .81 512. The calibration data for onnode messaging parameters ................... ...............82 513. The calibration data for offnode messaging parameters............... ...............8 514. The simulated performance curves for DCEE* on 2 to 1024 PEs over Fast Ethernet ................. ...............85................. 515. The simulated performance curves for DCEE* on 2 to 128 PEs over Fast Ethernet ................. ...............86................. 516. The simulated performance curves for DCEE* on an SP switch. .........................86 517. The relative performance of the initial parallelization attempt compared to the new al gorithm .............. ...............87.... 518. The absolute performance of the initial parallelization attempt compared to the new al gorithm .............. ...............88.... 519. The performance curves for serine on 16 to 1024 PEs ................. ................ ...89 520. The performance curves for phenylalanine on 128 to 1024 PEs ........................89 D1. The spin blocks in an AO shell pair block ....._._._ ..... ... .__ ......._._......10 Abstract of Dissertation Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy A PARALLEL DIRECT OPENSHELL COUPLEDCLUSTER SINGLES AND DOUBLES ALGORITHM By Anthony D. Yau December 2004 Chair: Rodney J. Bartlett Major Department: Chemistry CoupledCluster Singles and Doubles (CCSD) is a benchmark ab initio electron correlation method. The biggest factor in preventing its widespread adoption for large molecular systems is that the theory scales like N4 in data and N6 in COmputation (N being the number of atomic orbital basis functions). Using simple factorization techniques and data and task parallelization, a new approach to calculating openshell CCSD energies is available. For a molecular system with 50 electrons (25 alpha, 25 beta) and 500 AO basis functions, this approach has the potential to reduce storage requirements from 630 GB to 2.3 GB for a traditional UHF CCSD algorithm in a basis set of molecular orbitals. The number of operations will increase, but the natural parallelization of the data provides a clear and concise framework for parallelizing the computations. CHAPTER 1 INTTRODUCTION CoupledCluster (CC) theory is an ab initio electron correlation method that can be thought of as infiniteorder perturbation theory or a sizeextensive approximation to the full configuration interaction (CI) analysis [Bar95]. Given a oneparticle reference wavefunction Go, the CC wavefunction is defined as an exponential expansion of excitation clusters, T, from the occupied orbitals to the unoccupied (virtual) orbitals. The T operator is a particle excitation operator that naturally truncates at the number of particles, P, in the system (Equation 12). Occupied orbitals (the hole space) are denoted with the letters i, j, k, and 1, while virtual orbitals (the particle space) are denoted with the letters a, b, c, and d. Appendix A lists acronyms and abbreviations, and Appendix B summarizes index conventions and matrix elements. I cc)= exp(T)00) (11) T =i T; T = 1 t a ,ibt ptk... (12) p=1 22]k a,b,c, The difficulty in calculating the full CI is that the data requirement scales like N2P and the computational requirement scales like N2P+2, in which N is the size of basis set space. To retain size extensivity and computational tractability, the excitation operator is usually truncated at a small number of excitations. The most common is at double excitations, such that T=T1+T2. As computers become more capable of massive calculations, the number of excitations can increase, but the size of the molecular system must decrease much more rapidly. Currently, the highest handcoded truncation is at pentuple excitations (i.e., T= T1+T2+...+T5) [Mus02], but efforts to automate this process are becoming more promising as distribution techniques for data and computations become available [Bau02]. Using the standard twoparticle electronic Hamiltonian, the CC energy is calculated by contracting the elements of the Fock matrix (f,,) and the twoelectron Coulomb integrals ( to alpha spin orbitals, and lowercase orbital indices correspond to beta spin orbitals. The doublebar integrals in Equation 14 are antisymmetrized Coulombexchange integrals (Appendix B). (This derivation assumes that the reference orbitals are orthonormal.) Ecc o exp() Go(13) Ecc= Eo +C Ij Ab ( +t;'t1 IJAb +f~l/ flltA + I A tAB + t~tJ t t" (14) IA IJAB +C1f~,f +SC (9 ab: I' +tzat tt) la tyab The T amplitudes are calculated by iterating over a coupled set of equations that proj ect the Hamiltonian and wavefunction into a set of pparticle excitation spaces. The CCSD approximation truncates T at double excitations, and is one of the benchmark correlated methods in quantum chemistry today. The T1 and T2 equations come from the projections in Equations 16 and 17, in which Hj, is the normalordered Hamiltonian that does not include the reference energy. Only the P and pp spin equations are shown, but the a, aa, and up spin equations are analogous to the former. exp( T)H;, exp(T)I 00)= AEcI 0,) (15) Ofl exp( T) ~, exp(T) Q0) = 0 (16) /rab O Xp( T) exp(T)I 00) = (17) The BakerCampbellHausdorff formula [Hal00] has the effect of reducing the T equations to those terms that 1) contract directly with the Hamiltonian and 2) have no more than four excitation amplitudes. The first effect is from disconnected terms being cancelled by the commutation, and the second effect is from the twoparticle Hamiltonian having at most four indices from the twoelectron integrals. Computationally, the diagonal elements of the Fock matrix that contract with T are grouped together, and define the input amplitudes for the next iteration. The T2P equations are shown in Equations 18 through 11 1. If T is initialized to 0, then the next iteration will produce T2= be initialized to the secondorder wavefunction and T1 can be set to 0. O = C~~factc bCfcfl ac k ~b _fktrab +... (18) k k 0 = (Jas +/bb J;i af~b ac C~~b bc~f6t c k ~b _~ ir ktra +.. (19) c~a c~b k ke Dab=j; aa bb (110) Dabtb acic b ca k kyd (k 2 ** c~a c~b km k Since the T amplitudes differentiate between occupied and virtual orbitals, it is common to group integrals into the various space combinations such as occupied occupied (OO), occupiedvirtual (OV), and virtualvirtual (VV). There are Hyve such combinations for the twoelectron integrals: OOOO, OOOV, OOVV, OVVV, and VVVV. If the virtual orbitals outnumber the occupied orbitals by a factor of 3, then the VVVV and AO integrals will outnumber the OOOO integrals by factors of 81 and 256, respectively. Most CC implementations calculate and store the OOOO, OOOV, and usually OOVV integrals before starting the CC iterations. Due to programming complexity, most implementations also store the OVVV integrals. The VVVV integrals are used in only one CCSD term and how they are processed illustrates a fundamental technique that the parallel CCSD algorithm uses in almost every contraction. The data requirement scales like N4 and the computational requirement scales like N6, and this term is usually the ratedetermining step of the CC iteration. The standard MO expression is shown in Equation 112. The "+= operator is shorthand for "incremented by" (i.e., "x+=y" means "x=x+y"), and the r amplitude is the residual that collects all of the contributions before dividing by the D denominator and producing the next set of T amplitudes. a~b+= }fab lcd cd (112) Forward transforming T into the AO basis allows the AO integrals to contribute directly to the VVVV terms. Conceptually, the AO coefficients from the reference wavefunction are "removed" from the integral and "attached" to the amplitude. The primes on the AO indices in T denote a forward transformation. This distinction must be made since another useful transformation (a backward transformation) has the same dimensionality (viz. po), but the AO overlap matrix is used in the contraction in addition to the AO coefficients. abh ac = ~Pr~ pa c:)c~t id (113) cdvp While this transformation removes the need to store the VVVV MO integrals, it increases the computational workload substantially. In fact, since each integral contributes to each amplitude, the R2 array WOuld be updated N4 times!i However, implementing this technique presents a more optimistic view of calculating large systems. The R2 amplitudes are backtransformed from MOs to AOs before being incremented by the AO integral terms. Once the integrals have been processed, the R2 array is forward transformed into the MO basis. 4""e =u _(' cd n S' (115) cd I4""+= pr per t "" (116) The method of calculating amplitude contributions in the AO basis instead of the virtual MO basis has the potential to reduce the amount of storage space needed to calculate the aforementioned VVVV contribution. This is particularly true for openshell systems, which need to store all three spin combinations of the VVVV integrals compared to storing one set of AO integrals. If direct integral generation is used, then the storage requirement for this term is completely removed. In either the AO or MO scenario, parallelization involves distributing the data and operations of a large matrix multiplication over the processing elements. CHAPTER 2 PARALLELIZING A SERIAL CCSD PROGRAM Widescale reformulation of computational algorithms should only be performed if currently existing formulations are inadequate, but determining whether an algorithm is inadequate depends heavily on its intended use and the computers that are meant to execute it. Any of a number of runtime parameters could contribute to low performance: compiler efficiency; net CPU speed; memory or disk storage requirements and performance; parallel scalability and efficiency; etc. In the case of general computational chemistry algorithms, how the number of operations scales with respect to the size of the chemical system will also determine whether an algorithm should be reformulated. The ACES II Program System [ACES] is one of the more popular software packages that calculates CCSD electron correlation energies. Although it was written to make use of CPU vectorization, parallelization is considered to be one of the most important ways to decrease the run time on modern supercomputers. Traditional methods for parallelizing existing software begin with execution proofing. Once the main bottlenecks are identified and analyzed, parallel formulations of those sections are implemented and profied. Serial Profile Profiing the CC code in ACES II was performed with four test systems: * Alanine ("ala"; C3H7NO2; 48 e") in ccpVDZ (119 AO basis functions) * Octane ("oct"; CsHls; 66 e") in ccpVDZ (202 AOs) * P,P'dichloroethyl ether ("DCEE"; C4HsCl20; 74 e") in ccpVDZ (146 AOs) * DCEE radical ("DCEE*"; C4H7Cl20; 73 e") in ccpVDZ (141 AOs). The first three systems had an RHF reference, the fourth had UHF, and no spatial point group symmetry was used for any system. All of the systems used the AO basis formulation for the VVVV*T2 COntribution, which was described in Chapter 1. The profiled code used the C system function gettimeofday() to record the realtime clock before and after large code sections. Any subroutine (or group of related subroutines) that consumed more than 3% of the real time became a potential target for parallelization. Table 21 lists the average percentage of the total real time for each highlevel subroutine measured across the three closedshell test systems (RHF column), and it lists the percentage for each routine in the DCEE* test system (UHF column). Routines that contributed more than 3% are in bold case. The maximum variance of any set of percentages for the closedshell systems was 0.004, which suggests that the three profiles have a consistent and reliable distribution of times. The table entry for "~AOLAD" corresponds to the time spent in AOLAD that is not due to RDAO, even though the latter is a dependent of the former. Figure 21 shows the amount of the total time consumed by the maj or subroutines and the next most significant routine. 70% 60% I RHF 50% I I UHF 40% 30% 20% 10% 0% T1W1AB RNGDRV CNTRCT RDAO next Figure 21. The distribution of real time over primary CCSD subroutines. Table 21. The distribution of real time over highlevel CCSD subroutines Routine RHF UHF Routine RHF UHF ZERSYM 0.2% 0.1% TllNTl 0.4% 0.2% T1RAAAA 1.4% 0.6% LADAA 0.0% 0.1% T1RABBA 0.8% 0.6% LADAB 1.2% 0.5% T1RABAB 0.8% 0.6% AOLADLST 0.0% 0.0% CNTRCT 7.7% 9.8% T2TOAO 0.7% 0.3% SUMSYM 0.1% 0.2% RDAO 58.3% 65.1% F2TAU 0.0% 0.0% ~AOLAD 0.6% 0.3% GETLST 1.1% 0.8% Z2TOMO 0.9% 0.7% QUAD 1 1.3% 0.5% ZIAAO 0.1% 0.2% QUAD2 0.5% 0.3% T12INT2 0.8% 1.1% QUAD3 0.3% 0.2% RNGDRV 9.4% 11.0% MAKFME 0.1% 0.1% SSTOxl 0.0% 0.0% T1W1AA 0.0% 0.6% SUMRNG 0.2% 0.1% T1W1AB 3.9% 2.1% ERNGAA 0.0% 0.0% T1INW2 0.1% 0.0% ERNGAB 0.5% 0.1% FEACONT 0.7% 0.4% SUMSYM 0.2% 0.0% FMICONT 0.3% 0.3% E4S 0.6% 0.2% FMECONT 0.5% 0.2% NEWT2 0.3% 0.2% T1INT2A 0.3% 0.2% PHASE3 0.6% 0.2% T1INT2B 0.8% 0.5% DIIS 2.6% 1.0% It is clear from the table that four routines consume most of the execution time: CNTRCT, T1W1AB, RDAO, and RNGDRV. Together, they cost 79.3% and 88.1% for RHF and UHF, respectively. Figure 22 shows the four maj or subroutines and any individual routine that consumes more than 1% from the DCEE* profile. RNGDRV RDAO Figure 22. The breakdown of significant UHF subroutines. Primary values are grouped on the left, secondary values (greater than 1%) are grouped on the right. T12INT2 DIIS Bottleneck Categorization and Parallelization ParticleParticle Ladders The single most time consuming routine in either RHF or UHF calculations is RDAO, which increments T2 by contributions from the VVVV integrals in the AO basis. Integralstored schemes, such as the current implementation in ACES II, read the AO integrals into memory from disk, multiply them by the T2 amplitudes from the previous iteration, and add the product to the T2 TOSidual amplitudes. Neglecting integral sparsity, the number of permutationally unique AO integrals is shown in Equation 21. If there are at least 100 AO basis functions, then the number of unique integrals can be approximated by NAO4/8 with more than 98% accuracy; 200 or more functions and the approximation is more than 99% accurate. Nin =PP+) P = NoNo+)(21) 2 2 Many AObased implementations have a userdefined threshold for discarding AO integrals that would not have a significant contribution to the T2 TOSidual. The integral program in ACES II stores the integrals in one long vector and discards individual integrals that are less than the threshold. For DCEE* and a threshold of 1014, ACES II stored almost 41.1 million integrals from a complete set of 50. 1 million, which means the twoelectron integrals were 18.0% sparse. The RDAO algorithm reads in a batch of integrals and processes the integrals individually. For a particular integral, the value might contribute 1, 2, 4, or 8 times depending on the permutational symmetry of the orbitals. The RHF case only has 1, 2, or 4 contributions and postprocesses the residual for the final permutation. The general structure of the algorithm can be described by the following: do while (more batches) read batch do for each int do for each perm do for all oo R2 (OO, gy) += T2(OO, po)* end do end do end do end do This algorithm is not particularly efficient since each permutation requires a vector vector operation (R2(*,Cly) += V*T2(*,po)). Experience in computational algorithms suggests that matrixmatrix operations (using integral blocks) would have much higher computational throughput. For example, the RDAO update averaged 286.7 MFLOPS out of a theoretical maximum of 1500 MFLOPS on the 375 MHz IBM POWER3 CPU. A 600x600 matrix multiplication using the ESSL library (a tuned linear algebra library for IBM architectures) averages 1278 MFLOPS, which corresponds to a computational efficiency of 85.2% compared to 19.1% for RDAO. However, parallel algorithms for blockwise implementations are virtually identical to elementwise implementations, so block optimizations were ignored in the parallel strategy and could be implemented later. For any DCEE* CC iteration, 9.6 seconds out of 5952 seconds were spent reading integrals from disk (i.e., I/O takes about 0. 16% of the total RDAO time). Before the multiplication loop, there was a small sorting section, which took close to 13.6 seconds (also insignificant); therefore, parallelizing the actual integral updates was expected to have the largest impact on performance. There are multiple levels at which the current algorithm can be parallelized: batches, integrals, permutations, or rows of OO pairs. Distributing the operations over OO pairs might decrease performance since the pairs are contiguous in memory, and the cost of adding more rows is much less than the cost of adding more loops to process them. The problem with distributing permutations is that the number of permutations for a random integral is not always 8. This might introduce loadbalancing problems depending on how the integrals are sorted on disk; furthermore, each integral has a limited number of tasks (permutations), which might limit overall scalability above 8 processors. The two remaining loops (unique integrals and integral batches) are ideal for distributing over processors. The number of integrals is substantial enough to support scalability over a wide number of processors, and each processor will need to store only those integrals for which it is assigned to calculate contributions. The current ACES II implementation expands the permutations into the "unique" integral list and processes each one separately, so this expanded list was distributed over processors instead of the truly unique list. For maximum software portability, the parallel algorithm needed to distinguish between processors on the same physical computer and processors on remote computers. The strategy was to distribute integral batches over physical computers (nodes) and to distribute the integral list from each batch over processors on the same computer. After the nodes finished processing their batches of integrals, the R2 COntributions were reduced and distributed back to all of the nodes. To minimize memory requirements on a single node, only one integral list, T2 array, and R2 array WeTO Stored, and CPU threads were created to process the integrals. The R2 array WAS divided into blocks of columns (Cly pairs), and each CPU was assigned to calculate all of the terms that contributed to a particular set of blocks. This technique reduced false cache conflicts (when two CPUs attempt to write to different parts of the same cache line at the same time), and it increased the probability that a CPU would have a target R2 COlumn in cache memory, since multiple CPUs would be caching different columns. Figure 23 shows the CPU/column assignments in a 4CPU computer with 4 columns per block. CPU 0 1 2 3 WWW Figure 23. The column assignments for each of 4 CPUs updating R2. In this example, there are 49 Clv pairs (from 7 AOs). For an arbitrary blocking factor (bf) and number of threads (Nth), the thread assignment (from 0) of a column index (mn) can be determined with the following FORTRAN function: thread_id(mn,bf,Nth) = mod(mn1,bf*Nth)/bf If bf*Nth is a power of 2, then the mod function can be replaced by a bitwise AND with the mask (bf*Nth)1, and if the blocking factor is a power of 2, then the division can be replaced by bit shifting. Matrix Multiplication If the VVVV contributions are calculated in the MO basis, then LADAA and LADAB become the bottleneck routines. These perform large twodimensional matrix multiplies, as do the other three bottlenecks: CNTRCT, T1W1AB, and RNGDRV. Incidentally, the routine T12INT2 (1.1% of the UHF time) is also bound by matrix multiplications. Parallelizing these routines involved distributing the columns of the output matrix over CPU threads. Since the input and output matrices were all kept in memory, the columns were divided once, unlike block cycling in the RDAO scheme, which had irregular data access patterns. Figure 24 shows the distribution pattern for a matrix multiply over 4 threads. A B = C Figure 24. The distribution scheme for matrix multiplies over CPU threads. Every thread can read the entire A matrix, but each thread is assigned a block of the B matrix and consequently writes to a unique block of the C matrix. The multiplications were not distributed over nodes since the parallel computer that ran the profile had 16 CPUs per node. The time penalty incurred from the global reduction (after the multiplication) would have greatly lessened the gains from parallelization; however, nodes with fewer CPUs might benefit from distributing the rows of the A matrix. All nodes but one would then have to initialize the C matrix to 0 (if the operation increments preexisting data), and the results would have to be reduced and broadcast after the multiplication. I/ODependent Routines None of the maj or subroutines that required parallelization were heavily dependent on I/O operations; however, many parallel computers (including the one that ran the profiling calculations) have shared disk storage devices. This introduces competition between some or all of the compute nodes when each one performs file read or write operations. The DIIS routine (1% of the UHF time) is heavily dependent on I/O performance, and the ratedetermining operation can be described by the following: do for each i do for each (x,y) B(x,y) += Tali(x,i)*~TaillY,i) end do end do The number of rows in Tanll and the size of each dimension of B) is usually 5 but could be any value (preferably a small one). The i index spans all of the T amplitudes, both T1 and T2. Obviously the construction of a 5x5 matrix from a single 5xl vector is negligible compared to the time spent reading from and writing to disk. This operation was not parallelized, but only one node was permitted to perform the operation. I/O competition is a lowlevel problem that shared storage architectures have. In many cases, extensive software changes are required to overcome the increasing penalties from read and write delays that serial programs consider to be constant. These penalties manifest themselves in almost every routine that uses disk storage in the parallel profile. Parallel Profile The DCEE* test case was run with the ACES II CCSD program that had parallel modifications as described in the previous section. The parallel performance of the maj or subroutines is shown in Figure 25. The real time spent in each highlevel routine is shown in Figure 26. Only RDAO was distributed over every CPU in the calculation and was 20.6% efficient in the 256CPU (16xl6) calculation. The three other highlevel routines were only distributed over the 16 CPUs in each node and had average threading efficiencies of 70. 1% (CNTRCT), 25.4% (T1W1AB), and 28.8% (RNGDRV). O lxl 0 2x l6 4x l6 8xl6  16xl6 CCSD II~ II II 1~~ 1 I IIII T1INT2A T1INT2B I I I T 1INT 1 I I I LADAA II I II I LADAB T2TOAO I I I I RDAO ~AOLAD I I I Z2TOMO I I I ZIAAO I I I I I I T12INT2 RNGDRV I SUMRNG II II II II ERNGAB II II II II E4S I I I I I I I NEWT2 I I I I I I I PHASE3 I I I I I I I DIISI I I I I I 60  50  S40  S30  ,a20  10  0 CNTRCT T1W1AB RDAO (AOLAD) RNGDRV Figure 25. The parallel speedup for each maj or UHF CCSD subroutine and the CCSD executable. (AOLAD) contains the RDAO parallelization and the global reduction afterward. O 1800 3600 5400 7200 0 1800 3600 5400 7200 ZERSYM T1RAAAA T1RABBA T1RABAB CNTRCT SUMSYM GETLST QUAD1 QUAD2 QUAD3 MAKFME T1W1AA T1W1AB T1INW2 FEACONT FMICONT FMECONT Figure 26. The runtime profile of parallel CCSD with a UHF reference from DCEE*. The times are measured in seconds. The overall parallel efficiency of the UHF CCSD calculation on 256 CPUs was 1.8% with a speedup of 4.62. The calculation with the highest speedup was 4xl6 (64) CPUs with an efficiency of 8.8%. The reason for such poor performance can be attributed to I/O contention by all of the routines that were insignificant in the serial calculation. Most of the routines listed in Figure 26 increase in time even though they had no parallel modifications. While further parallel optimizations could be made to the current CCSD program, two maj or drawbacks would still plague the parallel code: problem scalability and expensive modifyprofile development cycles (i.e., trial and error). The worst aspect of the current approach is that tractability does not increase as CPUs, memory, and disks are added to the pool of resources (i.e., the largest molecule that can be studied is limited by the resources of a single node). There is no mechanism for distributing data across memory or disks; however, if the current limitations are adequate, then this model should be able to reduce the runtime to a reasonable amount. Parallelizing existing serial software usually takes the bottomup approach, in which significant bottlenecks are parallelized and profiled. As new bottlenecks present themselves, they too are parallelized and profiled. Unfortunately, by the time this cycle ends (or becomes too expensive), so much of the software might have been rewritten that it would have been cheaper to reformulate the algorithm using a topdown parallelization approach. In the current CCSD program, no matter how well the algorithm is parallelized, the software is still limited to performing calculations on systems that are calculable by a single node. It would be more efficient in the long term to reformulate the CCSD algorithm from the top down (reusing existing software wherever possible) and to include proper data distribution and small task parallelization as design fundamentals. TopDown Parallelization Considerations Although a few programs already calculate the CCSD energy in parallel, most of them are limited to closedshell references [Koc96,Kob97,Per03]. The exception is the general tensor contractor of Hirata [Hir03], but the Einal program requires stored integrals. The goal of the topdown parallelization approach will be to produce an open shell CCSD program capable of handling molecular systems with at least 500 AO basis functions (although the internal target is 1000) and that can scale up to a few hundred (if not a thousand) CPUs. These constraints almost necessitate calculating amplitude contributions from AO integrals and distributing amplitudes across CPUs. In addition to the VVVV integrals, the algorithm should also allow for direct generation of the VVVO integrals. The extent of calculating amplitude contributions in the AO basis instead of the MO basis is that the virtual orbitals are never formally recognized by the equations other than to create complementary density matrices such as that shown in Equation 22. These matrices are very important since they define how amplitudes transition from backward to forwardtransformed relative to the MOs. Q" = [efc (22) The first step of the topdown approach involves cataloging the integral contributions and identifying which output condition on the R amplitudes (forward or backwardtransformed) would be best suited for their additions. Finding a data and task distribution then becomes a matter of understanding how the integrals are processed and which terms can be grouped together given a common output condition on the residual. CHAPTER 3 SPINUNRESTRICTED CCSD EQUATIONS The T1 and T2 equations in spinrestricted HartreeFock (HF) CCSD have 11 and 29 contributions, respectively. If the reference wavefunction is not HF but the spin orbitals are still restricted, then the T1 and T2 equations have 14 and 31 contributions, respectively. The spinunrestricted nonHF CCSD amplitude equations have a total of 222 contributions. Table 31 lists the spinrestricted T1 amplitude contributions and the number of corresponding spinunrestricted terms. Table 32 does the same for the T2 contributions [Cra98]. P(pq) is the antisymmetric permutation operator (Appendix B). There are many ways to group and factorize the amplitude contributions. The efficiency of each approach largely depends on the following: * The chemical system being examined * The CPU speed and efficiency * The amount of local memory per process * Local and remote (shared or exclusive) storage capacities * Local and remote I/O latencies and bandwidths * Interprocess message latency and bandwidth. For example, if the CPU speed were incredibly slow compared to the speed of the I/O reads and writes, then it might be faster to group as many subterms together as possible, and accumulate the intermediate product to disk. Consider the sum X*A*B + X*C*D + X*E*F, which factorizes to X*I provided I = A*B + C*D + E*F (first accumulated on disk). In most cases, however, millions (if not billions) of CPU operations are performed before a single byte of data is retrieved from disk, so this approach is relegated to those systems whose CPU speeds are unusually slow compared to other hardware capabilities. Table 31. The RHF CCSD T1 contributions Number of Contribution Diagram UHF terms +Cltlclfkcd V  1kl c tt 4 + L,, / 2 +C(ka ci~f t 4 +IL,(x 2 + ka edt~ tid 4 kcd 1 kl c iC tXII tza 4 tl k/ i t 4 Skled + fkc,/tttpe x 4 + / ttda _I _~ _ The most common factorization strategy involves searching all possible combinations of terms from either amplitude equation and finding the set of subterms that is used most frequently. This implicitly assumes that the cost of recomputing the data is greater than cost of storing and retrieving the data from disk. In the MO basis, this is certainly true (i.e., the amount of computation needed to generate all of the MO integrals from AO integrals is quite high), but the cost of calculating AO integrals on demand is negligible when multiple processes are in competition for shared I/O resources. Table 32. The RHF CCSD T2 COntributions Number of Contribution Diagram(s) H em + abh if) 3 +P(!i)C' hab jtae 4 + (ij) C (abh cd tacd  3 P(ab) ~kb if) t \/ P (ij)P (ab) kb icrl tV  (ij) (ab) ~kb cd tot;tdi 4I + arb cd) t c 3 +iUP~ijh)P(a)kb cj~ to or 10 +P~j)Po~ab) ak~ djidt)~ k0~jh 10 +P+) (kli if )~i t 3 +Pij)P3(ab)klCi cj otVt 4 + (ij) (ab)' Ckl ic ft t 1 + P~P~ij)P~(ab), k/ )t~ 11 + (~ij) k/ J tal~ctdtab 3 +P(ab)n(l~: I~k/ Table 32. Continued Number of Contribution Diagram(s) UHF terms ~ g k t  4 P(ij>) kl ci fttz 8 P(ij)C(~, k/\J t:ftdtb 8 kled +Pli(ab) fbct \f x 4 kcd h>(ab) L~~k/tlkj JI tftatb 8 i(a kle d f: t; The following assumptions guided the development of the factorization strategy described in this chapter: *There are at least 3 times as many virtual orbitals as occupied orbitals for each spin *The twoparticle data are too large to store in the memory of one process *The VVVV and VVVO integrals are not transformed and stored on disk *The CPU operations are several orders of magnitude faster than I/O latencies *Latency is at least an order of magnitude more significant than bandwidth *Each process is able to retrieve (GET) and increment (ACCUMU7LATE) data that are stored by another process without interrupting the other process. The first assumption follows from a study of basis sets that noted that ccpVTZ is the minimum basis set needed to describe electron correlation accurately [DunOO]. For chlorine, there are 17 electrons and 34 contracted basis functions. This produces an almost 1:3 occupied/virtual split in molecular orbitals, since half of the electrons have alpha spin and half have beta spin. Designing the algorithm such that twoparticle terms (viz., VVOO) can be distributed over PEs is necessary because the parallel program should be capable of treating systems with 50 electrons and 500 basis functions (50/500), although 100 electrons and 1000 basis functions is the target (100/1000). For 50/500 systems, the three VVOO spin combinations require 1.55 GB of memory. Considering the algorithm needs at least two (for T2 and R2) held in core, a processor with 1 GB of memory will not be able to store the minimum amount of data without overflowing to disk storage. Furthermore, the VVVO and VVVV terms will require 59.8 and 568 GB of storage, respectively. (This assumes a conversion factor of 1024.) The parallel algorithm should then at least allow for the possibility that the terms that use these MO integrals can be calculated from AO integrals generated on the fly. Today, the fastest disk storage devices are able to seek for random data in approximately 3.7 milliseconds (measured on a Maxtor Atlas 15k 73 GB Ultra320 SCSI hard drive). The slowest desktop CPU (to within reasonable expectations of performance) operates at 1 GHz. (The Intel Pentium 4 line of CPUs can operate at 3.4 GHz.) Assuming that most floatingpoint operations in CCSD require multiplications and additions and that each operation completes in one clock cycle, then the maximum theoretical multiply/add (fma) throughput is somewhere around 500 million fma/s. Together, the data indicate that a slow CPU could theoretically deliver 1.85 million fma/seek, which at least supports the assumption that computation is several orders of magnitude faster than I/O latency. On the same hard drive, bandwidth bursts measure 73.7 MB/s, which suggests that 280 kB of data could be transferred during the same time it took to locate the data. The final assumption follows from the latest version of the most popular message passing library MPI (Message Passing Interface) [Gro99]. MPI2 allows PEs in the same communication network to retrieve (GET), increment (ACCUMULATE), and overwrite (PUT) data that are stored by another CPU without communicating with it. In addition to their own proprietary communication libraries, most vendors of parallel computers offer implementations of the standard message passing libraries (like MPI) to ensure maximum compatibility with the myriad parallel programs that depend on industry standards. The main consequences of the aforementioned assumptions are that there must be more than one process, that recomputation is better than reading from disk, and that messages should be as large as possible. The primary goals of this parallelization effort are scalability and addressability (i.e., the ability to treat a wide range of molecular systems), so the grouping and factorization of terms should maximize message size and minimize message frequency while minimizing the local scratch memory requirements. Group 1: T2 Virtual Orbitals Coming from V One of the smallest groups of terms consists of T2 COntributions, in which both virtual orbitals come from the twoelectron integrals. Equation 31 shows the pp terms the at terms are analogous. The fly in the AO basis, so the expression must be written in terms of AO integrals. Before doing so, the second term must be split and rearranged to allow for better factorization. ab + = ab ij + (Of a~b cj tfc +i p(g)C l abtc~d t~t+ abln cd}t~cd (31) cd cd + =C (bIiftLe(blct (32) + P< ,c ab lcd tot(+ ab cd /,lrJtCd cd cd sab+= BJPI)Cf abpa c,"c, cpa I cpa 33 + i(ij) [ ab pa c, c~tactd + ap c c~td cdpa cdpa ab+ = J(g) b pa ~cc c + tP'~ c" c"' t" + t 't" lp)+ abp t,"" (3 4) I;:=Ab+= +(Abb + ACj tc~ +(Abllc to +e ~`Ab JCdtcd l/ +fAbJdtlCd (36) C c Cd Cd riA+ = C Ab pa (c;"c +t" c +c"t"' +t 't,')+ (Ab po tz'"" (3 7) ~rib+ ,Ab p t, Xe" +r" ) 1 (38) Group 2: T2 Virtual Orbitals Coming from V and T Set 1: One Virtual from V, One Virtual from T1 This set of T2 COntributions is very similar to the Group 1 terms. The pp increment is shown (Equation 39). Since every term antisymmetrizes the contraction with respect to ab, the permutation operator is removed and the increment is identified as non antisymmetric. The F residual amplitudes (Appendix B) will be antisymmetrized with respect to both ab and AB virtual pairs at the end of the iteration. Splitting the second term, as in Equation 31, is also required. r,7b = P~(ab)C kbl ilf t +P~ab)Plij)f jkb ic tf t k kc (39) + J (ab) (j)(ib kb d tP t~ctd + (a) kbc tcd kcd kcd ; = 1 (ii) (kb it + (ij) ~bljkb c tP t~C + (j bi k L ckc 1 (310) + P(ij) kbL ci~~~d tLtotf + kb cdt~t kcd kcd C = Pijftkb poj c,"c" +t c"L~~IfP + ct, +t(p,")t+ t kb pa t,""tl~ (311) kpa kpa = t ,c, irv pa li(ij) c:' +tXc, ) t (312) The of increments (Equation 313) are grouped by the braside T1 and then expressed in terms of AO integrals and factorized. rl b = [Kb lj tj + f Ak lj t( K k + Kb Cd tit~ + A( k Cd~ttC tetf c)ttrt KCd kCdK k +o Kb Cd t td +f Ak Cd)t~tCd KCd kCd + ft Kb ~~~Cd t / bedtb~ KCd d rl b = ftAk ~Ij) + /C\UI t Ak y\Cj to + t@Ak Ich~te + ~ t A(k Cd ttf; k kC c k~d(314b) + ft Alk 7llCd te kCd FI be _e t' ce trj t (3 15a) rlb b ~ oj~p+ tz' Xc: terj tper (315b) u per Set 2: One Virtual from V, One Virtual from T2 The pp set contains four T2 COntributions that are ring contractions. Equations 316 through 318 show these terms and derive their AO basis expressions. ab+ = (ab)(g), k cj oii +[ .ckb lcd tractf kc kcd (316) KC KCd (317) +P(i) c 1 ppA c rtap' + c r a , ab+=Pg rp +( pr p~~a t c" t, (318) Each half of the 000 increments contains four ring terms and two particlehole ladder terms. The MO and AO expressions are shown in Equations 319 and 320. rl b+= /. kb cj tol +C Kb7Cj tcKC f KbI~ c tcI kc KC Kc (319a) + klb I~tcd tie + Kby~ dr~tfKC t Kbd l, tc otD kcd KCd KcD Ab bCyll~tb k~lAl~b cIbc kAj KC kc kC (319b) + (KA CD tb/CtlD + kD tb tlD kd tdit KCD kcD kCd ~uvp~(320a) ~uvp~(320b) Group 3: T2 Virtual Orbitals Coming from T Set 1: Rings and PH Ladders There are 31 UHF T2 terms that belong to this set. Each process loads a block of V," integrals and calculates all of the contributions to the range of R2 amplitudes that it is assigned to store. Grouping and factorizing the terms for performance is heavily dependent on the manner in which the old T2 amplitudes are stored and accessed. The terms were first cataloged and sorted into subsets according to up spin pairs. For example, if the a index is stride 1, then the term t, ckgt~d must step down in A and down in B for a given cd pair in V. To minimize the number of messages, any other ring terms that have the same access pattern should be calculated at the same time. From the 31 UHF terms, three access patterns emerged. Any term that did not prefer one pattern to another was added to the pattern that had the most similar intermediate products. Subset 1: AC/BD Subset 1 contains 13 terms that step through T2 in the order AC/BD, which corresponds to the order of the two bra indices. More specifically, for a given CD pair of ket shells, the first (or outer) loop runs over A and the second (or inner) loop runs over B. All of the at T2 COntributions are performed in this series. r, + (BPU tfKC 1.Vt"" It VtDB fttccf" (321) ' PiHP~) KLCD KLD KLCD rf" += P(AB)li(lU)Ctf tZ ~c 'VtW? klcd + (AB) ~u(C )[tfKCVIKltd C t VIK~ ity ft tcVKlt?(322 KlCd Kld KlCd SV notation can be ambiguous with respect to whether AO integrals are antisymmetrized. Both electrons in the terms in this section have at least one occupied MO index, so V notation is used to save equation width. rzAb 7 t~~tz [ fC bi, (323) kled KlCd Ab tlAcV/tf b, t,AcT/ktdtb klc kled(3 24) + ft fC VKlb t? + ,A t KC ctd tb KlC KlCd In terms of uncontracted ket indices of V, the partial contributions reduce to Equations 325 through 327, in which each equation corresponds to a "leftside" intermediate enclosed in parentheses. A+ = P(IJ)C /,tfK"V t c;"+ /K\" t (325) La Kp Kp += (IJ tAP' 1VJ+ [tfK"V f c+t" t"" (326) la kp Kp Kp rPb+= tAp'Vji + tyfK^ h' "+,"(7 la kp Kp n Subset 2: DB/CA The DB/CA subset calculates 12 terms by looping first over B and then over A. rr +=P~b)~g)[ty tdb l b rVt Ct,. ty~4tdt (328) kled kcl kled ~b+= P~ab)Pi(g) ftPC (Vct/b KLCD h Ct~(329) KlCd KCl KlCd rAb+ t IAKC VKtob ( 3 3 0) KLCD FIb AVIK 0t b Atic'VCKDb KLD KLCD ( 31 + t VIKltdb + t ticc iKtif Kld KlCd The MO equations can be reduced into four equations with four "rightside" intermediates when they are expressed in terms of V with uncontracted ket indices. + = (tij) tpb YkV t( V c,+"(2 ,h b+= ft V t" (334) Kp La ]~ briK b p p' V "K + VKl z, (3 3 5) Kp La la Subset 3: CB/AD The remaining terms are primarily PH ladder terms and the outer loop is over B. If the bra shells are thought of as a square VV matrix, then the main difference in loops between this subset and Subset 2 is that the inner loop A steps down the D column, while the inner loop A of Subset 2 steps across the C row. r b ddTc iCb (336) KlCd rf _AdViC5l b AdVTKltib + ftAd~citct? (337) KlCd Kld KlCd rl~b V t~ C~ltfb A t~dTcitCb (338) KlC KlCd Factorizing the MO expressions in terms of rightside intermediates and uncontracted ket indices yields Equations 339 through 341. r~ tAU 1V t (339) Ob t tA( +f V n~c" +t"' ( (340) KU l lp KU lp 1(1 Set 2: HH Ladders from OOVV, OOOV, and OOOO The ladder term that is quadratic in T2 can be calculated in one of two ways: one with a HH ladder intermediate and one with a PP ladder intermediate. Opting for the HH ladder intermediate saves a large amount of memory and memory writes. Three other terms have HH ladder intermediates from the VfO integral and can be factorized with the quadratic term. The pp contributions are shown (Equation 342). rGab Itdtilb 1 Ie/tactdt"b kled kled (3 42) + P (ab I k/ jfitcd ati + (g)~b I .k .Ijtctd katl kled kled The four ladderlike terms that come from OOOO and OOOV integrals can be combined with the OOVV ladders when the ket orbitals are from the plain AO integral. Equation 343 shows the MO contributions. r;: + = il (kl i ta'b + () l cj) tact" kl klc (343) +JP\ab)I kl ilftot?+(Pg)b)I(kll cij~tctatb kl klc Combining the two equations, incrementing preantisymmetrized amplitudes, and expressing the ket indices in the uncontracted AO basis yields Equation 345. anb + = 1 [ lpe)/ t' + (g)[kl pr \t(t' 1 t" klpo klps (344a) +(PJg f (kll pjcj' 1 ta +(PtiiC ~kl d)pj 1 tab klp klp ga+= 4 C kl pa i tot1 +P~iij)C kll pa tP t"'tat1 klpw klp ab+=JE l p i P~i) c" +,"'c,"+t ab t?) (345b) klp kl The 003 terms and factorized increment are shown in Equations 346 through 348. rl b+= Kd d ,Ctf +f KCdttdt Ah Kl~d Kl~d(3 46) + (Kl CJd~tCdtf tf + lK'll ~Cdtdt t? KlCd KlCd riAb+= (KIl~jttf fK:jttt KlC KlC + (Kl Ictof~f + KlI totht (347) Klc Klc +C(Kl Ijt /CK +f K~L Ij t t Kl Kl rlb=[Klp + c;" +t PXC, r t +t rt (348) Klps Group 4: T2 into T1 The T2 amplitudes contribute to the R1 amplitudes in 14 terms. These terms can be grouped together such that they define three oneparticle intermediate operators, which effectively reduce the particle number of the perturbation. Equations 349 and 350 show the terms that define the IP (particle) and IH (hole) intermediates, respectively. Equation 35 1 contains both spin cases of the IR (ring) intermediate. I + = k/bl c; i t [ lC t to klc KlC(349) k/ Jtiatzd ~K~l l~Cd atd kled KlCd += ka dtd+ Ka~).d kcd K~d(350) k/ a+= KL D Cta kL c tt LDt~ KLCD kcD LD(35 1) +C(" k/ ')t ~tida + Kl Cd~tctda~ fId Iu kled KlCd Id The terms that define the intermediates have been isolated in Equations 352 through 354. Both spin cases of the IR intermediate are shown because they both contribute to p R1 amplitudes. (The terms with the Fock matrix elements are not part of the intermediate.) The a spin cases of IP and IH are analOgous to the P cases. I, : Sa+C Ckl ca: i Kl Co t ,"+ (352) Sklc KlC IH : y(+ = iCk @ cd td + Kv Cd t~ ;Cd a t) (353) vkcd KCd I" : r;+=CVr(!'', ~~K@Co kve tkc a LCr t ,a (354a) If : 5"+=Cc CkVIca)CK Ct + kCc tob a dtid (354b) vakcKC Id Group 5: OneParticle Transformations Set 1: T1 Contributions The 12 remaining T1 contributions can be loosely described as oneparticle transformations. The a terms are analogous to the P terms, which are shown. These are then drastically simplified with the IR intermediate, and further consolidation yields a nonHF set in Equation 361 and an IR Set in Equation 362. F~~kc KC I(5 FI' : 5+= 1/adtd +( I kaed t td + KalCd t~ td (356) dea kdKCd (357) (358) (359) (360) (361) (362) n"+= 1c; ) 12 ab I~td z ab + ij)(1Cx l k I/ Jtk la +B [ Kl d t tla kled KlCd li(ij) kcittab + (Kl Ci ttla~b Sklc KlC + P(i) k Jt tid tab + KlC "I b kled KlCd Foo (1): ab (363a) Foo (2): rzaib (363b) F,'(1 :ab+ = P(~r~dab) fad (364a) lar Id klc KlC + [ k/ J tf~tidt+ C Kl CdjtC t tdta kled KlCd Fo: Ca* fo + calf " dea ver Fo ra _~ = 12ta + 1d sdt t I c,"+") lar Id ver n"+ L 1JadtId 12l I dbC dz 1 dea 1 2 Id t )rf (c," + t"') Set 2: T2 COntributions Two oneparticle transformations describe the final T2 pp COntributions. id icl(ab)i: }[ k/ J./ + K Cd). kled KlCd F,'(2) : ga a)[k dt td aC "I" kcd KCd P (b) k/\ txllf tzatzb + C Kl Cd tf~ tlajd kled KlCd (364b) Similar to the Fl intermediates used to calculate the T1 contributions, the one particle transformation matrix for T2 into R2 can be defined in terms of IP, IH, and IR, and using these definitions, the pp and 000 increments can be trivially reduced to four terms (Equations 367 and 368). It; f;l, +fldCCrP d z" +~ t~) (365) Id v ver Fd dda Idl ((" Id ver ab eat' + Fe t kb Fktab (366) (3 67) (368) k k CFIKt, Fkt K k nAb ~+FCbt~bFe c Cc CHAPTER 4 PARALLEL ALGORITHM Virtual Parallel Computer Before any algorithm can be parallelized, the type of parallel computer that the algorithm will run on must be identified. Is remote memory accessible through onesided operations or must each process explicitly interrupt and request data from another process? How much data will be needed during each section of the algorithm and can that data be held by a single process? Does each process have the ability to read data from a storage device, and if so, what is the penalty for multiple processes reading data simultaneously? These are the types of questions that must be answered before making assumptions, which will shape the plan for parallelization. The most basic parallel computer is a network of workstations, in which each node is a single CPU computer with a private memory space and a storage deviceusually a hard disk. The most common connection fabric for these networks is TCP/IP over Ethernet. This allows for any node to communicate with any other node in the network, and the network hardware or operating system software will handle delivering the messages. (As opposed to each process having to maintain a table of which nodes are directly accessible and which nodes must be reached through other nodes.) The cost of maintaining these networks is not linearly proportional to their size, so it is common for 500+ node networks to remove the most failure prone element, the hard disk, and provide other means for each node to conduct I/O operations. The extreme opposite of this picture is a single computer with multiple CPUs, which are all able to access the same memory space and hard drives) with equal performance. Each CPU is identical to the others, so these are commonly referred to as Symmetric MultiProcessor (SMP) computers; however, there are tremendous costs associated with adding each additional CPU, so it unusual to have more than 8 CPUs in a truly SMP machine. Some manufacturers have mininetworks within a single computer that allow them to increase the number of CPUs without making them pure SMP. To an application programmer, these machines are programmed in the same way as Symmetric MultiProcessors, but the true label is Shared Memory Programming (also SMP!) even though the CPUs are not identical in terms of distance to memory, cache, and possible hard drives. Henceforth, there will be no distinction between the meanings of SMP since the parallel CCSD algorithm will be implemented as a user process and not a system or kernel process. The balance between cost and performance lies somewhere between the two extreme types of parallel computers. The primary offerings of the maj or vendors of supercomputers are clusters of SMP machines. For finegrain parallelization, each node has multiple CPUs accessing the same memory, and for coarsegrain parallelization, each node has physically separate resources. The virtual parallel computer that the parallel CCSD algorithm relies on is this type of cluster of SMP workstations; however, program portability and the ability to tune the network parameters are very desirable. The only way to build portability and tunability into the program without having divergent paths through the software is to model the physical computer in the boundaries of network communication. By properly defining certain subnetworks, no process will attempt to communicate through slow network channels when faster ones may exist; furthermore, messages pertaining to a certain context can be limited to regions of the network, which might reduce network contention on some architectures. This allows the scope of communications to be set once at program run time (or compile time) and the algorithm will transparently acknowledge the underlying hardware up to the efficiency of the partitioning scheme. Communication Networks In some sense, parallelizing algorithms that handle massive amounts of data and computations is akin to managing disaster recovery. The moment the event occurs, there are many diverse and unused workers who have no assignment or direction. As the situation is assessed, teams of workers are sent to handle different aspects of the recovery from the time critical, like search and rescue of survivors, to the time forgiving, like public relations and cost reimbursement. Wildfire firefighters have some of the most extensive experience in rapid deployment and management of labor units. In the late 1970s, the Incident Command System (ICS) was created in response to a series of wildfire disasters in southern California [Fed98]. It has since been adopted and refined by several federal and state agencies and provides the basis for most emergency and nonemergency incident responses. The parallel CCSD algorithm uses the following aspects of its effectiveness: * Modular/scalable organization * Manageable span of control * Unit designations. The lowest level of the organization is the single process (normally assigned to a single CPU). Every process must have its own memory heap, and the combination of a process and its memory is called a processor element (PE). PEs recognize other PEs on the same physical node and form squads (SQD). Squads can be grouped together to form platoons (PLT) based on algorithm requirements, and platoons can be grouped together to form companies (CO), which execute largescale algorithms. Although the units were listed in increasing size, their division and recognition happen in decreasing size. The commanding process evaluates the available resources and divides the group into subgroups accordingly. For example, the Company Commander (CO_CDR) would catalog the total distributed memory and channels to disk and form platoons based on how the company algorithm uses those resources. The squads would then selfassemble within the platoons based on which node the PEs are on. Each division has its own communication space that does not interfere with messages in another division, and there are special communication spaces for the commanders of the subdivisions. This allows platoons to synchronize independently from each other, and more importantly, it allows company algorithms to disregard the messages of other companies. The company that performs the CCSD iterations and calculates the T amplitudes is called the Amplitude Company. Figure 41 illustrates the communication boundaries for a company and its subdivisions. Data Management Arrays stored in distributed memory The fundamental data elements involved in standard CC theory are T amplitudes, integrals, and the reference wavefunction. The predominant method for solving the T amplitude equations is iterating over a coupled set of equations for T1 and T2. This makes the amplitudes stateful at each iteration compared to the twoelectron integrals, which are stateless. Thus, the minimum amount of information that must be stored during the CC iterations is TOLD (T1 and T2), TNEw (R1 and R2), and any information that defines the reference wavefunction. The integrals do not need to be stored since they can be generated on the fly. SCO CDR NET CO_NET __AMVPLITUDE CO CDRCD 1 st PLT PLT CDR NET 3 rd PLT PLT NET ~~2nd PLT CDR_ 1st SQD SQD_ LDR_ NET 3rd SQD SQDNET2nd SQD LDR 2nd PE 4hP 3rd PE Figure 41. Communication network boundaries. Rounded rectangles represent communication networks (in italics). Square rectangles represent physical entities. Dashed square rectangles could be redudant entities (i.e., a platoon commander will probably be the leader of the first squad). It is assumed that a single PE cannot store a full OOVV quantity. Even for SMP machines, a single OOVV array might exceed the total physical memory of the node, so the T2 Old and new arrays must be distributed over a set of PEs. It is also assumed that each PE is able to retrieve and increment remote memory with onesided communications (i.e., without interrupting the process that allocated the memory). If the total distributed memory is many times larger than the amount of data that must be stored, then subdividing the Amplitude Company into platoons such that multiple copies of the amplitudes are available for reading will decrease the average distance between a PE and the distributed data that it attempts to access. In this algorithm, the number of PEs in the Amplitude Company is almost irrelevant, but the total amount of distributed memory is critical. The Company must be able to store TOLD, TNEw (R), and the larger of T2 Or the amount of scratch space needed per PE, which depends on the chemical system (Figure 47). Each PE will have its own local copy of small frequently accessed arrays, which include basis set definitions, AO coefficients, T1 and R1 amplitudes, and other oneparticle intermediates. While the PEs are calculating contributions to the R amplitudes, only T2 and R2 need to be accessed in addition to local scratch memory. Between certain sections, R2 must be transformed, which requires enough scratch memory for storing the third distributed OOVV array. Example 1 Example 2 f reserved memory + 4 T2 distributed 4 4 R2 distributed 4 .    .   m in scr < R 2 block     min scr > R2 block Figure 47. The memory heap management for various molecular systems. In one scenario, the minimum scratch memory needed for amplitude contributions is less than the amount of memory needed to store a section of R2 during a transformation. The other scenario is when the opposite occurs. Example 1. The user runs a calculation that requires 128 MB of reserved data per PE, such as basis set information, oneparticle matrix elements, etc. The calculation requires another 128 MB (minimum) of scratch memory per PE for calculating various quantities during the CCSD iteration, and the OOVV quantities require 4096 MB of distributed memory from all of the PEs. If the user submits the j ob with 1024 MB of heap per process, what is the minimum number of PEs needed to perform the calculation? Alternatively, does the calculation have enough resources if the user submits the j ob with X PEs, each with 1024 MB of memory? Solution 1. The amount of nonreserved memory is 1024128 MB/PE (i.e., 896 MB is available for scratch and distributed data on each PE). Three OOVV arrays require 12,288 MB. Dividing this amount by 896 MB/PE and rounding up yields a minimum of 14 PEs. Distributing each OOVV array over 14 PEs requires each PE to store approximately 293 MB of data per array. Since this amount is larger than the minimum amount of scratch memory needed (128 MB), the calculation can proceed with 14 or more PEs. Example 2. Using the same system data as above, but increasing the minimum required amount of scratch memory to 5 12 MB, what is the minimum number of PEs needed to run the calculation? Solution 2. This time, the regular scratch memory is larger than the distributed OOVV scratch array. In essence, this becomes unavailable for distributed data, so the effective amount of memory per PE for the OOVV arrays is (896512) 3 84 MB. Since the third OOVV array can fit in the scratch memory, only two OOVV arrays, totaling 8192 MB, need to be distributed over the Amplitude Company. 8192 MB divided by 384 MB/PE yields 22 PEs (rounded up). This means each PE will store approximately 187 MB of each OOVV array, and the calculation can proceed if there are at least 22 PEs. The minimum PE test can be described by the following pseudocode: mem_di st = mem_heap mem_rese rved min_PES = ceiling( 3 mem_oow / mem_dist ) mem_block = mem_00w / min_PES if (mem_block < mem_minscr) then mem_dist = mem_minscr min_PES = ceiling( 2 mem_oow / mem_dist ) end if If the user runs the j ob with more than twice the minimum number of PEs, then the Amplitude Company can divide into multiple platoons, each with its own copy of TOLD and TNEw. One of the benefits of this approach is that fewer PEs will access each block of distributed memory, but the main advantage is that knowledge of the underlying hardware can be programmed into the partitioning scheme and the algorithm will transparently align itself with the hardware boundaries. Arrays stored on disk The VVVV and VVVO integrals will not be transformed and stored on disk, so their contributions must be calculated on the fly in the AO basis. The OOOO and OOOV integrals will not be stored on disk either, not because they are too large, but because their contributions can be accounted for with intermediates from the direct integrals and from the stored OOVV halftransformed integrals. The stored integrals can be distributed over and individually processed by the platoons of the Amplitude Company. Parallel UHF CCSD Each CC iteration begins with initializing the global intermediates and residual amplitudes to zero. The TOLD amplitudes are defined by Equation 41. This formulation allows the amplitudes to contract directly with AO integrals. The T2 TOSidual amplitudes will have different definitions depending on which part of the algorithm is executing. In the Direct Integral (DI) phase, one of the virtual indices is coming from an uncontracted AO integral, so the residual is related to the MO residual amplitudes by Equation 42. t = [cft; t l= ~chcb$ab a ab tp'" = ~,c~tb~t Ab rAb ~ A b A'v (4la) (41b) (42) The Stored Integral (SI) and One Particle (OP) phases calculate residual contributions with both virtual orbitals coming from T, so the DI residual is transformed according to Equation 43. by (43) Throughout the iteration, each PE stores all OO pairs (ket indices) for a range of AO shell pairs (bra indices). The amplitudes are distributed over the platoon (Figure 42), and onesided communications retrieve and increment all OO pairs for a block of AO shell pairs. MN = 1 2 3 4 5 6 13 13 13 I 13 13 13 PE 0 PE 1 Figure 42. The T2 and R2 memory layout. Each square represents the fourdimensional array of amplitudes that have the designated spin pair and whose AO bra indices belong to the MN shell pair. The arrows represent the order in memory. After all of the contributions have been calculated, the residual is antisymmetrized with respect to the two virtual indices and backtransformed to the MO basis. The new T amplitudes are calculated by dividing each amplitude by the diagonal elements of the Fock matrix. Direct Integral Phase Design considerations Analyzing the T2 COntribution in Equation 44 can summarize the DI phase. Since the VVVO integrals are not transformed and stored prior to the CC iterations, the terms must be evaluated in the AO basis. r Ab + = C tA~ (Kb ICd() td +Clb (Al Cd') td (44) KCd ICd riA tA b~~ t"" + cA b t"~' (45) The biggest problem with this situation is that every AO integral causes a global update to R2 (i.e., One integral contributes to every R2 amplitude). The best way to reduce the number of operations is to precondition R2 Such that a smaller set of VVVO contributions is calculated. R2 WOuld then be posttransformed after the contributions have been accumulated (Equations 46 through 48). rv b Ab (46) rI~v+ =C ~ tAv pr per)tf" (47) r Ab b Av (48) For each integral, only an IjA strip of R2 amplitudes is incremented compared to the full set without the backward and forward transformations of R. The difficulty with the up amplitudes is that while the P side can get away with 1/V updates, the a side still requires a global update per integral (Equation 49). The C1 orbital must be forward transformed into A, but the v orbital must be transformed into b by T1, which must then be backtransformed by c l. One solution is to calculate the two spin cases in separate passes (while transforming R between them). Although this requires generating the AO integrals a second time, the performance of a naturally parallel local computation is much greater than a global update per N4 matrix element. rf "+= fc cfb/,t, (pv fpeb" (49) A second performance issue is antisymmetrization of the samespin integrals. Equation 410 shows one of the up ring contributions in MOs and Equation 411 shows the terms appropriately transformed into the AO basis. rlb + = b cd tjetfd (410) kcd rIf'"+= prp jt"(411) The AO integrals are not antisymmetrized by the integral generator, so one of two approaches must be taken to account for both contributions. The first approach involves two updates per integral: The benefit of this method is that the PE needs only enough memory to store the AO integral batch and the T2 batch; however, two updates per integral is more costly than two reads per integral, which can be achieved by antisymmetrizing the po pair: This requires either two temporary T2 blocks or two serial loads. If there is enough memory, then the exchange integral can be calculated with the regular Coulomb integral and the antisymmetrized contribution can be calculated with one read and one update. Since shell quadruplets of integrals are typically not more than a few hundred kilobytes, the algorithm will assume this can be done. The permutational symmetry of the antisymmetrized integrals is not the same as that of the plain Coulomb integrals so they must be generated again in the middle of the loop over permutations. Although two updates or two reads are avoided, the integrals need to be calculated three times per AO shell quadruplet. Considering the cost of messages compared to the speed of locally computing a shell quad, this is a favorable tradeoff. From these considerations, it is clear that the DI phase must calculate the unique integrals six times even though the AO integrals have 8fold permutational symmetry. There are two passes (one for up and one for Pa) and there are three generations per passone Coulomb and two exchange integrals. This might seem expensive, but the cost of doubling or quadrupling the number of messages does not scale linearly with the number of PEs and far outweighs the cost of the naturally parallel integral computation. Algorithm A critical aspect of this algorithm that makes or breaks the performance is one sided communications. A onesided GET of a T2 block can be in flight while the PE calculates terms that do not involve the T2 amplitudes. Immediately after the amplitudes have been read, the next block of amplitudes can be requested, even if the PE is not ready to use them. This technique is also used to hide the communication of R2 updates. The main difference between the two arrays is that the T2 amplitudes are read once per GET, while the R2 amplitudes are incremented two or three times per ACCUMULATE. Ideally, there would be two R2 buffers, one active and one pending, that increase the time available for each ACCUMULATE. If the user must run the calculation in limited memory, then it is possible to still benefit from onesided ACCUMULATEs, but the PE might have to wait for a while before initializing the buffer for the next R2 COntribution. The main driver for the DI phase loops over AO shell quadruplets of integrals. The MN shell pairs are distributed over all of the PEs in the Amplitude Company. This distribution can happen a number of ways: round robin, blocked, block cyclic, j ob board, master/slave, etc. How a PE processes each MN shell pair is independent of the shell pair assignment, so this can be tweaked outside the main processing loop. Ideally, there would be a globally accessible integer that corresponds to the next MN pair that must be processed. The integer must be atomically read and incremented by each PE to get its next assignment. This balances the load better than a fixed distribution algorithm like round robin or block cyclic, and it performs the same function as a master process sending out new tasks without actually needing a process to sit idle waiting for requests. Once an MN shell pair is identified, the PE is responsible for generating and processing all unique RS shell pairs that correspond to the MN upper bound. This prevents any PE from processing Coulomb or exchange integrals that another PE is responsible for. It also allows for a single generation of the integrals for each pass if there is enough scratch memory (i.e., the PE would generate all RS pairs and internally antisymmetrize them before beginning the processing loop). Typically, however, there will be enough memory for two shell quads and the scratch buffers for contracting terms. A shell quadruplet of integrals will contribute to all of the Group 1 terms, the Group 2 terms, and the Group 4 intermediates. Table 41 summarizes the stages of the processing loop and which elements are accessible at each stage. Table 41. The direct integral processing loop summer LloopScope Contributions section all Group 1 terms and the IH prologue (IHs, I2) += f(V,C,TI,T2) itreit all contributions to R2 fTOm stage 1 Rzc += f(V,C,T1,T2,I2) Group, 2 and Group, 1 sae 2 IPs += f(V,C,T2) the IP intermediate epilogue IR += f(V,C,T1) the IR intermediate The Group 1 terms are saved in a twoparticle intermediate during the loop over the T1 amplitudes from the Group 2 terms. The I2 intermediate is then contracted with Q (C*C) and the Group 1 contribution is sent with the Group 2 contribution to increment R2. If there is not enough scratch memory to store the I2 intermediate, then it is possible to calculate these terms in a single direct integral pass separately from the Group 2 terms or the IP Of IR intermediates. Table 42 has the detailed order of operations for the prologue of processing a shell quad. The w oneparticle matrix element is defined in Equation 414. L Y + t; (414) When the prologue finishes, there are two important quantities stored in memory: the Group 1 contribution and the T2 amplitudes that generated it. Since the Group 2 contributions loop over all L shells for a given S shell, it is convenient to begin the loop with L=R. Table 43 lists the detailed order of operations for the L loop. The x one particle matrix element is defined in Equation 415. x~ =~ [ jc t (415) After the last value of L, there is still a pending R2 ACCUMULATE, which must be completed. Rather than WAIT immediately after the loop, the IR intermediate, which requires only T1, C, and V, can be calculated. Furthermore, if there are more permutations of V, then the PE can GET the next T2(RS) block and permute the integrals. Table 44 lists the detailed order of operations for the permutation epilogue. The first four permutations of V involve the same two sets of shell quads. The second set of four permutations requires a different exchange shell quad, so the integral generator is invoked a third time per MNRS configuration. Figure 48 shows the movement of data needed to calculate the contributions to R2 from an integral shell pair. Global Shell Pair Counter of T ~h~ IT R 1st Squad 2nd Squad 1st Squad 2nd Squad 1st Platoon 2nd Platoon Figure 48. The data flow in the Direct Integral phase. Solid arrows represent onesided messages, and dashed arrows represent reading and incrementing the global shell pair counter. Each PE retrieves all T2 amplitudes that contract with the integral and accumulates the contributions into the R2 Efray. When the Amplitude Company has calculated all of the MN shell pairs, the R2 distributed arrays are transformed for the phase that processes the stored OOVV integrals. The oneparticle global intermediates could be reduced here for use in the One Particle phase. The following pseudocode lists the entire flow of control from the beginning of the CC iteration up to the start of the SI phase. ze ro(R2, R,, I,) W, = C + T, do spin = 2, 1, 1 X, = EzQ, C*T, SYNCHRONIZE(CO_NET) do while (more MN) do all RS GET T2(RS) G = intgen(M,N,R,S) H = G intgen(M,N,5,R) do perm = 1, 8 calC Prologue() nLeft = Nshells L= R do while (nLeft != 0) calc InnerLoop() nLeft = 1 L = 1 + mod(LNshells) end do nLeft calC Epilogue() end do perm end do RS end do MN SYNCHRONIZE(CO_NET) if (spin =2) then gl obal R2 (La, Np) = Q (Np,~ Dp) *R2 (Dp,~ M,) "Q,(M,, L,) else global R2(L,D) = R2(L,N)*Q,(N,D) end if end do spin REDUCE (Iz, CO_CDR) Stored Integral Phase The SI phase calculates all of the Group 3 terms from integrals of the type integral block, which contains all spin pairs for a particular po shell pair. The mechanism for obtaining integrals depends on the performance and architecture of the I/O system. If only one PE per node (or squad) is allowed to read from diskassuming this is an operating system constraintthen the other PEs in the squad will have to request the blocks from that I/O PE. The I/O PE will then have to juggle its own calculations while acting as an I/O proxy; otherwise it could opt not to store R2 amplitudes, but then the system takes a performance hit as whole because there is one relatively idle CPU per node. Most computers will allow any process to perform I/O and shift the performance hit of I/O contention to the program. The technique the parallel CCSD algorithm uses to lessen this penalty is asynchronous file reads. The logistics are the same as onesided operations in the DI phase in that each PE issues an AIO READ immediately after the last integral has been read but before other computation has been done. If asynchronous I/O operations are not available and the user elects not to designate a special I/O PE, then the algorithm can wait for the integrals up front with a blocking read or it can synchronize the squad PEs by broadcasting the current block and waiting for all of them to finish before reading the next block. Resource competition, such as parallel blocking reads, suffers scaling penalties, meaning the cost per read increases as the number of PEs per squad increases. The upshot is that there is still the possibility that some PEs are computing while others are waiting on I/O. Locked step algorithms, however, usually suffer far worse than resource competition algorithms. Locked step guarantees that no one is working while a read is pending and that no one is reading while the PEs are computing. Even if the I/O PE used asynchronous reads, the locked step approach would suffer from expensive synchronization delays compared to pure resource competition. From these considerations, each PE will be responsible for reading its next block of integrals from disk. If asynchronous reads are not supported, then the PE will default to blocking reads. The OOVV integrals will be distributed over platoons by block cycling the AO shell pair kets. Every PE in a platoon must read and process all of the integrals assigned to its platoon. The integrals will only contribute to the R2 amplitudes that the PE stores, but the T2 amplitudes must be read (with onesided GETs) from other PEs in the platoon. Every shell pair of integrals results in four updates to all of the local R2 amplitudes, one for each subset of Group 3 Set 1 and one for Group 3 Set 2. Since Group 3 involves contributions in which all virtual orbitals come from T, the virtual orbitals of the integrals must be the internal summation indices. The general approach to calculating the Set 1 subsets is to loop over h' and 6' for a fixed po pair. Each subset corresponds to a different configuration of these nested loops. Figure 43 shows the path of updates through the local set of R2 amplitudes. Table 45 shows the detailed order of operations for calculating Group 3 Set 1 Subset 1 (AC/BD is LR/DS in AO shells). Accessing memory in nonunit stride is highly unfavorable when performance is the utmost concern; however each square in Figure 43 represents three fourindex arrays such as those shown in Figure 42, and those updates can be performed in unit stride. Furthermore, the LD loop order is only used in Subset 1. Tables 46, 47, and 48 show the operations for Subsets 2 and 3 of Set 1 and Set 2, respectively. Each steps through the LD pairs in DL order. Figure 43. The path of Group 3 Set 1 Subset 1 updates per OOVV integral. Dashed lines show memory continuity and solid arrows show the loop path. There is some flexibility in choosing the order in which the Group 3 sections are calculated. The most important design consideration is choosing which section will be performed last since it will AIO EAD the next RS shell pair of OOVV integrals from disk. Subset 2 of Set I has the largest number of operations after the last integral is used, so this section should be performed last. Choosing a section to run first depends on how many operations can complete during the first GET before reading the T2 amplitudes. Subsets 1 and 3 of Set 1 and Set 2 all use the amplitudes almost immediately, but each requires a different initial shell pair. Set 2 reads only one shell pair, T2(RS), and neither shell contributes to the final R2 amplitudes, so the probability that T2 is local is inversely proportional to the number of PEs in the platoon. The Subsets of Set 1 fare much better since at least one shell of T2 contributes directly to a shell of R2. If the amplitudes are distributed in bands of L, then Subset 3 has a good chance of T2 being local, and if the amplitudes are distributed in bands of D, then Subset 1 is better off executing first. Figure 44 illustrates the two possibilities. Lbands D bands L L D D Figure 44. The LD shell distribution schemes. Each PE stores both T2 and R2 amplitudes for its block of LD shell pairs. Set 1 Subsets 2 and 3 favor L bands, while Subset 1 favors D bands. The parallel algorithm uses L bands since two of the three subsets favor them as far as having local T2 amplitudes per R2 update. Subset 2 is already designated to run last, so subset 3 should be performed first because it has the highest chance of having no wait time for the initial T2 GET. The decision to run Set 2 before or after Subset 1 can be made on the same argument. Since Set 2 updates the entire set of local LD pairs after reading the T2(RS) amplitudes, it offers the most time for the initial GET of Set 1 Subset 1 to complete before reading them. Therefore, the calculation order of the Group 3 terms is Set 1 Subset 3 (SIS3), Set 2, S1S1, and S1S2. Figure 49 shows the flow of data in the SI phase. Each PE loops over all of the shell pairs that were assigned to its platoon and calculates only the terms that contribute to the local R2 amplitudes. The following pseudocode lists the entire flow of control from the end of the DI phase up to the start of the One Particle phase. AIO_READ V(RSo) GET T2(RDo) do RS = RSo, RS1 WAIT V(RS) cal C G3S153(GET T2(RS)) cal C G3S2(GET T2(LoR)) Cal C G3S151(GET T2(SDo)) Cal C G3S152(AIO_READ next V(RS); GET next T2(RD)) end do RS Platoon Shell Pairs of i i f T ~h~ I T ~A~ R T R 1st Squad 2nd Squad 1st Squad 2nd Squad 1st Platoon 2nd Platoon Figure 49. The data flow in the Stored Integral phase. Each PE processes all of the integrals that are stored by its platoon and updates its local block of R2. One Particle Phase The OP phase is responsible for contracting the Fock matrix elements of the reference wavefunction (f,,) and the li intermediates (IP, IH, and IR), with the T amplitudes. There are three types of contractions: T2 into R1 (Group 4), T1 into R1 (Group 5 Set 1), and T2 into R2 (Group 5 Set 2). Only one PE must do the T1aR1 contractions provided it has the complete li intermediate (it should since the contributions were reduced at the end of the DI phase). These transformations, shown in Equation 416 for T1 are insignificant performancewise and little is gained by parallelizing them. +~ = t c"+t lf (4 1 6a) I" +=C Ic far + ca fad C@tz 1 tz1 /2 1 t%/d" t"cr (416b) aad I~ Id The T24R1 contributions can be distributed over the PEs in one platoon, since they are limited to local reads and local writes; however, while the T24R2 terms also have local writes, they have both local and remote reads. The remote reads come from the terms in Equations 417 and 418. The P part of FP is defined in Equation 419. (The oc part is analogous.) ? += FPa t; +=C FP t""' (417) F' = C cn"fadc C t + IdGi Iv" t (419) ad IS Id Parallelization of these particle terms should reduce the number of messages a PE must send as the number of PEs in the company increases. This implies distributing the summation index (p or o) over the platoons; however, since the SI phase prefers L bands (the p index), it is likely that most (if not all) of the T2(RD) GETs are local, so it is not as critical to performance to distribute them over the platoons as it is for o. For the a loop, each PE must get data from multiple target PEs that contribute to its local R2 Set (Table 4 9). Table 410 shows the other FP terms. The T24R2 hole terms and the T24R1 terms (along with the latter FP terms) should be performed by the largest platoon (HVYPLT), even if that platoon has only one more PE. During this time, all of the other platoons will send their contributions via ACCUMULATE to the Heavy Platoon. Once these operations are complete, the only remaining tasks are antisymmetrizing the samespin R2 amplitudes and updating the old T2 amplitudes with R2. Table 411 shows the operations of the Heavy Platoon while the other platoons are sending their contributions, and Equations 420 and 421 show the terms that contribute to FH and FR, TOSpectively. FZ = ft tak, Ideft,"r + c'I: +C c,Ifcp"+t,") (420) Id v vor FL = r. Ad+C~P (421) The following pseudocode lists the operations that occur from the end of the SI phase up to the start of the antisymmetrization and update phase. if (CO_CDR) then make FP, FH, FR BROADCAST( F,, CO_N ET) calc T,_into_R,() else RECEIVE (F,, CO_N ET) end if cal c FP 2uP( if (HVY_PLT) then calc other_FT2( else ACCUMU LATE (R2, HVY_PLT) end if REDUCE (Ry CO_CDR) Antisymmetrization and T2 Update The aa and pp R2 amplitudes must be antisymmetrized with respect to AO shell pairs. Equation 422 shows this operation. All of the R amplitudes must then be divided by the diagonal elements of the Fock matrix in order to generate the next set of T2 amplitudes. Each PE in the Heavy Platoon, having collected all of the R2 amplitudes that it is assigned to store, will antisymmetrize and transform its amplitudes into T2. The R24T2 transformation is shown in Equation 423 and the diagonal matrix elements of fh' are shown in Equation 424. r~ = (422) t r t =n (423a) t =I (423b) / =Ic; ad d(424) The R1 amplitudes must be transformed in the same way in order to generate the next set of T1 amplitudes. One PE should do this since the transformation is trivial (Equation 424). r r"' t =; t' = n (424) Table 412 lists the operations of the Heavy Platoon for antisymmetrizing and transforming R2. The code listing that follows shows the operations of the Amplitude Company from the end of the OP phase up to the next CC iteration. if (HVY_PLT) then make T2 end if if (CO_CDR) then make T, BROADCAST (T,, CO_N ET) else RECEIVE (Ty, CO_NET) end if if (not HVY_PLT) then GET T2 end if The order of the loop nesting, LD versus DL, does not affect the results, but the performance of the loops might differ from one computer architecture to the next. The L D ordering issues multiple GETs to the same PE (down the L column) before cycling up to the next column. If every PE begins with L=1, then the first PE in the platoon will be bombarded with the GETs of every PE at once. Normally, GET operations do not require exclusive locks, in which one PE is allowed to access the data at a time, but the efficiency of the messaging subsystem might be a factor in performance. This would be less important in the DL ordering. Another way in which the messaging subsystem could affect performance is by "remembering" which PE was accessed last. In this situation, once a GET is established between two PEs, any successive GET to the remote PE is less expensive than switching to another PE. This would actually favor the LD ordering and penalize the DL ordering. In either case, the code can be trivially designed to switch between the two depending on which architecture the program is running. Table 42. The Direct Integral phase prologue memory map Order of V T2 2t W2 operations G H a au p aa au I aa u W""I = H""11 ~ r w IU pa I = W'w" w r W," = G 0 I r w I =W/ w" w r WAIT T2(RS) W~u _i =H""~t r r w Ia + = WI"" c r I U1+ = WUI/ r W ""r r w I U1+ = WUI/ r Each operation is implicitly summed over all shared indices. The G column of V corresponds to the set of for write, read, and increment, respectively. Shaded cells are involved in pending communications and may not be read from or written to (without risking data inconsistencies). Temporary arrays (W2) are Outlined to show their scope of dependence. 61 Table 43. The Direct Integral phase inner loop memory map V T2 R2(1) R2(2) 12 W2 X2 Order of operations G H c l l if (L!i=R) WAIT T2(LS) PV "'=c'tw if (R!= S) GET T2(LR) X" =GC;"W "'rrw RI = 0 & r 'X n lrl RI\ + = I I r R"' = x)"I"" \\ r if (R!= S) WAIT T2(LR) if (more L) GET next T2(LS) X '"P = WIP H r r x H r I r~w I~" + = X 'P&c;U r X~1/ =W G~/ r ri R~" += XIP"w"~ I r Pa',~'a I I Th 21)bfe is th civ rtebfe, an h 2()bferih acesfrbde buffer fro the prviu L shll Th loo strt wit LRconsutoNhlsCCiS back ~ r~ toL1 n otnusu oL() One of the two R2 buffers will be accessforbidden. The last permutation does not need to GET another T2(RS) block; otherwise, the GET should be issued as soon as possible. If the permutation index is 4, then a different exchange integral will be generated directly. The (w) cells are writeoptional, meaning the operation might or might not need the scratch memory. Order of V T2 R2(1) R2(2) SCT operations G H a au up aa au am if (perm==8) then If += ct +c t G r r (w WAIT R2 else GET next T2(RS) I~ )+ = c tf'H~ +cLtl G'p r r (w) permute V w w (w) end if Table 44 The Direct Integral ph p 63 Table 45. The Stored Integral phase Group 3 Set 1 Subset 1 memory map OrerofV T2(1) T2(2) I A B IC operations d l cl d l cl d l cl d cl c GE T THeLnR ile L = Ln,. L, WAI.~T T,(LRI I(Ce = ta'p u"M r r w Wi = m pt H" r r w GET T( DnS ile D = Dn,. D, WAI.~T T,( DS GE T nextr T (")' I/ + = I' ir I/ + = I' '?r i r' r I += I' '? i r r'II I I I tndl ile D tndl ile L During the loop over D, the T2 buffers alternate active/inactive, but the prologue in the L loop uses the second buffer as scratch memory. The GET in the inner loop over D requests T2(DS) unless D=D1, in which case it requests the next T2(LR) (provided LAL1). 64 Table 46. The Stored Integral phase Group 3 Set 1 Subset 2 memory map Orde ofV T2(1) T2(2) IA IB C ID operations d cl l d cl cl d cl cl cl cl c GE T T~t SDn)I ile D = Dn,. D, WAI.~T Tt SDI GE T T( RLn)I I = (; i r r n I I' = \i I = H _i r r n I' ~ r I = H i r r n Ift'D==D.) I thn etni It ile L = Ln,. L, I = is / I' r WAI.~T T(RLI GET nextr T(")' I/ + = I' i '" i~ + = I' i I' r I + =I i I' r r I tndl ile L tndl ile D T Te inner loop of this subset has more operations than either of the other two subse s, so it is calculated last and the asynchronous read is issued during the last iteration. Table 47. The Stored Integral phase Group 3 Set 1 Subset 3 memory map Orde ofV T2(1) T2(2) IA B IC operations d l cl d l cl d cl cl cl d G;ET T,( RDn)I ile D = Dn,. D, \\~.MT T,(RDI G;ET T,(LnSI I''=(; _i J' '+ = ' ile L = Ln,. L, \\~.MT T( LSI G;ET nextr TH ('' I +=I ?'I I +=iI tndl ile L tndl ile D r r r 66 Table 48. The Stored Integral phase Group 3 Set 2 memory map Order of V T2 operations aa t uu 00 aa upa WI,""' = t" +1., w r w I K 1 = WI," V r do LD = LDo, LDI W '= t + tt w s* ~ l r r end do LD W, t, (r) r Iw 'Irk = W Vk2P (r) r r w do LD = LDo, LDI A3 Okl A' ' end do LD end do spin All of the T2(LD) reads and R2(LD) updates are local operations. If there is not enough memory for 12, then the Ij/ij/IJ bra pairs can be processed in blocks. Table 49. The One Particle phase FpP Tap memory ma Order of T2(1) T2(2) operations up up G;ET T,(LSn)I ile S = Sn,. S, ile L = Ln,. L, WAI.~T T,(LSI G;ET nextr T(L SI ile D = Dn,. D, I += TI / tndl ile D tndl ile L tndl ile S 67 Table 410. The One Particle phase FPuT~ap (and same spin terms) memory map Order of T2(1) T2(2) operations me up PP ma ap PP G;ET TH IDn)I ile D = Dn,. D, ilo R = 1. N\II, n WAI~T T,( RD G;E T nextr T( RD I ile L = Ln,. L, I +=/ li I +=/ li tndl ile L tndl ile R tndl ile D Table 411. The One Particle phase FH and FR lOcal transformations Order of operations do LD = LDo, LD1 A'' f t + K k A~'"' 1 i, ;k A's' FR F tk KU ka r + =CI [ ,t" t ko KU end do LD 68 Table 412. The R2 antisymmetrization and T2 update memory map Order of R2(1) R2(2) operations aa t uu 00 aa upa G;E T R,(DnLn)I ile L = Ln,. L, dlo D = Dn,. D, WAI.~T R,( DL G;ET nextr R,(DL I I = / +/ 1 1 / = / +/ 1 1 / =1 / +/ 1 1 tndl ile D tndl ile L CHAPTER 5 DISCUSSION AND SIMULATION Discussion General Comments The parallel CCSD algorithm distributes almost every operation over the number of PEs in the Amplitude Company. The few operations that are not distributed are insignificant to overall performance. Specifically, the worst oneparticle transformation would be a multiplication of two NAoxNAo matrices. This scales operationally like NAO3 In contrast, the Direct Integral phase multiplies two matrices that scales operationally like O2NAO4. For a system with 100 electrons (50 oC and 50 P) and 500 basis functions, there are approximately 125 million operations for the serial multiply and over 309 trillion operations for the parallel multiply. Assuming that every operation takes the same amount of time and that the parallel multiply is 100% efficient, the maximum theoretical speedup is 2.48 million times faster (for this system). t +t 1.25*108 +3.09*1014 lim SpeedUp= P"' 2.48*106 (51) "proesam t 1.25*108 Obviously this limit is unachievable, and there are multiple operations that slow down the parallel portions of the CCSD algorithm. Disk I/O, network messages, and synchronizations all become more expensive as the number of PEs increases. The two primary techniques that help to reduce these costs are realtime load balancing and asynchronous operations (i.e., overlapping computation and communication for both disk reads and remote memory operations). Load Balancing Job boards In this context, a job board is a global structure that any PE can write to or read from at will. The Direct Integral phase must have a global counter that registers the next AO shell pair waiting to be processed. PEs that finish their tasks quickly will read and increment the counter more often than PEs that lag behind the others. Natural load imbalance or CPU contention from other processes can cause this lag (or drift). CPU contention frequently happens if the operating system attempts to run more compute intensive processes than installed CPUs. Even if the parallel program executes exclusively but the number of PEs is the same as the number of CPUs, then the operating system kernel itself will compete for CPU cycles and detract from overall scalability. Job boards can relieve natural load imbalance in the Direct Integral phase, but the Stored Integral phase cannot easily use them (if at all). The SI phase has two levels of parallelization: the OOVV integrals are distributed over platoons and the R2 updates are distributed over PEs within a platoon. Any attempt to balance the load in real time will incur multiple synchronization delays since the PEs in a platoon have close to but not the same number of operations. Instead of realtime load balancing, the AO shells are grouped together to balance the work needed to process each shell. Data aggregation All uses of the word shell" that correspond to CC data processing refer to collections of physical AO shells, which are the smallest entities that direct integral generation programs can calculate. Since the number of operations scales like O2N4, the difference between the smallest and largest physical shell is a direct measure of the imbalance in computations per message. The hydrogen STO3G basis set has 1 contracted orbital in the s shell. Larger basis sets will typically run up to about 16 orbitals per shell (although the iodine PBS basis has 55 orbitals in the d shell). The hydrogen shell and the average shell are computationally uneven by a factor of 65536 for the DI phase. If the messages pertaining to these two shells cost the same, which is a fair assessment since message latency is an order of magnitude more important than message size, then one PE will have 65536 times more work to do than another. Simply grouping as many physical shells together as possible might not be the best solution since it is the size variability that affects load balancing. The parallel CCSD algorithm, before running the first iteration, will examine the collection of physical AO shells and search for the set of logical shells that minimizes the variance of shell sizes. This is a bounded twodimensional search along the dimensions MaxShellSize and NLogShells. The maximum number of logical shells is simply the number of physical shells. The fewest number of logical shells is 1, and the size of that shell is equal to the sum of all of the physical shell sizes; however, given a maximum shell size, the fewest number of shells is defined in Equation 52. LoNel > = (52) Physical AO shells should not be broken up since generating direct integrals costs less when orbitals of the same angular momentum and atomic center are calculated at once. Equation 52 is actually an inequality because there might not be a perfect grouping of physical shells such that every logical shell has the same size. The search algorithm loops over size limits (MaxShellSize) starting with the largest physical shell size and ending with a predefined limit or the sum of all physical shell sizes, whichever is smaller. For every potential MaxShellSize, the algorithm also loops over the number of logical shells starting from the theoretical minimum up to the number of physical shells. The problem of putting obj ects with mass (physical shells with orbitals) into groups with a mass limit (logical shells with up to MaxShellSize orbitals) is well known in the field of Computer Science as the Bin Packing Problem (BPP) [Cof97]. This is an NP hard optimization and there are many algorithms for finding solution candidates. Rather than solve the BPP, the CC algorithm uses the Worst Fit Decreasing (WFD) offline algorithm limited to a known set of bins (logical shells) and measures the variance of sizes (Appendix E). The partitioning scheme with the lowest variance should have the best balance between computation and messaging. WFD is ideal for finding possible partitions for this particular problem since it adds the next smallest physical shell to the logical shell with the fewest orbitals. Asynchronous I/O Most parallel programs are hampered by the increasing cost of messages as the number of PEs increases. Onesided operations allow PEs to process data while messages are moving through the network. The two limits of this scheme are: 1) the CCSD calculation is free because the messages take so long to arrive and 2) the program scales almost linearly with the number of PEs because global data is available the moment it is read. Given the complex task and data distributions, it is likely that some portions of the algorithm will be bound by communication costs and others by computations. Reading the OOVV integrals from disk should not be a performance bottleneck considering the amount of computation that each PE will do while the AIO READ is being fulfilled. The inner loops of the SI phase can become heavily dependent on T2 GET operations if the logical shells are not large enough to saturate the retrieval with computation. The easiest way to increase the number operations per message is to increase the size of a logical AO shell pair. The difficulty in doing this is that the local scratch memory requirements scale like N4 with respect to shell size. As more scratch memory is needed, the less heap there is for distributed data and ultimately more PEs will be needed. The problem of balancing shell size, scratch memory, and data distribution is not solved, but as the algorithm has been stated, the data distribution schemes can be tried independently of the data contraction kernels. For example, if more scratch memory is needed but there is no possibility to increase the heap or the number of PEs, then the DI and SI phases can be further divided to allow for fewer intermediates at the cost of more integral processing. Synchronization Delays An attractive feature of the parallel CCSD algorithm is that no PE is required to stall in a wait call for another PE unless the state of their globally accessible data changes (i.e., if R2 is transformed or T2 is updated). A barrier in message passing programs is a routine that causes every PE to wait until all of the PEs in a communication network have entered the barrier. Certain collective operations can function like a barrier and some pairs of functions act as a true barrier. Reductions involve every PE sending some data to one PE, which will combine all the data into one element. Broadcasts involve one PE sending some data to every PE. Individually, certain PEs may continue after the operation whether or not other PEs have finished their operations, but properly configured, a reduction and a broadcast can together function like a barrier. No PE will proceed until it has received data from one PE, which will not send the reduced element before receiving the contributions from all of the other PEs. The concept of a fence comes from the MPI2 standard, which defines it as a collective barrier that ensures all onesided communications on a shared obj ect are complete before leaving the barrier. Synchronizations of any sort occur essentially 10 times in each CC iteration: 1. *initialize R and synchronize T before the Pa DI phase a CO fence (Appendix A) 2. complete the R2Pa COntributions from the Pa DI phase a PLT fence 3. complete the R2PuaR2aCP transformation into scratch a PLT fence 4. copy scratch into R2 before the up DI phase a PLT fence 5. *complete the R2aP COntributions from the up DI phase a CO fence 6. reduce li contributions onto CO CDR 4 CO reduce 7. receive F1 from CO CDR 4 CO bcast 8. complete the R2 '"R21'"' transformation into scratch a PLT fence 9. copy scratch into R2 before the SI phase a PLT fence 10. complete SI and OP contributions a CO fence 11. complete R24T2 update a CO fence 12. receive T1 from CO CDR 4 CO bcast In the steady state of iterating over T, the first and last synchronizations can be overlapped (i.e., point 1 is the same as point 12 in the list). Point 5 can be overlapped with point 6 in that reducing li signals the end of the up DI phase (Table 51). Collectively, the number of synchronizations does not scale with the problem size; however, the cost of each synchronization point will increase as the number of PEs increases. Simulation The milliondollar question that every parallel algorithm must answer is, "Does it scale?" Amdahl's Law [Amd67], which neglects all process contention and performance degradation, describes the maximum theoretical speedup for a Eixed problem size running on a Eixed number of processors. The simplest expression of this law is shown in Equation 53. Each platoon independently transforms R2 and separates each step with a PLT fence (dashed lines). The shaded rows correspond to collective communications across the Company. r, w, i, and 0 stand for read, write, increment, and initialize to 0, respectively. The largest platoon calculates T1 while the other platoons update their copies of T2. t, + t SpeedUp = "" P" (53) tt However, the parallel CCSD algorithm does not spend the same percentage of time in the parallel parts relative to the serial parts for every molecular system. As the problem size increases, the percent spent in the serial part decreases; thus, the parallel speedup will increase with system size. This is loosely related to the GustafsonBarsis Law that makes the observation that a parallel computer can do bigger calculations relative to the serial computer for data parallel algorithms as the number of processors increases [Gus88]. The parallel CCSD algorithm leaves N3type operations for one CPU (in the form of NxN matrix multiplications). A few oneparticle N5 operations and the N4 T2 update are distributed over the largest platoon, but most N5 and all N6 Operations are distributed over the entire Amplitude Company. Furthermore, remote data is retrieved as early as possible .~ ~~ ~~~~`` ~ r~" PhaseHeavy Platoon Light Platoon T1 R1 T2 R2 Scr T1 T2 R2 Scr DI:pa r r i w/r r r i w/r R(D,M)4S(L,N) r w r w S(L,N)4R(L,N) w r w r DI: a13 r r i w/r r r i w/r Br~oadclast F, R(L,N)4S(L,D) r w r w S(L,D)4R(L,D) w r w r SI; OP:FpP / T OP:FPU,FH,FR / antisym; 1/D w r 0 update T2 w w 0 w Table 51 The schedule o s ______ and used as late as possible. This provides an ideal countermeasure for variable network performance . General Overview A simulator is used to test a design before building a prototype or the final product. Simulating software performance (scalability in the case of parallel CCSD) usually involves estimating computation and data transfer times, which requires hardware parameters such as CPU speed and efficiency; memory, disk, and network I/O latencies and bandwidths; etc. These parameters can be measured and will be different for various computer architectures; however, the goal of the software simulator is not to emulate hardware. By using a variety of parameters, the simulator should be able to expose general strengths and weaknesses of an algorithm that might need to be redesigned. The simulator used to test the parallel CCSD algorithm constructs a parallel network with an array of floatingpoint numbers that represents the clocks of the PEs. The program assumes the identity of a PE and estimates how long it would take to perform a set of operations. That time is added to the clock and the simulator switches to the next PE. For the Direct Integral phase, the simulator maintains a counter that corresponds to the next integral shell pair, and becomes the PE with the lowest clock (i.e., the one that would increment the counter next). Once the counter exceeds the number of shell pairs, the clocks are synchronized to the highest value and the R2 amplitudes are transformed. The process repeats itself for the second DI phase. Binary trees are used to estimate reductions and broadcasts. The Stored Integral and One Particle phases do not use realtime load balancing so the simulator cycles through each PE and adjusts the clock accordingly. Finally, the antisymmetrization and update phases are estimated with all of the appropriate synchronizations and computations. Specific Technologies Local computation. Accurately estimating local mathematical operations is essential to simulating both parallel and serial performance. Each operation involves reading data from memory, operating on the data, and writing the results back to memory (Equation 54). S is the data size, B is the memory bandwidth, Nops IS the number of operations, C is the CPU speed, and e is an efficiency modifier. Compiler optimization, loop ordering, and vendorsupplied mathematical libraries will affect performance so multiple efficiencies may be needed to estimate different parts of the algorithm. S, So Nop t + out op (54) comp Bd B C* e Asynchronous I/O. Disk operations can be estimated with latency, bandwidth, and the number of channels (Equation 55). Latency is usually the time it takes for a disk head to move to the appropriate position on the disk, Nch is the number of dedicated I/O channels, and Nmsg IS the number of simultaneous disk operations. Disk contention in the simulator is estimated by multiplying S by the number PEs on the node (i.e., if one PE is reading data, then another probably is as well). tr to = tlat +~ (55) B *min(Nch> msg When an asynchronous call is placed, the simulator estimates how long it would take for the data to arrive and saves the time at which it should be available. The PE must then call a wait routine, which will compare the current time to the time the data is useable. Depending on the difference, the simulator will increase the clock (the PE must wait) or return silently (the data arrived while computing). Onesided communication. Sending and receiving messages through a network is very similar to writing and reading from disk. If twosided messaging is used, then the clocks of the origin and target must be modified. Blocking messages require the origin and target to have the same clock, while nonblocking messages allow for drift. The parallel CCSD algorithm only uses nonblocking onesided messages, which are very similar to I/O operations in that only the origin is affected by the time it takes for data to arrive. The simulator allows for onnode and offnode messaging parameters so that PEs in the same squad can communicate at different speeds than those on different nodes. Some MPI implementations require copying data into and out of system buffers, so memory bandwidth might also affect the time. Collective communication. Barriers, reductions, and broadcasts are simulated with binary trees (Figure 510), and the number of stages depends on the number of PEs involved in the operation (Equation 56). (Ceiling brackets mean round up to the nearest integer.) Equation 57 shows the number of messages being sent in each stage i. Istages= 10g2 PrE> (56) N,,,z = min(NPE,2) 2' (57) 000000000000 1 , 2 , Figure 510. The binary tree approach to collective communications. 12 PEs require 4 stages, and the last stage only has 4 messages. Calibration Memory bandwidth is measured by allocating a large heap and then flushing it with data and reading the data back in. The average of 7 sweeps is used for the memory reading and writing bandwidth. Henceforth, all data will pertain to an IBM POWER3 workstation with four 375 MHz CPUs. The following output was taken from a calibration run: 512 MB heap at Ox3348f518 wr was 1.924009 > 266.11 MB/s; rd was 0.623147 > 821.64 MB/5 wr was 1.924412 > 266.06 MB/s; rd was 0.623328 > 821.40 MB/5 wr was 1.922241 > 266.36 MB/s; rd was 0.623201 > 821.56 MB/5 wr was 1.922671 > 266.30 MB/s; rd was 0.623183 > 821.59 MB/5 wr was 1.922141 > 266.37 MB/s; rd was 0.623107 > 821.69 MB/5 wr was 1.922495 > 266.32 MB/s; rd was 0.623170 > 821.61 MB/5 wr was 1.922200 > 266.36 MB/s; rd was 0.623147 > 821.64 MB/5 bw_memwr = 266.27 MB/s (279201344.00 B/s) bw_memrd = 821.59 MB/s (861497664.00 B/s) CPU throughput is rarely the clock speed of the processor so efficiency modifiers are used to estimate the time it would take to compute a set of operations. The two primary mathematical constructs that the CCSD algorithm uses are twodimensional matrix multiplications and N6 array contractions. Many vendors sell highly optimized linear algebra libraries that are tuned for specific computer architectures. Using these libraries for standard matrix multiplies can dramatically increase performance if the program is dependent on such operations. The simulator measures the efficiency of the double precision general matrix multiply (DGEMM) by multiplying two 625x625 matrices. The matrices are then recast as two 25x25x25x25 fourdimensional arrays and a handcoded sixloop array contraction is timed, wherein the array contraction performs the same operations as the twodimensional matrix multiply. The following results were taken from a calibration run: CALIBRATION FOR POWER 375 MHz CPU (2 fma units per CPU) 750000000 fma/s reference Disk latency and bandwidth are difficult to measure in an isolated environment. Most operating systems use aggressive Eile caching, which can have a noticeable effect on performance curves that compare message size to bandwidth. The parallel CCSD algorithm is somewhat forgiving of exact performance because all disk reads are asynchronous; nonetheless, calibrating the I/O performance attempts to bypass the CPU and memory caches. The benchmarking program iozone [Nor02] was used to measure the disk latency and bandwidth, and all Hiles were created with the descriptors: O DIRECT (direct I/O), O_SYNC (synchronous writes), and ORSYNC (synchronous reads). Comparing fie ( 625 x 625 ) ( 625 x 625 ) dgemm took .387582 s > 88.1849% eff dgemm took .384157 s > 89.0108% eff dgemm took .384016 s > 89.0451% eff dgemm took .383978 s > 89.0543% eff dgemm took .384088 s > 89.0276% eff dgemm took .384009 s > 89.0468% eff dgemm took .384046 s > 89.0378% eff loops took 1.785683 s > 18.4198% eff loops took 1.786396 s > 18.4123% eff loops took 1.785581 s > 18.4208% eff loops took 1.785778 s > 18.4188% eff loops took 1.786151 s > 18.4149% eff loops took 1.785726 s > 18.4193% eff loops took 1.785444 s > 18.4223% eff dgemm eff = 0.889153 loops eff = 0.184183 vendor is 4.83 times more efficient The results are then compared to another set of larger multiplications and contractions to test for predictability. The simulator uses the latter set of parameters if the results are precise. The following results were generated using the previous data: ( 30 x 30 ,40 x 40 ) x ( 40 x 40 ,40 x 40 ) DGEMM: real: 3.549262 s (est: 3.53) eff: DGEMM: real: 3.547543 s (est: 3.53) eff: DGEMM: real: 3.547254 s (est: 3.53) eff: LOOPS: real: 22.458201 s (est: 16.76) 6loop eff: LOOPS: real: 22.448650 s (est: 16.76) 6loop eff: LOOPS: real: 22.446517 s (est: 16.76) 6loop eff: LOOPS: real: 22.450999 s (est: 16.76) 6loop eff: LOOPS: real: 22.447324 s (est: 16.76) 6loop eff: LOOPS: real: 22.445928 s (est: 16.76) 6loop eff: LOOPS: real: 22.449582 s (est: 16.76) 6loop eff: .8892 .8892 .8892 .1373 .1373 .1373 .1373 .1373 .1373 .1373 record length to the time it took to read a random record produced data that were used to describe the local disk performance (Figure 511 and Table 52). 600 4 2048 500 11 0El 4096 ~400 A 8192 300 u x 16384 O 32768 100 65536 + 13107 0 4096 8192 12288 16384 2 26214 record size (kB) 4 Figure 511. The local disk random read calibration data. The sets of data are almost identical no matter what file size (kB) was used between 2 and 512 MB. Table 52. The latencies and bandwidths of random disk reads with respect to file size File Size Brd tint File Size Brd tint File Size Brd tla 2048 29.7 3.5 16384 30.4 3.5 131072 31.2 4.3 4096 30.4 4.0 32768 30.5 2.9 262144 31.5 4.9 8192 30.0 3.2 65536 31.1 4.0 524288 31.4 4.8 File sizes are in kB, bandwidths (Brd) are in MB/s, and latencies (tle) are in ms. For each file size, the bandwidth is the inverse of the slope of the linear regression of record lengths vs. time to read, and the latency is the intercept. (The minimum R2 was 0.999.) The average bandwidth and latency for all 9 sets were 30.7 MB/s and 3.9 ms, respectively. Measuring network messaging performance was kept relatively simple. The mpptest program, which ships with the MPICH library [Gro96], was used to measure the pointtopoint (P2P) and bisection bandwidths for messages ranging from 16 to 64 kB. P2P involves only two PEs, while bisection involves all PEs. Figure 512 shows the on node P2P and bisection bandwidths measured on the IBM workstation. Latency is taken from the intercept of the linear regression, not the zerosize transfer time. The regression analysis for both curves suggests that both pairs of CPUs can communicate without degradation, so the simulator set the number of onnode communication channels to 2 bisect: t 9.8093*S + 209.7 R2 = 0.9982 P2P: t = 9.9477*S + 210.69 R' 0.9973 "" and used the approximate values of 210 Cls and 99 MB/s for latency and bandwidth, respectively. Three offnode network architectures were tested: switched Fast Ethernet and two 150 MB/s full duplex, redundant path IBM SP switches (with different configurations). Only P2P tests were used to measure offnode latency and bandwidth (Figure 513), and Equation 55 was used to estimate degradation. + P2P X bisect   Linear (P2P)  Linear (bisect) 900 800 400 300 0 8 24 32 40 message size (kB) Figure 512. The calibration data for onnode messaging parameters. Bandwidth is the inverse of the regression slope, and latency is the intercept. The performance does not change with one or two pairs of CPUs communicating at the same time, which implies two independent messaging channels. + F.Ethernet X SP swl + SP sw2   Linear (F.Ethernet)  Linear (SP swl)   Linear (SP sw2) 7000 6000 ,5000 Y 4000 E 3 00 0 '2000 1000 0 0 8 24 32 40 message size (kB) Figure 513. The calibration data for offnode messaging parameters. The regression equations are Fast Ethernet: t=86.987*S+553.42 (R2=0.9998); SP switch 1: t=15.304*S+265.18 (R2=0.9915); and SP switch 2: t=11.942*S+180.88 (R2=0.9970). Bandwidths are 11.2, 63.8, and 81.8 MB/s, respectively. As in the case of asynchronous I/O, most messages in the parallel CCSD algorithm are onesided and will hide behind computation. If the amount of computation is less than the estimated time spent in communication, then a more accurate messaging function should be used. Theoretical Examples With reasonable hardware parameters and timing algorithms, simulations of real molecular systems can be used to estimate the performance of the parallel CCSD algorithm. Table 53 lists the 20 amino acids and the corresponding numbers of electrons and spin pairs. Also listed are the sizes of the openshell OOVV integralsand thus the minimum disk requirements (with 1 GB equal to 10243 Bytes)and the minimum number of PEs with 1 GB of heap needed to calculate the UHF CCSD energy if each system had 1024 AO basis functions. Energy calculations of the anion, cation, and zwitterion are certainly achievable with the resources of current networks of workstations and clusters of SMP machines. For 1024 AO basis functions, the size of the twoelectron integrals is 8 TB. Approximately oneeighth of these need to be calculated on the fly since the others can be generated by permutations; however, assuming the integrals are dense, the full 8 TB array must be generated and processed twice (once for Pa and once for up). Systems of this size should be executed on hundreds if not a thousand PEs or more. The initial parallelized software cannot scale this high much less store the amount of data needed by each node. The parallel CCSD algorithm overcomes this with finegrain task distribution and distributing large arrays over as few PEs as possible. The net effect is that large systems can be calculated with few resources at the expense of computing time. calculation.) Min Disk Disk Amino acid ne noo C H NOS PEs (GB) savings Alanine 3 7 1 2 0 48 1128 27 8.8 99.92% Cysteine 3 7 1 2 1 64 2016 48 15.8 99.86% Aspartic acid 4 7 1 4 0 70 2415 57 18.9 99.84% Glutamic acid 5 9 1 4 0 78 3003 71 23.5 99.79% Phenylalanine 9 11 1 2 0 88 3828 90 29.9 99.73% Glycine 2 5 1 2 0 40 780 19 6.1 99.95% Histidine 6 9 3 2 0 82 3321 78 25.9 99.77% Isoleucine 6 13 1 2 0 72 2556 60 20.0 99.83% Lysine 6 14 2 2 0 80 3160 75 24.7 99.78% Leucine 6 13 1 2 0 72 2556 60 20.0 99.83% Methionine 5 11 1 2 1 80 3160 75 24.7 99.78% Asparagine 4 8 2 3 0 70 2415 57 18.9 99.84% Proline 5 9 1 2 0 62 1891 45 14.8 99.87% Glutamine 5 10 2 3 0 78 3003 71 23.5 99.79% Arginie 6 14 4 2 0 94 4371 103 34.1 99.69% Serine 3 7 1 3 0 56 1540 37 12.0 99.90% Threonine 4 9 1 3 0 64 2016 48 15.8 99.86% Valine 5 11 1 2 0 64 2016 48 15.8 99.86% Tryptophan 11 12 2 2 0 108 5778 136 45.1 99.59% Tyrosine 9 11 1 3 0 96 4560 107 35.6 99.68% Table 53 The minimum re D _____ The data assume thle scratch memory needed by each PE, is less than thle amount of data needed to store the third OOVV array during an R2 transformation. Every example saves over 99.5% of the disk space required by a traditional MObased CCSD algorithm. Runtime Simulations DCEE* A simulation of the parallel CCSD algorithm was performed on the DCEE* system described in Chapter 2. Two network scenarios were tested: Fast Ethernet (F.Eth) and the faster of the two IBM SP switches (henceforth referred to as x3 sw). Each node in the network was configured as if it were the aforementioned 4CPU IBM workstation. All data are compared to 100% and 50% theoretical efficiencies (both are technically linear, but "linear" scaling commonly refers to 100% theoretical efficiency). (The parallel algorithm requires at least two 1GB PEs, so all data are compared to this as the reference mt Fast Ethernet. The F.Eth simulation shows acceptable parallel performance up to 64 PEs, but beyond that, efficiency drops below 50%, which is considered to be unacceptable (Figures 514 and 515). However, even up to 1024 PEs, the scalability curve is still increasing. Efficiency shows normal behavior for parallel programs (starting out high and decreasing asymptotically). 96 / ** Xth (100%)   Xth (50%/) 64 X I~ pred. r M Efficiency*128 30 0 256 512 768 1024 # PEs Figure 514. The simulated performance curves for DCEE* on 2 to 1024 PEs over Fast Ethernet. Xth (100%) and Xth (50%) are the theoretical speedup curves with 100% and 50% efficiencies, respectively. X pred. is the predicted speedup for the simulation. Efficiency*128 is the efficiency scaled by 128. Beyond the standard analysis of speedup and efficiency, users might concern themselves with the idea of costtoperformance. Another way of stating the idea, "At what point does the cost of using more PEs outweigh the gain in speedup they might contribute?" The "cost" function is considered to be the efficiency and the metric for "bang per buck" is efficacy (Equation 58) [Gho91]. 17(p1)= E(p~) S(p~) (58) Efficiency times speedup can have multiple maxima, but only the first maximum value (with respect to the number of PEs) corresponds to the most costeffective use of PEs in a parallel run. The F.Eth simulation of DCEE* peaks at 64 PEs (r=10.2), but the curve is so shallow that one could argue anywhere between 32 and 96 PEs is still cost effective. O 32 64 96 # PEs Figure 515. The simulated performance curves for DCEE* on 2 to 128 PEs over Fast Ethernet. The curves correspond to the same series as in Figure 514 with the exception that efficiency has been scaled by 32, not 128. SP switch. The x3 sw simulation shows much better performance (Figure 516). Predicted speedup never drops below 50% efficiency up to 1024 PEs, and efficacy has not peaked. The simulation suggests that the parallel performance for this system (DCEE*) is highly dependent on communication speed. 300 100  Xth (100%)  Xth (50%) 50 ~7 X pred. 50  Efficiency*300 SEfficacy 0 256 512 768 1024 # PEs Figure 516. The simulated performance curves for DCEE* on an SP switch. Comparison to serial ACES II. Efficacy is a relative metric that depends on the machine and (molecular) system, but it can be used to compare the effectiveness of different parallel programs running the same system on the same computer. Figure 517 compares the speedup and efficacy of the new algorithm to the previous parallel CCSD algorithm. The new algorithm scales better and is much more cost effective per PE; however, the comparison of real times does not fare so well (Figure 518). 100 new X 75 ( new Efficacy   prev X 50 + prev Efficacy 25 0 64 128 192 256 # PEs Figure 517. The relative performance of the initial parallelization attempt compared to the new algorithm. The discontinuity of the previous parallel program comes from allocating whole nodes with 16 CPUs each. The new algorithm achieves high scalability by decomposing the equations to such a degree, that the PEs are swamped with calculations in an attempt to use as little disk and memory as possible. The current serial ACES II CCSD program is 31.4 times faster than the new parallel algorithm, but it also requires much more disk space and does not scale as well. Nonetheless, Table 54 lists the numbers of PEs that the new algorithm would need to match the time achieved by the initial parallelization attempt. It is clear that the CCSD energy can be calculated quickly with few PEs for molecular systems that have small data requirements (DCEE* used on the order of 2 GB of disk space per node). The current ACES II program and the initial parallelization attempt are more than adequate for reducing the time the new parallel algorithm would take; however, the new algorithm requires very little disk space and can calculate energies for very large systems. 300 200 M X new serial ACES II 100 MX prey O 256 512 768 1024 # PEs Figure 518. The absolute performance of the initial parallelization attempt compared to the new algorithm. The dark lines are the factors of speedup achieved by using the previous code compared to the reference calculation of the new algorithm . Table 54. The number of PEs needed to surpass the initial parallel performance Previous PEs New PEs Etaoae speedup 1 76 31.8 32 409 146 64 514 176 128 424 151 256 405 145 The simulated speedup curve was fitted with a cubic regression, which was used to predict how many PEs the new algorithm needs to finish a CC iteration in the same time that the initial parallelization attempt did. Serine and phenylalanine Using the x3 sw network parameters, the simulator examined serine (Ser) and phenylalanine (Phe), both in the ccpVQZ basis set. Ser, with 56 electrons, had 595 AO basis functions and Phe (88 e") had 990 AOs. The current ACES II UHF CCSD program, using the AO algorithm for VVVV integrals and assuming 20% of the AO integrals are discarded, would require over 200 GB of local disk storage for Ser and over 1.5 TB for Phe. The new parallel algorithm would need only 2.8 and 19.2 GB of distributed storage, respectively. Figures 519 and 520 show the simulated parallel performance of Ser and Phe. Ser requires a minimum of 14 1GB PEs, and Phe requires 95 PEs. In either system, parallel efficiency never drops below 96%. With respect to tractability, a single Ser UHF CCSD iteration is expected to take 2.8 days on 1024 PEs. Phe requires approximately 2 months per iteration on 1024 PEs. If the efficiency curve is to be believed, then this can be dropped to about 2 weeks on over 4000 PEs.  Xth (100%/) 48 . Xt~x h (50%) X pred. 32 MEffciency*64 SEffcacy .. 10 0 256 512 768 1024 # PEs Figure 519. The performance curves for serine on 16 to 1024 PEs.  Xth (100%) 6  Xth (50%) X pred. 4 MEffciency*8 SEnfcacy 0 256 512 768 1024 # PEs Figure 520. The performance curves for phenylalanine on 128 to 1024 PEs. The simulator is only able to model local computations with high accuracy. These are generally highly ordered, welldefined sequences of commands executing on a consistent processor (barring process contention in the operating system). After these, the performance of local disk operations is somewhat predictable, even with I/O contention from other PEs. The least predictable performance aspect of a parallel program is network communications. This seems to be an N! correlation problem in its own right (nevermind coupledcluster theory), but certain models are able to predict the statistical nature of network traffic rather well. The current implementation of the simulator treats the network as a single, flat interconnected fabric in which every PE can reduce network bandwidth for every other PE. In principle, this should then predict the worst performance since multilayered network hierarchies favor local communications, which the platoons are supposed to capitalize on. However, the penalty function for estimating network contention lacks the ability to scale up with the number of PEs (i.e., more PEs reduces the accuracy of the predicted network effects). This should be corrected with a better model such as the hyperbolic model for layered environments [Sto96]. At a more fundamental level, the simulator assumes all messages can be delivered in one eager message. This means each PE can send and receive data without bothering to divide the information into multiple messages, but more importantly, it completely eliminates the need for the target PE to acknowledge or create space for the incoming data. In practice, this is rarely the case, and this rendezvous protocol must be modeled. The simulator is also limited to sharing I/O bandwidth among PEs on a single node. Many highperformance parallel supercomputers use storage area networks to increase 