7th International Conference on Multiphase Flow,
ICMF 2010, Tampa, FL, May 30 June 4, 2010
NEPTUNE_CFD High Parallel Computing Performances for ParticleLaden
Reactive Flows
H. Neau*i, J. Lavi~ville O. Simonin*i
University de Toulouse ; INPT, UPS ; IMFT ; Allee Camille Soula, F31400 Toulouse, France
SCNRS ;Institut de Mecanique des Fluides de Toulouse ; F31400 Toulouse, France
i:EDF Recherche & Developpement ; DMFEE; F78401, Chatou, France
neau~imft.fr
Keywords: High Performance Computing, Parallel computation, Multiphase flows, Euler multifluid approach
Abstract
This paper presents high performance computing of NEPTUNE_CFD V1.07@~Tlse. NEPTUNE_CFD is an unstruc
tured parallelized code (MPI) using unsteady Eulerian multifluid approach for dilute and dense particleladen reactive
flows. Threedimensional numerical simulations of two test cases have been carried out. The first one, a uniform
granular shear flow exhibits an excellent scalability of NEPTUNE_CFD up to 1024 cores, and demonstrates the
good agreement between the parallel simulation results and the analytical solutions. Strong scaling and weak scaling
benchmarks have been performed. The second test case, a realistic dense fluidized bed shows the code computing
performances on an industrial geometry.
Nomenclatu re
Dilute and dense particleladen reactive flows are used in
a wide range of industrial applications such as coal com
bustion, catalytic polymerization or uranium fluorida
tion. A few years ago, threedimensional numerical sim
ulations of such flows, especially fluidized beds, were
restricted to coarse meshes due to limitation of com
puting power (CPU clock speed, memory, hard drive).
The recent improvements on high performance comput
ing overcome this limitation. For hardware, the frequen
cies of the pocessors have remained almost flat in the
last years. The development of multicore technology
and the increase of memory cache size and of mem
ory and other interconnect frequencies have led to very
highly parallel and efficient computing platforms. For
software, this computational power has been exploited
according to the parallelization of codes (message pass
ing). Hence, we are able to solve previously intractable
problems. The strong development of HPC allows the
use of finer and larger meshes and the increase of nu
merical simulation accuracy.
Since a few years, threedimensional realistic numer
ical simulations of industrial configurations are realized
with unsteady Eulerian multifluid modeling approach
for monodispersed or polydispersed reactive particle
mixture (Randrianarivelo et al. (2007)). This approach
Roman symbols
CPU Central Processing Unit
E f Efficiency
ECf,,s Weak scaling efficiency
y Gravitational constant (11.5 )
Hbell Bed height (11)
HPC High Performance Computing
MPI Message Passing Interface.,
P, Mean gas pressure (N.111)
qp Man par ice agi aion 111 .s )
Sp? Speedup
T, e7 Job execution time on one node (8 cores)
T,, Job execution time on a nodes (n x 8 cores)
Us,; Mean velocity of phase k (m1.51
Fluctuating velocity of phase k (m1.51
Greek symbols
p, Gas viscosity (kg.1111)
Pre Density of phase k (kg.1113)
Subscripts
9 Gas
p Particle
i ith component of a vector
Introduction
7th International Conference on Multiphase Flow,
ICMF 2010, Tampa, FL, May 30 June 4, 2010
has been developed and implemented by IMFT (Insti
tut de Mecanique des Fluides de Toulouse) in the NEP
TUNE CFD V1.07@Tlse version. NEPTUNE CFD is
a unstructured parallelized multiphase flow software de
veloped in the framework of the NEPTUNE project, fi
nancially supported by CEA (Commissariat B l'Energie
Atomique), EDF (Electricit6 de France), IRSN (Institut
de Radioprotection et de Stiret6 Nucleaire) and AREVA
NP. NEPTUNE CFD V1.07@Tlse. This code allows to
take into account complex phenomena: particle mixture,
particlefluid interaction, particleparticle and particle
wall collisions, heat and mass transfers and chemical re
actions .
In this study, threedimensional numerical simula
tions of two test cases have been carried out: a uniform
granular shear flow and a dense gassolid fluidized bed
at industrial scale. These simulations enable to evaluate
high computing performances of NEPTUNE_CFD on
three clusters with different technologies and architec
tures with different message passing libraries. Parallel
structuring of the computational procedure, speedups,
efficiency and their dependence upon the size of the
problem, and the number of cores are discussed.
NEPTUNE CFD overview
NEPTUNE_CFD is dedicated to calculating multiphase
or multifield flows, at the local scale and in geometries
that may be complex. The software has the following
main characteristics and functions:
flow systems processed:
1 to 20 fluid fields (phases or fields),
processing of water/steam flows with actual
thermodynamic laws;
numerical methods:
meshes with all types of cell (element), non
conforming connections,
"cellcenter" type finite volume method,
calculation of colocalized gradients with re
construction methods,
distributedmemory parallelism by domain
splitting;
physical models:
interfacial momentum transfer terms,
interfacial energy transfer terms,
turbulence,
head losses and porosity,
additional scalar;
architecture:
Figure 1: NEPTUNE_CFD modules.
interfacing with the Code_Saturne Enveloppe
module for management of the preprocessing
operations on the mesh, of the parallelism and
of the postprocessing files,
written in Fortran 77 (the majority) and C
(ANSI 1989) (procedural programming),
ready for Unix and Linux platforms.
The method of parallel computing used by NEP
TUNE_CFD is known as domain decomposition, in
which the geometry and associated fields are broken
into pieces and allocated to separate cores. The pro
cess of parallel computation involves: decomposition of
mesh and fields, running the application in parallel, post
processing. The parallel running can use public or not
implementations of the message passing library (MPI).
NEPTUNE_CFD is composed of two main elements
the Kernel module is the numerical solver,
the Enveloppe module is in charge of mesh data
reading, mesh pasting allowing non matching
meshes, domain decomposition for parallel com
puting using METIS (Karypis and Kumar (1999)),
definition of periodicity boundary conditions and
postprocessmng.
NEPTUNE_CFD also relies on one compulsory library
(under LGPL license): BFT (Base Functions and Types)
for memory and input/output management as well as
specific utilities.
Mathematical models and numerical methods
The unsteady Eulerian multifluid approach is derived
from joint fluidparticle PDF equation allowing to derive
the transport equations for the mass, momentum and ag
itation of particle phases (Simonin (2000)).
7th International Conference on Multiphase Flow,
ICMF 2010, Tampa, FL, May 30 June 4, 2010
In the proposed modeling approach, separate mean
transport equations (mass, momentum, and fluctuating
kinetic energy) are solved for each phase and coupled
through interphase transfer terms.
In the following when subscript k = y we refer to
the gas and k p to the particulate phase. The mass
balance equation is:
as rrpre + ag P lc<,j = 0, (1)
where as is the kth phase volume fraction, pa the den
sity and LT<,; the component of the velocity. In (1)
the righthandside is equal to zero since no mass trans
fer takes place.
The mean momentum transport equation takes the fol
lowing expression:
The momentum balance equations are solved with a
semiimplicit method. They are splited in fractional
steps: explicit balance, velocity implicit increment pre
diction, "alphapressureenergy" implicit increment pre
diction, final velocity correction,
The "AlphaPressureEnergy" step stops after the
mass conservation substep, when the volume conserva
tion holds. The user can adapt the criterion parameter
t est, but this one remains very severe as it applies to a
maximum value over the whole domain:
,It,[ L "ll ,l
The standard value of I, is 105
Because of implicit formulation
dimensional unstructured meshes, we
solvers:
and three
use iterative
i)P
for the pressure: conjugated gradient or bi
conjugate gradient stabilized (bicgstab),
for volume fraction: biconjugate gradient stabilized
or jacobi,
for velocity: jacobi.
Parallel computer description
The numerical simulations have been carried out on 3
clusters:
C1: SGI Altix ICE 8200 based on Intel Xeon Quad
Core processors (Harpertown technology),
C2: SGI Altix ICE 8200 EX based on Intel Xeon
X5560 Quad Core processors (Nehalem EP tech
nology),
C3: cluster based on AMD \hI.1Iugal Quad Core pro
cessors.
The complete description of clusters is given in table 1.
These clusters among the most efficient in France are
based on different technologies, on different file systems
and on different interconnects. The aim is to evaluate
NEPTUNE_CFD parallel performances on different ar
chitectures about speedup and efficiency and not to com
pare cluster HPC (elapsed time). For clusters C1 and C2,
we test 2 implementations of MPI : Intel MPI and SGI
MPT (Cla, Cib, C2a and C2b) and we use Intel For
tran and C compiler. Numerical simulations on cluster
C3 are carried out with GNU fortran and C compiler and
OpenMPI. All clusters were in production.
+IA,; +[ax Pk <'u',i' i) + Opr;j ,
where ..' is the fluctuating part of the instantaneous
velocity of phase k, P, is the gas pressure, y; the ith
component of the gravity acceleration and IA,; the mean
gasparticle interphase momentum transfer without the
mean gas pressure contribution. Finally OA,;, is for
k y the molecular viscous tensor and for k = p
the collisional particle stress tensor. For details on di
luted flows we refer to Simonin (1991), and on fluidized
beds we refer to Balzer et al. (1995), Gobin et al. (2003).
The partial differential equations are discretized with
a secondorder centered scheme and the solution is time
advanced by a first order scheme.
The model and the numerical method are adapted to
the handling of nphases (in fact nfields), including
the single phase frame. Models and Eulerian transport
equations represent an extension of the classical two
phase flow model developed in the frame of the ESTET
ASTRID Project and include watersteam closure laws
and correlations.
The algorithm, based on original elliptic fractional
step method (see Mechitoua et al. (2002) and Mechitoua
et al. (2003)) that leads either to use linear solvers or di
rect nphas.x.nphas matrix inversion. The main interest
of the method is the socalled "alphapressureenergy"
step that ensures conservativity of mass and energy and
allows strong interface source term coupling. Mass,
momentum and energy equations are coupled with the
help of a pressure correction equation, within the itera
tive "alphapressureenergy" step. The algorithm allows
density variation according to pressure and enthalpy dur
ing the computation.
DETA,; DETA,;
Table 1: Cluster description.
Cluster Cla Clb C2a C2b C3
Computational center CINES CALMIP Renault Fl Team
Model SGI Altix ICE 8200 SGI Altix ICE 8200 EX HP
Number of cores 12288 2816 1024
Number of nodes 1536 352 128
Processors/nodes 2 2 2
RAM/core (GB) 4 4.5 4
Peak performance (TFlop/s) 147 31.5 9.4
Processor Intel Xeon E5472 Intel Xeon X5560 AMD Shangal
Processor frequency (GHz) 3.0 2.8 2.3
Cores/processor 4 4 4
L2 cache 2x(6 MB per 2 cores) 4*256KB 4*512KB
L3 cache 8MB per 4 cores 6MB per 4 cores
Network fabric Infiniband Infiniband Infiniband
File system Lustre Lustre LFS
C compiler Intel icc1 1.1 Intel icc1 1.1 gcc4.1.2
Fortran compiler Intel ifortl1.1 Intel ifortl1.1 gfortran4.1.2
MPI I ntel MPI 3.2.2 SGI MPT1.25 Intel MPI3.2.2 SGI MPT1.25 OpenMPII.2.6
7th International Conference on Multiphase Flow,
ICMF 2010, Tampa, FL, May 30 June 4, 2010
o.olm
Simulation of a threedimensional granular
Tu e f ora e shvee iflo w
The purpose of the test case is a threedimensional
isothermal granular shear flow. For such a simple con
figuration, the mathematical model has a analytical solu
tion which can be compared with the numerical solution
given by NEPTUNE_CFD.
The computational domain is a cubic box of length
Hcube = 0.01 m (Fig. 2). The powder and gas material
prop rties a g venlin t nbe r2 T~he p~artrioculahte vheassethe
following values:
*particle fluctuating kinetic energy:
0.01m
0.01m
Figure 2: Sketch of the threedimensional granular uni
form shear flow.
is also neglected in this case. Finally for this test case we
solve 10 equations (mass balance, momentum transport,
particle turbulence).
Two different meshes have been employed to evalu
ate the performances: 1,000,000 cells from 1 up to 32
nodes corresponding to 8 up to 256 cores and 10,503,459
cells from 16 up to 128 nodes corresponding to 128 up
to 1024 cores. The Fig. 3 shows the standard mesh
composed of 1,000,000 hexaedra with uniform cell size
(ax: ay aZ 0.1 mm). The second mesh is
also a regular mesh of 10,503,459 hexaedra (219 cells
per direction).
Configuration is detailed in Fig. 4. Three kinds of
boundary conditions are used:
qp = 2.66.10s m2 s2
* particle kinetic viscosity:
pep = 6.21.104 kg ml s ,
* particle collisional viscosity:
Pc,p = 9.88.104 kg ml s .
The gas phase is laminar and the particle agitation
model is accomplished by solving a system of two trans
port equations: one for the particle turbulent agitation
and one for the fluidparticle covariance. According to a
large particle to gas density ratio only the drag force is
taken into account. It must be printed out that the gravity
Moving waHl : U=0.1m/s
z Y
Moving waHl
Table 2: Powder and gas properties.
Gas Particles
Density (kg m) 1 1,000
Viscosity x 10s (Pa s) 1
Pressure (bar) 1
Diameter (pm) 350
Solid volume fraction 0.3
Restitution coefficient 0.8
Figure 4: Schematic of boundary conditions.
Table 3: Analytical and numerical result comparison.
NEPTUNE CFD
Analytical
minimum maximum
op ) 0.30 0.29989 0.30071
qp2.66.105 2.6432.105 2.6639.105
(m s )
6.21.104 6.1872.104 6.2116.104
(kg/ms)
"pc 9.88.104 9.8784.104 9.9085.104
(kg/ms)
An adaptative time step is used (computed from
Courant and Fourier criteria), typically at 5.104 S.
The following iterative solvers have been selected: ja
cobi for the velocity, conjugated gradient for the pres
sure and bicgstab for the volume fraction. The crite
rion parameter Ez oz of "AlphaPressure" step is fixed to
106 and the maximum number of cycles into "Alpha
Pressure" step is 50.
Results and discussion
After a transient phase, numerical simulations converge
to stationary state. The Fig. 5 shows temporal evolution
of the solid volume fraction between 0.5 s and 2.0 s.
Threedimensional unsteady numerical simulation re
sults a ree with analytical solutions (table 3). The
accuracy of numerical results is independent of par
allelization and of core number. NEPTUNE CFD
performances are evaluated on a restart simulation of
1,000 additional iterations after transient step. The
"CPUTIME" intrinsic fortran subroutine has been
used to measure CPU times per core of the main rou
tines of code, the CPU times per iteration per core and
the effective CPU time per core of the simulation. The
meaSurements of different times are repeated 3 times for
7th International Conference on Multiphase Flow,
ICMF 2010, Tampa, FL, May 30 June 4, 2010
o0.oto.oto>
Soutlet
Intet
Intet
outlet r
Figure 3: Mesh containing 1,000,000 hexaedra.
rightside and leftside, inletoutlet condition for
particles and gas with imposed velocity:
U, = 20 (Y Hcube/2),
frontside and backside, symmetry condition for
particle and gas
topside and bottomside, moving wall:
Uwalt = 20 + (Y,,,at Hcube/2).
Moving wall is a noslip wall boundary for particle and
gas:
89 (4)
Inlet boundary condition values to particle fluctuating
kinetic energy and fluidparticle fluctuating velocity co
variance are:
q=7.5.10s m2 s
Sq =1.104 12 S
Table 4: Effect of core number on effective core CPU
time for 1,000 iterations with the standard mesh.
number of Effective CPU time/core (s.)
cores Cla Clb C2a C2b C3
8 35,621 35,325 12,150 12,462 15,469
16 14,999 14,936 5,628 5,639 7,087
32 5,959 5,877 2,598 2,567 3,254
64 1,571 1,323 1,200 1,127 1,548
128 803 751 563 583 895
256 580 757 449 485 1,269
512 498 493 
7th International Conference on Multiphase Flow,
ICMF 2010, Tampa, FL, May 30 June 4, 2010
(a) 0.5 s. (b) 1.0 s.
(c) 1.5 s. (d) 2.0 s.
Figure 5: Temporal convergence of solid volume fraction between 0.5 s to 2.0 s.
each case to remove accidental and statistical error. No
\ i Ii lk~.ll difference of measured time is observed. The
following analysis is based on the averaged values of the
3 different runs.
In a first time, we focus on the strong scaling which is
defined as how the solution time varies with the number
of cores for a fixed total problem size. Secondly, we are
interested in the weak scaling that is to say how the solu
tion time varies with the number of processors for a fixed
problem size per processor (strong scaling is about same
problem size in shorter time and weak scaling is about
larger problem size in same time). It is hard to achieve
a good strongscaling at larger process counts since the
communication overhead increases in proportion to the
number of processes used.
7th International Conference on Multiphase Flow,
ICMF 2010, Tampa, FL, May 30 June 4, 2010
3
le+04
e,
E
p
3
a
u
e,
t;
e,
LL(
LY le+03
W
64 96 128 16()
Number of cores
Number of coreS
Figure 6: NEPTUNE_CFD effective CPU time per core
(standard mesh 1,000 iterations).
Figure 8: NEPTUNE_CFD efficiency with respect to
T, e7 of each cluster (standard mesh 1,000 iterations).
demonstrates that NEPTUNE_CFD exploits the paral
lelism very efficiently. From 16 up to 128 cores (2 up to
16 nodes), all of the speedup is greater or equal to the
linear speedup (say, superlinear speedup) and the effi
ciencies are also greater or equal to 1.0 (Fig. 8).
However, in Fig. 6, we can observe very long effective
CPU time on Cla and Clb for weak core number (up to
32). Ratio between C1 and C2 CPU time is about 1.3
for high core number and about 3 for weak core number.
This effect may be attributed to Harpertown technology
and cache limitation. The overtime for effective CPU
time per core using 8 cores implies speedup and effi
ciency overestimate. Using 128 cores, a speedup of 45
is not realistic, linear speedup is 16.
To avoid this problem, we define "absolute" speedup
and efficiency with respect to T, e7 of C2a (ref = 1 node
= 8 cores) as following:
96 128 16()
Number of cores
Figure 7: NEPTUNE_CFD speedup with respect to
T, e7 of each cluster (standard mesh 1,000 iterations).
The table 4 shows the effect of core number on effec
tive CPU time per core with the standard mesh. Compu
tation elapsed time is near to this time. The speedup is
defined as the ratio between the elapsed time to execute
a program on one node (ref = 1 node = 8 cores) and on a
set of concurrent a nodes (n x 8 cores) and the efficiency
is defined as the ratio between speedup and n:
T
TCT~creSC2n
E fabs 1
where T, ,f is the execution time of the parallel program
on one node (8 cores) on C2a and T,, is the elapsed time
of the parallel program on a nodes (n x 8 cores) on the
different clusters.
When we use this kind of definition about speedup,
the speedup values are less interesting than the evolu
tion of speedup per cluster as core number.The chang
ing slope indicates the optimum core number. The Fig.
9 and 10 show the evolution of absolute speedup and ef
ficiency. On all clusters, parallel performances are very
good from 8 up to 128 cores. On C1, we find again low
performances for few cores. Over 128 cores, the effi
ciency degradation is \ignilk~.alll especially for C3 where
effective CPU time increases. Moreover for C1 and C2,
E f =
Tsm, es
where T, ,f is the effective core CPU time of the par
allel job on one node (8 cores) and T,, is the effective
core CPU time of the parallel job on n nodes (n x 8
cores).
NEPTUNE_CFD effective CPU time per core,
speedup and efficiency of the calculation are depicted
in Fig. 6, 7 and 8 respectively. The continuous line
without symbol indicates linear speedup. The Fig. 7
Number ofcor s
Figure 9: NEPTUNE_CFD absolute speedup with re
spect to T ,, of C2a (standard mesh 1,000 iterations).
7th International Conference on Multiphase Flow,
ICMF 2010, Tampa, FL, May 30 June 4, 2010
Table 5: Effect of core number on core CPU time for
500 iterations with the refined mesh on C2a.
number of cores CPU time/core (s.)
8 109,220
16 55,205
32 28,877
64 14,707
128 7,351
256 3,721
512 1,780
768 1,220
1024 962
while cell number per core is greater than 10,000 that is
to say while the core number is lower or equal to 128 for
a mesh of 1,000,000 cells. On Nehalem cluster, perfor
mances are excellent up to 196 cores. It is noteworthy
that the restitution time decreases up to 256 cores (5,000
nodes per core).
NEPTUNE_CFD scalability is excellent on all plat
forms, with linear speedup on Harpertown and \II.IIg.sI
technologies and superlinear speedup on Nehalem tech
nology.
To improve NEPTUNE_CFD performance evalua
tion, we perform numerical simulations with the refined
mesh (10,503,459 hexaedra) using from 1 up to 128
nodes (8 up to 1024 cores). As CPU times are im
portant, these numerical simulations have been realized
only on cluster C2a (the most recent cluster) with Intel
MPI implementation. Performances are evaluated on a
restart simulation of 500 additional iterations after tran
sient phase. Time measurements are still done with the
"CPU TIME" intrinsic fortran subroutine and each run
is repeated 3 times.
The table 5 shows effect of core number on effective
CPU time per core with the refined mesh. The speedup
and the efficiency are defined as previously.
NEPTUNE_CFD effective CPU time per core,
speedup and efficiency of the calculation are depicted
in Fig. 11, 12 and 13 respectively. The solid line rep
resents the linear speedup. The Fig. 12 demonstrates
that NEPTUNE_CFD exploits the parallelism very ef
ficiently even for large core number. The speedup is
superlinear up to 768 cores and linear for 1,024 cores
and efficiency are also greater or equal to 1.0 (Fig. 13).
Parallel performances are very good up to 128 nodes. As
in previous case, performance degradation begins when
communications become preponderant to compute. We
can assume that with this mesh speedup should be low
using 2,048 cores .
This refined mesh study confirms that NEP
TUNE_CFD performances are excellent on Nehalem
Number of core
Figure 10l: NEPTUNEFD absolute efficiency with
respect to T e7 of C2a (standard mesh 1,000 iterations).
we can note a MPI implementation effect. SGI MPT
seems more efficient for several cores and Intel MPI is
better for numerous cores.
The degradation for high core number is mainly
caused by the load unbalancing and the communica
tion overhead. Restitution time of a simulation is di
vided into computation time, communication and wait
ing time (send/receive/barrier MPI) and I/O time. In
this case, I/O time is almost zero. The increase of core
number corresponds to a decrease of cell number per
core that is to say to a decrease of computation time
per core. Moreover, more cores implies more commu
nications between cores. When communication time in
creases quicker than calculation time decreases, the core
limit number is reached. Thus there is an optimal num
ber of cores for a mesh and a set of equations.
For this numerical simulation of a threedimensional
granular uniform shear flow solving 10 equations, NEP
TUNEFD performances are excellent on all clusters
05
re * C2a
04
\a
to too looc
Number of cores
j*
los
~er
oss
0.9
7th International Conference on Multiphase Flow,
ICMF 2010, Tampa, FL, May 30 June 4, 2010
le+
c C2a
 Ideal
0.85
Number of cores
Figure 13: NEPTUNE_CFD efficiency with ref = 1
node (refined mesh 500 iterations).
D other
40%Vl 5allreduce
o  Owaital
20 barrier alreduce
O computation
8 16 32 ~Number ofcores 12 25 52
Figure 14: MPINSIDE profiling on C2a (standard mesh
1,000 iterations).
is over 90 %b of effective core CPU time. Barrier before
MPI Allreduce time (bhallred) is weak and MPI Allre
duce time (allred) and Waitall time are almost nil. Other
MPI job times (overhead, send, receive, ...) are negligi
ble. Increasing core number, barrier before MPI Allre
duce time, MPI Allreduce time and Waitall time increase
and computational time decrease. Using 64 cores with
the standard mesh, computational time is only about
70 %b of core CPU time and using 512 cores computa
tional time is about 10 %b. With the refined mesh, us
ing 256 cores, computational time is only about 70 %b of
core CPU time and using 1024 cores computational time
is about 25 %b.
MPINSIDE gives informations about MPI function
samples. Using 16 cores on refined mesh, during 500 it
erations, barrier before MPI Allreduce number is about
55,904,240. Using 1,024 cores, barrier before MPI
Allreduce number is about 3,574,704, 128. Indeed, NEP
TUNE_CFD iterative solvers require a lot of MPI com
munications. Thus, a lot of messages are exchanged be
tween the cores many times in every iteration. For a high
Figure 11: NEPTUNE_CFD effective CPU time per
core (refined mesh 500 iterations).
C/ 6
Number of cores
Figure 12: NEPTUNE_CFD speedup with ref = 1 node
(refined mesh 500 iterations).
cluster (C2) while cell number per core is higher than
10,000 that is to say for a mesh of 10,503,459 cells a
core number lower or equal to 1,024.
Hence, NEPTUNEFD strong scaling is excellent
up to 1,024 cores in condition of respect of minimum
cell number per core.
To understand these performances and time divide ac
cording to core number, NEPTUNEFD profiling is
necessary. We use MPINSIDE to evaluate MPI commu
nications and PerfSuite to get NEPTUNEFD timing
informations.
MPINSIDE increases slightly the elapsed time of nu
merical simulations but gives the right percentages of
elapsed time used by computation and by the different
MPI process. The main MPI function times and com
putational time are depicted in Fig. 14 and 15 for the
standard and the refined meshes for different core num
ber.
Using 8 cores, with the two meshes, computation time
7th International Conference on Multiphase Flow,
ICMF 2010, Tampa, FL, May 30 June 4, 2010
64 128 256 512 768 10u
Number of cores
Figure 15: MPINSIDE profiling on C2a (refined mesh 
500 iterations).
I40%C  HntelMPlfunctions
gsoH  Osh~eduling
0 prodsc _
o Ogradco
8 16 32 ~~~umber of coreS 5 1 6 04
Figure 16: PerfSuite routine profiling on C2a (refined
mesh 500 iterations).
core number, communication time is important. MPIN
SIDE confirms that MPI communications due to itera
tive solvers and "alpha pressure" step will be the main
limitations to NEPTUNE_CFD massively parallel com
putation.
PerfSuite is used to know the most frequently asked
routines and functions (sample profiling). The Fig. 16
shows the distribution of call for the main "routines" as
a function of core number. The term "routines" here
corresponds to Intel MPI functions, function "parcom"
(call of MPI function in fortran code), scheduling, rou
tine "prodsc" (fortran scalar product), routine "gradco".
For low core number, percentage of NEPTUNE_CFD
routine sample for "gradco" and "prodsc" is about 90 %b.
For high core number (above 256), percentage of Intel
MPI functions and scheduling is \ig ni lk~.sll and become
predominant for 1,024 cores.
PerfSuite enables to follow the called number to the
code, the "libc" library and the Intel MPI library (Fig.
17). The increase of core number reduces strongly the
percentage the NEPTUNE_CFD call number.
This sample profiling can be compared with a time
profiling included in NEPTUNE_CFD. This profiling
Figure 17: PerfSuite module profiling on C2a (refined
mesh 500 iterations).
Table 6: NEPTUNE_CFD integrated profiling : details
of one iteration with the refined mesh using 16 cores on
step CPU time (s.)
pOstprocessing 0.00
initialization 1.35
VelOcity prediction 5.41
"alphapressure" 109.70
gradco routine 102.92
jacobi routine 1.61
bicgstab routine 1.19
Total time for iteration 120
gives per iteration per core, CPU time and main algo
rithm step time. MPI process time are included into rou
tines calling. In table 6, we can see that with the re
fined mesh and using 16 cores, one iteration requires a
core CPU time of 120 s. meanly spent by the "alpha
pressure" step (109 s) and this one is meanly spent by
the conjugate gradient (103 s). For the same iteration but
this time using 1,024 cores, the core CPU time of 2.14 s.
is meanly spent by the "alphapressure" step (1.88 s)
which is meanly spent by conjugate gradient (1.76 s).
Between 80 %b and 90 %b of effective CPU time is
spent in the step of matrixinversion iterativee solvers).
A complementary aspect of performance evaluation
is weak scaling. To respect a cell number per core of
20,000 we have created seven regular meshes to evalu
ate performance from 16 cores up to 512. The table 7
details the different meshes according to the core num
ber. NEPTUNE_CFD performances are evaluated on
a restart simulation of 2,500 additional iterations after
transient step on C2a.
In the Fig. 18, the weak scaling of NEPTUNE_CFD
is shown. The effective CPU times per core are almost
constant. A weak scaling efficiency E fweak can be de
Table 7: NEPTUNE_CFD weak scaling : different
meshes according to core number on C2a.
hexaedra number Core number Cell per core
166,375 8 20,797
328,509 16 20,532
658,503 32 20,578
1,295,029 64 20,235
2,515,456 128 19,652
5,177,717 256 20,225
10,503,459 512 20,515
12000
1,
op
Figure 19: NEPTUNECFD weak scaling efficiency
for 2,500 iterations (20,000 cells per core)
Table 8: Powder and gas properties of industrial reactor.
Gas Particles
Density (kg/m") 21.11 850
Viscosity x 10s (Pa.s) 1.54
Pressure (bar) 20
Vr (m.s ) 0.67
Mean diameter (pm) 1,600
Solid mass (kg) 80,000
Restitution coefficient 0.9
meter wide as shown by Fig. 21. This bulb is an hemi
spherical dome with two smokestacks for gas outlet.
The powder properties and operating points are given
in table 8. In the industrial process, the particle phase is
polydispersed however the numerical simulations have
been carried out with monodisperse particle having a
median diameter.
For the gas turbulence modeling, we use a standard
k E model extended to the multiphase flows account
ing for additional source terms due to the interfacial in
teractions. For the dispersed phase, a coupled transport
equations system is solved on particle fluctuating kinetic
energy and fluidparticle fluctuating velocity covariance.
According to large particle to gas density ratio only the
drag force is accounted as acting on the particles. Fi
nally, we solve 12 equations (mass balance, momentum
transport, gas and particle turbulence).
The threedimensional mesh is shown by Fig. 20. The
mesh, based on Ogrid technique, contains 3,150,716
hexaedra with approximately Ax: = Ay = 30 mm and
aZ 90 mm.
At the bottom, the fluidization grid is an inlet for the
gas with imposed superficial velocity corresponding to
the fluidization velocity Vf 0.67 m.s For the par
ticles this section is a wall. At the top of the fluidized
7th International Conference on Multiphase Flow,
ICMF 2010, Tampa, FL, May 30 June 4, 2010
I+4C2a[
? 's
Number of cores
I I 12
12
Number of cores
Figure 18: NEPTUNE_CFD effective CPU time per
core for 2,500 iterations (20,000 cells per core)
fined as following:
TTre TsCOreS
T, ,
Ef weal
where Tre; is the elapsed time of the parallel job on one
node with smallest mesh and T, is the elapsed time of
the parallel job on n nodes with the respective meshes.
The Fig. 19 shows that this weak scaling efficiency is
very good up to 512 cores (larger than 1).
This test case of granular shear flow exhibits accuracy
of NEPTUNE_CFD results and excellent scalability of
the code for large enough problem sizes.
Simulation of a industrial gassolid fluidized
bed
The purpose of this test case is a threedimensional in
stationnary dense fluidized bed at industrial scale. NEP
TUNECFD V1.07@Tlse is dedicated to this kind of
numerical simulations Gobin et al. (2003).
The industrial reactor is composed of a cylinder and a
bulb at the top. The reactor is about 30 meter high and 5
7th International Conference on Multiphase Flow,
ICMF 2010, Tampa, FL, May 30 June 4, 2010
Figure 20: Threedimensional mesh containing 3,150,716 hexaedra.
X
z 1
bed reactor, we defined two free outlet for both the gas
and the particles. The walltype boundary condition is
friction for the gas and slip for the particles.
Results and discussion
The Fig. 22(a) shows the automatic domain decomposi
tion realized by METIS on the industrial reactor.
The Fig. 22(b) shows instantaneous solid volume frac
tion at 16.5 s.
NEPTUNE_CFD performances are evaluated on a
restart simulation of 250 additional iterations after a
transient phase corresponding to the destabilization of
the fluidized bed (15 s.). Numerical simulations have
been carried out only on cluster C2a (Nehalem architec
ture with Intel MPI).
As for the shear flow, the "CPU_TIME" intrinsic for
tran subroutine has been used to measure the different
times per core.
The table 9 shows the effect of core number on effec
tive CPU time per core.
As described above, we calculate speedup and effi
ciency. NEPTUNE_CFD effective CPU time per core,
speedup and efficiency of this calculation are depicted
Lp,n
Di3
i~n
An adaptative time step is used (computed from
Courant and Fourier criteria), typically at 1.10 5.
The following iterative solvers have been selected: ja
cobi for the velocity, conjugated gradient for the pres
sure and bicgstab for the volume fraction. The crite
rion parameter Tai of "AlphaPressure" step is fixed to
is 106 and the maximum number of cycles into "Alpha
Pressure" step is 50.
Figure 21: Sketch of the industrial polymerization reac
tor.
Table 9: Effect of core number on core CPU time on
C2a (industrial fluidized bed).
number of cores CPU time/core (s.)
8 68,479
16 40,885
32 19,233
64 8,625
128 4,129
256 1,707
384 1,275
512 1,173
768 1,190
1024 1,220
in Fig. 23, 24 and 25 respectively. The continuous line
without symbol indicates linear speedup and ideal effi
ciency. The Fig. 24 demonstrates that NEPTUNE_CFD
exploits the parallelism very efficiently. From 16 up to
384 cores (48 nodes), the speedup is superlinear and ef
ficiencies are also greater than 1.0 (Fig. 25).
Parallel performances are very good up to 384 cores,
the restitution time decreases up to 512 cores. Over
this value, an increase of core number is not interest
ing but restitution time does not increase significantly.
The increase of communication time is balanced by the
decrease of computational time.
On the same way, in this case realistic industrial dense
fluidized bed solving 12 equations, NEPTUNE_CFD
HPC is excellent on Nehalem cluster (C2) while cell
7th International Conference on Multiphase Flow,
ICMF 2010, Tampa, FL, May 30 June 4, 2010
Gas out
0.64
DA48
0.32
0 16
0.00
Time: 16.5s.
(a) (b)
Figure 22: Industrial fluidized bed. (a) Automatic
domain decomposition using Metis. (b) Instantaneous
solid volume fraction field.
number per core is higher than 10,000 that is to say for
a mesh of 3,150,716 hexaedra a core number inferior or
equal to 384. If we consider the "cost" of the numerical
simulation about CPU time, the optimum performances
are obtained with 256 cores.
This test case illustrates excellent NEPTUNE CFD
scalability performances on a industrial geometry.
Strong scaling is good for large enough problem sizes
(up to 384 cores with this mesh).
Conclusion
Threedimensional numerical simulations of two differ
ent cases of dense gasparticle flows have been per
formed to evaluate NEPTUNE_CFD parallel perfor
mances. The first one exhibits accuracy of numerical
results and demonstrates high NEPTUNE_CFD parallel
computing performances up to 1,024 cores. The second
one, a realistic fluidized bed shows NEPTUNE_CFD
efficiency and scalability on industrial geometry. The
results show linear speedup using Harpertown and
\II.IIg.sI technologies and superlinear speedup with Ne
halem technology. For the two test cases, a minimum
of 10,000 cells of mesh per core is needed and 20,000
is recommended. A next step of evaluation of NEP
TUNE_CFD performances will focus on effect of phase
number (fluids) and of solved equation number on the
minimum cell number per core. On the other, we will
~30 m
T5m
Gas in
**CC~a
100 00
0 ~Ideal
8~ 6
~ ~ ~ ~ ~ I .I 1 * 24
Number of cores
Figure 24: NEPTUNE_CFD speedup with ref =1 node
on C2a (250 iterations) on an industrial reactor.
evaluate the performances using several thousands of
cores such as realized on the Open Source CFD code:
Code_SATURNE by Sunderland et al. (2008).
Moreover, the profilings have shown that the most im
portant part of CPU time is spent in the step of matrix
inversion by the iterative approach. The next implemen
tation of multigrid in NEPTUNE_CFD should improve
strongly the global performances of the code.
Acknowledgments
This work was granted access to the HPC resources of
CALMIP under the allocation P0111 (Calcul en Midi
Pyrenees). This work was also granted access to the
HPC resources of CINES under the allocation 2010
026012 made by GENCI (Grand Equipement National
de Calcul Intensif).
7th International Conference on Multiphase Flow,
ICMF 2010, Tampa, FL, May 30 June 4, 2010
o,7 C a
Ideal
** 1**24
Number of cores
Figure 25: NEPTUNECFD efficiency with ref = 1
node on C2a (250 iterations) on an industrial reactor
References
G. Balzer, A. Boelle, and O. Simonin. Eulerian gassolid
flow modelling of dense fluidized bed. FLUIDIZATION
VIII, Proc. International Symposium of the Engineering
Foundation, J.E Large and C. Lagudrie (Editors), pages
409418, 1995.
A. Gobin, H. Neau, O. Simonin, J.R. Llinas, V. Reiling,
and J.L S61o. Fluid dynamic numerical simulation of a
gas phase polymerization reactor. Int. J. for Num. Meth
Ods in Fluids, 43, pages 11991220, 2003.
G. Karypis and V. Kumar. A fast and highly quality mul
tilevel scheme for partitioning irregular graphs. SIAM
Journal on Scientific Computing, Vol. 20, No 1, pages
359392, 1999.
N. Mechitoua, M. Boucker, S. Mimouni, S. Pigny, and
G. Serre. Numerical simulation of multiphase flow with
on elliptic oriented fractional step method. Third Inter
national Symposium on Finite Volumes for Complex Ap
plications, Porquerolle France, 2002.
N. Mechitoua, M. Boucker, J. Lavieville, J.M. Herard,
S. Pigny, and G. Serre. An unstructured finite vol
ume solver for twophase water/vapour flows mod
elling based on elliptic oriented fractional step method.
NURETH 10, Se'oul, Core'e du Sud, 2003.
T. Randrianarivelo, H. Neau, O. Simonin, and F. Nico
las. 3d unsteady polydispersed simulation of uranium
hexafluoride production in a fluidized bed pilot. In Proc.
6th Int. Conf on Multiphase Flow, Leipzig (Germany),
paper S6_Thu_46,, 2007.
O. Simonin. Prediction of the dispersed phase tur
bulence in particleladen jets. GasSolid Flows1991,
ASME FED, Vol. 121, pages 197206, 1991.
7th International Conference on Multiphase Flow,
ICMF 2010, Tampa, FL, May 30 June 4, 2010
0. Simonin. Statistical and continuum modelling of tur
bulent reactive particulate flows. Theoretical and Ex
perimental Modeling of Particidate Flows. Lecture Se
ries 200006, Von Karinan Institute for Fluid Dynamics,
Rhode Saint Gendse, Belgium, 2000.
A.G. Sunderland, M. Ashworth, N. Li, C. Moulinec,
Y. Fournier, and J. Uribe. Towards petascale comput
ing with parallel efd codes. Parallel CFD 2008, Lyon
France, 2008.
