Group Title: 7th International Conference on Multiphase Flow - ICMF 2010 Proceedings
Title: P1.57 - NEPTUNE_CFD High Parallel Computing Performances for Particle-Laden Reactive Flows
ALL VOLUMES CITATION THUMBNAILS PAGE IMAGE ZOOMABLE
Full Citation
STANDARD VIEW MARC VIEW
Permanent Link: http://ufdc.ufl.edu/UF00102023/00456
 Material Information
Title: P1.57 - NEPTUNE_CFD High Parallel Computing Performances for Particle-Laden Reactive Flows Computational Techniques for Multiphase Flows
Series Title: 7th International Conference on Multiphase Flow - ICMF 2010 Proceedings
Physical Description: Conference Papers
Creator: Neau, H.
Laviéville, J.
Simonin, O.
Publisher: International Conference on Multiphase Flow (ICMF)
Publication Date: June 4, 2010
 Subjects
Subject: high performance computing
parallel computation
multiphase flow
Euler multi-fluid approach
 Notes
Abstract: This paper presents high performance computing of NEPTUNE_CFD V1.07@Tlse. NEPTUNE_CFD is an unstructured parallelized code (MPI) using unsteady Eulerian multi-fluid approach for dilute and dense particle-laden reactive flows. Three-dimensional numerical simulations of two test cases have been carried out. The first one, a uniform granular shear flow exhibits an excellent scalability of NEPTUNE_CFD up to 1024 cores, and demonstrates the good agreement between the parallel simulation results and the analytical solutions. Strong scaling and weak scaling benchmarks have been performed. The second test case, a realistic dense fluidized bed shows the code computing performances on an industrial geometry.
General Note: The International Conference on Multiphase Flow (ICMF) first was held in Tsukuba, Japan in 1991 and the second ICMF took place in Kyoto, Japan in 1995. During this conference, it was decided to establish an International Governing Board which oversees the major aspects of the conference and makes decisions about future conference locations. Due to the great importance of the field, it was furthermore decided to hold the conference every three years successively in Asia including Australia, Europe including Africa, Russia and the Near East and America. Hence, ICMF 1998 was held in Lyon, France, ICMF 2001 in New Orleans, USA, ICMF 2004 in Yokohama, Japan, and ICMF 2007 in Leipzig, Germany. ICMF-2010 is devoted to all aspects of Multiphase Flow. Researchers from all over the world gathered in order to introduce their recent advances in the field and thereby promote the exchange of new ideas, results and techniques. The conference is a key event in Multiphase Flow and supports the advancement of science in this very important field. The major research topics relevant for the conference are as follows: Bio-Fluid Dynamics; Boiling; Bubbly Flows; Cavitation; Colloidal and Suspension Dynamics; Collision, Agglomeration and Breakup; Computational Techniques for Multiphase Flows; Droplet Flows; Environmental and Geophysical Flows; Experimental Methods for Multiphase Flows; Fluidized and Circulating Fluidized Beds; Fluid Structure Interactions; Granular Media; Industrial Applications; Instabilities; Interfacial Flows; Micro and Nano-Scale Multiphase Flows; Microgravity in Two-Phase Flow; Multiphase Flows with Heat and Mass Transfer; Non-Newtonian Multiphase Flows; Particle-Laden Flows; Particle, Bubble and Drop Dynamics; Reactive Multiphase Flows
 Record Information
Bibliographic ID: UF00102023
Volume ID: VID00456
Source Institution: University of Florida
Holding Location: University of Florida
Rights Management: All rights reserved by the source institution and holding location.
Resource Identifier: P157-Neau-ICMF2010.pdf

Full Text



7th International Conference on Multiphase Flow,
ICMF 2010, Tampa, FL, May 30 June 4, 2010


NEPTUNE_CFD High Parallel Computing Performances for Particle-Laden
Reactive Flows


H. Neau*i, J. Lavi~ville O. Simonin*i

University de Toulouse ; INPT, UPS ; IMFT ; Allee Camille Soula, F-31400 Toulouse, France

SCNRS ;Institut de Mecanique des Fluides de Toulouse ; F-31400 Toulouse, France
i:EDF Recherche & Developpement ; DMFEE; F-78401, Chatou, France
neau~imft.fr

Keywords: High Performance Computing, Parallel computation, Multiphase flows, Euler multi-fluid approach




Abstract

This paper presents high performance computing of NEPTUNE_CFD V1.07@~Tlse. NEPTUNE_CFD is an unstruc-
tured parallelized code (MPI) using unsteady Eulerian multi-fluid approach for dilute and dense particle-laden reactive
flows. Three-dimensional numerical simulations of two test cases have been carried out. The first one, a uniform
granular shear flow exhibits an excellent scalability of NEPTUNE_CFD up to 1024 cores, and demonstrates the
good agreement between the parallel simulation results and the analytical solutions. Strong scaling and weak scaling
benchmarks have been performed. The second test case, a realistic dense fluidized bed shows the code computing
performances on an industrial geometry.


Nomenclatu re


Dilute and dense particle-laden reactive flows are used in
a wide range of industrial applications such as coal com-
bustion, catalytic polymerization or uranium fluorida-
tion. A few years ago, three-dimensional numerical sim-
ulations of such flows, especially fluidized beds, were
restricted to coarse meshes due to limitation of com-
puting power (CPU clock speed, memory, hard drive).
The recent improvements on high performance comput-
ing overcome this limitation. For hardware, the frequen-
cies of the pocessors have remained almost flat in the
last years. The development of multi-core technology
and the increase of memory cache size and of mem-
ory and other interconnect frequencies have led to very
highly parallel and efficient computing platforms. For
software, this computational power has been exploited
according to the parallelization of codes (message pass-
ing). Hence, we are able to solve previously intractable
problems. The strong development of HPC allows the
use of finer and larger meshes and the increase of nu-
merical simulation accuracy.
Since a few years, three-dimensional realistic numer-
ical simulations of industrial configurations are realized
with unsteady Eulerian multi-fluid modeling approach
for monodispersed or polydispersed reactive particle
mixture (Randrianarivelo et al. (2007)). This approach


Roman symbols
CPU Central Processing Unit
E f Efficiency
ECf,,s Weak scaling efficiency
y Gravitational constant (11.5 )
Hbell Bed height (11)
HPC High Performance Computing
MPI Message Passing Interface.,
P, Mean gas pressure (N.111-)
qp Man par ice agi aion 111 .s )
Sp? Speedup
T, e7 Job execution time on one node (8 cores)
T,, Job execution time on a nodes (n x 8 cores)
Us,; Mean velocity of phase k (m1.5-1
Fluctuating velocity of phase k (m1.5-1
Greek symbols
p, Gas viscosity (kg.111-1-)
Pre Density of phase k (kg.1113)
Subscripts
9 Gas
p Particle
i ith component of a vector


Introduction







7th International Conference on Multiphase Flow,
ICMF 2010, Tampa, FL, May 30 June 4, 2010


has been developed and implemented by IMFT (Insti-
tut de Mecanique des Fluides de Toulouse) in the NEP-
TUNE CFD V1.07@Tlse version. NEPTUNE CFD is
a unstructured parallelized multiphase flow software de-
veloped in the framework of the NEPTUNE project, fi-
nancially supported by CEA (Commissariat B l'Energie
Atomique), EDF (Electricit6 de France), IRSN (Institut
de Radioprotection et de Stiret6 Nucleaire) and AREVA-
NP. NEPTUNE CFD V1.07@Tlse. This code allows to
take into account complex phenomena: particle mixture,
particle-fluid interaction, particle-particle and particle-
wall collisions, heat and mass transfers and chemical re-
actions .
In this study, three-dimensional numerical simula-
tions of two test cases have been carried out: a uniform
granular shear flow and a dense gas-solid fluidized bed
at industrial scale. These simulations enable to evaluate
high computing performances of NEPTUNE_CFD on
three clusters with different technologies and architec-
tures with different message passing libraries. Parallel
structuring of the computational procedure, speedups,
efficiency and their dependence upon the size of the
problem, and the number of cores are discussed.

NEPTUNE CFD overview

NEPTUNE_CFD is dedicated to calculating multiphase
or multi-field flows, at the local scale and in geometries
that may be complex. The software has the following
main characteristics and functions:

flow systems processed:
1 to 20 fluid fields (phases or fields),
processing of water/steam flows with actual
thermodynamic laws;

numerical methods:
meshes with all types of cell (element), non-
conforming connections,
"cell-center" type finite volume method,
calculation of co-localized gradients with re-
construction methods,
distributed-memory parallelism by domain
splitting;

physical models:
interfacial momentum transfer terms,
interfacial energy transfer terms,
turbulence,
head losses and porosity,
additional scalar;

architecture:


Figure 1: NEPTUNE_CFD modules.


interfacing with the Code_Saturne Enveloppe
module for management of the preprocessing
operations on the mesh, of the parallelism and
of the post-processing files,
written in Fortran 77 (the majority) and C
(ANSI 1989) (procedural programming),
ready for Unix and Linux platforms.

The method of parallel computing used by NEP-
TUNE_CFD is known as domain decomposition, in
which the geometry and associated fields are broken
into pieces and allocated to separate cores. The pro-
cess of parallel computation involves: decomposition of
mesh and fields, running the application in parallel, post-
processing. The parallel running can use public or not
implementations of the message passing library (MPI).
NEPTUNE_CFD is composed of two main elements


the Kernel module is the numerical solver,

the Enveloppe module is in charge of mesh data
reading, mesh pasting allowing non matching
meshes, domain decomposition for parallel com-
puting using METIS (Karypis and Kumar (1999)),
definition of periodicity boundary conditions and
post-processmng.

NEPTUNE_CFD also relies on one compulsory library
(under LGPL license): BFT (Base Functions and Types)
for memory and input/output management as well as
specific utilities.

Mathematical models and numerical methods

The unsteady Eulerian multi-fluid approach is derived
from joint fluid-particle PDF equation allowing to derive
the transport equations for the mass, momentum and ag-
itation of particle phases (Simonin (2000)).







7th International Conference on Multiphase Flow,
ICMF 2010, Tampa, FL, May 30 June 4, 2010


In the proposed modeling approach, separate mean
transport equations (mass, momentum, and fluctuating
kinetic energy) are solved for each phase and coupled
through inter-phase transfer terms.
In the following when subscript k = y we refer to
the gas and k p to the particulate phase. The mass
balance equation is:

as rrpre + ag P lc<,j = 0, (1)

where as is the kth phase volume fraction, pa the den-
sity and LT<,; the component of the velocity. In (1)
the right-hand-side is equal to zero since no mass trans-
fer takes place.
The mean momentum transport equation takes the fol-
lowing expression:


The momentum balance equations are solved with a
semi-implicit method. They are splited in fractional
steps: explicit balance, velocity implicit increment pre-
diction, "alpha-pressure-energy" implicit increment pre-
diction, final velocity correction,
The "Alpha-Pressure-Energy" step stops after the
mass conservation sub-step, when the volume conserva-
tion holds. The user can adapt the criterion parameter
t est, but this one remains very severe as it applies to a
maximum value over the whole domain:


,It,[ L "ll ,l


The standard value of I, is 105
Because of implicit formulation
dimensional unstructured meshes, we
solvers:


and three-
use iterative


i)P


for the pressure: conjugated gradient or bi-
conjugate gradient stabilized (bi-cgstab),

for volume fraction: biconjugate gradient stabilized
or jacobi,

for velocity: jacobi.


Parallel computer description

The numerical simulations have been carried out on 3
clusters:

C1: SGI Altix ICE 8200 based on Intel Xeon Quad
Core processors (Harpertown technology),

C2: SGI Altix ICE 8200 EX based on Intel Xeon
X5560 Quad Core processors (Nehalem EP tech-
nology),

C3: cluster based on AMD \hI.1Iugal Quad Core pro-
cessors.

The complete description of clusters is given in table 1.
These clusters among the most efficient in France are
based on different technologies, on different file systems
and on different interconnects. The aim is to evaluate
NEPTUNE_CFD parallel performances on different ar-
chitectures about speedup and efficiency and not to com-
pare cluster HPC (elapsed time). For clusters C1 and C2,
we test 2 implementations of MPI : Intel MPI and SGI
MPT (Cla, Cib, C2a and C2b) and we use Intel For-
tran and C compiler. Numerical simulations on cluster
C3 are carried out with GNU fortran and C compiler and
OpenMPI. All clusters were in production.


+IA,; +[-ax Pk <'u',i' i) + Opr;j ,

where ..' is the fluctuating part of the instantaneous
velocity of phase k, P, is the gas pressure, y; the ith
component of the gravity acceleration and IA,; the mean
gas-particle interphase momentum transfer without the
mean gas pressure contribution. Finally OA,;, is for
k y the molecular viscous tensor and for k = p
the collisional particle stress tensor. For details on di-
luted flows we refer to Simonin (1991), and on fluidized
beds we refer to Balzer et al. (1995), Gobin et al. (2003).

The partial differential equations are discretized with
a second-order centered scheme and the solution is time-
advanced by a first order scheme.
The model and the numerical method are adapted to
the handling of n-phases (in fact n-fields), including
the single phase frame. Models and Eulerian transport
equations represent an extension of the classical two-
phase flow model developed in the frame of the ESTET-
ASTRID Project and include water-steam closure laws
and correlations.
The algorithm, based on original elliptic fractional
step method (see Mechitoua et al. (2002) and Mechitoua
et al. (2003)) that leads either to use linear solvers or di-
rect nphas.x.nphas matrix inversion. The main interest
of the method is the so-called "alpha-pressure-energy"
step that ensures conservativity of mass and energy and
allows strong interface source term coupling. Mass,
momentum and energy equations are coupled with the
help of a pressure correction equation, within the itera-
tive "alpha-pressure-energy" step. The algorithm allows
density variation according to pressure and enthalpy dur-
ing the computation.


DETA,; DETA,;













Table 1: Cluster description.
Cluster Cla Clb C2a C2b C3
Computational center CINES CALMIP Renault Fl Team
Model SGI Altix ICE 8200 SGI Altix ICE 8200 EX HP
Number of cores 12288 2816 1024
Number of nodes 1536 352 128
Processors/nodes 2 2 2
RAM/core (GB) 4 4.5 4
Peak performance (TFlop/s) 147 31.5 9.4
Processor Intel Xeon E5472 Intel Xeon X5560 AMD Shangal
Processor frequency (GHz) 3.0 2.8 2.3
Cores/processor 4 4 4
L2 cache 2x(6 MB per 2 cores) 4*256KB 4*512KB
L3 cache 8MB per 4 cores 6MB per 4 cores
Network fabric Infiniband Infiniband Infiniband
File system Lustre Lustre LFS
C compiler Intel icc1 1.1 Intel icc1 1.1 gcc4.1.2
Fortran compiler Intel ifortl1.1 Intel ifortl1.1 gfortran4.1.2
MPI I ntel MPI 3.2.2 SGI MPT1.25 Intel MPI3.2.2 SGI MPT1.25 OpenMPII.2.6


7th International Conference on Multiphase Flow,
ICMF 2010, Tampa, FL, May 30 June 4, 2010


o.olm


Simulation of a three-dimensional granular

Tu e f ora e shvee iflo w


The purpose of the test case is a three-dimensional
isothermal granular shear flow. For such a simple con-
figuration, the mathematical model has a analytical solu-
tion which can be compared with the numerical solution
given by NEPTUNE_CFD.
The computational domain is a cubic box of length
Hcube = 0.01 m (Fig. 2). The powder and gas material

prop rties a g venlin t nbe r2 T~he p~artrioculahte vheassethe


following values:

*particle fluctuating kinetic energy:


0.01m


0.01m


Figure 2: Sketch of the three-dimensional granular uni-
form shear flow.



is also neglected in this case. Finally for this test case we
solve 10 equations (mass balance, momentum transport,
particle turbulence).
Two different meshes have been employed to evalu-
ate the performances: 1,000,000 cells from 1 up to 32
nodes corresponding to 8 up to 256 cores and 10,503,459
cells from 16 up to 128 nodes corresponding to 128 up
to 1024 cores. The Fig. 3 shows the standard mesh
composed of 1,000,000 hexaedra with uniform cell size
(ax: ay aZ 0.1 mm). The second mesh is
also a regular mesh of 10,503,459 hexaedra (219 cells
per direction).
Configuration is detailed in Fig. 4. Three kinds of
boundary conditions are used:


qp = 2.66.10-s m2 s2


* particle kinetic viscosity:

pep = 6.21.10-4 kg ml s ,

* particle collisional viscosity:

Pc,p = 9.88.10-4 kg ml s .


The gas phase is laminar and the particle agitation
model is accomplished by solving a system of two trans-
port equations: one for the particle turbulent agitation
and one for the fluid-particle covariance. According to a
large particle to gas density ratio only the drag force is
taken into account. It must be printed out that the gravity











Moving waHl : U=0.1m/s







z Y







Moving waHl


Table 2: Powder and gas properties.
Gas Particles
Density (kg m-) 1 1,000
Viscosity x 10s (Pa s) 1
Pressure (bar) 1-
Diameter (pm) 350
Solid volume fraction 0.3
Restitution coefficient 0.8


Figure 4: Schematic of boundary conditions.


Table 3: Analytical and numerical result comparison.
NEPTUNE CFD
Analytical
minimum maximum
op -) 0.30 0.29989 0.30071
qp2.66.10-5 2.6432.10-5 2.6639.10-5
(m -s )
6.21.10-4 6.1872.10-4 6.2116.10-4
(kg/ms)
"pc 9.88.10-4 9.8784.10-4 9.9085.10-4
(kg/ms)


An adaptative time step is used (computed from
Courant and Fourier criteria), typically at 5.10-4 S.
The following iterative solvers have been selected: ja-
cobi for the velocity, conjugated gradient for the pres-
sure and bi-cgstab for the volume fraction. The crite-
rion parameter Ez oz of "Alpha-Pressure" step is fixed to
10-6 and the maximum number of cycles into "Alpha-
Pressure" step is 50.

Results and discussion

After a transient phase, numerical simulations converge
to stationary state. The Fig. 5 shows temporal evolution
of the solid volume fraction between 0.5 s and 2.0 s.
Three-dimensional unsteady numerical simulation re-
sults a ree with analytical solutions (table 3). The
accuracy of numerical results is independent of par-
allelization and of core number. NEPTUNE CFD
performances are evaluated on a restart simulation of
1,000 additional iterations after transient step. The
"CPUTIME" intrinsic fortran sub-routine has been
used to measure CPU times per core of the main rou-
tines of code, the CPU times per iteration per core and
the effective CPU time per core of the simulation. The
meaSurements of different times are repeated 3 times for


7th International Conference on Multiphase Flow,
ICMF 2010, Tampa, FL, May 30 June 4, 2010


o0.ot-o.ot-o>





Soutlet
Intet


Intet
outlet r


Figure 3: Mesh containing 1,000,000 hexaedra.


right-side and left-side, inlet-outlet condition for
particles and gas with imposed velocity:

U, = 20 (Y Hcube/2),

front-side and back-side, symmetry condition for
particle and gas

top-side and bottom-side, moving wall:

Uwalt = 20 + (Y,,,at Hcube/2).

Moving wall is a no-slip wall boundary for particle and
gas:


89 (4)

Inlet boundary condition values to particle fluctuating
kinetic energy and fluid-particle fluctuating velocity co-
variance are:

q=7.5.10-s m2 s

Sq =1.10-4 12 S-

































































Table 4: Effect of core number on effective core CPU
time for 1,000 iterations with the standard mesh.
number of Effective CPU time/core (s.)
cores Cla Clb C2a C2b C3
8 35,621 35,325 12,150 12,462 15,469
16 14,999 14,936 5,628 5,639 7,087
32 5,959 5,877 2,598 2,567 3,254
64 1,571 1,323 1,200 1,127 1,548
128 803 751 563 583 895
256 580 757 449 485 1,269
512 498 493 -


7th International Conference on Multiphase Flow,
ICMF 2010, Tampa, FL, May 30 June 4, 2010


(a) 0.5 s. (b) 1.0 s.


(c) 1.5 s. (d) 2.0 s.

Figure 5: Temporal convergence of solid volume fraction between 0.5 s to 2.0 s.


each case to remove accidental and statistical error. No
\ i Ii lk~.ll difference of measured time is observed. The
following analysis is based on the averaged values of the
3 different runs.

In a first time, we focus on the strong scaling which is
defined as how the solution time varies with the number
of cores for a fixed total problem size. Secondly, we are
interested in the weak scaling that is to say how the solu-
tion time varies with the number of processors for a fixed
problem size per processor (strong scaling is about same
problem size in shorter time and weak scaling is about
larger problem size in same time). It is hard to achieve
a good strong-scaling at larger process counts since the
communication overhead increases in proportion to the
number of processes used.








7th International Conference on Multiphase Flow,
ICMF 2010, Tampa, FL, May 30 June 4, 2010


3
le+04
e,
E
p
3
a
u
e,
t;
e,
LL(
LY le+03
W


64 96 128 16()
Number of cores


Number of coreS


Figure 6: NEPTUNE_CFD effective CPU time per core
(standard mesh 1,000 iterations).


Figure 8: NEPTUNE_CFD efficiency with respect to
T, e7 of each cluster (standard mesh 1,000 iterations).


demonstrates that NEPTUNE_CFD exploits the paral-
lelism very efficiently. From 16 up to 128 cores (2 up to
16 nodes), all of the speedup is greater or equal to the
linear speedup (say, super-linear speedup) and the effi-
ciencies are also greater or equal to 1.0 (Fig. 8).
However, in Fig. 6, we can observe very long effective
CPU time on Cla and Clb for weak core number (up to
32). Ratio between C1 and C2 CPU time is about 1.3
for high core number and about 3 for weak core number.
This effect may be attributed to Harpertown technology
and cache limitation. The overtime for effective CPU
time per core using 8 cores implies speedup and effi-
ciency over-estimate. Using 128 cores, a speedup of 45
is not realistic, linear speedup is 16.
To avoid this problem, we define "absolute" speedup
and efficiency with respect to T, e7 of C2a (ref = 1 node
= 8 cores) as following:


96 128 16()
Number of cores


Figure 7: NEPTUNE_CFD speed-up with respect to
T, e7 of each cluster (standard mesh 1,000 iterations).


The table 4 shows the effect of core number on effec-
tive CPU time per core with the standard mesh. Compu-
tation elapsed time is near to this time. The speedup is
defined as the ratio between the elapsed time to execute
a program on one node (ref = 1 node = 8 cores) and on a
set of concurrent a nodes (n x 8 cores) and the efficiency
is defined as the ratio between speedup and n:


T


TCT~creSC2n


E fabs 1


where T, ,f is the execution time of the parallel program
on one node (8 cores) on C2a and T,, is the elapsed time
of the parallel program on a nodes (n x 8 cores) on the
different clusters.
When we use this kind of definition about speedup,
the speedup values are less interesting than the evolu-
tion of speedup per cluster as core number.The chang-
ing slope indicates the optimum core number. The Fig.
9 and 10 show the evolution of absolute speedup and ef-
ficiency. On all clusters, parallel performances are very
good from 8 up to 128 cores. On C1, we find again low
performances for few cores. Over 128 cores, the effi-
ciency degradation is \ignilk~.alll especially for C3 where
effective CPU time increases. Moreover for C1 and C2,


E f =


Tsm, es


where T, ,f is the effective core CPU time of the par-
allel job on one node (8 cores) and T,, is the effective
core CPU time of the parallel job on n nodes (n x 8
cores).
NEPTUNE_CFD effective CPU time per core,
speedup and efficiency of the calculation are depicted
in Fig. 6, 7 and 8 respectively. The continuous line
without symbol indicates linear speedup. The Fig. 7


























Number ofcor s


Figure 9: NEPTUNE_CFD absolute speed-up with re-
spect to T ,, of C2a (standard mesh 1,000 iterations).


7th International Conference on Multiphase Flow,
ICMF 2010, Tampa, FL, May 30 June 4, 2010



Table 5: Effect of core number on core CPU time for
500 iterations with the refined mesh on C2a.
number of cores CPU time/core (s.)
8 109,220
16 55,205
32 28,877
64 14,707
128 7,351
256 3,721
512 1,780
768 1,220
1024 962


while cell number per core is greater than 10,000 that is
to say while the core number is lower or equal to 128 for
a mesh of 1,000,000 cells. On Nehalem cluster, perfor-
mances are excellent up to 196 cores. It is noteworthy
that the restitution time decreases up to 256 cores (5,000
nodes per core).
NEPTUNE_CFD scalability is excellent on all plat-
forms, with linear speedup on Harpertown and \II.IIg.sI
technologies and super-linear speedup on Nehalem tech-
nology.
To improve NEPTUNE_CFD performance evalua-
tion, we perform numerical simulations with the refined
mesh (10,503,459 hexaedra) using from 1 up to 128
nodes (8 up to 1024 cores). As CPU times are im-
portant, these numerical simulations have been realized
only on cluster C2a (the most recent cluster) with Intel
MPI implementation. Performances are evaluated on a
restart simulation of 500 additional iterations after tran-
sient phase. Time measurements are still done with the
"CPU TIME" intrinsic fortran sub-routine and each run
is repeated 3 times.
The table 5 shows effect of core number on effective
CPU time per core with the refined mesh. The speedup
and the efficiency are defined as previously.
NEPTUNE_CFD effective CPU time per core,
speedup and efficiency of the calculation are depicted
in Fig. 11, 12 and 13 respectively. The solid line rep-
resents the linear speedup. The Fig. 12 demonstrates
that NEPTUNE_CFD exploits the parallelism very ef-
ficiently even for large core number. The speedup is
super-linear up to 768 cores and linear for 1,024 cores
and efficiency are also greater or equal to 1.0 (Fig. 13).
Parallel performances are very good up to 128 nodes. As
in previous case, performance degradation begins when
communications become preponderant to compute. We
can assume that with this mesh speedup should be low
using 2,048 cores .
This refined mesh study confirms that NEP-
TUNE_CFD performances are excellent on Nehalem


Number of core


Figure 10l: NEPTUNEFD absolute efficiency with
respect to T e7 of C2a (standard mesh 1,000 iterations).


we can note a MPI implementation effect. SGI MPT
seems more efficient for several cores and Intel MPI is
better for numerous cores.
The degradation for high core number is mainly
caused by the load unbalancing and the communica-
tion overhead. Restitution time of a simulation is di-
vided into computation time, communication and wait-
ing time (send/receive/barrier MPI) and I/O time. In
this case, I/O time is almost zero. The increase of core
number corresponds to a decrease of cell number per
core that is to say to a decrease of computation time
per core. Moreover, more cores implies more commu-
nications between cores. When communication time in-
creases quicker than calculation time decreases, the core
limit number is reached. Thus there is an optimal num-
ber of cores for a mesh and a set of equations.
For this numerical simulation of a three-dimensional
granular uniform shear flow solving 10 equations, NEP-
TUNEFD performances are excellent on all clusters














05


re -* C2a


04





\a

to too looc
Number of cores


j*

los
~er

oss
0.9


7th International Conference on Multiphase Flow,
ICMF 2010, Tampa, FL, May 30 June 4, 2010


le+


c-- C2a
- Ideal


0.85


Number of cores

Figure 13: NEPTUNE_CFD efficiency with ref = 1
node (refined mesh 500 iterations).









D other
40%Vl 5allreduce
o - Owaital
20 barrier alreduce
O computation


8 16 32 ~Number ofcores 12 25 52


Figure 14: MPINSIDE profiling on C2a (standard mesh
-1,000 iterations).



is over 90 %b of effective core CPU time. Barrier before
MPI Allreduce time (bhallred) is weak and MPI Allre-
duce time (allred) and Waitall time are almost nil. Other
MPI job times (overhead, send, receive, ...) are negligi-
ble. Increasing core number, barrier before MPI Allre-
duce time, MPI Allreduce time and Waitall time increase
and computational time decrease. Using 64 cores with
the standard mesh, computational time is only about
70 %b of core CPU time and using 512 cores computa-
tional time is about 10 %b. With the refined mesh, us-
ing 256 cores, computational time is only about 70 %b of
core CPU time and using 1024 cores computational time
is about 25 %b.

MPINSIDE gives informations about MPI function
samples. Using 16 cores on refined mesh, during 500 it-
erations, barrier before MPI Allreduce number is about
55,904,240. Using 1,024 cores, barrier before MPI
Allreduce number is about 3,574,704, 128. Indeed, NEP-
TUNE_CFD iterative solvers require a lot of MPI com-
munications. Thus, a lot of messages are exchanged be-
tween the cores many times in every iteration. For a high


Figure 11: NEPTUNE_CFD effective CPU time per
core (refined mesh 500 iterations).


C/ 6







Number of cores

Figure 12: NEPTUNE_CFD speed-up with ref = 1 node
(refined mesh 500 iterations).



cluster (C2) while cell number per core is higher than
10,000 that is to say for a mesh of 10,503,459 cells a
core number lower or equal to 1,024.
Hence, NEPTUNEFD strong scaling is excellent
up to 1,024 cores in condition of respect of minimum
cell number per core.


To understand these performances and time divide ac-
cording to core number, NEPTUNEFD profiling is
necessary. We use MPINSIDE to evaluate MPI commu-
nications and PerfSuite to get NEPTUNEFD timing
informations.
MPINSIDE increases slightly the elapsed time of nu-
merical simulations but gives the right percentages of
elapsed time used by computation and by the different
MPI process. The main MPI function times and com-
putational time are depicted in Fig. 14 and 15 for the
standard and the refined meshes for different core num-
ber.

Using 8 cores, with the two meshes, computation time







7th International Conference on Multiphase Flow,
ICMF 2010, Tampa, FL, May 30 June 4, 2010


64 128 256 512 768 10u
Number of cores


Figure 15: MPINSIDE profiling on C2a (refined mesh -
500 iterations).








I40%C - HntelMPlfunctions
gsoH - Osh~eduling
0 prodsc _
o Ogradco

8 16 32 ~~~umber of coreS 5 1 6 04

Figure 16: PerfSuite routine profiling on C2a (refined
mesh 500 iterations).



core number, communication time is important. MPIN-
SIDE confirms that MPI communications due to itera-
tive solvers and "alpha pressure" step will be the main
limitations to NEPTUNE_CFD massively parallel com-
putation.
PerfSuite is used to know the most frequently asked
routines and functions (sample profiling). The Fig. 16
shows the distribution of call for the main "routines" as
a function of core number. The term "routines" here
corresponds to Intel MPI functions, function "parcom"
(call of MPI function in fortran code), scheduling, rou-
tine "prodsc" (fortran scalar product), routine "gradco".
For low core number, percentage of NEPTUNE_CFD
routine sample for "gradco" and "prodsc" is about 90 %b.
For high core number (above 256), percentage of Intel
MPI functions and scheduling is \ig ni lk~.sll and become
predominant for 1,024 cores.
PerfSuite enables to follow the called number to the
code, the "libc" library and the Intel MPI library (Fig.
17). The increase of core number reduces strongly the
percentage the NEPTUNE_CFD call number.
This sample profiling can be compared with a time
profiling included in NEPTUNE_CFD. This profiling


Figure 17: PerfSuite module profiling on C2a (refined
mesh 500 iterations).



Table 6: NEPTUNE_CFD integrated profiling : details
of one iteration with the refined mesh using 16 cores on

step CPU time (s.)
pOst-processing 0.00
initialization 1.35
VelOcity prediction 5.41
"alpha-pressure" 109.70
gradco routine 102.92
jacobi routine 1.61
bi-cgstab routine 1.19
Total time for iteration 120



gives per iteration per core, CPU time and main algo-
rithm step time. MPI process time are included into rou-
tines calling. In table 6, we can see that with the re-
fined mesh and using 16 cores, one iteration requires a
core CPU time of 120 s. meanly spent by the "alpha-
pressure" step (109 s) and this one is meanly spent by
the conjugate gradient (103 s). For the same iteration but
this time using 1,024 cores, the core CPU time of 2.14 s.
is meanly spent by the "alpha-pressure" step (1.88 s)
which is meanly spent by conjugate gradient (1.76 s).
Between 80 %b and 90 %b of effective CPU time is
spent in the step of matrix-inversion iterativee solvers).
A complementary aspect of performance evaluation
is weak scaling. To respect a cell number per core of
20,000 we have created seven regular meshes to evalu-
ate performance from 16 cores up to 512. The table 7
details the different meshes according to the core num-
ber. NEPTUNE_CFD performances are evaluated on
a restart simulation of 2,500 additional iterations after
transient step on C2a.
In the Fig. 18, the weak scaling of NEPTUNE_CFD
is shown. The effective CPU times per core are almost
constant. A weak scaling efficiency E fweak can be de-













Table 7: NEPTUNE_CFD weak scaling : different
meshes according to core number on C2a.
hexaedra number Core number Cell per core
166,375 8 20,797
328,509 16 20,532
658,503 32 20,578
1,295,029 64 20,235
2,515,456 128 19,652
5,177,717 256 20,225
10,503,459 512 20,515






12000


1,


op


Figure 19: NEPTUNECFD weak scaling efficiency
for 2,500 iterations (20,000 cells per core)


Table 8: Powder and gas properties of industrial reactor.
Gas Particles
Density (kg/m") 21.11 850
Viscosity x 10s (Pa.s) 1.54
Pressure (bar) 20
Vr (m.s ) 0.67
Mean diameter (pm) -1,600
Solid mass (kg) -80,000
Restitution coefficient -0.9



meter wide as shown by Fig. 21. This bulb is an hemi-
spherical dome with two smokes-tacks for gas outlet.
The powder properties and operating points are given
in table 8. In the industrial process, the particle phase is
polydispersed however the numerical simulations have
been carried out with monodisperse particle having a
median diameter.
For the gas turbulence modeling, we use a standard
k E model extended to the multiphase flows account-
ing for additional source terms due to the interfacial in-
teractions. For the dispersed phase, a coupled transport
equations system is solved on particle fluctuating kinetic
energy and fluid-particle fluctuating velocity covariance.
According to large particle to gas density ratio only the
drag force is accounted as acting on the particles. Fi-
nally, we solve 12 equations (mass balance, momentum
transport, gas and particle turbulence).
The three-dimensional mesh is shown by Fig. 20. The
mesh, based on O-grid technique, contains 3,150,716
hexaedra with approximately Ax: = Ay = 30 mm and
aZ 90 mm.
At the bottom, the fluidization grid is an inlet for the
gas with imposed superficial velocity corresponding to
the fluidization velocity Vf 0.67 m.s For the par-
ticles this section is a wall. At the top of the fluidized


7th International Conference on Multiphase Flow,
ICMF 2010, Tampa, FL, May 30 June 4, 2010


I+-4C2a[


? 's


Number of cores


I I 12


12


Number of cores


Figure 18: NEPTUNE_CFD effective CPU time per
core for 2,500 iterations (20,000 cells per core)


fined as following:


TTre TsCOreS
T, ,


Ef weal


where Tre; is the elapsed time of the parallel job on one
node with smallest mesh and T, is the elapsed time of
the parallel job on n nodes with the respective meshes.
The Fig. 19 shows that this weak scaling efficiency is
very good up to 512 cores (larger than 1).

This test case of granular shear flow exhibits accuracy
of NEPTUNE_CFD results and excellent scalability of
the code for large enough problem sizes.


Simulation of a industrial gas-solid fluidized
bed

The purpose of this test case is a three-dimensional in-
stationnary dense fluidized bed at industrial scale. NEP-
TUNECFD V1.07@Tlse is dedicated to this kind of
numerical simulations Gobin et al. (2003).
The industrial reactor is composed of a cylinder and a
bulb at the top. The reactor is about 30 meter high and 5







7th International Conference on Multiphase Flow,
ICMF 2010, Tampa, FL, May 30 June 4, 2010









































Figure 20: Three-dimensional mesh containing 3,150,716 hexaedra.


X










z 1


bed reactor, we defined two free outlet for both the gas
and the particles. The wall-type boundary condition is
friction for the gas and slip for the particles.


Results and discussion


The Fig. 22(a) shows the automatic domain decomposi-
tion realized by METIS on the industrial reactor.
The Fig. 22(b) shows instantaneous solid volume frac-
tion at 16.5 s.
NEPTUNE_CFD performances are evaluated on a
restart simulation of 250 additional iterations after a
transient phase corresponding to the destabilization of
the fluidized bed (15 s.). Numerical simulations have
been carried out only on cluster C2a (Nehalem architec-
ture with Intel MPI).
As for the shear flow, the "CPU_TIME" intrinsic for-
tran sub-routine has been used to measure the different
times per core.
The table 9 shows the effect of core number on effec-
tive CPU time per core.
As described above, we calculate speedup and effi-
ciency. NEPTUNE_CFD effective CPU time per core,
speedup and efficiency of this calculation are depicted


Lp,n

Di3


i~n


An adaptative time step is used (computed from
Courant and Fourier criteria), typically at 1.10 5.
The following iterative solvers have been selected: ja-
cobi for the velocity, conjugated gradient for the pres-
sure and bi-cgstab for the volume fraction. The crite-
rion parameter -Tai of "Alpha-Pressure" step is fixed to
is 106 and the maximum number of cycles into "Alpha-
Pressure" step is 50.





































Figure 21: Sketch of the industrial polymerization reac-
tor.



Table 9: Effect of core number on core CPU time on
C2a (industrial fluidized bed).
number of cores CPU time/core (s.)
8 68,479
16 40,885
32 19,233
64 8,625
128 4,129
256 1,707
384 1,275
512 1,173
768 1,190
1024 1,220



in Fig. 23, 24 and 25 respectively. The continuous line
without symbol indicates linear speedup and ideal effi-
ciency. The Fig. 24 demonstrates that NEPTUNE_CFD
exploits the parallelism very efficiently. From 16 up to
384 cores (48 nodes), the speedup is super-linear and ef-
ficiencies are also greater than 1.0 (Fig. 25).
Parallel performances are very good up to 384 cores,
the restitution time decreases up to 512 cores. Over
this value, an increase of core number is not interest-
ing but restitution time does not increase significantly.
The increase of communication time is balanced by the
decrease of computational time.
On the same way, in this case realistic industrial dense
fluidized bed solving 12 equations, NEPTUNE_CFD
HPC is excellent on Nehalem cluster (C2) while cell


7th International Conference on Multiphase Flow,
ICMF 2010, Tampa, FL, May 30 June 4, 2010


Gas out


0.64
DA48
0.32
0 16
0.00





Time: 16.5s.

(a) (b)


Figure 22: Industrial fluidized bed. (a) Automatic
domain decomposition using Metis. (b) Instantaneous
solid volume fraction field.


number per core is higher than 10,000 that is to say for
a mesh of 3,150,716 hexaedra a core number inferior or
equal to 384. If we consider the "cost" of the numerical
simulation about CPU time, the optimum performances
are obtained with 256 cores.
This test case illustrates excellent NEPTUNE CFD
scalability performances on a industrial geometry.
Strong scaling is good for large enough problem sizes
(up to 384 cores with this mesh).


Conclusion

Three-dimensional numerical simulations of two differ-
ent cases of dense gas-particle flows have been per-
formed to evaluate NEPTUNE_CFD parallel perfor-
mances. The first one exhibits accuracy of numerical
results and demonstrates high NEPTUNE_CFD parallel
computing performances up to 1,024 cores. The second
one, a realistic fluidized bed shows NEPTUNE_CFD
efficiency and scalability on industrial geometry. The
results show linear speed-up using Harpertown and
\II.IIg.sI technologies and super-linear speedup with Ne-
halem technology. For the two test cases, a minimum
of 10,000 cells of mesh per core is needed and 20,000
is recommended. A next step of evaluation of NEP-
TUNE_CFD performances will focus on effect of phase
number (fluids) and of solved equation number on the
minimum cell number per core. On the other, we will


~30 m


T5m



Gas in


























*--*CC~a




100 -00




-0 ~Ideal




8~ 6




~ ~ ~ ~ ~ I .I 1 * 24
Number of cores

Figure 24: NEPTUNE_CFD speed-up with ref =1 node
on C2a (250 iterations) on an industrial reactor.



evaluate the performances using several thousands of
cores such as realized on the Open Source CFD code:
Code_SATURNE by Sunderland et al. (2008).
Moreover, the profilings have shown that the most im-
portant part of CPU time is spent in the step of matrix-
inversion by the iterative approach. The next implemen-
tation of multi-grid in NEPTUNE_CFD should improve
strongly the global performances of the code.




Acknowledgments


This work was granted access to the HPC resources of
CALMIP under the allocation P0111 (Calcul en Midi-
Pyrenees). This work was also granted access to the
HPC resources of CINES under the allocation 2010-
026012 made by GENCI (Grand Equipement National
de Calcul Intensif).


7th International Conference on Multiphase Flow,
ICMF 2010, Tampa, FL, May 30 June 4, 2010











o,7-- C a
Ideal


** 1**24


Number of cores


Figure 25: NEPTUNECFD efficiency with ref = 1
node on C2a (250 iterations) on an industrial reactor


References

G. Balzer, A. Boelle, and O. Simonin. Eulerian gas-solid
flow modelling of dense fluidized bed. FLUIDIZATION
VIII, Proc. International Symposium of the Engineering
Foundation, J.E Large and C. Lagudrie (Editors), pages
409-418, 1995.

A. Gobin, H. Neau, O. Simonin, J.R. Llinas, V. Reiling,
and J.L S61o. Fluid dynamic numerical simulation of a
gas phase polymerization reactor. Int. J. for Num. Meth-
Ods in Fluids, 43, pages 1199-1220, 2003.

G. Karypis and V. Kumar. A fast and highly quality mul-
tilevel scheme for partitioning irregular graphs. SIAM
Journal on Scientific Computing, Vol. 20, No 1, pages
359-392, 1999.

N. Mechitoua, M. Boucker, S. Mimouni, S. Pigny, and
G. Serre. Numerical simulation of multiphase flow with
on elliptic oriented fractional step method. Third Inter-
national Symposium on Finite Volumes for Complex Ap-
plications, Porquerolle France, 2002.

N. Mechitoua, M. Boucker, J. Lavieville, J.M. Herard,
S. Pigny, and G. Serre. An unstructured finite vol-
ume solver for two-phase water/vapour flows mod-
elling based on elliptic oriented fractional step method.
NURETH 10, Se'oul, Core'e du Sud, 2003.

T. Randrianarivelo, H. Neau, O. Simonin, and F. Nico-
las. 3d unsteady polydispersed simulation of uranium
hexafluoride production in a fluidized bed pilot. In Proc.
6th Int. Conf on Multiphase Flow, Leipzig (Germany),
paper S6_Thu_46,, 2007.

O. Simonin. Prediction of the dispersed phase tur-
bulence in particle-laden jets. Gas-Solid Flows-1991,
ASME FED, Vol. 121, pages 197-206, 1991.







7th International Conference on Multiphase Flow,
ICMF 2010, Tampa, FL, May 30 June 4, 2010


0. Simonin. Statistical and continuum modelling of tur-
bulent reactive particulate flows. Theoretical and Ex-
perimental Modeling of Particidate Flows. Lecture Se-
ries 2000-06, Von Karinan Institute for Fluid Dynamics,
Rhode Saint Gendse, Belgium, 2000.

A.G. Sunderland, M. Ashworth, N. Li, C. Moulinec,
Y. Fournier, and J. Uribe. Towards petascale comput-
ing with parallel efd codes. Parallel CFD 2008, Lyon
France, 2008.




University of Florida Home Page
© 2004 - 2010 University of Florida George A. Smathers Libraries.
All rights reserved.

Acceptable Use, Copyright, and Disclaimer Statement
Last updated October 10, 2010 - Version 2.9.7 - mvs