# GROMACS教程：GROMACS模拟空间非均相体系(板块结构)的并行性能：区域分解与PME节点设置

## 译者按

1. 硬件设备: 更快更好, 支持特殊优化的CPU, GPU
2. 编译GROMACS时的设置: 高效的库, 开启优化
3. mdp文件中的设置: 步长, 截断, 输出频率
4. mdrun的运行设置: 并行设置, 节点划分

1. 根据测试运行结果确定区域分解的最小单元尺寸: $$d$$
2. 根据盒子x和y方向的尺寸 $$a$$$$b$$ 确定x和y方向的最大可能划分单元数: $$N_x=a/d$$, $$N_y=b/d$$.
3. 对板块构型, 一般不沿z方向划分盒子, 所以体系的最大可能划分单元数为: $$Nx \times Ny \times 1$$
4. 使用-dd选项设定区域分解时, 每一方向的划分单元数不可超过前面得到的最大可能划分单元数, 否则出错
5. 使用-npme选项设定PME节点数时, 其值必须小于总节点数的一般. 具体多少可根据测试结果确定, 一般可取1/3或1/4
6. GROMACS自动设置区域分解时, 对于某些节点数可能出错. 此时可使用-dd手动设定区域分解.

## 2. 研究体系

$\D z_\text{vacuum} \ge 3 \max(x,y)$

## 3. GROMACS mdrun文档中的有关说明

When mdrun is started with more than 1 rank, parallelization with domain decomposition is used.

With domain decomposition, the spatial decomposition can be set with option -dd. By default mdrun selects a good decomposition. The user only needs to change this when the system is very inhomogeneous. Dynamic load balancing is set with the option -dlb, which can give a significant performance improvement, especially for inhomogeneous systems. The only disadvantage of dynamic load balancing is that runs are no longer binary reproducible, but in most cases this is not important. By default the dynamic load balancing is automatically turned on when the measured performance loss due to load imbalance is 5% or more. At low parallelization these are the only important options for domain decomposition. At high parallelization the options in the next two sections could be important for increasing the performace.

When PME is used with domain decomposition, separate ranks can be assigned to do only the PME mesh calculation; this is computationally more efficient starting at about 12 ranks, or even fewer when OpenMP parallelization is used. The number of PME ranks is set with option -npme, but this cannot be more than half of the ranks. By default mdrun makes a guess for the number of PME ranks when the number of ranks is larger than 16. With GPUs, using separate PME ranks is not selected automatically, since the optimal setup depends very much on the details of the hardware. In all cases, you might gain performance by optimizing -npme. Performance statistics on this issue are written at the end of the log file. For good load balancing at high parallelization, the PME grid x and y dimensions should be divisible by the number of PME ranks (the simulation will run correctly also when this is not the case).

This section lists all options that affect the domain decomposition.

Option -rdd can be used to set the required maximum distance for inter charge-group bonded interactions. Communication for two-body bonded interactions below the non-bonded cut-off distance always comes for free with the non-bonded communication. Atoms beyond the non-bonded cut-off are only communicated when they have missing bonded interactions; this means that the extra cost is minor and nearly indepedent of the value of -rdd. With dynamic load balancing option -rdd also sets the lower limit for the domain decomposition cell sizes. By default -rdd is determined by mdrun based on the initial coordinates. The chosen value will be a balance between interaction range and communication cost.

When inter charge-group bonded interactions are beyond the bonded cut-off distance, mdrun terminates with an error message. For pair interactions and tabulated bonds that do not generate exclusions, this check can be turned off with the option -noddcheck.

When constraints are present, option -rcon influences the cell size limit as well. Atoms connected by NC constraints, where NC is the LINCS order plus 1, should not be beyond the smallest cell size. A error message is generated when this happens and the user should change the decomposition or decrease the LINCS order and increase the number of LINCS iterations. By default mdrun estimates the minimum cell size required for P-LINCS in a conservative fashion. For high parallelization it can be useful to set the distance required for P-LINCS with the option -rcon.

The -dds option sets the minimum allowed x, y and/or z scaling of the cells with dynamic load balancing. mdrun will ensure that the cells can scale down by at least this factor. This option is used for the automated spatial decomposition (when not using -dd) as well as for determining the number of grid pulses, which in turn sets the minimum allowed cell size. Under certain circumstances the value of -dds might need to be adjusted to account for high or low spatial inhomogeneity of the system.

The option -gcom can be used to only do global communication every n steps. This can improve performance for highly parallel simulations where this global communication step becomes the bottleneck. For a global thermostat and/or barostat the temperature and/or pressure will also only be updated every -gcom steps. By default it is set to the minimum of nstcalcenergy and nstlist.

## 4. 运行性能测试

mdrun_mpi -maxh $MAXH -deffnm$tpr -dlb auto


Initializing Domain Decomposition on 32 nodes
Will sort the charge groups at every domain (re)decomposition

NOTE: Periodic molecules: can not easily determine the required minimum bonded cut-off, using half the non-bonded cut-off

Minimum cell size due to bonded interactions: 0.650 nm
Guess for relative PME load: 0.23
Will use 24 particle-particle and 8 PME only nodes
This is a guess, check the performance at the end of the log file
Using 8 separate PME nodes
Scaling the initial minimum size with 1/0.8 (option -dds) = 1.25
Optimizing the DD grid for 24 cells with a minimum initial size of 0.812 nm
Ewald_geometry=3dc: assuming inhomogeneous particle distribution in z, will not decompose in z.
The maximum allowed number of cells is: X 7 Y 7 Z 1
Domain decomposition grid 6 x 4 x 1, separate PME nodes 8
PME domain decomposition: 8 x 1 x 1
Interleaving PP and PME nodes
This is a particle-particle only node


NOTE: 13.4 % performance was lost because the PME nodes
had less work to do than the PP nodes.
You might want to decrease the number of PME nodes
or decrease the cut-off and the grid spacing.


X Y Z
-npme 性能 性能损失(*)
ns/day hour/n
32 auto
6 4 1
auto
8
5.061 4.742 13.4%
64 auto
(48)
auto
16

8 6 1 16 9.325 2.574 11.6%
7 7 1 15 9.836 2.440 10.2%
9 6 1 10 9.945 2.413 5.3%
96 9 9 1 15 14.470 1.659 5.2%
12 7 1 12 错误

## 5. 错误信息

Initializing Domain Decomposition on 64 nodes
Will sort the charge groups at every domain (re)decomposition

NOTE: Periodic molecules: can not easily determine the required minimum bonded cut-off, using half the non-bonded cut-off

Minimum cell size due to bonded interactions: 0.650 nm
Guess for relative PME load: 0.23
Will use 48 particle-particle and 16 PME only nodes
This is a guess, check the performance at the end of the log file
Using 16 separate PME nodes
Scaling the initial minimum size with 1/0.8 (option -dds) = 1.25
Optimizing the DD grid for 48 cells with a minimum initial size of 0.812 nm
Ewald_geometry=3dc: assuming inhomogeneous particle distribution in z, will not decompose in z.
The maximum allowed number of cells is: X 7 Y 7 Z 1

-------------------------------------------------------
Program mdrun_mpi, VERSION 4.5.5
Source code file: domdec.c, line: 6436

Fatal error:
There is no domain decomposition for 48 nodes that is compatible with the given box and a minimum cell size of 0.8125 nm
Change the number of nodes or mdrun option -rdd or -dds
Look in the log file for details on the domain decomposition
website at http://www.gromacs.org/Documentation/Errors


mdrun_mpi -maxh $MAXH -deffnm$tpr -dlb auto -dd 12 7 1 -npme 12

#++++++++++++++++++++++++++++++++++++++++++

Initializing Domain Decomposition on 96 nodes
Will sort the charge groups at every domain (re)decomposition

NOTE: Periodic molecules: can not easily determine the required minimum bonded cut-off, using half the non-bonded cut-off

Minimum cell size due to bonded interactions: 0.650 nm
ERROR: The initial cell size (0.491903) is smaller than the cell size limit (0.650000)

-------------------------------------------------------
Program mdrun_mpi, VERSION 4.5.5
Source code file: domdec.c, line: 6413

Fatal error:
The initial cell size (0.491903) is smaller than the cell size limit (0.650000), change options -dd, -rdd or -rcon, see the log file for details
website at http://www.gromacs.org/Documentation/Errors


## 6. 如何获得最佳区域分解设置

• 盒子在x和y方向的尺寸: 5.99 nm
• 根据键合相互作用得到的最小单元尺寸: 0.65 nm
• 使用的节点数: 由于所用HECToR机器的设置策略, 必须是32的倍数

• x方向的最大区域数: $$N_x=5.99/0.65 \approx 9$$
• y方向的最大区域数: $$N_y=5.99/0.65 \approx 9$$
• z方向的最大区域数: $$N_z=1$$, 因为体系为板块结构
• 最大PP节点数: $$N_{PP}=N_x \times N_y \times N_z=9 \times 9 \times 1=81$$
• 总节点数中可用PME节点数的最大比例: $$N_{PME} \lt {8 \over 32} N_{node}=0.25 N_{node}$$
(考虑初始的区域分解为8个PME节点, $$6 \times 4 \times 1=24$$ 个PP节点, 性能损失为13.4)

$N_{PP}+N_{PME}=N_{node}$

$N_{nodes}=32 u \;\; u \in \mathbb{N}$

$N_{PP}+0.25 N_{node} \gt N_{node}$

$81+0.25 \cdot 32 \cdot u \gt 32 \cdot u$

$u \lt {81 \over 0.75 \cdot 32} \approx 3.375$

mdrun_mpi -maxh $MAXH -deffnm$tpr -dlb auto -dd 9 9 1 -npme 18


1. I.C. Yeh and M.L. Berkowitz. Ewald summation for systems with slab geometry. J. Chemi. Phys., 111:3155–3162, 1999  ↩

2. ASE(Atomic Simulation Environment)工具的一种, ASE是CAMd开发的模拟工具. https://wiki.fysik.dtu.dk/ase/epydoc/ase.structure-module.html  ↩

 随意赞赏 微信 支付宝
◆本文地址: , 转载请注明◆
◆评论问题: https://jerkwin.herokuapp.com/category/3/博客, 欢迎留言◆