Category:Parallelization: Difference between revisions

From VASP Wiki
(Removed redirect to Parallelization)
Tag: Removed redirect
No edit summary
Line 1: Line 1:
For many complex problems, a single core is not enough to finish the calculation in a reasonable time.
VASP makes use of parallel machines splitting the calculation into many tasks, that communicate with each other using MPI.
VASP makes use of parallel machines splitting the calculation into many tasks, that communicate with each other using MPI.
By default, VASP distributes the number of bands ({{TAG|NBANDS}}) over the available MPI ranks.
This is because, for many complex problems, a single core is not enough to finish the calculation in a reasonable time.
But it is often beneficial to add parallelization of the FFTs ({{TAG|NCORE}}), parallelization over '''k''' points ({{TAG|KPAR}}), and parallelization over separate calculations ({{TAG|IMAGES}}).
 
All these tags default to 1 and divide the number of MPI ranks among the parallelization options.
==Theory==
Additionally, there are some parallelization options for specific algorithms in VASP, e.g., {{TAG|NOMEGAPAR}}. In summary, VASP paralellizes with
===Basic parallelization===
::<math>
:By default, VASP distributes the number of bands ({{TAG|NBANDS}}) over the available MPI ranks. But it is often beneficial to add parallelization of the FFTs ({{TAG|NCORE}}), parallelization over '''k''' points ({{TAG|KPAR}}), and parallelization over separate calculations ({{TAG|IMAGES}}). All these tags default to 1 and divide the number of MPI ranks among the parallelization options. Additionally, there are some parallelization options for specific algorithms in VASP, e.g., {{TAG|NOMEGAPAR}}. In summary, VASP paralellizes with
 
:::<math>
\text{total ranks} = \text{ranks parallelizing bands} \times \text{NCORE} \times \text{KPAR} \times \text{IMAGES} \times \text{other algorithm-dependent tags}.
\text{total ranks} = \text{ranks parallelizing bands} \times \text{NCORE} \times \text{KPAR} \times \text{IMAGES} \times \text{other algorithm-dependent tags}.
</math>
</math>
In addition to the parallelization using MPI, VASP can make use of [[Hybrid_MPI/OpenMP_parallelization|OpenMP-threading]] and/or [[OpenACC_GPU_port_of_VASP|OpenACC (for the GPU-port)]].
Note that running on multiple OpenMP threads and/or GPUs switches off the {{TAG|NCORE}} parallelization.


==Optimizing the parallelization==
:In addition to the parallelization using MPI, VASP can make use of [[Combining MPI and OpenMP|OpenMP threading]] and/or [[OpenACC_GPU_port_of_VASP|OpenACC (for the GPU port)]]. Note that running on multiple OpenMP threads and/or GPUs switches off the {{TAG|NCORE}} parallelization.
{{NB|mind|We offer only general advice here.}}
 
The performance of a specific parallelization depends on the system, i.e., the number of ions, the elements, the size of the cell, etc, as well as the algorithms, e.g., whether it is a [[Category:Electronic minimization|density-functional-theory]] calculation, a [[:Category:Many-body perturbation theory|many-body&ndash;perturbation&ndash;theory]] calculation or a [[:Category:Molecular dynamics|molecular-dynamics]] simulation using [[:Category:Machine-learned force fields|machine-learned force fields]]. To obtain trustworthy and publishable results, many projects require performing many similar calculations, i.e., calculations with similar input and using the same algorithms. So, we recommend optimizing the parallelization to make the most of the available compute time. 
===MPI setup===
{{NB|tip|Run a few test calculations varying the parallel setup, and use the optimal choice of parameters for the rest of the calculations.}}


* How to [[Optimizing the parallelization|optimize the parallelization]] in a nutshell
:The MPI setup determines the placement of the ranks onto the nodes. VASP assumes the ranks first fill up a node before the next node is occupied. As an example when running with 8 ranks on two nodes, VASP expects rank 1–4 on node 1 and rank 5–8 on node 2. If the ranks are placed differently, communication between the nodes occurs for every parallel FFT. Because FFTs are essential to VASP's speed this inhibits the performance of the calculation. A manifestation is an increase in computing time when the number of nodes is increased from 1 to 2. If {{TAG|NCORE}} is not used this issue is less severe but will still reduce the performance.
* How to [[parallelize with multiple OpenMP threads per MPI rank
* How to efficiently parallelize a <math>GW</math> calculation
* How to parallelize a molecular-dynamics calculation


==Caveat about the MPI setup==
:To address this issue, please check the setup of the MPI library and the submitted job script. It is usually possible to overwrite the placement by setting environment variables or command-line arguments. When in doubt, contact the HPC administration of your machine to investigate the behavior.


The MPI setup determines the placement of the ranks onto the nodes.
==How to==
VASP assumes the ranks first fill up a node before the next node is occupied.
===Optimizing the parallelization===
As an example when running with 8 ranks on two nodes, VASP expects rank 1–4 on node 1 and rank 5–8 on node 2.
:The performance of a specific parallelization depends on the system, i.e., the number of ions, the elements, the size of the cell, etc, as well as the algorithms, e.g., whether it is a [[Category:Electronic minimization|density-functional-theory]] calculation, a [[:Category:Many-body perturbation theory|many-body&ndash;perturbation&ndash;theory]] calculation or a [[:Category:Molecular dynamics|molecular-dynamics]] simulation using [[:Category:Machine-learned force fields|machine-learned force fields]]. To obtain trustworthy and publishable results, many projects require performing many similar calculations, i.e., calculations with similar input and using the same algorithms. Therefore, we recommend optimizing the parallelization to make the most of the available compute time.
If the ranks are placed differently, communication between the nodes occurs for every parallel FFT.
{{NB|tip|Run a few test calculations varying the parallel setup, and use the optimal choice of parameters for the rest of the calculations.|:}}
Because FFTs are essential to VASP's speed this inhibits the performance of the calculation.
:For more detailed advice, check the following:
A manifestation is an increase in computing time when the number of nodes is increased from 1 to 2.
:* How to [[Optimizing the parallelization|optimize the parallelization]] in a nutshell
If {{TAG|NCORE}} is not used this issue is less severe but will still reduce the performance.
<!--:* How to efficiently parallelize a <math>GW</math> calculation
:* How to parallelize a molecular-dynamics calculation using machine-learned force fields -->


To address this issue, please check the setup of the MPI library and the submitted job script.
===OpenMP/OpenACC===
It is usually possible to overwrite the placement by setting environment variables or command-line arguments.
:Both [[Combining MPI and OpenMP|OpenMP]] and [[OpenACC_GPU_port_of_VASP|OpenACC]] parallelize the FFTs and therefore disregard any conflicting specification of {{TAG|NCORE}}.When combining these methods OpenACC takes precedence but any code not ported to OpenACC benefits from the additional OpenMP treads.This approach is relevant because the recommended NVIDIA Collective Communications Library requires a single MPI rank per GPU.
When in doubt, contact the HPC administration of your machine to investigate the behavior.
:* How to [[Combining MPI and OpenMP|parallelize with multiple OpenMP threads per MPI rank]]
:* How to [[OpenACC_GPU_port_of_VASP|run on GPUs]]


==Additional parallelization options==
==Additional parallelization options==
Line 39: Line 36:
; {{TAG|KPAR}}: For Laplace transformed MP2 this tag [[LTMP2_-_Tutorial#Parallelization|has a different meaning]].
; {{TAG|KPAR}}: For Laplace transformed MP2 this tag [[LTMP2_-_Tutorial#Parallelization|has a different meaning]].
; {{TAG|NCORE_IN_IMAGE1}}: Defines how many ranks work on the first image in the thermodynamic coupling constant integration ({{TAG|VCAIMAGES}}).
; {{TAG|NCORE_IN_IMAGE1}}: Defines how many ranks work on the first image in the thermodynamic coupling constant integration ({{TAG|VCAIMAGES}}).
; {{TAG|NOMEGAPAR}}: Parallelize over imaginary frequency points in GW and RPA calculations.
; {{TAG|NOMEGAPAR}}: Parallelize over imaginary frequency points in <math>GW</math> and RPA calculations.
; {{TAG|NTAUPAR}}: Parallelize over imaginary time points in GW and RPA calculations.
; {{TAG|NTAUPAR}}: Parallelize over imaginary time points in <math>GW</math> and RPA calculations.
 
==OpenMP/OpenACC==


Both [[Hybrid_MPI/OpenMP_parallelization|OpenMP]] and [[OpenACC_GPU_port_of_VASP|OpenACC]] parallelize the FFTs and therefore disregard any conflicting specification of {{TAG|NCORE}}.
When combining these methods OpenACC takes precedence but any code not ported to OpenACC benefits from the additional OpenMP treads.
This approach is relevant because the recommended NVIDIA Collective Communications Library requires a single MPI rank per GPU.


[[Category:VASP|parallelization]][[Category:Performance]]
[[Category:VASP|parallelization]][[Category:Performance]]

Revision as of 09:10, 12 April 2022

VASP makes use of parallel machines splitting the calculation into many tasks, that communicate with each other using MPI. This is because, for many complex problems, a single core is not enough to finish the calculation in a reasonable time.

Theory

Basic parallelization

By default, VASP distributes the number of bands (NBANDS) over the available MPI ranks. But it is often beneficial to add parallelization of the FFTs (NCORE), parallelization over k points (KPAR), and parallelization over separate calculations (IMAGES). All these tags default to 1 and divide the number of MPI ranks among the parallelization options. Additionally, there are some parallelization options for specific algorithms in VASP, e.g., NOMEGAPAR. In summary, VASP paralellizes with
In addition to the parallelization using MPI, VASP can make use of OpenMP threading and/or OpenACC (for the GPU port). Note that running on multiple OpenMP threads and/or GPUs switches off the NCORE parallelization.

MPI setup

The MPI setup determines the placement of the ranks onto the nodes. VASP assumes the ranks first fill up a node before the next node is occupied. As an example when running with 8 ranks on two nodes, VASP expects rank 1–4 on node 1 and rank 5–8 on node 2. If the ranks are placed differently, communication between the nodes occurs for every parallel FFT. Because FFTs are essential to VASP's speed this inhibits the performance of the calculation. A manifestation is an increase in computing time when the number of nodes is increased from 1 to 2. If NCORE is not used this issue is less severe but will still reduce the performance.
To address this issue, please check the setup of the MPI library and the submitted job script. It is usually possible to overwrite the placement by setting environment variables or command-line arguments. When in doubt, contact the HPC administration of your machine to investigate the behavior.

How to

Optimizing the parallelization

The performance of a specific parallelization depends on the system, i.e., the number of ions, the elements, the size of the cell, etc, as well as the algorithms, e.g., whether it is a calculation, a many-body–perturbation–theory calculation or a molecular-dynamics simulation using machine-learned force fields. To obtain trustworthy and publishable results, many projects require performing many similar calculations, i.e., calculations with similar input and using the same algorithms. Therefore, we recommend optimizing the parallelization to make the most of the available compute time.
Tip: Run a few test calculations varying the parallel setup, and use the optimal choice of parameters for the rest of the calculations.
For more detailed advice, check the following:

OpenMP/OpenACC

Both OpenMP and OpenACC parallelize the FFTs and therefore disregard any conflicting specification of NCORE.When combining these methods OpenACC takes precedence but any code not ported to OpenACC benefits from the additional OpenMP treads.This approach is relevant because the recommended NVIDIA Collective Communications Library requires a single MPI rank per GPU.

Additional parallelization options

KPAR
For Laplace transformed MP2 this tag has a different meaning.
NCORE_IN_IMAGE1
Defines how many ranks work on the first image in the thermodynamic coupling constant integration (VCAIMAGES).
NOMEGAPAR
Parallelize over imaginary frequency points in and RPA calculations.
NTAUPAR
Parallelize over imaginary time points in and RPA calculations.