Benchmarking NAMD workflows for GPU containers
Revision as of 18:21, 21 October 2022 by Sysadmin (talk | contribs) (Created page with "<span id="benchmarking-namd2-workloads-for-gpu-containers-on-ccast"></span> == Benchmarking NAMD2 workloads for GPU containers on CCAST == <span id="stephen-szwiec"></span> Stephen Szwiec Rasulev Computational Chemistry Research Group North Dakota State University 26 July 2022 ----- <span id="apoa1-information"></span> ==== ApoA1 information ==== * simulates a bloodstream lipoprotein particle * '''physical stats''' ** 92224 atoms ** 70660 bonds ** 74136 angles ** 741...")
Benchmarking NAMD2 workloads for GPU containers on CCAST
Stephen Szwiec Rasulev Computational Chemistry Research Group North Dakota State University 26 July 2022
ApoA1 information
- simulates a bloodstream lipoprotein particle
- physical stats
- 92224 atoms
- 70660 bonds
- 74136 angles
- 74130 diheadrals
- 1402 impropers
- 32992 hydrogen groups
- 553785 total amu of mass
- energy stats
- 300K initial temp.
- -14 e total charge
- simulation stats
- Consists of a startup process followed by 500 steps of simulation for benchmark time
- GPU workload modified to use CUDA FFTW, CUDA integration
- GPU workload modified to continue simulation for 10000 steps
NAMD non-gpu run information
- Charmrun used with one node (condo02)
- machine topology: 2sockets x 64 cores x 1 PU = 128-way SMP possible
- 20 processes, 20 cores, 1 physical node, and 16GB memory specified
- NAMD 2.14 used
NAMD gpu run information
- NAMD used within Singularity container system
- namd:3.0-alpha11 image used to generate singularity container
- machine topology: 2sockets x 64 cores x 1 PU = 128-way SMP possible
- 2 CPUs, 2 GPUs, 1 openMP thread, 1 physical node, and 16GB memory specified
- binding each GPU to one Nvidia A10 with 22731MB memory
- NAMD 3.0alpha11
Benchmark findings
Startup Wall Time
CPU 3.4346 s GPU 0.0963 s
Simulation Wall Time
CPU 38.6991 s GPU 5.59571 s
Wall Time Per Step
CPU 0.0918308 s/step GPU 0.0110155 s/step
Days Per Nanosecond Simulation
CPU 0.999022 days/ns GPU 0.0660301 days/ns
Additional information
- Extending Apoa1 workload steps with CUDA lead to step 3500 being reached in roughly the same time step 500 was reached on 20 CPUs.
- per core speedup with GPU acceleration is ~151.297 times
- caveat: one CPU must supervise and await output from each GPU because of how NAMD was written
- CUDA performed 10000steps and wrote output in 136.315155s total wall time
- CPU performed 500steps and wrote output in 76.477325s total wall time
Benchmarking NAMD2 workloads for GPU containers on CCAST
Stephen Szwiec
Rasulev Computational Chemistry Research Group
North Dakota State University
26 July 2022
ApoA1 information
- simulates a bloodstream lipoprotein particle
- physical stats
- 92224 atoms
- 70660 bonds
- 74136 angles
- 74130 diheadrals
- 1402 impropers
- 32992 hydrogen groups
- 553785 total amu of mass
- energy stats
- 300K initial temp.
- -14 e total charge
- simulation stats
- Consists of a startup process followed by 500 steps of simulation for benchmark time
- GPU workload modified to use CUDA FFTW, CUDA integration
- GPU workload modified to continue simulation for 10000 steps
NAMD non-gpu run information
- Charmrun used with one node (condo02)
- machine topology: 2sockets x 64 cores x 1 PU = 128-way SMP possible
- 20 processes, 20 cores, 1 physical node, and 16GB memory specified
- NAMD 2.14 used
NAMD gpu run information
- NAMD used within Singularity container system
- namd:3.0-alpha11 image used to generate singularity container
- machine topology: 2sockets x 64 cores x 1 PU = 128-way SMP possible
- 2 CPUs, 2 GPUs, 1 openMP thread, 1 physical node, and 16GB memory specified
- binding each GPU to one Nvidia A10 with 22731MB memory
- NAMD 3.0alpha11
Benchmark findings
Startup Wall Time
CPU 3.4346 s GPU 0.0963 s
Simulation Wall Time
CPU 38.6991 s GPU 5.59571 s
Wall Time Per Step
CPU 0.0918308 s/step GPU 0.0110155 s/step
Days Per Nanosecond Simulation
CPU 0.999022 days/ns GPU 0.0660301 days/ns
Additional information
- Extending Apoa1 workload steps with CUDA lead to step 3500 being reached in roughly the same time step 500 was reached on 20 CPUs.
- per core speedup with GPU acceleration is ~151.297 times
- caveat: one CPU must supervise and await output from each GPU because of how NAMD was written
- CUDA performed 10000steps and wrote output in 136.315155s total wall time
- CPU performed 500steps and wrote output in 76.477325s total wall time