Benchmarking NAMD workflows for GPU containers: Difference between revisions

From Rasulev Lab Wiki
Jump to navigation Jump to search
(Created page with "<span id="benchmarking-namd2-workloads-for-gpu-containers-on-ccast"></span> == Benchmarking NAMD2 workloads for GPU containers on CCAST == <span id="stephen-szwiec"></span> Stephen Szwiec Rasulev Computational Chemistry Research Group North Dakota State University 26 July 2022 ----- <span id="apoa1-information"></span> ==== ApoA1 information ==== * simulates a bloodstream lipoprotein particle * '''physical stats''' ** 92224 atoms ** 70660 bonds ** 74136 angles ** 741...")
 
(created and edited page)
 
Line 2: Line 2:
== Benchmarking NAMD2 workloads for GPU containers on CCAST ==
== Benchmarking NAMD2 workloads for GPU containers on CCAST ==


<span id="stephen-szwiec"></span>
Stephen Szwiec  
Stephen Szwiec  
Rasulev Computational Chemistry Research Group
Rasulev Computational Chemistry Research Group
North Dakota State University
North Dakota State University
26 July 2022
26 July 2022
-----
-----
<span id="apoa1-information"></span>
==== ApoA1 information ====
* simulates a bloodstream lipoprotein particle
* '''physical stats'''
** 92224 atoms
** 70660 bonds
** 74136 angles
** 74130 diheadrals
** 1402 impropers
** 32992 hydrogen groups
** 553785 total amu of mass
* '''energy stats'''
** 300K initial temp.
** -14 e total charge
* '''simulation stats'''
** Consists of a startup process followed by 500 steps of simulation for benchmark time
** GPU workload modified to use CUDA FFTW, CUDA integration
** GPU workload modified to continue simulation for 10000 steps
<span id="namd-non-gpu-run-information"></span>
==== NAMD non-gpu run information ====
* Charmrun used with one node (condo02)
* machine topology: 2sockets x 64 cores x 1 PU = 128-way SMP possible
* 20 processes, 20 cores, 1 physical node, and 16GB memory specified
* NAMD 2.14 used
<span id="namd-gpu-run-information"></span>
==== NAMD gpu run information ====
* NAMD used within Singularity container system
** namd:3.0-alpha11 image used to generate singularity container
* machine topology: 2sockets x 64 cores x 1 PU = 128-way SMP possible
* 2 CPUs, 2 GPUs, 1 openMP thread, 1 physical node, and 16GB memory specified
** binding each GPU to one Nvidia A10 with 22731MB memory
* NAMD 3.0alpha11
-----
<span id="benchmark-findings"></span>
=== Benchmark findings ===
<span id="startup-wall-time"></span>
==== Startup Wall Time ====
'''CPU''' 3.4346 s '''GPU''' 0.0963 s
<span id="simulation-wall-time"></span>
==== Simulation Wall Time ====
'''CPU''' 38.6991 s '''GPU''' 5.59571 s
<span id="wall-time-per-step"></span>
==== Wall Time Per Step ====
'''CPU''' 0.0918308 s/step '''GPU''' 0.0110155 s/step
<span id="days-per-nanosecond-simulation"></span>
==== Days Per Nanosecond Simulation ====
'''CPU''' 0.999022 days/ns '''GPU''' 0.0660301 days/ns
-----
<span id="additional-information"></span>
==== Additional information ====
* Extending Apoa1 workload steps with CUDA lead to step 3500 being reached in roughly the same time step 500 was reached on 20 CPUs.
* per core speedup with GPU acceleration is ~151.297 times
** caveat: one CPU must supervise and await output from each GPU because of how NAMD was written
* CUDA performed 10000steps and wrote output in 136.315155s total wall time
* CPU performed 500steps and wrote output in 76.477325s total wall time
<span id="benchmarking-namd2-workloads-for-gpu-containers-on-ccast"></span>
== Benchmarking NAMD2 workloads for GPU containers on CCAST ==
<span id="stephen-szwiec"></span>
=== Stephen Szwiec ===
<span id="rasulev-computational-chemistry-research-group"></span>
=== Rasulev Computational Chemistry Research Group ===
<span id="north-dakota-state-university"></span>
=== North Dakota State University ===
<span id="july-2022"></span>
=== 26 July 2022 ===
-----
<span id="apoa1-information"></span>
==== ApoA1 information ====
==== ApoA1 information ====


Line 121: Line 28:
** GPU workload modified to use CUDA FFTW, CUDA integration
** GPU workload modified to use CUDA FFTW, CUDA integration
** GPU workload modified to continue simulation for 10000 steps
** GPU workload modified to continue simulation for 10000 steps
<span id="namd-non-gpu-run-information"></span>
==== NAMD non-gpu run information ====
==== NAMD non-gpu run information ====


Line 129: Line 34:
* 20 processes, 20 cores, 1 physical node, and 16GB memory specified
* 20 processes, 20 cores, 1 physical node, and 16GB memory specified
* NAMD 2.14 used
* NAMD 2.14 used
<span id="namd-gpu-run-information"></span>
==== NAMD gpu run information ====
==== NAMD gpu run information ====


Line 139: Line 42:
** binding each GPU to one Nvidia A10 with 22731MB memory
** binding each GPU to one Nvidia A10 with 22731MB memory
* NAMD 3.0alpha11
* NAMD 3.0alpha11
-----
-----
<span id="benchmark-findings"></span>
=== Benchmark findings ===
=== Benchmark findings ===
<span id="startup-wall-time"></span>
==== Startup Wall Time ====
==== Startup Wall Time ====


'''CPU''' 3.4346 s '''GPU''' 0.0963 s
'''CPU''' 3.4346 s '''GPU''' 0.0963 s
<span id="simulation-wall-time"></span>
==== Simulation Wall Time ====
==== Simulation Wall Time ====


'''CPU''' 38.6991 s '''GPU''' 5.59571 s
'''CPU''' 38.6991 s '''GPU''' 5.59571 s
<span id="wall-time-per-step"></span>
==== Wall Time Per Step ====
==== Wall Time Per Step ====


'''CPU''' 0.0918308 s/step '''GPU''' 0.0110155 s/step
'''CPU''' 0.0918308 s/step '''GPU''' 0.0110155 s/step
<span id="days-per-nanosecond-simulation"></span>
==== Days Per Nanosecond Simulation ====
==== Days Per Nanosecond Simulation ====


'''CPU''' 0.999022 days/ns '''GPU''' 0.0660301 days/ns
'''CPU''' 0.999022 days/ns '''GPU''' 0.0660301 days/ns
-----
-----
<span id="additional-information"></span>
==== Additional information ====
==== Additional information ====



Latest revision as of 18:22, 21 October 2022

Benchmarking NAMD2 workloads for GPU containers on CCAST

Stephen Szwiec

Rasulev Computational Chemistry Research Group

North Dakota State University

26 July 2022


ApoA1 information

  • simulates a bloodstream lipoprotein particle
  • physical stats
    • 92224 atoms
    • 70660 bonds
    • 74136 angles
    • 74130 diheadrals
    • 1402 impropers
    • 32992 hydrogen groups
    • 553785 total amu of mass
  • energy stats
    • 300K initial temp.
    • -14 e total charge
  • simulation stats
    • Consists of a startup process followed by 500 steps of simulation for benchmark time
    • GPU workload modified to use CUDA FFTW, CUDA integration
    • GPU workload modified to continue simulation for 10000 steps

NAMD non-gpu run information

  • Charmrun used with one node (condo02)
  • machine topology: 2sockets x 64 cores x 1 PU = 128-way SMP possible
  • 20 processes, 20 cores, 1 physical node, and 16GB memory specified
  • NAMD 2.14 used

NAMD gpu run information

  • NAMD used within Singularity container system
    • namd:3.0-alpha11 image used to generate singularity container
  • machine topology: 2sockets x 64 cores x 1 PU = 128-way SMP possible
  • 2 CPUs, 2 GPUs, 1 openMP thread, 1 physical node, and 16GB memory specified
    • binding each GPU to one Nvidia A10 with 22731MB memory
  • NAMD 3.0alpha11

Benchmark findings

Startup Wall Time

CPU 3.4346 s GPU 0.0963 s

Simulation Wall Time

CPU 38.6991 s GPU 5.59571 s

Wall Time Per Step

CPU 0.0918308 s/step GPU 0.0110155 s/step

Days Per Nanosecond Simulation

CPU 0.999022 days/ns GPU 0.0660301 days/ns


Additional information

  • Extending Apoa1 workload steps with CUDA lead to step 3500 being reached in roughly the same time step 500 was reached on 20 CPUs.
  • per core speedup with GPU acceleration is ~151.297 times
    • caveat: one CPU must supervise and await output from each GPU because of how NAMD was written
  • CUDA performed 10000steps and wrote output in 136.315155s total wall time
  • CPU performed 500steps and wrote output in 76.477325s total wall time