Benchmarking NAMD workflows for GPU containers

From Rasulev Lab Wiki
Jump to navigation Jump to search

Benchmarking NAMD2 workloads for GPU containers on CCAST

Stephen Szwiec

Rasulev Computational Chemistry Research Group

North Dakota State University

26 July 2022


ApoA1 information

  • simulates a bloodstream lipoprotein particle
  • physical stats
    • 92224 atoms
    • 70660 bonds
    • 74136 angles
    • 74130 diheadrals
    • 1402 impropers
    • 32992 hydrogen groups
    • 553785 total amu of mass
  • energy stats
    • 300K initial temp.
    • -14 e total charge
  • simulation stats
    • Consists of a startup process followed by 500 steps of simulation for benchmark time
    • GPU workload modified to use CUDA FFTW, CUDA integration
    • GPU workload modified to continue simulation for 10000 steps

NAMD non-gpu run information

  • Charmrun used with one node (condo02)
  • machine topology: 2sockets x 64 cores x 1 PU = 128-way SMP possible
  • 20 processes, 20 cores, 1 physical node, and 16GB memory specified
  • NAMD 2.14 used

NAMD gpu run information

  • NAMD used within Singularity container system
    • namd:3.0-alpha11 image used to generate singularity container
  • machine topology: 2sockets x 64 cores x 1 PU = 128-way SMP possible
  • 2 CPUs, 2 GPUs, 1 openMP thread, 1 physical node, and 16GB memory specified
    • binding each GPU to one Nvidia A10 with 22731MB memory
  • NAMD 3.0alpha11

Benchmark findings

Startup Wall Time

CPU 3.4346 s GPU 0.0963 s

Simulation Wall Time

CPU 38.6991 s GPU 5.59571 s

Wall Time Per Step

CPU 0.0918308 s/step GPU 0.0110155 s/step

Days Per Nanosecond Simulation

CPU 0.999022 days/ns GPU 0.0660301 days/ns


Additional information

  • Extending Apoa1 workload steps with CUDA lead to step 3500 being reached in roughly the same time step 500 was reached on 20 CPUs.
  • per core speedup with GPU acceleration is ~151.297 times
    • caveat: one CPU must supervise and await output from each GPU because of how NAMD was written
  • CUDA performed 10000steps and wrote output in 136.315155s total wall time
  • CPU performed 500steps and wrote output in 76.477325s total wall time