We’ve been talking about how baselining and benchmarking helps optimize performance of your HPC system. Last time, we discussed the importance of maintaining a set of consistent benchmarks so you can analyze the performance of your HPC system throughout its lifetime. In this post, we’ll cover some specific baselining tips.
Baselining should always start with hardware and move up the ladder from there. Spend the first few weeks/months on a new system understanding the optimal settings for your hardware. Start with compute node hardware moving into the interconnect network and on to cluster storage.
For compute nodes and servers, BIOS/firmware tweaks are likely needed to dial-up or dial-down performance. Consider running memory bandwidth tests coupled with high-intensity CPU checks like Linpack or DGEMM. Develop scripts to run these tests on individual nodes as the backbone of your hardware testing efforts. This approach allows you to analyze the individual performance of every compute node in the cluster. Using this, pick out the low performers and compare them to the top performers; perhaps a firmware setting needs to be adjusted.
Once you’ve established a norm for hardware performance – define it! Your Linpack configuration files don’t need to be a model of perfection. The important thing is to establish a minimum level of acceptable performance, not to establish that you’re a Linpack master (though the bragging rights would be cool).
Create scripts to log runs for each system with a time/date stamp and the results of each hardware benchmark run. When a compute node is suspected of running slow, run your hardware benchmarks. Compare the results to the dozen or so benchmark runs executed in the past. A 20-minute benchmark test can save a day or more of troubleshooting in just ruling in/out a hardware problem.
As preventative maintenance and system updates are applied to the server hardware or the operating system, make sure to re-execute your compute node benchmarks to ensure performance doesn’t slide. If it does, perhaps a new BIOS/firmware parameter was introduced for “power savings,” or perhaps the update performed a reset of your firmware settings.
For low-latency interconnects, things quickly get sticky. Start with testing network bandwidth on compute nodes on a common switch once you can attain a value close to theoretical peak (this becomes similar to single-node server checks). This is one of the most difficult pieces of your HPC system on which to manage performance.
In general, InfiniBand networks are configured in a fat-tree and the larger they are, the more congestion becomes a problem. Accelerated project schedules for implementing new clusters rarely allow enough time for more than a cursory check of the InfiniBand networks. Work with your system installer to develop an adequate system-level performance benchmark based on your network type and topology. This can leverage MPI bandwidth checks between groups of nodes or, in some cases, all nodes in the system to validate system performance. Have these results also log to a file for historical reference.
For cluster filesystems, this needs to be a multi-pronged approach. Most cluster filesystems are created on storage array volumes which are themselves backed by a group of individual disks. Consider carefully the tuning parameters selected for your storage arrays and how they interact with your filesystem. At a minimum, during cluster installation, execute IOR across a predefined group of nodes. This result will be a cluster-wide read/write value.
While IOR checks the read/write performance of your cluster, consider running mdtest to validate the metadata performance of your filesystem as well. The tuning and the iterations required may take some tweaking but once the proper size of the test and the number of nodes engaged is discovered to reach optimal performance, set this as the standard (and yes … log the results of each run to a file for historical analysis.)
So now we’ve run through compute node hardware, network, and storage baselines. Let’s pull these together into a system-wide baseline with an industry standard benchmark like HPCC which includes Linpack, DGEMM, STREAM, PTRANS, RandomAccess, FFT, and a communication test suite. A range of tests in HPCC allows you to test specific facets of your system’s overall performance as a whole.
Finally, and most importantly, we reach the application baselines. What test will this be? Take care in picking a local application benchmark. This application needs to be fixed in place for years or at a minimum, the copy you are executing in the baseline needs to be a separate copy from the one your users execute. As the application team improves and makes changes to the code which makes up the application baseline, discuss locking/adding a new version of the application at a specific code base and incorporating a separate run of the new application benchmark.
As I said in Part One of this post, HPC systems are complex beasts. But a diligent process of benchmarking and baselining can help ensure changes to your system don’t negatively affect performance. We’d be happy to answer your questions or discuss your own situation more deeply. Just reach out.
Keep connected—subscribe to our blog by email.