Scaling HPC applications on your hardware is a vital task in supercomputing. If an application can’t use all of the hardware devoted to it, then you have expensive equipment sitting idle and not doing anyone any good.
When looking at scalability, it is important to first ensure that the problem can scale. In other words, is the problem big enough to be broken into many component parts that can be solved in parallel? Assuming the problem will scale, the application itself will have a scalability bottleneck of one sort or another, and most likely, a series of bottlenecks. If there is a preponderance of serial code in your application, for example, or other dependencies that require serial execution, the code may need to be refactored to either eliminate the serial portions, or hide them by performing other work at the same time.
The classic approach to application scalability involves dividing the problem up so that it can be solved in parallel. What you’re doing here is dividing up your problem into many sub-domains for MPI-based parallelization. This will give each of your MPI processes a chunk of the problem to work on, distributing work across the MPI domain.
It is also possible to divide up the problem into compact, generally non-interdependent kernels that can be off-loaded to an accelerator, like a GPU or FPGA, to crunch. Accelerators have many more compute cores than CPUs and can run large numbers of calculations much faster than a traditional CPU, but to achieve best results, they require single instruction, multiple data (SIMD) types of problems, which is not always the type of problem being solved. SIMD operations are also often referred to as vectorization. Most processors today have at least some vector calculation ability, but accelerators will provide true vectorization.
The bottleneck(s) in a given application may already be known, but if it’s not obvious, you can use a profiling tool such as TAU to identify the bottleneck or bottlenecks in the code.
Let’s look at each common scaling barrier in turn and discuss how to conquer (or at least work around) them:
Making an application truly scalable will require multiple iterations through this process because as soon as one barrier is overcome, another will appear. However, each iteration will result in faster, more efficient code and will boost performance, throughput, and hardware utilization. You will be pushing your hardware closer and closer to its limits and thus getting your money’s worth out of it.
For a deeper discussion on scaling your own HPC system, reach out.
Keep connected—subscribe to our blog by email.