While public clouds have been used for HPC for at least the past decade, several key challenges have traditionally slowed the adoption of cloud computing for HPC:
Recently, the HPC offerings of public cloud vendors are starting to mature into viable platforms to support virtually any workload, whether traditional MPI-based HPC applications or machine learning/AI workflows. A few key developments mark a sea change in cloud HPC capabilities:
Low-latency interconnects: The availability of InfiniBand or accelerated Ethernet options are a game-changer for running latency-sensitive distributed applications.
Whole-node virtualization: Cloud providers now offer virtual machine instances that have exclusive use of the entire physical machine, and have decreased virtualization overhead. In some cases, special CPUs offload virtualization onto dedicated cores, leaving the full core count to the operating system and user workloads. Current offerings also include compute nodes with GPU resources that are frequently used for machine learning.
Node affinity: For multi-node distributed applications, node affinity, or colocation of the compute nodes topologically (on the same network subnet) and physically (e.g. in the same rack) is important to achieve low-latency communications. Isolation from other workloads and network traffic is desirable, as one slow connection can have a dramatic impact on application performance. Major cloud vendors have HPC-friendly virtual cluster deployment features that help enforce node affinity.
High-performance storage options: The availability of high-performance compute, disk, and interconnects in the cloud enable roll-your-own HPC storage as an option. However, several vendors offer preconfigured high-performance storage using Lustre, IBM Spectrum Scale, BeeGFS, or other parallel file systems. Some vendors offer “storage as a service” options that can dynamically create a given amount of high-performance storage on demand, while enabling Hierarchical Storage Management (HSM) features to back it up to a less expensive persistent file or object storage tier in the cloud.
Easy deployment: Several vendors have templates for deploying virtual clusters with a preconfigured workload manager (such as Slurm) in a single command or click. This approach deploys an environment familiar to HPC users, easing adoption through familiarity and portability. Coordination between the workload manager and the virtualization back-end can automatically provision or deprovision resources based on demand.
Flexible job submission options: Vendors offer SSH access to a “head node” in the cloud, APIs to launch and interact with jobs from one’s laptop or workstation, or “cloudbursting,” where users submit jobs to a local compute cluster that forwards the job to cloud resources as needed.
These developments broaden the array of workloads that may successfully be run in the cloud, and make cloud resources much easier to utilize.
Several models exist for utilizing cloud resources for HPC:
Long-term deployment: Instantiate compute and storage resources and leave them available for users over the course of a project or period of peak use. While this simple approach is most like a traditional cluster, it may not be cost-efficient, as in many scenarios one pays by the hour for resources, whether they are used or not.
Per-job deployment: Submit a job to run in the cloud through an API or Web interface, and the required compute, storage and networking resources will automatically be provisioned and configured for running it. While this approach helps to reduce cloud charges for idle periods, the setup for this approach to work seamlessly and efficiently is more complicated, and adds a significant amount of startup and tear-down time to the job (which incurs cloud charges).
Cloudbursting: In a traditional HPC batch environment, one can “burst” workload to cloud resources when demand requires additional compute capacity. In this case, the local workload manager makes an API call to dynamically provision virtual instances and storage in the cloud, runs the job in the cloud, and retrieves results back to the local cluster. Virtual resources that are no longer needed can be automatically deprovisioned to reduce costs.
Cloud computing’s advantages for HPC are similar to that of other use cases:
Although much progress has been made in supporting HPC in the cloud, key challenges remain, particularly the availability of low-latency interconnects. In some cases, custom and updated software stacks are required to utilize a vendor’s high-performance networking option. Also, users may be used to always-available physical infrastructure and data storage, so job submission, workflows, and data staging may need to be rethought in a cloud environment.
To determine how public cloud providers may best be used to support your HPC mission, a number of questions and considerations must be taken into account, including:
Also, to what extent can cloud providers partially or completely replace private “on-prem” resources? In general, it is often cheaper to own than to rent over the long term, but there can also be significant lead time, facilities, and budgeting issues involved when deploying on-prem solutions. Deciding on how much of your workload can move to the cloud, when this is appropriate, and how it will be achieved, depends on a complex mixture of cost, performance, and feasibility considerations.
When developing your strategy for a user-ready HPC deployment to the cloud, it is helpful to have someone with deep HPC and cloud experience to guide your decisions. If you are interested in transitioning HPC workload to the cloud, reach out to us to see how we can help.
Keep connected—subscribe to our blog by email.