Blog

Large-Scale HPC Workflow Management at NCEP

By in Operations & Maintenance on March 3, 2017

Editor’s Note: RedLine Senior Program Analyst Kit Menlove contributed to this post.

For industry and academia, managing workflows for large-scale data-intensive computational processes is a constant challenge. As every industry and scientific discipline becomes more reliant on ever increasingly complex and robust computational solutions, the ability to create repeatable and agile end-to-end processes becomes an important priority.

National numerical weather prediction centers have a unique set of challenges since they require highly available, accurate, monitorable, and flexible systems while simultaneously supporting their critical research development efforts.

Here, we will outline current workflow requirements in production and research at NOAA’s National Centers for Environmental Prediction (NCEP) and discuss what we are doing to meet these requirements with available software.

NCEP has adopted two technologies for HPC workflows: Rocoto from NOAA’s Development Testbed Center and ecFlow from the European Center for Medium-Range Weather Forecasts, (ECMWF) to support its needs in research and production respectively. Both have proven effective in reliably managing the complexities of large-scale weather-related workflows – with each technology solving distinct challenges in production and research. Both are freely available for download and use under the Apache License, Version 2.0.

Operations

The operations team at NCEP maintains a suite of weather models and applications with the mission to deliver accurate, highly reliable meteorological products on time to regional weather centers, federal agencies, the armed forces, and the public at large. This mission requires a set of tools to enable the operations team to monitor, troubleshoot, and interact with the production suite in a transparent fashion. It also demands that backup hardware (also used by the research team) and failover mechanisms be in place and always available should fatal hardware errors arise. To ensure accuracy, the ability to test, onboard, and disseminate new and upgraded workflows in a streamlined way is essential.

ecFlow is an event-driven system with a feature-rich graphical interface for real-time monitoring. It is based on a client-server paradigm where tasks, organized hierarchically into suites and families, are initiated from the ecFlow server and send signals back to the server indicating their progress, completion, or abnormal termination.

These signals, sent via commands embedded in the task’s code, alter the state of the task in the receiving server. This state information can then be used to trigger downstream tasks. Because the server must run continuously, it works best in an environment where daemon processes are allowed to run for weeks at a time – something that is often restricted in shared computing environments. This enables real-time notification of tasks with longer than usual submission or running times.

Research

Supporting research and development workflows differ from operations in that emphasis is focused on the ability for individual researchers to effortlessly reconfigure and re-run complex and novel experiments. There is also a higher demand to support multiple platforms with varying system configurations.

Rocoto’s strengths are in its simplicity as it works through a concise XML description of a workflow and operates entirely within a user’s environment. Without a centralized server on all research platforms, a simple sqlite3 database file is utilized instead. It is initially created when the user first executes the main rocotorun utility and is continually updated each time the utility is executed (usually through cron for long-running jobs).

The utility queries the system’s scheduler and then references this database for the current state of the workflow before determining the next appropriate task. Because the scheduler must be queried on a regular basis to obtain state information, the technology works best when there are a limited number of Rocoto instances running on the system. The job script for this task is then created on the file and is automatically submitted to the system’s scheduler. As the workflow description is centered around time-based cycles, this process is repeated until a specified end date is reached.

We delivered on the configurability and ease of use requirements to the researchers by establishing a front-end tool to the XML workflow description. It effortlessly creates an on-demand experiment from a given set of configurable user settings. The Rocoto framework internally supports the system’s specific scheduling interfaces, which leads to a seamless way to support multiple platforms. It was then easy to develop interfaces that could interact and control the workflow in real time through the utilities provided by the Rocoto framework.

Due to the strengths provided by each, the active use of both ecFlow and Rocoto is likely to continue into the foreseeable future. In order to make transitions between the two as seamless as possible, an effort is underway to bridge the two with a newly developed specification language currently used in the research division’s testing framework. Along with a set of utilities, the language will be able to seamlessly derive the requisite Rocoto and ecFlow workflow inputs on the fly. The challenge remains to incorporate a standard way of using data availability as a trigger in ecFlow (as Rocoto natively supports) while maintaining workflow-level variables and events in Rocoto (as ecFlow natively supports). This approach will greatly reduce the difficulty of meeting the center’s research needs while also being able to satisfy delivery requirements to support the production systems.

For a deeper discussion of how to effectively deploy HPC workflow management solutions within your organization, reach out.

Comparing Rocoto and ecFlow

Rocoto ecFlow
Workflow can only advance when the rocotorun utility is executed (usually through user’s cron)  Server daemon runs continuously
Must query the system’s scheduler to maintain the workflow’s state information  Maintains its own job state information
Natively supports intercycle task, data, and time dependencies.  Natively supports task, event, and time dependencies
Workflows are defined by custom XML language  Workflows defined in custom definition format
System-specific job cards are created automatically  System-specific job cards are provided by user
Some interactive visual interfaces are available  Fully supported feature-rich GUI available

Written by

Terrence McGuinness

Terry joined RedLine Performance Solutions in 2015 as a Senior HPC Analyst. He is a technical computing specialist with expertise in computational science, applied mathematics, and scientific applied programming. He currently works in the Engineering and Implementation Branch at NOAA’s National Center for Environmental Prediction in College Park. Prior to joining Redline, Terry has served […]

Sign Up for Our Newsletter


Archives
Categories
Subscribe to Our Newsletter

Keep connected—subscribe to our blog by email.


Back to Top

All rights reserved. Copyright 2014 RedLine Performance Solutions, LLC.