Having spent most of my HPC career as a system administrator at a large government agency, I’ve experienced more than my fair share of problems related to change. Change is an essential part of IT operations and HPC is no exception. Although there is no magic bullet that makes change risk free, good change management will reduce the mean time to repair (MTTR) when change introduces problems into your HPC system.
So what makes a change management process work? The most important aspect is the people. The best change management policies, procedures, and tools are useless unless people adhere to them. System administrators, and particularly HPC system administrators, are typically very smart folks. Unfortunately, some view adherence to a change management process as an unnecessary burden, particularly when it comes to making small and seemingly “obvious” changes to fix a problem.
What is often not fully considered are the upstream or downstream ramifications of change, which in a complex HPC system are seldom obvious for even the most experienced sysadmin. When things start failing, it’s usually not the person who made the change that gets called at 3 a.m. I know that when I get that call the first thing I want to know is what changed. Too much time is unnecessarily wasted on problem determination when the actual fix is relatively simple. Knowing what changed, when it changed, why it changed, and how to back it out are some of the most valuable benefits system administrators derive from change management.
The value of baselines and benchmarks before and after a change cannot be overstated. Changes will often have an impact that is not obvious. How many times have we received calls from our end users saying, “the system seems slower today?” Running your baselines and benchmarks prior to a change confirms your system is healthy. Running those same baselines and benchmarks after a change allows you to assess the impact of that change. In addition, if your pre-change baselines and/or benchmarks fail, strongly consider cancelling your change and remediating your system “as is” before applying “more change.”
Providing access to change management data for help desk and on-call personnel is another essential piece of an organization’s change management process. Many IT departments have multiple teams, such as storage, network, systems, security, applications, etc. Each of these teams can and likely do operate and make changes independently of one another. Knowing if a change has or has not been made by another group is as important as knowing if your own group has made a change. The key word here is knowing. When you know about changes, it means that the entire organization has bought into change management.
And that brings me to my final point: There should be zero tolerance for unauthorized change.
Enforcing a policy of zero unauthorized changes is how we know change management is being strictly followed. Dealing with change-related problems comes with the territory for an HPC system administrator, but dealing with unauthorized changes is unacceptable. The problem determination process requires sysadmins to make logical deductions based on available information. When unauthorized/undocumented changes are made, MTTR suffers.
Configuration management tools can be deployed to detect and even correct unauthorized change. (For the purposes of this article, I consider configuration management to be a subset of change management.)
In the HPC world, we often deploy stateless images on our compute nodes. Making changes to a stateless image is a more deliberate process than it is with stateful images. Cluster management tools like xCAT and Bright Cluster Manager make deploying and updating stateless images relatively easy. Returning nodes to a known and consistent state is as simple as rebooting a node. Detecting unauthorized changes to system images is very manageable. Enterprise configuration management tools like Puppet and Chef can be configured to detect and, if appropriate, overwrite unauthorized changes for both stateless and stateful nodes.
These tools have been successfully deployed and utilized in HPC environments. Tools like Tripwire or OSSEC, known mostly for intrusion detection, provide automated detection and reporting of changes. However, in the end, IT staff must be trusted to adhere to change management policies and practices or find a new line of work.
Change is inevitable, and it will break your system. I believe in this more than death and taxes. And yes, sitting through a change management meeting to get approval for your simple update is a real pain, but ultimately it will allow you and your fellow sysadmins to occasionally get a good night’s sleep.
Our team at RedLine is charged with delivering more than just uptime; we have to deliver on time. So we know that when managing a complex HPC system, even the smallest of changes can cause unforeseen problems. Reach out for more advice on how to enforce adherence to change management.
Keep connected—subscribe to our blog by email.