Assessing Your HPC System Data Archiving Needs

By in System Administration on June 7, 2017

Most organizations need some level and type of archive, whether it’s small or large. Sometimes it’s a legal requirement, like for companies that keep medical or financial records. Firms utilizing HPC systems often need to store large data sets long-term in order to analyze them later.

Organizations with small archive requirements, say less than a petabyte, might simply use backup software to build and maintain their long-term archives. But a backup is usually just a snapshot in time—an indiscriminate copy of all of the data in the organization. Backup software can only take you so far, usually to the 2- to 4-petabyte level, before reaching its limit in scalability.

An archive, on the other hand, is where an organization keeps data that’s very important, but not used all of the time. And the more computing a firm performs, the more data it generates, and the more data that organization will find itself needing to archive.

This is especially true for those using high-performance computing systems. For such organizations, it’s vital to develop a carefully considered archiving strategy that addresses the question of whether to archive data on spinning disks, tape, or (as we usually recommend) some combination of both.

Tape vs. Disks

Tape has the advantage of being low-cost in terms of data density, physical density, and power consumption. In other words, you can store a huge amount of data in a relatively small footprint that only sips electricity when compared to large disk arrays.

But one downside of tape is that data access times are quite a bit slower than spinning disks, which can be a problem in some organizations, particularly when dealing with small files.

This is why such organizations typically use a mix of tape and hard drives in their archive solution.   Because writing anything other than large files directly to tape can be inefficient, a tiered storage archive uses disk as a caching area for data. The archive system can then stream that data to tape in a manner that is efficient for tape. For instance, multiple small files can be aggregated for a more efficient single write to tape. Data will remain on disk until a second copy is made, possibly offsite. Policy will ultimately dictate how long data remains on disk.

The proper balance between tape and hard drives in a particular archive solution depends on specific organizational requirements. There is no “one architecture fits all” solution. There are several factors that must be taken into account in order to design a solution that fits the needs of an organization today, yet will scale adequately to handle future needs.  A good archive solution will allow for the required flexibility and scalability.

Key Factors That Influence Archiving Decisions

Some of the basic considerations are elementary, like how much archive data is generated daily, how long it needs to be kept, and how often archived data needs to be accessed.  The answers to these questions will dictate initial archive size and future scalability needs, and the ratio between tape and disk.

Another consideration is average file size and the sheer number of files. Large numbers of small files (under 16k, for example) put a lot of stress on a tape-based archive system. While these files don’t take up much space, accessing them is expensive. Consider the fact that tape is accessed sequentially, as opposed to disk. Seeking to a tape mark can take considerably longer than actually reading a file.  Which is why some data may not be suitable for a tape-based archive. Some archives allow for policies that retain all files smaller than X to remain on disk and never purged. However, they can still be written to tape, as an aggregate, which satisfies the requirement for multiple copies, or offsite copies.

Yet another important factor is the number of people who will need to access the same data simultaneously. Organizations should consider how much collaborative work would need to be performed using particular data sets to help guide archiving decisions. It’s best to design your archive policy such that frequently retrieved files are stored on disks (because they’re faster) and those retrieved less often remain on tape (because they’re more cost-effective).

That’s one reason the best archive solutions are hierarchies, in which users initially store data on disk, where it can be quickly and easily accessed. But over time—or when other conditions, like frequency of access, are met—the data is then migrated over to tape. This provides the best balance between speed of access and achieving the lowest total cost of ownership.

Archiving seems like a relatively simple proposition when the archive is small. But archives only grow larger. That’s why it’s important to plan your archive strategy today, before it becomes unmanageable. For help with that, or for a deeper conversation about your own archiving needs, get in touch.

Editor’s Note: RedLine Vice President and Chief Technology Officer Don Avart contributed to this article.

Sign Up for Our Newsletter

Subscribe to Our Newsletter

Keep connected—subscribe to our blog by email.

Back to Top

All rights reserved. Copyright 2020 RedLine Performance Solutions, LLC.