GPFS Engineer

Job Description:

RedLine Performance Solutions (RedLine) has been in the HPC solutions engineering services business for approximately 17 years and is consistently determined to keep the “bar of excellence” quite high for new hires. This enables RedLine to accomplish what other firms cannot and promotes a high level of staff retention. We offer services ranging from full life cycle HPC systems engineering to remote managed services to HPC program analysis. We are located in the Washington, DC area and are looking for a GPFS Engineer to join us for our NASA NACS High Performance Computing contract.

US citizenship and the ability to obtain a Public Trust security clearance are mandatory requirements for this position. The position is located at a customer site in Greenbelt, MD. Preference is for local candidates, but we will consider relocation as well.

This position will interact with the Program Manager, Site Lead, Customer, and site staff attending regularly scheduled customer meetings to keep the customer informed of activities and progress and answer customer inquiries concerning all aspects of the various the program. An individual at this skill level should have demonstrated his/her problem solving ability in the appropriate area of expertise with numerous technical publications and formal technical presentations, and should have some experience in mentoring and leading others in small team environments.

Job Responsibilities:

  • Design (architect), implement and troubleshoot large-scale (tens of Petabytes) storage systems. This includes developing technical drawings including all required cables and connectivity to existing systems, and communicating with key stakeholders.
  • Serve as a GPFS SME for the Discover HPC team as well as other teams running GPFS both within and outside of the immediate organization.
  • Develop and execute test plans for filesystem upgrades and resolving issues, potentially by working with vendors.
  • Resolve user-reported application issues (e.g., filesystem, RDMA interconnect, kernel, operating system, MPI middleware)
  • Provide support to the applications team in installing user-requested applications.
  • Evaluate and test proposed changes to the Discover supercomputer’s production operating environment (e.g. MPI upgrades, OS Patches, Kernel parameter changes) and develop upgrade/potential backout plans.
  • Maintain the Discover Test and Development System (TDS), keeping it as close as reasonably possible to the production system configuration.
  • Provide 24×7 on-call support as required.

Required Skills/Experience:

  • Bachelor’s degree in Computer Science, Management Information Systems or other technical discipline plus 5 years of experience, or equivalent.
  • At least five years of experience as a High-Performance Computing parallel filesystem Storage Administrator, with experience with IBM Spectrum Scale (GPFS), Lustre, or equivalent. Experience with optimizing for performance, reliability, and security.
  • In-depth knowledge of HPC parallel filesystems and the ability to troubleshoot complex problems. Must be comfortable with monitoring and managing clustered filesystems, and be able to examine GPL driver code when required.
  • Experience with deploying parallel filesystem upgrades in a rolling fashion with no overall system downtime.
  • In-depth knowledge of Linux NFS server/client implementation and ability to troubleshoot NFS issues.
  • In-depth knowledge of SAN technologies (e.g., FC, FCoE, RoCE, NVMoF, iSER, SRP) and awareness of high-level protocol function, management approaches, and performance tuning.
  • Deep experience with InfiniBand or OmniPath high speed fabrics, including subnet management, IPoIB and/or IPoOPA mechanisms, fabric topology and health monitoring and integration with MPI.
  • Knowledge of Ethernet networking (VLANs, etc.)
  • In-depth knowledge of MPI Implementations (Intel MPI, MVAPICH2, OpenMPI, HPE/SGI MPT) and troubleshooting MPI application stability and performance problems.
  • Experience with debugging issues with the Linux kernel. Ability to produce patches to fix issues, as required.
  • Experience with applying patches and building custom kernels as required to implement functionality or address security concerns.
  • Experience deploying and managing large HPC clusters using image based cluster management tool such as xCAT.
  • Experience in building, installing and debugging scientific applications (e.g. MPI, NetCDF, HDF, WRF).
  • Experience in submitting parallel applications to a batch scheduler (ideally SLURM).
  • Knowledge of configuration management tools (e.g., Puppet, CFEngine).
  • Working knowledge of scripting and programming languages such as C, C++, Fortran Bash, CSH, TSCH, Perl, Python, Ruby.
  • Good organization skills to balance and prioritize work, and ability to multitask
  • Good communication skills to communicate with support personnel, customer, and managers.

Desirable Skills:

  • Experience with cloud technologies (AWS, Azure, GCP), OpenStack, and Kubernetes
  • Experience with GPFS Cluster Export Services, Clustered NFS, GPFS Multi-cluster
  • Broad knowledge of distributed file systems and object stores such as Lustre, HDFS, BeeGFS, LizardFS Gluster, Ceph, Swift.
  • Experience with revision control via Git.
  • Familiarity with Time-Series databases and associated tools (Such as InfluxDB, Graphite, Grafana, Elastisearch, Kibana).
  • Knowledge of virtualization technologies (particularly qemu/kvm) and managing large numbers of virtual machines.

__________

Please email hr@redlineperf.com with your resume if this opportunity is of interest to you.

Back to Top

All rights reserved. Copyright 2014 RedLine Performance Solutions, LLC.