HPC Production Engineer: Scale Compute & Ops

Company: Jump Trading
Apply for the HPC Production Engineer: Scale Compute & Ops
Location: London
Job Description:

What You’ll Do

  • Design, implement, maintain, and support high performance compute and storage systems
  • Implement and support performance monitoring and fault monitoring systems
  • Monitor systems and storage performance, up to and including network components
  • Build tooling to compile, package, install, and upgrade software and operating system components at scale
  • Collaborate with team members and across teams to write code and testing infrastructures spanning both new and existing codebases in multiple programming languages
  • Develop and improve systems and user documentation
  • Participate in large, coordinated maintenance operations, including during evenings and weekends
  • Work on global projects across a wide range of infrastructure
  • Collaborate directly with researchers to optimize their use of HPC infrastructure
  • Develop and monitor the tools used to maintain a production computing environment
  • Provide operational support on a rotating basis and as needed
  • Manage relationships with outside vendors, including traveling both domestically and internationally to meet with current and potential vendors
  • Adhere to all company cybersecurity and IT policies, including performing all work using only approved hardware and software Other duties as assigned or needed

Skills You’ll Need

  • 5+ years of professional experience in high performance computing (HPC), including parallel filesystems (e.g., Lustre, GPFS), batch systems (e.g., Slurm, Grid Engine), and high-performance network interconnects experience is a plus, but not required
  • 5+ years of experience with Linux systems administration
  • High proficiency with at least one programming/scripting language (e.g., Go, Python, C)
  • Extensive experience designing, building, and maintaining complicated, interdependent, and distributed systems
  • Extensive experience profiling and debugging application stacks (debuggers and profilers)
  • Experience with system configuration management tools (SaltStack, Ansible, Puppet, etc.)
  • A compulsion to perform root cause analysis
  • Reliable and predictable availability

Benefits

  • Private Medical, Vision and Dental Insurance
  • Travel Medical Insurance
  • Group Pension Scheme
  • Group Life Assurance and Income Protection Schemes
  • Paid Parental Leave
  • Parking and Commuter Benefits

#J-18808-Ljbffr…

Posted: May 27th, 2026