Company: Jump Trading

Apply for the HPC Production Engineer: Scale Compute & Ops

Location: London

Job Description:

Design, implement, maintain, and support high performance compute and storage systems
Implement and support performance monitoring and fault monitoring systems
Monitor systems and storage performance, up to and including network components
Build tooling to compile, package, install, and upgrade software and operating system components at scale
Collaborate with team members and across teams to write code and testing infrastructures spanning both new and existing codebases in multiple programming languages
Develop and improve systems and user documentation
Participate in large, coordinated maintenance operations, including during evenings and weekends
Work on global projects across a wide range of infrastructure
Collaborate directly with researchers to optimize their use of HPC infrastructure
Develop and monitor the tools used to maintain a production computing environment
Provide operational support on a rotating basis and as needed
Manage relationships with outside vendors, including traveling both domestically and internationally to meet with current and potential vendors
Adhere to all company cybersecurity and IT policies, including performing all work using only approved hardware and software Other duties as assigned or needed

5+ years of professional experience in high performance computing (HPC), including parallel filesystems (e.g., Lustre, GPFS), batch systems (e.g., Slurm, Grid Engine), and high-performance network interconnects experience is a plus, but not required
5+ years of experience with Linux systems administration
High proficiency with at least one programming/scripting language (e.g., Go, Python, C)
Extensive experience designing, building, and maintaining complicated, interdependent, and distributed systems
Extensive experience profiling and debugging application stacks (debuggers and profilers)
Experience with system configuration management tools (SaltStack, Ansible, Puppet, etc.)
A compulsion to perform root cause analysis
Reliable and predictable availability

#J-18808-Ljbffr…

Posted: May 27th, 2026