Salary: Competitive + package (depending on experience)
Type: Full-time
A leading consulting and technology organisation is looking to hire a number of HPC Systems Administrator / Consultant to join a growing High Performance Compute operations team supporting next-generation AI infrastructure projects across the UK.
This role will focus on the design, deployment and operation of high-density compute environments, supporting advanced GPU clusters and AI model training platforms. The successful candidate will work with cutting-edge compute stacks and play a key role in enabling high-performance AI workloads. Due to the nature of the work, this role will involve secure and sensitive environments.
Key Responsibilities
- Design, deploy and manage HPC infrastructures, including GPU clusters and parallel computing environments
- Support AI model training platforms by maintaining compute resources and optimising workload scheduling
- Monitor, analyse and optimise system performance, identifying bottlenecks and improving efficiency
- Develop and maintain automation scripts and operational tooling (Python, PowerShell, Bash)
- Maintain clear documentation covering architecture, configurations, operational procedures and incident resolution
- Support incident management processes, including root cause analysis and post-incident reviews
- Work closely with cross-functional teams to ensure reliability, performance and security across HPC environments
Required Experience
- Strong experience working in High Performance Computing (HPC) environments
- Experience managing GPU clusters (e.g. NVIDIA or AMD)
- Familiarity with workload schedulers such as SLURM or PBS
- Experience supporting AI/ML model training frameworks such as TensorFlow, PyTorch or CUDA
- Solid understanding of Linux and Windows server environments, networking and storage platforms
- Strong troubleshooting and performance optimisation skills within compute-heavy environments
- Experience with automation, scripting and monitoring tools (Python, PowerShell, Bash)
- Excellent communication skills and ability to work with both technical and non-technical stakeholders
#J-18808-Ljbffr…
