A leading AI research organization in London is seeking a Training Runtime: Process Management Engineer to develop high-performance distributed systems using Rust and Python. This role involves overseeing machine learning workloads across supercomputers, focusing on reliability, performance, and observability. The ideal candidate has experience in distributed systems, a solid understanding of performance analysis, and strong software engineering skills. The position offers a hybrid work model and requires a proactive approach to meet dynamic system needs.#J-18808-Ljbffr…
