CoreWeave is The Essential Cloud for AI™. Built for pioneers by pioneers, CoreWeave delivers a platform of technology, tools, and teams that enables innovators to build and scale AI with confidence. Trusted by leading AI labs, startups, and global enterprises, CoreWeave combines superior infrastructure performance with deep technical expertise to accelerate breakthroughs and turn compute into capability. Founded in 2017, CoreWeave became a publicly traded company (Nasdaq: CRWV) in March 2025. Learn more at www.coreweave.com.
We’re proud to be a Living Wage accredited Employer.
What You’ll Do
The Data Science team is focused on developing an advanced reliability platform. This system covers various aspects of data processing and analysis, including data intake, deriving meaningful metrics, identifying unusual patterns, predicting potential issues, finding slow processes in distributed systems, and using automated analysis to determine causes. We collaborate closely with internal teams like Fleet, Infrastructure, and AI Platform to enhance system stability, optimize resource use, shorten resolution times, and maintain service availability and financial performance.
About The Role
As a Senior Data & MLOps Engineer, you will design and scale the infrastructure supporting the GPU Intelligence Platform. This involves building pipelines for handling data, features, model training, and delivering insights and predictions for system health and optimization. You will transition the system from initial prototypes to a production environment operating across the fleet, focusing on scalability, separating real‑time service from periodic processing, and dynamic resource management based on system load and data frequency. You will architect and deploy these scalable distributed services using orchestration technologies.
Key Responsibilities
- Design and implement scalable data ingestion pipelines.
- Build feature processing and baseline computation systems.
- Productionize models for prediction and detection.
- Develop and operate low‑latency service and robust offline workflows.
- Architect horizontally scalable services with clear separation between components, leveraging orchestration for distribution.
- Implement monitoring and feedback loops for continuous model and signal improvement.
- Collaborate with Platform teams to integrate operational signals into monitoring and diagnostics.
- Implement a scalable solution for mitigation and structured analysis.
Who You Are
- 7+ years of experience in data engineering, distributed systems, MLOps, or infrastructure ML roles in production environments.
- Proven experience building high‑throughput streaming or telemetry pipelines (e.g., Kafka, Pulsar, Kinesis, or equivalent).
- Strong experience designing time‑series feature pipelines and operating large‑scale observability systems.
- Experience building and maintaining feature stores and ensuring offline/online feature parity.
- Hands‑on experience deploying ML models to production, including versioning, monitoring, rollback, and drift detection.
- Experience designing scalable microservices deployed in Kubernetes‑based environments.
- Strong proficiency in Python and at least one systems language (Go, Rust, or C++).
- Experience working with distributed compute or training systems (e.g., NCCL, PyTorch Distributed, Spark, Ray, Slurm).
- Familiarity with GPU telemetry systems such as NVML or DCGM and hardware‑level monitoring concepts.
- Demonstrated experience scaling systems from Proof‑of‑Concept to production‑grade, fleet‑level deployments.
Preferred
- Experience working on GPU fleet management, hyperscale infrastructure, or AI training clusters.
- Experience building anomaly detection or failure prediction systems for hardware or distributed systems.
- Experience implementing distributed straggler detection or collective‑level performance analysis systems.
- Experience developing agentic or LLM‑powered reasoning systems for diagnostics or operational intelligence.
- Background in reliability engineering or SRE practices.
Wondering if you’re a good fit?
- You love building systems that turn raw infrastructure telemetry into actionable intelligence.
- You’re curious about distributed systems failure modes, GPU performance pathologies, and reliability engineering at scale.
- You’re excited by the idea of moving from anomaly detection to prediction to autonomous root cause reasoning.
- You enjoy designing platforms that protect uptime, revenue, and customer trust through proactive systems thinking.
Why CoreWeave?
At CoreWeave, we work hard, have fun, and move fast! We’re in an exciting stage of hyper‑growth that you will not want to miss out on. We’re not afraid of a little chaos, and we’re constantly learning.
- Be Curious at Your Core
- Act Like an Owner
- Empower Employees
- Deliver Best‑in‑Class Client Experiences
- Achieve More Together
What We Offer
- Family‑level Medical Insurance
- Family‑level Dental Insurance
- Generous Pension Contribution
- Life Assurance at 4x Salary
- Critical Illness Cover
- Employee Assistance Programme
- Tuition Reimbursement
- Work culture focused on innovative disruption
Our Workplace
While we prioritize a hybrid work environment, remote work may be considered for candidates located more than 30 miles from an office, based on role requirements for specialized skill sets. New hires will be invited to attend onboarding at one of our hubs within their first month. Teams also gather quarterly to support collaboration.
CoreWeave is an equal opportunity employer, committed to fostering an inclusive and supportive workplace. All qualified applicants and candidates will receive consideration for employment without regard to race, color, religion, sex, disability, age, sexual orientation, gender identity, national origin, veteran status, or genetic information.
Export Control Compliance
This position requires access to export controlled information. To conform to U.S. Government export regulations applicable to that information, applicant must either be (A) a U.S. person, defined as a (i) U.S. citizen or national, (ii) U.S. lawful permanent resident (green card holder), (iii) refugee under 8 U.S.C. 1157, or (iv) asylee under 8 U.S.C. 1158, (B) eligible to access the export controlled information without a required export authorization, or (C) eligible and reasonably likely to obtain the required export authorization from the applicable U.S. government agency. CoreWeave may, for legitimate business reasons, decline to pursue any export licensing process.
- (i) A U.S. citizen or national
- (ii) U.S. lawful permanent resident (green card holder)
- (iii) refugee under 8 U.S.C. 1157
- (iv) asylee under 8 U.S.C. 1158
#J-18808-Ljbffr…
