Site Reliability Engineer | Remote
Company: Crossing Hurdles
Location:
Posted: May 9th, 2026
- Design, implement, and maintain scalable infrastructure using Linux and Kubernetes.
- Monitor system performance using Prometheus and address potential issues proactively.
- Automate operational processes to improve system reliability and efficiency.
- Respond to incidents, perform root cause analysis, and implement improvements.
- Collaborate with development teams to ensure smooth deployments and high availability.
- Create and maintain documentation, runbooks, and operational guidelines.
- Promote best practices in reliability, security, and system performance.
Requirements
- Strong experience with Linux system administration and troubleshooting.
- Strong expertise in Kubernetes cluster management and orchestration.
- Strong experience using Prometheus for monitoring and alerting.
- Proficiency in scripting languages such as Bash or Python.
- Strong problem-solving and incident management skills.
- Excellent written and verbal communication skills.
- Ability to work independently in a remote, fast-paced environment.
#J-18808-Ljbffr
Apply Now