Responsibilities
- Design, implement, and maintain scalable, reliable infrastructure.
- Monitor system health and performance, proactively identifying issues and driving improvements.
- Collaborate with development teams to ensure new services and features are reliable and scalable.
- Automate operational and repetitive tasks to improve efficiency and reduce manual effort.
- Build and maintain monitoring, alerting, and incident response mechanisms.
- Conduct incident investigations, perform root cause analysis, and implement preventive actions.
- Participate in on‑call rotations, providing 24/7 support for critical systems.
- Maintain clear documentation of processes, procedures, and best practices.
- Develop, tune, and manage detection rules aligned with organizational SIEM standards.
- Perform patching and upgrades to keep SIEM components up to date.
- Ensure data sources are healthy, troubleshoot logging issues, and restore data flows promptly.
Required Skills and Experience
- Bachelor’s degree in Computer Science, Engineering, or a related discipline.
- Proven experience (4+ years) as a Site Reliability Engineer or in a similar role.
- Strong expertise in Elastic-based systems, including Elasticsearch, Logstash, and Kibana.
- Hands‑on experience with SIEM technologies and security applications.
- Experience with containerization and orchestration tools such as Docker and Kubernetes.
- Strong background in incident management, debugging, and root cause analysis.
- Proficiency in scripting languages such as Python and Bash.
- Experience with infrastructure as code tools, including Terraform and Ansible.
- Familiarity with infrastructure and system monitoring tools.
- Excellent problem‑solving skills, attention to detail, and ability to work under pressure.
- Strong communication and collaboration skills.
#J-18808-Ljbffr…
