← Back to opportunities
About the Role
Key Responsibilities Site Reliability & Operations
- Manage and improve the reliability, availability, and operational excellence of the SHIP-HATS platform
- Define, monitor, and maintain Service Level Objectives (SLOs) and Service Level Indicators (SLIs)
- Lead incident management, troubleshooting, root cause analysis, and post-mortem reviews
- Drive continuous improvements to reduce operational toil and prevent recurring incidents
- Perform capacity planning, performance tuning, and system optimisation
- Design and implement observability solutions across logging, metrics, and distributed tracing
- Build dashboards, alerts, and monitoring strategies to provide deep visibility into platform health
- Manage and maintain monitoring stacks such as Prometheus, Grafana, ELK, or equivalent tools
- Develop and ma...
Ready to Join Through a Referral?
Apply now and get connected directly with the hiring team
Apply for this Position