← Back to opportunities
About the Role
Responsibilities
- Navigate, troubleshoot, and recover dynamic infrastructure and long-running processes in real-time using command-line tools.
- Master and manage highly containerized environments, including orchestrating Dockerized sandboxes and CI/CD workflows.
- Build, maintain, and optimize systems for AI model training and high-throughput compute environments.
- Respond swiftly to system errors, executing dynamic mid-operation replanning and recovery.
- Collaborate with engineering and AI teams to ensure seamless integration, reliability, and performance.
- Document system architectures, incident responses, and recovery protocols with meticulous clarity.
Requirements
- Have demonstrated expert proficiency working in terminal environments for system builds, server administration, and infrastructure management.
- Possess advanced problem-solving skills for multi-step troubleshooting, f...
Ready to Join Through a Referral?
Apply now and get connected directly with the hiring team
Apply for this Position