Senior Site Reliability Engineer in Toronto

Boson AI • winnipeg, Canada

📍 Location

winnipeg

⏰ Job Type

Full-time

📅 Posted

June 15, 2026

About the Role

                Elevate your career as a Senior Site Reliability Engineer in Toronto, managing cutting-edge HPC infrastructure with NVIDIA GPUs. Join a dynamic team focusing on advanced AI and ML clusters.

You will oversee the lifecycle of our high-performance computing (HPC) infrastructure. This role requires hands-on experience in planning, deploying, and maintaining resilient systems. Collaborate with engineering and research teams to optimize operations and ensure seamless performance.

Key Responsibilities: • Manage and optimize operations of HPC clusters • Deploy and maintain infrastructure-as-code solutions • Support research teams by optimizing cluster usage • Operate and troubleshoot Ceph storage clusters • Develop tooling and automation for efficiency

Requirements: • 5+ years experience in SRE or HPC operations • Proficiency in Linux systems (Ubuntu/Debian) • Experience with Kubernetes container orchestration • Knowledge of Ceph deployments over 1PB • Skilled in Pyt...

Ready to Join Through a Referral?

Apply now and get connected directly with the hiring team

Apply for this Position