Senior SRE: AI/ML HPC Infra & GPU Cluster

Boson AI • toronto, Canada

📍 Location

toronto

⏰ Job Type

Full-time

📅 Posted

May 30, 2026

About the Role

                A technology company in Toronto seeks a Senior Site Reliability Engineer to manage and optimize its HPC infrastructure. In this role, you'll ensure smooth operations of a powerful GPU cluster, deploy infrastructure-as-code solutions, and support ML teams. Candidates should have extensive SRE experience, proficiency in Linux, and familiarity with Kubernetes and Ceph storage. This position offers the chance to work with cutting-edge technology in a collaborative environment, perfect for problem-solvers who love learning.
#J-18808-Ljbffr
            

Ready to Join Through a Referral?

Apply now and get connected directly with the hiring team

Apply for this Position