Site Reliability Engineer, AI/ML Infrastructure

Boson AI • toronto, Canada

📍 Location

toronto

⏰ Job Type

Full-time

📅 Posted

June 01, 2026

About the Role

Overview We2;re looking for a Senior Site Reliability Engineer to help us run one of the most exciting GPU clusters aroundour Toronto datacenter packed with NVIDIA H100 and A100 GPUs, over 20PB of Ceph storage, terabit networking, and hundreds of servers. 
Youll be hands-on with the full lifecycle of HPC infrastructure: planning, building, testing, deploying, and keeping everything running smoothly. That means troubleshooting issues as they arise, monitoring performance, developing automation to make our lives easier, and working closely with engineering and science teams to ensure they have what they need. Youll also help us plan for future capacity and evaluate new technologies as we continue to scale. 
Responsibilities Manage and optimize HPC cluster operations 
Deploy and maintain infrastructure-as-code solutions 
Support ML/research teams with cluster usage optimization 
Operat...
            

Ready to Join Through a Referral?

Apply now and get connected directly with the hiring team

Apply for this Position