← Back to opportunities

Senior Software Engineer, AI Resiliency

📍 Location
Redmond
⏰ Job Type
Full-time
📅 Posted
May 31, 2026

About the Role

We are now looking for a Senior Software Engineer for AI Resiliency!


At NVIDIA, we are pushing the boundaries of what’s possible in AI. We are currently seeking a Senior Software Engineer to lead the development of AI software resiliency for the most powerful AI supercomputers in the world. As a member of our AI Software Resiliency team, you will play a pivotal role in defining and implementing critical resiliency features for AI supercomputers at a scale of 100,000+ GPUs. Your expertise will be crucial in driving down cluster downtime towards zero, ensuring that our AI systems remain robust and reliable at all times.


What You’ll Be Doing:
+ Develop AI Software Resiliency Features: Implement and optimize software features that improve AI system reliability at a massive scale, such as fast checkpoint-recovery, error detection, error isolation, and straggler/hang detection.
+ Hands-On Coding & Optimization: Contribute to large-scale distributed syst...

Ready to Join Through a Referral?

Apply now and get connected directly with the hiring team

Apply for this Position