Senior Cloud Architect, SRE - DGX Cloud: Shaping the Future of Cloud Computing and AI Infrastructure

Remote, USA Full-time
Join the Ranks of the World's Most Innovative Technology Company NVIDIA is at the forefront of technological advancements, driving innovations in AI, computing, and beyond. We're seeking a highly skilled and experienced Senior Cloud Architect to join our DGX Cloud Site Reliability Engineering (SRE) team. As a Senior Cloud Architect, SRE - DGX Cloud, you will play a pivotal role in designing, building, and maintaining large-scale production systems that power NVIDIA's GPU cloud services. This is an exceptional opportunity to leverage your technical expertise, creativity, and passion for cloud computing to shape the future of AI infrastructure. About the Role The Senior Cloud Architect, SRE - DGX Cloud role is a key position within NVIDIA's SRE team, responsible for ensuring the reliability, efficiency, and scalability of our DGX Cloud solutions. As a Senior Cloud Architect, you will lead the technical architecture for DGX cloud solutions on top of cloud service providers like AWS, GCP, Azure, and OCI. You will work closely with cross-functional teams to design, implement, and support operational and reliability aspects of large-scale GPU training clusters. Key Responsibilities Lead technical architecture for DGX cloud solutions on top of cloud service providers like AWS, GCP, Azure, and OCI. Provide fast and creative solutions for complex problems and write effective, clear, and reliable architecture specifications. Design, implement, and support operational and reliability aspects of large-scale GPU training clusters with a focus on performance at scale, real-time monitoring, logging, and alerting. Engage in and improve the whole lifecycle of services—from inception and design through deployment, operation, and refinement. Support services before they go live through activities such as system design consulting, developing software tools, platforms, and frameworks, capacity management, and launch reviews. Maintain services once they are live by measuring and monitoring availability, latency, and overall system health. Scale systems sustainably through mechanisms like automation and evolve systems by pushing for changes that improve reliability and velocity. Practice sustainable incident response and blameless postmortems. Requirements and Qualifications To be successful in this role, you should possess a strong technical background with a focus on cloud computing, distributed systems, and site reliability engineering. The ideal candidate will have: Essential Qualifications B.Sc./M.Sc./Ph.D. degree in Computer Science or a related technical field involving coding (e.g., physics or mathematics), or equivalent experience. 8+ years of proven experience in cloud computing, distributed systems, or a related field. Experience with infrastructure automation, distributed systems design, and experience with designing, developing tools for running large-scale private or public cloud systems in production. Experience in one or more of the following: Python, Go. In-depth knowledge of Linux, Networking, and Cloud Native Technologies. Preferred Qualifications Interest in crafting, analyzing, and fixing large-scale distributed systems. Systematic problem-solving approach, coupled with strong communication skills and a sense of ownership and drive. Ability to debug and optimize code and automate routine tasks. Experience in using or running large private and public cloud systems based on Kubernetes or Slurm. What We Offer NVIDIA is committed to providing a comprehensive compensation and benefits package that reflects our employees' skills, experience, and contributions. The base salary range for this role is $220,000 - $419,750 USD. You will also be eligible for equity and benefits. We accept applications on an ongoing basis, so we encourage you to apply as soon as possible. Our Culture and Work Environment At NVIDIA, we pride ourselves on fostering a diverse and inclusive work environment that encourages creativity, innovation, and collaboration. Our SRE team is no exception, with a culture that values intellectual curiosity, problem-solving, and openness. We promote self-direction, allowing our engineers to work on meaningful projects while providing the support and mentorship needed to learn and grow. As a remote team, we offer the flexibility to work from anywhere, at any time, as long as you're committed to delivering exceptional results. We're committed to building a community that is diverse, inclusive, and respectful, where everyone can thrive and grow. Career Growth and Development At NVIDIA, we're committed to helping our employees grow and develop their careers. As a Senior Cloud Architect, SRE - DGX Cloud, you will have the opportunity to work on complex, challenging projects that will help you develop your technical skills and expertise. You will also have access to our comprehensive training and development programs, designed to help you stay up-to-date with the latest technologies and trends. Join Our Team! If you're a motivated, talented, and experienced Senior Cloud Architect looking to shape the future of cloud computing and AI infrastructure, we want to hear from you! Apply today to join our team and be part of a community that is driving innovation and excellence in the tech industry. NVIDIA is an equal opportunity employer and welcomes applications from diverse candidates. We do not discriminate on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status, or any other characteristic protected by law. Apply for this job
Apply Now

Similar Jobs

Senior Community Liaison for Home Health and Hospice - Remote Opportunity with HarmonyCares

Remote, USA Full-time

Senior Compliance Accountant - Remote Opportunity with Piedmont Airlines, Ensuring Financial Integrity and Regulatory Adherence in the Aviation Industry

Remote, USA Full-time

Senior Concept Artist for Disney Digital Entertainment - Remote Game Development Opportunity

Remote, USA Full-time

Senior Concept Artist for Netflix's Groundbreaking AAA PC Game Development - Remote

Remote, USA Full-time

Senior Content Designer for Acquisition - Remote Opportunity at Netflix

Remote, USA Full-time

Senior Content Designer for Merchandising Experience Design (XD) - Remote Opportunity at Netflix

Remote, USA Full-time

Experienced Senior Copywriter for Disney's In-House Creative Agency - Crafting Innovative Content for Entertainment Marketing

Remote, USA Full-time

Experienced Senior Corporate Lawyer for Remote IT Recruitment Industry - Full-Time

Remote, USA Full-time

Senior Content Designer for Member Experience Personalization - Shaping the Future of Entertainment at Netflix

Remote, USA Full-time

Experienced Senior Contract Recruiter - Talent Acquisition & Client Relationship Management (Remote)

Remote, USA Full-time

.NET Developer​/Backend & DevOps Focused​/Local Remote

Remote, USA Full-time

Experienced Remote Data Entry Specialist – Contributing to Healthcare Excellence with arenaflex from the Comfort of Your Home

Remote, USA Full-time

**Experienced Chat Moderator – Online Community Safety and Engagement Specialist**

Remote, USA Full-time

**Data Entry Specialist – Flexible Work Opportunity at arenaflex**

Remote, USA Full-time

**Experienced Live Chat Representative – Remote Customer Support for Education Sector**

Remote, USA Full-time

Bank Treasury / Money Markets & Funding – Associate

Remote, USA Full-time

Experienced Remote Data Entry Clerk – Part Time – Detail-Oriented and Organized Professional for Database Management and Administrative Support

Remote, USA Full-time

Associate Attorney

Remote, USA Full-time

Intermediate Professional, Project Accounting

Remote, USA Full-time

**Experienced Customer Support Specialist – Remote Opportunity at arenaflex**

Remote, USA Full-time
Back to Home