Demo

Sr. Reliability Engineer

Jobright.ai
San Jose, CA Full Time
POSTED ON 8/5/2025 CLOSED ON 9/2/2025

What are the responsibilities and job description for the Sr. Reliability Engineer position at Jobright.ai?

Verified Job On Employer Career Site


Job Summary:

Supermicro is a leading provider of advanced server, storage, and networking solutions for various computing environments. The Cloud Reliability Engineer will be responsible for deploying, scaling, and ensuring the high availability and performance of AI cloud platforms, while also bridging Dev and Ops through automation and observability practices.


Responsibilities:

• Cloud Infra Automation: Design and provision cloud infrastructure using Infrastructure as Code (Terraform, Ansible, or Helm) on bare metal or cloud platforms. Develop custom automation and tooling in Python or Go to extend deployment workflows and streamline operations.

• Platform Reliability: Deploy, scale, maintain, and optimize uptime for AI cloud services including GPU clusters, Kubernetes (K8s), and storage systems (e.g., Ceph, BeeGFS, or Weka). Understand the tools required to benchmark and assure consistent application performance.

• Monitoring & Alerting: Implement observability tools (e.g., Prometheus, Grafana, ELK, Loki, Fluentd) to monitor system health and alert on anomalies or performance degradation.

• Capacity Planning: Analyze usage trends and forecast infrastructure needs to support AI workloads and large-scale model training/inference.

• Incident Management: Lead root cause analysis and resolution for system outages or degraded performance. Define and maintain service level objectives (SLOs), indicators (SLIs), and agreements (SLAs) aligned with uptime and performance goals.

• CI/CD Integration: Collaborate with DevOps and MLOps teams to ensure reliable delivery pipelines using GitLab CI/CD, ArgoCD, or similar tools.

• Security & Compliance: Harden Linux systems, manage TLS certificates, and enforce secure access controls via Role-Based Access Control (RBAC), LDAP-integrated SSO, TLS, and network segmentation policies.

• Documentation & Playbooks: Maintain clear, version-controlled documentation, including architecture diagrams, runbooks, and incident response playbooks to support cross-team knowledge transfer and rapid onboarding.


Qualifications:


Required:

• Bachelor’s degree in Computer Science, Engineering, or a related field—or equivalent experience and 8 years of experience in the areas below

• Proficiency in Linux (Ubuntu, RHEL/CentOS), containers (Docker, Podman), and orchestration (Kubernetes)

• Experience managing GPU compute clusters (NVIDIA / CUDA, AMD / ROCm)

• Hands-on experience with observability tools (Prometheus, Grafana, Loki, ELK, etc.)

• Strong scripting and coding skills (Bash, Python, or Go)

• Exposure to secure multi-tenant environments and zero trust architectures

• Familiarity with network protocols, DNS, DHCP, BGP, ROCEv2, and InfiniBand or high-throughput Ethernet fabrics

• Excellent collaboration and communication skills for cross-team, partner, and customer initiatives


Preferred:

• Understanding of AI/ML reference architectures and experience with workflows, MLFlow, or Kubeflow

• Familiarity with storage backends optimized for AI (CephFS, BeeGFS, WekaFS)

• Prior experience in bare-metal provisioning via PXE, Ironic, or Foreman

• Understanding of NVIDIA GPU telemetry and NCCL testing for performance benchmarking

• Familiarity with ITIL processes or structured change management in production systems is a plus

• Certifications: CKA, CKAD, Linux , or related credentials


Company:

Super Micro Computer Inc., fundada en 1993 en California, USA, fabricante líder en placas base, chasis y servidores de altas prestaciones. Founded in 2000, the company is headquartered in 's-hertogenbosch, Noord-Brabant, NLD, with a team of 11-50 employees. The company is currently Early Stage.

If your compensation planning software is too rigid to deploy winning incentive strategies, it’s time to find an adaptable solution. Compensation Planning
Enhance your organization's compensation strategy with salary data sets that HR and team managers can use to pay your staff right. Surveys & Data Sets

What is the career path for a Sr. Reliability Engineer?

Sign up to receive alerts about other jobs on the Sr. Reliability Engineer career path by checking the boxes next to the positions that interest you.
Income Estimation: 
$114,618 - $136,401
Income Estimation: 
$144,264 - $191,312
Income Estimation: 
$140,435 - $166,410
Income Estimation: 
$114,618 - $136,401
Income Estimation: 
$144,264 - $191,312
Income Estimation: 
$140,435 - $166,410
Income Estimation: 
$76,670 - $90,826
Income Estimation: 
$91,609 - $118,978
Income Estimation: 
$92,877 - $110,401
Income Estimation: 
$92,877 - $110,401
Income Estimation: 
$120,933 - $155,034
Income Estimation: 
$114,618 - $136,401
Income Estimation: 
$140,435 - $166,410
Income Estimation: 
$151,875 - $212,356
Income Estimation: 
$169,957 - $202,398
This job has expired.
View Core, Job Family, and Industry Job Skills and Competency Data for more than 15,000 Job Titles Skills Library

Job openings at Jobright.ai

Jobright.ai
Hired Organization Address Island, RI Full Time
Job_Summary: Randstad Digital is a trusted digital enablement partner that facilitates accelerated transformation for bu...
Jobright.ai
Hired Organization Address Smithfield, RI Full Time
Job_Summary: Randstad Digital is a trusted digital enablement partner that facilitates accelerated transformation for bu...
Jobright.ai
Hired Organization Address Wilmington, DE Full Time
Verified Job On Employer Career Site Job Summary: Chase is a forward-thinking financial institution seeking a Product Ma...
Jobright.ai
Hired Organization Address Wilmington, DE Full Time
Verified Job On Employer Career Site Job Summary: JPMorgan Chase is seeking a Lead Software Engineer - Python/AI/AWS wit...

Not the job you're looking for? Here are some other Sr. Reliability Engineer jobs in the San Jose, CA area that may be a better fit.

Sr. Reliability Engineer

Supermicro, San Jose, CA

Sr. Reliability Engineer/ Sustaining

Rivian and Volkswagen Group Technologies, Palo Alto, CA

AI Assistant is available now!

Feel free to start your new journey!