What are the responsibilities and job description for the Senior Site Reliability Engineer position at Recurring Decimal?
Seeking a seasoned Site Reliability Engineer (SRE) to join Cloud Operations and Observability team. You’ll be instrumental in driving resiliency, performance, automation, and AI-driven observability across hybrid cloud environments (Azure and GCP). You will design, implement, and manage infrastructure with a strong focus on Kubernetes, Terraform, and integrating AI/LLM solutions into observability and operational workflows.
Required Qualifications:
- 5 years of experience as an SRE, DevOps Engineer, or Cloud Infrastructure Engineer.
- Strong expertise in Azure and GCP cloud platforms (certifications a plus).
- Proficient in Splunk (Enterprise Observability) for monitoring, alerting, and log analytics.
- Hands-on experience with Terraform for infrastructure automation.
- In-depth knowledge of Kubernetes (AKS, GKE), Helm, and container lifecycle.
- Familiarity with AI/ML and LLM-based tools (e.g., OpenAI, Hugging Face, Azure OpenAI) for observability or automation use cases.
- Experience with CI/CD pipelines, GitOps, and secure deployment practices.
- Programming/scripting skills in Python, Go, or Bash.
- Strong understanding of SRE principles: SLAs, SLIs, SLOs, error budgets, and incident management.
Preferred Qualifications:
- Experience building AI-enabled runbooks or copilots.
- Exposure to FinOps or cost-optimization strategies in cloud environments.
- Knowledge of distributed tracing and event correlation using OpenTelemetry.
- Familiarity with Kafka, Pub/Sub, or other messaging systems for observability data.
Key Responsibilities:
- Build and operate scalable, secure, and highly available infrastructure in Azure and GCP using IaC (Terraform).
- Design and maintain observability platforms leveraging Splunk, Prometheus, Grafana, OpenTelemetry, and cloud-native monitoring tools.
- Develop and support AI/LLM-driven automation solutions to improve incident triage, alert correlation, and root cause analysis.
- Partner with application and data teams to define SLOs, SLIs, and error budgets.
- Drive operational excellence through automation, chaos testing, and proactive reliability improvements.
- Optimize Kubernetes environments (GKE/AKS) for performance, security, and cost-efficiency.
- Integrate observability data pipelines with LLMs for anomaly detection, summarization, and proactive remediation.
- Participate in on-call rotations, incident response, and postmortem reviews.
- Implement runbooks, auto-remediation scripts, and AI copilots for operations.