What are the responsibilities and job description for the Lead Site Reliability Engineer position at pro/source <it></it>?
- Location: Miami, FL (Hybrid – 3 days onsite per week)
We are seeking a Lead Site Reliability Engineer to drive the performance, reliability, and operational excellence of critical customer-facing web applications. In this role, you’ll lead a large global engineering team while remaining hands-on with Java and Spring Boot development.
What You’ll Do
- Lead and manage a distributed engineering team of 20 developers
- Ensure the reliability and support of key customer care and retail web applications
- Design and build tools and automation to accelerate incident resolution and improve application health
- Implement AI Ops and leverage LLM technologies for smarter support operations
- Develop dashboards, observability tools, and monitoring systems to detect and resolve issues proactively
- Define and manage operational standards, including SLOs and SLIs
- Conduct in-depth root cause analysis (RCA) and post-incident reviews
- Drive cross-system collaboration and optimize end-to-end application performance
- Build frameworks to support data remediation and incident resolution
- Translate technical concepts for business and leadership stakeholders
- Lead documentation and training initiatives to foster knowledge sharing
- Bachelor’s degree or equivalent professional experience
- 8 years of hands-on engineering and application support experience
- Deep expertise with:
- Java & Spring Boot (middle-tier development)
- Databases (Oracle, Cassandra)
- Front-end technologies (Angular, Next.js)
- ITSM tools (ServiceNow, Jira)
- Monitoring and APM tools
- Proven experience managing and leading large global engineering teams (20 people, including offshore teams)
- Ability to drive technical initiatives across cross-functional teams
- Background in full-stack Java/Spring microservices development
- Experience with BPM tools (Pega, Camunda)
- Proficiency in anomaly detection and log analysis (Opensearch, ELK, Splunk)
- Familiarity with observability and tracing tools (Jaeger, OpenTelemetry)
- Experience with user experience monitoring platforms (New Relic, Quantum Metrics, Glassbox)
- Knowledge of L1/L2 support architectures and modern SRE best practices
- Strong communication, leadership, and cross-functional collaboration skills
Lead II, Penetration Test/Incident Response Engineer
S&P Global -
Charleston, WV