Sr. DevOps Lead - SRE Job
Bangalore, KA, IN Bangalore, KA, IN
YASH Technologies is a leading technology integrator specializing in helping clients reimagine operating models, enhance competitiveness, optimize costs, foster exceptional stakeholder experiences, and drive business transformation.
At YASH, we’re a cluster of the brightest stars working with cutting-edge technologies. Our purpose is anchored in a single truth – bringing real positive changes in an increasingly virtual world and it drives us beyond generational gaps and disruptions of the future.
We are looking forward to hire Site Reliability Engineering (SRE) Professionals in the following areas :
Job description:
Site Reliability Engineering – SRE Manager
Overview
Client Digital is seeking a Site Reliability Manager to join our growing team. We have ambitious plans to drive digital commerce growth globally and are looking for a hands-on, passionate, and skilled leader with an unwavering focus on system reliability, performance, and automation.
As a critical member of our team, you will be the owner of the reliability engineering practice for our Digital and Commerce platforms, including the Adobe Experience suite, SAP Commerce Cloud, eProcurement gateways, and our AWS Cloud infrastructure. This role is essential to our global digital transformation journey, helping to guarantee the availability and scalability needed to grow our digital and B2B eCommerce business.
About the Role
The Site Reliability Manager is a technical leadership position responsible for establishing, implementing, and maturing the Site Reliability Engineering (SRE) charter. This role is focused on driving the elimination of toil, enhancing operational stability, and ensuring the highly available and reliable performance of our entire digital ecosystem.
You will lead, mentor, and grow a team of Site Reliability Engineers, influencing the technical direction, defining key reliability metrics (SLOs/SLIs), and ensuring the health and integrity of our production systems globally.
- SRE Practice & Metrics: You will champion the adoption of SRE principles, defining and monitoring Service Level Objectives (SLOs) and Service Level Indicators (SLIs) for our critical platforms to align engineering efforts with customer experience goals.
- Automation & Toil Reduction: You will be responsible for identifying operational toil and leading the team to develop and deploy automation solutions to maximize efficiency and stability.
- Platform Availability & Performance: You will manage the overall availability and performance of the digital platform, ensuring system robustness, rapid incident response, and effective post-mortem analysis to prevent recurrence.
Years of Experience: 10+
Technologies: Adobe Experience Suite, React, Java, Commerce preferably SAP Hybris, Application monitoring tools, and Github
Process: ITIL, Scrum and Support Mindset
Responsibilities:
- Manage and lead our Web/eCommerce Site Reliability Engineering team as a part of Digital Transformation with primary goal of making the production platform stable and reliable.
- Serve as point of contact for Site Reliability Engineering and production support of Web/eCommerce platform, which includes Adobe Experience Management suite, SAP Commerce Cloud, Solr Search, MuleSoft based services API, etc.
- Responsible for managing production incidents and own the closure
- Responsible for delivering clear, concise, timely communication to our customers to ensure their confidence in our team's passion to provide them with the best customer experience possible.
- Manage on-call rotations across continents, using a follow-the-sun model.
- Lead SRE team and continuously assess & implement best industry SRE practices
- Own incident management, problem management, and service request management
- Accountable for production platform and it’s uptime, availability, stability, and capacity planning
- Monitor baselines of technical KPIs such as uptime, performance, and error rate of web/eCommerce platform and drive the efforts to improve these with the help of other teams, as needed
- Drive team to enhance monitoring and alerting for all technical components by creating dashboards, visualizations, baselines, and alerts
- Provide 24X7 on-call support during on-call rotation and be available during non-working hours when needed for critical incidents or during production release
Skills/Qualifications:
- Experience in managing shift schedules, kanban execution and SLA/SLO metric reporting
- Recent 5+ years of experience in managing L2 technical application production support for large B2C/B2B eCommerce website
- Recent 3+ years of experience with ITIL framework including incident management, problem management, and change management
- Experience in driving SWAT and managing SLA/SLO metrics
- Experience in driving root cause analysis and driving closure on a permanent fix
- Experience in managing technical KPIs - availability, performance, error rate, etc.
- Good communicator, both written and spoken, such that complex IT issues can be explained in business language that business can understand
- Able to articulate business solutions to both technical and non-technical audiences.
- Strong understanding of SDLC methodologies (Agile, SCRUM).
At YASH, you are empowered to create a career that will take you to where you want to go while working in an inclusive team environment. We leverage career-oriented skilling models and optimize our collective intelligence aided with technology for continuous learning, unlearning, and relearning at a rapid pace and scale.
Our Hyperlearning workplace is grounded upon four principles
- Flexible work arrangements, Free spirit, and emotional positivity
- Agile self-determination, trust, transparency, and open collaboration
- All Support needed for the realization of business goals,
- Stable employment with a great atmosphere and ethical corporate culture