Senior Engineer - Site Reliability Engineering
Bangalore, KA, IN, 560100
SRE Staff Engineer:
- Experience Level: 5-8 years
- Location: Onsite Client (Bangalore)
- Nature of work: Hybrid
- Shift: 24x5 Rotational Shifts
About the Role:
We are seeking an experienced and motivated Senior Engineer who can work and Individual Contributor to drive the performance of Observability Team (PMG). The ideal candidate will have a strong technical background in Site Reliability Engineering (SRE) with expertise in observability, alerting frameworks, incident management, and automation.
Key Responsibilities:
1. Payment Monitoring and Alert Triage:
- Monitoring of the Payments Flow Based Alerts across multiple applications in rotation 24 X 7 shifts and identify the issue proactively.
- Triage the alerts by analysing the trends on affected dimensions of payment flow, and co-relate the same with other services metrics, logs and traces to find the root cause along with the documentation of triage.
- Ensure timely escalation and closure of issues reported while working with Engineering Teams of payment Services.
2. Observability Development:
- Design and implement alerting frameworks using tools like Datadog, Grafana, Kibana, Splunk, and Prometheus.
- Set up custom dashboards and streamline alerting to reduce noise while ensuring critical issues are addressed.
- Drive the adoption of SLO-based alerting, burn rate metrics, and anomaly detection techniques.
3. Incident Management:
- Lead incident management efforts from identification to resolution.
- Conduct post-incident reviews and implement preventive measures to avoid recurring issues.
- Maintain detailed documentation and performance reports on incident trends and team efficiency.
4. Automation and Optimization:
- Automate repetitive processes using programming languages like Python or Java.
- Develop and refine scripts to manage and fine-tune alerts.
- Collaborate with engineering teams to implement solutions that reduce manual effort and operational toil.
Required Skills and Qualifications:
- Proven expertise in SRE Observability Concepts and monitoring architecture design.
- Extensive experience with alerting frameworks like Prometheus, Grafana, Kibana, Splunk, and Datadog.
- Hands-on experience with alert noise reduction and advanced alerting techniques such as anomaly detection and burn rate alerting.
- Strong proficiency in incident management, including analysis, root cause identification, and preventive measures.
- Familiarity with payment monitoring systems and operational requirements.
- Proficient in automation tools and scripting languages like Python or Java.
- Excellent collaboration and communication skills to interact with cross-functional teams.
- Flexibility to work in rotational 24x7 shifts from the office.
Required Skills and Qualifications:
- Proven expertise in SRE Observability Concepts and monitoring architecture design.
- Extensive experience with alerting frameworks like Prometheus, Grafana, Kibana, Splunk, and Datadog.
- Hands-on experience with alert noise reduction and advanced alerting techniques such as anomaly detection and burn rate alerting.
- Strong proficiency in incident management, including analysis, root cause identification, and preventive measures.
- Familiarity with payment monitoring systems and operational requirements.
- Proficient in automation tools and scripting languages like Python or Java.
- Excellent collaboration and communication skills to interact with cross-functional teams.
- Flexibility to work in rotational 24x7 shifts from the office.
SRE Staff Engineer:
- Experience Level: 5-8 years
- Location: Onsite Client (Bangalore)
- Nature of work: Hybrid
- Shift: 24x5 Rotational Shifts
About the Role:
We are seeking an experienced and motivated Senior Engineer who can work and Individual Contributor to drive the performance of Observability Team (PMG). The ideal candidate will have a strong technical background in Site Reliability Engineering (SRE) with expertise in observability, alerting frameworks, incident management, and automation.
Key Responsibilities:
1. Payment Monitoring and Alert Triage:
- Monitoring of the Payments Flow Based Alerts across multiple applications in rotation 24 X 7 shifts and identify the issue proactively.
- Triage the alerts by analysing the trends on affected dimensions of payment flow, and co-relate the same with other services metrics, logs and traces to find the root cause along with the documentation of triage.
- Ensure timely escalation and closure of issues reported while working with Engineering Teams of payment Services.
2. Observability Development:
- Design and implement alerting frameworks using tools like Datadog, Grafana, Kibana, Splunk, and Prometheus.
- Set up custom dashboards and streamline alerting to reduce noise while ensuring critical issues are addressed.
- Drive the adoption of SLO-based alerting, burn rate metrics, and anomaly detection techniques.
3. Incident Management:
- Lead incident management efforts from identification to resolution.
- Conduct post-incident reviews and implement preventive measures to avoid recurring issues.
- Maintain detailed documentation and performance reports on incident trends and team efficiency.
4. Automation and Optimization:
- Automate repetitive processes using programming languages like Python or Java.
- Develop and refine scripts to manage and fine-tune alerts.
- Collaborate with engineering teams to implement solutions that reduce manual effort and operational toil.
Required Skills and Qualifications:
- Proven expertise in SRE Observability Concepts and monitoring architecture design.
- Extensive experience with alerting frameworks like Prometheus, Grafana, Kibana, Splunk, and Datadog.
- Hands-on experience with alert noise reduction and advanced alerting techniques such as anomaly detection and burn rate alerting.
- Strong proficiency in incident management, including analysis, root cause identification, and preventive measures.
- Familiarity with payment monitoring systems and operational requirements.
- Proficient in automation tools and scripting languages like Python or Java.
- Excellent collaboration and communication skills to interact with cross-functional teams.
- Flexibility to work in rotational 24x7 shifts from the office.
BE.B.TECH NA
Education and Experience Required