Position Summary
The Engineer of the IT OCC is responsible for and owns the end-to-end assessment, triage, communications and resolution of production incidents within a 24/7 Operations Command Center (OCC). This role serves as the central command ensuring rapid detection, timely restoration of services during incidents leveraging real-time monitoring and alerting tools. The OCC engineer collaborates closely across the entire IT organization to maintain and protect systems stability, minimize downtime, and continuously improve operational reliability.
Essential Responsibilities
- Ensure rapid Mean Time to Acknowledge (MTTA) and Mean Time to Restore (MTTR) through effective triage, troubleshooting and escalation
- Incident leader for high-severity (SEV1/SEV2) incidents as well as lower severity incidents
- Lead incident bridges ensuring clear ownership, driving resolution and adherence to recovery timelines
- Monitor enterprise environments using tools PagerDuty, Azure Monitor, Grafana and AppDynamics
- Evaluate alerts and events against defined SLIs, SLOs, and error budgets to determine business impact and response urgency
- Provides timely escalation for issues based on predefined levels for mission critical applications using the Incident Management methodology
- Track, document, and manage incidents using ServiceNow, ensuring accurate updates, clear timelines, and proper closure
- Coordinate incident communications, including internal updates, status notifications, and executive-level briefings as required
- Partner with Network, Infrastructure, Cloud, and Application teams to support root cause analysis (RCA) and post-incident reviews (AAR)
- Identify trends, recurring issues, and operational gaps to support continuous improvement and incident reduction initiatives
- Maintain and continuously improve runbooks, escalation procedures, and incident playbooks
- Participate in shift handovers, ensuring operational continuity and visibility into active or high-risk issues
- Use a proactive approach to problem resolution, aimed at reducing the number of reported incidents and enhancing the technology experience for the business
- Partner with engineering and operations teams to improve monitoring coverage and reliability
- Ability to remain calm, decisive, and organized during critical, high-pressure incidents
- Excellent written and verbal communication skills across technical and business audiences
- Other duties as assigned