Knowledge Hub / eG Innovations

Partner Spotlight

eG Innovations

Partner Spotlight

When All Dashboards Report Green During a Production Outage

A retail ERP system underwent a vertical scaling operation to support growth from 3,000 to 10,000 stores on AWS. Immediately following the cutover, users experienced widespread HTTP 503 (“Service Unavailable”) errors and checkout failures. Yet, standard performance dashboards indicated a healthy environment.

 

During the incident response, each team reviewed their respective telemetry, which indicated normal operation:

 

Database Team: “Query latency is flat at sub-millisecond levels. The database is executing requests instantly.” 
Application Team: “JVM threads are in a WAIT state on sun.nio.ch.SocketDispatcher.read. The code is blocked, waiting for database responses.” 
Infrastructure Team: “CPU is at 9%, storage IOPS is at 8%, and bandwidth is within SLA. We have substantial headroom.”

 

While component-level metrics appeared healthy, system-wide transactions were failing.

 

Case Study: Non-Linear Failure at 3X Scale

To understand why this happens, we have to look outside standard telemetry. This article breaks down a real production incident where the root cause was an invisible bottleneck: the EC2 instance had hit a hard packets-per-second (PPS) ceiling, not a bandwidth limit.
 
The system looked perfectly healthy at 9% CPU and under 10% storage IOPS. It wasn’t; it was silently discarding traffic. TCP retransmissions had climbed past 20% at peak (with spikes to 50%), database insert latency jumped from 1ms to 150ms, and connection time to the SQL service ballooned to 3 seconds.
 
The standard monitoring stack saw none of it.

 

This postmortem documents how cross-layer correlation—specifically overlaying synthetic connection probes, network stack metrics, and application thread states on a single timeline—exposed what siloed monitoring missed, and exactly what SRE teams must instrument to catch it early.

 

(Note: This article summarizes a 15-page forensic postmortem. Download the full technical case study (PDF) for the complete timeline, configuration diffs, and TCP tuning parameters.)