From Metrics to Root Cause
16 slides
Press → or Space to advance
Press F to toggle fullscreen
Metrics Look Fine.
Alerts Are Silent.
Users Are Calling.
From Metrics to Root Cause: JVM Diagnostics in Production
Mateusz Grużewski · Software Architect · Asseco Data Systems
You've Been Here Before.
✅ CPU: normal
✅ Memory: normal
✅ Error rate: 0%
✅ Alerts: silent
"The app has been slow for 20 minutes."
— your biggest client, 9:47 PM
Dashboards Don't Show Causes.
Dashboards show symptoms. Not causes.
The Real Question Is Not "What".
You can see what is slow.
You can see when it started.
The hard part is why.
Observability is the ability to answer "why" in a system you cannot stop.
The First Instinct During an Incident
Every production incident starts the same way.
Engineers open Grafana.
They scan dashboards.
They check CPU, memory, error rate.
And for a moment…
everything looks fine.
Dashboards confirm the symptom.
They rarely reveal the cause.
Three Signals. One Answer.
Every production incident leaves traces in three layers.
The Workflow We'll Follow.
Not a checklist. A way of thinking.
Every transition is a narrowing of the hypothesis. Not a random click.
The System We'll Diagnose.
Two services. One hidden problem.
Architecture:
client → api-service :8080
api-service → per-user cache (lock)
api-service → external-service :8081
Premium users: cache TTL = 5s
Observability stack:
Prometheus — metrics
Loki — logs
Tempo — traces
Grafana — everything
The problem is already running. We just don't know where it is yet.
Before We Touch Any Tools.
Let's make a hypothesis.
What we know:
Latency increased.
But:
✅ CPU is normal
✅ Memory is normal
✅ Error rate is zero
Possible explanations:
• Thread contention
• Slow dependency
• GC pause
Only one of these is true. Let's find out which one.
DEMO
Root Cause Found.
No guessing. No restarting. No hoping.
The chain:
external-service is slow (~2500ms)
api-service cache lock serializes all parallel requests
Every premium user request waits for the same slow call
The code:
// external-service
db_query_duration_ms=2500;
In production: slow DB query, timeout, GC pause
One trace. Ten minutes. Zero guessing.
What We Just Did.
Every transition was a decision.
Metrics → identified the spike
Logs → found the pattern, got traceId
Trace → pinpointed the bottleneck
Root cause → confirmed in 10 minutes
The tool doesn't matter. The question does.
Same Workflow, Different Incidents.
The scenario changes. The thinking stays the same.
Lock Contention
p95 latency increases
Threads in BLOCKED state
Trace shows waiting on monitor
→ synchronization hotspot in your code
Memory Pressure
Heap usage looks stable
Allocation rate keeps increasing
Old generation slowly fills up
→ OOM waiting to happen
Same Workflow, Different Incidents.
The scenario changes. The thinking stays the same.
GC Pauses
p99 spikes periodically
Pause time aligns with GC cycles
Stop-the-world visible in logs
→ GC is stealing your latency budget
The scenario changes. The thinking stays the same.
Metrics → what is wrong?
Logs → what exactly happened?
Traces → where is the time going?
Three Rules for Diagnostic Thinking.
1. Start with the symptom, not the tool.
Don't open Tempo because you like traces. Open it because logs gave you a traceId.
2. Each layer answers a different question.
Metrics: what and how much? Logs: who, when, exactly what? Traces: where in the system?
3. The cause is rarely where you look first.
Lock contention in api-service
→ cause in external-service
In distributed systems, symptom and cause live in different places.
Observability is the path between them.
Thank You.
In distributed systems, the symptom and the cause are rarely in the same place.
Mateusz Grużewski

gruzewski.dev
Rate this talk

Scan to leave feedback
Source code
github.com/lshadown/from-metrics-to-root-cause-jvm