From Metrics to Root Cause

16 slides

Press or Space to advance

Press F to toggle fullscreen

Back to talk details

Metrics Look Fine.

Alerts Are Silent.

Users Are Calling.

From Metrics to Root Cause: JVM Diagnostics in Production

Mateusz Grużewski · Software Architect · Asseco Data Systems

You've Been Here Before.

✅ CPU: normal

✅ Memory: normal

✅ Error rate: 0%

✅ Alerts: silent

"The app has been slow for 20 minutes."

— your biggest client, 9:47 PM

Dashboards Don't Show Causes.

Dashboards show symptoms. Not causes.

The Real Question Is Not "What".

You can see what is slow.

You can see when it started.

The hard part is why.

Observability is the ability to answer "why" in a system you cannot stop.

The First Instinct During an Incident

Every production incident starts the same way.

Engineers open Grafana.

They scan dashboards.

They check CPU, memory, error rate.

And for a moment…

everything looks fine.

Dashboards confirm the symptom.
They rarely reveal the cause.

Three Signals. One Answer.

Every production incident leaves traces in three layers.

The Workflow We'll Follow.

Not a checklist. A way of thinking.

Every transition is a narrowing of the hypothesis. Not a random click.

The System We'll Diagnose.

Two services. One hidden problem.

Architecture:

clientapi-service :8080

api-service → per-user cache (lock)

api-serviceexternal-service :8081

Premium users: cache TTL = 5s

Observability stack:

Prometheus — metrics

Loki — logs

Tempo — traces

Grafana — everything

The problem is already running. We just don't know where it is yet.

Before We Touch Any Tools.

Let's make a hypothesis.

What we know:

Latency increased.

But:

✅ CPU is normal

✅ Memory is normal

✅ Error rate is zero

Possible explanations:

• Thread contention

• Slow dependency

• GC pause

Only one of these is true. Let's find out which one.

DEMO

Root Cause Found.

No guessing. No restarting. No hoping.

The chain:

external-service is slow (~2500ms)

api-service cache lock serializes all parallel requests

Every premium user request waits for the same slow call

The code:

    // external-service
    db_query_duration_ms=2500;

In production: slow DB query, timeout, GC pause

One trace. Ten minutes. Zero guessing.

What We Just Did.

Every transition was a decision.

Metrics → identified the spike

Logs → found the pattern, got traceId

Trace → pinpointed the bottleneck

Root cause → confirmed in 10 minutes

The tool doesn't matter. The question does.

Same Workflow, Different Incidents.

The scenario changes. The thinking stays the same.

Lock Contention

p95 latency increases

Threads in BLOCKED state

Trace shows waiting on monitor

synchronization hotspot in your code

Memory Pressure

Heap usage looks stable

Allocation rate keeps increasing

Old generation slowly fills up

OOM waiting to happen

Same Workflow, Different Incidents.

The scenario changes. The thinking stays the same.

GC Pauses

p99 spikes periodically

Pause time aligns with GC cycles

Stop-the-world visible in logs

GC is stealing your latency budget

The scenario changes. The thinking stays the same.

Metrics → what is wrong?

Logs → what exactly happened?

Traces → where is the time going?

Three Rules for Diagnostic Thinking.

1. Start with the symptom, not the tool.

Don't open Tempo because you like traces. Open it because logs gave you a traceId.

2. Each layer answers a different question.

Metrics: what and how much? Logs: who, when, exactly what? Traces: where in the system?

3. The cause is rarely where you look first.

Lock contention in api-service

→ cause in external-service

In distributed systems, symptom and cause live in different places.

Observability is the path between them.

Thank You.

In distributed systems, the symptom and the cause are rarely in the same place.

Mateusz Grużewski

Profile QR

gruzewski.dev

Rate this talk

Feedback QR

Scan to leave feedback

Source code
github.com/lshadown/from-metrics-to-root-cause-jvm