Observability in Agile Teams: From Incidents to Feedback
20 slides
Press → or Space to advance
Press F to toggle fullscreen
Metrics Look Fine.
Alerts Are Silent.
Users Are Calling.
Observability in Agile Teams: From Incidents to Feedback
Mateusz Grużewski
Assistant Professor, West Pomeranian University of Technology in Szczecin
Principal Software Architect, Asseco Data Systems
You've Been Here Before.
✅ CPU: normal
✅ Memory: normal
✅ Error rate: 0%
✅ Alerts: silent
"The app has been slow for 20 minutes."
— your biggest client, 9:47 PM
Why this matters in Agile teams
Agile teams do not only need to deliver fast.
They also need to learn fast.
After a release, teams need to understand:
- •what changed,
- •what users experience,
- •where problems come from,
- •what to improve next.
Observability shortens the feedback loop between release and learning.
Dashboards Don't Show Causes.
Dashboards show symptoms. Not causes.
The Real Question Is Why.
You can see what is slow.
You can see when it started.
The hard part is why.
Observability is the ability to answer "why" in a system you cannot stop.
The First Instinct During an Incident
Every production incident starts the same way.
Engineers open Grafana.
They scan dashboards.
They check CPU, memory, error rate.
And for a moment…
everything looks fine.
Dashboards confirm the symptom.
They rarely reveal the cause.
Three Signals. One Answer.
Whether we investigate an incident or evaluate a release, we usually need three kinds of signals.
The Workflow We'll Follow.
Not a checklist. A way of thinking.
Every transition is a narrowing of the hypothesis. Not a random click.
From incidents to feedback
This workflow is useful during incidents.
But in Agile teams — it is used every day.
The same thinking helps teams answer:
- •Did latency change after the last release?
- •Is this a regression or expected behavior?
- •Which users are affected?
- •Should we rollback or move forward?
- •What do we improve in the next sprint?
Observability turns production into input for the next iteration.
Observability in the Agile Loop
Observability connects production with Agile decision-making.
The System We'll Investigate.
Two services. One hidden problem.
Architecture:
client → api-service :8080
api-service → per-user cache (lock)
api-service → external-service :8081
Premium users: cache TTL = 5s
Observability stack:
Prometheus — metrics
Loki — logs
Tempo — traces
Grafana — everything
The problem is already running. We just don't know where it is yet.
Before We Touch Any Tools.
Let's make a hypothesis.
What we know:
Latency increased.
But:
✅ CPU is normal
✅ Memory is normal
✅ Error rate is zero
Possible explanations:
- •Something is blocking requests
- •Something outside the service is slow
- •Something inside the app pauses execution
Only one of these is true. Let's find out which one.
You Are Ready for This.
You don’t need production experience to follow this.
We focus on thinking, not tools.
DEMO
Root Cause Found.
No guessing. No restarting. No hoping.
The chain:
external-service is slow (~2500ms)
api-service cache lock serializes all parallel requests
Every premium user request waits for the same slow call
The logs:
// external-service
db_query_duration_ms=2500;
In production: slow DB query, timeout, GC pause
One trace. Ten minutes. Zero guessing.
What We Just Did.
Every transition was a decision.
Metrics → identified the spike
Logs → found the pattern, got traceId
Trace → pinpointed the bottleneck
Root cause → confirmed in 10 minutes
The tool doesn't matter. The question does.
The Pattern Repeats.
The scenario changes. The thinking stays the same.
Memory Pressure
Heap usage looks stable
Allocation rate keeps increasing
Old generation slowly fills up
→ OOM waiting to happen
The scenario changes. The thinking stays the same.
Metrics → what is wrong?
Logs → what exactly happened?
Traces → where is the time going?
Three Rules for Diagnostic Thinking.
1. Start with the symptom, not the tool.
Don't open Tempo because you like traces. Open it because logs gave you a traceId.
2. Each layer answers a different question.
Metrics: what and how much? Logs: who, when, exactly what? Traces: where in the system?
3. The cause is rarely where you look first.
Lock contention in api-service
→ cause in external-service
In distributed systems, symptom and cause live in different places.
Observability is the path between them.
What to expect from the workshop
In the workshop, you will:
- •investigate a realistic production-like problem step by step
- •learn how to go from "something is slow" to root cause
- •use metrics, logs and traces together — not separately
- •understand how observability supports Agile decisions
- •practice thinking about real systems
You do not need prior experience.
The goal is to learn how to ask the right questions about a running system.
Thank You.
In Agile teams, delivery is only half of the story.
The other half is learning from production.
If you've ever thought:
"Everything looks fine… but something is wrong"
You are ready for this workshop.
See you in May.