Observability in Agile Teams: From Incidents to Feedback

20 slides

Press or Space to advance

Press F to toggle fullscreen

Back to talk details

Metrics Look Fine.

Alerts Are Silent.

Users Are Calling.

Observability in Agile Teams: From Incidents to Feedback

Mateusz Grużewski
Assistant Professor, West Pomeranian University of Technology in Szczecin
Principal Software Architect, Asseco Data Systems

You've Been Here Before.

✅ CPU: normal

✅ Memory: normal

✅ Error rate: 0%

✅ Alerts: silent

"The app has been slow for 20 minutes."

— your biggest client, 9:47 PM

Why this matters in Agile teams

Agile teams do not only need to deliver fast.
They also need to learn fast.

After a release, teams need to understand:

  • what changed,
  • what users experience,
  • where problems come from,
  • what to improve next.

Observability shortens the feedback loop between release and learning.

Dashboards Don't Show Causes.

Dashboards show symptoms. Not causes.

The Real Question Is Why.

You can see what is slow.

You can see when it started.

The hard part is why.

Observability is the ability to answer "why" in a system you cannot stop.

The First Instinct During an Incident

Every production incident starts the same way.

Engineers open Grafana.

They scan dashboards.

They check CPU, memory, error rate.

And for a moment…

everything looks fine.

Dashboards confirm the symptom.
They rarely reveal the cause.

Three Signals. One Answer.

Whether we investigate an incident or evaluate a release, we usually need three kinds of signals.

The Workflow We'll Follow.

Not a checklist. A way of thinking.

Every transition is a narrowing of the hypothesis. Not a random click.

From incidents to feedback

This workflow is useful during incidents.
But in Agile teams — it is used every day.

The same thinking helps teams answer:

  • Did latency change after the last release?
  • Is this a regression or expected behavior?
  • Which users are affected?
  • Should we rollback or move forward?
  • What do we improve in the next sprint?

Observability turns production into input for the next iteration.

Observability in the Agile Loop

Observability connects production with Agile decision-making.

The System We'll Investigate.

Two services. One hidden problem.

Architecture:

clientapi-service :8080

api-service → per-user cache (lock)

api-serviceexternal-service :8081

Premium users: cache TTL = 5s

Observability stack:

Prometheus — metrics

Loki — logs

Tempo — traces

Grafana — everything

The problem is already running. We just don't know where it is yet.

Before We Touch Any Tools.

Let's make a hypothesis.

What we know:

Latency increased.

But:

✅ CPU is normal

✅ Memory is normal

✅ Error rate is zero

Possible explanations:

  • Something is blocking requests
  • Something outside the service is slow
  • Something inside the app pauses execution

Only one of these is true. Let's find out which one.

You Are Ready for This.

You don’t need production experience to follow this.

We focus on thinking, not tools.

DEMO

Root Cause Found.

No guessing. No restarting. No hoping.

The chain:

external-service is slow (~2500ms)

api-service cache lock serializes all parallel requests

Every premium user request waits for the same slow call

The logs:

    // external-service
    db_query_duration_ms=2500;

In production: slow DB query, timeout, GC pause

One trace. Ten minutes. Zero guessing.

What We Just Did.

Every transition was a decision.

Metrics → identified the spike

Logs → found the pattern, got traceId

Trace → pinpointed the bottleneck

Root cause → confirmed in 10 minutes

The tool doesn't matter. The question does.

The Pattern Repeats.

The scenario changes. The thinking stays the same.

Memory Pressure

Heap usage looks stable

Allocation rate keeps increasing

Old generation slowly fills up

OOM waiting to happen

The scenario changes. The thinking stays the same.

Metrics → what is wrong?

Logs → what exactly happened?

Traces → where is the time going?

Three Rules for Diagnostic Thinking.

1. Start with the symptom, not the tool.

Don't open Tempo because you like traces. Open it because logs gave you a traceId.

2. Each layer answers a different question.

Metrics: what and how much? Logs: who, when, exactly what? Traces: where in the system?

3. The cause is rarely where you look first.

Lock contention in api-service

→ cause in external-service

In distributed systems, symptom and cause live in different places.

Observability is the path between them.

What to expect from the workshop

In the workshop, you will:

  • investigate a realistic production-like problem step by step
  • learn how to go from "something is slow" to root cause
  • use metrics, logs and traces together — not separately
  • understand how observability supports Agile decisions
  • practice thinking about real systems

You do not need prior experience.

The goal is to learn how to ask the right questions about a running system.

Thank You.

In Agile teams, delivery is only half of the story.
The other half is learning from production.

If you've ever thought:

"Everything looks fine… but something is wrong"

You are ready for this workshop.

See you in May.