Observability in Agile Teams
70 slides
Press → or Space to advance
Press F to toggle fullscreen
Observability in Agile Teams
From Signals to Decisions
Workshop · 21.05.2026
Mateusz Grużewski Assistant Professor, West Pomeranian University of Technology in Szczecin w Principal Software Architect, Asseco Data Systems
Three Weeks Ago
We talked about why observability matters.
Today we go deeper into what it actually is — and how it changes the way Agile teams work.
What You'll Leave With Today
By 14:45 you will:
- •understand what metrics, logs, and traces really are
- •know which signal answers which question
- •have diagnosed real problems in a real system, yourself
- •see how this connects to the Agile work you've studied this week
A Quick Reminder
You've all seen this picture before.
✅ CPU: normal
✅ Memory: normal
✅ Error rate: 0%
✅ Alerts: silent
"Something is wrong with the app. The price shows zero. Customers are confused."
— your support team, 9:47 AM
Dashboards confirm symptoms. They rarely reveal causes.
Today We Look Inside Each Signal
Three pillars. We'll examine each one separately.
What it is. What it can do. What it cannot.
By the end you'll know not just the names — but when to reach for which.
Pillar 1: Metrics

What a Metric Actually Is
A number that changes over time.
That's it.
http_request_duration_seconds = 0.234
active_users = 1847
error_rate = 0.02
Aggregated across many events. Cheap to store. Cheap to query.
What Metrics Are Good At
Trends — "error rate has been rising for 30 minutes"
Comparisons — "p95 latency this week vs last week"
Alerts — "page someone if errors exceed 5%"
Capacity planning — "we're at 70% of disk usage, growing 5% per week"
Metrics are the early warning system.
What Metrics Cannot Do
They tell you something is wrong.
They cannot tell you:
- •which user was affected
- •why a specific request failed
- •what the exact error message was
Metrics aggregate. Aggregation hides individuals.
Metrics in an Agile Team
Every iteration produces metrics that inform the next one.
- •Did latency improve after our last sprint's optimization?
- •Is the new feature actually used? How often?
- •Are we delivering faster, or just shipping more bugs?
- •DORA metrics: deployment frequency, lead time, change failure rate, MTTR
Without metrics, retrospectives are opinions. With metrics, retrospectives are evidence.
Pillar 2: Logs
What a Log Actually Is
An event that happened, with a timestamp and a description.
2026-04-25 14:32:15 [ERROR] payment failed for user=42 reason=timeout
2026-04-25 14:32:16 [INFO] retry scheduled in 5s
One event = one log entry. Specific, not aggregated.
What Logs Are Good At
Specific incidents — "what happened to user 12345 at 10:23?"
Context — full error messages, stack traces, request bodies
Audit trail — who did what, when
Debugging — when you know roughly what to search for
Logs are the eyewitness account.
What Logs Cannot Do
Show you trends without aggregation.
Tell you about errors that weren't logged.
Easily connect across services — unless they share a traceId.
Logs only show what the developer chose to log. Silence in logs does not mean absence of problems.
Logs in an Agile Team
After every release, production becomes input for the next sprint.
- •Bug reports become reproducible — "here's the log line where it broke"
- •Postmortems become factual — "here's the exact sequence of events"
- •User-reported issues stop being mysteries
Without logs, the team builds blind. With logs, every incident becomes a learning opportunity.
Pillar 3: Traces
What a Trace Actually Is
The complete journey of a single request through your system.
Made up of spans — each one is "this service did this operation, this took that long."

A trace is a causal graph. It shows what called what, in what order, and how long each part took.
What Traces Are Good At
Cross-service debugging — "which of our 12 microservices is slow?"
Bottleneck analysis — "this request took 2 seconds: where did the time go?"
System topology — "oh, I didn't realize that service even called this one"
Span events — structured details about what happened, attached to a specific operation
Traces show the system as it actually behaves, not as the diagram on the wall.
What Traces Cannot Do
Show you aggregate behavior — use metrics for that
Capture every request — usually sampled at 1% or 10%
Help you if your services aren't instrumented — no spans, no traces
Traces are deep, not wide. One trace = one request, deeply.
Traces in an Agile Team
Microservices are easier to understand when you can read traces.
- •New team members onboard faster — "watch the trace, learn the system"
- •Architecture diagrams stop lying — traces show real dependencies
- •Performance issues become visible — not theoretical
Traces turn distributed systems from abstract to observable.
Three Signals, Three Strengths
Metrics what and how much Wide. Cheap. Fast.
Logs who and exactly what Specific. Detailed. Sometimes silent.
Traces where and how Deep. Causal. Sampled.
No single signal is enough. The skill is knowing which to use when.
One Investigation, Three Signals
A single bug. Three different views.
- •Metrics: "GetProduct error rate jumped from 0% to 100%"
- •Logs: "only a few unrelated errors logged"
- •Traces: "every failed request involves product ID OLJCESPC7Z"
Each signal showed a different part of the truth. Combined, they told the whole story.
How This Connects to Agile
Observability is not a DevOps tool.
It is the empirical layer that Agile assumes you have.
Every Agile principle — inspect and adapt, fast feedback, learning over guessing — needs real data from production.
Without observability, you're inspecting nothing. With it, every release becomes a hypothesis you can test.
The System You'll Work With
OpenTelemetry Demo — a microservices application built to teach observability.

The System You'll Work With
OpenTelemetry Demo — a microservices application built to teach observability.

Your Tools Today

Grafana — dashboards for metrics, search interface for logs
All accessible at: workshop.gruzewski.dev
Your Tools Today

Jaeger — distributed traces and span events
All accessible at: workshop.gruzewski.dev
Your Tools Today

Feature Flag UI — to inject failures into the system
All accessible at: workshop.gruzewski.dev
What Happens Next
Right after this lecture (with a short break):
I'll show you one scenario, end-to-end. You watch.
Then:
You split into groups. Each group gets a scenario card.
You diagnose. You report what you found.
By 14:45:
You'll have done this 2–3 times. You'll know how it feels.
A Final Thought Before We Begin
You don't need to be an expert in software engineering to do this.
You need to ask good questions and read evidence.
That's what we'll practice.
Observability is a way of thinking, more than a set of tools.
Questions?
Anything before we move into the demo?
Break 15 minutes
Form Your Groups
We're 23. Let's make 5 groups — 4 to 5 people each.
Quickest way: count off 1, 2, 3, 4, 5 going around the room. All 1s sit together. All 2s sit together. And so on.
You'll work with this group for the rest of the day.
Two minutes. One laptop per group is enough. Pick someone who'll submit answers on Moodle.
Scenario 0: A Real Customer Complaint
We start with a real diagnostic — together.
I drive. You watch. You follow along on your laptops.
The same picture you saw at the start of today. Now we go inside it.
The Ticket That Just Came In
"Your product page looks broken. The price shows zero, the name is missing, but the page still loads. Customers are confused."
— customer support, this morning
No alert fired. No service crashed. Every infrastructure dashboard is green.
But the customer is right. Something is wrong.
Follow Along & Submit
Watch the demo. I'll move through Grafana and Jaeger step by step.
Open in your browser:
- •Shop — workshop.gruzewski.dev
- •Grafana — workshop.gruzewski.dev/grafana
- •Jaeger — workshop.gruzewski.dev/jaeger
Then in your group, answer four questions on Moodle:
- •What did the metrics show?
- •What did the logs say — and not say?
- •What did the traces reveal?
- •Which signal told you the most?
One submission per group. Discuss before you answer.
Scenario 1: The Payment Problem
Pillar focus: Metrics
Another Ticket Arrived
URGENT — Priority: HIGH
"Multiple customer complaints since this morning — they can't complete their orders. Payments are failing during checkout."
"We're not sure how many are affected. Call volume is spiking. Sales is panicking — every minute we lose orders."
"Can someone find out HOW BAD this is and WHEN it started?"
— customer support, 10:23 AM
Your Mission
Find which service is causing the problem.
You have access to:
- •The shop — workshop.gruzewski.dev
- •Grafana — metrics, dashboards, logs
- •Jaeger — distributed traces
- •Your Quick Reference card on the table
Work in your groups. Don't worry about the "right" answer yet — explore.
You have 10 minutes.
Need a Hint?
Did you reproduce the issue in the shop yet?
Try it: add a product to cart → checkout → fill any data → Place Order.
Open DevTools → Network tab while you do this.
What do you see?
Still Stuck?
OK, the issue reproduces. But — how widespread is it?
Open Grafana → Dashboards → Demo Dashboard.
The dashboard has a Service selector at the top.
Try different services. Which one shows an elevated error rate?
What You Should Have Found
The checkout service shows elevated error rate — around 20–30%.
Two spans tell the story:
- •
oteldemo.CheckoutService/PlaceOrder - •
oteldemo.PaymentService/Charge

This is RED metrics in action — Rate, Errors, Duration — generated automatically from spans by the OpenTelemetry Collector.
Going Deeper
If PaymentService/Charge is failing,
shouldn't the payment service show errors too?
Switch the Demo Dashboard to Service:
payment. Compare what you see there with what we just found oncheckout.
You have 5 more minutes.
Need a Nudge?
Compare the same span on both dashboards:
oteldemo.PaymentService/Charge
- •on
checkoutdashboard - •on
paymentdashboard
Do the numbers match? What does each one tell you about the failure?
Let Me Show You The Path
The same ticket. The same answer. Cleaner route.
The answer matters less than the workflow. Compare what I do to what your group did.
What Metrics Taught Us Today
Scale and rate "20% of checkouts fail" — not just "is it broken?"
Pattern over time "Started at 10:23. Steady at 20%." When did this begin? Is it growing?
Where to look "Checkout has the spike — others are flat." In a system of dozens of services, metrics narrow the search.
The dark lesson: one dashboard never tells the whole story.
Group Task — 10 minutes
Discuss with your group and submit on Moodle:
- •Which service did you identify first as having issues, and why that one?
- •When did the problem start? How did you find that from metrics?
- •What alert would have caught this before customer complaints?
- •Compare to Scenario 0: how was your investigation different this time?
One submission per group. Numbered answers, 2–4 sentences each.
End of Scenario 1
Next: Scenario 2 — Logs Pillar
Lunch break: we will see at 01:15 PM
Scenario 2: The Accounting Errors
Pillar focus: Logs
Another Alert Fired
URGENT — Monitoring alert
"Hi team — our monitoring has been firing alerts since this morning. The
accountingservice is generating thousands of error logs per minute, all saying 'Order parsing failed:'.""But here's the weird thing: the shop works fine. Customers can place orders. Orders appear in our database. Nothing seems broken from the customer's point of view."
"So either we have a real bug that doesn't affect users — or these errors are misleading us."
"Can you figure out what's actually happening?"
— Anna W., Operations team, 8:47 AM
Your Mission
Figure out what's actually happening with the accounting errors.
You have access to:
- •The shop — workshop.gruzewski.dev
- •Grafana — metrics, dashboards, logs
- •Jaeger — distributed traces
- •Your Quick Reference card on the table
Work in your groups. Don't worry about the "right" answer yet — explore.
You have 10 minutes.
Need a Hint?
Start with metrics — confirm what Anna said.
Open Grafana → Dashboards → Demo Dashboard.
Try Service: accounting.
What does the Error Rate panel show? Does it match the ticket?
Still Stuck?
OK — accounting has errors. But metrics only tell us that something is wrong.
To find what is wrong, go to the logs.
Open Grafana → Explore → OpenSearch.
Try filtering:
resource.service.name:"accounting" AND severity.text:"Error"
Look at the body field. What does it say?
What You Should Have Found
The accounting service has thousands of ERROR logs, all with the same body:
"Order parsing failed:"

So Anna was right — something IS failing. Looks like the order data is somehow malformed.
Or is it?
Going Deeper
The body says "Order parsing failed" — but log bodies are short summaries.
The real information is in the other fields of the log entry.
Click on one error log entry to expand it. What fields do you see? What do they tell you?
You have 5 more minutes.
Need a Nudge?
Look specifically at these fields inside the log entry:
- •
attributes.exception.type - •
attributes.exception.message - •
attributes.exception.stacktrace
What do they tell you about the real cause? Is parsing really what's failing?
The Twist
The body says "Order parsing failed."
The attributes tell a completely different story:
- •
attributes.exception.type: DbUpdateException - •
attributes.exception.message: "An error occurred while saving the entity changes." - •
attributes.exception.stacktrace: "duplicate key value violates unique constraint order_pkey"
This isn't a parsing bug. The database is rejecting duplicate orders.
The body misled us. The structured attributes held the truth.
But Where Do The Duplicates Come From?
The database is rejecting duplicates — fine. But why are duplicates arriving in the first place?
Accounting doesn't create orders. Another service does.
Go check the logs of the service that publishes orders to Kafka. Look at the
checkoutservice logs. What do you see?
You have 5 more minutes.
Need One More Hint?
Try these searches in checkout's logs:
resource.service.name:"checkout" AND body:"FeatureFlag"
or
resource.service.name:"checkout" AND body:"overload"
What do you find?
The Root Cause
Checkout's logs reveal the smoking gun:
⚠️ "FeatureFlag 'kafkaQueueProblems' is activated"
ℹ️ "Done with #100 messages for overload simulation"
Checkout intentionally publishes 100 copies of every order because the feature flag is enabled.
Accounting receives all 100, tries to save them, fails on duplicates.
The error in accounting is real. But the cause is in checkout. The root is a feature flag.
Let Me Show You The Path
Same ticket. Same answer. Cleaner route.
Watch the workflow. Compare what I do to what your group did.
What Logs Taught Us Today
Specific context exact exception, stack trace, attributes not just "something failed"
Multi-service narrative follow the chain across services when traceID can't help us
The hidden truth body is the headline — attributes are the article always expand. always check.
The dark lesson: a clear error message can be wrong.
Group Task — 10 minutes
Discuss with your group and submit on Moodle:
- •Are customers affected? What evidence supports your answer?
- •The body says 'Order parsing failed:'. What is the real cause of the error, and which field told you?
- •Where do the duplicate messages come from? What does that service's logs reveal?
- •Compare to Scenario 1: what kind of questions can logs answer that metrics can't?
One submission per group. Numbered answers, 2–4 sentences each.
End of Scenario 2
**Short break: 10 minutes
Three Pillars, One Story
You came in this morning with a question: how do you actually know what's happening in production?
You leave with three tools to answer it.
Metrics what and how much — Scenario 0, Scenario 1
Logs who, why, and exactly what — Scenario 0, Scenario 2
Traces where and how — across services — Scenario 0, and what's next
Each pillar answers questions the others cannot.
Back To Where We Started
This morning I showed you this picture:
✅ CPU: normal ✅ Memory: normal
✅ Error rate: 0%
✅ Alerts: silent
"Something is wrong with the app. The price shows zero. Customers are confused." — support team, 9:47 AM
Today, you went inside that picture. The tools to see beyond green dashboards are no longer abstract — they're yours.
The Third Pillar — Briefly
We didn't have time for a deep dive on traces today.
But you've already seen them — in Scenario 0, and in passing throughout.
Let me show you one trace in Jaeger, so you know what to come back to.
Watch the call chain. Notice how many services touched a single request.
What Will You Remember?
Turn to your group right now.
Each of you — one thing you'll remember a year from now.
60 seconds.
Where To Go Next
If you want to continue learning:
- •The OpenTelemetry Demo — github.com/open-telemetry/opentelemetry-demo (everything you used today is open source, you can run it locally)
- •DORA Metrics — if you're studying Agile delivery (deployment frequency, lead time, change failure rate, MTTR)
- •Honeycomb / Grafana / Datadog blogs — production observability writing
- •Your own projects — add basic logs to one you're building
The tools are free. The skill is in knowing what to look for.
Thank You
You worked hard today.
Three pillars. Three investigations. One real skill that will outlast any specific tool.
Materials are on Moodle. Any questions before we finish?