Content is user-generated and unverified.

🛒 E-Commerce Checkout Latency Crisis

Mob Troubleshooting Workshop Scenario

🎬 The Story

Your e-commerce platform is experiencing a checkout slowdown during peak hours. Customer complaints are rolling in about "slow payment processing" and some abandoned carts. Revenue is at risk!

🧪 Chaos Experiment Design

Target System Architecture

[Frontend] → [API Gateway] → [Checkout Service] → [Payment Service]
                                     ↓
                              [Inventory Service]
                                     ↓
                              [Database (PostgreSQL)]

The Experiment: "Payment Service Latency Injection"

Hypothesis: "Our checkout flow gracefully handles payment service degradation without cascading failures"

Chaos Action:

Target: Payment Service containers
Fault: Inject 2-5 second latency on 70% of HTTP requests
Duration: 10 minutes
Blast Radius: Production-like environment (not actual prod!)

Expected Symptoms:

⚠️ Checkout completion time increases from 2s → 8s average
📊 Payment success rate drops slightly due to timeouts
🚨 APM alerts fire for "High Response Time"
💸 Revenue dashboard shows declining checkout conversion

🔍 What Teams Will Discover

Layer 1: Obvious Symptoms

High-level dashboards show checkout latency spike
User-facing alerts firing
Customer support tickets increasing

Layer 2: Service-Level Investigation

Payment service response times elevated
Database connection pool not saturated
No obvious infrastructure issues (CPU, memory normal)

Layer 3: Deep Dive Troubleshooting

Network latency between services artificially increased
Distributed tracing shows bottleneck location
Circuit breaker patterns (if implemented) may activate

🎯 Multiple Valid Troubleshooting Paths

Path A: Dashboard-Driven

Start with business KPI dashboard
Drill down to service health dashboard
Identify payment service as bottleneck
Check service dependencies

Path B: Alert-Driven

Check recent alerts/incidents
Follow alert runbook procedures
Validate alert accuracy with manual checks
Trace root cause through service mesh

Path C: User Journey-Driven

Simulate customer checkout flow
Time each step manually
Use browser dev tools to identify slow API calls
Follow the request path backwards

Path D: Metrics Query-Driven

Start with SLI/SLO violations
Query APM tools for service latency percentiles
Correlate with infrastructure metrics
Build hypothesis from data patterns

📝 Expected Alternative Approaches

Engineers might write down:

"I'd check the database first - it's usually the bottleneck"
"I'd look at the service mesh metrics instead of APM"
"I'd run a curl test from inside the cluster"
"I'd check if it's a DNS resolution issue"
"I'd look at error rates, not just latency"

🎪 Workshop Flow Integration

Pre-Workshop Setup (Day Before)

Deploy experiment template in Steadybit
Verify monitoring coverage exists
Create shared troubleshooting document
Set up screen sharing for mob format

During Workshop - Experiment Trigger

bash

# Facilitator runs this (or clicks in Steadybit UI):
Experiment: "Payment Latency Injection"
Duration: 15 minutes (gives buffer for workshop)
Target: payment-service pods in staging environment

Rotation Strategy

Driver 1 (2 min): Notices the business impact, opens main dashboard
Driver 2 (2 min): Drills into service-level metrics, identifies payment service
Driver 3 (2 min): Investigates payment service health, checks dependencies
Driver 4 (2 min): Discovers the latency injection through distributed tracing
Continue rotating until root cause identified or experiment ends

🎓 Learning Outcomes

Technical Skills

Practice systematic troubleshooting methodology
Learn different tool approaches (APM, metrics, logs, tracing)
Understand service dependency investigation

Collaboration Skills

Experience time-boxed troubleshooting pressure
See how different engineers approach the same problem
Practice explaining technical reasoning quickly

System Understanding

Map service dependencies under stress
Identify monitoring blind spots
Understand cascade failure patterns

🛠️ Required Steadybit Setup

yaml

# Experiment Template
name: "Workshop - Payment Service Latency"
hypothesis: "Checkout survives payment service degradation"
environment: "staging"
lanes:
  - steps:
    - type: "action"
      actionType: "com.steadybit.extension_container.network_delay"
      parameters:
        duration: "10m"
        delay: "2000ms"
        jitter: "1000ms"
      radius:
        targetType: "com.steadybit.extension_container.container"
        predicate:
          operator: "AND"
          predicates:
            - key: "k8s.container.name"
              operator: "EQUALS"
              values: ["payment-service"]
        percentage: 70

Monitoring Requirements

Business KPI dashboard (checkout conversion, revenue)
Service health dashboard (latency, error rates)
Infrastructure dashboard (CPU, memory, network)
Distributed tracing (Jaeger/Zipkin)
Log aggregation (ELK/Loki)

💡 Pro Tips for Facilitators

Have a backup plan - If experiment doesn't trigger obvious symptoms, have pre-recorded screenshots
Control the chaos - Be ready to stop experiment if it causes real issues
Time strictly - Use visible timer, be firm about 2-minute rotations
Capture divergence - The different approaches are the real learning gold
Follow up - Schedule time to implement identified monitoring gaps

Ready to run this chaos experiment and see how your team troubleshoots under pressure? The real insights come from comparing different engineering mental models!

Content is user-generated and unverified.