🛒 E-Commerce Checkout Latency Crisis
Mob Troubleshooting Workshop Scenario
🎬 The Story
Your e-commerce platform is experiencing a checkout slowdown during peak hours. Customer complaints are rolling in about "slow payment processing" and some abandoned carts. Revenue is at risk!
🧪 Chaos Experiment Design
Target System Architecture
[Frontend] → [API Gateway] → [Checkout Service] → [Payment Service]
↓
[Inventory Service]
↓
[Database (PostgreSQL)]
The Experiment: "Payment Service Latency Injection"
Hypothesis: "Our checkout flow gracefully handles payment service degradation without cascading failures"
Chaos Action:
- Target: Payment Service containers
- Fault: Inject 2-5 second latency on 70% of HTTP requests
- Duration: 10 minutes
- Blast Radius: Production-like environment (not actual prod!)
Expected Symptoms:
- ⚠️ Checkout completion time increases from 2s → 8s average
- 📊 Payment success rate drops slightly due to timeouts
- 🚨 APM alerts fire for "High Response Time"
- 💸 Revenue dashboard shows declining checkout conversion
🔍 What Teams Will Discover
Layer 1: Obvious Symptoms
- High-level dashboards show checkout latency spike
- User-facing alerts firing
- Customer support tickets increasing
Layer 2: Service-Level Investigation
- Payment service response times elevated
- Database connection pool not saturated
- No obvious infrastructure issues (CPU, memory normal)
Layer 3: Deep Dive Troubleshooting
- Network latency between services artificially increased
- Distributed tracing shows bottleneck location
- Circuit breaker patterns (if implemented) may activate
🎯 Multiple Valid Troubleshooting Paths
Path A: Dashboard-Driven
- Start with business KPI dashboard
- Drill down to service health dashboard
- Identify payment service as bottleneck
- Check service dependencies
Path B: Alert-Driven
- Check recent alerts/incidents
- Follow alert runbook procedures
- Validate alert accuracy with manual checks
- Trace root cause through service mesh
Path C: User Journey-Driven
- Simulate customer checkout flow
- Time each step manually
- Use browser dev tools to identify slow API calls
- Follow the request path backwards
Path D: Metrics Query-Driven
- Start with SLI/SLO violations
- Query APM tools for service latency percentiles
- Correlate with infrastructure metrics
- Build hypothesis from data patterns
📝 Expected Alternative Approaches
Engineers might write down:
- "I'd check the database first - it's usually the bottleneck"
- "I'd look at the service mesh metrics instead of APM"
- "I'd run a curl test from inside the cluster"
- "I'd check if it's a DNS resolution issue"
- "I'd look at error rates, not just latency"
🎪 Workshop Flow Integration
Pre-Workshop Setup (Day Before)
- Deploy experiment template in Steadybit
- Verify monitoring coverage exists
- Create shared troubleshooting document
- Set up screen sharing for mob format
During Workshop - Experiment Trigger
bash
# Facilitator runs this (or clicks in Steadybit UI):
Experiment: "Payment Latency Injection"
Duration: 15 minutes (gives buffer for workshop)
Target: payment-service pods in staging environment
Rotation Strategy
- Driver 1 (2 min): Notices the business impact, opens main dashboard
- Driver 2 (2 min): Drills into service-level metrics, identifies payment service
- Driver 3 (2 min): Investigates payment service health, checks dependencies
- Driver 4 (2 min): Discovers the latency injection through distributed tracing
- Continue rotating until root cause identified or experiment ends
🎓 Learning Outcomes
Technical Skills
- Practice systematic troubleshooting methodology
- Learn different tool approaches (APM, metrics, logs, tracing)
- Understand service dependency investigation
Collaboration Skills
- Experience time-boxed troubleshooting pressure
- See how different engineers approach the same problem
- Practice explaining technical reasoning quickly
System Understanding
- Map service dependencies under stress
- Identify monitoring blind spots
- Understand cascade failure patterns
🛠️ Required Steadybit Setup
yaml
# Experiment Template
name: "Workshop - Payment Service Latency"
hypothesis: "Checkout survives payment service degradation"
environment: "staging"
lanes:
- steps:
- type: "action"
actionType: "com.steadybit.extension_container.network_delay"
parameters:
duration: "10m"
delay: "2000ms"
jitter: "1000ms"
radius:
targetType: "com.steadybit.extension_container.container"
predicate:
operator: "AND"
predicates:
- key: "k8s.container.name"
operator: "EQUALS"
values: ["payment-service"]
percentage: 70
Monitoring Requirements
- Business KPI dashboard (checkout conversion, revenue)
- Service health dashboard (latency, error rates)
- Infrastructure dashboard (CPU, memory, network)
- Distributed tracing (Jaeger/Zipkin)
- Log aggregation (ELK/Loki)
💡 Pro Tips for Facilitators
- Have a backup plan - If experiment doesn't trigger obvious symptoms, have pre-recorded screenshots
- Control the chaos - Be ready to stop experiment if it causes real issues
- Time strictly - Use visible timer, be firm about 2-minute rotations
- Capture divergence - The different approaches are the real learning gold
- Follow up - Schedule time to implement identified monitoring gaps
Ready to run this chaos experiment and see how your team troubleshoots under pressure? The real insights come from comparing different engineering mental models!