Logging, Monitoring & Observability
Logs, metrics, traces — the three pillars. RED method. What to alert on.
The Three Pillars of Observability
Observability is the ability to understand what's happening inside your system from its external outputs. Three pillars:
Logs — Discrete events with timestamps. "User 123 logged in at 10:32:15."
Metrics — Numeric measurements over time. "95th percentile response time: 245ms."
Traces — Request flows across services. "Request R1 went through: API→Auth→DB→Cache."
You need all three. Logs tell you WHAT happened. Metrics tell you WHEN and HOW OFTEN. Traces tell you WHERE time was spent.
Structured Logging
Never log plain strings in production. Log structured JSON so logs are searchable and parseable.
Bad:
console.log("User 123 created order 456")
Good:
logger.info({
event: "order.created",
userId: "123",
orderId: "456",
amount: 49.99,
currency: "USD",
duration_ms: 145,
requestId: "req_abc123"
});
Tools: Pino (Node.js), Zap (Go), structlog (Python).
Log levels:
DEBUG — verbose, dev only (never in production)
INFO — normal operations, key events
WARN — unexpected but handled (deprecated API used, retry succeeded)
ERROR — something failed and needs attention
FATAL — system cannot continue, about to crash
Metrics to Track
The RED Method (for services):
• Rate — requests per second
• Errors — error rate (%)
• Duration — response time (p50, p95, p99)
The USE Method (for resources):
• Utilization — CPU %, memory %, disk %
• Saturation — queue depth, pending requests
• Errors — hardware errors, packet drops
Key metrics for a backend API:
• Request rate by endpoint
• Error rate by status code and endpoint
• p50/p95/p99 latency by endpoint
• DB query duration
• Cache hit/miss ratio
• Queue depth and processing rate
• Active connections
• Memory and CPU usage
Tools: Prometheus + Grafana, DataDog, New Relic, CloudWatch.
Distributed Tracing
In a microservices system, one user request touches 5 services. Tracing follows it across all of them.
Each request gets a Trace ID (unique per user request) and spans (timed operations within the trace).
Request → [API Gateway] → [Auth Service] → [Order Service] → [DB]
span1(50ms) span2(10ms) span3(200ms) span4(30ms)
A trace visualization shows the total time and exactly where latency lives.
How it works:
1. API Gateway creates Trace ID, injects into request headers
2. Each service extracts Trace ID, creates a child span
3. Spans are sent to a collector (Jaeger, Zipkin, DataDog APM)
4. Collector assembles the full trace
Standards: OpenTelemetry (OTEL) is the unified standard. Instrument once, export to any backend.
Alerting
Monitoring without alerting is useless. Set up alerts for:
- Error rate > 1% for 5 minutes → PagerDuty alert
- p99 latency > 2 seconds → Slack warning
- CPU > 90% for 10 minutes → Scale-up trigger
- DB connection pool exhausted → Immediate alert
- Queue depth > 10,000 → Scale workers alert
- Disk > 85% full → Warning (at 95%: critical)
- Memory leak detected (ever-increasing memory) → Alert
Alert fatigue is real. Only alert on things that require human action. Too many alerts → people start ignoring them.
On-call rotation: Someone is always responsible. Incidents have runbooks. Postmortems are blameless.
The Backend from First Principles series is based on what I learnt from Sriniously's YouTube playlist — a thoughtful, framework-agnostic walk through backend engineering. If this material helped you, please go check the original out: youtube.com/@Sriniously. The notes here are my own restatement for revisiting later.