Module 15 • Operations • 13 min read

Logging, Monitoring & Observability

Logs, metrics, traces — the three pillars. RED method. What to alert on.

The Three Pillars of Observability

Observability is the ability to understand what's happening inside your system from its external outputs. Three pillars:

Logs — Discrete events with timestamps. "User 123 logged in at 10:32:15."
Metrics — Numeric measurements over time. "95th percentile response time: 245ms."
Traces — Request flows across services. "Request R1 went through: API→Auth→DB→Cache."

You need all three. Logs tell you WHAT happened. Metrics tell you WHEN and HOW OFTEN. Traces tell you WHERE time was spent.

Structured Logging

Never log plain strings in production. Log structured JSON so logs are searchable and parseable.

Snippet

Bad:
  console.log("User 123 created order 456")

Good:

JavaScript

logger.info({
  event: "order.created",
  userId: "123",
  orderId: "456",
  amount: 49.99,
  currency: "USD",
  duration_ms: 145,
  requestId: "req_abc123"
});

Tools: Pino (Node.js), Zap (Go), structlog (Python).

Snippet

Log levels:
  DEBUG — verbose, dev only (never in production)
  INFO  — normal operations, key events
  WARN  — unexpected but handled (deprecated API used, retry succeeded)
  ERROR — something failed and needs attention
  FATAL — system cannot continue, about to crash

Metrics to Track

The RED Method (for services):
• Rate — requests per second
• Errors — error rate (%)
• Duration — response time (p50, p95, p99)

The USE Method (for resources):
• Utilization — CPU %, memory %, disk %
• Saturation — queue depth, pending requests
• Errors — hardware errors, packet drops

Key metrics for a backend API:
• Request rate by endpoint
• Error rate by status code and endpoint
• p50/p95/p99 latency by endpoint
• DB query duration
• Cache hit/miss ratio
• Queue depth and processing rate
• Active connections
• Memory and CPU usage

Tools: Prometheus + Grafana, DataDog, New Relic, CloudWatch.

Distributed Tracing

In a microservices system, one user request touches 5 services. Tracing follows it across all of them.

Each request gets a Trace ID (unique per user request) and spans (timed operations within the trace).

Snippet

Request → [API Gateway] → [Auth Service] → [Order Service] → [DB]
           span1(50ms)      span2(10ms)       span3(200ms)     span4(30ms)

A trace visualization shows the total time and exactly where latency lives.

How it works:
1. API Gateway creates Trace ID, injects into request headers
2. Each service extracts Trace ID, creates a child span
3. Spans are sent to a collector (Jaeger, Zipkin, DataDog APM)
4. Collector assembles the full trace

Standards: OpenTelemetry (OTEL) is the unified standard. Instrument once, export to any backend.

Alerting

Monitoring without alerting is useless. Set up alerts for:

Error rate > 1% for 5 minutes → PagerDuty alert
p99 latency > 2 seconds → Slack warning
CPU > 90% for 10 minutes → Scale-up trigger
DB connection pool exhausted → Immediate alert
Queue depth > 10,000 → Scale workers alert
Disk > 85% full → Warning (at 95%: critical)
Memory leak detected (ever-increasing memory) → Alert

Alert fatigue is real. Only alert on things that require human action. Too many alerts → people start ignoring them.

On-call rotation: Someone is always responsible. Incidents have runbooks. Postmortems are blameless.

★

Source & Credit

The Backend from First Principles series is based on what I learnt from Sriniously's YouTube playlist — a thoughtful, framework-agnostic walk through backend engineering. If this material helped you, please go check the original out: youtube.com/@Sriniously. The notes here are my own restatement for revisiting later.

⁂ Back to all modules