DevPath · Learn to code ESPTEN

Observability and performance

The three pillars of observability

Monitoring vs. observability

Monitoring is watching signals you already know matter (is the server alive? how much CPU is it using?). Observability is being able to answer questions you had not anticipated from what the system emits. The practical difference: when something breaks at 3 in the morning, do you have enough data to understand why, without deploying new code?

That data rests on three pillars: logs, metrics and traces. They are complementary, not interchangeable.

1. Structured logs

A log is a record of a single event. The classic mistake is writing free text (console.log("user " + id + " logged in")): readable for a person, useless for a machine. A structured log is an object (serialized to JSON) with consistent fields:

{ "level": "info", "message": "login", "userId": 42, "ts": "2026-06-22T10:00:00Z", "requestId": "abc-123" }

This lets you filter and aggregate: "give me every error for userId 42 in the last hour". Each log carries a level that indicates its severity and lets you silence the noise in production:

2. Metrics

A metric is a number aggregated over time, cheap to store (you do not keep every event, just its count or distribution). Common types:

On latency, never look only at the mean: a mean of 100 ms can hide that 1 in every 20 users waits 3 seconds. That is why percentiles are used. The p95 is the value below which 95 % of requests fall: "95 % respond in under 200 ms". The p99 captures the tail, the worst experience. Improving the p95/p99 usually matters more than improving the mean.

3. Traces (distributed tracing)

In a system with several services, a single user request crosses many hops: gateway → orders service → database → payments service. A trace follows that request end to end. Each hop is a span with its duration, and they all share the same request id (or trace id) that propagates in the headers. That way you see where the time goes: if the request takes 800 ms, the trace tells you 700 of them went into a slow DB query.

That same requestId must also go into the logs: it is the thread that sews the three pillars together. With it you jump from "this request was slow" (trace) to "and it also logged this error" (log).

What you build on top

Practical rule: instrument the golden signals first — latency, traffic, errors and saturation. They cover most incidents.

Examples

A structured log in JSON, ready to index

function log(level, message, context = {}) {
  return JSON.stringify({
    level,
    message,
    ts: new Date().toISOString(),
    ...context,
  });
}

console.log(log("error", "payment declined", { userId: 42, requestId: "abc-123" }));

p95: the 95th percentile of a list of latencies (ms)

function percentile(values, p) {
  const ordered = [...values].sort((a, b) => a - b);
  const i = Math.ceil((p / 100) * ordered.length) - 1;
  return ordered[Math.max(0, i)];
}

const latencies = [80, 90, 95, 100, 110, 120, 130, 140, 150, 900];
console.log("p95:", percentile(latencies, 95), "ms"); // the tail shows up
Put this into practice

DevPath is a hands-on course: you read the theory here; in the app you put it into practice with exercises that really run, offline.

Start free in the app →
Performance: caching, lazy loading and the DB →