Monitoring vs. observability
Monitoring is watching signals you already know matter (is the server alive? how much CPU is it using?). Observability is being able to answer questions you had not anticipated from what the system emits. The practical difference: when something breaks at 3 in the morning, do you have enough data to understand why, without deploying new code?
That data rests on three pillars: logs, metrics and traces. They are complementary, not interchangeable.
1. Structured logs
A log is a record of a single event. The classic mistake is writing free
text (console.log("user " + id + " logged in")): readable for a person,
useless for a machine. A structured log is an object (serialized to
JSON) with consistent fields:
{ "level": "info", "message": "login", "userId": 42, "ts": "2026-06-22T10:00:00Z", "requestId": "abc-123" }
This lets you filter and aggregate: "give me every error for userId 42
in the last hour". Each log carries a level that indicates its severity and
lets you silence the noise in production:
debug: internal detail, development only.info: normal expected events (a request, a login).warn: something odd but recoverable (a retry, a cold cache).error: an operation failed and someone should look at it.
2. Metrics
A metric is a number aggregated over time, cheap to store (you do not keep every event, just its count or distribution). Common types:
- Counter: only goes up. "Number of requests", "total errors". You derive the rate (requests per second) from its slope.
- Gauge: goes up and down. "Open connections", "memory used".
- Histogram: groups many measurements into buckets to see their distribution. It is what you use for latency.
On latency, never look only at the mean: a mean of 100 ms can hide that 1 in every 20 users waits 3 seconds. That is why percentiles are used. The p95 is the value below which 95 % of requests fall: "95 % respond in under 200 ms". The p99 captures the tail, the worst experience. Improving the p95/p99 usually matters more than improving the mean.
3. Traces (distributed tracing)
In a system with several services, a single user request crosses many hops: gateway → orders service → database → payments service. A trace follows that request end to end. Each hop is a span with its duration, and they all share the same request id (or trace id) that propagates in the headers. That way you see where the time goes: if the request takes 800 ms, the trace tells you 700 of them went into a slow DB query.
That same requestId must also go into the logs: it is the thread that
sews the three pillars together. With it you jump from "this request was slow"
(trace) to "and it also logged this error" (log).
What you build on top
- Dashboards: charts of the key metrics (error rate, p95, traffic).
- Alerts: rules that warn when a metric crosses a threshold ("p95 > 1 s for 5 min"). A good alert is actionable: if no one is going to do anything when they receive it, it is noise.
- Health checks: an endpoint (
/health) that responds whether the service is healthy. The load balancer queries it to stop sending traffic to instances that are down.
Practical rule: instrument the golden signals first — latency, traffic, errors and saturation. They cover most incidents.
Examples
A structured log in JSON, ready to index
function log(level, message, context = {}) {
return JSON.stringify({
level,
message,
ts: new Date().toISOString(),
...context,
});
}
console.log(log("error", "payment declined", { userId: 42, requestId: "abc-123" }));
p95: the 95th percentile of a list of latencies (ms)
function percentile(values, p) {
const ordered = [...values].sort((a, b) => a - b);
const i = Math.ceil((p / 100) * ordered.length) - 1;
return ordered[Math.max(0, i)];
}
const latencies = [80, 90, 95, 100, 110, 120, 130, 140, 150, 900];
console.log("p95:", percentile(latencies, 95), "ms"); // the tail shows up