Observability begins as a virtue and often ends as a liability. The same instrumentation that illuminates system behavior at small scale becomes a performance tax—and a storage catastrophe—when traffic multiplies by an order of magnitude. Teams discover this only after their telemetry pipeline starts dropping data, their dashboards time out, and their cloud bill eclipses the cost of the services being monitored.
The architectural problem is deceptive. Metrics, logs, and traces each scale along different dimensions, and each creates distinct failure modes when treated as an afterthought. What works for a dozen services rarely works for several hundred. What captures every request at a thousand RPS collapses at a million.
Designing observability as a first-class subsystem requires the same rigor applied to any distributed system: explicit trade-offs between fidelity and cost, deliberate decisions about what to retain and what to discard, and a clear-eyed view of how telemetry infrastructure itself must be operated. Done well, observability becomes a strategic asset. Done poorly, it becomes the bottleneck it was meant to prevent.
Sampling Strategy Design
Full-fidelity telemetry is a myth at scale. Every high-volume system eventually confronts the same decision: keep everything and drown in cost, or sample intelligently and accept that some signals will be lost. The question is not whether to sample, but how.
Head-based sampling makes its decision at the start of a trace, typically through a probabilistic filter applied at the entry point. It is cheap, stateless, and predictable—properties that make it attractive for high-throughput environments. Its weakness is equally clear: the sampling decision is made before the system knows whether the request is interesting. Rare errors, latency outliers, and unusual code paths are discarded at the same rate as routine traffic.
Tail-based sampling inverts the trade-off. The decision is deferred until the trace completes, allowing the collector to retain traces that exhibit errors, exceed latency thresholds, or match specific attributes. The cost is architectural complexity: collectors must buffer complete traces in memory, coordinate across distributed spans, and handle the operational burden of a stateful pipeline.
Mature architectures rarely choose one. They combine aggressive head-based sampling for baseline traffic with targeted tail-based policies for anomalies, supplemented by dynamic sampling that adjusts rates based on current load and signal value.
TakeawaySampling is not a loss of information—it is a deliberate redistribution of observability budget toward signals that matter most. The goal is not completeness but representativeness.
Metric Cardinality Control
Metrics appear cheap until they aren't. A single counter is negligible; the same counter labeled with user ID, request path, and region can produce millions of unique time series. This phenomenon—cardinality explosion—is the silent killer of metric systems, degrading ingestion, ballooning storage, and rendering queries unusably slow.
The root cause is usually well-intentioned. Engineers add labels for debuggability, assuming each dimension costs little. In practice, cardinality multiplies: ten values on one label combined with a hundred on another yields a thousand series per metric. Add a third unbounded dimension and the collapse is inevitable.
Disciplined architectures treat label design as a governance concern. High-cardinality attributes—user identifiers, session tokens, full URLs—belong in traces or logs, not metrics. Metrics should capture aggregate behavior across bounded dimensions: status codes, service names, deployment environments. When finer granularity is genuinely needed, exemplars can link aggregate metrics to representative traces, preserving drill-down without paying the cardinality price.
Operationally, this requires tooling: cardinality budgets per service, alerts on series count growth, and periodic audits of label usage. The discipline is cultural as much as technical. Without it, every incident investigation tempts someone to add another label, and the system degrades one well-meaning commit at a time.
TakeawayMetrics answer how much and how often across populations; they should not try to answer who and what specifically. Respecting that boundary is the difference between a queryable system and an expensive one.
Log Architecture Patterns
Logs are the most abused telemetry signal. They are easy to produce, trivially easy to overproduce, and expensive to retain. A mature log architecture begins with the recognition that not all logs serve the same purpose, and therefore should not be stored or kept for the same duration.
Structured logging is the foundation. Free-text logs resist aggregation and force expensive parsing at query time. Emitting logs as structured events—with consistent fields for service, trace ID, severity, and domain-specific attributes—transforms them from forensic artifacts into queryable data. The incremental cost at write time is trivial; the compounding benefit at read time is enormous.
Aggregation topology matters next. A common pattern places lightweight collectors adjacent to workloads, forwarding to a regional aggregation tier that handles enrichment, routing, and buffering before persistence. This separation absorbs bursts, isolates failure domains, and allows retention policies to be applied upstream of storage—dropping debug logs that no one will ever query, routing audit logs to compliance-grade storage, and tiering operational logs based on age.
Retention strategy is where economics and utility meet. Hot storage for recent data where query latency matters, warm storage for investigation windows measured in weeks, cold storage for compliance horizons measured in years. Each tier has an order-of-magnitude cost difference, and treating them uniformly is the most expensive mistake a team can make.
TakeawayLogs are not a single workload but three: debugging, auditing, and analytics. Designing for the blend rather than the aggregate is what keeps log systems sustainable over years.
Observability at scale is not a tooling problem; it is an architectural one. The decisions that determine whether a telemetry system supports or strangles its business are made long before the first alert fires—in sampling policies, cardinality conventions, and retention tiers that either compound in value or compound in cost.
The systems that endure treat observability as infrastructure worthy of the same design scrutiny applied to their primary workloads. They accept that fidelity has a price and spend that budget deliberately, favoring representative signals over exhaustive ones.
The architectural principle is simple to state and difficult to practice: observability should scale sublinearly with the system it observes. When telemetry grows as fast as traffic, the architecture has already failed.