The Geometric Interpretation of Softmax Classification

8 min read

Softmax classification creates linear decision boundaries in feature space because the log-odds between any two classes are affine functions of the input.

The K-class softmax partitions feature space into convex polytopes analogous to a power diagram, with each region governed by the relative geometry of learned weight vectors.

Temperature scaling acts as a dilation in log-probability space, reshaping prediction confidence while leaving decision boundaries invariant.

Calibration measures whether the model's geometric confidence on the probability simplex aligns with empirical class frequencies in the data.

Discrimination and calibration are structurally independent in the softmax parameterization, suggesting architectures should optimize both angular geometry and magnitude separately.

Softmax is one of the most ubiquitous operations in modern machine learning, yet its deeper geometric meaning is rarely examined with the rigor it deserves. Most practitioners encounter it as a convenient normalization that converts raw logits into probabilities. But this framing obscures a richer story — one told in the language of affine hyperplanes, log-probability ratios, and the curvature of entropy surfaces.

When we view softmax classification through the lens of geometry, several phenomena that seem disconnected suddenly unify. The linearity of decision boundaries, the effect of temperature on prediction sharpness, and the pathologies of miscalibration all emerge as consequences of how the softmax function maps feature vectors into the probability simplex via log-space arithmetic. This perspective transforms softmax from a computational convenience into a principled geometric operator.

In this article, we develop the geometric interpretation of softmax classification from first principles. We begin with the log-odds structure that induces linear decision boundaries, proceed to temperature scaling as a dilation in log-probability space, and conclude with calibration — the degree to which geometric confidence aligns with empirical frequency. The goal is a unified framework that makes these concepts not just intuitive but mathematically precise, offering researchers and advanced practitioners a deeper vocabulary for reasoning about classification at the output layer.

Log-Odds Geometry: Linear Boundaries in Feature Space

The softmax function for class k over logits z is defined as σ(z)_k = exp(z_k) / Σ_j exp(z_j). The critical observation is what happens when we examine the ratio of two class probabilities. Taking the logarithm of σ(z)_k / σ(z)_m yields z_k − z_m directly. This log-probability ratio — the log-odds — is a linear function of the logits, and when logits are themselves affine functions of the input features, the log-odds become affine in the feature space.

This linearity is geometrically significant. The decision boundary between class k and class m is the locus of points where their log-odds equal zero, i.e., z_k = z_m. If z_k = w_k^Tx + b_k, then the boundary satisfies (w_k − w_m)^Tx + (b_k − b_m) = 0. This is a hyperplane in ℝ^d with normal vector w_k − w_m. For K classes, softmax classification partitions feature space into convex polytopes — one per class — bounded by these pairwise hyperplanes.

This polytope structure has deep connections to Voronoi tessellations. When weight vectors are unit-norm and biases are zero, the decision regions form a power diagram — a generalization of the Voronoi diagram where each cell corresponds to the region closest to a particular class prototype in a metric defined by the weight geometry. The softmax classification map is thus a weighted nearest-centroid classifier operating in the geometry induced by the learned weight matrix.

A subtlety worth appreciating: the log-odds formulation reveals that softmax does not treat classes independently. Every decision depends on the relative geometry of weight vectors. Adding a constant vector to all weight vectors leaves the classification unchanged — this is the well-known translational redundancy of softmax parameterization. Geometrically, this means softmax operates on the quotient space of weight configurations modulo uniform translation, and the effective degrees of freedom are (K−1) × d rather than K × d.

Understanding this log-odds geometry clarifies why deep networks place a linear layer before softmax. The nonlinear feature extractor's job is to warp the input space so that classes become linearly separable — so that the polytope partition induced by the final affine-softmax layer correctly assigns regions. The expressiveness of the classifier thus reduces entirely to the quality of the learned representation, while the output layer performs a geometrically constrained linear operation in log-probability space.

Takeaway
Softmax decision boundaries are hyperplanes in feature space because the log-odds between any two classes are affine functions of the input. The entire classification geometry reduces to how well the learned representation makes these linear separations meaningful.

Temperature and Sharpness: Dilation in Log-Probability Space

Temperature scaling replaces the standard softmax with σ(z/T)_k = exp(z_k/T) / Σ_j exp(z_j/T), where T > 0 is the temperature parameter. The geometric effect is elegant: dividing logits by T uniformly dilates log-probability space around the origin. As T → 0, the dilation magnifies differences between logits, concentrating probability mass on the argmax class. As T → ∞, differences shrink, and the distribution approaches uniformity. The decision boundaries themselves — defined by z_k = z_m — are invariant to temperature, because dividing both sides by T preserves equality.

This invariance is the key geometric insight. Temperature does not change which class is predicted; it changes how confidently the prediction is made. In the probability simplex, low temperature pushes the output distribution toward the vertices (one-hot predictions), while high temperature pulls it toward the centroid (uniform distribution). The trajectory of the output as T varies traces a curve on the simplex that passes through the centroid at T = ∞ and approaches a vertex as T → 0.

The connection to entropy is immediate. The entropy H of the softmax output is a concave function that achieves its maximum at the simplex centroid and its minimum at the vertices. Temperature scaling smoothly interpolates along this entropy landscape. Formally, ∂H/∂T > 0 always, so increasing temperature monotonically increases entropy. The rate of entropy change with temperature encodes information about the logit spread — tightly clustered logits produce distributions that are more sensitive to temperature perturbation.

From an information-geometric perspective, temperature scaling defines a one-parameter family of distributions that forms a curve on the statistical manifold of categorical distributions. The Fisher information metric along this curve quantifies how distinguishable nearby temperatures are. At low temperatures, the Fisher information is high — small changes in T produce statistically distinguishable outputs. At high temperatures, the Fisher information diminishes because the distribution flattens and becomes insensitive to further increases.

Practically, this geometric view explains why temperature is such an effective post-hoc calibration tool. Because it preserves decision boundaries while reshaping confidence, it decouples the accuracy question ("which class?") from the calibration question ("how sure?"). This separation is not an accident — it is a direct consequence of the dilation symmetry in log-probability space. Any calibration method that preserves this symmetry will share this property, and temperature scaling is the simplest non-trivial member of that family.

Takeaway
Temperature scaling is a dilation in log-probability space that reshapes confidence without moving decision boundaries. It decouples what the model predicts from how certain it claims to be — a geometric separation that makes it uniquely suited for post-hoc calibration.

Calibration and Reliability: When Geometry Meets Frequency

A classifier is calibrated when its predicted probabilities match empirical frequencies: among all instances where the model assigns probability 0.8 to a class, roughly 80% should indeed belong to that class. Calibration is fundamentally about the alignment between the geometric confidence encoded by the softmax output and the statistical reality of the data distribution. When these diverge, the model's position on the probability simplex is geometrically consistent but statistically misleading.

Modern deep networks are notoriously miscalibrated, typically exhibiting overconfidence — their softmax outputs cluster near the vertices of the simplex even when accuracy does not warrant such certainty. Geometrically, this means the learned logit magnitudes are too large relative to the true class-separation difficulty. The feature extractor produces representations that are well-separated (high accuracy) but excessively separated (logits scaled beyond what the Bayes-optimal classifier would produce).

The reliability diagram — which plots predicted confidence against observed accuracy in binned intervals — can be understood as a diagnostic of geometric distortion. A perfectly calibrated model produces the identity line. Overconfident models bend the curve below the diagonal: high predicted probabilities correspond to lower-than-expected empirical frequencies. This bending quantifies how far the model's simplex trajectory deviates from the trajectory that a Bayesian oracle would trace.

Temperature scaling corrects this by finding the T that minimizes the expected calibration error (ECE) or negative log-likelihood on a validation set. Because temperature is a single scalar operating on the dilation symmetry, it can correct global miscalibration — a uniform scaling error across the confidence range. But it cannot fix local miscalibration, where different confidence regimes exhibit different biases. For that, richer transformations in log-probability space are needed: Platt scaling introduces an affine map per logit, and isotonic regression fits a nonparametric monotone correction.

The deeper lesson is that calibration is a property of the mapping from feature-space geometry to simplex geometry. A model can have perfect decision boundaries — placing every hyperplane optimally — and still be catastrophically miscalibrated if the magnitudes are wrong. This dissociation between discrimination and calibration is a structural feature of the softmax parameterization. It suggests that future architectures should consider learning calibrated magnitudes directly, perhaps through loss functions that jointly optimize both the angular geometry of weight vectors (for discrimination) and their norms (for calibration).

Takeaway
Calibration measures whether geometric confidence in log-probability space corresponds to real-world frequency. Perfect decision boundaries do not guarantee calibrated uncertainty — discrimination and calibration are structurally independent properties of the softmax parameterization.

The geometric lens transforms softmax from a computational afterthought into a structured mathematical object. Decision boundaries are hyperplanes in feature space, temperature is a dilation operator in log-probability space, and calibration measures the fidelity of the mapping from learned geometry to statistical reality.

These three perspectives are not independent — they form a coherent framework. The log-odds structure creates the boundaries, temperature controls how aggressively the model commits to one side of those boundaries, and calibration evaluates whether that commitment is warranted by the data. Each layer of analysis adds precision to our understanding of what the output layer actually computes.

For researchers pushing the boundaries of classification architectures, this geometric vocabulary offers more than intuition. It suggests concrete design principles: representations should be learned for linear separability, logit magnitudes should be regularized for calibration, and post-hoc corrections should respect the symmetries of log-probability space. The geometry is not a metaphor — it is the mechanism.