On Boltzmann Rationality

Author

Shenghui Chen

Published

December 18, 2024

Modeling Decision-Making

A decision-maker selects an action \(a_i\) from a set of \(n\) actions \(\{a_1,\dots,a_n\}\), where each action has an associated utility \(U(a_i)\).

Example

A driver is approaching an intersection where the traffic light has just turned yellow. There is a 98% chance of safely crossing (+100), a 2% chance of causing an accident (-200), and a 100% chance of stopping safely (+80). The driver must choose between:

  • \(a_1\): Accelerate to make it through the light: \[ \mathbb{E}[U(a_1)] = (0.98 \cdot 100) + (0.02 \cdot -200) = 94. \]
  • \(a_2\): Decelerate and stop at the light: \[ \mathbb{E}[U(a_2)] = 1.0 \cdot 80 = 80. \]

Perfect Rationality

In perfect rationality, one always chooses the action with the highest utility: \[ p(a_i) = \Pr\left(\arg\max_{j} U(a_j) = a_i\right). \]

Example

Under perfect rationality, the driver would always choose \(a_1\) (Accelerate). However, human decision-making often involves uncertainty, emotions, or incomplete information, leading to different choices.

Boltzmann Rationality

Definition: Boltzmann Rationality

The action probability is given by: \[ p(a_i) = \frac{\exp(-\beta\cdot U(a_i))}{\sum_{j=1}^n\exp(-\beta\cdot U(a_j))}. \] Here \(\beta\) is the rationality coefficient:

  • \(\beta\to\infty\): Perfect rationality.
  • \(\beta\to 0\): Uniform random choice.
Example

The probability of the driver choosing \(a_1\) (Accelerate) is proportional to its utility, but \(a_2\) (Stop) still has a chance.

Analogy to Boltzmann Distribution Boltzmann rationality is analogous to the Boltzmann distribution in statistical mechanics: \[ p_i = \frac{\exp\left(-\frac{\epsilon_i}{\kappa T}\right)}{\sum_{j=1}^n\exp\left(-\frac{\epsilon_j}{\kappa T}\right)}, \] where \(\epsilon_i\) is energy, \(\kappa\) is the Boltzmann constant, and \(T\) is temperature.

History

  • Duncan Luce (1959) derived the logic formula from assumptions about the characteristics of choice probabilities, namely the "independence from irrelevant alternatives" property [12].
  • Jacob Marschak (1960) showed that these axioms implied that the model is consistent with utility maximization.
  • Tony Marley developed the relation of the logit formula to the distribution of unobserved utility.
  • Luce and Suppes (1965) showed that the extreme value distribution leads to the logit formula.
  • Daniel McFadden (1974) completed the analysis by showing that the logit formula implies that unobserved utility is distributed extreme value [3].
  • Ziebart et al. proposed the Boltzmann model of noisily rational behavior in trajectory space [4].

See Chapter 3 (Logit) of [5] for an overview.

Interpretation 1: Solving Max-Entropy Problem

Definition: Max-Entropy Principle.

\[ \max_{p(a_i)} \quad \mathbb{E}_{p}[U(a_i)] + H(p) \] \[ \text{s.t.} \quad \sum_{i=1}^n p(a_i) = 1, \] where:

  • \(\mathbb{E}_{p}[U(a_i)] = \sum_{i=1}^n p(a_i) U(a_i)\): Expected utility of the action distribution.
  • \(H(p) = -\sum_{i=1}^n p(a_i) \log p(a_i)\): Entropy of the probability distribution \(p(a_i)\).

This optimization balances exploitation (maximizing utility) and exploration (maximizing entropy). Using Lagrange multipliers, we derive the Boltzmann rationality equation.

Derivation

First, write out the Lagrangian with the multiplier \(\lambda\):

\[ \mathcal{L}(p(a_i), \lambda) = \sum_{i=1}^n p(a_i) U(a_i) - \sum_{i=1}^n p(a_i) \log p(a_i) + \lambda \left( \sum_{i=1}^n p(a_i) - 1 \right) \]

Since this is an equality-constrained, convex optimization problem, the KKT conditions reduce to the following conditions, which fully characterize the solution:

\[ \frac{\partial \mathcal{L}}{\partial p(a_i)} = U(a_i) - \log p(a_i) - 1 + \lambda = 0 \quad \forall i \tag{stationarity} \]

\[ \frac{\partial \mathcal{L}}{\partial \lambda} = \sum_{i=1}^n p(a_i) - 1 = 0 \tag{primal feasibility} \]

Rearranging into: \(\log p(a_i) = U(a_i) - 1 + \lambda\)

Exponentiate both sides to solve for \(p(a_i)\):

\[ p(a_i) = \exp\left(U(a_i) + \lambda - 1\right) = e^{\lambda - 1}\cdot \exp\left(U(a_i)\right) \]

To ensure the constraint \(\sum_{i=1}^n p(a_i) = 1\), we need:

\[ \sum_{i=1}^n e^{\lambda - 1}\cdot \exp\left(U(a_i)\right) = 1 \implies e^{\lambda - 1} = \frac{1}{\sum_{i=1}^n \exp\left(U(a_i)\right)}. \]

Thus, the solution is the Boltzmann distribution:

\[ p(a_i) = \frac{\exp\left(U(a_i)\right)}{\sum_{j=1}^n \exp\left(U(a_j)\right)}. \]

Interpretation 2: Utility Perturbed by Gumbel Noise

Definition

Let each action's utility \(U(a_i)\) be perturbed by independent and identically distributed (i.i.d.) Gumbel noise \(\varepsilon_i\): \[ \tilde{U}(a_i) = U(a_i) + \varepsilon_i, \quad \varepsilon_i \sim \text{Gumbel}(0, 1). \]

Recall the Gumbel distribution \(\varepsilon_i \sim \text{Gumbel}(0,1)\): \[ \text{PDF: } f(\varepsilon_i) = e^{-\varepsilon_i} e^{-e^{-\varepsilon_i}}, \] \[ \text{CDF: } F(\varepsilon_i) = e^{-e^{-\varepsilon_i}}. \]

Derivation

The decision-maker selects the action \(a_i\) that maximizes the perturbed utility \(\tilde{U}(a_i)\). The probability of action \(a_i\) being selected is:

\[ p(a_i) = \Pr\left(a_i = \arg\max_{j} \tilde{U}(a_j)\right) \]

\[ = \Pr\left(\tilde{U}(a_i) \ge \tilde{U}(a_j) \; \forall j \ne i\right). \]

Substituting the perturbed utility \(\tilde{U}(a_i) = U(a_i) + \varepsilon_i\) into the probability condition, we have:

\[ p(a_i) = Pr\left(U(a_i) + \varepsilon_i \ge U(a_j) + \varepsilon_j, \, \forall j \ne i \right) \]

\[ = Pr\left(\varepsilon_j \le \varepsilon_i + U(a_i) - U(a_j), \, \forall j \ne i \right). \]

The expression is the CDF for \(\varepsilon_j\) evaluated at \(\varepsilon_i + U(a_i) - U(a_j)\):

\[ F(\varepsilon_i + U(a_i) - U(a_j)) = e^{-e^{-(\varepsilon_i + U(a_i) - U(a_j))}}. \]

Since the \(\varepsilon\)'s are independent, this cumulative distribution over all \(j\ne i\) is the product of the individual cumulative distributions:

\[ p_{a_i|\varepsilon_i} = \prod_{j\ne i} F(\varepsilon_i+U(a_i)-U(a_j)) = \prod_{j\ne i} e^{-e^{-(\varepsilon_i+U(a_i)-U(a_j))}} \]

Since \(\varepsilon_i\) is not given, the choice probability is the integral of \(p_{a_i|\varepsilon_i}\) over all values of \(\varepsilon_i\) weighted by its density:

\[ p(a_i) = \int \left(\prod_{j\ne i} F(\varepsilon_i+U(a_i)-U(a_j))\right) f(\varepsilon_i) d\varepsilon_i \]

\[ = \int \left(\prod_{j\ne i} e^{-e^{-(\varepsilon_i+U(a_i)-U(a_j))}}\right) e^{-e^{-\varepsilon_i}} d\varepsilon_i \]

With some algebraic manipulation, this integral results in:

\[ p(a_i) = \frac{e^{U(a_i)}}{\sum_{j=1}^n e^{U(a_j)}} = \text{softmax}(U(a_i)). \]

The algebra is given in equation (3.10) (pages 85-86) in [5].

References

1.
R. D. Luce, Individual choice behavior (Wiley New York, 1959).
2.
R. D. Luce, The choice axiom after twenty years. Journal of mathematical psychology, 15 (1977) 215–233.
3.
D. McFadden, Economic choices. American economic review, 91 (2001) 351–378.
4.
B. D. Ziebart, A. L. Maas, J. A. Bagnell, A. K. Dey, & others, Maximum entropy inverse reinforcement learning. Aaai (Chicago, IL, USA, 2008), pp. 1433–1438.
5.
K. E. Train, Discrete choice methods with simulation (Cambridge university press, 2009).