Reference: Bishop 1.5

Decision theory

Basic concepts

Consider an input vector x, corresponding target variables t, with our goal to predict t given a new value for x.

The joint probability distribution $p(\textbf{x},\textbf{t})$ provides a complete summary of the uncertainty associated with these variables. Determination of $p(\textbf{x},\textbf{t})$ from a set of training data is an example of inference.

In a practical application, we must often make a specific prediction for the value of $\textbf{t}$, or more generally take a specific action based on our understanding of the values $\textbf{t}$ is likely to take, and this aspect is the subject of decision theory.

Decision regions: regions $R_k$ that are divided from the input space according to some rules
Decision boundaries (or decision surfaces): the boundaries between decision regions

How?

Minimising the misclassification rate:

$$p(\text{mistake}) = p(\textbf{x}\in \mathcal{R}_1, \mathcal{C}_2) + p(\textbf{x}\in \mathcal{R}_2, \mathcal{C}1)\ =\int{\mathcal{R}_1} p(\mathbf{x},\mathcal{C}2)d\mathbf{x} + \int{\mathcal{R}_2} p(\mathbf{x},\mathcal{C}_1)d\mathbf{x}$$

Or, maximising the probability of being correct:

$$p(\text{correct}) = \sum_{k=1}^K p(\mathbf{x}\in \mathcal{R}k, \mathcal{C}k)\ = \sum{k=1}^K \int{\mathcal{R}_k} p(\mathbf{x},\mathcal{C}_k)d\mathbf{x}$$

The above two are both equivalent to maximising the posterior probability $p(\mathcal{C}_k|\mathbf{x})$.

Minimising the expected loss

loss function / cost function: a single, overall measure of loss incurred in taking any of the available decisions or actions. Our goal is to minimise it.

utility function (alternatively): a similar evaluation, which is aimed to maximise.

loss matrix: we denote some level of loss $L_{kj}$ as the $k$, $j$ element of the loss matrix.

We seek to minimise the average loss, where the average is computed with respect to the joint distribution ($p(\textbf{x}, \mathcal{C}_k)$),

$$ \mathbb{E}[L] = \sum_k \sum_j \int_{\mathcal{R}j} L{kj} p(\mathbf{x},\mathcal{C}_k)d\mathbf{x} $$

Our goal is to choose the regions $\mathcal{R}_j$ that minimise the expected loss above. And this can then be eliminated to the problem of minimising the quantity

$$ \sum_k L_{kj} p(\mathcal{C}_k|\mathbf{x}) $$

Reject option

It's appropriate to avoid making decisions on the difficult cases. So such a mechanism "reject option" is necessary. This can be achieved by introducing a threshdld $\theta$ and rejecting those inputs $\mathbf{x}$ for which the largest of the posterior probabilities $p(\mathcal{C}_k|\mathbf{x})$ is less than or equal to $\theta$.