Reference: Bishop 1.5
Consider an input vector x, corresponding target variables t, with our goal to predict t given a new value for x.
The joint probability distribution $p(\textbf{x},\textbf{t})$ provides a complete summary of the uncertainty associated with these variables. Determination of $p(\textbf{x},\textbf{t})$ from a set of training data is an example of inference.
In a practical application, we must often make a specific prediction for the value of $\textbf{t}$, or more generally take a specific action based on our understanding of the values $\textbf{t}$ is likely to take, and this aspect is the subject of decision theory.
$$p(\text{mistake}) = p(\textbf{x}\in \mathcal{R}_1, \mathcal{C}_2) + p(\textbf{x}\in \mathcal{R}_2, \mathcal{C}1)\ =\int{\mathcal{R}_1} p(\mathbf{x},\mathcal{C}2)d\mathbf{x} + \int{\mathcal{R}_2} p(\mathbf{x},\mathcal{C}_1)d\mathbf{x}$$
Or, maximising the probability of being correct:
$$p(\text{correct}) = \sum_{k=1}^K p(\mathbf{x}\in \mathcal{R}k, \mathcal{C}k)\ = \sum{k=1}^K \int{\mathcal{R}_k} p(\mathbf{x},\mathcal{C}_k)d\mathbf{x}$$
The above two are both equivalent to maximising the posterior probability $p(\mathcal{C}_k|\mathbf{x})$.
loss function / cost function: a single, overall measure of loss incurred in taking any of the available decisions or actions. Our goal is to minimise it.
utility function (alternatively): a similar evaluation, which is aimed to maximise.
loss matrix: we denote some level of loss $L_{kj}$ as the $k$, $j$ element of the loss matrix.
We seek to minimise the average loss, where the average is computed with respect to the joint distribution ($p(\textbf{x}, \mathcal{C}_k)$),
$$ \mathbb{E}[L] = \sum_k \sum_j \int_{\mathcal{R}j} L{kj} p(\mathbf{x},\mathcal{C}_k)d\mathbf{x} $$
Our goal is to choose the regions $\mathcal{R}_j$ that minimise the expected loss above. And this can then be eliminated to the problem of minimising the quantity
$$ \sum_k L_{kj} p(\mathcal{C}_k|\mathbf{x}) $$
It's appropriate to avoid making decisions on the difficult cases. So such a mechanism "reject option" is necessary. This can be achieved by introducing a threshdld $\theta$ and rejecting those inputs $\mathbf{x}$ for which the largest of the posterior probabilities $p(\mathcal{C}_k|\mathbf{x})$ is less than or equal to $\theta$.