Reference: Bishop 1.2

Probability theory

Probability provides a consistent framework for the quantification and manipulation of uncertainty and forms one of the central foundations for pattern recognition.

Densities

Discrete variables

(skipping quick intro of probability, sum rule, product rule, joint probability, marginal probability, conditional probability, Bayes' theorem, prior probability, posterior probability, independent)

Note that if $x$ is a discrete variable, then $p(x)$ is sometimes called a probability mass function because it can be regarded as a set of "probability masses" concentrated at the allowed values of $x$.

Continuous variables

If a probability of a real-valued variable $x$ falling in the interval $(x, x+\delta x)$ is given by $p(x)\delta x$ for $\delta x \rightarrow 0$, then $p(x)$ is called the probability density over $x$.

The probability that $x$ will lie in an interval $(a,b)$ is then given by

$$ p(x\in (a,b)) = \int_a^b p(x) dx $$

The cummulative distribution function is defined by the probability that $x$ lies in the interval $(-\infty,z)$:

$$ P(z) = \int_{-\infty}^z p(x) dx $$

which satisfies $P'(x) = p(x)$.

The sum and product rules of probability, as well as Bayes' theorem, apply equally to the case of probability densities, or to combinations of discrete and continuous variables.

nonlinear change of variable

Under a nonlinear change of variable, a probability density transforms differently from a simple function, due to the Jacobian factor. For instance, if we consider a chnage of variables $x=g(y)$, then a function $f(x)$ becomes $\tilde{f}(y)=f(g(y))$. Now consider a probability density $p_x(x)$ that corresponds to a density $p_y(y)$ with respect to the new variable $y$, where the suffices denote the fact that $p_x(x)$ and $p_y(y)$ are different densities. Observations falling in the range $(x,x+\delta x)$ will, for small values of $\delta x$, be transformed into the range $(y,y+\delta y)$ where $p_x(x)\delta x \simeq p_y(y)\delta y$, and hence

$$ p_y(y) = p_x(x)|\frac{dx}{dy}| = p_x(g(y))|g'(y)| $$

One consequence of this property is that the concept of the maximum of a probability density is dependent on the choice of variable.

E & cov

For a discrete distribution, $\mathbb{E}[f]=\sum_x p(x)f(x)$.

For a continuous distribution, $\mathbb{E}[f]=\int p(x)f(x)dx$.

In either case, if we are given a finite number $N$ of points drawn from the probability distribution or probability density, then the expectation can be approximated as a finite sum over these points, $\mathbb{E}[f]\simeq \frac{1}{N}\sum_{n=1}^N f(x_n)$.

$\mathbb{E}_x[f(x,y)]$ denotes the average of the function $f(x,y)$ with respect to the distribution of $x$. Note that this will be a function of $y$.

For conditional expectation, $\mathbb{E}_x[f|y]=\sum_x p(x|y)f(x)$ for discrete variables, and $\mathbb{E}_x[f|y]=\int_x p(x|y)f(x)$ for continuous variables.

The variance of $f(x)$ is defined by

$$ \begin{aligned} var[f] &= \mathbb{E}[(f(x)-\mathbb[f(x)])^2] \ &= \mathbb{E}[f(x)^2] - \mathbb{E}[f(x)]^2 \end{aligned} $$

and provides a measure of how much variability there is in $f(x)$ around its mean value $\mathbb{E}[f(x)]$.

If we consider the variance of the variable $x$ itself, then it's $var[x]=\mathbb{E}[x^2]-\mathbb{E}[x]^2$.

For two random variables $x$ and $y$, the covariance is defined by

$$ \begin{aligned} cov[x,y] &= \mathbb{E}_{x,y} [{x-\mathbb{E}[x]} {y-\mathbb{E}[y]}]\ &= \mathbb{E}[xy]-\mathbb{E}[x]\mathbb{E}[y] \end{aligned} $$

which expresses the extent to which $x$ and $y$ vary together. If $x$ and $y$ are independent, then their covariance vanishes.

In the case of two vectors of random variables $x$ and $y$, the covariance is a matrix

$$ \begin{aligned} cov[\mathbf{x},\mathbf{y}] &= \mathbb{E}{\mathbf{x},\mathbf{y}} [{\mathbf{x}-\mathbb{E}[\mathbf{x}]} {\mathbf{y}^T - \mathbb{E}[\mathbf{y}^T]}]\ &= \mathbb{E}{\mathbf{x},\mathbf{y}} [\mathbf{x}\mathbf{y}^T] - \mathbb{E}[\mathbf{x}]\mathbb{E}[\mathbf{y}^T] \end{aligned} $$

Bayesian

We refer to above as the classical or frequentist interpretation of probability. Now we turn to a more general Bayesian view, in which probabilities provide a quantification of uncertainty.

$$ p(\mathbf{w}|\mathcal{D}) = \frac{p(\mathcal{D}|\mathbf{w})p(\mathbf{w})}{p(\mathcal{D})}\ \text{posterior} \propto \text{likelihood} \times \text{prior} $$

The denominator in the Bayes' theorem is the normalisation constant, which ensures that the posterior distribution on the left-hand side is a valid probability desntiy and integrates to one.

In both the Bayesian and frequentist paradigms, the likelihood function $p(\mathcal{D}|\mathbf{w})$ plays a central role.

in a frequentist setting, $\mathbf{w}$ is considered to be a fixed parameter, whose value is determined by some form of "estimator", and error bars on this estimate are obtained by considering the distribution of possible data sets $\mathcal{D}$.
- a widely used frequentist estimator is maximum likelihood, in which $\mathbf{w}$ is set to the value that maximises the likelihood function $p(\mathcal{D}|\mathbf{w})$. In the ML literature, the negative log of the likelihood function is called an error function.
- one approach to determining frequentist error bars is the bootstrap. The statistical accuracy of parameter estimates can be evaluated by looking at the variability of predictions between the different bootstrap data sets (subsets of the original data set that are randomly sampled for $L$ times).
from the Bayesian viewpoint, there is only a single data set $\mathcal{D}$ (the one observed), and the uncertainty in the parameters is expressed through a probability distribution over $\mathbf{w}$.
- advantage: the inclusion of prior knowledge arises naturally

There is no unique frequentist, or even Bayesian, viewpoint.

Gaussian

The normal / Gaussian distribution for a single real-valued variable $x$

$$ \mathcal{N}(x|\mu,\sigma^2) = \frac{1}{(2\pi\sigma^2)^{\frac{1}{2}}} exp{-\frac{1}{2\sigma^2}(x-\mu)^2} $$

is governed by 2 parameters, $\mu$ called mean and $\sigma^2$ called the variance. Moreover, the square root of the variance given by $\sigma$ is called the standard deviation, and the reciprocal of the variance written as $\beta=\frac{1}{\sigma^2}$ is called the precision.

The Gaussian distribution defined over a $D$-dimensional vector $\mathbf{x}$ of continuous variables is given by

$$ \mathcal{N}(\mathbf{x}|\mathbf{\mu},\mathbf{\Sigma}) = \frac{1}{(2\pi)^{\frac{D}{2}}} \frac{1}{|\mathbf{\Sigma}|^{\frac{1}{2}}} exp{-\frac{1}{2}(\mathbf{x}-\mathbf{\mu})^T \mathbf{\Sigma}^{-1}(\mathbf{x}-\mathbf{\mu})} $$

where the $D$-dimensional vector $\mathbf{\mu}$ is called the mean, the $D\times D$ matrix $\mathbf{\Sigma}$ is called the covariance, and $|\mathbf{\Sigma}|$ denotes the determinant of $\mathbf{\Sigma}$.

maximum likelihood approach

We shall determine values for the unknown parameters $\mu$ and $\sigma^2$ in the Gaussian by maximizing the likelihood function. In practice, it's more convenient to maximise the log of the likelihood function.

$$ \ln p(\mathbf{X}|\mu,\sigma^2) = -\frac{1}{2\sigma^2} \sum_{n=1}^N (x_N-\mu)^2-\frac{N}{2}\ln\sigma^2- \frac{N}{2}\ln(2\pi) $$

Maximising log likelihood above with respect with $\mu$ and $\sigma^2$, we obtain the solution given by

$$ \mu_{ML} = \frac{1}{N}\sum_{n=1}^N x_n\ \sigma_{ML}^2 = \frac{1}{N}\sum_{n=1}^N (x_n-\mu_{ML})^2 $$

Note that $\sigma_{ML}^2$ is the sample variance measured with respect to the sample mean $\mu_{ML}$.

Limitations: the maximum likelihood approach systematically underestimates the variance of the distribution. This is an example of a phenomenon called bias and is related to the problem of over-fitting.

[ML foundations] probability theory, maximum likelihood