Reference: Bishop 1.2
Probability provides a consistent framework for the quantification and manipulation of uncertainty and forms one of the central foundations for pattern recognition.
(skipping quick intro of probability, sum rule, product rule, joint probability, marginal probability, conditional probability, Bayes' theorem, prior probability, posterior probability, independent)
Note that if $x$ is a discrete variable, then $p(x)$ is sometimes called a probability mass function because it can be regarded as a set of "probability masses" concentrated at the allowed values of $x$.
If a probability of a real-valued variable $x$ falling in the interval $(x, x+\delta x)$ is given by $p(x)\delta x$ for $\delta x \rightarrow 0$, then $p(x)$ is called the probability density over $x$.
The probability that $x$ will lie in an interval $(a,b)$ is then given by
$$ p(x\in (a,b)) = \int_a^b p(x) dx $$
The cummulative distribution function is defined by the probability that $x$ lies in the interval $(-\infty,z)$:
$$ P(z) = \int_{-\infty}^z p(x) dx $$
which satisfies $P'(x) = p(x)$.
The sum and product rules of probability, as well as Bayes' theorem, apply equally to the case of probability densities, or to combinations of discrete and continuous variables.
Under a nonlinear change of variable, a probability density transforms differently from a simple function, due to the Jacobian factor. For instance, if we consider a chnage of variables $x=g(y)$, then a function $f(x)$ becomes $\tilde{f}(y)=f(g(y))$. Now consider a probability density $p_x(x)$ that corresponds to a density $p_y(y)$ with respect to the new variable $y$, where the suffices denote the fact that $p_x(x)$ and $p_y(y)$ are different densities. Observations falling in the range $(x,x+\delta x)$ will, for small values of $\delta x$, be transformed into the range $(y,y+\delta y)$ where $p_x(x)\delta x \simeq p_y(y)\delta y$, and hence
$$ p_y(y) = p_x(x)|\frac{dx}{dy}| = p_x(g(y))|g'(y)| $$
One consequence of this property is that the concept of the maximum of a probability density is dependent on the choice of variable.
For a discrete distribution, $\mathbb{E}[f]=\sum_x p(x)f(x)$.
For a continuous distribution, $\mathbb{E}[f]=\int p(x)f(x)dx$.
In either case, if we are given a finite number $N$ of points drawn from the probability distribution or probability density, then the expectation can be approximated as a finite sum over these points, $\mathbb{E}[f]\simeq \frac{1}{N}\sum_{n=1}^N f(x_n)$.
$\mathbb{E}_x[f(x,y)]$ denotes the average of the function $f(x,y)$ with respect to the distribution of $x$. Note that this will be a function of $y$.
For conditional expectation, $\mathbb{E}_x[f|y]=\sum_x p(x|y)f(x)$ for discrete variables, and $\mathbb{E}_x[f|y]=\int_x p(x|y)f(x)$ for continuous variables.
The variance of $f(x)$ is defined by
$$ \begin{aligned} var[f] &= \mathbb{E}[(f(x)-\mathbb[f(x)])^2] \ &= \mathbb{E}[f(x)^2] - \mathbb{E}[f(x)]^2 \end{aligned} $$
and provides a measure of how much variability there is in $f(x)$ around its mean value $\mathbb{E}[f(x)]$.
If we consider the variance of the variable $x$ itself, then it's $var[x]=\mathbb{E}[x^2]-\mathbb{E}[x]^2$.
For two random variables $x$ and $y$, the covariance is defined by
$$ \begin{aligned} cov[x,y] &= \mathbb{E}_{x,y} [{x-\mathbb{E}[x]} {y-\mathbb{E}[y]}]\ &= \mathbb{E}[xy]-\mathbb{E}[x]\mathbb{E}[y] \end{aligned} $$
which expresses the extent to which $x$ and $y$ vary together. If $x$ and $y$ are independent, then their covariance vanishes.
In the case of two vectors of random variables $x$ and $y$, the covariance is a matrix
$$ \begin{aligned} cov[\mathbf{x},\mathbf{y}] &= \mathbb{E}{\mathbf{x},\mathbf{y}} [{\mathbf{x}-\mathbb{E}[\mathbf{x}]} {\mathbf{y}^T - \mathbb{E}[\mathbf{y}^T]}]\ &= \mathbb{E}{\mathbf{x},\mathbf{y}} [\mathbf{x}\mathbf{y}^T] - \mathbb{E}[\mathbf{x}]\mathbb{E}[\mathbf{y}^T] \end{aligned} $$
We refer to above as the classical or frequentist interpretation of probability. Now we turn to a more general Bayesian view, in which probabilities provide a quantification of uncertainty.
$$ p(\mathbf{w}|\mathcal{D}) = \frac{p(\mathcal{D}|\mathbf{w})p(\mathbf{w})}{p(\mathcal{D})}\ \text{posterior} \propto \text{likelihood} \times \text{prior} $$
The denominator in the Bayes' theorem is the normalisation constant, which ensures that the posterior distribution on the left-hand side is a valid probability desntiy and integrates to one.
In both the Bayesian and frequentist paradigms, the likelihood function $p(\mathcal{D}|\mathbf{w})$ plays a central role.
There is no unique frequentist, or even Bayesian, viewpoint.
The normal / Gaussian distribution for a single real-valued variable $x$
$$ \mathcal{N}(x|\mu,\sigma^2) = \frac{1}{(2\pi\sigma^2)^{\frac{1}{2}}} exp{-\frac{1}{2\sigma^2}(x-\mu)^2} $$
is governed by 2 parameters, $\mu$ called mean and $\sigma^2$ called the variance. Moreover, the square root of the variance given by $\sigma$ is called the standard deviation, and the reciprocal of the variance written as $\beta=\frac{1}{\sigma^2}$ is called the precision.
The Gaussian distribution defined over a $D$-dimensional vector $\mathbf{x}$ of continuous variables is given by
$$ \mathcal{N}(\mathbf{x}|\mathbf{\mu},\mathbf{\Sigma}) = \frac{1}{(2\pi)^{\frac{D}{2}}} \frac{1}{|\mathbf{\Sigma}|^{\frac{1}{2}}} exp{-\frac{1}{2}(\mathbf{x}-\mathbf{\mu})^T \mathbf{\Sigma}^{-1}(\mathbf{x}-\mathbf{\mu})} $$
where the $D$-dimensional vector $\mathbf{\mu}$ is called the mean, the $D\times D$ matrix $\mathbf{\Sigma}$ is called the covariance, and $|\mathbf{\Sigma}|$ denotes the determinant of $\mathbf{\Sigma}$.
We shall determine values for the unknown parameters $\mu$ and $\sigma^2$ in the Gaussian by maximizing the likelihood function. In practice, it's more convenient to maximise the log of the likelihood function.
$$ \ln p(\mathbf{X}|\mu,\sigma^2) = -\frac{1}{2\sigma^2} \sum_{n=1}^N (x_N-\mu)^2-\frac{N}{2}\ln\sigma^2- \frac{N}{2}\ln(2\pi) $$
Maximising log likelihood above with respect with $\mu$ and $\sigma^2$, we obtain the solution given by
$$ \mu_{ML} = \frac{1}{N}\sum_{n=1}^N x_n\ \sigma_{ML}^2 = \frac{1}{N}\sum_{n=1}^N (x_n-\mu_{ML})^2 $$
Note that $\sigma_{ML}^2$ is the sample variance measured with respect to the sample mean $\mu_{ML}$.
Limitations: the maximum likelihood approach systematically underestimates the variance of the distribution. This is an example of a phenomenon called bias and is related to the problem of over-fitting.