Reference: Bishop 1.1, 1.3, 1.4

Basic ML concepts

a large set of N digits ${ x_1, . . . , x_N}$ called a training set is used to tune the parameters of an adaptive model.
we can express the category of a digit using target vector t, which represents the identity of the corresponding digit.
the result of running the machine learning algorithm can be expressed as a function y(x) which takes a new digit image x as input and that generates an output vector y, encoded in the same way as the target vectors.
the precise form of the function y(x) is determined during the training phase, also known as the learning phase, on the basis of the training data.
once the model is trained it can then determine the identity of new digit images, which are said to comprise a test set.
the ability to categorize correctly new examples that differ from those used for training is known as generalization.

Polynomial curve fitting

Problem

We shall fit the data using a polynomial function of the form

$$ y(x,\mathbf{w}) = w_0 + w_1 x + w_2 x^2 + ... + w_M x^M = \sum_{j=0}^M w_j x^j $$

where $M$ is the order of the polynomial, and $x^j$ denotes $x$ raised to the power of $j$. The polynomial coefficients $w_0, ..., w_M$ are collectively denoted by the vector $\mathbf{w}$.

Evaluation

Error function measures the misfit between the function $y(x, \mathbf{w})$, for any given value of $\mathbf{w}$, and the training set data points.

One simple choice is given by the sum of the squares of the errors between the predictions $y(x_n,\mathbf{w})$ for each data point $x_n$ and the corresponding target values $t_n$, so that we minimise

$$ E(\mathbf{w}) = \frac{1}{2} \sum_{n=1}^N {y(x_n, \mathbf{w}) - t_n}^2 $$

where 1/2 is for later convenience.

Since the error function is a quadratic function of the coefficients $\mathbf{w}$, its derivatives with respect to the coefficients will be linear in the elements of $\mathbf{w}$, and so the minimisation of the error function has a unique solution, denoted by $\mathbf{w^*}$, which can be found in closed form.

The problem of choosing the right order $M$ involves an important concepts called model comparison or model selection.

Note that it is sometimes more convenient to use the root-mean-square (RMS) error defined by

$$ E_{RMS} = \sqrt {2E(\mathbf{w}^*)/N} $$

when $M$ is changed

As $M$ increases, the magnitude of the coefficients typically gets larger.

when data size is changed

For a given model complexity, the over-fitting problem become less severe as the size of the data set increases.

Another way to say this is that: the larger the data set, the more complex (in other words more flexible) the model that we can afford to fit to the data.

Regularisation

One technique to control the over-fitting phenomenon is regularisation, which involves adding a penalty term to the error function in order to discourage the coefficients from reaching large values. The simplest example (sum of squares of all of the coefficients) leads to a modified error function of the form:

$$ \tilde{E}(\mathbf{w}) = \frac{1}{2} \sum_{n=1}^N {y(x_n, \mathbf{w}) - t_n}^2 + \frac{\lambda}{2} ||\mathbf{w}||^2 $$

where $||\mathbf{w}||^2 = \mathbf{w}^T \mathbf{w} = w_0^2 + w_1^2 + ... + w_M^2$, and the coefficient $\lambda$ governs the relative importance of the regularisation term compared with the sum-of-squares error term.

Techniques such as this are known in the statistics literature as shrinkage methods because they reduce the value of the coefficients. The particular case of a quadratic regulariser is called ridge regression. In the context of neural networks, this approach is known as weight decay.

Model selection

validation set: the independent set of data used to select one model having the best predictive performance
test set: a third set of data on which the performance of the selected model is finally evalutated
cross validation: allows a proportion (S-1)/S of the available data to be used for training while making use of all of the data to assess performance.
- leave-one-out techinique: S = N, used when data is particularly scarce

The curse of dimensionality

The severe difficulty that can arise in spaces of many dimensions is sometimes called the curse of dimensionality.

[ML foundations] model & data, polynomial curve fitting