Reference: Bishop 1.1, 1.3, 1.4
We shall fit the data using a polynomial function of the form
$$ y(x,\mathbf{w}) = w_0 + w_1 x + w_2 x^2 + ... + w_M x^M = \sum_{j=0}^M w_j x^j $$
where $M$ is the order of the polynomial, and $x^j$ denotes $x$ raised to the power of $j$. The polynomial coefficients $w_0, ..., w_M$ are collectively denoted by the vector $\mathbf{w}$.
Error function measures the misfit between the function $y(x, \mathbf{w})$, for any given value of $\mathbf{w}$, and the training set data points.
One simple choice is given by the sum of the squares of the errors between the predictions $y(x_n,\mathbf{w})$ for each data point $x_n$ and the corresponding target values $t_n$, so that we minimise
$$ E(\mathbf{w}) = \frac{1}{2} \sum_{n=1}^N {y(x_n, \mathbf{w}) - t_n}^2 $$
where 1/2 is for later convenience.
Since the error function is a quadratic function of the coefficients $\mathbf{w}$, its derivatives with respect to the coefficients will be linear in the elements of $\mathbf{w}$, and so the minimisation of the error function has a unique solution, denoted by $\mathbf{w^*}$, which can be found in closed form.
The problem of choosing the right order $M$ involves an important concepts called model comparison or model selection.
Note that it is sometimes more convenient to use the root-mean-square (RMS) error defined by
$$ E_{RMS} = \sqrt {2E(\mathbf{w}^*)/N} $$
As $M$ increases, the magnitude of the coefficients typically gets larger.
For a given model complexity, the over-fitting problem become less severe as the size of the data set increases.
Another way to say this is that: the larger the data set, the more complex (in other words more flexible) the model that we can afford to fit to the data.
One technique to control the over-fitting phenomenon is regularisation, which involves adding a penalty term to the error function in order to discourage the coefficients from reaching large values. The simplest example (sum of squares of all of the coefficients) leads to a modified error function of the form:
$$ \tilde{E}(\mathbf{w}) = \frac{1}{2} \sum_{n=1}^N {y(x_n, \mathbf{w}) - t_n}^2 + \frac{\lambda}{2} ||\mathbf{w}||^2 $$
where $||\mathbf{w}||^2 = \mathbf{w}^T \mathbf{w} = w_0^2 + w_1^2 + ... + w_M^2$, and the coefficient $\lambda$ governs the relative importance of the regularisation term compared with the sum-of-squares error term.
Techniques such as this are known in the statistics literature as shrinkage methods because they reduce the value of the coefficients. The particular case of a quadratic regulariser is called ridge regression. In the context of neural networks, this approach is known as weight decay.
The severe difficulty that can arise in spaces of many dimensions is sometimes called the curse of dimensionality.