Bayesian Curve Fitting and Gaussian Processes

July 08, 2012

So far, we have examined least squares regression in terms of empirical error minimization. We can also examine the same thing from a probabilistic perspective. Let us consider the input values to be x1, ..., xn and the target values to be y1, y2, ..., yn. We fit a function f(x; w) to the data and probabilistically, we can write it as

p(t_n|x_n, w, inv(β)) = N(t_n | y(x_n,w), inv(β))
Hence,
p(t|X, w, inv(β)) = Π N(t_i | y(x_i,w), inv(β))
Taking the log of the likelihood function, we have,
ln p(t|X, w, inv(β)) = -β/2 Σ (t_i - y(x_i, w))^2 -N/2*ln(2Π) + N/2ln(β)

Hence, we see that minimizing the negative log of the likelihood is equivalent to minimizing the sum of squares error which is our least squares formulation. This is also known as the Maximum Likelihood or ML formulation. Further, we can improve our estimates of w by considering a prior on w, of the form

p(w|α) = N(w|0, inv(α)) 
       = (α / 2Π)^(M+1)/2 exp{-αw'w/2}
Now,
p(w|X,t, α, β) = p(t|w,x, β)p(w|α)

This is the Maximum Aposteriori or MAP formulation. The MAP formulation leads to minimization of the following function:

β/2 Σ (t_i - y(x_i, w))^2 + αw'w/2

This minimization is the same as the ridge regression and hence a MAP formulation is similar to adding a L2 regularization term for the ML estimate. The MAP formulation still yields to a single w. The true bayesian setting requires integration over w.

p(t|x, X, t) = ∫ p(t|w, x) p(w|x, t)dw

Solving this, we get the resulting distribution to be a Gaussian with the following parameters.
Gaussian Processes:
Applying the concept to kernels to the framework of Bayesian fitting, we can arrive at Gaussian Processes. Gaussian processes framework is infact similar to the notion of kernel ridge regression. A Gaussian process however operates directly in function space and is a distribution over functions without an explicit use of the weight vector w. They are hence non-parametric processes. Going back to the regression problem, the target variable t is a function of f(x) and some gaussian noise parameter. In Gaussian processes the noise parameter has a Gaussian distribution.

p(t|y) = N(t|y, inv(β))
p(y) = N(y| 0, K)
p(t) = ∫ p(t|y)P(y) dy = N(t|0, C), where Cij = Kij + inv(β)
Now we see the n+1th example,
p(tn+1) = N(tn+1| 0, Cn+1)

Now, the conditional distribution of tn+1|t is a Gaussian distribution and is specified by two parameters the mean and the covariance given by the formula.

m(xn+1) = k'inv(Cn)t
σ(xn+1) = k(xn+1, xn - k'inv(Cn)k

The real hurdle for Gaussian Processes is learning the right parameters for the covariance function. This can be done by maximizing the log likelihood of the data given the parameters. We have

log p(t|θ) = -1/2logCn -1/2t'inv(Cn)t -N/2log(2π)

This can be computed using the gradient descent algorithm. Finally the matlab code for the Gaussian processes is as follows:

load hahn1;
n = size(temp, 1);
X=temp;
Y=thermex;

isig2 = 7000;
ibeta = 1e-3;
stepsig = 100;
stepbeta = 0.1;
[sig2, beta] = minimize_hyperparameters(X, Y, @kerrbf, isig2, ibeta, stepsig, stepbeta);

covK = create_kernel(@kerrbf, X, sig2);
ind = 1:1:900;
sz = size(ind, 2);
invK = inv(covK + eye(n)*beta);

for i=1:sz
    k = zeros(n,1);
    for j=1:n
        k(j) = kerrbf(ind(i), X(j), sig2);
    end  
    f_x(i) = k'*invK*Y;
    var_x(i) = k'*invK*k; 
end

figure;
hold on;
plot(X,Y,'+');
plot(ind, f_x, 'g.');
plot(ind, f_x + 2*sqrt(var_x), 'r');
plot(ind, f_x - 2*sqrt(var_x), 'r');

Here is a figure showing the converged results.
Note that there is a very nice matlab package GPML which can be used.
Helpful Links:
Pattern Recognition and Machine Learning, Chapter 1
Learning with Gaussian Processes - Rasmussen

Comments

UnknownMarch 23, 2015 at 10:46 AM
is there any simple example for Bayesian Curve Fitting step by step in order to understand clearly
ReplyDelete
Replies

Add comment

My Learnings in ML

Bayesian Curve Fitting and Gaussian Processes

Comments

Post a Comment

Popular posts from this blog

My learnings of Karpathy 1-3

MinHash, Bloom Filters and LSH

Two Tower