Bayesian Curve Fitting and Gaussian Processes
So far, we have examined least squares regression in terms of empirical error minimization. We can also examine the same thing from a probabilistic perspective. Let us consider the input values to be x1, ..., xn and the target values to be y1, y2, ..., yn. We fit a function f(x; w) to the data and probabilistically, we can write it as
Gaussian Processes:
Applying the concept to kernels to the framework of Bayesian fitting, we can arrive at Gaussian Processes. Gaussian processes framework is infact similar to the notion of kernel ridge regression. A Gaussian process however operates directly in function space and is a distribution over functions without an explicit use of the weight vector w. They are hence non-parametric processes. Going back to the regression problem, the target variable t is a function of f(x) and some gaussian noise parameter. In Gaussian processes the noise parameter has a Gaussian distribution.
Note that there is a very nice matlab package GPML which can be used.
Helpful Links: Pattern Recognition and Machine Learning, Chapter 1
Learning with Gaussian Processes - Rasmussen
p(t_n|x_n, w, inv(β)) = N(t_n | y(x_n,w), inv(β)) Hence, p(t|X, w, inv(β)) = Π N(t_i | y(x_i,w), inv(β)) Taking the log of the likelihood function, we have, ln p(t|X, w, inv(β)) = -β/2 Σ (t_i - y(x_i, w))^2 -N/2*ln(2Π) + N/2ln(β)Hence, we see that minimizing the negative log of the likelihood is equivalent to minimizing the sum of squares error which is our least squares formulation. This is also known as the Maximum Likelihood or ML formulation. Further, we can improve our estimates of w by considering a prior on w, of the form
p(w|α) = N(w|0, inv(α)) = (α / 2Π)^(M+1)/2 exp{-αw'w/2} Now, p(w|X,t, α, β) = p(t|w,x, β)p(w|α)This is the Maximum Aposteriori or MAP formulation. The MAP formulation leads to minimization of the following function:
β/2 Σ (t_i - y(x_i, w))^2 + αw'w/2This minimization is the same as the ridge regression and hence a MAP formulation is similar to adding a L2 regularization term for the ML estimate. The MAP formulation still yields to a single w. The true bayesian setting requires integration over w.
p(t|x, X, t) = ∫ p(t|w, x) p(w|x, t)dwSolving this, we get the resulting distribution to be a Gaussian with the following parameters.
Gaussian Processes:
Applying the concept to kernels to the framework of Bayesian fitting, we can arrive at Gaussian Processes. Gaussian processes framework is infact similar to the notion of kernel ridge regression. A Gaussian process however operates directly in function space and is a distribution over functions without an explicit use of the weight vector w. They are hence non-parametric processes. Going back to the regression problem, the target variable t is a function of f(x) and some gaussian noise parameter. In Gaussian processes the noise parameter has a Gaussian distribution.
p(t|y) = N(t|y, inv(β)) p(y) = N(y| 0, K) p(t) = ∫ p(t|y)P(y) dy = N(t|0, C), where Cij = Kij + inv(β) Now we see the n+1th example, p(tn+1) = N(tn+1| 0, Cn+1)Now, the conditional distribution of tn+1|t is a Gaussian distribution and is specified by two parameters the mean and the covariance given by the formula.
m(xn+1) = k'inv(Cn)t σ(xn+1) = k(xn+1, xn - k'inv(Cn)kThe real hurdle for Gaussian Processes is learning the right parameters for the covariance function. This can be done by maximizing the log likelihood of the data given the parameters. We have
log p(t|θ) = -1/2logCn -1/2t'inv(Cn)t -N/2log(2π)This can be computed using the gradient descent algorithm. Finally the matlab code for the Gaussian processes is as follows:
load hahn1; n = size(temp, 1); X=temp; Y=thermex; isig2 = 7000; ibeta = 1e-3; stepsig = 100; stepbeta = 0.1; [sig2, beta] = minimize_hyperparameters(X, Y, @kerrbf, isig2, ibeta, stepsig, stepbeta); covK = create_kernel(@kerrbf, X, sig2); ind = 1:1:900; sz = size(ind, 2); invK = inv(covK + eye(n)*beta); for i=1:sz k = zeros(n,1); for j=1:n k(j) = kerrbf(ind(i), X(j), sig2); end f_x(i) = k'*invK*Y; var_x(i) = k'*invK*k; end figure; hold on; plot(X,Y,'+'); plot(ind, f_x, 'g.'); plot(ind, f_x + 2*sqrt(var_x), 'r'); plot(ind, f_x - 2*sqrt(var_x), 'r');Here is a figure showing the converged results.
Note that there is a very nice matlab package GPML which can be used.
Helpful Links: Pattern Recognition and Machine Learning, Chapter 1
Learning with Gaussian Processes - Rasmussen
is there any simple example for Bayesian Curve Fitting step by step in order to understand clearly
ReplyDelete