Kernel density estimation is a method to obtain smooth approximations of a distribution through nonparametric techniques. This method allows us to get a good approximation to the distribution of any data sample, even when we have no idea what its true (population) distribution is.
If we know the true distribution of the data we are working with then we are very lucky. In that case, we can use the density function or cumulative distribution function, whether it be a closed form, a series or a recursive expression, to perform any calculation we want, including calculating the different moments of the distribution.
However, this is rarely the case. Actually, for data we observe, we basically never know the data generating process. We can use the Law of Large Numbers (LLN) to use a normal distribution to approximate the distribution of some data or parameters. It is of special interest the approximation using LLN to approximate the mean of observed data as the sample size becomes large. Still, in a general case, the distribution of observed data remains unknown.
What does it do?
Kernel density estimation achieves smooth approximations to the data distribution through nonparametric techniques. The method intends to extrapolate the sample data to a density for the whole population. Think of it as smoothing a histogram of the data to attain the shape we would have if more and more data were obtained and the histogram bins were made finer and finer. It can be shown that the kernel estimator converges faster than any other (in other words, it's very good!).Given a sample $x^{1},x^{2}, ...,x^{B} $ with a true distribution $p(x)$, the kernel density approximation is given by
$\widehat{p}_h(x) = \frac{1}{Bh} \sum_{i=1}^B K\left( \frac{x-x_i}{h}\right)$
The typical implementation uses a standard normal density as the kernel:
$ K(z) = {1 \over \sqrt{2\pi} }\,e^{-\frac{z^2}{2}}$
where $K$ is a probability kernel, and $h$ is a bandwidth or smoothing parameter.
The following figure
1 shows what our approximations are doing:
The estimator does not group observations in bins but places bumps at each observation determined by the kernel function. Thus, the resulting density is smoother than using a histogram. In the normal kernel case, for each data point we are placing a normal distribution, whose parameters are affected by the bandwidth parameter $h$. One way to think about this parameter is that it chooses how smooth we are making the approximation. For low $h$, we are undersmoothing, and we will be getting a bunch of bumps around each point, closer to the behavior of a histogram. For high $h$ we are oversmoothing the approximation, and the structure that we want to capture can be lost.
Choosing the bandwidth
Usually the software you use to do the kernel density estimation will choose the optimal $h$, which depends on the sample size. For the normal kernel and univariate distributions, the optimal $h$ is given by$ h = \left( \dfrac{4 \widehat{\sigma}^5}{3B} \right)^\frac{1}{5} $
Using it in multivariate distributions
There is also a multivariate version of the kernel density estimation. The development is analogous to the explanation above but for more than 1 variable. It is very useful that we can make such an extension. For example, in some of my research, I had an estimation problem involving 2 parameters $\theta$ and $\lambda$. In particular, I needed the distribution of $\theta$ conditional on $\lambda = 0$. Thus,
- I used multivariate kernel density estimation to calculate the joint distribution of $\theta$ and $\lambda$,
- Then used univariate kernel density estimation for the marginal density for $\lambda$ and the probability $p(\lambda=0)$,
- and finally Bayes theorem to obtain the conditional density $p(\theta\mid\lambda=0)$.
Implement it in Matlab
For us Matlab users, it is possible to implement univariate kernel density estimation with the
pre-installed function ksdensity.
For multivariate, you are going to have to install kde2d.
Cheers
1 Taken from University of British Columbia Department of Geography