Friday, April 19, 2013

Kernel Density Estimation

Kernel density estimation is a method to obtain smooth approximations of a distribution through nonparametric techniques. This method allows us to get a good approximation to the distribution of any data sample, even when we have no idea what its true (population) distribution is. 

If we know the true distribution of the data we are working with then we are very lucky. In that case, we can use the density function or cumulative distribution function, whether it be a closed form, a series or a recursive expression, to perform any calculation we want, including calculating the different moments of the distribution.

However,  this is rarely the case. Actually, for data we observe, we basically never know the data generating process. We can use the Law of Large Numbers (LLN) to use a normal distribution to approximate the distribution of some data or parameters. It is of special interest the approximation using LLN to approximate the mean of observed data as the sample size becomes large.  Still, in a general case, the distribution of observed data remains unknown. 


What does it do?

Kernel density estimation achieves smooth approximations to the data distribution through nonparametric techniques. The method intends to extrapolate the sample data to a density for the whole population. Think of it as smoothing a histogram of the data to attain the shape we would have if more and more data were obtained and the histogram bins were made finer and finer. It can be shown that the kernel estimator converges faster than any other (in other words, it's very good!). 

Given a sample $x^{1},x^{2}, ...,x^{B} $ with a true distribution $p(x)$, the  kernel density approximation is given by  
$\widehat{p}_h(x) = \frac{1}{Bh} \sum_{i=1}^B K\left( \frac{x-x_i}{h}\right)$

The typical implementation uses a standard normal density as the kernel:

$  K(z) = {1 \over \sqrt{2\pi} }\,e^{-\frac{z^2}{2}}$
where $K$ is a probability kernel, and $h$ is a bandwidth or smoothing parameter.

The following figure 1 shows what our approximations are doing:



The estimator does not group observations in bins but places bumps at each observation determined by the kernel function. Thus, the resulting density is smoother than using a histogram. In the normal kernel case, for each data point we are placing a normal distribution, whose parameters are affected by the bandwidth parameter $h$. One way to think about this parameter is that it chooses how smooth we are making the approximation. For low $h$, we are undersmoothing, and we will be getting a bunch of bumps around each point, closer to the behavior of a histogram. For high $h$ we are oversmoothing the approximation, and the structure that we want to capture can be lost. 

Choosing the bandwidth

Usually the software you use to do the kernel density estimation will choose the optimal $h$, which depends on the sample size. For the normal kernel and univariate distributions, the optimal $h$ is given by
$ h = \left( \dfrac{4 \widehat{\sigma}^5}{3B} \right)^\frac{1}{5} $


Using it in multivariate distributions

There is also a multivariate version of the kernel density estimation. The development is analogous to the explanation above but for more than 1 variable. It is very useful that we can make such an extension. For example, in some of my research, I had an estimation problem involving 2 parameters $\theta$ and $\lambda$. In particular, I needed the distribution of $\theta$ conditional on $\lambda = 0$. Thus,
  • I used multivariate kernel density estimation to calculate the joint distribution of $\theta$ and $\lambda$, 
  • Then used univariate kernel density estimation for the marginal density for $\lambda$ and the probability $p(\lambda=0)$,
  • and finally Bayes theorem to obtain the conditional density $p(\theta\mid\lambda=0)$. 


Implement it in Matlab

For us Matlab users, it is possible to implement univariate kernel density estimation with the 
pre-installed function ksdensity.
For multivariate, you are going to have to install kde2d.

Cheers






1 Taken from University of British Columbia Department of Geography

Monday, April 15, 2013

Simple Linear Regression Regression Assumptions

A very basic review of the assumptions of simple linear regression. Future posts will expand on the effects we get when we relax these assumptions, as well as assumptions we use in other econometric models.


Simple Linear Regression Setup

  • Population Regression Line.
$E(y\vert x) = \mu_{y \vert x} = \beta_0 + \beta_1 x$
  • Individual response.
$y_i = \beta_0 + \beta_1 x_i + e_i $
where parameters have the same interpretation and the error $e_i$ represents the differences between the true $y_i$ and the conditional mean $\mu_{y \vert x}$.
  • $e_i \sim \mathcal{N}(0,\sigma_e^2)$, $\rightarrow$: Homoskedasticty and normality.
  • $e_i \sim iid $ (Of concern when working with time series): Nonautocorrelation.

Main assumptions on variables

  • $x_i$ and $y_i$ are observed and nonrandom after observation.
  • $K$ regressors $(x_i)$ are independet (Absence of perfect multicollinearity).
  •  Errors and indep variables are independent: $E(x' \epsilon=0) $.

Consequences of the assumption on parameter estimates


  • Sampling distribution of $\beta_0$ 
  1.  $E(\hat{\beta}_0)=\beta_0$
  2. $var(\hat{\beta_0}) = \sigma_e^2 \left( \frac{1}{N} + \frac{\bar{x}^2}{\sum_{i=1}^N ( x_i-\bar{x} )^2} \right )  = \sigma_e^2 \left( \frac{1}{N} + \frac{\bar{x}^2}{(N-1) s_x^2 } \right )$
  • Sampling distribution of $\beta_1$ 
  1. $E(\hat{\beta}_1)=\beta_1$
  2. $var(\hat{\beta_1}) = \frac{\sigma_e^2}{\sum_{i=1}^N ( x_i-\bar{x} )^2} =  \frac{\sigma_e^2}{(N-1) s_x^2 }$ where $s_x^2 = \frac{ \sum_{i=1}^N (x_i-\bar{x})^2}{(N-1)} $
  3. The sampling distribution of $\beta_0$ is normally distributed.
  • Estimate of $\sigma_e^2$: 
          $s_e^2=\frac{\sum_i (y_i-\hat{y_i})^2}{N-2}=\frac{SSE}{N-2}=MSE$


Characteristics of the parameter estimates

  • Unbiased Estimators: The mean of the sampling distribution is equal to the population parameter being estimated. 
  •  Consistent Estimators: As $N$ increases, the probability that the estimator will be close to the true parameter increases: As $N\rightarrow \infty$ then $\hat{\beta}=\beta$.
  • Minimum variance estimators: The variance of $\hat{\beta}$ is smaller than the variance of any other linear unbiased estimator of $\beta_1$.
These are the most basic assumptions of the simple linear regression. The consequences require these assumptions to be satisfied. Empirical analyses that use simple linear regression are making these assumptions implicitly. There are many cases in which it should be obvious that they will not be satisfied. These posts analyze these cases and their consequences. 



Sunday, April 14, 2013

Giving Academic Talks

I am preparing my thesis proposal talk. I found the following notes very useful:

http://www.cs.berkeley.edu/~jrs/speaking.html

and thought I should share them.

It will not happen this time, but I am really looking forward to be able to implement this minimalistic-in-bullets- style where most of the text is delivered by the speaker and not the slides.

Cheers