Why Do We Need to Care About Endogeneity?

Jun 28, 2025·
Shonn Cheng
Shonn Cheng
· 2 min read

Let’s consider a simple linear regression model:

y=β0+β1x+ε y = \beta_0 + \beta_1 x + \varepsilon

Where:

  • yy: outcome variable
  • xx: predictor
  • ε\varepsilon: error term (disturbance or residual)
  • β0,β1\beta_0, \beta_1: parameters to estimate

What OLS Assumes

For the OLS (Ordinary Least Squares) estimator to be unbiased, one of the Gauss-Markov assumptions is:

E[εx]=0orCov(x,ε)=0 \mathbb{E}[\varepsilon \mid x] = 0 \quad \text{or} \quad \text{Cov}(x, \varepsilon) = 0

This means the error term must not be correlated with the predictor xx.


What If Cov(x,ε)0\text{Cov}(x, \varepsilon) \neq 0?

Let’s derive the OLS estimate of β1\beta_1:

β^1=Cov(x,y)Var(x) \hat{\beta}_1 = \frac{\text{Cov}(x, y)}{\text{Var}(x)}

Substitute y=β0+β1x+εy = \beta_0 + \beta_1 x + \varepsilon:

Cov(x,y)=Cov(x,β0+β1x+ε)=β1Var(x)+Cov(x,ε) \text{Cov}(x, y) = \text{Cov}(x, \beta_0 + \beta_1 x + \varepsilon) = \beta_1 \text{Var}(x) + \text{Cov}(x, \varepsilon)

So the estimator becomes:

β^1=β1Var(x)+Cov(x,ε)Var(x) \hat{\beta}_1 = \frac{\beta_1 \text{Var}(x) + \text{Cov}(x, \varepsilon)}{\text{Var}(x)} β^1=β1+Cov(x,ε)Var(x) \Rightarrow \hat{\beta}_1 = \beta_1 + \frac{\text{Cov}(x, \varepsilon)}{\text{Var}(x)}

Consequence

If Cov(x,ε)0\text{Cov}(x, \varepsilon) \neq 0, then:

  • β^1β1\hat{\beta}_1 \neq \beta_1
  • Your estimate is biased
  • You cannot trust the regression results to reflect a true causal relationship

Interpretation

Imagine you’re studying the effect of study hours (xx) on test scores (yy).
If motivation affects both study hours and test scores but is not included in the model, it ends up in the error term ε\varepsilon.
Since xx (study hours) is related to motivation, now xx is correlated with ε\varepsilon.
As a result, your estimate of the effect of study hours on test scores (β^1\hat{\beta}_1) will be biased.