Why Do We Need to Care About Endogeneity?

Let’s consider a simple linear regression model:

$$ y = \beta_0 + \beta_1 x + \varepsilon $$

Where:

$y$: outcome variable
$x$: predictor
$\varepsilon$: error term (disturbance or residual)
$\beta_0, \beta_1$: parameters to estimate

What OLS Assumes

For the OLS (Ordinary Least Squares) estimator to be unbiased, one of the Gauss-Markov assumptions is:

$$ \mathbb{E}[\varepsilon \mid x] = 0 \quad \text{or} \quad \text{Cov}(x, \varepsilon) = 0 $$

This means the error term must not be correlated with the predictor $x$.

What If $\text{Cov}(x, \varepsilon) \neq 0$?

Let’s derive the OLS estimate of $\beta_1$:

$$ \hat{\beta}_1 = \frac{\text{Cov}(x, y)}{\text{Var}(x)} $$

Substitute $y = \beta_0 + \beta_1 x + \varepsilon$:

$$ \text{Cov}(x, y) = \text{Cov}(x, \beta_0 + \beta_1 x + \varepsilon) = \beta_1 \text{Var}(x) + \text{Cov}(x, \varepsilon) $$

So the estimator becomes:

$$ \hat{\beta}_1 = \frac{\beta_1 \text{Var}(x) + \text{Cov}(x, \varepsilon)}{\text{Var}(x)} $$$$ \Rightarrow \hat{\beta}_1 = \beta_1 + \frac{\text{Cov}(x, \varepsilon)}{\text{Var}(x)} $$

Consequence

If $\text{Cov}(x, \varepsilon) \neq 0$, then:

$\hat{\beta}_1 \neq \beta_1$
Your estimate is biased
You cannot trust the regression results to reflect a true causal relationship

Interpretation

Imagine you’re studying the effect of study hours ($x$) on test scores ($y$).
If motivation affects both study hours and test scores but is not included in the model, it ends up in the error term $\varepsilon$.
Since $x$ (study hours) is related to motivation, now $x$ is correlated with $\varepsilon$.
As a result, your estimate of the effect of study hours on test scores ($\hat{\beta}_1$) will be biased.