Let’s consider a simple linear regression model:
y=β0+β1x+εWhere:
- y: outcome variable
- x: predictor
- ε: error term (disturbance or residual)
- β0,β1: parameters to estimate
What OLS Assumes
For the OLS (Ordinary Least Squares) estimator to be unbiased, one
of the Gauss-Markov assumptions is:
E[ε∣x]=0orCov(x,ε)=0This means the error term must not be correlated with the predictor
x.
What If Cov(x,ε)=0?
Let’s derive the OLS estimate of β1:
β^1=Var(x)Cov(x,y)Substitute y=β0+β1x+ε:
Cov(x,y)=Cov(x,β0+β1x+ε)=β1Var(x)+Cov(x,ε)So the estimator becomes:
β^1=Var(x)β1Var(x)+Cov(x,ε)⇒β^1=β1+Var(x)Cov(x,ε)
Consequence
If Cov(x,ε)=0, then:
- β^1=β1
- Your estimate is biased
- You cannot trust the regression results to reflect a true causal
relationship
Interpretation
Imagine you’re studying the effect of study hours (x) on test
scores (y).
If motivation affects both study hours and test scores but is not
included in the model, it ends up in the error term ε.
Since x (study hours) is related to motivation, now x is
correlated with ε.
As a result, your estimate of the effect of study hours on test scores
(β^1) will be biased.