Stata - 现今最流行的计量经济学软件
R - 统计学、计量经济学的全能选手
Python - 冉冉升起的数值计算、统计学和计量经济学新星
import pandas as pd
import statsmodels.formula.api as smf
# import dataset
df = pd.read_stata("http://fmwww.bc.edu/ec-p/data/wooldridge/mroz.dta")
# ols
results = smf.ols(formula='lwage ~ educ + exper + expersq', data=df).fit()
print(results.summary())
多元线性回归模型的基本形式为
$$y=\beta_{0}+\beta_{1} x_{1}+\beta_{2} x_{2}+\beta_{3} x_{3}+\ldots+\beta_{k} x_{k}+\mu$$或者写成矩阵形式
$$\mathbf{y}=\mathbf{X} \boldsymbol{\beta}+\boldsymbol{\varepsilon}$$回归系数为
$$\mathbf{\hat{\beta}}=\left(\mathbf{X}^{\prime} \mathbf{X}\right)^{-1} \mathbf{X}^{\prime} \mathbf{y}$$多元线性回归模型的高斯马尔科夫假定
MLR.1 线性于参数
MLR.2 随机抽样
MLR.3 不存在完全共线性 —— 没有一个自变量是常数,自变量之间也不存在严格的线性关系
MLR.4 零条件均值 $E(\mu|x_{1},x_{2},...x_{k}) = 0$
当满足假定MLR.4时,我们常说具有外生解释变量,否则,$x_{j}$就被称为内生解释变量。
OLS斜率估计量的抽样方差
在假定MLR.1至MLR.5下,以自变量的样本值为条件,对所有的$j=1,2,...,k$,都有
$$Var(\hat{\beta}_{j}) = \frac{\sigma^{2}}{SST_{j}(1-R_{j}^{2})}$$其中,$SST_{j} = \sum^{n}_{i=1} (x_{i}-\bar{x})^{2}$是$x_{j}$的总样本变异,而$R_{j}^{2}$则是将$x_{j}$对所有其他自变量(并包括一个截距项)进行回归得到的$R^{2}$。
或者写成矩阵形式
$$\operatorname{Var}[\mathbf{\hat{\beta}} | \mathbf{X}]=s^{2}\left(\mathbf{X}^{\prime} \mathbf{X}\right)^{-1}$$其中$s^{2}=\frac{\mathbf{e}^{\prime} \mathbf{e}}{n-K}$
高斯-马尔科夫定理
在假定MLR.1至MLR.5下,$\hat{\beta}_{0},\hat{\beta}_{1},...,\hat{\beta}_{k}$分别是$\beta_{0},\beta_{1},...,\beta_{k}$的最优线性无偏估计量。
检验对单个总体参数的假设:$t$检验
$$\left(\hat{\beta}_{j}-\beta_{j}\right) / \operatorname{se}\left(\hat{\beta}_{j}\right) \sim t_{n-k-1}=t_{d f}$$其中,$k+1$是总体模型$y=\beta_{0}+\beta_{1} x_{1}+\ldots+\beta_{k} x_{k}+\mu$中未知参数的个数($k$个斜率参数和截距$\beta_{0}$),$n-k-1$是自由度(df)。
对多个线性约束的检验:$F$检验
$$F \equiv \frac{\left(\mathrm{SSR}_{r}-\mathrm{SSR}_{u r}\right) / q}{\mathrm{SSR}_{u r} /(n-k-1)}$$其中,$\mathrm{SSR}_{r}$是受约束模型的残差平方和,$\mathrm{SSR}_{u r}$是不受约束模型的残差平方和,$q$是所施加的约束数。
在$H_{0}$下,$F$统计量服从自由度为$(q, n-k-1)$的$F$随机变量的分布,可写为
$$F \sim F_{q, n-k-1}$$statsmodels是一个python的模块,它可以对多种统计模型进行估计和检验。
可以通过pip安装
pip install statsmodels
OLS估计
import statsmodels.formula.api as smf
mod = smf.ols(formula='Lottery ~ Literacy + Wealth + Region', data=df)
res = mod.fit()
print(res.summary())
import patsy
import numpy as np
y, X = patsy.dmatrices("lwage ~ educ + exper + expersq", data=df, return_type="dataframe")
beta = np.linalg.inv(np.transpose(X).dot(X)).dot(np.transpose(X)).dot(y)
coefs = pd.DataFrame(beta, index=["constant", "educ", "exper", "expersq"], columns=["coef."])
coefs
# residual
e = np.array(y) - np.array(X.dot(beta))
# sigma squared
sigma_sq = np.diagonal(np.linalg.inv(np.transpose(X).dot(X)))
std_err = np.sqrt(np.transpose(e).dot(e)/(X.shape[0]-X.shape[1])*sigma_sq)
std_errs = pd.DataFrame(np.transpose(std_err), index=["constant", "educ", "exper", "expersq"], columns=["Std. Err."])
std_errs
对于OLS回归的结果,用$t$和$F$检验来进行系数的假设检验。
import pandas as pd
import statsmodels.formula.api as smf
# import dataset
df = pd.read_stata("http://fmwww.bc.edu/ec-p/data/wooldridge/mroz.dta")
# ols
results = smf.ols(formula='lwage ~ educ + exper + expersq', data=df).fit()
# t test
hypotheses = 'educ = 0.10'
t_test = results.t_test(hypotheses)
print(t_test)
print(results.predict)
import pandas as pd
import statsmodels.formula.api as smf
# import dataset
df = pd.read_stata("http://fmwww.bc.edu/ec-p/data/wooldridge/mroz.dta")
# ols
results = smf.ols(formula='lwage ~ educ + exper + expersq + motheduc + fatheduc', data=df).fit()
print(results.summary(), "\n\n")
# F test
hypotheses = '(motheduc = 0), (fatheduc = 0)'
f_test = results.f_test(hypotheses)
print("F-statistics: {:.4}, P value: {:.4}".format(f_test.fvalue[0][0], f_test.pvalue))
import pandas as pd
import statsmodels.formula.api as smf
# import dataset
df = pd.read_stata("http://fmwww.bc.edu/ec-p/data/wooldridge/wage1.dta")
# ols
results = smf.ols(formula='lwage ~ educ + exper + expersq + female*educ', data=df).fit()
print(results.summary(), '\n\n')
# F test
hypotheses = '(female = 0), (female:educ = 0)'
f_test = results.f_test(hypotheses)
print("F-statistics: {:.4}, P value: {:.4}".format(f_test.fvalue[0][0], f_test.pvalue))
例如考虑仅有一个解释变量的模型,有
$$y_{it} = \beta_{1}x_{it} + \alpha_{i} + \mu_{it},t=1,2,...,T$$如果$\alpha_{i}$是引起内生性问题的麻烦,那么可以通过固定效应估计解决。
import pandas as pd
from linearmodels.panel import PanelOLS
# import dataset
df = pd.read_stata("http://fmwww.bc.edu/ec-p/data/wooldridge/jtrain.dta")
df = df.set_index(["fcode", "year"])
y, X = patsy.dmatrices("lscrap ~ d88 + d89 + grant + grant_1 + lsales + lemploy", data=df, return_type="dataframe")
mod = PanelOLS(y, X, entity_effects=True)
res = mod.fit()
print(res)
案例:工资方程的案例
工具变量基本假定
设有一个可观测的变量$z$,它满足两个假定
工具外生性:$Cov(z,\mu) = 0$
工具相关性:$Cov(z,x) \neq 0$
我们则称$z$是$x$的工具变量,有时候也简称为$x$的工具。
import pandas as pd
from linearmodels.iv import IV2SLS
# import dataset
df = pd.read_stata("http://fmwww.bc.edu/ec-p/data/wooldridge/mroz.dta")
y, X = patsy.dmatrices("lwage ~ educ + exper + expersq + motheduc + fatheduc", data=df, return_type="dataframe")
res_ols = IV2SLS(y, X[['Intercept','educ','exper','expersq']], None, None).fit(cov_type='unadjusted')
print(res_ols)
res_second = IV2SLS(y, X[['Intercept','exper','expersq']], X[['educ']], X[['motheduc','fatheduc']]).fit(cov_type='unadjusted')
print(res_second)
from linearmodels.iv import compare
print(compare({'OLS':res_ols, '2SLS':res_second}))
鲁宾因果框架(Rubin Causal Model)
个体的潜在结果
$$Y_{i}=\left\{\begin{array}{ll} Y_{1 i} & D_{i}=1 \\ Y_{0 i} & D_{i}=0 \end{array}=Y_{0 i}+\left(Y_{1 i}-Y_{0 i}\right) D_{i}\right.$$对于个体来说,只能观测到$Y_{1 i}$或$Y_{0 i}$,所以可以理解为一个缺失数据问题。 平均处理效应(average treatment effect,ATE)为
$$\tau_{A T E}=E\left(Y_{1 i}-Y_{0 i}\right)$$处理组的平均处理效应(average treatment effect on the treated,ATT)
$$\tau_{A T T}=E\left(Y_{1 i}-Y_{0 i} | D_{i}=1\right)$$因此
$$\begin{aligned} E\left[Y_{i} | D_{i}=1\right]-E\left[Y_{i} | D_{i}=0\right] &=E\left[Y_{1 i} | D_{i}=1\right]-E\left[Y_{0 i} | D_{i}=1\right] \\ &+E\left[Y_{0 i} | D_{i}=1\right]-E\left[Y_{0 i} | D_{i}=0\right] \end{aligned}$$前半部分是处理的平均因果效应,后半部分是选择性偏误。 给定随机分配下$D_{i}$的独立性,我们可以对因果效应继续简化
$$E\left[Y_{1 i} | D_{i}=1\right]-E\left[Y_{0 i} | D_{i}=1\right]=E\left[Y_{1 i}-Y_{0 i} | D_{i}=1\right]=E\left[Y_{1 i}-Y_{0 i}\right]$$匹配与倾向匹配得分
回归和匹配都是用来控制协变量的研究策略。而回归可以看做是一种特殊的匹配估计量,特定类型的一种加权后的匹配估计量(Angrist,2008)。
倾向评分定理
若条件独立假设成立,也就是$\left\{Y_{0 i}, Y_{1 i}\right\} \perp D_{i} | X_{i}$,那么给定协变量向量的某个值函数$p\left(X_{i}\right)$(即倾向得分),则潜在结果与处理状况仍然相互独立,即
$$\left\{Y_{0 i}, Y_{1 i}\right\} \perp D_{i} | p\left(X_{i}\right)$$其中
$$p\left(X_{i}\right) \equiv E\left[D_{i} | X_{i}\right]=P\left[D_{i}=1 | X_{i}\right]$$import pandas as pd
from pscore_match.pscore import PropensityScore
from pscore_match.match import Match, whichMatched
df = pd.read_stata("https://www.stata-press.com/data/r16/cattaneo2.dta")
df["treatment"] = 0
df.loc[df["mbsmoke"]=="smoker","treatment"]=1
treatment = np.array(df["treatment"])
covariates = df.loc[:,["medu","mage","alcohol"]]
pscore = PropensityScore(treatment, covariates).compute()
pairs = Match(treatment, pscore)
pairs.create(method='one-to-one', replace=True)
data_matched = whichMatched(pairs, pd.DataFrame({'pscore': pscore, 'treatment' :treatment, 'bweight':df["bweight"]}))
treated_turnout = df.loc[df["treatment"] == 1,"bweight"].mean()
control_turnout = df.loc[df["treatment"] == 0,"bweight"].mean()
ATT = treated_turnout - control_turnout
print(str("ATT: " + str(ATT)))