Python在计量经济学中的应用

基本内容

  • 引言
    • 计量经济学
    • 软件与语言的选择
  • 永不停歇的回归 —— 线性回归模型与OLS的Python实现
    • Python线性回归模型的第一次尝试 —— 对数工资方程
    • 线性回归模型基本框架:估计与推断
    • Python相关库简介:Statsmodels
    • DIY尝试:OLS估计
    • 对数工资方程的拓展之一:加入检验
    • 对数工资方程的拓展之二:加入虚拟变量
  • 内生性的诅咒与解决 —— 更多模型与方法在Python中的实践
    • 绕不开的内生性问题
    • 面板数据模型:不随时间变化所引起的内生性问题的解决思路
    • 工具变量模型:终极武器?
    • 另辟蹊径:反事实框架和匹配方法在Python中的应用
  • 关于计量经济学与Python更多的话题
    • 线性模型的束缚与挑战
    • Python与实证之路:大有可为的未来

引言

计量经济学

关键词

  • 观测数据

  • 因果关系

软件与语言的选择

  • Stata - 现今最流行的计量经济学软件

  • R - 统计学、计量经济学的全能选手

  • Python - 冉冉升起的数值计算、统计学和计量经济学新星

永不停歇的回归 —— 线性回归模型与OLS的Python实现

Python线性回归模型的第一次尝试 —— 对数工资方程

In [1]:
import pandas as pd
import statsmodels.formula.api as smf

# import dataset
df = pd.read_stata("http://fmwww.bc.edu/ec-p/data/wooldridge/mroz.dta")

# ols
results = smf.ols(formula='lwage ~ educ + exper + expersq', data=df).fit()
print(results.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  lwage   R-squared:                       0.157
Model:                            OLS   Adj. R-squared:                  0.151
Method:                 Least Squares   F-statistic:                     26.29
Date:                Mon, 06 Jul 2020   Prob (F-statistic):           1.30e-15
Time:                        09:27:59   Log-Likelihood:                -431.60
No. Observations:                 428   AIC:                             871.2
Df Residuals:                     424   BIC:                             887.4
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     -0.5220      0.199     -2.628      0.009      -0.912      -0.132
educ           0.1075      0.014      7.598      0.000       0.080       0.135
exper          0.0416      0.013      3.155      0.002       0.016       0.067
expersq       -0.0008      0.000     -2.063      0.040      -0.002   -3.82e-05
==============================================================================
Omnibus:                       77.792   Durbin-Watson:                   1.961
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              300.917
Skew:                          -0.753   Prob(JB):                     4.54e-66
Kurtosis:                       6.822   Cond. No.                     2.21e+03
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.21e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

线性回归模型基本框架:估计与推断

多元线性回归模型的基本形式为

$$y=\beta_{0}+\beta_{1} x_{1}+\beta_{2} x_{2}+\beta_{3} x_{3}+\ldots+\beta_{k} x_{k}+\mu$$

或者写成矩阵形式

$$\mathbf{y}=\mathbf{X} \boldsymbol{\beta}+\boldsymbol{\varepsilon}$$

回归系数为

$$\mathbf{\hat{\beta}}=\left(\mathbf{X}^{\prime} \mathbf{X}\right)^{-1} \mathbf{X}^{\prime} \mathbf{y}$$


多元线性回归模型的高斯马尔科夫假定

  • MLR.1 线性于参数

  • MLR.2 随机抽样

  • MLR.3 不存在完全共线性 —— 没有一个自变量是常数,自变量之间也不存在严格的线性关系

  • MLR.4 零条件均值 $E(\mu|x_{1},x_{2},...x_{k}) = 0$

当满足假定MLR.4时,我们常说具有外生解释变量,否则,$x_{j}$就被称为内生解释变量

  • MLR.5 同方差性 $Var(\mu|x_{1},x_{2},...x_{k}) = \sigma^{2}$


OLS斜率估计量的抽样方差

在假定MLR.1至MLR.5下,以自变量的样本值为条件,对所有的$j=1,2,...,k$,都有

$$Var(\hat{\beta}_{j}) = \frac{\sigma^{2}}{SST_{j}(1-R_{j}^{2})}$$

其中,$SST_{j} = \sum^{n}_{i=1} (x_{i}-\bar{x})^{2}$是$x_{j}$的总样本变异,而$R_{j}^{2}$则是将$x_{j}$对所有其他自变量(并包括一个截距项)进行回归得到的$R^{2}$。

或者写成矩阵形式

$$\operatorname{Var}[\mathbf{\hat{\beta}} | \mathbf{X}]=s^{2}\left(\mathbf{X}^{\prime} \mathbf{X}\right)^{-1}$$

其中$s^{2}=\frac{\mathbf{e}^{\prime} \mathbf{e}}{n-K}$


高斯-马尔科夫定理

在假定MLR.1至MLR.5下,$\hat{\beta}_{0},\hat{\beta}_{1},...,\hat{\beta}_{k}$分别是$\beta_{0},\beta_{1},...,\beta_{k}$的最优线性无偏估计量。


检验对单个总体参数的假设:$t$检验

$$\left(\hat{\beta}_{j}-\beta_{j}\right) / \operatorname{se}\left(\hat{\beta}_{j}\right) \sim t_{n-k-1}=t_{d f}$$

其中,$k+1$是总体模型$y=\beta_{0}+\beta_{1} x_{1}+\ldots+\beta_{k} x_{k}+\mu$中未知参数的个数($k$个斜率参数和截距$\beta_{0}$),$n-k-1$是自由度(df)。


对多个线性约束的检验:$F$检验

$$F \equiv \frac{\left(\mathrm{SSR}_{r}-\mathrm{SSR}_{u r}\right) / q}{\mathrm{SSR}_{u r} /(n-k-1)}$$

其中,$\mathrm{SSR}_{r}$是受约束模型的残差平方和,$\mathrm{SSR}_{u r}$是不受约束模型的残差平方和,$q$是所施加的约束数。

在$H_{0}$下,$F$统计量服从自由度为$(q, n-k-1)$的$F$随机变量的分布,可写为

$$F \sim F_{q, n-k-1}$$

Python相关库简介:Statsmodels

statsmodels是一个python的模块,它可以对多种统计模型进行估计和检验。

可以通过pip安装

pip install statsmodels

OLS估计

  • 使用ols函数
import statsmodels.formula.api as smf

mod = smf.ols(formula='Lottery ~ Literacy + Wealth + Region', data=df)

res = mod.fit()

print(res.summary())

DIY尝试:OLS估计

  • 如何得到系数的OLS估计量
In [2]:
import patsy
import numpy as np

y, X = patsy.dmatrices("lwage ~ educ + exper + expersq", data=df, return_type="dataframe")
beta = np.linalg.inv(np.transpose(X).dot(X)).dot(np.transpose(X)).dot(y)

coefs = pd.DataFrame(beta, index=["constant", "educ", "exper", "expersq"], columns=["coef."])
coefs
Out[2]:
coef.
constant -0.522041
educ 0.107490
exper 0.041567
expersq -0.000811
  • 如何得到标准误?
In [3]:
# residual
e = np.array(y) - np.array(X.dot(beta))

# sigma squared
sigma_sq = np.diagonal(np.linalg.inv(np.transpose(X).dot(X)))
std_err = np.sqrt(np.transpose(e).dot(e)/(X.shape[0]-X.shape[1])*sigma_sq)

std_errs = pd.DataFrame(np.transpose(std_err), index=["constant", "educ", "exper", "expersq"], columns=["Std. Err."])
std_errs
Out[3]:
Std. Err.
constant 0.198632
educ 0.014146
exper 0.013175
expersq 0.000393

对数工资方程的拓展之一:加入检验

对于OLS回归的结果,用$t$和$F$检验来进行系数的假设检验。

  • 检验一:教育回报率是否为$8\%$?
In [4]:
import pandas as pd
import statsmodels.formula.api as smf

# import dataset
df = pd.read_stata("http://fmwww.bc.edu/ec-p/data/wooldridge/mroz.dta")

# ols
results = smf.ols(formula='lwage ~ educ + exper + expersq', data=df).fit()

# t test
hypotheses = 'educ = 0.10'
t_test = results.t_test(hypotheses)
print(t_test)
                             Test for Constraints                             
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
c0             0.1075      0.014      0.529      0.597       0.080       0.135
==============================================================================
In [5]:
print(results.predict)
<bound method Results.predict of <statsmodels.regression.linear_model.OLSResults object at 0x00000253D0D9D048>>
  • 检验二:父母的教育对于工资的影响?
In [6]:
import pandas as pd
import statsmodels.formula.api as smf

# import dataset
df = pd.read_stata("http://fmwww.bc.edu/ec-p/data/wooldridge/mroz.dta")

# ols
results = smf.ols(formula='lwage ~ educ + exper + expersq + motheduc + fatheduc', data=df).fit()
print(results.summary(), "\n\n")

# F test
hypotheses = '(motheduc = 0), (fatheduc = 0)'
f_test = results.f_test(hypotheses)

print("F-statistics: {:.4}, P value: {:.4}".format(f_test.fvalue[0][0], f_test.pvalue))
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  lwage   R-squared:                       0.163
Model:                            OLS   Adj. R-squared:                  0.153
Method:                 Least Squares   F-statistic:                     16.45
Date:                Mon, 06 Jul 2020   Prob (F-statistic):           7.65e-15
Time:                        09:28:01   Log-Likelihood:                -430.00
No. Observations:                 428   AIC:                             872.0
Df Residuals:                     422   BIC:                             896.3
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     -0.4704      0.201     -2.345      0.020      -0.865      -0.076
educ           0.1196      0.016      7.534      0.000       0.088       0.151
exper          0.0415      0.013      3.156      0.002       0.016       0.067
expersq       -0.0008      0.000     -2.136      0.033      -0.002   -6.69e-05
motheduc      -0.0158      0.012     -1.317      0.189      -0.039       0.008
fatheduc      -0.0052      0.011     -0.459      0.646      -0.028       0.017
==============================================================================
Omnibus:                       74.708   Durbin-Watson:                   1.933
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              276.114
Skew:                          -0.735   Prob(JB):                     1.10e-60
Kurtosis:                       6.650   Cond. No.                     2.24e+03
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.24e+03. This might indicate that there are
strong multicollinearity or other numerical problems. 


F-statistics: 1.587, P value: 0.2058

对数工资方程的拓展之二:加入虚拟变量

  • 问题:城市的妇女在劳动力市场上是否更有优势?
In [7]:
import pandas as pd
import statsmodels.formula.api as smf

# import dataset
df = pd.read_stata("http://fmwww.bc.edu/ec-p/data/wooldridge/wage1.dta")

# ols
results = smf.ols(formula='lwage ~ educ + exper + expersq + female*educ', data=df).fit()
print(results.summary(), '\n\n')

# F test
hypotheses = '(female = 0), (female:educ = 0)'
f_test = results.f_test(hypotheses)

print("F-statistics: {:.4}, P value: {:.4}".format(f_test.fvalue[0][0], f_test.pvalue))
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  lwage   R-squared:                       0.400
Model:                            OLS   Adj. R-squared:                  0.394
Method:                 Least Squares   F-statistic:                     69.22
Date:                Mon, 06 Jul 2020   Prob (F-statistic):           2.01e-55
Time:                        09:28:02   Log-Likelihood:                -279.27
No. Observations:                 526   AIC:                             570.5
Df Residuals:                     520   BIC:                             596.1
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
===============================================================================
                  coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------
Intercept       0.3874      0.123      3.156      0.002       0.146       0.628
educ            0.0844      0.009      9.639      0.000       0.067       0.102
exper           0.0389      0.005      8.056      0.000       0.029       0.048
expersq        -0.0007      0.000     -6.377      0.000      -0.001      -0.000
female         -0.3294      0.172     -1.910      0.057      -0.668       0.009
female:educ    -0.0006      0.013     -0.046      0.963      -0.027       0.026
==============================================================================
Omnibus:                        7.135   Durbin-Watson:                   1.775
Prob(Omnibus):                  0.028   Jarque-Bera (JB):               10.584
Skew:                           0.008   Prob(JB):                      0.00503
Kurtosis:                       3.695   Cond. No.                     8.29e+03
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 8.29e+03. This might indicate that there are
strong multicollinearity or other numerical problems. 


F-statistics: 43.01, P value: 5.181e-18

内生性的诅咒与解决 —— 更多模型与方法在Python中的实践

绕不开的内生性问题

对于计量经济学而言,如何目标是进行因果解释,那么内生性问题就是最大的障碍。

例如:遗漏变量问题就是所谓的内生性问题


案例:工资方程

面板数据模型:不随时间变化所引起的内生性问题的解决思路

例如考虑仅有一个解释变量的模型,有

$$y_{it} = \beta_{1}x_{it} + \alpha_{i} + \mu_{it},t=1,2,...,T$$

如果$\alpha_{i}$是引起内生性问题的麻烦,那么可以通过固定效应估计解决。

In [8]:
import pandas as pd
from linearmodels.panel import PanelOLS

# import dataset
df = pd.read_stata("http://fmwww.bc.edu/ec-p/data/wooldridge/jtrain.dta")
df = df.set_index(["fcode", "year"])

y, X = patsy.dmatrices("lscrap ~ d88 + d89 + grant + grant_1 + lsales + lemploy", data=df, return_type="dataframe")
mod = PanelOLS(y, X, entity_effects=True)
res = mod.fit()
print(res)
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-8-59a68dfe7ff8> in <module>()
      1 import pandas as pd
----> 2 from linearmodels.panel import PanelOLS
      3 
      4 # import dataset
      5 df = pd.read_stata("http://fmwww.bc.edu/ec-p/data/wooldridge/jtrain.dta")

ModuleNotFoundError: No module named 'linearmodels'

工具变量模型:终极武器?

案例:工资方程的案例


工具变量基本假定

设有一个可观测的变量$z$,它满足两个假定

  • 工具外生性:$Cov(z,\mu) = 0$

  • 工具相关性:$Cov(z,x) \neq 0$

我们则称$z$是$x$的工具变量,有时候也简称为$x$的工具。

In [ ]:
import pandas as pd
from linearmodels.iv import IV2SLS

# import dataset
df = pd.read_stata("http://fmwww.bc.edu/ec-p/data/wooldridge/mroz.dta")

y, X = patsy.dmatrices("lwage ~ educ + exper + expersq + motheduc + fatheduc", data=df, return_type="dataframe")
  • OLS估计
In [ ]:
res_ols = IV2SLS(y, X[['Intercept','educ','exper','expersq']], None, None).fit(cov_type='unadjusted')
print(res_ols)
  • IV/2SLS估计
In [ ]:
res_second = IV2SLS(y, X[['Intercept','exper','expersq']], X[['educ']], X[['motheduc','fatheduc']]).fit(cov_type='unadjusted')
print(res_second)
In [ ]:
from linearmodels.iv import compare

print(compare({'OLS':res_ols, '2SLS':res_second}))

另辟蹊径:反事实框架和匹配方法在Python中的应用

鲁宾因果框架(Rubin Causal Model)

个体的潜在结果

$$Y_{i}=\left\{\begin{array}{ll} Y_{1 i} & D_{i}=1 \\ Y_{0 i} & D_{i}=0 \end{array}=Y_{0 i}+\left(Y_{1 i}-Y_{0 i}\right) D_{i}\right.$$

对于个体来说,只能观测到$Y_{1 i}$或$Y_{0 i}$,所以可以理解为一个缺失数据问题。 平均处理效应(average treatment effect,ATE)为

$$\tau_{A T E}=E\left(Y_{1 i}-Y_{0 i}\right)$$

处理组的平均处理效应(average treatment effect on the treated,ATT)

$$\tau_{A T T}=E\left(Y_{1 i}-Y_{0 i} | D_{i}=1\right)$$

因此

$$\begin{aligned} E\left[Y_{i} | D_{i}=1\right]-E\left[Y_{i} | D_{i}=0\right] &=E\left[Y_{1 i} | D_{i}=1\right]-E\left[Y_{0 i} | D_{i}=1\right] \\ &+E\left[Y_{0 i} | D_{i}=1\right]-E\left[Y_{0 i} | D_{i}=0\right] \end{aligned}$$

前半部分是处理的平均因果效应,后半部分是选择性偏误。 给定随机分配下$D_{i}$的独立性,我们可以对因果效应继续简化

$$E\left[Y_{1 i} | D_{i}=1\right]-E\left[Y_{0 i} | D_{i}=1\right]=E\left[Y_{1 i}-Y_{0 i} | D_{i}=1\right]=E\left[Y_{1 i}-Y_{0 i}\right]$$

匹配与倾向匹配得分

回归和匹配都是用来控制协变量的研究策略。而回归可以看做是一种特殊的匹配估计量,特定类型的一种加权后的匹配估计量(Angrist,2008)。

倾向评分定理

若条件独立假设成立,也就是$\left\{Y_{0 i}, Y_{1 i}\right\} \perp D_{i} | X_{i}$,那么给定协变量向量的某个值函数$p\left(X_{i}\right)$(即倾向得分),则潜在结果与处理状况仍然相互独立,即

$$\left\{Y_{0 i}, Y_{1 i}\right\} \perp D_{i} | p\left(X_{i}\right)$$

其中

$$p\left(X_{i}\right) \equiv E\left[D_{i} | X_{i}\right]=P\left[D_{i}=1 | X_{i}\right]$$
In [ ]:
import pandas as pd
from pscore_match.pscore import PropensityScore
from pscore_match.match import Match, whichMatched

df = pd.read_stata("https://www.stata-press.com/data/r16/cattaneo2.dta")
df["treatment"] = 0
df.loc[df["mbsmoke"]=="smoker","treatment"]=1
treatment = np.array(df["treatment"])
covariates = df.loc[:,["medu","mage","alcohol"]]

pscore = PropensityScore(treatment, covariates).compute()
pairs = Match(treatment, pscore)
pairs.create(method='one-to-one', replace=True)
data_matched = whichMatched(pairs, pd.DataFrame({'pscore': pscore, 'treatment' :treatment, 'bweight':df["bweight"]}))

treated_turnout = df.loc[df["treatment"] == 1,"bweight"].mean()
control_turnout = df.loc[df["treatment"] == 0,"bweight"].mean()
ATT = treated_turnout - control_turnout
print(str("ATT: " + str(ATT)))

关于计量经济学与Python更多的话题

线性模型的束缚与挑战

Python与实证之路:大有可为的未来