CS计算机代考程序代写 Panel Data in R
Panel Data in R
Panel Data in R
Davide Benedetti
Working with Panel datasets
Let’s import a dataset and perform some analysis on it.
# After changing the dir let’s load our dataset as usual
# setwd(“/your/folder”)
library(foreign)
prod <- read.dta("prod_balanced.dta")
library(stargazer)
stargazer(prod, type = "text")
##
## ==============================================================
## Statistic N Mean St. Dev. Min Max
## --------------------------------------------------------------
## year 14,504 82.500 2.291 79 86
## sic4dig 14,504 3,344.621 249.930 3,111 3,909
## id 14,504 54,266.710 28,160.680 72 99,513
## go 14,504 102,449.400 302,269.600 665.176 5,539,650.000
## m 14,504 48,869.760 147,895.700 155.503 3,134,312.000
## l 14,504 62.029 107.409 5 1,696
## k 14,504 43,538.290 206,041.100 8.130 6,947,903.000
## sic3dig 14,504 334.093 25.020 311 390
## sic2dig 14,504 33.223 2.454 31 39
## minyear 14,504 79.000 0.000 79 79
## maxyear 14,504 86.000 0.000 86 86
## countyear 14,504 8.000 0.000 8 8
## va 14,504 53,579.600 177,919.800 143.804 4,870,663.000
## lnvaOl 14,504 5.941 0.891 2.501 9.094
## lnk 14,504 8.854 1.655 2.096 15.754
## --------------------------------------------------------------
head(prod)
## year sic4dig id go m l k sic3dig sic2dig minyear
## 1 86 3813 72 9824.879 5117.624 23 1575.752 381 38 79
## 2 79 3813 72 11606.232 5185.277 15 2402.043 381 38 79
## 3 85 3813 72 7129.540 4211.143 19 1727.427 381 38 79
## 4 80 3813 72 9808.034 6124.000 15 2602.932 381 38 79
## 5 82 3813 72 11243.599 3619.531 16 2335.772 381 38 79
## 6 83 3813 72 5957.458 2257.693 13 2102.915 381 38 79
## maxyear countyear va lnvaOl lnk
## 1 86 8 4707.255 5.321366 7.362488
## 2 86 8 6420.956 6.059272 7.784075
## 3 86 8 2918.397 5.034351 7.454388
## 4 86 8 3684.034 5.503714 7.864394
## 5 86 8 7624.067 6.166477 7.756098
## 6 86 8 3699.764 5.651075 7.651080
The object prod si our usual data.frame, i.e. a dataset with named columns. The difference from a standard cross-sectional data however, is that it contains two index columns, one for the id of the individual measured (e.g., people, stocks, firms, countries, etc.), and the other for the time the individual was measured (e.g., days, months, quarters, years, etc.).
A panel dataset is simply a cross-sectional data measured at different points in time. The additional time dimension will enable as to perform additional analyis as well as visualize the data either by time or by cross-section.
The number of units in a panel dataset could be the same every period. This case is known as balanced panel, and it happens when the same individuals are observed in every period with no changes. However, more often the number of individuals changes from one period to another. Think about a panel of companies where some companies might default and exit the sample after some years while new ones, that did not exist before, are created.
Our datset contains variables of several companies measured over a few years. We can use tidyverse to count how many companies we have every year, and see what years we are considering. And we have a constant of 1813 companies from 1979 to 1986 (balanced panel).
library(tidyverse)
prod %>%
group_by(year) %>% # this gives me the number of companies every year
summarise(n = n())
## # A tibble: 8 x 2
## year n
##
## 1 79 1813
## 2 80 1813
## 3 81 1813
## 4 82 1813
## 5 83 1813
## 6 84 1813
## 7 85 1813
## 8 86 1813
Plotting variables of Panel datasets
As previously said, we can exploit the two dimensions of the data to say something about both about the cross-sectional distribution of the variables, and well as their dynamic evolution over time.
In the following code, we are averaging variable K (capital) by id (company number) thus removing the time dimension, and we are then plotting its distribution with an histogram.
You can try to visualize the ranking of which company had highest average which geom_bar.
library(tidyverse)
prod %>%
group_by(id) %>%
summarise(k_ave = mean(lnk)) %>%
ggplot(aes(k_ave)) +
geom_histogram(aes(y = ..density..)) +
ggtitle(“Histogram of cross-sectional average of log of Capital”) +
xlab(“”) +
theme_minimal()
Below we are instead averaging the same variable by time, therefore killing the cross-section and transforming the panel into a time-series of the average of log of capital over time.
prod %>%
group_by(year) %>%
summarise(k_ave = mean(lnk), k_se = sd(lnk)/sqrt(n())) %>%
ggplot(aes(x = year, y = k_ave)) +
geom_line() +
geom_point(size = 0.5) +
geom_ribbon(aes(ymin = k_ave – 1.96*k_se,
ymax = k_ave + 1.96*k_se),
alpha = 0.2) +
ylab(“”) +
ggtitle(“Time-series of average of log of Capital”) +
theme_minimal()
We can also perform scatter plots of two variables colouring by time or id. In this dataset we have many id’s, so we are plotting only a few of them by selecting 8 random id’s among all unique id’s.
set.seed(800)
prod %>%
filter(id %in% sample(unique(id), size = 8)) %>%
mutate(`Company ID` = factor(id)) %>%
ggplot(aes(x = lnk, y = lnvaOl, color = `Company ID`)) +
geom_point(shape = 3) +
ylab(“”) +
xlab(“”) +
ggtitle(“Log of Production over Labour by log of Capital”) +
theme_minimal()
Working with dummies
Most of the variables in the dataset are sector codes, identifying the sector of the company. These variables are recorded as numbers, but they are qualitative variables. However, R will recognize and analyse them as quantitative variables unless we tell him not to do it.
In the following model, we am regressing log of production over labour (lnvaOl) on the 2 -digits sector code (sec2dig). However, R is estimating the model in the wrong way, considering this covariate as quantitative.
prod %>%
lm(lnvaOl ~ sic2dig, data = .) %>%
stargazer(type = “text”)
##
## ===============================================
## Dependent variable:
## —————————
## lnvaOl
## ———————————————–
## sic2dig 0.057***
## (0.003)
##
## Constant 4.060***
## (0.099)
##
## ———————————————–
## Observations 14,504
## R2 0.024
## Adjusted R2 0.024
## Residual Std. Error 0.880 (df = 14502)
## F Statistic 361.865*** (df = 1; 14502)
## ===============================================
## Note: *p<0.1; **p<0.05; ***p<0.01
We need to convert this variable as factor, if we want that R interprets it as a qualitative variable. This way R will estimate a beta for every sector code.
Note that one sector code is missing (sector #31). The value of this sector is absorbed by the constant. This means that sector 31 has an average log productio over labour of 5.787, and all the other values have to be interpreted as difference from it. E.g., companies in industrial sector 32 have on average 7.1% more production over labour than compnaies in sector 31.
prod %>%
mutate(sic2dig = factor(sic2dig)) %>%
lm(lnvaOl ~ sic2dig, data = .) %>%
stargazer(type = “text”)
##
## ===============================================
## Dependent variable:
## —————————
## lnvaOl
## ———————————————–
## sic2dig32 0.071***
## (0.020)
##
## sic2dig33 -0.136***
## (0.027)
##
## sic2dig34 0.319***
## (0.030)
##
## sic2dig35 0.714***
## (0.024)
##
## sic2dig36 0.387***
## (0.042)
##
## sic2dig37 0.621***
## (0.077)
##
## sic2dig38 0.212***
## (0.023)
##
## sic2dig39 -0.337
## (0.215)
##
## Constant 5.787***
## (0.012)
##
## ———————————————–
## Observations 14,504
## R2 0.075
## Adjusted R2 0.074
## Residual Std. Error 0.857 (df = 14495)
## F Statistic 146.253*** (df = 8; 14495)
## ===============================================
## Note: *p<0.1; **p<0.05; ***p<0.01
In the next two models, we are regressing the same variable (lnvaOl) on year (column 1), and on year transformed as factor (column 2). Can you explain the difference between the two models?
mod1 <- lm(lnvaOl ~ year, data = prod)
mod2 <- lm(lnvaOl ~ factor(year), data = prod)
stargazer(mod1, mod2, type = "text")
##
## =======================================================================
## Dependent variable:
## ---------------------------------------------------
## lnvaOl
## (1) (2)
## -----------------------------------------------------------------------
## year -0.012***
## (0.003)
##
## factor(year)80 0.072**
## (0.030)
##
## factor(year)81 0.151***
## (0.030)
##
## factor(year)82 0.064**
## (0.030)
##
## factor(year)83 -0.027
## (0.030)
##
## factor(year)84 0.001
## (0.030)
##
## factor(year)85 -0.075**
## (0.030)
##
## factor(year)86 0.036
## (0.030)
##
## Constant 6.952*** 5.913***
## (0.266) (0.021)
##
## -----------------------------------------------------------------------
## Observations 14,504 14,504
## R2 0.001 0.005
## Adjusted R2 0.001 0.005
## Residual Std. Error 0.890 (df = 14502) 0.889 (df = 14496)
## F Statistic 14.443*** (df = 1; 14502) 11.066*** (df = 7; 14496)
## =======================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
Panel regression
We can run panel regression in R with function plm from the homonymous package. To use plm, the dataset needs to be converted in panel format using function pdata.frame, in which we specify the names of the columns in the data containing the id and time index respectively.
library(plm)
prod_plm <- pdata.frame(prod, index = c("id", "year"))
Panel models with fixed effects are the most common types of panel regressions used in economics. They are simply linear models with a different intercept for either every individual or period in the dataset, or both. By default, function plm runs panel data models with individual fixed effects.
Below we are comparing the results from lm (column 1) and plm with individual fixed effects (column 2). While lm estimates one unique constant, plm estimates one for every company (they are not printed out but can be shown with function fixef).
mod1 <- lm(lnvaOl ~ lnk, data = prod)
mod2 <- plm(lnvaOl ~ lnk, data = prod_plm,
effect = "individual")
stargazer(mod1, mod2, type = "text",
omit.stat = c("rsq","adj.rsq"))
##
## ==========================================================================
## Dependent variable:
## ------------------------------------------------------
## lnvaOl
## OLS panel
## linear
## (1) (2)
## --------------------------------------------------------------------------
## lnk 0.283*** 0.062***
## (0.004) (0.014)
##
## Constant 3.434***
## (0.034)
##
## --------------------------------------------------------------------------
## Observations 14,504 14,504
## Residual Std. Error 0.757 (df = 14502)
## F Statistic 5,556.180*** (df = 1; 14502) 19.109*** (df = 1; 12690)
## ==========================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
You can use argument effect in function plm to specify whether you want to use individual or time fixed effects, or both. See example below.
mod1 <- plm(lnvaOl ~ lnk, data = prod_plm,
effect = "individual")
mod2 <- plm(lnvaOl ~ lnk, data = prod_plm,
effect = "time")
mod3 <- plm(lnvaOl ~ lnk, data = prod_plm,
effect = "twoways")
stargazer(mod1, mod2, mod3, type = "text",
omit.stat = c("rsq","adj.rsq"))
##
## ============================================================================================
## Dependent variable:
## -------------------------------------------------------------------------------
## lnvaOl
## (1) (2) (3)
## --------------------------------------------------------------------------------------------
## lnk 0.062*** 0.283*** 0.038***
## (0.014) (0.004) (0.014)
##
## --------------------------------------------------------------------------------------------
## Observations 14,504 14,504 14,504
## F Statistic 19.109*** (df = 1; 12690) 5,554.331*** (df = 1; 14495) 7.061*** (df = 1; 12683)
## ============================================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
Technically, estimating an individual fixed effect panel model is the same as estimating a linear model including dummy variables for each individual. At the same time, estimating a time fixed effect panel model is the same as estimating a linear model including dummy variables for each time period, while a panel with both individual and time effects can be obtained by estimating a linear model with dummy variables for both individuals and times.
The difference is that plm performs the estimating much faster than lm and does not print automatically the different intercepts, but only the betas.
Below, you can see the difference between lm with time dummies and plm with time effects. The beta for lnk is identical, but lm prints all intercepts for every year while plm does not.
mod1 <- lm(lnvaOl ~ 0 + lnk + factor(year), data = prod)
mod2 <- plm(lnvaOl ~ lnk, data = prod_plm,
effect = "time")
stargazer(mod1, mod2, type = "text",
omit.stat = c("rsq","adj.rsq"))
##
## ===============================================================================
## Dependent variable:
## -----------------------------------------------------------
## lnvaOl
## OLS panel
## linear
## (1) (2)
## -------------------------------------------------------------------------------
## lnk 0.283*** 0.283***
## (0.004) (0.004)
##
## factor(year)79 3.412***
## (0.038)
##
## factor(year)80 3.469***
## (0.038)
##
## factor(year)81 3.542***
## (0.038)
##
## factor(year)82 3.465***
## (0.038)
##
## factor(year)83 3.384***
## (0.038)
##
## factor(year)84 3.422***
## (0.038)
##
## factor(year)85 3.352***
## (0.038)
##
## factor(year)86 3.467***
## (0.038)
##
## -------------------------------------------------------------------------------
## Observations 14,504 14,504
## Residual Std. Error 0.756 (df = 14495)
## F Statistic 100,275.000*** (df = 9; 14495) 5,554.331*** (df = 1; 14495)
## ===============================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
As a final note, you can estimate a linear model with function plm by setting model to pooling.
mod1 <- lm(lnvaOl ~ lnk, data = prod)
mod2 <- plm(lnvaOl ~ lnk, data = prod_plm,
model = "pooling")
stargazer(mod1, mod2, type = "text",
omit.stat = c("rsq","adj.rsq"))