Date Columns in tidyfinance

library(tidyfinance)
library(dplyr)
library(lubridate)

Working with dates can be cumbersome, in particular when combining data sets from different contexts. When designing the output tables for the download_data() function we faced a trade-off between (i) preserving original column names that experienced users might know and for which the original sources provide documentation, and (ii) introducing consistent columns names across data sets where necessary. We decided to consistently use the date column with <date> column type across various data sets to facilitate joins and avoid unnecessary conversions. In this vignette, we walk you through some examples.

Let us start with the popular Fama-French factors that come in daily, weekly, and monthly versions. After downloading the monthly data, you can see that each row refers to a specific month of a year and the date column refers to the first day of a month.

factors_ff_3_monthly <- download_data("factors_ff_3_monthly")
#> No start_date or end_date provided. Returning the full data set.
factors_ff_3_monthly
#> # A tibble: 1,175 × 5
#>    date       mkt_excess     smb     hml risk_free
#>    <date>          <dbl>   <dbl>   <dbl>     <dbl>
#>  1 1926-07-01     0.0296 -0.0256 -0.0243    0.0022
#>  2 1926-08-01     0.0264 -0.0117  0.0382    0.0025
#>  3 1926-09-01     0.0036 -0.014   0.0013    0.0023
#>  4 1926-10-01    -0.0324 -0.0009  0.007     0.0032
#>  5 1926-11-01     0.0253 -0.001  -0.0051    0.0031
#>  6 1926-12-01     0.0262 -0.0003 -0.0005    0.0028
#>  7 1927-01-01    -0.0006 -0.0037  0.0454    0.0025
#>  8 1927-02-01     0.0418  0.0004  0.0294    0.0026
#>  9 1927-03-01     0.0013 -0.0165 -0.0261    0.003 
#> 10 1927-04-01     0.0046  0.003   0.0081    0.0025
#> # ℹ 1,165 more rows

Why did we use the first day of a month here? This convention has the big advantage that you can easily add or subtract months or compute date differences. For instance, in the following code chunk, we subtract 3 months from the original date and then compute the difference in months:

factors_ff_3_monthly |> 
  select(date) |> 
  mutate(date_lag3 = date %m-% months(3),
         date_difference = interval(date_lag3, date) %/% months(1))
#> # A tibble: 1,175 × 3
#>    date       date_lag3  date_difference
#>    <date>     <date>               <dbl>
#>  1 1926-07-01 1926-04-01               3
#>  2 1926-08-01 1926-05-01               3
#>  3 1926-09-01 1926-06-01               3
#>  4 1926-10-01 1926-07-01               3
#>  5 1926-11-01 1926-08-01               3
#>  6 1926-12-01 1926-09-01               3
#>  7 1927-01-01 1926-10-01               3
#>  8 1927-02-01 1926-11-01               3
#>  9 1927-03-01 1926-12-01               3
#> 10 1927-04-01 1927-01-01               3
#> # ℹ 1,165 more rows

As another example, the date column is arguably straight forward for daily data, such as daily Fama-French factors:

factors_ff_3_daily <- download_data("factors_ff_3_daily")
#> No start_date or end_date provided. Returning the full data set.
factors_ff_3_daily
#> # A tibble: 25,754 × 5
#>    date       mkt_excess     smb     hml risk_free
#>    <date>          <dbl>   <dbl>   <dbl>     <dbl>
#>  1 1926-07-01     0.001  -0.0025 -0.0027   0.00009
#>  2 1926-07-02     0.0045 -0.0033 -0.0006   0.00009
#>  3 1926-07-06     0.0017  0.003  -0.0039   0.00009
#>  4 1926-07-07     0.0009 -0.0058  0.0002   0.00009
#>  5 1926-07-08     0.0021 -0.0038  0.0019   0.00009
#>  6 1926-07-09    -0.0071  0.0043  0.0057   0.00009
#>  7 1926-07-10     0.0062 -0.0053 -0.001    0.00009
#>  8 1926-07-12     0.0004 -0.0003  0.0064   0.00009
#>  9 1926-07-13     0.0048 -0.0028 -0.002    0.00009
#> 10 1926-07-14     0.0004  0.0007 -0.0043   0.00009
#> # ℹ 25,744 more rows

Each row refers to a trading date, so date refers to the actual date. So computing date differences to the last row might yield an integer larger than 1.

factors_ff_3_daily |> 
  select(date) |> 
  mutate(date_difference = interval(lag(date), date) %/% days(1)) 
#> # A tibble: 25,754 × 2
#>    date       date_difference
#>    <date>               <dbl>
#>  1 1926-07-01              NA
#>  2 1926-07-02               1
#>  3 1926-07-06               4
#>  4 1926-07-07               1
#>  5 1926-07-08               1
#>  6 1926-07-09               1
#>  7 1926-07-10               1
#>  8 1926-07-12               2
#>  9 1926-07-13               1
#> 10 1926-07-14               1
#> # ℹ 25,744 more rows

Similarly, working with daily CRSP data is straight-forward:

crsp_daily <- download_data("wrds_crsp_daily")
crsp_daily |> 
  select(permno, date)

However, enforcing this consistency poses a challenge for monthly CRSP data. As CRSP calculates the monthly returns based on the last available price for each month, we decided to use date following our convention of beginning-of-month and use calculation_date as the date that CRSP uses to calculate the returns. In fact, the CRSP 2.0 update introduced the column mthcaldt instead of date, hence our name calculation_date. For most applications, you will not need calculation_date, so it serves primarily an informational purpose.

crsp_monthly <- download_data("wrds_crsp_monthly")
crsp_monthly |> 
  select(permno, date, calculation_date)

A similar challenge arises with Compustat data. The datadate column of the original Compustat tables refers to the date of the fiscal year end or quarter. For many cases, this date is the December 31st of a fiscal year, but they can be any end of month throughout a calendar year. To comply with our notion of consistency, we introduce the column date, which is simply the beginning of the month of each datadate:

compustat_annual <- download_data("wrds_compustat_annual")
compustat_annual |> 
  select(gvkey, date, datadate)

If you want to use year or month information in your application, you can simply parse them from date:

compustat_annual |> 
  mutate(year = year(date),
         month = month(date)) |> 
  select(gvkey, date, datadate, year, month)

For completeness, you can do the same for quarterly Compustat data:

compustat_quarterly <- download_data("wrds_compustat_quarterly")
compustat_quarterly |> 
  select(gvkey, date, datadate) |> 
  mutate(year = year(date),
         month = month(date),
         quarter = quarter(date))

Joining multiple data sources

As a first example, let us combine the Fama-French factors with monthly CRSP returns. We can simply run:

crsp_monthly |> 
  left_join(factors_ff_3_monthly, join_by(date)) |> 
  select(permno, date, risk_free, mkt_excess, smb, hml)

Similarly, the same logic applies to daily CRSP returns:

crsp_daily |> 
  left_join(factors_ff_3_daily, join_by(date)) |> 
  select(permno, date, risk_free, mkt_excess, smb, hml)

Of course, be careful not to join data sets from different contexts, e.g., joining daily Fama-French data to monthly CRSP data. However, the worst that can happen is that you get meaningless or missing rows.

Now, let us move to the involved case of joining CRSP with Compustat data. To achieve that, we have to first load the CCM links and join them to the monthly CRSP data (the same logic applies to daily CRSP data).

ccm_links <- download_data("wrds_ccm_links")
crsp_monthly <- crsp_monthly |> 
  left_join(ccm_links, join_by(permno), relationship = "many-to-many") |> 
  filter(between(date, linkdt, linkenddt)) |> 
  select(-c(linkdt, linkenddt))

So if you want to join information from Compustat for the same date, you just join by gvkey and date:

crsp_monthly |> 
  left_join(
    compustat_annual, join_by(gvkey, date)
  )

If you want to lag the Compustat information by 6 months to incorporate the fact that there is usually a considerable lag between fiscal year ends and the release of accounting reports, you can just add months to the date column in Compustat:

crsp_monthly |> 
  left_join(
    compustat_annual |> 
      mutate(date = date %m+% months(6)), 
    join_by(gvkey, date)
  )

If you prefer to follow the Fama-French protocol for joining company fundamentals to stock returns (see our chapter on this procedure here), you should do:

crsp_monthly |> 
  left_join(
    compustat_annual |>
      mutate(date = ymd(paste0(year(date) + 1, "0701"))),
    join_by(gvkey, date)
  )