0. Overview

The Panel Study of Income Dynamics (PSID) is the longest running longitudinal household survey in the world, which provides invaluable data covering numerous topics including marriage, income, wealth, health and etc. However, the process of converting raw PSID data files into datasets ready for analysis is quite complex and challenging, especially for new users.

This package is developed with the purpose of addressing these challenges within only R environment without additional assistance from other statistical programming softwares. By bridging these gaps, the package aims to make PSID datasets more usable and manageable for researchers and analysts.

The package is now on GitHub. To install the package, try this code in your R Console:

devtools::install_github(repo = "Qcrates/psidread")
library(DiagrammeR)
library(psidread)

1. Introduction

File Structure of PSID: The main PSID data files include two types of data files: 1. single-year family files and 2. a cross-year individual file. The single-year family files contain data collected in each wave from 1968 through 2021, with one record for each family interviewed in the specified year. These files include family-level variables and are identified by the family Interview Number for that year. The cross-year individual file, on the other hand, contains all individual-level variables collected from 1968 to 2021 in one single file. This file includes data for both respondents and non-respondents, identified by the 1968 family Interview Number and Person Number (ER30001 and ER30002). Therefore, the procedure of merging datasets from multiple waves is necessary before conducting any further analysis, if family-level variables are involved.

Data Downloading Approach: PSID’s website offers two primary method to download the data: 1. packaged files and 2. customized shopping cart with only selected variables. These two methods have both pros and cons:

Packaged Files Customized File
Pros
  1. Reusable
  2. Avoid redundant downloading
  3. Easy to manage if you have multiple projects using PSID
  1. Take less space
  2. Options available for downloading data in the format that can be processed directly using R
  3. Waves already merged by PSID
Cons
  1. Hard to merge manually
  2. Present in ASCII format and require additional software like SAS or Stata to process before being imported to R
  3. Even with paid software, packaged files still need to be unzipped and converted one by one before analysis
  1. The download procedure needs to be repeated if additional variables are added in the analysis
  2. The same variables are downloaded repeatedly if the user have multiple research projects using PSID data

Variable Name: A significant challenge when analyzing PSID data is its variable name which is not so intuitive or interpretable (e.g. ER00000, V000). It can be a heavy workload for researchers working with multiple waves of data to rename these variables manually.

Missing Waves: In the PSID, survey questions vary across different waves, leading to some variables not being consistently available in all waves. Detailed information about the inclusion of specific questions in each wave is accessible only on the cross-year index webpage, a method that is not user-friendly for quick reference. Manually creating a list of variables for different years is an option, but it is tedious and inconvenient.

What psidread package is created to help:

  • Create a table of data structure across multiple waves using the text that can be copied and pasted from the website

  • Unzip and convert the zipped files without additional help of other software

  • Read and merge the data files from multiple waves

  • Rename and reshape the dataset to fit the need for advanced analysis

2. Workflow of psidread Package

While users have the option to directly jump to a specific step in the process, I strongly advise following the procedure sequentially without skipping any steps. This approach ensures a replicable code for importing the PSID dataset. Additionally, skipping steps might lead to the failure of the complications, particularly if certain prerequisites for the code’s operation are not met.

3. psid_str(): Build Your Table of Structure

3.2 If you copy and paste your Stata code…

This way of input is inspired by psidtools package developed by Professor Ulrich Kohler in Stata. Therefore, this function also offers an option for users who would like to transfer their work from Stata to R. You can directly copy and paste your Stata code after psid use without making any changes. The only effort required here is to set the type argument to "integrated".

For example:

psid_varlist <- "|| religion_hh /// Household head's religious preference
    [97]ER11895 [99]ER15977 [01]ER20038 [03]ER23474 [05]ER27442 [07]ER40614 ///
    || denom_hh /// Household head's religious denominations
    [97]ER11896 [99]ER15978 [03]ER23475 [05]ER27443 [07]ER40615 ///"
psid_str(
  varlist = psid_varlist,
  type = "integrated"
)
##   year religion_hh denom_hh
## 2 1997     ER11895  ER11896
## 3 1999     ER15977  ER15978
## 4 2001     ER20038     <NA>
## 5 2003     ER23474  ER23475
## 6 2005     ER27442  ER27443
## 7 2007     ER40614  ER40615

Please note that it is the user’s responsibility to make sure that the year and variable code is correct. Do not include any ALL-YEAR variables (e.g. individual’s sex, individual’s birth order) in this function. It will be declared in the idvars argument in psid_read().

4. psid_unzip(): Prepare Data Files

This function helps to unzip the data files downloaded from PSID website and convert them to .rda files, a data format that is easier to manage in R.

4.1 Packaged Files

Please put your packaged files in .zip format in one directory. Here I set the input and output directory to be the same. You can set the exdir to other directory so that the output .rda files will be exported there separately from the directory you put the original downloaded data files.

Please note that in the below example we use system.file... and tempdir() just because we would like to use the data file in the package file folder. In practice, it should be your directory pathway in the format like "your/directory/pathway/psid/file/folder"

input_directory <- system.file(package = "psidread","extdata")
output_directory <- tempdir()
psid_unzip(indir = input_directory,
           exdir = output_directory,
           zipped = TRUE,
           type = "package",
           filename = NA)

If you have already unzipped ALL the .zip data files. You can also skip the procedure by setting the zipped argument to be FALSE:

psid_unzip(indir = input_directory,
           exdir = output_directory,
           zipped = FALSE,
           type = "package",
           filename = NA)

It takes some time to unzip and convert all the packaged files if your analysis involves numerous waves of data. Therefore, once this function is executed and all the .rda files needed to generate your dataset is ready, you do not have to run it every time before you run the psid_read() and psid_reshape() function.

4.2 Single Customized File

If you download the dataset from your shopping cart with selected variables, you can also use this function to unzip and convert the files. One thing to note is that you should choose the ⁠ASCII Data With SAS Statements when downloading. Compared to packaged files, you will need to

  1. specify the name of your data file to unzip and convert in the filename argument
  2. change the type argument to "single"

For example:

psid_unzip(indir = input_directory,
           exdir = output_directory,
           zipped = TRUE,
           type = "single",
           filename = "J327825.zip")

The user can also use psid_unzip() in this way to unzip and convert specific packaged data files, especially when they are adding one wave to their dataset but do not want to go through the whole directory again.

5. psid_read(): Read Data

Please make sure you have had the below checklist done before you run the psid_read() function:

  • Run the psid_str() function and get the table of data structure stored in the global environment.

  • Run the psid_unzip() function and have all the data files prepared in .rda format.

  • Have the cross-year individual packaged file (if you are packaged file user) downloaded, or have at least one individual-level variable downloaded in your customized dataset (if you are the customized file user). Even if you do not use individual-level variables, please do this. This package will collapse your dataset to household-level if you need in psid_reshape().

All the item above checked? Let’s move on to this core step!

5.1 Packaged Files

The advantage of this package is outstanding especially for data processing over multiple packaged dataset. One example:

psid_varlist = c(" hh_age || [13]ER53017 [17]ER66017", " p_age || [13]ER34204")
str_df <- psid_str(varlist = psid_varlist, type = "separated")
input_directory <- system.file(package = "psidread","extdata")
psid_df <- psid_read(indir = input_directory, str_df = str_df,idvars = c("ER30000"),type = "package",filename = NA)
## Data for year 2013 has been added!
## Data for year 2017 has been added!
str(psid_df)
## 'data.frame':    5 obs. of  11 variables:
##  $ ER34201: num  8684 8684 7300 8569 8698
##  $ ER34501: num  5620 8559 6510 8691 7682
##  $ ER34202: num  1 2 2 1 3
##  $ ER34502: num  1 1 1 2 2
##  $ ER34203: num  10 40 20 10 30
##  $ ER34503: num  10 10 10 20 30
##  $ ER30000: num  1 1 1 1 1
##  $ pid    : num  4006 4007 4031 4038 4049
##  $ ER34204: num  55 53 39 25 4
##  $ ER53017: num  55 55 41 25 32
##  $ ER66017: num  59 57 43 34 27
##  - attr(*, "problems")=<externalptr>

5.2 Single Customized File

If you are reading all the variables from one single file, the only different things you need to change here are:

  1. Specify the name of the data file
  2. Change your type argument to "single"
psid_df <- psid_read(indir = input_directory, str_df = str_df,idvars = c("ER30000"),type = "single",filename = "J327825")
str(psid_df)
## 'data.frame':    5 obs. of  11 variables:
##  $ ER53017: num  55 55 41 25 32
##  $ ER66017: num  59 57 43 34 27
##  $ ER34204: num  55 53 39 25 4
##  $ ER34202: num  1 2 2 1 3
##  $ ER34502: num  1 1 1 2 2
##  $ ER34203: num  10 40 20 10 30
##  $ ER34503: num  10 10 10 20 30
##  $ ER34201: num  8684 8684 7300 8569 8698
##  $ ER34501: num  5620 8559 6510 8691 7682
##  $ ER30000: num  1 1 1 1 1
##  $ pid    : num  4006 4007 4031 4038 4049
##  - attr(*, "problems")=<externalptr>

5.3 Notes

Please note that the indir argument in this function should be the directory where you store the .rda files. Therefore, it should be the exdir in psid_unzip() if you use this function to prepare data.

The user may notice that some additional variables, which are not declared in your table of data structure, are also added to the data frame:

  1. Sequence number for each year (e.g. ER34202 for 2013 and ER34502 for 2017)
  2. Relation to household head (e.g. ER34203 for 2013 and ER34503 for 2017)
  3. Interview number (e.g. ER34201 for 2013 and ER34501 for 2017)
  4. pid: The individual-level identification key, equals to ER30001 * 1000 + ER30002

They are survey information variables. Please do not drop them before you run the psid_reshape(). I will strongly recommend you to keep them in the final output because they can be very useful in the analysis.

6. psid_reshape(): Format Data

We finally come to the last step! psid_reshape() function will rename and reshape the data to the final output ready for your next-step analysis!

All the variables will be renamed following your self-defined variable name in psid_str(). You can also reshape the dataset to a long version if you want to further process the data of multiple waves together. For example:

df <- psid_reshape(psid_df = psid_df, str_df = str_df, shape = "long", level = "individual")
df
## # A tibble: 10 × 8
##    ER30000   pid year  hh_age p_age xsqnr rel2hh indfid
##      <dbl> <dbl> <chr>  <dbl> <dbl> <dbl>  <dbl>  <dbl>
##  1       1  4006 2013      55    55     1     10   8684
##  2       1  4006 2017      59    NA     1     10   5620
##  3       1  4007 2013      55    53     2     40   8684
##  4       1  4007 2017      57    NA     1     10   8559
##  5       1  4031 2013      41    39     2     20   7300
##  6       1  4031 2017      43    NA     1     10   6510
##  7       1  4038 2013      25    25     1     10   8569
##  8       1  4038 2017      34    NA     2     20   8691
##  9       1  4049 2013      32     4     3     30   8698
## 10       1  4049 2017      27    NA     2     30   7682

If you would like to keep the wide shape of the data frame. The variable name will be varname_YYYY. For example,

df <- psid_reshape(psid_df = psid_df, str_df = str_df, shape = "wide", level = "individual")
df
##   hh_age_2013 hh_age_2017 p_age_2013 xsqnr_2013 xsqnr_2017 rel2hh_2013
## 1          55          59         55          1          1          10
## 2          55          57         53          2          1          40
## 3          41          43         39          2          1          20
## 4          25          34         25          1          2          10
## 5          32          27          4          3          2          30
##   rel2hh_2017 indfid_2013 indfid_2017 ER30000  pid
## 1          10        8684        5620       1 4006
## 2          10        8684        8559       1 4007
## 3          10        7300        6510       1 4031
## 4          20        8569        8691       1 4038
## 5          30        8698        7682       1 4049

You can also collapse the data to household level in this step. Only one record will be kept here for each household at each wave:

df <- psid_reshape(psid_df = psid_df, str_df = str_df, shape = "long", level = "household")
df
## # A tibble: 5 × 8
##   ER30000   pid year  hh_age p_age xsqnr rel2hh indfid
##     <dbl> <dbl> <chr>  <dbl> <dbl> <dbl>  <dbl>  <dbl>
## 1       1  4006 2013      55    55     1     10   8684
## 2       1  4006 2017      59    NA     1     10   5620
## 3       1  4007 2017      57    NA     1     10   8559
## 4       1  4031 2017      43    NA     1     10   6510
## 5       1  4038 2013      25    25     1     10   8569

Feel free to reshape the data based on your own needs!