dataMeta: Making and appending a data dictionary to an R dataset

Dania M. Rodriguez

2017-08-11

Introduction

Datasets generally require a data dictionary that will incorporate information related to the dataset’s variables and their descriptions, including the description of any variable options. This information is crucial, particularly when a dataset will be shared for further analyses. However, constructing it may be time-consuming. The dataMeta package has a collection of functions that are designed to construct a data dictionary and append it to the original dataset as an attribute, along with other information generally provided in other software as metadata. This information will include: the time and date when it was edited last, the user name, a main description of the dataset. Finally, the dataset is saved as an R dataset (.rds).

There are three basic steps to building a data dictionary with this package, which are outlined in the figure below. First, a “linker” data frame is created, where the user will add each variable description and also provide a variable “type” or key. These keys/variable types are explained below. The variable type will depend on whether the user wants to list all available variable options or if a range of variable values would suffice. Secondly, the main data dictionary is created using the original dataset and the linker data frame. Here, the user will be able to construct any additional variable option descriptions as needed or just build a dictionary with variable names, their descriptions and options. Finally, the user can append the dictionary to the original dataset as an R attribute, along with the date in which the dictionary is created, the author name, and also general R attributes included for data frames. The new dataset with its attributes is then saved as an R dataset (.rds).

 

 

Data

The data used for this vignette will be obtained from the Centers for Disease Control and Prevention (CDC) github repository for publicly avaiable Zika data: https://github.com/cdcepi/zika. This repo contains Zika data that has been scraped from the published records of the Health Department of various countries. The dataset used contains information related with Zika infection in the United States Virgin Islands (USVI) as published by their Health Department on their January 03, 2017 report and can be obtained using the code below. Please, note that the data is read with stringsAsFactors = FALSE. Below is a portion of this dataset:

data(my.data)

kable(head(my.data, 10), format = "html", caption = "Portion of dataset")
Portion of dataset
report_date location location_type data_field data_field_code time_period time_period_type value unit
2017-01-03 United_States_Virgin_Islands territory zika_reported VI0001 NA NA 1930 cases
2017-01-03 United_States_Virgin_Islands-Saint_Thomas county zika_reported VI0001 NA NA 1163 cases
2017-01-03 United_States_Virgin_Islands-Saint_Croix county zika_reported VI0001 NA NA 646 cases
2017-01-03 United_States_Virgin_Islands-Saint_John county zika_reported VI0001 NA NA 121 cases
2017-01-03 United_States_Virgin_Islands territory zika_lab_positive VI0002 NA NA 916 cases
2017-01-03 United_States_Virgin_Islands-Saint_Thomas county zika_lab_positive VI0002 NA NA 645 cases
2017-01-03 United_States_Virgin_Islands-Saint_Croix county zika_lab_positive VI0002 NA NA 198 cases
2017-01-03 United_States_Virgin_Islands-Saint_John county zika_lab_positive VI0002 NA NA 73 cases
2017-01-03 United_States_Virgin_Islands territory zika_not VI0003 NA NA 932 cases
2017-01-03 United_States_Virgin_Islands territory zika_pending VI0004 NA NA 81 cases

 

Another way to load the data is using it’s raw link from the github repo where it is stored:

path = "http://raw.githubusercontent.com/cdcepi/zika/master/"
path2 = "USVI/USVI_Zika/data/USVI_Zika-2017-01-03.csv"
url <- paste0(path, path2, collapse="")

my.data <- read.csv(url, header = TRUE, stringsAsFactors = FALSE)

Data Dictionary

Linker:

To build a data dictionary for the previous dataset, a linker data frame is constructed. This linker will serve as an intermediary to build the data dictionary. It will contain the names of the variables, a description of each variable provided by the user and a “variable type.” The variable type will give each variable a value of 1 or 0, as follows:

The linker is built using one of the following two functions:

 

var_desc <- c("Date when report was published", "Regional location", 
             "Description of regional location", "Type of case",
             "A specific code for each data field", "The time period of each week",
             "The type of time period", "The number of cases per data field type",
             "The unit in which cases are reported")

var_type <- c(0, 1, 0, 1, 0, 0, 0, 0, 1)
                          
linker <- build_linker(my.data, variable_description = var_desc, variable_type = var_type)
The linker dataframe will look like this:
Linker data frame
var_name var_desc var_type
report_date Date when report was published 0
location Regional location 1
location_type Description of regional location 0
data_field Type of case 1
data_field_code A specific code for each data field 0
time_period The time period of each week 0
time_period_type The type of time period 0
value The number of cases per data field type 0
unit The unit in which cases are reported 1

Dictionary build:

Next, the dicitonary is built using the linker created above and the original dataset using the build_dict. Here, option_description is set to NULL because the options of each variable are self explanatory, thus the dictionary will not contain a description for each variable option. If option_descriptions will be added, then a vector of each option description is constructed. The length of this vector needs to be the same as the number of rows in the dictionary and it will depend on the the variable_type used in the linker. Another way to add option descriptions is to use the prompt_varopts function, which prompt the user for a description of each variable option, without the need to write the descriptions vector beforehand. If using this option, then prompt_varopts must be set to TRUE and option_description must be NULL. The following code builds a data dictionary for the USVI report:

 

dict <- build_dict(my.data = my.data, linker = linker, option_description = NULL, 
                   prompt_varopts = FALSE)

The dictionary will look as follows:

kable(dict, format = "html", caption = "Data dictionary for original dataset")
Data dictionary for original dataset
variable_name variable_description variable_options
data_field Type of case zika_reported
zika_lab_positive
zika_not
zika_pending
confirmed_age_under20
confirmed_age_20to39
confirmed_age_40to59
confirmed_age_over59
confirmed_age_unk
confirmed_male
confirmed_female
confirmed_fever
confirmed_acute_fever
confirmed_arthralgia
confirmed_arthritis
confirmed_rash
confirmed_conjunctivitis
confirmed_eyepain
confirmed_headache
confirmed_malaise
zika_no_specimen
zika_tested_pregnant
zika_positive_pregnant
zika_negative_pregnant
zika_pending_pregnant
zika_no_specimen_pregnant
data_field_code A specific code for each data field VI0001 to VI0026
location Regional location United_States_Virgin_Islands
United_States_Virgin_Islands-Saint_Thomas
United_States_Virgin_Islands-Saint_Croix
United_States_Virgin_Islands-Saint_John
location_type Description of regional location county to territory
report_date Date when report was published 2017-01-03 to 2017-01-03
time_period The time period of each week NA to NA
time_period_type The type of time period NA to NA
unit The unit in which cases are reported cases
value The number of cases per data field type 0 to 1930

 

Adding the dictionary as a data attribute and incorporating other attributes:

To add the dictionary as an attribute to the original dataset, the function incorporate_attr is used. This function will require that the user add a main description of the dataset as shown below. There is also a prompt_attr function that will prompt the user to add this description through the console. Once the attributes are added to the dataset, the function save_it will save the file as an R data file .rds, which will preserve all attributes to the dataset. Accessing the attributes of the data is done using the following comand: attributes(my.new.data) and single attributes, like the main description of the dataset, can be obtained by typing: attributes(my.new.data)$main, as is shown below:

 

data_desc = "This data set portrays Zika infection related cases as reported by USVI."

my.new.data <- incorporate_attr(my.data = my.data, data.dictionary = dict, main_string = data_desc)

attributes(my.new.data)
$names
[1] "report_date"      "location"         "location_type"    "data_field"       "data_field_code" 
[6] "time_period"      "time_period_type" "value"            "unit"            

$class
[1] "data.frame"

$row.names
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32

$main
[1] "This data set portrays Zika infection related cases as reported by USVI."

$dictionary
      variable_name                    variable_description                          variable_options
1        data_field                            Type of case                             zika_reported
2                                                                                   zika_lab_positive
3                                                                                            zika_not
4                                                                                        zika_pending
5                                                                               confirmed_age_under20
6                                                                                confirmed_age_20to39
7                                                                                confirmed_age_40to59
8                                                                                confirmed_age_over59
9                                                                                   confirmed_age_unk
10                                                                                     confirmed_male
11                                                                                   confirmed_female
12                                                                                    confirmed_fever
13                                                                              confirmed_acute_fever
14                                                                               confirmed_arthralgia
15                                                                                confirmed_arthritis
16                                                                                     confirmed_rash
17                                                                           confirmed_conjunctivitis
18                                                                                  confirmed_eyepain
19                                                                                 confirmed_headache
20                                                                                  confirmed_malaise
21                                                                                   zika_no_specimen
22                                                                               zika_tested_pregnant
23                                                                             zika_positive_pregnant
24                                                                             zika_negative_pregnant
25                                                                              zika_pending_pregnant
26                                                                          zika_no_specimen_pregnant
27  data_field_code     A specific code for each data field                          VI0001 to VI0026
28         location                       Regional location              United_States_Virgin_Islands
29                                                          United_States_Virgin_Islands-Saint_Thomas
30                                                           United_States_Virgin_Islands-Saint_Croix
31                                                            United_States_Virgin_Islands-Saint_John
32    location_type        Description of regional location                       county to territory
33      report_date          Date when report was published                  2017-01-03 to 2017-01-03
34      time_period            The time period of each week                                  NA to NA
35 time_period_type                 The type of time period                                  NA to NA
36             unit    The unit in which cases are reported                                     cases
37            value The number of cases per data field type                                 0 to 1930

$last_edit_date
[1] "2017-08-11 20:01:20 -04"

$author
[1] "Dania Rodriguez"

 

The user can also export the data dicitonary using as .csv or .xlsx by itlself and also save the data with all of its attributes, as shown below:

# Exporting dictionary only:
dict_only <- attributes(my.new.data)$dictionary
write.csv(dict_only, "dict_only.csv")

# Saving as .rds (dataset with appended dictionary)
save_it(complete_dict = my.new.data, name_of_file = "My Complete Dataset")

 

References

  1. Dania M. Rodriguez, Michael A Johansson, Luis Mier-y-Teran-Romero, moiradillon2, eyq9, YoJimboDurant, … Daniel Mietchen. (2017). cdcepi/zika: March 31, 2017 [Data set]. Zenodo. http://doi.org/10.5281/zenodo.439543