---
title: "Adding to the dictionary"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Adding to the dictionary}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

```{r setup, echo = FALSE, message=FALSE}
df <- data.frame(
  gender = c("male", "enby", "womn", "mlae", "mann", "frau", "femme", "homme", "nin"),
  stringsAsFactors = FALSE
)
```
## Outline

While the `gendercoder` dictionaries aim to be as comprehensive as possible, it
is inevitable that new typos and variations will occur in wild data. Moreover, 
at present, the dictionaries are limited to data the authors have had access to 
which has been collected in English. As such, if you are collecting data, you 
will at some point want to add to or create your own dictionaries (and if so, 
we strongly encourage contributions either as [a pull request via GitHub](https://github.com/ropensci/gendercoder), or by [raising an issue](https://github.com/ropensci/gendercoder/issues/new) so the team can
help).

## Adding to the dictionary

Let's say I have free-text gender data, but some of it is not in English.

```{r example-data, message=FALSE}
library(gendercoder)
df
```

I can create a new dictionary by creating a named vector, where the names are 
the raw, uncoded values, and the values are the desired outputs. This can then 
be used as the dictionary in the `recode_gender()` function.

```{r new-dictionary}
new_dictionary <- c(
  mann = "man", 
  frau = "woman", 
  femme = "woman", 
  homme = "man", 
  nin = "man")

new_dictionary_df <- df
new_dictionary_df$recoded_gender <- recode_gender(
  df$gender,
  dictionary = new_dictionary,
  retain_unmatched = TRUE
)
new_dictionary_df

```
However, as you can see using just this new dictionary leaves a number of 
responses uncoded that the built-in dictionaries could handle. As the 
dictionaries are just vectors, we can simply concatenate these to use both at 
the same time. 

We can do this in-line...

```{r inline-augmentation}
inline_df <- df
inline_df$recoded_gender <- recode_gender(
  df$gender,
  dictionary = c(manylevels_en, new_dictionary),
  retain_unmatched = TRUE
)
inline_df
```
Or otherwise we can create a new dictionary and call that later, useful if you 
might want to save an augmented dictionary for later use or for contributing to 
the package.
```{r stepped-augmentation}
manylevels_plus <-  c(manylevels_en, new_dictionary)

stepped_df <- df
stepped_df$recoded_gender <- recode_gender(
  df$gender,
  dictionary = manylevels_plus,
  retain_unmatched = TRUE
)
stepped_df
```

## Making it official

Let's say you are happy with your `manylevels_plus` dictionary and think it should be 
part of the `manylevels_en` dictionary in the package. All you need to do is [fork the 
gendercoder repo](https://docs.github.com/en/get-started/quickstart/fork-a-repo), 
[clone it to your local device](https://docs.github.com/en/repositories/creating-and-managing-repositories/cloning-a-repository), and then rename 
your vector and use the `usethis::use_data()` function to overwrite the `manylevels_en` 
dictionary as shown below. 

```{r replacing-dictionaries, eval=FALSE}
manylevels_en <-  manylevels_plus
usethis::use_data(manylevels_en, overwrite = TRUE)
```

Once you've pushed the changes to your fork, you can [make a pull request](https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/proposing-changes-to-your-work-with-pull-requests/creating-a-pull-request-from-a-fork). Please tell us 
what you're adding so we know what to look out for and how to test it.