---
title: "Introduction to gendercoder"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Introduction to gendercoder}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

The goal of gendercoder is to allow simple recoding of free-text gender responses.

## Installation

You can install the development version from GitHub with:

``` r
devtools::install_github("ropensci/gendercoder")
library(gendercoder)
```

## Why would we do this?

Researchers who collect self-reported demographic data from respondents 
occasionally collect gender using a free-text response option. This has the 
advantage of respecting the gender diversity of respondents without prompting 
users and potentially including misleading responses. However, this presents a 
challenge to researchers in that some inconsistencies in typography and spelling 
create a larger set of responses than would be required to fully capture the 
demographic characteristics of the sample. 

For example, male participants may provide free-text responses as "male", "man", 
"mail", "mael". Non-binary participants may provide responses as "nonbinary", 
"enby", "non-binary", "non binary"

Manually coding of such free-text responses this is often not feasible with 
larger datasets. `recode_gender()` uses dictionaries of common
misspellings to re-code free-text responses into a consistent set of 
responses. The small number of responses not automatically re-coded by 
recode_gender() can then be feasibly manually recoded.

## Motivating example

`gendercoder` includes a sample dataset with actual free-text
responses to the question "What is your gender?" from a number of studies of 
English-speaking participants. The sample dataset includes responses from 7756
participants. Naive coding identifies 103 unique responses to this item. 


```{r message = FALSE, warning= FALSE}
library(gendercoder)

count_values <- function(x, name = "value") {
  counts <- sort(table(x, useNA = "ifany"), decreasing = TRUE)
  out <- data.frame(value = names(counts), count = as.integer(counts))
  names(out)[1] <- name
  out
}

knitr::kable(
  count_values(sample$Gender, "Gender"),
  caption = "Summary of gender categories before coding"
)
```

Recoding using the `gender_coder()` function classifies all but 28 responses 
into pre-defined response categories. 

```{r}
manylevels_sample <- sample
manylevels_sample$recoded_gender <- recode_gender(
  gender = manylevels_sample$Gender,
  dictionary = manylevels_en
)

knitr::kable(
  head(manylevels_sample, 10),
  caption = "The manylevels_en dictionary applied to `head(sample)`"
)

knitr::kable(
  count_values(na.omit(manylevels_sample$recoded_gender), "recoded_gender"),
  caption = "Summary of gender categories after use of the *manylevels_en* dictionary"
)

```

In this dataset unclassified responses are a mix of unusual responses and 
apparent response errors (e.g. numbers and symbols). While some of these are
genuinely missing (i.e. Gender = 40), other could be manually recoded, or added
to a custom dictionary. 

```{r}

knitr::kable(
  manylevels_sample[is.na(manylevels_sample$recoded_gender), ],
  caption = "All responses not classified by the built-in dictionary"
)

```

## Options within the function

### dictionary

The package provides two built-in dictionaries. The use of these is controlled 
using the `dictionary` argument. The first `dictionary = manylevels_en` provides 
corrects spelling and standardises terms while maintaining the diversity of responses. 
This is the default dictionary for `recode_gender()` as it preserves as much gender
diversity as possible. 

However in some cases you may wish to collapse gender into a smaller set of 
categories by using the `fewlevels_en` dictionary (`dictionary = fewlevels_en`). This 
dictionary contains fewer gender categories, "man", "woman", 
"boy", "girl", and "sex and gender diverse". 

The "man" category includes all participants who indicate that they are

- male
- trans male  (including female to male transgender respondents)
- cis male

The "woman" category includes all participants who indicate that they are

- female
- trans female (including male to female transgender respondents)
- cis female

The "sex and gender diverse" category includes all participants who indicate 
that they are

- agender
- androgynous
- intersex
- non-binary
- gender-queer

```{r} 
fewlevels_sample <- sample
fewlevels_sample$recoded_gender <- recode_gender(
  gender = fewlevels_sample$Gender,
  dictionary = fewlevels_en
)

knitr::kable(
  head(fewlevels_sample, 10),
  caption = "The fewlevels_en dictionary applied to `head(sample)`"
)

knitr::kable(
  count_values(fewlevels_sample$recoded_gender, "recoded_gender"),
  caption = "Summary of gender categories after use of the *fewlevels_en* dictionary"
)
```

You can also specify a custom dictionary to replace or supplement the built-in 
dictionary. The custom dictionary should be a list in the following 
format. 

```{r}
# name of the vector element is the user input value and the vector element is the 
# replacement value corresponding to that name as a lower case string.
custom_dictionary <- c(
  masculino = "man",
  hombre = "man",
  mujer = "woman",
  femenina = "woman"
)

str(custom_dictionary)
```

Custom dictionaries can be used in place of a built-in dictionary or can 
supplement the built-in dictionary by providing a vector of vectors to the 
dictionary argument. Where the lists contain duplicated elements, the last 
version of the duplicated value will be used for recoding. This allows you to 
use the built-in dictionary but change the coding of one or more responses from 
that dictionary. Here the addition of Spanish terms allows for recoding of 11 
previously uncoded responses.

```{r}

combined_sample <- sample
combined_sample$recoded_gender <- recode_gender(
  gender = combined_sample$Gender,
  dictionary = c(fewlevels_en, custom_dictionary)
)

knitr::kable(
  count_values(combined_sample$recoded_gender, "recoded_gender"),
  caption = "Summary of gender categories after use of the combined dictionaries"
)

knitr::kable(
  combined_sample[is.na(combined_sample$recoded_gender), ],
  caption = "All responses not classified by the combined dictionaries"
)
```

### retain_unmatched

The `retain_unmatched` argument is used to determine the handling for recoding of values
not contained in the dictionary. By default, unmatched values are coded as NA. 
`retain_unmatched = TRUE` will fill unmatched responses with the participant provided 
response. 

```{r}
retained_sample <- sample
retained_sample$recoded_gender <- recode_gender(
  gender = retained_sample$Gender,
  dictionary = c(fewlevels_en, custom_dictionary),
  retain_unmatched = TRUE
)

knitr::kable(
  count_values(retained_sample$recoded_gender, "recoded_gender"),
  caption = "Summary of gender categories after use of the combined dictionary and `retain_unmatched = TRUE`"
)
```

# A disclaimer on handling gender responses

This package attempts to remove typographical errors from free text gender data.
The defaults that we used are specific to our context and the time at which the 
package was developed and your data or context may be different. 

We offer two built-in dictionaries, manylevels_en and fewlevels_en. Both are necessarily 
opinionated about how gender descriptors collapse into categories. 

However, as these are culturally specific, they may not be suitable for your data. 
In particular the fewlevels_en option makes opinionated choices about some responses that we want to acknowledge are potentially problematic. Specifically,

- In 'fewlevels_en' coding intersex responses are recoded as 'sex and gender
diverse'
- In 'fewlevels_en' responses where people indicate they are trans and
indicate their identified gender are recoded as the identified gender
(e.g. 'Male to Female' is recoded as 'woman'). We wish to acknowledge
that this may not reflect how some individuals would classify
themselves when given these categories and in some contexts may make
systematic errors. The manylevels_en coding dictionary attempts to avoid these
issues as much as possible - however users can provide a custom
dictionary to add to or overwrite our coding decisions if they feel
this is more appropriate. We welcome people to update the built-in dictionary 
where desired responses are missing.
- In both dictionaries, we assume that typographical features such as spacing
are not relevant to recoding the gender response (e.g. we assume that
"genderqueer" and "gender queer" are equivalent). This is unlikely to be 
true for all contexts. 


The 'manylevels_en' coding separates out those who identify as trans
female/male or cis female/male into separate categories it should not
be assumed that all people who describe as male/female are cis, if you
are assessing trans status we recommend a two part question see:

Bauer, Greta & Braimoh, Jessica & Scheim, Ayden & Dharma, Christoffer.
(2017). Transgender-inclusive measures of sex/gender for population surveys:
Mixed-methods evaluation and recommendations. PLoS ONE. 12.

# Contributing to this package

This package is a reflection of cultural context of the package contributors. 
We acknowledge that understandings of gender are bound by both culture and time 
and are continually changing. As such, we welcome issues and pull requests to 
make the package more inclusive, more reflective of current understandings of 
gender inclusive languages and/or suitable for a broader range of cultural 
contexts. We particularly welcome addition of non-English dictionaries or of 
other gender-diverse responses to the manylevels_en and fewlevels_en dictionaries.

The "Adding to the dictionary" vignette includes information about how to make changes to the dictionary either for your own use or when contributing to the gendercoder package.

# Acknowledgement of Country

We acknowledge the Wurundjeri people of the Kulin Nation as the custodians of 
the land on which this package was developed and pay respects to elders past, 
present and future.