The forwards package provides anonymized data from surveys conducted by Forwards, the R Foundation task force on women and other under-represented groups. The package currently contains a single data set, useR2016, with results from a survey of participants at the useR! 2016 conference. The questions and form of responses are described in the help file (?useR2016). This vignette provides a few examples of how to obtain equivalent results to those presented in reports on this survey (note it is not possible to reproduce the exact results in most cases, due to the aggregation necessary to protect respondent’s privacy).

Descriptive statistics

Q7 gives the highest education level of the respondents. We can cross-tabulate this by gender (Q2) as follows:

tab <- with(useR2016,
            prop.table(table(Q7, Q2), margin = 2))
kable(tab*100, digits = 1)

	Men	Non-Binary/Unknown	Women
Doctorate/Professional	48.6	NaN	43.6
Masters or lower	51.4	NaN	56.4

The results are missing for the non-binary/unknown group as all demographic variables have been suppressed for these individuals. The education levels have been aggregated into “Doctorate/Professional” and “Masters or lower” - this gives two groups of roughly similar size, also the Doctorate and Professional qualification groups were observed to be separated from the lower education groups (Master, Undergraduate, and High School or lower) in the multivariate analyses (see useR! 2016 participants and R programming: a multivariate analysis and useR! 2016 participants and the R community: a multivariate analysis). Even with this heavy aggregation, we can still observe the high proportion of people with advanced qualifications and the tendency for men to have higher qualifications than women (as noted in our reports, women attendees were generally younger than men). For more discussion of the respondent demographics, see the blog post mapping useRs.

Q15 asked respondent’s opinions on several statements about R. The following code collects these responses and shows the percentage in each opinion category for each statement:

ldat <- likert(useR2016[c("Q15", "Q15_B", "Q15_C", "Q15_D")])
plot(ldat) +
    scale_x_discrete(labels = 
                       rev(c("fun", "considered cool/interesting\n by my peers",
                             "difficult", "monotonous task"))) +
    ggtitle("useR! 2016 attendees' opinions on writing R")

This plot was presented in the blog post useRs relationship with R which covers all the programming related questions in the survey.

Q24 asked respondents whether certain options would make them more likely to participate in the R community, or improve their experience. The following code gathers all the responses together and summarizes the percentage selecting each category, for men and women separately.

dat <- useR2016 %>%
    filter(Q2 %in% c("Men", "Women")) %>%
    select(Q2, Q24, Q24_B, Q24_C, Q24_D, Q24_E, Q24_F, Q24_G, Q24_H, Q24_I, 
           Q24_J, Q24_K, Q24_L) %>%
    group_by(Q2) %>%
    summarize_all(list(Yes = ~ sum(!is.na(.)),
                       No = ~ sum(is.na(.)))) %>%
    gather(Response, Count, -Q2) %>%
    separate(Response, c("Q", "Answer"), sep = "_(?=[^_]+$)") %>%
    arrange(Q2, Q, Answer) %>%
    group_by(Q2, Q) %>%
    summarize(Yes = Count[2],
              Percentage = Count[2]/sum(Count) * 100) %>%
    ungroup() %>%
    filter(Yes > 4) %>%
    mutate(Q = factor(Q, labels = 
                        c("New R user group near me",#A
                          "New R user group near me aimed at my demographic",#B
                          "Free local introductory R workshops",#C
                          "Paid local advanced R workshops",#D
                          "R workshop at conference in my domain", #E
                          "R workshop aimed at my demographic",#F
                          "Mentoring (e.g. CRAN/useR! submission, GitHub contribution)", #G
                          #"Training in non-English language",
                          #"Training that accommodates my disability",
                          "Online forum to discuss R-related issues", #J
                          "Online support group for my demographic"#, #K
                          #"Special facilities at R conferences"
                          ))) 
kable(dat, digits = 1)

Q2	Q	Yes	Percentage
Men	New R user group near me	78	27.6
Men	New R user group near me aimed at my demographic	9	3.2
Men	Free local introductory R workshops	37	13.1
Men	Paid local advanced R workshops	32	11.3
Men	R workshop at conference in my domain	39	13.8
Men	R workshop aimed at my demographic	7	2.5
Men	Mentoring (e.g. CRAN/useR! submission, GitHub contribution)	46	16.3
Men	Online forum to discuss R-related issues	28	9.9
Men	Online support group for my demographic	5	1.8
Women	New R user group near me	44	26.0
Women	New R user group near me aimed at my demographic	22	13.0
Women	Free local introductory R workshops	24	14.2
Women	Paid local advanced R workshops	28	16.6
Women	R workshop at conference in my domain	33	19.5
Women	R workshop aimed at my demographic	13	7.7
Women	Mentoring (e.g. CRAN/useR! submission, GitHub contribution)	40	23.7
Women	Online forum to discuss R-related issues	34	20.1
Women	Online support group for my demographic	13	7.7

Note that respondents could select multiple options so that the percentages to not add up to 100% for men and women. Also some options were not selected at all and do not appear in the summary. The following code visualizes these percentages:

ggplot(dat, aes(x = fct_rev(Q),  y = Percentage, fill = Q2)) + 
  geom_bar(stat = "identity", position = "dodge") +
  coord_flip() +
  labs(x = NULL, y = "%", title = "R programming level of useR! 2016 attendees", fill = NULL) +
  scale_y_continuous(breaks = seq(0, 100, 20), limits = c(0, 100)) +
  scale_fill_hue(h = c(110,250), direction = -1, breaks = c("Women", "Men"))

Men and women are equally interested in local user groups and free workshops, but women are more interested than men in mentoring, online support groups and workshops of all types. For more on the community questions in the survey, see the blog post useRs participation in the R community.

Logistic regression analysis

Logistic regression analysis can be used to explore the relationships between contribution to the R project and other survey variables. For example, the following code creates a contributor response, that is equal to 1 if respondents have contributed to R packages on CRAN or elsewhere (Q13_D), have written their own R package (Q13_E), or have written their own R package and released it on CRAN or Bioconductor or shared it on GitHub, R-Forge or similar platforms (Q13_F), and is equal to 0 otherwise. A logistic regression is then used to model this response by gender (Q2), length of R usage (Q11), employment status (Q8) and whether the respondent feels a part of the R community (Q18):

response <- with(useR2016,
    ifelse(!is.na(Q13_D) | !is.na(Q13_E) | !is.na(Q13_F), 1, 0))
summary(glm(response ~ Q2 + Q11 + Q8 + Q18, data = useR2016))

## 
## Call:
## glm(formula = response ~ Q2 + Q11 + Q8 + Q18, data = useR2016)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -0.92897  -0.38417   0.07103   0.34218   0.90121  
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    0.27420    0.05918   4.633 4.84e-06 ***
## Q2Women       -0.10034    0.04423  -2.269   0.0238 *  
## Q112-5 years   0.26890    0.06665   4.034 6.53e-05 ***
## Q115-10 years  0.37641    0.06501   5.790 1.40e-08 ***
## Q11> 10 years  0.43971    0.06849   6.420 3.77e-10 ***
## Q8Academic     0.21506    0.04329   4.968 9.95e-07 ***
## Q18No         -0.34397    0.05864  -5.866 9.19e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.1809193)
## 
##     Null deviance: 102.050  on 417  degrees of freedom
## Residual deviance:  74.358  on 411  degrees of freedom
##   (37 observations deleted due to missingness)
## AIC: 480.52
## 
## Number of Fisher Scoring iterations: 2

This model suggests that women are slightly less likely to contribute, however more important factors are length of programming experience (more experience, more likely to have contributed), type of employment (academic, including retired, unemployed and student more likely to contribute) and sense of belonging to the R community (people that do not feel part of the R community are less likely to contribute). A working paper is in progress on this and related models.

Multivariate analysis

A multiple correspondence analysis may be used to explore multivariate relationships between a set of questions. The following code considers questions relating to previous programming experience (Q12), how the respondent uses R (Q13) and why they use R (Q14). The demographic variables gender (Q2), age (Q3), highest education level (Q7) and employment type (Q8) are used as supplementary variables, that is they are not used to build the dimensions of variability, but projected a posteriori to aid interpretation.

demo <- c("Q2", "Q3", "Q7", "Q8")
suppl <- c(demo, "Q12")
ruses <- c("Q11", "Q13", "Q13_B", "Q13_C", "Q13_D", "Q13_E", "Q13_F", "Q14")
don.mca <- useR2016[, c(suppl, ruses)] %>%
    mutate(Q12 = factor(ifelse(Q12 == "Yes", "prg_exp_yes", "prg_exp_no")),
           Q13 = factor(ifelse(!is.na(Q13), "use_func_yes", "use_func_no")),
           Q13_B = factor(ifelse(!is.na(Q13_B), "wrt_code_yes", "wrt_code_no")),
           Q13_C = factor(ifelse(!is.na(Q13_C), "wrt_func_yes", "wrt_func_no")),
           Q13_D = factor(ifelse(!is.na(Q13_D), "ctb_pkg_yes", "ctb_pkg_no")),
           Q13_E = factor(ifelse(!is.na(Q13_E), "wrt_pkg_yes", "wrt_pkg_no")),
           Q13_F = factor(ifelse(!is.na(Q13_F), "rel_pkg_yes", "wrt_rel_no")))
rownames(don.mca) <- seq(nrow(don.mca))
res.mca <- MCA(don.mca, graph =  FALSE, quali.sup =  seq(length(suppl)))
plot(res.mca, invisible = c("ind", "quali.sup"), cex = 0.8)

The plot above summarizes the main dimensions of variability in the responses to the programming experience questions. Two categories are close on the graph when individuals who have selected the first category also tend to select the other category. The main feature of this plot is the gradient from bottom right to top left, showing increasing experience and greater contribution, including in the respondents’ free time.

The following code then projects the demographic variables onto the same dimensions

res.dimdesc <- dimdesc(res.mca)  
# demographic variables linked to the dimension 1 or 2 
varselect <- 
    demo[which(demo%in%unique(c(rownames(res.dimdesc$'Dim 1'$quali),
                                rownames(res.dimdesc$'Dim 2'$quali))))]
# vector with the categories for such demographic variables
modeselect <- unlist(sapply(don.mca[, varselect],levels))      
# discriminant categories for the position of the individuals on dimension 1 or 2
getlabel <- function(x) sub("[^=]+=(.*)", "\\1", x)
lab1 <- getlabel(rownames(res.dimdesc$'Dim 1'$category))
lab2 <- getlabel(rownames(res.dimdesc$'Dim 2'$category))
modeselect <- modeselect[modeselect %in% unique(c(lab1, lab2))]
plot(res.mca, invisible=c("ind", "var"), cex = 0.8,
     selectMod = modeselect, autoLab = "yes",
     xlim = c(-1.5,1.5), ylim = c(-1,1))

This shows that the more experienced programmers tend to be men working in academia. Further multivariate analysis of the programming questions can be found in useR! 2016 participants and R programming: a multivariate analysis while a similar analysis of the community questions is reported in useR! 2016 participants and the R community: a multivariate analysis.

Overview of forwards package

Heather Turner

2019-07-30

Descriptive statistics

Logistic regression analysis

Multivariate analysis