anvilPoll2024ExtraAnalysis.Rmd

---
title: "State of the AnVIL 2024"
subtitle: "Supplementary Analysis and Alternative Plots"
author: "Kate Isaac, Elizabeth Humphries, & Ava Hoffman"
date: "`r Sys.Date()`"
output: html_document
---

```{r results='hide', warning=FALSE, message=FALSE}
library(here)
library(patchwork)
library(ggVennDiagram)

knitr::knit_child(here("anvilPoll2024MainAnalysis.Rmd"))
```

# Supplemental Analyses and Graphs

## Identify User Type (supplemental)

*No supplements at this time*


## Demographics: Highest Degree (supplemental)

<details><summary>Description of variable definitions and steps</summary>

First we select the columns of interest from `resultsTidy`: `Degrees` and `UserType`. Then we use `group_by` in conjunction with `summarize( = n())` to add counts for how many of each combo are observed in the data. 

Then we send this data to ggplot and make a bar chart with the x-axis representing the degrees (`reorder`ed by the count number such that higher counts are first (and the sum) because otherwise the 2 MDs are located after the high school and master's in progress bars (1 each)). The y-axis represents the count, and the fill is used to specify user type (current or potential AnVIL users). We use a stacked bar chart and include labels above each bar of the total sum for that degree type.

Used [this Stack Overflow post to label sums above the bars](https://stackoverflow.com/questions/30656846/draw-the-sum-value-above-the-stacked-bar-in-ggplot2)

and used [this Stack Overflow post to remove NA from the legend](https://stackoverflow.com/questions/45493163/ggplot-remove-na-factor-level-in-legend)

The rest of the changes are related to theme and labels and making sure that the numerical bar labels aren't cut off on the top.

</details>

```{r}
resultsTidy %>%
  group_by(Degrees, UserType) %>% 
  summarize(n = n()) %>%
  ggplot(aes(x = reorder(Degrees, -n, sum), 
             y = n, 
             fill = UserType
             )) +
      geom_bar(position = "stack", stat="identity") +
      geom_text(
                  aes(label = after_stat(y), group = Degrees), 
                  stat = 'summary', fun = sum, vjust = -1, size=2
                ) +
      theme_classic() + theme(axis.text.x = element_text(angle = 45, hjust=1)) +
      xlab("Degree") +
      ylab("Count") +
      coord_cartesian(clip = "off") +
      scale_fill_manual(values = c("#E0DD10", "#035C94"), na.translate = F) +
      ggtitle("What is the highest degree you have attained?") +
  theme(legend.title = element_blank())

ggsave(here("plots/degree_usertype.png")) #set plot size
```

## Demographics: Kind of Work (supplemental)

*No supplements at this time*


## Demographics: Institutional Affiliation (supplemental)

### Number of institutions represented in responses

```{r}
length(unique(resultsTidy$InstitutionalAffiliation))
```

### Institution type

Let's make a bar chart that shows how many of each institution, colored by institution type

<details><summary>Description of variable definitions and steps</summary>

We first prepare the data by selecting the columns of interest from `resultsTidy`: `InstitutionalAffiliation` and `InstitutionalType`. And we use the `group_by` and `summarize( = n())` functions to add a count (`InstitutionalCount`) for every InstitutionalAffiliation. We want to include the InstitutionalType in the group_by even though it's redundant for what we're displaying since we'll want to color by institution type. 

We then plot the data with the Affiliation on the y-axis (reordered by the count so largest count is on top),
the count on the x-axis, and the fill color being the institutional type. 

We change some theme and label elements and add a grob annotation to specify how many unique institutions are represented in this graph. 

</details>

```{r}
resultsTidy %>%
  group_by(InstitutionalAffiliation, InstitutionalType) %>% summarize(InstitutionalCount = n()) %>%
  ggplot(aes(
    y = reorder(InstitutionalAffiliation, InstitutionalCount),
    x = InstitutionalCount,
    fill = InstitutionalType
  )) + geom_bar(stat = "identity") +
  ggtitle("What institution are you affiliated with?")+
  annotation_custom(textGrob(paste("There are\n", length(unique(resultsTidy$InstitutionalAffiliation))  ,"\nunique institutions"), gp=gpar(fontsize=8, fontface = "bold")),xmin=7,xmax=7,ymin=3,ymax=3) +
  coord_cartesian(clip = "off") +
  theme_classic() + 
  xlab("Count")

ggsave(here("plots/institutionalAffilition_allResponses.png"))
```

Taking a less granular approach, and aggregating by institution type rather than looking at names of institutions

```{r}
instPlot <- resultsTidy %>%
  group_by(UserType, InstitutionalType) %>% summarize(InstitutionalCount = n()) %>%
  ggplot(aes(
    y = reorder(InstitutionalType, InstitutionalCount, sum),
    x = InstitutionalCount,
    fill = UserType
  )) + geom_bar(position = "stack", stat = "identity") +
  annotation_custom(textGrob(paste("There are\n", length(unique(resultsTidy$InstitutionalAffiliation))  ,"\nunique institutions"), gp=gpar(fontsize=8, fontface = "bold")),xmin=34,xmax=34,ymin=2.5,ymax=2.5) +
  coord_cartesian(clip = "off") +
  ggtitle("What institution are you affiliated with?")

stylize_bar(instPlot)

ggsave(here("plots/institutionalType_allResponses_colorUserType.png"))

```

#### Just for Current/Returning Users

The above plot was for all survey responses. Here we want to focus on institutions represented by just current users of AnVIL. 

<details><summary>Description of variable definitions and steps</summary>

We first select rows/responses that are just from Current users. Then we prepare the data and plot following the same scheme as above.

</details>


```{r}
resultsTidy %>%
  filter(UserType == "Current User") %>%
  group_by(InstitutionalAffiliation, InstitutionalType) %>% summarize(InstitutionalCount = n()) %>%
  ggplot(aes(
    y = reorder(InstitutionalAffiliation, InstitutionalCount),
    x = InstitutionalCount,
    fill = InstitutionalType
  )) + geom_bar(stat = "identity") +
  theme_bw() +
  theme(
    panel.background = element_blank(),
    panel.grid.minor = element_blank(),
    panel.grid.major.y = element_blank()
  ) +
  ylab("Institutional Affiliation") + xlab("Count") +
  ggtitle(bquote('Institutional Affilition for' ~ bold('Current User') ~ 'Respondents')) +
  annotation_custom(textGrob(paste("There are\n", nrow(unique(resultsTidy[which(resultsTidy$UserType == "Current User"), "InstitutionalAffiliation"]))  ,"\nunique institutions"), gp=gpar(fontsize=8, fontface = "bold")),xmin=5.5,xmax=5.5,ymin=3,ymax=3) +
  coord_cartesian(clip = "off")

ggsave(here("plots/institutionalAffilition_currentUserResponses.png"))
```

Taking a less granular approach, and just looking at institution type rather than names of institutions. Saving the plot into a variable so that we can combine it with the one for potential users later. 

Note that the x- and y-axis labels are turned off since this will be the top plot when combined, also simplified the title to just say Current Users. Turned off the legend.

Also used `scale_fill_manual` to set specific colors for the institution types in order to sync colors for institution types in this and the potential users version (`institutionTypePotential`) (more info on this with that plot below).

```{r}
institutionTypeCurrent <- resultsTidy %>%
  filter(UserType == "Current User") %>%
  group_by(InstitutionalType) %>% summarize(InstitutionalCount = n()) %>%
  ggplot(aes(
    y = reorder(InstitutionalType, InstitutionalCount),
    x = InstitutionalCount,
    fill = InstitutionalType
  )) + geom_bar(stat = "identity") +
  theme_bw() +
  theme(
    panel.background = element_blank(),
    panel.grid.minor = element_blank(),
    panel.grid.major.y = element_blank()
  ) +
  ylab("") +
  xlab("Count") + 
  #xlab("") +
  ggtitle(bquote(bold("Current Users"))) +
  coord_cartesian(clip = "off") +
  scale_fill_manual(values = c("R1 University" = "#FDB462",
                    "Research Center" = "#FCCDE5",
                    "Medical Center or School" = "#FB8072",
                    "R2 University" = "#B3DE69")) +
  theme(legend.position = "none")

institutionTypeCurrent

#ggsave(here("plots/institutionalType_currentUserResponses.png"), plot = institutionTypeCurrent)
```

#### Just for Potential Users

Here we want to focus on institutions represented by just potential users of AnVIL. 

<details><summary>Description of variable definitions and steps</summary>

We first select rows/responses that are just from potential users. Then we prepare the data and plot following the same scheme as above.

</details>

```{r}
resultsTidy %>%
  filter(UserType == "Potential User") %>%
  group_by(InstitutionalAffiliation, InstitutionalType) %>% summarize(InstitutionalCount = n()) %>%
  ggplot(aes(
    y = reorder(InstitutionalAffiliation, InstitutionalCount),
    x = InstitutionalCount,
    fill = InstitutionalType
  )) + geom_bar(stat = "identity") +
  theme_bw() +
  theme(
    panel.background = element_blank(),
    panel.grid.minor = element_blank(),
    panel.grid.major.y = element_blank()
  ) +
  ylab("Institutional Affiliation") + xlab("Count") +
  ggtitle(bquote('Institutional Affilition for' ~ bold('Potential User') ~ 'Respondents')) +
  annotation_custom(textGrob(paste("There are\n", nrow(unique(resultsTidy[which(resultsTidy$UserType == "Potential User"), "InstitutionalAffiliation"]))  ,"\nunique institutions"), gp=gpar(fontsize=8, fontface = "bold")),xmin=6,xmax=6,ymin=1.5,ymax=1.5) +
  coord_cartesian(clip = "off")
  

ggsave(here("plots/institutionalAffilition_potentialUserResponses.png"))
```

Taking a less granular approach, and just looking at institution type rather than names of institutions. 

Wanted to sync the colors between the current and potential institutional types and so used the Set3 palette for scale_fill_brewer as it has 12 colors (and need 9 for current users) and it seemed more accessible than the Paired palette. To see the hex codes that were assigned to the shared institution types in this plot, I used the `scales` library and `brewer_pal(palette = "Set3")(9)`

Turned off the y-axis label, but kept the x-axis label since this will be the bottom plot when combined with the current user version (`institutionTypeCurrent`). Also used `xlim` to sync the x-axis limits between the two. 

Simplified the title to just be Potential Users. Turned off the legend.

```{r}
institutionTypePotential <- resultsTidy %>%
  filter(UserType == "Potential User") %>%
  group_by(InstitutionalType) %>% summarize(InstitutionalCount = n()) %>%
  ggplot(aes(
    y = reorder(InstitutionalType, InstitutionalCount),
    x = InstitutionalCount,
    fill = InstitutionalType
  )) + geom_bar(stat = "identity") +
  theme_bw() +
  theme(
    panel.background = element_blank(),
    panel.grid.minor = element_blank(),
    panel.grid.major.y = element_blank()
  ) +
  ylab("") +
  xlab("") +
  #xlab("Count") +
  xlim(0,15) +
  ggtitle(bquote(bold("Potential Users"))) +
  coord_cartesian(clip = "off") +
  scale_fill_brewer(palette = "Set3") +
  theme(legend.position = "none")

institutionTypePotential

#ggsave(here("plots/institutionalType_potentialUserResponses.png"), plot = institutionTypePotential)
```

Combined the two plots for institutional type (`institutionTypeCurrent` and `institutionTypePotential`) using patchwork, stacking them on top of each other (`/`) and using `plot_layout` to set the heights since there are more institution types for Potential users than Current users and therefore want current users to be shorter than default.


```{r}
combined_plot <-  institutionTypePotential / institutionTypeCurrent + plot_layout(heights = unit(c(4, 2),'cm')) + plot_annotation("What institution are you affiliated with?")
  
combined_plot

ggsave(here("plots/institutionalType_facetedUserType.png"), plot = combined_plot)
```


## Demographics: Consortia Affiliations (supplemental)

*No supplements at this time*

## Experience: Tool & Resource Knowledge/Comfort level (supplemental)

### Plot y-axis ordered by potential user ratings

```{r}

# Provide a list of AnVIL only Tools
AnVIL_only <-
  setdiff(toPlotToolKnowledgeSeparateBR[toPlotToolKnowledgeSeparateBR$UserType == "Current Users" &
                                toPlotToolKnowledgeSeparateBR$AnVILorNo == "On the AnVIL", ]$Tool,
          toPlotToolKnowledgeSeparateBR[toPlotToolKnowledgeSeparateBR$UserType == "Potential Users", ]$Tool)

# Order dummy column based only on Potential users
toPlotToolKnowledgeSeparateBR <-
  toPlotToolKnowledgeSeparateBR %>% mutate(ToolOrder = case_when(
    UserType == "Potential Users" | Tool %in% AnVIL_only ~ avgScore,
    TRUE ~ 0
  ))


PlotToolKnowledge_potential_user_score <-
  ggplot(data = toPlotToolKnowledgeSeparateBR) +
  geom_point(data = toPlotToolKnowledgeSeparateBR[toPlotToolKnowledgeSeparateBR$UserType == "Potential Users" | toPlotToolKnowledgeSeparateBR$Tool %in% AnVIL_only ,],
             aes(color = UserType, shape = AnVILorNo, y = reorder(Tool, ToolOrder), x = avgScore)) +
  geom_point(data = toPlotToolKnowledgeSeparateBR[toPlotToolKnowledgeSeparateBR$UserType == "Current Users",],
             aes(color = UserType, shape = AnVILorNo, y = Tool, x = avgScore))

PlotToolKnowledge_customization(PlotToolKnowledge_potential_user_score)
ggsave(here("plots/tooldataresourcecomfortscore_singlepanel_by_potential_users.png"), w = 2200, h = 1350, units = "px")
```

### simpler plots focusing on a subset of the data

```{r}
#only separate from the AnVIL data

simplerPlot <- toPlotToolKnowledge %>%
  filter(AnVILorNo == "Separate from the AnVIL") %>% 
  ggplot(aes(y = reorder(Tool, avgScore), x=avgScore)) + geom_point(aes(color = UserType)) + 
  geom_line() + 
  scale_x_continuous(breaks = 0:5, labels = 0:5, limits = c(0,5)) + ylab("Tool or Resource") + xlab("Average Knowledge or Comfort Score") + theme_bw() + theme(panel.background = element_blank(), panel.grid.minor.x = element_blank()) + 
  annotation_custom(textGrob("Don't know\nat all", gp=gpar(fontsize=8, fontface = "bold")),xmin=0,xmax=0,ymin=-1,ymax=-1) + 
  annotation_custom(textGrob("Extremely\ncomfortable", gp=gpar(fontsize=8, fontface= "bold")),xmin=5,xmax=5,ymin=-1,ymax=-1) +
  coord_cartesian(clip = "off") +
  theme(plot.margin = margin(1,1,1,1.1, "cm"))+
  ggtitle("How would you rate your knowledge of or\ncomfort with these technologies\n(separate from the AnVIL)?") +
  theme(legend.title = element_blank())

simplerPlot

ggsave(here("plots/toolsSeparateFromAnVIL_comfortscore.png"), plot = simplerPlot)
```

```{r}
#add in purple points of comparison for On the AnVIL

toPlot_simplified <- toPlotToolKnowledge %>%
  filter(AnVILorNo == "Separate from the AnVIL")

onAnVIL <- toPlotToolKnowledge %>%
  filter(AnVILorNo == "On the AnVIL") %>%
  right_join(., toPlot_simplified,by = "Tool") %>%
  bind_rows(., 
            data.frame(Tool = "RStudio", 
                       avgScore.x = toPlotToolKnowledge[which(toPlotToolKnowledge$Tool == "Bioconductor & RStudio"),"avgScore"],
                       UserType.x = "Current Users",
                       AnVILorNo.x = "On the AnVIL"),
            data.frame(Tool = "Bioconductor", 
                       avgScore.x = toPlotToolKnowledge[which(toPlotToolKnowledge$Tool == "Bioconductor & RStudio"),"avgScore"],
                       UserType.x = "Current Users",
                       AnVILorNo.x = "On the AnVIL")
            ) %>% drop_na(avgScore.x)
```


```{r}
simplerPlot + geom_point(data = onAnVIL, aes(x=avgScore.x,y=Tool,colour="#C77CFF")) + 
  scale_color_manual( 
    values = c("#F8766D", "#00BFC4", "#C77CFF"), labels = c("Potential Users", "Current Users", "Current User Ratings\nfor related AnVIL tools")) + theme(legend.title = element_blank())

ggsave(here("plots/tools_comfortscore.png"))
```

```{r}
#only the data resources

toPlotToolKnowledge %>%
  filter(Tool == "DUOS" | Tool == "Access controlled access data" | Tool == "TDR" | Tool == "Terra Workspaces") %>%
  ggplot(aes(y = reorder(Tool, avgScore), x=avgScore)) + geom_point(colour = "#F8766D") + 
  scale_x_continuous(breaks = 0:5, labels = 0:5, limits = c(0,5)) + ylab("Data Resource") + xlab("Average Knowledge or Comfort Score") + theme_bw() + theme(panel.background = element_blank(), panel.grid.minor.x = element_blank()) + 
  annotation_custom(textGrob("Don't know\nat all", gp=gpar(fontsize=8, fontface = "bold")),xmin=0,xmax=0,ymin=-0.35,ymax=-0.35) + 
  annotation_custom(textGrob("Extremely\ncomfortable", gp=gpar(fontsize=8, fontface= "bold")),xmin=5,xmax=5,ymin=-0.35,ymax=-0.35) +
  coord_cartesian(clip = "off") +
  theme(plot.margin = margin(1,1,1,1.1, "cm"))+
  ggtitle("How would you rate your knowledge of or\ncomfort with these AnVIL data features?") +
  theme(legend.title = element_blank())

ggsave(here("plots/dataresources_comfortscore.png"))
```

## Experience: Types of data analyzed (supplemental)

*No supplements at this time*

## Experience: Genomics and Clinical Research Experience (supplemental)

### Should we split current users vs potential users?

Here we use two different plots to show that the distribution of experience level among these three research types is similar when comparing the distribution of current users vs potential users. In this first plot, we have the experience level on the x-axis, the count on the y-axis, and color the bars by research type. We stack the user type responses using `facet_wrap` and `nrow=2` as an argument within that. We use a `position="dodge"` to cluster the similar research type bars next to each other. And we use geom_text to label the bars with the actual count. This requires `group = researchType` within the `geom_text()` `aes()` and `position = position_dodge(width = 0.9)` within the general `geom_text()` function. 

We then also make some theme changes like rotating the x-axis tick labels and changing the y- and x- axis labels and using a minimal theme to turn off borders, and then turning off grids, etc.

```{r}
ggplot(experienceDf, aes(x=experienceLevel,y=n, fill=researchType)) + 
  facet_wrap(~UserType, nrow=2) + 
  geom_bar(stat="identity", position="dodge") + 
  theme_minimal() + 
  theme(panel.background = element_blank(), panel.grid = element_blank()) + 
  theme(axis.text.x = element_text(angle = 45, hjust=1)) + 
  geom_text(
    aes(label = n, group = researchType), 
    size=2, position = position_dodge(width = .9), vjust=-0.5
) + 
  ylab("Count") + xlab ("Reported Experience Level") +
  coord_cartesian(clip = "off")
  

ggsave(here("plots/researchExperienceLevel_colorResearchType.png"))
```

In this second plot, we have the experience level on the x-axis, the count on the y-axis, and color the bars by experience level. We stack the user type responses and separate out the research types into separate facets using `facet_grid`. And we use geom_text to label the bars with the actual count. This uses `group = experienceLevel` within the `geom_text()` `aes()`. 

We then also make some theme changes like rotating the x-axis tick labels and changing the y- and x- axis labels, expanding the left plot margin, and using a minimal theme to turn off borders, and then turning off grids, etc.

```{r}
ggplot(experienceDf, aes(x=experienceLevel,y=n, fill=experienceLevel)) + 
  facet_grid(UserType~researchType) + 
  geom_bar(stat="identity") + 
  theme_classic() + 
  theme(panel.background = element_blank(), panel.grid = element_blank()) + 
  theme(axis.text.x = element_text(angle = 45, hjust=1)) + 
  geom_text(
    aes(label = n, group = experienceLevel), vjust = -1, size=2
) + 
  ylab("Count") + xlab ("Reported Experience Level") +
  coord_cartesian(clip = "off") + 
  theme(plot.margin = margin(1,1,1,1.05, "cm")) +
  theme(legend.position = "none")
  

ggsave(here("plots/researchExperienceLevel_colorExperience.png"))
```

Both of these give us confidence that current and potential user counts for reported experience level in these research areas show similar distributions. So we'll go ahead and plot it without splitting out `UserType`.

### Alternate plot

<details><summary>Description of variable definitions and steps</summary>

This bar plot is the same as in the main analysis, but it doesn't use a fill for experience level. It has the experience level on the x-axis, the count on the y-axis. We facet the research category type and label the bars. We keep a summary stat and sum function and after_stat(y) for the label since the data has splits like UserType that we're not visualizing here. 

We adjust various aspects of the theme like turning off the grid and background and rotating the x-tick labels and changing the x- and y-axis labels. We also slightly widen the left axis so that the tick labels aren't cut off. 

</details>

```{r}
ggplot(experienceDf, aes(x=experienceLevel,y=n)) + 
  facet_grid(~researchType) + 
  geom_bar(stat="identity") + 
  theme_bw() + 
  theme(panel.background = element_blank(), panel.grid = element_blank()) + 
  theme(axis.text.x = element_text(angle = 45, hjust=1)) + 
  geom_text(
    aes(label = after_stat(y), group = experienceLevel), 
    stat = 'summary', fun = sum, vjust = -0.5, size=2
) + 
  ylab("Count") + xlab ("Reported Experience Level") +
  coord_cartesian(clip = "off") + 
  theme(plot.margin = margin(1,1,1,1.05, "cm")) +
  theme(legend.position = "none")+
  ggtitle("How much experience do you have analyzing the following data categories?")
  

ggsave(here("plots/researchExperienceLevel_noColor_noUserTypeSplit.png"))
```

### Follow-up: Overlap in experience levels for moderate or extreme experience categories for respondents

A potential follow-up question we had from examining the results of the "Experience: Genomics and Clinical Research Experience" section was "What's the overlap like for those moderately or extremely experienced in these various categories? The results of that question follow.

```{r}
resultsTidy %>% 
  select(Timestamp, HumanGenomicExperience, HumanClinicalExperience, NonHumanGenomicExperience, UserType) %>%
  pivot_longer(c(HumanGenomicExperience, 
                 HumanClinicalExperience, 
                 NonHumanGenomicExperience), 
               names_to = "researchType", 
               values_to = "experienceLevel") %>%
  mutate(experienceLevel = 
           factor(experienceLevel, 
                  levels = c("Not at all experienced", 
                             "Slightly experienced", 
                             "Somewhat experienced", 
                             "Moderately experienced", 
                             "Extremely experienced")),
         researchType = case_when(researchType == "HumanClinicalExperience" ~ "Human Clinical\nResearch",
                                  researchType == "HumanGenomicExperience" ~ "Human Genomic\nResearch",
                                  researchType == "NonHumanGenomicExperience" ~ "Non-human\nGenomic Research"),
         Timestamp = factor(Timestamp)) %>%
  ggplot(aes(y = factor(experienceLevel,
                        levels = rev(c("Not at all experienced", 
                                       "Slightly experienced", 
                                       "Somewhat experienced", 
                                       "Moderately experienced", 
                                       "Extremely experienced"))), 
             x = Timestamp, 
             fill = experienceLevel)) +
  geom_tile() +
  scale_fill_manual(values = c("#035C94","#035385","#024A77","#024168", "#02395B")) +
  theme_bw() +
  theme(axis.text.x=element_blank(),
        axis.ticks.x=element_blank(),
        axis.text.y = element_blank(), 
        axis.ticks.y = element_blank(),
        legend.position = "left") +
  ylab("") +
  ggtitle("How much experience do you have analyzing\nthe following data categories?") +
  xlab("Respondent") +
  facet_wrap(~researchType, nrow=3, strip.position="right")
```

```{r}
inputList <- list(ClinicalExperience = which(resultsTidy$clinicalFlag),
                  HumanGenomicsExperience = which(resultsTidy$humanGenomicFlag),
                  NonHumanGenomicsExperience = which(resultsTidy$nonHumanGenomicFlag))

ggVennDiagram(inputList, 
              category.names = c("Clinical\nExperience", "Human Genomics\nExperience", " Non-human Genomics Experience")) +
  scale_x_continuous(expand = expansion(mult = .2))
```

## Experience: Controlled Access Datasets (supplemental)

*No supplements at this time*


## Awareness: Monthly AnVIL Demos (supplemental)

### Utilization

```{r}
demoPlotUtil <- resultsTidy %>%
  group_by(UserType, AnVILDemoUse) %>%
  summarize(count = n()) %>%
  ggplot(aes(y=reorder(AnVILDemoUse, count),
             x= count,
             fill = UserType)) + 
  geom_bar(stat = "identity") +
  ggtitle("Have you attended a monthly AnVIL Demo?")

stylize_bar(demoPlotUtil)
```  

### Awareness or Utilization Color rather than y-axis split

```{r}
pd1 <- resultsTidy %>%
  group_by(UserType, AnVILDemoUse) %>%
  summarize(count = n()) %>%
  ggplot(aes(y=UserType,
             x= count,
             fill = AnVILDemoUse)) + 
  geom_bar(stat = "identity") +
  ggtitle("Have you attended a monthly AnVIL Demo?") +
  theme_classic() +
  xlab("") +
  ylab(" ") +
  scale_fill_manual(values = c("#25445A", "#7EBAC0")) +
  scale_x_continuous(breaks= pretty_breaks()) +
  theme(legend.title = element_blank())
```

```{r}
pd2 <- resultsTidy %>%
  group_by(UserType, AnVILDemoAwareness) %>%
  summarize(count = n()) %>%
  ggplot(aes(y=UserType,
             x=count,
             fill = AnVILDemoAwareness)) +
  geom_bar(stat = "identity") +
  #ggtitle("Have you attended a monthly AnVIL Demo?") +
  theme_classic() +
  xlab("Count") + 
  ylab(" ") +
  scale_fill_manual(values = c("#25445A", "#7EBAC0")) +
  scale_x_continuous(breaks= pretty_breaks()) +
  theme(legend.title = element_blank())

```

```{r}
pd1 / pd2
```

## Awareness: AnVIL Support Forum (supplemental)

```{r}
forumdf %<>% mutate(,
        forumUse = factor(
          case_when(
          forumInteractionDescription == "Posted in" ~ "Have utilized",
          forumInteractionDescription == "Answered someone's post" ~ "Have utilized",
          forumInteractionDescription == "Read through others' posts" ~ "Have utilized",
          forumInteractionDescription == "No but aware of" ~ "Have not utilized",
          forumInteractionDescription == "No didn't know of" ~ "Have not utilized"
        ), levels = c("Have not utilized", "Have utilized")))
```

### Utilization

```{r}
forumPlotUtil <- forumdf %>%
  group_by(UserType, forumUse) %>%
  summarize(count = n()) %>%
  ggplot(aes(y=reorder(forumUse, count),
             x= count,
             fill = UserType)) + 
  geom_bar(stat = "identity") +
  ggtitle("Have you ever read or posted in our AnVIL Support Forum?")

stylize_bar(forumPlotUtil)
```  

### Awareness or Utilization Color rather than y-axis split

```{r}
pf1 <- forumdf %>%
  group_by(UserType, forumUse) %>%
  summarize(count = n()) %>%
  ggplot(aes(y=UserType,
             x= count,
             fill = forumUse)) + 
  geom_bar(stat = "identity") +
  ggtitle("Have you ever read or posted in our AnVIL Support Forum?") +
  theme_classic() +
  xlab("") +
  ylab(" ") +
  scale_fill_manual(values = c("#25445A", "#7EBAC0")) +
  scale_x_continuous(breaks= pretty_breaks()) +
  theme(legend.title = element_blank(), )
```

```{r}
pf2 <- forumdf %>%
  group_by(UserType, forumAwareness) %>%
  summarize(count = n()) %>%
  ggplot(aes(y=UserType,
             x=count,
             fill = forumAwareness)) +
  geom_bar(stat = "identity") +
  #ggtitle("Have you ever read or posted in our AnVIL Support Forum?") +
  theme_classic() +
  xlab("Count") + 
  ylab(" ") +
  scale_fill_manual(values = c("#25445A", "#7EBAC0")) +
  scale_x_continuous(breaks= pretty_breaks()) +
  theme(legend.title = element_blank())

```

```{r}
pf1 / pf2
```

## Preferences: Feature Importance Ranking (supplemental)

### Numerical response bias

<details><summary>Visualizing the numerical response bias since there were non-unique ranks assigned by some respondents</summary>

```{r}
resultsTidy %>%
  select(starts_with("PotentialRank")) %>%
  rowSums(na.rm = TRUE) %>%
  table() %>% as.data.frame()
```

We would expect a row sum of 21 if a 6, 5, 4, 3, 2, and 1 were selected. We see row sums ranging from  6 (ranking everything 1) to 24. Only 8 out of 28 responses have a row sum of 21 and even that doesn't guarantee that all choices received a unique ranking for those 8 responses (e.g., 3 2's, 1 4, 1 5 and 1 6 sum to 21). So this table is instead showing that 20 responses definitely did not use unique ranks for all 6 questions. Given that most of these observed sums are less than 21, people showed a bias towards ranking things as more important (closer to 1)

```{r}
resultsTidy %>% 
  select(starts_with("CurrentRank")) %>% 
  rowSums(na.rm = TRUE) %>%
  table() %>% as.data.frame()
```

We again would expect a row sum of 21 if a 6, 5, 4, 3, 2, and 1 were selected. We see row sums ranging from  6 (ranking everything 1) to 26. Only 9 out of 22 responses have a row sum of 21 and even that doesn't guarantee that all choices received a unique ranking for those 9 responses (e.g., 3 2's, 1 4, 1 5 and 1 6 sum to 21). So this table is instead showing that 13 responses definitely did not use unique ranks for all 6 questions. Given that most of these observed sums are less than 21, people showed a bias towards ranking things as more important (closer to 1)

We can visualize the numerical response bias where people tended to rate things as more important by creating a density plot of all rankings no matter the feature queried or 

```{r}
resultsTidy %>%
  select(starts_with(c("CurrentRank", "PotentialRank"))) %>%
  pivot_longer(cols = everything()) %>%
  drop_na() %>%
  ggplot(aes(x = value)) +
  geom_density() +
  theme_bw() + theme(panel.background = element_blank()) +
  xlab("Rank") + scale_x_continuous(breaks = 1:6, labels = 1:6)
```

</details>

### Plot Density plot

#### Prepare data

<details><summary>Description of variable definitions and steps</summary>

Here, we just want all of the numerical ranks in one column and we can have additional columns that describe if that rank was from a current or potential user and which feature it corresponds to.

So to make a dataframe `densitydf`, we 

* start by selecting the columns of interest from `resultsTidy` using `select(starts_with(c("PotentialRank", "CurrentRank")))
* tell it to take this "wide" dataframe and pivot it to a longer one where the values all go to a `value` column, and the column name associated with the value goes into a `name` column. 
* drop rows that have na with `drop_na()` since as described earlier not every survey respondent was asked each question; e.g., if they were a current user they weren't asked as a potential user.
* Then we `separate` the `name` column on the word "Rank" to remove the `name` column we just made but then make two new columns (`Usertype` and `Feature`) where `Usertype is either "Current" or "Potential", and the Features are listed in the code below, because...
* We then use a `case_when` within a `mutate()` to fill out those features so they're more informative and show the choices survey respondents were given.
* we add another `case_when` within that `mutate` to add the word "Users" to the `Usertypes` column values. 

</details>

```{r}
densitydf <- resultsTidy %>% 
    select(starts_with(c("PotentialRank", "CurrentRank"))) %>% pivot_longer(cols = everything()) %>% drop_na() %>%
  separate(name, c("Usertype", "Feature"), sep = "Rank", remove = TRUE) %>%
    mutate(Feature = 
               case_when(Feature == "EasyBillingSetup" ~ "Easy billing setup",
                         Feature == "FlatRateBilling" ~ "Flat-rate billing rather than use-based",
                         Feature == "FreeVersion" ~ "Free version with limited compute or storage",
                         Feature == "SupportDocs" ~ "On demand support and documentation",
                         Feature == "ToolsData" ~ "Specific tools or datasets are available/supported",
                         Feature == "CommunityAdoption" ~ "Greater adoption of the AnVIL by the scientific community"),
           Usertype =
             case_when(Usertype == "Current" ~ "Current Users",
                       Usertype == "Potential" ~ "Potential Users")
    )
```


#### Density plot

<details><summary>Description of variable definitions and steps</summary>

We use the `densitydf` dataframe we just made and the x-axis is raw rank `value` column values, and the y-axis shows the density. The different density curves are grouped and color filled based off of which feature they represent, and we `facet_wrap` or split the plot facets into two rows so that there's one for each user type. We set the alpha value within `geom_density` since so many of the curves are on top of each other.

Some theme things are changed, labels and titles added, and then we display and save that plot

It also adds annotations (using [Grobs, explained in this Stack Overflow post answer](https://stackoverflow.com/a/31081162)) to specify which rank was "Most important" and which was "Least important". 

And it increases the bottom margin so those grob annotations aren't cutoff

</details>

```{r}
ggplot(densitydf, aes(x=value, group = Feature, fill = Feature)) +
  facet_wrap(~Usertype, nrow = 2) +
  geom_density(alpha=0.3) + 
  theme_bw() + theme(panel.background = element_blank()) +
  xlab("Rank") + scale_x_continuous(breaks = 1:6, labels= 1:6, limits = c(1, 6)) +
  ggtitle("Rank the following features according to\ntheir importance to you as a potential user\nor for your continued use of the AnVIL")+
  annotation_custom(textGrob("Most\nimportant", gp=gpar(fontsize=8, fontface = "bold")),xmin=1,xmax=1,ymin=-0.85,ymax=-0.85) + 
  annotation_custom(textGrob("Least\nimportant", gp=gpar(fontsize=8, fontface= "bold")),xmin=6,xmax=6,ymin=-0.85,ymax=-0.85) +
  coord_cartesian(clip = "off") +
  theme(plot.margin = margin(1,1,1.25,1, "cm"))
ggsave(here("plots/densityplot_rankfeatures.png"))
```

#### Density plot with facets for feature

<details><summary>Description of variable definitions and steps</summary>

We use the `densitydf` dataframe we just made, but we re-simplify the Features so that they'll fit in the legend. For the plot, the x-axis is raw rank `value` column values, and the y-axis shows the density. The different density curves are grouped and color filled based off of which feature they represent, and we use `facet_grid` to split the plot facets into two rows and 6 columns so that there's one row for each user type and one column per feature. We switch the row/y-axis labels over to the left (using `switch = "y"`) and remove the column/x-axis labels (using `theme(strip.background.x = element_blank(), strip.text.x = element_blank())`)

We use the `unit()` function to create some margins, and then set the plot margins and legend position within another `theme()`. I used [this Stack Overflow post to find this method.](https://stackoverflow.com/questions/29808620/ggplot2-move-legend-to-corner-but-keep-it-in-margin) 

Some theme things are changed, labels and titles added, and then we display and save that plot

</details>

```{r}
margins = unit(c(1, 10, 1, 1), 'lines')

densitydf %>% 
  mutate(Feature = 
           case_when(Feature == "Easy billing setup" ~ "Easy billing setup",
                     Feature == "Flat-rate billing rather than use-based" ~ "Flat-rate billing",
                     Feature == "Free version with limited compute or storage" ~ "Free version",
                     Feature == "On demand support and documentation" ~ "Support & documentation",
                     Feature == "Specific tools or datasets are available/supported" ~ "Specific tools or datasets",
                     Feature == "Greater adoption of the AnVIL by the scientific community" ~ "More community adoption")
        ) %>% 
  ggplot(aes(x=value, group = Feature, fill = Feature)) +
  facet_grid(Usertype~Feature, switch = "y") +
  geom_density() + 
  theme_bw() + theme(panel.background = element_blank()) + #theme(legend.position = "bottom") +
  theme(strip.background.x = element_blank(), strip.text.x = element_blank()) +
  theme(plot.margin=margins, legend.position=c(1.25, 0.5)) +
  xlab("Rank") + scale_x_continuous(breaks = 1:6, labels= 1:6, limits = c(1, 6)) +
  ggtitle("Rank the following features according to their importance to you as a\npotential user or for your continued use of the AnVIL")+
  coord_cartesian(clip = "off")
  
ggsave(here("plots/densityplot_rankfeatures_faceted.png"))
```

### Plot Stacked Bar Chart showing number of times for each rank rather than average

#### Prepare data (count)

<details><summary>Description of variable definitions and steps</summary>

For this, we want a data frame that gives counts for all of the ranks given to each feature by each UserType.

To do this we 

  * Select the relevant columns from `resultsTidy`, specifically using  `select(starts_with(c("PotentialRank", "CurrentRank")))`
  * tell it to take this "wide" dataframe and pivot it to a longer one (`pivot_longer`) where the values all go to a `value` column, and the column name associated with the value goes into a `name` column.
  * drop rows that have na with `drop_na()` since as described earlier not every survey respondent was asked each question; e.g., if they were a current user they weren't asked as a potential user.
  * group by the name (feature and UserType combined) and value (the rank) and have it count the number of that specific rank for each feature/UserType combo
  * rename the columns because it's getting confusing. name stays name, value changes to rank, and n is used for the count.
  * Then we `separate` the `name` column on the word "Rank" to remove the `name` column but then make two new columns (`Usertype` and `Feature`) where `Usertype is either "Current" or "Potential", and the Features are listed in the code below, because...
  * We then use a `case_when` within a `mutate()` to fill out those features so they're more informative and show the choices survey respondents were given.
  * we add another `case_when` within that `mutate` to add the word "Users" to the `Usertypes` column values.
  * set the ranks to be a factor (treated like a categorical variable with a better color scheme instead of a continuous one if we didn't do this) with a specified level so that the most important rank is the first bar on the left when we plot.

</details>

```{r}
countdf <- resultsTidy %>% 
    select(starts_with(c("PotentialRank", "CurrentRank"))) %>% 
  pivot_longer(cols = everything()) %>% 
  drop_na() %>% 
  group_by(name, value) %>% count() %>% 
  `colnames<-`(c("name", "rank", "n")) %>%
  separate(name, c("Usertype", "Feature"), sep = "Rank", remove = TRUE) %>%
  mutate(Feature = 
           case_when(Feature == "EasyBillingSetup" ~ "Easy billing setup",
                     Feature == "FlatRateBilling" ~ "Flat-rate billing rather than use-based",
                     Feature == "FreeVersion" ~ "Free version with limited compute or storage",
                     Feature == "SupportDocs" ~ "On demand support and documentation",
                     Feature == "ToolsData" ~ "Specific tools or datasets are available/supported",
                     Feature == "CommunityAdoption" ~ "Greater adoption of the AnVIL by the scientific community"),
        Usertype =
            case_when(Usertype == "Current" ~ "Current Users",
                      Usertype == "Potential" ~ "Potential Users"),
        rank = factor(rank, levels = c(6:1))
    )
```

#### Stacked bar chart

<details><summary>Description of variable definitions and steps</summary>

Using the `countdf` dataframe that we just made, we have the count or `n` column on the x-axis, the `Feature` on the y-axis, and the fill of the bars to be the `rank` (categorical 1, 2, 3, 4, 5, 6). We facet wrap on UserType with two rows so that each facet represents a different UserType. 

We use the `position = "fill"` argument in `geom_bar()` so that it's a percent stacked bar instead of raw counts (since current and potential users had a different number of respondents)

We set the labels for the legend so that it specifies which rank is Least important and which is most important, and we reverse the order in the legend so 1 is on top on the legend. 

Finally we set labels and titles and change the theme a bit

</details>

```{r}
ggplot(countdf, aes(fill=rank, y=Feature, x=n)) +
  facet_wrap(~Usertype, nrow=2) +
    geom_bar(position="fill", stat="identity") +
  scale_fill_discrete(labels=c('6 (Least\n    Important)', '5', '4', '3', '2', '1 (Most\n    Important)')) +
  guides(fill = guide_legend(reverse = TRUE)) +
  xlab("Percent Responses") +
  ggtitle("Rank the following features according to\ntheir importance to you as a potential user\nor for your continued use of the AnVIL") +
  theme_bw() + theme(panel.background = element_blank(), panel.grid = element_blank())

ggsave(here("plots/stackedbarplot_rankfeatures.png"))
```


## Preferences: Training Workshop Modality Ranking (supplemental)

*No supplements at this time*

## Preferences: Where analyses are currently run (supplemental)

*No supplements at this time*

## Preferences: DMS compliance/data repositories (supplemental)

*No supplements at this time*

## Preferences: Source for cloud computing funds (supplemental)

### Alternate plot

```{r}
toPlotFundingSource %>% 
  mutate(UserType = case_when(
    UserType == "Current User" ~ "Current",
    UserType == "Potential User" ~ "Potential"
         ),
   whichFundingSource = factor(whichFundingSource, levels = rev(c("NHGRI", "Other NIH", "Institutional funds", "Foundation Grant", "NSF", "Only use free options", "Don't know")))
  ) %>%
  ggplot(aes(y = UserType, x = count, fill = whichFundingSource)) +
  geom_bar(position = "fill", stat = "identity") +
  scale_fill_manual(values = rev(c("#035C94", "#012840", "#F2F2F2", "#E0DD10", "#AEEBF2", "#7EBAC0", "#333333"))) +
  theme_bw() +
  ggtitle("What source(s) of funds do you use to pay for cloud computing?") +
  xlab("Fraction of responses") +
  ylab("User Type") +
  theme(panel.background = element_blank(),
        panel.grid.minor.x = element_blank(), 
        panel.grid.minor.y = element_blank(),
        panel.grid.major.y = element_blank()) +
  labs(fill="Funding Source")
  

ggsave(here("plots/fundingsources_colorSource.png"))
```

## Returning User: Length of Use of the AnVIL (supplemental)

*No supplements at this time*

## Returning User: Foreseeable Computational Needs (supplemental)

*No supplements at this time*

## Returning User: Recommendation likelihood (supplemental)

*No supplements at this time*

## Session Info

```{r}
sessionInfo()
```