generated from jhudsl/OTTR_Template_Website
-
Notifications
You must be signed in to change notification settings - Fork 0
/
anvilPoll2024ExtraAnalysis.Rmd
933 lines (713 loc) · 41.9 KB
/
anvilPoll2024ExtraAnalysis.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
---
title: "State of the AnVIL 2024"
subtitle: "Supplementary Analysis and Alternative Plots"
author: "Kate Isaac, Elizabeth Humphries, & Ava Hoffman"
date: "`r Sys.Date()`"
output: html_document
---
```{r results='hide', warning=FALSE, message=FALSE}
library(here)
library(patchwork)
library(ggVennDiagram)
knitr::knit_child(here("anvilPoll2024MainAnalysis.Rmd"))
```
# Supplemental Analyses and Graphs
## Identify User Type (supplemental)
*No supplements at this time*
## Demographics: Highest Degree (supplemental)
<details><summary>Description of variable definitions and steps</summary>
First we select the columns of interest from `resultsTidy`: `Degrees` and `UserType`. Then we use `group_by` in conjunction with `summarize( = n())` to add counts for how many of each combo are observed in the data.
Then we send this data to ggplot and make a bar chart with the x-axis representing the degrees (`reorder`ed by the count number such that higher counts are first (and the sum) because otherwise the 2 MDs are located after the high school and master's in progress bars (1 each)). The y-axis represents the count, and the fill is used to specify user type (current or potential AnVIL users). We use a stacked bar chart and include labels above each bar of the total sum for that degree type.
Used [this Stack Overflow post to label sums above the bars](https://stackoverflow.com/questions/30656846/draw-the-sum-value-above-the-stacked-bar-in-ggplot2)
and used [this Stack Overflow post to remove NA from the legend](https://stackoverflow.com/questions/45493163/ggplot-remove-na-factor-level-in-legend)
The rest of the changes are related to theme and labels and making sure that the numerical bar labels aren't cut off on the top.
</details>
```{r}
resultsTidy %>%
group_by(Degrees, UserType) %>%
summarize(n = n()) %>%
ggplot(aes(x = reorder(Degrees, -n, sum),
y = n,
fill = UserType
)) +
geom_bar(position = "stack", stat="identity") +
geom_text(
aes(label = after_stat(y), group = Degrees),
stat = 'summary', fun = sum, vjust = -1, size=2
) +
theme_classic() + theme(axis.text.x = element_text(angle = 45, hjust=1)) +
xlab("Degree") +
ylab("Count") +
coord_cartesian(clip = "off") +
scale_fill_manual(values = c("#E0DD10", "#035C94"), na.translate = F) +
ggtitle("What is the highest degree you have attained?") +
theme(legend.title = element_blank())
ggsave(here("plots/degree_usertype.png")) #set plot size
```
## Demographics: Kind of Work (supplemental)
*No supplements at this time*
## Demographics: Institutional Affiliation (supplemental)
### Number of institutions represented in responses
```{r}
length(unique(resultsTidy$InstitutionalAffiliation))
```
### Institution type
Let's make a bar chart that shows how many of each institution, colored by institution type
<details><summary>Description of variable definitions and steps</summary>
We first prepare the data by selecting the columns of interest from `resultsTidy`: `InstitutionalAffiliation` and `InstitutionalType`. And we use the `group_by` and `summarize( = n())` functions to add a count (`InstitutionalCount`) for every InstitutionalAffiliation. We want to include the InstitutionalType in the group_by even though it's redundant for what we're displaying since we'll want to color by institution type.
We then plot the data with the Affiliation on the y-axis (reordered by the count so largest count is on top),
the count on the x-axis, and the fill color being the institutional type.
We change some theme and label elements and add a grob annotation to specify how many unique institutions are represented in this graph.
</details>
```{r}
resultsTidy %>%
group_by(InstitutionalAffiliation, InstitutionalType) %>% summarize(InstitutionalCount = n()) %>%
ggplot(aes(
y = reorder(InstitutionalAffiliation, InstitutionalCount),
x = InstitutionalCount,
fill = InstitutionalType
)) + geom_bar(stat = "identity") +
ggtitle("What institution are you affiliated with?")+
annotation_custom(textGrob(paste("There are\n", length(unique(resultsTidy$InstitutionalAffiliation)) ,"\nunique institutions"), gp=gpar(fontsize=8, fontface = "bold")),xmin=7,xmax=7,ymin=3,ymax=3) +
coord_cartesian(clip = "off") +
theme_classic() +
xlab("Count")
ggsave(here("plots/institutionalAffilition_allResponses.png"))
```
Taking a less granular approach, and aggregating by institution type rather than looking at names of institutions
```{r}
instPlot <- resultsTidy %>%
group_by(UserType, InstitutionalType) %>% summarize(InstitutionalCount = n()) %>%
ggplot(aes(
y = reorder(InstitutionalType, InstitutionalCount, sum),
x = InstitutionalCount,
fill = UserType
)) + geom_bar(position = "stack", stat = "identity") +
annotation_custom(textGrob(paste("There are\n", length(unique(resultsTidy$InstitutionalAffiliation)) ,"\nunique institutions"), gp=gpar(fontsize=8, fontface = "bold")),xmin=34,xmax=34,ymin=2.5,ymax=2.5) +
coord_cartesian(clip = "off") +
ggtitle("What institution are you affiliated with?")
stylize_bar(instPlot)
ggsave(here("plots/institutionalType_allResponses_colorUserType.png"))
```
#### Just for Current/Returning Users
The above plot was for all survey responses. Here we want to focus on institutions represented by just current users of AnVIL.
<details><summary>Description of variable definitions and steps</summary>
We first select rows/responses that are just from Current users. Then we prepare the data and plot following the same scheme as above.
</details>
```{r}
resultsTidy %>%
filter(UserType == "Current User") %>%
group_by(InstitutionalAffiliation, InstitutionalType) %>% summarize(InstitutionalCount = n()) %>%
ggplot(aes(
y = reorder(InstitutionalAffiliation, InstitutionalCount),
x = InstitutionalCount,
fill = InstitutionalType
)) + geom_bar(stat = "identity") +
theme_bw() +
theme(
panel.background = element_blank(),
panel.grid.minor = element_blank(),
panel.grid.major.y = element_blank()
) +
ylab("Institutional Affiliation") + xlab("Count") +
ggtitle(bquote('Institutional Affilition for' ~ bold('Current User') ~ 'Respondents')) +
annotation_custom(textGrob(paste("There are\n", nrow(unique(resultsTidy[which(resultsTidy$UserType == "Current User"), "InstitutionalAffiliation"])) ,"\nunique institutions"), gp=gpar(fontsize=8, fontface = "bold")),xmin=5.5,xmax=5.5,ymin=3,ymax=3) +
coord_cartesian(clip = "off")
ggsave(here("plots/institutionalAffilition_currentUserResponses.png"))
```
Taking a less granular approach, and just looking at institution type rather than names of institutions. Saving the plot into a variable so that we can combine it with the one for potential users later.
Note that the x- and y-axis labels are turned off since this will be the top plot when combined, also simplified the title to just say Current Users. Turned off the legend.
Also used `scale_fill_manual` to set specific colors for the institution types in order to sync colors for institution types in this and the potential users version (`institutionTypePotential`) (more info on this with that plot below).
```{r}
institutionTypeCurrent <- resultsTidy %>%
filter(UserType == "Current User") %>%
group_by(InstitutionalType) %>% summarize(InstitutionalCount = n()) %>%
ggplot(aes(
y = reorder(InstitutionalType, InstitutionalCount),
x = InstitutionalCount,
fill = InstitutionalType
)) + geom_bar(stat = "identity") +
theme_bw() +
theme(
panel.background = element_blank(),
panel.grid.minor = element_blank(),
panel.grid.major.y = element_blank()
) +
ylab("") +
xlab("Count") +
#xlab("") +
ggtitle(bquote(bold("Current Users"))) +
coord_cartesian(clip = "off") +
scale_fill_manual(values = c("R1 University" = "#FDB462",
"Research Center" = "#FCCDE5",
"Medical Center or School" = "#FB8072",
"R2 University" = "#B3DE69")) +
theme(legend.position = "none")
institutionTypeCurrent
#ggsave(here("plots/institutionalType_currentUserResponses.png"), plot = institutionTypeCurrent)
```
#### Just for Potential Users
Here we want to focus on institutions represented by just potential users of AnVIL.
<details><summary>Description of variable definitions and steps</summary>
We first select rows/responses that are just from potential users. Then we prepare the data and plot following the same scheme as above.
</details>
```{r}
resultsTidy %>%
filter(UserType == "Potential User") %>%
group_by(InstitutionalAffiliation, InstitutionalType) %>% summarize(InstitutionalCount = n()) %>%
ggplot(aes(
y = reorder(InstitutionalAffiliation, InstitutionalCount),
x = InstitutionalCount,
fill = InstitutionalType
)) + geom_bar(stat = "identity") +
theme_bw() +
theme(
panel.background = element_blank(),
panel.grid.minor = element_blank(),
panel.grid.major.y = element_blank()
) +
ylab("Institutional Affiliation") + xlab("Count") +
ggtitle(bquote('Institutional Affilition for' ~ bold('Potential User') ~ 'Respondents')) +
annotation_custom(textGrob(paste("There are\n", nrow(unique(resultsTidy[which(resultsTidy$UserType == "Potential User"), "InstitutionalAffiliation"])) ,"\nunique institutions"), gp=gpar(fontsize=8, fontface = "bold")),xmin=6,xmax=6,ymin=1.5,ymax=1.5) +
coord_cartesian(clip = "off")
ggsave(here("plots/institutionalAffilition_potentialUserResponses.png"))
```
Taking a less granular approach, and just looking at institution type rather than names of institutions.
Wanted to sync the colors between the current and potential institutional types and so used the Set3 palette for scale_fill_brewer as it has 12 colors (and need 9 for current users) and it seemed more accessible than the Paired palette. To see the hex codes that were assigned to the shared institution types in this plot, I used the `scales` library and `brewer_pal(palette = "Set3")(9)`
Turned off the y-axis label, but kept the x-axis label since this will be the bottom plot when combined with the current user version (`institutionTypeCurrent`). Also used `xlim` to sync the x-axis limits between the two.
Simplified the title to just be Potential Users. Turned off the legend.
```{r}
institutionTypePotential <- resultsTidy %>%
filter(UserType == "Potential User") %>%
group_by(InstitutionalType) %>% summarize(InstitutionalCount = n()) %>%
ggplot(aes(
y = reorder(InstitutionalType, InstitutionalCount),
x = InstitutionalCount,
fill = InstitutionalType
)) + geom_bar(stat = "identity") +
theme_bw() +
theme(
panel.background = element_blank(),
panel.grid.minor = element_blank(),
panel.grid.major.y = element_blank()
) +
ylab("") +
xlab("") +
#xlab("Count") +
xlim(0,15) +
ggtitle(bquote(bold("Potential Users"))) +
coord_cartesian(clip = "off") +
scale_fill_brewer(palette = "Set3") +
theme(legend.position = "none")
institutionTypePotential
#ggsave(here("plots/institutionalType_potentialUserResponses.png"), plot = institutionTypePotential)
```
Combined the two plots for institutional type (`institutionTypeCurrent` and `institutionTypePotential`) using patchwork, stacking them on top of each other (`/`) and using `plot_layout` to set the heights since there are more institution types for Potential users than Current users and therefore want current users to be shorter than default.
```{r}
combined_plot <- institutionTypePotential / institutionTypeCurrent + plot_layout(heights = unit(c(4, 2),'cm')) + plot_annotation("What institution are you affiliated with?")
combined_plot
ggsave(here("plots/institutionalType_facetedUserType.png"), plot = combined_plot)
```
## Demographics: Consortia Affiliations (supplemental)
*No supplements at this time*
## Experience: Tool & Resource Knowledge/Comfort level (supplemental)
### Plot y-axis ordered by potential user ratings
```{r}
# Provide a list of AnVIL only Tools
AnVIL_only <-
setdiff(toPlotToolKnowledgeSeparateBR[toPlotToolKnowledgeSeparateBR$UserType == "Current Users" &
toPlotToolKnowledgeSeparateBR$AnVILorNo == "On the AnVIL", ]$Tool,
toPlotToolKnowledgeSeparateBR[toPlotToolKnowledgeSeparateBR$UserType == "Potential Users", ]$Tool)
# Order dummy column based only on Potential users
toPlotToolKnowledgeSeparateBR <-
toPlotToolKnowledgeSeparateBR %>% mutate(ToolOrder = case_when(
UserType == "Potential Users" | Tool %in% AnVIL_only ~ avgScore,
TRUE ~ 0
))
PlotToolKnowledge_potential_user_score <-
ggplot(data = toPlotToolKnowledgeSeparateBR) +
geom_point(data = toPlotToolKnowledgeSeparateBR[toPlotToolKnowledgeSeparateBR$UserType == "Potential Users" | toPlotToolKnowledgeSeparateBR$Tool %in% AnVIL_only ,],
aes(color = UserType, shape = AnVILorNo, y = reorder(Tool, ToolOrder), x = avgScore)) +
geom_point(data = toPlotToolKnowledgeSeparateBR[toPlotToolKnowledgeSeparateBR$UserType == "Current Users",],
aes(color = UserType, shape = AnVILorNo, y = Tool, x = avgScore))
PlotToolKnowledge_customization(PlotToolKnowledge_potential_user_score)
ggsave(here("plots/tooldataresourcecomfortscore_singlepanel_by_potential_users.png"), w = 2200, h = 1350, units = "px")
```
### simpler plots focusing on a subset of the data
```{r}
#only separate from the AnVIL data
simplerPlot <- toPlotToolKnowledge %>%
filter(AnVILorNo == "Separate from the AnVIL") %>%
ggplot(aes(y = reorder(Tool, avgScore), x=avgScore)) + geom_point(aes(color = UserType)) +
geom_line() +
scale_x_continuous(breaks = 0:5, labels = 0:5, limits = c(0,5)) + ylab("Tool or Resource") + xlab("Average Knowledge or Comfort Score") + theme_bw() + theme(panel.background = element_blank(), panel.grid.minor.x = element_blank()) +
annotation_custom(textGrob("Don't know\nat all", gp=gpar(fontsize=8, fontface = "bold")),xmin=0,xmax=0,ymin=-1,ymax=-1) +
annotation_custom(textGrob("Extremely\ncomfortable", gp=gpar(fontsize=8, fontface= "bold")),xmin=5,xmax=5,ymin=-1,ymax=-1) +
coord_cartesian(clip = "off") +
theme(plot.margin = margin(1,1,1,1.1, "cm"))+
ggtitle("How would you rate your knowledge of or\ncomfort with these technologies\n(separate from the AnVIL)?") +
theme(legend.title = element_blank())
simplerPlot
ggsave(here("plots/toolsSeparateFromAnVIL_comfortscore.png"), plot = simplerPlot)
```
```{r}
#add in purple points of comparison for On the AnVIL
toPlot_simplified <- toPlotToolKnowledge %>%
filter(AnVILorNo == "Separate from the AnVIL")
onAnVIL <- toPlotToolKnowledge %>%
filter(AnVILorNo == "On the AnVIL") %>%
right_join(., toPlot_simplified,by = "Tool") %>%
bind_rows(.,
data.frame(Tool = "RStudio",
avgScore.x = toPlotToolKnowledge[which(toPlotToolKnowledge$Tool == "Bioconductor & RStudio"),"avgScore"],
UserType.x = "Current Users",
AnVILorNo.x = "On the AnVIL"),
data.frame(Tool = "Bioconductor",
avgScore.x = toPlotToolKnowledge[which(toPlotToolKnowledge$Tool == "Bioconductor & RStudio"),"avgScore"],
UserType.x = "Current Users",
AnVILorNo.x = "On the AnVIL")
) %>% drop_na(avgScore.x)
```
```{r}
simplerPlot + geom_point(data = onAnVIL, aes(x=avgScore.x,y=Tool,colour="#C77CFF")) +
scale_color_manual(
values = c("#F8766D", "#00BFC4", "#C77CFF"), labels = c("Potential Users", "Current Users", "Current User Ratings\nfor related AnVIL tools")) + theme(legend.title = element_blank())
ggsave(here("plots/tools_comfortscore.png"))
```
```{r}
#only the data resources
toPlotToolKnowledge %>%
filter(Tool == "DUOS" | Tool == "Access controlled access data" | Tool == "TDR" | Tool == "Terra Workspaces") %>%
ggplot(aes(y = reorder(Tool, avgScore), x=avgScore)) + geom_point(colour = "#F8766D") +
scale_x_continuous(breaks = 0:5, labels = 0:5, limits = c(0,5)) + ylab("Data Resource") + xlab("Average Knowledge or Comfort Score") + theme_bw() + theme(panel.background = element_blank(), panel.grid.minor.x = element_blank()) +
annotation_custom(textGrob("Don't know\nat all", gp=gpar(fontsize=8, fontface = "bold")),xmin=0,xmax=0,ymin=-0.35,ymax=-0.35) +
annotation_custom(textGrob("Extremely\ncomfortable", gp=gpar(fontsize=8, fontface= "bold")),xmin=5,xmax=5,ymin=-0.35,ymax=-0.35) +
coord_cartesian(clip = "off") +
theme(plot.margin = margin(1,1,1,1.1, "cm"))+
ggtitle("How would you rate your knowledge of or\ncomfort with these AnVIL data features?") +
theme(legend.title = element_blank())
ggsave(here("plots/dataresources_comfortscore.png"))
```
## Experience: Types of data analyzed (supplemental)
*No supplements at this time*
## Experience: Genomics and Clinical Research Experience (supplemental)
### Should we split current users vs potential users?
Here we use two different plots to show that the distribution of experience level among these three research types is similar when comparing the distribution of current users vs potential users. In this first plot, we have the experience level on the x-axis, the count on the y-axis, and color the bars by research type. We stack the user type responses using `facet_wrap` and `nrow=2` as an argument within that. We use a `position="dodge"` to cluster the similar research type bars next to each other. And we use geom_text to label the bars with the actual count. This requires `group = researchType` within the `geom_text()` `aes()` and `position = position_dodge(width = 0.9)` within the general `geom_text()` function.
We then also make some theme changes like rotating the x-axis tick labels and changing the y- and x- axis labels and using a minimal theme to turn off borders, and then turning off grids, etc.
```{r}
ggplot(experienceDf, aes(x=experienceLevel,y=n, fill=researchType)) +
facet_wrap(~UserType, nrow=2) +
geom_bar(stat="identity", position="dodge") +
theme_minimal() +
theme(panel.background = element_blank(), panel.grid = element_blank()) +
theme(axis.text.x = element_text(angle = 45, hjust=1)) +
geom_text(
aes(label = n, group = researchType),
size=2, position = position_dodge(width = .9), vjust=-0.5
) +
ylab("Count") + xlab ("Reported Experience Level") +
coord_cartesian(clip = "off")
ggsave(here("plots/researchExperienceLevel_colorResearchType.png"))
```
In this second plot, we have the experience level on the x-axis, the count on the y-axis, and color the bars by experience level. We stack the user type responses and separate out the research types into separate facets using `facet_grid`. And we use geom_text to label the bars with the actual count. This uses `group = experienceLevel` within the `geom_text()` `aes()`.
We then also make some theme changes like rotating the x-axis tick labels and changing the y- and x- axis labels, expanding the left plot margin, and using a minimal theme to turn off borders, and then turning off grids, etc.
```{r}
ggplot(experienceDf, aes(x=experienceLevel,y=n, fill=experienceLevel)) +
facet_grid(UserType~researchType) +
geom_bar(stat="identity") +
theme_classic() +
theme(panel.background = element_blank(), panel.grid = element_blank()) +
theme(axis.text.x = element_text(angle = 45, hjust=1)) +
geom_text(
aes(label = n, group = experienceLevel), vjust = -1, size=2
) +
ylab("Count") + xlab ("Reported Experience Level") +
coord_cartesian(clip = "off") +
theme(plot.margin = margin(1,1,1,1.05, "cm")) +
theme(legend.position = "none")
ggsave(here("plots/researchExperienceLevel_colorExperience.png"))
```
Both of these give us confidence that current and potential user counts for reported experience level in these research areas show similar distributions. So we'll go ahead and plot it without splitting out `UserType`.
### Alternate plot
<details><summary>Description of variable definitions and steps</summary>
This bar plot is the same as in the main analysis, but it doesn't use a fill for experience level. It has the experience level on the x-axis, the count on the y-axis. We facet the research category type and label the bars. We keep a summary stat and sum function and after_stat(y) for the label since the data has splits like UserType that we're not visualizing here.
We adjust various aspects of the theme like turning off the grid and background and rotating the x-tick labels and changing the x- and y-axis labels. We also slightly widen the left axis so that the tick labels aren't cut off.
</details>
```{r}
ggplot(experienceDf, aes(x=experienceLevel,y=n)) +
facet_grid(~researchType) +
geom_bar(stat="identity") +
theme_bw() +
theme(panel.background = element_blank(), panel.grid = element_blank()) +
theme(axis.text.x = element_text(angle = 45, hjust=1)) +
geom_text(
aes(label = after_stat(y), group = experienceLevel),
stat = 'summary', fun = sum, vjust = -0.5, size=2
) +
ylab("Count") + xlab ("Reported Experience Level") +
coord_cartesian(clip = "off") +
theme(plot.margin = margin(1,1,1,1.05, "cm")) +
theme(legend.position = "none")+
ggtitle("How much experience do you have analyzing the following data categories?")
ggsave(here("plots/researchExperienceLevel_noColor_noUserTypeSplit.png"))
```
### Follow-up: Overlap in experience levels for moderate or extreme experience categories for respondents
A potential follow-up question we had from examining the results of the "Experience: Genomics and Clinical Research Experience" section was "What's the overlap like for those moderately or extremely experienced in these various categories? The results of that question follow.
```{r}
resultsTidy %>%
select(Timestamp, HumanGenomicExperience, HumanClinicalExperience, NonHumanGenomicExperience, UserType) %>%
pivot_longer(c(HumanGenomicExperience,
HumanClinicalExperience,
NonHumanGenomicExperience),
names_to = "researchType",
values_to = "experienceLevel") %>%
mutate(experienceLevel =
factor(experienceLevel,
levels = c("Not at all experienced",
"Slightly experienced",
"Somewhat experienced",
"Moderately experienced",
"Extremely experienced")),
researchType = case_when(researchType == "HumanClinicalExperience" ~ "Human Clinical\nResearch",
researchType == "HumanGenomicExperience" ~ "Human Genomic\nResearch",
researchType == "NonHumanGenomicExperience" ~ "Non-human\nGenomic Research"),
Timestamp = factor(Timestamp)) %>%
ggplot(aes(y = factor(experienceLevel,
levels = rev(c("Not at all experienced",
"Slightly experienced",
"Somewhat experienced",
"Moderately experienced",
"Extremely experienced"))),
x = Timestamp,
fill = experienceLevel)) +
geom_tile() +
scale_fill_manual(values = c("#035C94","#035385","#024A77","#024168", "#02395B")) +
theme_bw() +
theme(axis.text.x=element_blank(),
axis.ticks.x=element_blank(),
axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
legend.position = "left") +
ylab("") +
ggtitle("How much experience do you have analyzing\nthe following data categories?") +
xlab("Respondent") +
facet_wrap(~researchType, nrow=3, strip.position="right")
```
```{r}
inputList <- list(ClinicalExperience = which(resultsTidy$clinicalFlag),
HumanGenomicsExperience = which(resultsTidy$humanGenomicFlag),
NonHumanGenomicsExperience = which(resultsTidy$nonHumanGenomicFlag))
ggVennDiagram(inputList,
category.names = c("Clinical\nExperience", "Human Genomics\nExperience", " Non-human Genomics Experience")) +
scale_x_continuous(expand = expansion(mult = .2))
```
## Experience: Controlled Access Datasets (supplemental)
*No supplements at this time*
## Awareness: Monthly AnVIL Demos (supplemental)
### Utilization
```{r}
demoPlotUtil <- resultsTidy %>%
group_by(UserType, AnVILDemoUse) %>%
summarize(count = n()) %>%
ggplot(aes(y=reorder(AnVILDemoUse, count),
x= count,
fill = UserType)) +
geom_bar(stat = "identity") +
ggtitle("Have you attended a monthly AnVIL Demo?")
stylize_bar(demoPlotUtil)
```
### Awareness or Utilization Color rather than y-axis split
```{r}
pd1 <- resultsTidy %>%
group_by(UserType, AnVILDemoUse) %>%
summarize(count = n()) %>%
ggplot(aes(y=UserType,
x= count,
fill = AnVILDemoUse)) +
geom_bar(stat = "identity") +
ggtitle("Have you attended a monthly AnVIL Demo?") +
theme_classic() +
xlab("") +
ylab(" ") +
scale_fill_manual(values = c("#25445A", "#7EBAC0")) +
scale_x_continuous(breaks= pretty_breaks()) +
theme(legend.title = element_blank())
```
```{r}
pd2 <- resultsTidy %>%
group_by(UserType, AnVILDemoAwareness) %>%
summarize(count = n()) %>%
ggplot(aes(y=UserType,
x=count,
fill = AnVILDemoAwareness)) +
geom_bar(stat = "identity") +
#ggtitle("Have you attended a monthly AnVIL Demo?") +
theme_classic() +
xlab("Count") +
ylab(" ") +
scale_fill_manual(values = c("#25445A", "#7EBAC0")) +
scale_x_continuous(breaks= pretty_breaks()) +
theme(legend.title = element_blank())
```
```{r}
pd1 / pd2
```
## Awareness: AnVIL Support Forum (supplemental)
```{r}
forumdf %<>% mutate(,
forumUse = factor(
case_when(
forumInteractionDescription == "Posted in" ~ "Have utilized",
forumInteractionDescription == "Answered someone's post" ~ "Have utilized",
forumInteractionDescription == "Read through others' posts" ~ "Have utilized",
forumInteractionDescription == "No but aware of" ~ "Have not utilized",
forumInteractionDescription == "No didn't know of" ~ "Have not utilized"
), levels = c("Have not utilized", "Have utilized")))
```
### Utilization
```{r}
forumPlotUtil <- forumdf %>%
group_by(UserType, forumUse) %>%
summarize(count = n()) %>%
ggplot(aes(y=reorder(forumUse, count),
x= count,
fill = UserType)) +
geom_bar(stat = "identity") +
ggtitle("Have you ever read or posted in our AnVIL Support Forum?")
stylize_bar(forumPlotUtil)
```
### Awareness or Utilization Color rather than y-axis split
```{r}
pf1 <- forumdf %>%
group_by(UserType, forumUse) %>%
summarize(count = n()) %>%
ggplot(aes(y=UserType,
x= count,
fill = forumUse)) +
geom_bar(stat = "identity") +
ggtitle("Have you ever read or posted in our AnVIL Support Forum?") +
theme_classic() +
xlab("") +
ylab(" ") +
scale_fill_manual(values = c("#25445A", "#7EBAC0")) +
scale_x_continuous(breaks= pretty_breaks()) +
theme(legend.title = element_blank(), )
```
```{r}
pf2 <- forumdf %>%
group_by(UserType, forumAwareness) %>%
summarize(count = n()) %>%
ggplot(aes(y=UserType,
x=count,
fill = forumAwareness)) +
geom_bar(stat = "identity") +
#ggtitle("Have you ever read or posted in our AnVIL Support Forum?") +
theme_classic() +
xlab("Count") +
ylab(" ") +
scale_fill_manual(values = c("#25445A", "#7EBAC0")) +
scale_x_continuous(breaks= pretty_breaks()) +
theme(legend.title = element_blank())
```
```{r}
pf1 / pf2
```
## Preferences: Feature Importance Ranking (supplemental)
### Numerical response bias
<details><summary>Visualizing the numerical response bias since there were non-unique ranks assigned by some respondents</summary>
```{r}
resultsTidy %>%
select(starts_with("PotentialRank")) %>%
rowSums(na.rm = TRUE) %>%
table() %>% as.data.frame()
```
We would expect a row sum of 21 if a 6, 5, 4, 3, 2, and 1 were selected. We see row sums ranging from 6 (ranking everything 1) to 24. Only 8 out of 28 responses have a row sum of 21 and even that doesn't guarantee that all choices received a unique ranking for those 8 responses (e.g., 3 2's, 1 4, 1 5 and 1 6 sum to 21). So this table is instead showing that 20 responses definitely did not use unique ranks for all 6 questions. Given that most of these observed sums are less than 21, people showed a bias towards ranking things as more important (closer to 1)
```{r}
resultsTidy %>%
select(starts_with("CurrentRank")) %>%
rowSums(na.rm = TRUE) %>%
table() %>% as.data.frame()
```
We again would expect a row sum of 21 if a 6, 5, 4, 3, 2, and 1 were selected. We see row sums ranging from 6 (ranking everything 1) to 26. Only 9 out of 22 responses have a row sum of 21 and even that doesn't guarantee that all choices received a unique ranking for those 9 responses (e.g., 3 2's, 1 4, 1 5 and 1 6 sum to 21). So this table is instead showing that 13 responses definitely did not use unique ranks for all 6 questions. Given that most of these observed sums are less than 21, people showed a bias towards ranking things as more important (closer to 1)
We can visualize the numerical response bias where people tended to rate things as more important by creating a density plot of all rankings no matter the feature queried or
```{r}
resultsTidy %>%
select(starts_with(c("CurrentRank", "PotentialRank"))) %>%
pivot_longer(cols = everything()) %>%
drop_na() %>%
ggplot(aes(x = value)) +
geom_density() +
theme_bw() + theme(panel.background = element_blank()) +
xlab("Rank") + scale_x_continuous(breaks = 1:6, labels = 1:6)
```
</details>
### Plot Density plot
#### Prepare data
<details><summary>Description of variable definitions and steps</summary>
Here, we just want all of the numerical ranks in one column and we can have additional columns that describe if that rank was from a current or potential user and which feature it corresponds to.
So to make a dataframe `densitydf`, we
* start by selecting the columns of interest from `resultsTidy` using `select(starts_with(c("PotentialRank", "CurrentRank")))
* tell it to take this "wide" dataframe and pivot it to a longer one where the values all go to a `value` column, and the column name associated with the value goes into a `name` column.
* drop rows that have na with `drop_na()` since as described earlier not every survey respondent was asked each question; e.g., if they were a current user they weren't asked as a potential user.
* Then we `separate` the `name` column on the word "Rank" to remove the `name` column we just made but then make two new columns (`Usertype` and `Feature`) where `Usertype is either "Current" or "Potential", and the Features are listed in the code below, because...
* We then use a `case_when` within a `mutate()` to fill out those features so they're more informative and show the choices survey respondents were given.
* we add another `case_when` within that `mutate` to add the word "Users" to the `Usertypes` column values.
</details>
```{r}
densitydf <- resultsTidy %>%
select(starts_with(c("PotentialRank", "CurrentRank"))) %>% pivot_longer(cols = everything()) %>% drop_na() %>%
separate(name, c("Usertype", "Feature"), sep = "Rank", remove = TRUE) %>%
mutate(Feature =
case_when(Feature == "EasyBillingSetup" ~ "Easy billing setup",
Feature == "FlatRateBilling" ~ "Flat-rate billing rather than use-based",
Feature == "FreeVersion" ~ "Free version with limited compute or storage",
Feature == "SupportDocs" ~ "On demand support and documentation",
Feature == "ToolsData" ~ "Specific tools or datasets are available/supported",
Feature == "CommunityAdoption" ~ "Greater adoption of the AnVIL by the scientific community"),
Usertype =
case_when(Usertype == "Current" ~ "Current Users",
Usertype == "Potential" ~ "Potential Users")
)
```
#### Density plot
<details><summary>Description of variable definitions and steps</summary>
We use the `densitydf` dataframe we just made and the x-axis is raw rank `value` column values, and the y-axis shows the density. The different density curves are grouped and color filled based off of which feature they represent, and we `facet_wrap` or split the plot facets into two rows so that there's one for each user type. We set the alpha value within `geom_density` since so many of the curves are on top of each other.
Some theme things are changed, labels and titles added, and then we display and save that plot
It also adds annotations (using [Grobs, explained in this Stack Overflow post answer](https://stackoverflow.com/a/31081162)) to specify which rank was "Most important" and which was "Least important".
And it increases the bottom margin so those grob annotations aren't cutoff
</details>
```{r}
ggplot(densitydf, aes(x=value, group = Feature, fill = Feature)) +
facet_wrap(~Usertype, nrow = 2) +
geom_density(alpha=0.3) +
theme_bw() + theme(panel.background = element_blank()) +
xlab("Rank") + scale_x_continuous(breaks = 1:6, labels= 1:6, limits = c(1, 6)) +
ggtitle("Rank the following features according to\ntheir importance to you as a potential user\nor for your continued use of the AnVIL")+
annotation_custom(textGrob("Most\nimportant", gp=gpar(fontsize=8, fontface = "bold")),xmin=1,xmax=1,ymin=-0.85,ymax=-0.85) +
annotation_custom(textGrob("Least\nimportant", gp=gpar(fontsize=8, fontface= "bold")),xmin=6,xmax=6,ymin=-0.85,ymax=-0.85) +
coord_cartesian(clip = "off") +
theme(plot.margin = margin(1,1,1.25,1, "cm"))
ggsave(here("plots/densityplot_rankfeatures.png"))
```
#### Density plot with facets for feature
<details><summary>Description of variable definitions and steps</summary>
We use the `densitydf` dataframe we just made, but we re-simplify the Features so that they'll fit in the legend. For the plot, the x-axis is raw rank `value` column values, and the y-axis shows the density. The different density curves are grouped and color filled based off of which feature they represent, and we use `facet_grid` to split the plot facets into two rows and 6 columns so that there's one row for each user type and one column per feature. We switch the row/y-axis labels over to the left (using `switch = "y"`) and remove the column/x-axis labels (using `theme(strip.background.x = element_blank(), strip.text.x = element_blank())`)
We use the `unit()` function to create some margins, and then set the plot margins and legend position within another `theme()`. I used [this Stack Overflow post to find this method.](https://stackoverflow.com/questions/29808620/ggplot2-move-legend-to-corner-but-keep-it-in-margin)
Some theme things are changed, labels and titles added, and then we display and save that plot
</details>
```{r}
margins = unit(c(1, 10, 1, 1), 'lines')
densitydf %>%
mutate(Feature =
case_when(Feature == "Easy billing setup" ~ "Easy billing setup",
Feature == "Flat-rate billing rather than use-based" ~ "Flat-rate billing",
Feature == "Free version with limited compute or storage" ~ "Free version",
Feature == "On demand support and documentation" ~ "Support & documentation",
Feature == "Specific tools or datasets are available/supported" ~ "Specific tools or datasets",
Feature == "Greater adoption of the AnVIL by the scientific community" ~ "More community adoption")
) %>%
ggplot(aes(x=value, group = Feature, fill = Feature)) +
facet_grid(Usertype~Feature, switch = "y") +
geom_density() +
theme_bw() + theme(panel.background = element_blank()) + #theme(legend.position = "bottom") +
theme(strip.background.x = element_blank(), strip.text.x = element_blank()) +
theme(plot.margin=margins, legend.position=c(1.25, 0.5)) +
xlab("Rank") + scale_x_continuous(breaks = 1:6, labels= 1:6, limits = c(1, 6)) +
ggtitle("Rank the following features according to their importance to you as a\npotential user or for your continued use of the AnVIL")+
coord_cartesian(clip = "off")
ggsave(here("plots/densityplot_rankfeatures_faceted.png"))
```
### Plot Stacked Bar Chart showing number of times for each rank rather than average
#### Prepare data (count)
<details><summary>Description of variable definitions and steps</summary>
For this, we want a data frame that gives counts for all of the ranks given to each feature by each UserType.
To do this we
* Select the relevant columns from `resultsTidy`, specifically using `select(starts_with(c("PotentialRank", "CurrentRank")))`
* tell it to take this "wide" dataframe and pivot it to a longer one (`pivot_longer`) where the values all go to a `value` column, and the column name associated with the value goes into a `name` column.
* drop rows that have na with `drop_na()` since as described earlier not every survey respondent was asked each question; e.g., if they were a current user they weren't asked as a potential user.
* group by the name (feature and UserType combined) and value (the rank) and have it count the number of that specific rank for each feature/UserType combo
* rename the columns because it's getting confusing. name stays name, value changes to rank, and n is used for the count.
* Then we `separate` the `name` column on the word "Rank" to remove the `name` column but then make two new columns (`Usertype` and `Feature`) where `Usertype is either "Current" or "Potential", and the Features are listed in the code below, because...
* We then use a `case_when` within a `mutate()` to fill out those features so they're more informative and show the choices survey respondents were given.
* we add another `case_when` within that `mutate` to add the word "Users" to the `Usertypes` column values.
* set the ranks to be a factor (treated like a categorical variable with a better color scheme instead of a continuous one if we didn't do this) with a specified level so that the most important rank is the first bar on the left when we plot.
</details>
```{r}
countdf <- resultsTidy %>%
select(starts_with(c("PotentialRank", "CurrentRank"))) %>%
pivot_longer(cols = everything()) %>%
drop_na() %>%
group_by(name, value) %>% count() %>%
`colnames<-`(c("name", "rank", "n")) %>%
separate(name, c("Usertype", "Feature"), sep = "Rank", remove = TRUE) %>%
mutate(Feature =
case_when(Feature == "EasyBillingSetup" ~ "Easy billing setup",
Feature == "FlatRateBilling" ~ "Flat-rate billing rather than use-based",
Feature == "FreeVersion" ~ "Free version with limited compute or storage",
Feature == "SupportDocs" ~ "On demand support and documentation",
Feature == "ToolsData" ~ "Specific tools or datasets are available/supported",
Feature == "CommunityAdoption" ~ "Greater adoption of the AnVIL by the scientific community"),
Usertype =
case_when(Usertype == "Current" ~ "Current Users",
Usertype == "Potential" ~ "Potential Users"),
rank = factor(rank, levels = c(6:1))
)
```
#### Stacked bar chart
<details><summary>Description of variable definitions and steps</summary>
Using the `countdf` dataframe that we just made, we have the count or `n` column on the x-axis, the `Feature` on the y-axis, and the fill of the bars to be the `rank` (categorical 1, 2, 3, 4, 5, 6). We facet wrap on UserType with two rows so that each facet represents a different UserType.
We use the `position = "fill"` argument in `geom_bar()` so that it's a percent stacked bar instead of raw counts (since current and potential users had a different number of respondents)
We set the labels for the legend so that it specifies which rank is Least important and which is most important, and we reverse the order in the legend so 1 is on top on the legend.
Finally we set labels and titles and change the theme a bit
</details>
```{r}
ggplot(countdf, aes(fill=rank, y=Feature, x=n)) +
facet_wrap(~Usertype, nrow=2) +
geom_bar(position="fill", stat="identity") +
scale_fill_discrete(labels=c('6 (Least\n Important)', '5', '4', '3', '2', '1 (Most\n Important)')) +
guides(fill = guide_legend(reverse = TRUE)) +
xlab("Percent Responses") +
ggtitle("Rank the following features according to\ntheir importance to you as a potential user\nor for your continued use of the AnVIL") +
theme_bw() + theme(panel.background = element_blank(), panel.grid = element_blank())
ggsave(here("plots/stackedbarplot_rankfeatures.png"))
```
## Preferences: Training Workshop Modality Ranking (supplemental)
*No supplements at this time*
## Preferences: Where analyses are currently run (supplemental)
*No supplements at this time*
## Preferences: DMS compliance/data repositories (supplemental)
*No supplements at this time*
## Preferences: Source for cloud computing funds (supplemental)
### Alternate plot
```{r}
toPlotFundingSource %>%
mutate(UserType = case_when(
UserType == "Current User" ~ "Current",
UserType == "Potential User" ~ "Potential"
),
whichFundingSource = factor(whichFundingSource, levels = rev(c("NHGRI", "Other NIH", "Institutional funds", "Foundation Grant", "NSF", "Only use free options", "Don't know")))
) %>%
ggplot(aes(y = UserType, x = count, fill = whichFundingSource)) +
geom_bar(position = "fill", stat = "identity") +
scale_fill_manual(values = rev(c("#035C94", "#012840", "#F2F2F2", "#E0DD10", "#AEEBF2", "#7EBAC0", "#333333"))) +
theme_bw() +
ggtitle("What source(s) of funds do you use to pay for cloud computing?") +
xlab("Fraction of responses") +
ylab("User Type") +
theme(panel.background = element_blank(),
panel.grid.minor.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.major.y = element_blank()) +
labs(fill="Funding Source")
ggsave(here("plots/fundingsources_colorSource.png"))
```
## Returning User: Length of Use of the AnVIL (supplemental)
*No supplements at this time*
## Returning User: Foreseeable Computational Needs (supplemental)
*No supplements at this time*
## Returning User: Recommendation likelihood (supplemental)
*No supplements at this time*
## Session Info
```{r}
sessionInfo()
```