generated from jhudsl/OTTR_Template_Website
-
Notifications
You must be signed in to change notification settings - Fork 0
/
anvilPoll2024MainAnalysis.Rmd
1141 lines (817 loc) · 53.8 KB
/
anvilPoll2024MainAnalysis.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "State of the AnVIL 2024"
subtitle: "Main analysis"
author: "Kate Isaac, Elizabeth Humphries, & Ava Hoffman"
date: "`r Sys.Date()`"
output: html_document
---
```{r message=FALSE, results='hide', warning=FALSE}
library(here)
library(grid) #for Grobs
library(scales)
knitr::knit_child(here("TidyData.Rmd")) #inherit resultsTidy
source(here("resources/scripts/shared_functions.R"))
```
```{r, setup, include=FALSE}
knitr::opts_chunk$set(
message = FALSE, echo = FALSE, warning = FALSE
)
```
# Main Analysis and Insights
## Identify User Type
**Takeaway:** Of the ```r nrow(resultsTidy)``` responses, ```r nrow(resultsTidy %>% filter(UserType == "CurrentUser"))``` were current users and ```r nrow(resultsTidy %>% filter(UserType == "PotentialUser"))``` were potential users. The majority of current users belonged to the group who use the AnVIL for ongoing projects while the majority of potential users were evenly split between those who have never used the AnVIL (but have heard of it) and those who used to previously use the AnVIL, but don't currently.
**Potential Follow-ups:**
- Look to see if those potential users who previously used to use the AnVIL show similarity in overall trends with the rest of the potential users
- Directly ask why they no longer use the AnVIL (Elizabeth mentioned the possibility that the AnVIL is sometimes used in courses or workshops and students may not use it after that)
### Prepare and plot the data
<details><summary>Description of variable definitions and steps</summary>
First, we group the data by the assigned UserType labels/categories and their related more detailed descriptions. Then we use `summarize` to count the occurrences for each of those categories. We use a mutate statement to better fit the detailed descriptions on the plot. We then send this data to ggplot with the count on the x-axis, and the usage descriptions on the y-axis (ordered by count so highest count is on the top). We fill with the `UserType` description we've assigned. We manually scale the fill to be AnVIL colors and specify we want this to be a stacked bar chart. We then make edits for the theme and labels and finally add a geom_text label for the count next to the bars before we save the plot.
</details>
```{r}
typeOfUserPlot <- resultsTidy %>%
group_by(UserType, CurrentUsageDescription) %>%
summarize(count = n()) %>%
mutate(CurrentUsageDescription = case_when(
CurrentUsageDescription == "For ongoing projects (e.g., consistent project development and/or work)" ~ "For ongoing projects:\nconsistent project development\nand/or work",
CurrentUsageDescription == "For completed/long-term projects (e.g., occasional updates/maintenance as needed)" ~ "For completed/long-term projects:\noccasional updates/maintenance\nas needed",
CurrentUsageDescription == "For short-term projects (e.g., short, intense bursts separated by a few months)" ~ "For short-term projects:\nshort, intense bursts\nseparated by a few months",
CurrentUsageDescription == "I do not currently use the AnVIL, but have in the past" ~ "I do not current use the AnVIL,\nbut have in the past",
CurrentUsageDescription == "I have never used the AnVIL, but have heard of it" ~ "I have never\nused the AnVIL",
CurrentUsageDescription == "I have never heard of the AnVIL" ~ "I have never\nheard of the AnVIL"
)) %>%
ggplot(aes(x = count, y = reorder(CurrentUsageDescription, count), fill = UserType)) +
geom_bar(stat="identity", position ="stack") +
ggtitle("How would you describe your current usage\nof the AnVIL platform?") +
geom_text(aes(label = count, group = CurrentUsageDescription),
hjust = -0.5, size=2)
typeOfUserPlot %<>% stylize_bar()
typeOfUserPlot
ggsave(here("plots/respondent_usagedescription.png"), plot = typeOfUserPlot) #set plot size
```
## Demographics: Highest Degree
**Takeaway:** Most of the respondents have a PhD or are currently working on a PhD, though a range of career stages are represented.
### Prepare and plot the data
<details><summary>Description of variable definitions and steps</summary>
First we use `group_by()` to select`Degrees` and `UserType` in conjunction with `summarize( = n())` to add counts for how many of each combo are observed in the data.
Then we send this data to ggplot and make a bar chart with the x-axis representing the degrees (`reorder`ed by the count number such that higher counts are first (and the sum) because otherwise the 2 MDs are located after the high school and master's in progress bars (1 each)). The y-axis represents the count, and the fill is used to specify user type (current or potential AnVIL users). We use a stacked bar chart and include labels above each bar of the total sum for that degree type.
Used [this Stack Overflow post to label sums above the bars](https://stackoverflow.com/questions/30656846/draw-the-sum-value-above-the-stacked-bar-in-ggplot2)
and used [this Stack Overflow post to remove NA from the legend](https://stackoverflow.com/questions/45493163/ggplot-remove-na-factor-level-in-legend)
The rest of the changes are related to theme and labels and making sure that the numerical bar labels aren't cut off on the top.
</details>
```{r}
degreePlot <- resultsTidy %>%
group_by(FurtherSimplifiedDegrees, UserType) %>%
summarize(n = n()) %>%
ggplot(aes(y = reorder(FurtherSimplifiedDegrees, n, sum),
x = n,
fill = UserType
)) +
geom_bar(position = "stack", stat="identity") +
geom_text(
aes(label = after_stat(x), group = FurtherSimplifiedDegrees),
stat = 'summary', fun = sum, hjust = -1, size=2
) +
coord_cartesian(clip = "off") +
ggtitle("What is the highest degree you have attained?")
degreePlot %<>% stylize_bar()
degreePlot
ggsave(here("plots/degree_furthersimplified_usertype.png")) #set plot size
```
## Demographics: Kind of Work
**Takeaway:** Only a few responses report project management, leadership or administration as their only kind of work. This increases our confidence that this won't confound later questions asking about usage of datasets or tools.
**Potential Follow-ups:**
- Use this information (together with other info?) to try to cluster respondents/users into personas; see `PersonaStats.Rmd`
### Prepare and plot the data
<details><summary>Description of variable definitions and steps</summary>
Note: Can I bring what we used within the persona's work over here to make this code cleaner?
</details>
```{r}
dfForPlotKOW <- resultsTidy %>%
separate(KindOfWork,
c("whichWorkA", "whichWorkB", "whichWorkC", "whichWorkD", "whichWorkE", "whichWorkF", "whichWorkG", "whichWorkH", "whichWorkI", "whichWorkJ"),
sep=", ", fill="right") %>%
pivot_longer(starts_with("whichWork"), values_to = "whichWorkDescription") %>%
select(Timestamp, UserType, whichWorkDescription) %>%
mutate(whichWorkDescription =
recode(whichWorkDescription,
"computational education" = "Computational education",
"Program administration," = "Program administration"),
whichWorkDescription = factor(whichWorkDescription),
Timestamp = factor(Timestamp)
) %>%
drop_na()
factorLevel <- as.data.frame(table(dfForPlotKOW$whichWorkDescription)) %>% arrange(-Freq) %>% select(Var1) %>% unlist() %>% unname() %>% rev()
kowPlot <- ggplot(dfForPlotKOW,
aes(x = Timestamp,
y = factor(whichWorkDescription, levels = factorLevel),
fill = whichWorkDescription
)) +
geom_tile() +
theme_bw() +
theme(axis.text.x=element_blank(),
axis.ticks.x=element_blank(),
legend.position = "None") +
ylab("") +
ggtitle("What kind of work do you do?") +
xlab("Respondent") +
facet_wrap(~UserType)
kowPlot
#save and set save size
```
## Demographics: Institutional Affiliation
**Takeaway:**
### Prepare and plot the data
<details><summary>Description of variable definitions and steps</summary>
First, we set the factor level for the further simplified institutional type column (`FurtherSimplifiedInstitutionalType`) so that we know the order on the y-axis when plotting. We then use `group_by()` together with `summarize()` to count the number of each further simplified institutional type for each `UserType`. We plot this as a bar plot with the institutional type on the y-axis and the count on the x-axis and fill the stacked bars according to `UserType`. We add text labels to the bars displaying the sum of the institutional type. We also use custom annotation grobs that break down which institutional types are part of each further simplified institutional type (as defined in `TidyData.Rmd`). Note the liberal uses of spaces to try to align these sub-labels. Finally, we pass the plot to the shared function `stylize_bar()` to change axis labels, fill colors, etc.
</details>
```{r}
instTypePlot <- resultsTidy %>%
mutate(FurtherSimplifiedInstitutionalType = factor(FurtherSimplifiedInstitutionalType, levels = c("Industry & Other", "Education Focused", "Research Intensive"))) %>%
group_by(UserType, FurtherSimplifiedInstitutionalType) %>% summarize(InstitutionalCount = n()) %>%
ggplot(aes(
y = FurtherSimplifiedInstitutionalType,
x = InstitutionalCount,
fill = UserType
)) + geom_bar(position = "stack", stat = "identity") +
geom_text(
aes(label = after_stat(x), group = FurtherSimplifiedInstitutionalType),
stat = 'summary', fun = sum, hjust = -1, size=2
) +
ggtitle("Institutional Affiliation for All Survey Respondents") +
annotation_custom(textGrob("- R1 University \n- Med Campus \n- Research Center\n- NIH ", gp = gpar(fontsize = 8)), xmin = -8.5, xmax = -8.5, ymin = 2.65, ymax = 2.65) +
annotation_custom(textGrob("- Industry \n- International Loc\n- Unknown ", gp = gpar(fontsize = 8)), xmin = -8.5, xmax = -8.5, ymin = .7, ymax = .7) +
annotation_custom(textGrob("- R2 University \n- Community College", gp=gpar(fontsize=8)),xmin=-8.5,xmax=-8.5,ymin=1.75,ymax=1.75) +
coord_cartesian(clip = "off") +
ggtitle("What institution are you affiliated with?")
instTypePlot %<>% stylize_bar()
instTypePlot
ggsave(here("plots/institutionalType_simplified_allResponses_colorUserType.png"), plot = instTypePlot) #set plot size
```
## Demographics: Consortia Affiliations
```{r}
consortiaTable <- resultsTidy %>%
mutate(ConsortiaAffiliations = str_replace_all(ConsortiaAffiliations, c(";|&| and"), ",")) %>%
separate(ConsortiaAffiliations,
c("whichConsortiumA", "whichConsortiumB", "whichConsortiumC", "whichConsortiumD"),
sep=", ", fill = "right") %>%
pivot_longer(starts_with("whichConsortium"), values_to = "whichConsortiumName") %>%
group_by(whichConsortiumName) %>%
summarize(count = n()) %>%
drop_na() %>%
arrange(count)
```
**Takeaway:** Of `r nrow(resultsTidy)` responses, `r sum(!is.na(resultsTidy$ConsortiaAffiliations))` provide an affiliation, with `r nrow(consortiaTable)` unique affiliations represented across those responses (respondents could select more than one consortium). The following table shows the most represented consortia.
### Prepare and display the data
```{r, message = FALSE, echo = FALSE}
consortia_df <-
consortiaTable[which(consortiaTable$count >1),] %>%
rename(`consortium` = whichConsortiumName)
kableExtra::kable(consortia_df, table.attr = "style='width:20%;'")
```
## Experience: Tool & Resource Knowledge/Comfort level
**Takeaway:** Except for Galaxy, potential users tend to report lower comfort levels for the various tools and technologies when compared to current users. Where tools were present on and off AnVIL, current users report similar comfort levels.
Overall, there is less comfort with containers or workflows than using various programming languages and integrated development environments (IDEs).
### Prepare and plot the data
<details><summary>Description of variable definitions and steps for preparing the data </summary>
We bind the rows of two dataframes, one for current users and one for potential users. The steps for building the dataframes are essentially the same once the first `filter` and `mutate` steps are completed. The first step of building each data frame is to filter based on the `UserType` of interest. We then select the columns that start with "Score_" or "Score_AllTech" that we created in `TidyData.Rmd`. For potential users, we only need the "Score_AllTech" columns, not the "Score_CurrentAnVILTech" columns as well. Because the scores are integers and we want to sum the scores across responses, we use a column sum function and send those sums to a data frame where the rowname is the previous column name and the summed scores are stored in the `totalScore` column. We add columns `nscores`, `avgScore`, and `UserType` that store the number of responses or scores, the average score (total divided by number of), and the applicable type of user. Rownames are then moved to a column called `WhereTool` and this column is separated into two separate columns, separating on the word "Tech" Such that the new `AnVILorNo` column will contain either "Score_All" or "Score_CurrentAnVIL". We translate those to be "Separate from the AnVIL" or "On the AnVIL" respectively. And the new "Tool" column will contain the shorthand tool names which we recode to add spaces or more info.
</details>
```{r}
toPlotToolKnowledge <- bind_rows(
resultsTidy %>%
filter(UserType == "Current User") %>%
select(starts_with("Score_")) %>%
colSums() %>%
as.data.frame() %>% `colnames<-`(c("totalScore")) %>%
mutate(nscores = sum(resultsTidy$UserType == "Current User"),
avgScore = totalScore / nscores,
UserType = "Current Users") %>%
mutate(WhereTool = rownames(.)) %>%
separate(WhereTool, c("AnVILorNo", "Tool"), sep = "Tech", remove = TRUE) %>%
mutate(AnVILorNo =
case_when(AnVILorNo == "Score_CurrentAnVIL" ~ "On the AnVIL",
AnVILorNo == "Score_All" ~ "Separate from the AnVIL"
),
Tool =
recode(Tool, "JupyterNotebooks" = "Jupyter Notebooks",
"WDL" = "Workflows",
"CommandLine" = "Unix / Command Line",
"AccessData" = "Access controlled access data",
"Terra" = "Terra Workspaces",
"BioconductorRStudio" = "Bioconductor & RStudio"
)
),
resultsTidy %>%
filter(UserType == "Potential User") %>%
select(starts_with("Score_AllTech")) %>%
colSums() %>%
as.data.frame() %>% `colnames<-`(c("totalScore")) %>%
mutate(nscores = sum(resultsTidy$UserType == "Potential User"),
avgScore = totalScore / nscores,
UserType = "Potential Users") %>%
mutate(WhereTool = rownames(.)) %>%
separate(WhereTool, c("AnVILorNo", "Tool"), sep = "Tech", remove = TRUE) %>%
mutate(AnVILorNo =
case_when(AnVILorNo == "Score_CurrentAnVIL" ~ "On the AnVIL",
AnVILorNo == "Score_All" ~ "Separate from the AnVIL"
),
Tool =
recode(Tool, "JupyterNotebooks" = "Jupyter Notebooks",
"WDL" = "Workflows",
"CommandLine" = "Unix / Command Line",
"AccessData" = "Access controlled access data",
"Terra" = "Terra Workspaces",
"BioconductorRStudio" = "Bioconductor & RStudio"
)
)
) %>%
mutate(UserType = factor(UserType, levels = c("Potential Users", "Current Users")))
```
```{r}
roi <- toPlotToolKnowledge[which(toPlotToolKnowledge$Tool == "Bioconductor & RStudio"),]
toPlotToolKnowledgeSeparateBR <- rows_append(toPlotToolKnowledge, data.frame(
UserType = rep(roi$UserType,2),
avgScore = rep(roi$avgScore,2),
AnVILorNo = rep(roi$AnVILorNo,2),
Tool = c("Bioconductor", "RStudio")
)) %>%
rows_delete(., data.frame(roi))
```
<details><summary>Description of variable definitions and steps for plotting the dumbbell like plot </summary>
Used [this Stack Overflow response](https://stackoverflow.com/a/72309061) to get the values for the `scale_shape_manual()`
</details>
```{r}
PlotToolKnowledge_avg_score <-
ggplot(toPlotToolKnowledgeSeparateBR, aes(y = reorder(Tool, avgScore), x = avgScore)) +
geom_point(aes(color = UserType, shape = AnVILorNo))
PlotToolKnowledge_avg_score %<>% PlotToolKnowledge_customization()
PlotToolKnowledge_avg_score
ggsave(here("plots/tooldataresourcecomfortscore_singlepanel.png"), w = 2200, h = 1350, units = "px")
```
## Experience: Types of Data Analyzed
<details><summary>Question and possible answers</summary>
>What types of data do you or would you analyze using the AnVIL?
Possible answers include
* Genomes/exomes
* Transcriptomes
* Metagenomes
* Proteomes
* Metabolomes
* Epigenomes
* Structural
* Single Cell
* Imaging
* Phenotypic
* Electronic Health Record
* Metadata
* Survey
* Other (with free text response)
</details>
**Takeaway:**
### Prepare and plot the data
<details><summary>Description of variable definitions and steps for preparing the data </summary>
</details>
```{r}
typeOfDataDf <- resultsTidy %>% prep_df_typeData()
typeDataClinicalSubset <- resultsTidy %>%
filter(clinicalFlag == TRUE) %>%
prep_df_typeData()
typeDataHumanGenomicSubset <- resultsTidy %>%
filter(humanGenomicFlag == TRUE) %>%
prep_df_typeData()
```
<details><summary>Description of variable definitions and steps for plotting the bar graphs</summary>
</details>
```{r}
everyone_type_data <- plot_type_data(typeOfDataDf)
everyone_type_data
ggsave(here("plots/typesOfData.png"), plot=everyone_type_data) #add plot size
```
```{r}
clinical_type_data <- plot_type_data(typeDataClinicalSubset, subtitle = "Respondents moderately or extremely experienced with clinical data")
clinical_type_data
ggsave(here("plots/typesOfData_clinical.png"), plot=clinical_type_data)
```
```{r}
humangenomic_type_data <- plot_type_data(typeDataHumanGenomicSubset, subtitle = "Respondents moderately or extremely experienced with human genomic data")
humangenomic_type_data
ggsave(here("plots/typesOfData_humangenomic.png"), plot=humangenomic_type_data)
```
## Experience: Genomics and Clinical Research Experience
**Takeaway:** 21 respondents report that they are extremely experienced in analyzing human genomic data, while only 6 respondents report that they are not at all experienced in analyzing human genomic data. However, for human clinical data and non-human genomic data, more respondents report being not at all experienced in analyzing those data than report being extremely experienced.
**Potential Follow-ups**
- What's the overlap like for those moderately or extremely experienced in these various categories? (Note: Found in the supplemental analyses)
<details><summary>Question and possible answers</summary>
>How much experience do you have analyzing the following data categories?
The data categories were
* Human genomic
* Non-human genomic
* Human clinical
and for each category, possible options were
* Not at all experienced
* Slightly experienced
* Somewhat experienced
* Moderately experienced
* Extremely experienced
</details>
### Prepare and plot the data
<details><summary>Description of variable definitions and steps for preparing the data</summary>
Here we select the columns containing answers for each data category: `HumanGenomicExperience`, `HumanClinicalExperience`, and `NonHumanGenomicExperience`. We also select `UserType` in case we want to split user type out at all in viewing the data. We use a `pivot_longer` to make a long dataframe that can be grouped and groups counted. The category/column names go to a new column, `researchType` and the values in those columns go to a new column `experienceLevel`. Before we use group by and count, we set the factor level on the new `experienceLevel` column to match the progression from not at all experienced to extremely experienced, and we rename the research categories so that the words have spaces, and we say research instead of experience. Then we use `group_by` and `summarize` to add counts for each combination of research category, experience level, and `UserType`. These counts are in the new `n` column.
</details>
```{r}
experienceDf <- resultsTidy %>% select(HumanGenomicExperience, HumanClinicalExperience, NonHumanGenomicExperience, UserType) %>%
pivot_longer(c(HumanGenomicExperience, HumanClinicalExperience, NonHumanGenomicExperience), names_to = "researchType", values_to = "experienceLevel") %>%
mutate(experienceLevel =
factor(experienceLevel, levels = c("Not at all experienced", "Slightly experienced", "Somewhat experienced", "Moderately experienced", "Extremely experienced")),
researchType = case_when(researchType == "HumanClinicalExperience" ~ "Human Clinical Research",
researchType == "HumanGenomicExperience" ~ "Human Genomic Research",
researchType == "NonHumanGenomicExperience" ~ "Non-human\nGenomic Research")) %>%
group_by(researchType, experienceLevel, UserType) %>% summarize(n = n())
```
<details><summary>Description of variable definitions and steps for plotting the bar graph</summary>
We didn't observe big differences between current and potential users, so we believe this grouped plot is useful for understanding the community as a whole.
This bar plot has the experience level on the x-axis, the count on the y-axis, and fills the bars according to the experience level (though the fill/color legend is turned off by setting legend.position to none). We facet the research category type and label the bars. We keep a summary stat and sum function and after_stat(y) for the label since the data has splits like `UserType` that we're not visualizing here.
We adjust various aspects of the theme like turning off the grid and background and rotating the x-tick labels and changing the x- and y-axis labels. We also slightly widen the left axis so that the tick labels aren't cut off.
</details>
```{r}
genomicsExpPlot <- ggplot(experienceDf, aes(x=experienceLevel,y=n, fill = experienceLevel)) +
facet_grid(~researchType) +
geom_bar(stat="identity") +
geom_text(
aes(label = after_stat(y), group = experienceLevel),
stat = 'summary', fun = sum, vjust = -0.5, size=2
) +
coord_cartesian(clip = "off") +
theme(plot.margin = margin(1,1,1,1.05, "cm")) +
ggtitle("How much experience do you have analyzing the following data categories?")
genomicsExpPlot %<>% stylize_bar(usertypeColor = FALSE, sequentialColor = TRUE, ylabel = "Count", xlabel = "Reported Experience Level", rotate=55, hjustv = 1)
genomicsExpPlot
ggsave(here("plots/researchExperienceLevel_sequentialColor_noUserTypeSplit.png")) #set plot size
```
## Experience: Controlled Access Datasets
**Takeaway:** Generally, over half of respondents report they are extremely interested in working with controlled access datasets.
For specific controlled access datasets ...
- Of the survey provided choices, respondents have accessed or are particularly interested in accessing [All of Us](https://www.researchallofus.org/), [UK Biobank](https://www.ukbiobank.ac.uk/enable-your-research/about-our-data), and [GTEx](https://anvilproject.org/data/consortia/GTEx) (though All of Us and UK Biobank are not currently AnVIL hosted).
- 2 respondents (moderately or extremely experienced with genomic data) specifically wrote in ["TCGA"](https://www.cancer.gov/ccg/research/genome-sequencing/tcga).
- The trend of All of Us, UK Biobank, and GTEx being chosen the most is consistent across all 3 research categories (moderately or extremely experienced with clinical, human genomic, or non-human genomic data).
<details><summary>Question and possible answers</summary>
>What large, controlled access datasets do you access or would you be interested in accessing using the AnVIL?
* All of Us*
* Centers for Common Disease Genomics (CCDG)
* The Centers for Mendelian Genomics (CMG)
* Clinical Sequencing Evidence-Generating Research (CSER)
* Electronic Medical Records and Genomics (eMERGE)
* Gabriella Miller Kids First (GMKF)
* Genomics Research to Elucidate the Genetics of Rare Diseases (GREGoR)
* The Genotype-Tissue Expression Project (GTEx)
* The Human Pangenome Reference Consortium (HPRC)
* Population Architecture Using Genomics and Epidemiology (PAGE)
* Undiagnosed Disease Network (UDN)
* UK Biobank*
* None
* Other (Free Text Response)
Since this is a select all that apply question, we expect that there will be multiple responses that are comma separated. The free text responses will likely need recoded as well. The responses are in the `AccessWhichControlledData` column.
</details>
### Prepare and plot the data
<details><summary>Description of variable definitions and steps for preparing the data</summary>
</details>
```{r}
dataInterest <- resultsTidy %>%
group_by(InterestControlledData) %>%
summarize(count = n())
```
<details><summary>Description of variable definitions and steps for preparing bar plot</summary>
</details>
```{r}
dataInterestPlot <- dataInterest %>%
ggplot(aes(x = InterestControlledData,
y = count,
fill = as.factor(InterestControlledData))) +
geom_bar(stat="identity") +
ggtitle("How interested are you in working with controlled access datasets?") +
coord_cartesian(clip = "off") +
theme(plot.margin = margin(1,1,1,1.1, "cm")) +
annotation_custom(textGrob("Extremely\ninterested", gp=gpar(fontsize=8, fontface = "bold")),xmin=5,xmax=5,ymin=-3.5,ymax=-3.5) +
annotation_custom(textGrob("Not at all\ninterested", gp=gpar(fontsize=8, fontface= "bold")),xmin=1,xmax=1,ymin=-3.5,ymax=-3.5) +
scale_y_continuous(breaks= pretty_breaks()) +
geom_text(aes(label = count, group = InterestControlledData),
vjust = -1, size=2)
dataInterestPlot %<>% stylize_bar(usertypeColor = FALSE, sequentialColor = TRUE, xlabel = "Interest level", ylabel = "Count")
dataInterestPlot
```
<details><summary>Description of variable definitions and steps for preparing the data</summary>
Using a function `prep_df_whichData()` which is in the `shared_functions.R` script since we'll be using this workflow a few times for different subsets of the data, because we want to be able to differentially display the data based on the experience status (experienced with clinical research, human genomics research, etc.) of the person saying they'd like access to the data.
We want to color the bars based on whether or not the controlled access dataset is available on the AnVIL currently. We create a dataframe `onAnVILDF` to report this. Used the [AnVIL dataset catalog/browser](https://explore.anvilproject.org/datasets) to find out this information. However, HPRC and GREGoR don't show up in that resource, but are both available per these sources: [Announcement for HPRC](https://anvilproject.org/news/2021/03/11/hprc-on-anvil), [Access for HPRC](https://anvilproject.org/data/consortia/HPRC), [Access for GREGoR](https://anvilproject.org/data/consortia/GREGoR). Both GMKF and TCGA are data hosted on other NCPI platforms that are accessible via AnVIL because of interoperability. (See: https://www.ncpi-acc.org/ and https://ncpi-data.org/platforms). We list these as non-AnVIL hosted since while accessible, they are not AnVIL hosted and inaccessible without NCPI. Finally, UDN is described as non-AnVIL hosted as it is in the Data submission pipeline and not yet available.
We'll join this anvil-hosted or not data with the actual data at the end.
Given the input `subset_df`, we expect several answer to be comma separated. Since there are 12 set possible responses (not including "None") and one possible free response answer, we separate the `AccessWhichControlledData` column into 13 columns ("WhichA" through "WhichN"), separating on a comma (specifically a ", " a comma followed by a space, otherwise there were duplicates where the difference was a leading space). Alternative approaches should [consider using `str_trim`](https://stringr.tidyverse.org/reference/str_trim.html). We set fill to "right" but this shouldn't really matter. It's just to suppress the unnecessary warning that they're adding NA's when there aren't 13 responses. If there's only one response, it'll put that response in `WhichA` and fill the rest of them with `NA`. If there's two responses, it'll put those two responses in `WhichA` and `WhichB` and fill the rest of them with `NA`... etc,
We then use `pivot_longer` to grab these columns we just made and put the column names in a new column `WhichChoice` and the values in the each column to a new column `whichControlledAccess`. We drop all the NAs in this new `whichControlledAccess` column (and there's a lot of them there)...
Then we group by the new `whichControlledAccess` column and summarize a count for how many there are for each response.
Then we pass this to a mutate and recode function to simplify the fixed responses to be just their acronyms, to remove asterisks (that let the survey respondent know that that dataset wasn't available because of policy restrictions), and to recode the free text responses (details below in "Notes on free text response recoding").
We use a `left_join()` to join the cleaned data with a dataframe that specifies whether that dataset is currently available on the AnVIL or not. It's a left join rather than a full join so it's only adding the annotation for datasets that are available in the results.
Finally, we return this subset and cleaned dataframe so that it can be plotted.
</details>
<details><summary> Additional notes on free text response recoding</summary>
There were 4 "Other" free response responses
* "Being able to pull other dbGap data as needed."
--> We recoded this to be an "Other"
* "GnomAD and ClinVar"
--> GnomAD and ClinVar are not controlled access datasets so we recoded that response to be "None"
* "Cancer omics datasets"
--> We recoded this to be an "Other"
* "TCGA"
--> This response was left as is since there is a controlled access tier.
</details>
```{r}
onAnVILDF <- read_delim(here("data/controlledAccessData_codebook.txt"), delim = "\t", col_select = c(whichControlledAccess, AnVIL_Availability))
```
<details><summary>Description of variable definitions and steps for preparing the data continued</summary>
Here we set up 4 data frames for plotting
* The first uses all of the responses and sends them through the `prep_df_whichData()` function to clean the data for plotting to see which controlled access datasets are the most popular.
* The second filters to grab just the responses from those experienced in clinical research using the `clinicalFlag` column (described earlier in the Clean Data -> Simplified experience status for various research categories (clinical, human genomics, non-human genomics) subsection)
* The third filters to grab just the responses from those experienced in human genomic research using the `humanGenomicFlag` column (described earlier in the Clean Data -> Simplified experience status for various research categories (clinical, human genomics, non-human genomics) subsection)
* The fourth filters to grab just the responses from those experienced in non-human genomic research using the `nonHumanGenomicFlag` column (described earlier in the Clean Data -> Simplified experience status for various research categories (clinical, human genomics, non-human genomics) subsection)
</details>
```{r}
whichDataDf <- resultsTidy %>% prep_df_whichData(onAnVILDF = onAnVILDF)
whichDataClinicalSubset <- resultsTidy %>%
filter(clinicalFlag == TRUE) %>%
prep_df_whichData(onAnVILDF = onAnVILDF)
whichDataHumanGenomicSubset <- resultsTidy %>%
filter(humanGenomicFlag == TRUE) %>%
prep_df_whichData(onAnVILDF = onAnVILDF)
whichDataNonHumanGenomicSubset <- resultsTidy %>%
filter(nonHumanGenomicFlag == TRUE) %>%
prep_df_whichData(onAnVILDF = onAnVILDF)
```
<details><summary>Description of variable definitions and steps for plotting the bar graphs</summary>
Also have a function from `shared_functions.R` for this because it's the same plotting steps for each just changing the subtitle and which dataframe is used as input.
This takes the input dataframe and plots a bar plot with the x-axis having the controlled access datasets listed (reordering the listing based off of the count so most popular is on the left), the count number/popularity of requested is on the y-axis, and the fill is based on whether the dataset is available on AnVIL or not.
We change the theme elements like removing panel borders, panel background, and panel grid, and rotate the x-axis tick labels. We add an x- and y- axis label and add a title (and subtitle if specified - which it will be when we're looking at just a subset like those who are experienced with clinical data)
We also add text labels above the bars to say how many times each dataset was marked/requested. Note that we have to use the after_stat, summary, and sum way of doing it again because we use recoding and if we want the labels to be accurate, it has to capture every time we've recoded things to be the same after we used group_by and summarize to count before we recoded. It uses `coord_cartesian(clip = "off")` so these bar text labels aren't cut off and finally returns the plot.
We call this function 4 times
* once for all the data (and don't use a subtitle)
* next for just those experienced with clinical data (using a subtitle to specify this)
* next for just those experienced with human genomic data (using a subtitle to specify this)
* and finally for just those experienced with non-human genomic data (using a subtitle to specify this)
</details>
```{r}
everyoneDataPlot <- plot_which_data(whichDataDf)
everyoneDataPlot
ggsave(here("plots/whichcontrolleddata.png"), plot = everyoneDataPlot) #add plot size
```
```{r}
clinicalDataPlot <- plot_which_data(whichDataClinicalSubset, subtitle = "Respondents moderately or extremely experienced with clinical data")
clinicalDataPlot
ggsave(here("plots/whichcontrolleddata_clinical.png"), plot = clinicalDataPlot) #add plot size
```
```{r}
humanGenomicDataPlot <- plot_which_data(whichDataHumanGenomicSubset, subtitle = "Respondents moderately or extremely experienced with human genomic data")
humanGenomicDataPlot
ggsave(here("plots/whichcontrolleddata_humangenomic.png"), plot = humanGenomicDataPlot) #add plot size
```
```{r}
nonHumanGenomicDataPlot <- plot_which_data(whichDataNonHumanGenomicSubset, subtitle = "Respondents moderately or extremely experienced with non-human genomic data")
nonHumanGenomicDataPlot
ggsave(here("plots/whichcontrolleddata_nonhumangenomic.png"), plot = nonHumanGenomicDataPlot) #add plot size
```
## Awareness: Monthly AnVIL Demos
**Takeaway:** Most respondents have not attended an AnVIL Demo. To investigate whether this is an awareness issue, we aggregated all responses except `No, didn't know of`. We see that the majority of respondents are aware of AnVIL Demos. These responses are just distributed among different ways of utilizing the demos. Further, there's awareness among both current and potential AnVIL users.
### Prepare and plot the data
#### Raw responses
```{r}
demoPlotRaw <- resultsTidy %>%
group_by(UserType, AnVILDemo) %>%
summarize(count = n()) %>%
ggplot(aes(y=reorder(AnVILDemo, count),
x = count,
fill = UserType)) +
geom_bar(stat = "identity") +
ggtitle("Have you attended a monthly AnVIL Demo?")
demoPlotRaw %<>% stylize_bar()
demoPlotRaw
```
#### Responses recoded to focus on awareness
```{r}
demoPlot <- resultsTidy %>%
group_by(UserType, AnVILDemoAwareness) %>%
summarize(count = n()) %>%
ggplot(aes(y = AnVILDemoAwareness,
x = count,
fill = UserType)) +
geom_bar(stat = "identity") +
ggtitle("Have you attended a monthly AnVIL Demo?")
demoPlot %<>% stylize_bar(ylabel = "Awareness")
demoPlot
```
## Awareness: AnVIL Support Forum
**Takeaway:** Most respondents have not used the AnVIL support forum.
- We aggregated these responses to examine awareness. We observe that there is awareness of the support forum across potential and current users.
- While utilization in some form is reported by about 20% of respondents, reading through others' posts is the most common way of utilizing the support forum within this sample.
### Prepare and plot the data
```{r}
forumdf <- resultsTidy %>%
mutate(AnVILSupportForum = str_replace(AnVILSupportForum,
pattern = "No, ",
replacement= "No ")) %>%
separate(AnVILSupportForum,
c("forumInteractionA", "forumInteractionB", "forumInteractionC"),
sep = ", ",
fill = "right") %>%
pivot_longer(starts_with("forumInteraction"), values_to = "forumInteractionDescription") %>%
group_by(UserType, CurrentUsageDescription, forumInteractionDescription) %>%
summarize(count = n()) %>%
drop_na() %>%
mutate(forumInteractionDescription =
factor(forumInteractionDescription,
levels = c("Posted in", "Answered someone's post", "Read through others' posts", "No but aware of", "No didn't know of")),
forumAwareness = factor(
case_when(
forumInteractionDescription == "Posted in" ~ "Aware of",
forumInteractionDescription == "Answered someone's post" ~ "Aware of",
forumInteractionDescription == "Read through others' posts" ~ "Aware of",
forumInteractionDescription == "No but aware of" ~ "Aware of",
forumInteractionDescription == "No didn't know of" ~ "Not Aware of"
), levels = c("Not Aware of", "Aware of"))
)
```
#### Raw responses
```{r}
forumPlotRaw <- ggplot(forumdf,
aes(y = reorder(forumInteractionDescription, count),
x = count,
fill = UserType)) +
geom_bar(stat = "identity") +
ggtitle("Have you ever read or posted in our AnVIL Support Forum?")
forumPlotRaw %<>% stylize_bar()
forumPlotRaw
```
#### Responses recoded to focus on awareness
```{r}
forumPlot <- ggplot(forumdf, aes(y = forumAwareness, x = count, fill = UserType)) +
geom_bar(stat = "identity") +
ggtitle("Have you ever read or posted in our AnVIL Support Forum?")
forumPlot %<>% stylize_bar(ylabel = "Awareness")
forumPlot
```
## Preferences: Feature Importance Ranking
**Takeaway:** All respondents rate having specific tools or datasets supported/available as a very important feature for using AnVIL. Compared to current users, potential users rate having a free-version with limited compute or storage as the most important feature for their potential use of the AnVIL.
<details><summary>Question and possible answers</summary>
>Rank the following features or resources according to their importance for your continued use of the AnVIL
>Rank the following features or resources according to their importance to you as a potential user of the AnVIL?
* Easy billing setup
* Flat-rate billing rather than use-based
* Free version with limited compute or storage
* On demand support and documentation
* Specific tools or datasets are available/supported
* Greater adoption of the AnVIL by the scientific community
We're going to look at a comparison of the assigned ranks for these features, comparing between current users and potential users.
</details>
### Prepare and plot the data
Average rank is total rank (sum of given ranks) divided by number of votes (number of given ranks)
<details><summary>Description of variable definitions and steps for preparing the data </summary>
We make two different dataframes that find the total ranks (column name: `totalRank`) and avg ranks (column name: `avgRank`) for each future and then row bind (`bind_rows`) these two dataframes together to make `totalRanksdf`. The reason that we make two separately are that one is for Potential users (`starts_with("PotentialRank")`) and one is for Current users (`starts_with("CurrentRank")`). They have a different number of votes `nranks` and so it made more sense to work with them separately, following the same steps and then row bind them together.
The individual steps for each of these dataframes is to
* `select` the relevant columns from `resultsTidy`
* perform sums with `colSums`, adding together the ranks in those columns (each column corresponds to a queried feature); We set `na.rm = TRUE` to ignore the NAs (since not every survey respondent was asked each question; e.g., if they were a current user they weren't asked as a potential user)
* send those sums to a data frame such that the selected column names from the first step are now the row names and the total summed rank is the only column with values in each row corresponding to each queried feature
* Use a `mutate` to
* add a new column `nranks` that finds the number of responses in the survey are from potential users (e.g., the number that would have assigned ranks to the PotentialRank questions) or the number of responses in the survey that are from current/returning users (e.g., the number that would have assigned ranks to the CurrentRank questions).
* add a new column `avgRank` that divides the `totalRank` by the `nranks`
After these two dataframes are bound together (`bind_rows`), the rest of the steps are for aesthetics in plotting and making sure ggplot knows the `UserType` and the feature of interest, etc.
* We move the rownames to their own column `UsertypeFeature` (with the `mutate(UsertypeFeature = rownames(.))`).
* We separate the values in that column on the word "Rank" to remove the `UsertypeFeature` column we just made but then make two new columns (`Usertype` and `Feature`) where `Usertype is either "Current" or "Potential", and the Features are listed in the code below, because...
* We then use a `case_when` within a `mutate()` to fill out those features so they're more informative and show the choices survey respondents were given.
</details>
```{r}
totalRanksdf <-
bind_rows(
resultsTidy %>%
select(starts_with("PotentialRank")) %>%
colSums(na.rm = TRUE) %>%
as.data.frame() %>% `colnames<-`(c("totalRank")) %>%
mutate(nranks = sum(resultsTidy$UserType == "Potential User"),
avgRank = totalRank / nranks),
resultsTidy %>%
select(starts_with("CurrentRank")) %>%
colSums(na.rm = TRUE) %>%
as.data.frame() %>% `colnames<-`(c("totalRank")) %>%
mutate(nranks = sum(resultsTidy$UserType == "Current User"),
avgRank = totalRank /nranks)
) %>%
mutate(UsertypeFeature = rownames(.)) %>%
separate(UsertypeFeature, c("Usertype", "Feature"), sep = "Rank", remove = TRUE) %>%
mutate(Feature =
case_when(Feature == "EasyBillingSetup" ~ "Easy billing setup",
Feature == "FlatRateBilling" ~ "Flat-rate billing rather than use-based",
Feature == "FreeVersion" ~ "Free version with limited compute or storage",
Feature == "SupportDocs" ~ "On demand support and documentation",
Feature == "ToolsData" ~ "Specific tools or datasets are available/supported",
Feature == "CommunityAdoption" ~ "Greater adoption of the AnVIL by the scientific community"),
Usertype = factor(case_when(Usertype == "Potential" ~ "Potential Users",
Usertype == "Current" ~ "Current Users"), levels = c("Potential Users", "Current Users"))
)
```
<details><summary>Description of variable definitions and steps for plotting the dumbbell plot</summary>
We use the `totalRanksdf` we just made. The x-axis is the `avgRank` values, and the y-axis displays the informative `Feature` values, however, we `reorder` the y-axis so that more important (lower number) avgRank features are displayed higher in the plot.
geom_point and geom_line are used in conjunction to produce the dumbbell look of the plot and we set the color of the points to correspond to the `Usertype`
Some theme things are changed, labels and titles added, setting the color to match AnVIL colors, and then we display and save that plot.
The first version of the plot has trimmed limits, so the second version sets limits on the x-axis of 1 to 6 since those were the options survey respondents were given for ranking. It also adds annotations (using [Grobs, explained in this Stack Overflow post answer](https://stackoverflow.com/a/31081162)) to specify which rank was "Most important" and which was "Least important".
Then we've also adjusted the left margin so that the annotation isn't cut off.
We then display and save that version as well.
Finally, we'll reverse the x-axis so that most important is on the right and least important is on the left. We use `scale_x_reverse()` for that. We have to change our group annotations so that they are now on the negative number version of `xmin` and `xmax` that we were using previously. We then display and save that version as well.
</details>
```{r}
gdumbbell <- ggplot(totalRanksdf,
aes(x = avgRank,
y = reorder(Feature, -avgRank))) +
geom_line() +
geom_point(aes(color = Usertype), size = 3) +
ggtitle("Rank the following features\naccording to their importance to\nyou as a potential user or for\nyour continued use of the AnVIL")
gdumbbell %<>% stylize_dumbbell(xmax=6, importance = TRUE)
gdumbbell
ggsave(here("plots/dumbbellplot_xlim16_revaxis_rankfeatures.png"), plot = gdumbbell) #set plot size
```
## Preferences: Training Workshop Modality Ranking
**Takeaway:** Both current and potential users vastly prefer virtual training workshops.
<details><summmary>Question and possible answers</summary>
>Please rank how/where you would prefer to attend AnVIL training workshops.
Possible answers include
* On-site at my institution: `AnVILTrainingWorkshopsOnSite`
* Virtual: `AnVILTrainingWorkshopsVirtual`
* Conference (e.g., CSHL, AMIA): `AnVILTrainingWorkshopsConference`
* AnVIL-specific event: `AnVILTrainingWorkshopsSpecEvent`
* Other: `AnVILTrainingWorkshopsOther`
The responses are stored in the starts with `AnVILTrainingWorkshops` columns
</details>
### Prepare and plot the data
<details><summary>Description of variable definitions and steps for preparing the data</summary>
</details>
```{r}
toPlotTrainingRanks <- bind_rows(
resultsTidy %>%
filter(UserType == "Current User") %>%
select(starts_with("AnVILTrainingWorkshops")) %>%
colSums(na.rm = TRUE) %>%
as.data.frame() %>% `colnames<-`(c("totalRank")) %>%
mutate(nranks = sum(resultsTidy$UserType == "Current User"),
avgRank = totalRank / nranks,
UserType = "Current Users") %>%
mutate(TrainingType = rownames(.)) %>%
mutate(TrainingType = str_replace(TrainingType, "AnVILTrainingWorkshops", "")),
resultsTidy %>%
filter(UserType == "Potential User") %>%
select(starts_with("AnVILTrainingWorkshops")) %>%
colSums() %>%
as.data.frame() %>% `colnames<-`(c("totalRank")) %>%
mutate(nranks = sum(resultsTidy$UserType == "Potential User"),
avgRank = totalRank / nranks,
UserType = "Potential Users") %>%
mutate(TrainingType = rownames(.)) %>%
mutate(TrainingType = str_replace(TrainingType, "AnVILTrainingWorkshops", ""))
) %>% mutate(TrainingType = recode(TrainingType, "SpecEvent" = "AnVIL-specific event", "OnSite" = "On-site at my institution", "Conference" = "Conference (e.g., CSHL, AMIA)")) %>%
mutate(UserType = factor(UserType, levels = c("Potential Users", "Current Users")))
```
<details><summary>Description of variable definitions and steps for plotting the dumbbell plot</summary>
</details>
```{r}
tdumbbell <- ggplot(toPlotTrainingRanks, aes(x = avgRank, y = reorder(TrainingType, -avgRank))) +
geom_line() +
geom_point(aes(color = UserType), size = 3) +
ggtitle("Please rank how/where you would prefer to attend\nAnVIL training workshops.")
tdumbbell %<>% stylize_dumbbell(preference = TRUE, xlabel = "Average Rank", ylabel = "Training Workshop Modality", xmax=5)
tdumbbell
ggsave(here("plots/dumbbellplot_xlim15_revaxis_trainingmodalitypref.png"), plot = tdumbbell) #set plot size
```
## Preferences: Where analyses are currently run
**Takeaway:** Institutional HPC and locally/personal computers are the most common responses.
- Google Cloud Platform (GCP) is reported as used more than other cloud providers within this sample.
- We also see that potential users report using Galaxy (a free option) more than current users do.
### Prepare and plot the data
```{r}
whereRunPlot <- resultsTidy %>%
separate(WhereAnalysesRun,
c("whereRunA", "whereRunB", "whereRunC", "whereRunD", "whereRunE", "whereRunF", "whereRunG"),
sep = ", ", fill = "right") %>%
pivot_longer(starts_with("whereRun"), values_to = "wherePlatforms") %>%
mutate(wherePlatforms =
recode(wherePlatforms,
"Amazon Web Services (AWS)" = "AWS",
"Galaxy (usegalaxy.org)" = "Galaxy",
"Galaxy Australia" = "Galaxy",
"Google Cloud Platform (GCP)" = "GCP",
"Institutional High Performance Computing cluster (HPC)" = "Institutional HPC",
"Personal computer (locally)," = "Personal computer (locally)",
"local server" = "Institutional HPC")
) %>%
group_by(UserType, wherePlatforms) %>%
summarize(count = n()) %>%
drop_na() %>%
ggplot(aes(x = count,
y = reorder(wherePlatforms, count),
fill = UserType)) +
geom_bar(stat="identity") +
ggtitle("Where do you currently run analyses?")
whereRunPlot %<>% stylize_bar(ylabel = "Platform")
whereRunPlot
```
## Preferences: DMS compliance/data repositories
NEED TO FILL OUT