Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update First Section #17

Open
wants to merge 11 commits into
base: main
Choose a base branch
from
26 changes: 13 additions & 13 deletions 01-intro.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -16,13 +16,13 @@ ottrpal::include_slide("https://docs.google.com/presentation/d/10nOR2t1-F0E01fIt

## Target Audience

The course is intended for individuals in biomedical scientists and program managers who want to learn the best practices and techniques for data management and sharing.
The course is intended for individuals in in the cancer research community who want to learn the best practices and techniques for data management and sharing.

_This course is written for individuals who:_

- Have or plan to have biomedical data they need to manage
- Have or plan to have or apply for NIH funding
- Either work directly with the data or help mentor those who work with data
- Have or plan to apply for NIH funding
- Work directly with the cancer research data or mentor those who work with data
- Have not had much training or background in data handling practices


Expand All @@ -41,6 +41,7 @@ ottrpal::include_slide("https://docs.google.com/presentation/d/10nOR2t1-F0E01fIt

- Understand what data sharing is and why is it important?
- Effectively manage your data including the associated skills relating to data heavy projects
- Maintain data privacy and comply with data privacy laws
- Maintain and write effective documentation
- Keep effective records that will help you track your project properly but securely
- Create good metadata that can enhance the use your data
Expand All @@ -50,23 +51,22 @@ ottrpal::include_slide("https://docs.google.com/presentation/d/10nOR2t1-F0E01fIt
- Store and submit your data to repositories


```{r, echo=FALSE, fig.alt='CAPTION HERE', out.width = '100%', fig.align = 'center'}
ottrpal::include_slide("https://docs.google.com/presentation/d/10nOR2t1-F0E01fItN_l8uYRWslH2PmebPvhQzCBeCPM/edit#slide=id.g1173f7473f7_0_0")
```

## Motivation

Biomedical projects are increasingly data based. Not only is proper data sharing and management a non negotiable necessity for data heavy projects but NIH and other grant funding institutions have mandates that require you to do so. But it can be difficult to be brought up to speed on these effective management especially as data requirements and best practices continue to change rapidly.
The cancer research discipline has evolved into an increasingly complex mix of datasets - research projects are typically cross-disciplinary and contain many types of data in various formats. They often involve multiple collaborators generating data across different sites, with different data standards and infrastructure used to generate the data. Therefore, it is more important than ever to be well-versed in the best practices of data management and sharing.

Not only is proper data management and sharing a necessity for cancer research projects to succeed in positively impacting cancer care, but it is now also increasingly a necessity to obtain funding as the NIH and other cancer research funders have implemented mandates that require you to proactively plan to manage and share your data.

As a member of the cancer research community, it is imperative that you maintain well-documented metadata and properly share your data. This will benefit you, your colleagues, and the larger community by broadening the reach of your data, enabling data reuse by others, and ultimately accelerating the pace of scientific discovery. This course aims to serve as a starting point to cover the basics of good data management and sharing practices.

Effective data management and sharing can mean the difference between more funding or not. Well documented and properly shared data is not only the right thing to do as a part of a scientific community but it will help shed a spotlight on your work in the community. It also is not only helpful for yourself but for you and your future colleagues should you need to return to your work or build upon it (which is what science usually aims to do).

## Curriculum

**Goal of this course:**
This course will be a great reference of the best practices for data management and sharing practices and a lot of the associated skills that are needed in data based biomedical projects.
**How to use this course:**
This course contains high-level concepts for data management and sharing and can be used as a reference of suggested best practices and associated skills needed for data management and sharing in biomedical research.

**What is not the goal**
Data projects come in all shapes and sizes. Not all the advice in this book might work for you right out the gate. We encourage you to continue to consult data experts especially when it comes to matters of data security and privacy. This course will prepare you with the basics and best practices for data but it will not make you an end all be all expert in every topic you'd need to do when managing and sharing data.
**Keep in mind:**
Scientific data and research projects come in many different forms, and some content in this course may not apply, especially as the research landscape evolves to adapt and support new technology, methods, and techniques. Therefore, the goal of this course is not to prescribe rigid rules for how to conduct research, but rather serve as a guide to approach data management and sharing in the spirit of the FAIR principles (Findable, Accessible, Interoperable, Reusable). We encourage you to continue to consult with data management experts to suit the needs of your particular project and/or research goals.

<div class = disclaimer>
`r config::get("disclaimer")`
Expand Down
10 changes: 5 additions & 5 deletions 02-data-sharing-is-important.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -2,17 +2,17 @@

# Data Sharing is Important

Sharing data is critical for optimizing the advancement of scientific understanding. Now that labs all over the world are producing massive amounts of data, there are many discoveries that can be made by just using this existing data.
Sharing data is critical for optimizing the advancement of scientific understanding. Now that labs all over the world are producing massive amounts of data, there are many discoveries that can be made by simply re-using this existing data.

This is so important, that starting in January, 2023 the NIH will require specific sharing practices for data management and sharing. See the announcement [here](https://grants.nih.gov/grants/guide/notice-files/NOT-OD-21-013.html).
The concept of data re-use is so important that, in January 2023, the NIH began requiring specific practices for data management and sharing. See the announcement [here](https://grants.nih.gov/grants/guide/notice-files/NOT-OD-21-013.html).

See this [course](https://hutchdatascience.org/NIH_Data_Sharing/) for more information about how to comply with this policy.

<div class = "warning">
Note that many institutes and funding agencies or mechanisms have requirements about how your data can be shared. Typically data sharing of protected data also requires Institutional Review Board (IRB) approval before the study is conducted. Ensure that you are following those requirements before you share your data.
Note that many institutes and funding agencies or mechanisms have requirements about how your data can be shared. Typically, data sharing of protected data also requires Institutional Review Board (IRB) approval before the study is conducted. Ensure that you are following those requirements before you share your data. A later section in this course will cover data privacy.
</div>

There's so many excellent reasons to put your data in a repository whether or not a journal requires it:
There's so many excellent reasons to put your data in a repository, whether or not a journal requires it:

**Sharing your data...**

Expand All @@ -22,7 +22,7 @@ There's so many excellent reasons to put your data in a repository whether or no
ottrpal::include_slide("https://docs.google.com/presentation/d/1SRokLaGAc2hiwJSN26FHE0ZEEhPr3KQdyMICic8kAcs/edit#slide=id.g117c57cc481_0_636")
```

2. Helps your relieve your own workload so your email inbox isn't loaded by requests you probably don't have time to respond to.
2. Helps your reduce your own workload so your email inbox isn't overloaded by requests you probably don't have time to respond to.

```{r, fig.align='center', echo = FALSE, fig.alt= "Ruby is reading a journal article with data and code she is interested in. The journal article says ‘Code and data are available upon request by email’. Ruby sends an email that says ‘ The email is going to an inbox with 999,999,565473 emails in it and it is labeled ‘the corresponding author’s inbox’.", out.width="100%"}
ottrpal::include_slide("https://docs.google.com/presentation/d/1SRokLaGAc2hiwJSN26FHE0ZEEhPr3KQdyMICic8kAcs/edit#slide=id.g117c57cc481_0_616")
Expand Down
23 changes: 12 additions & 11 deletions 04-defining-reproducibility.Rmd → 03-defining-reproducibility.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ ottrpal::include_slide("https://docs.google.com/presentation/d/1LMurysUhCjZb7DVF

## What is reproducibility

There's been a lot of discussion about what is included in the term `reproducibility` and there is some discrepancy between fields. For the purposes of informatics and data analysis, a _reproducible analysis is one that can be re-run by a different researcher and the same result and conclusion is found_.
There's been a lot of discussion about what is included in the term `reproducibility` and there is some discrepancy between fields. For the broad field of cancer research, a _reproducible analysis is one that can be re-run by a different researcher and the same result and conclusion is found_.

```{r, fig.align='center', echo = FALSE, fig.alt= "Reproducibility is a different analyst re­-performing the same analysis with the same code and data."}
ottrpal::include_slide("https://docs.google.com/presentation/d/1LMurysUhCjZb7DVF6KS9QmJ5NBjwWVjRn40MS9f2noE/edit#slide=id.gf1accd298e_0_146")
Expand Down Expand Up @@ -68,15 +68,15 @@ Let's say Ruby's results are repeatable in her own hands and she excitedly tells
ottrpal::include_slide("https://docs.google.com/presentation/d/1LMurysUhCjZb7DVF6KS9QmJ5NBjwWVjRn40MS9f2noE/edit#slide=id.gf1accd298e_0_179")
```

Avi is also interested in Ruby's analysis methods and results. So Ruby sends Avi the code and data she used to obtain the results. Now, whether or not Avi is able to obtain the same exact results with this same data and same analysis code will indicate if Ruby's analysis is reproducible.
Avi is also interested in Ruby's analysis methods and results. So Ruby sends Avi the code, data, and methods she used to obtain the results. Now, whether or not Avi is able to obtain the same exact results with this same data and same analysis code will indicate if Ruby's analysis is reproducible.

```{r, fig.align='center', echo = FALSE, fig.alt= "Ruby the researcher has her computer with a plot and a significant and exciting research result. Ruby says 'Here, Avi, this code runs well on my computer, let me email it to you!' Avi the associate says 'so exciting'"}
ottrpal::include_slide("https://docs.google.com/presentation/d/1LMurysUhCjZb7DVF6KS9QmJ5NBjwWVjRn40MS9f2noE/edit#slide=id.gf1accd298e_0_944")
```

Ruby may have spent a lot of time on her code and getting it to work on her computer, but whether it will successfully run on Avi's computer is another story. Often when researchers share their analysis code it leads to a substantial amount of effort on the part of the researcher who has received the code to get it working and this often cannot be done successfully without help from the original code author [@BeaulieuJones2017].
Ruby may have spent a lot of time on her code and getting it to work on her computer, but whether it will successfully run on Avi's computer is another story. Often when researchers share their analysis code it leads to a substantial amount of effort on the part of the researcher who has received the code to get it working and this often cannot be done successfully without help from the original code author [@BeaulieuJones2017]. This same concept applies to experimental research methods in a laboratory setting.

Avi is encountering errors because Ruby's code was written with Ruby's computer and local setup in mind and she didn't know how to make it more generally applicable. Avi is spending a lot of time just trying to re-run Ruby's same analysis on her same data; he has yet to be able to try the code on any additional data (which will likely bring up even more errors).
Avi is encountering errors because Ruby's code was written with Ruby's computer and local setup in mind and she didn't know how to make it more generally applicable. Avi is spending a lot of time just trying to re-run Ruby's same analysis on her same data; he has yet to be able to try the code on any additional data (which will likely bring up even more errors). Imagine a trying to follow an experimental research method in the lab with vague or unclear instructions!

```{r, fig.align='center', echo = FALSE, fig.alt= "Avi the associate is confused and sweating. His computer has the word ‘error’ written all over it and its on fire trying to use Ruby’s code on Ruby’s data. This is using a substantial amount of time and effort on Avi’s part. "}
ottrpal::include_slide("https://docs.google.com/presentation/d/1LMurysUhCjZb7DVF6KS9QmJ5NBjwWVjRn40MS9f2noE/edit#slide=id.gf1accd298e_0_993")
Expand All @@ -94,11 +94,12 @@ Perhaps at some point Avi is able to successfully run Ruby's code on Ruby's same
ottrpal::include_slide("https://docs.google.com/presentation/d/1LMurysUhCjZb7DVF6KS9QmJ5NBjwWVjRn40MS9f2noE/edit#slide=id.gf1accd298e_0_673")
```

Lack of errors also doesn't mean that either Ruby or Avi's runs of the code ran with high accuracy or that the results can be trusted. Even a small difference in decimal point may indicate a more fundamental difference in how the analysis was performed and this could be due to differences in software versions, settings, or any number of items in their computing environments.
Lack of errors also doesn't mean that either Ruby or Avi's runs of the code ran with high accuracy or that the results can be trusted. Even a small difference in decimal point may indicate a more fundamental difference in how the analysis was performed and this could be due to differences in software versions, settings, or any number of items in their computing environments. This challenge also exists when trying to repeat a research method that you or someone else has written, especially if there aren’t enough details to precisely describe the conditions in which the data was collected.


## Reproducibility is worth the effort!

Perhaps you've found yourself in a situation like Ruby and Avi; struggling to re-run code that you thought for sure was working a minute ago. In the upcoming chapters, we will discuss how to bolster your projects' reproducibility.
Perhaps you've found yourself in a situation like Ruby and Avi; struggling to re-run code or a method that you thought for sure was working a minute ago. In the upcoming chapters, we will discuss how to bolster your projects' reproducibility.

As you apply these reproducible techniques to your own projects, you may feel like it is taking more time to reach endpoints, but keep in mind that reproducible analyses and projects have higher upfront costs but these will absolutely pay off in the long term.

Expand All @@ -112,26 +113,26 @@ Reproducibility in your analyses is not only a time saver for yourself, but also
ottrpal::include_slide("https://docs.google.com/presentation/d/1LMurysUhCjZb7DVF6KS9QmJ5NBjwWVjRn40MS9f2noE/edit#slide=id.gf1cd772e00_0_5")
```

You might not change a single character in your code but then return to it in a a few days/months/years and find that it no longer runs! Reproducible code stands the test of time longer, making 'future you' glad you spent the time to work on it. It's said that your closest collaborator is you from 6 months ago but you don't reply to email [@Broman].
You might not change a single character in your code or a step in your method but then return to it in a a few days/months/years and find that it no longer runs! Reproducible code and research methods stands the test of time longer, making 'future you' glad you spent the time to work on it. It's said that your closest collaborator is you from 6 months ago but you don't reply to email [@Broman].

```{r, fig.align='center', echo = FALSE, fig.alt= "Ruby the researcher’s code works now as represented on her computer by a check mark. But Future Ruby, who has gray hair has an error running the same code!"}
ottrpal::include_slide("https://docs.google.com/presentation/d/1LMurysUhCjZb7DVF6KS9QmJ5NBjwWVjRn40MS9f2noE/edit#slide=id.gf1cd772e00_0_330")
```

Many a data scientist has referred to their frustration with their past selves:
Many a researcher has referred to their frustration with their past selves:

<blockquote class="twitter-tweet"><p lang="en" dir="ltr">Dear past-Hadley: PLEASE COMMENT YOUR CODE BETTER. Love present-Hadley</p>&mdash; Hadley Wickham (\@hadleywickham) <a href="https://twitter.com/hadleywickham/status/718203628528349184?ref_src=twsrc%5Etfw">April 7, 2016</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>

The more you comment your code, and make it clear and readable, your future self will thank you.
The more you comment your code, or detail your method and make it clear and readable, your future self will thank you.

Reproducible code also saves your colleagues time! The more reproducible your code is, the less time all of your collaborators will need to spend troubleshooting it. The more people who use your code and need to try to fix it, the more time is wasted. This can add up to a lot of wasted researcher time and effort.
Reproducible code and research protocols also saves your colleagues time! The more reproducible your methods are, the less time all of your collaborators will need to spend troubleshooting it. The more people who use your methods need to try to troubleshoot it, the more time is wasted. This can add up to a lot of wasted researcher time and effort.

```{r, fig.align='center', echo = FALSE, fig.alt= "If Ruby’s code is less reproducible, every researcher who attempts to use Ruby’s code will encounter the same errors and each person will have to fix it. This adds up to a lot of spent researcher time and effort."}
ottrpal::include_slide("
https://docs.google.com/presentation/d/1LMurysUhCjZb7DVF6KS9QmJ5NBjwWVjRn40MS9f2noE/edit#slide=id.gf1cd772e00_0_160")
```

But, reproducible code saves everyone exponential amounts of time and effort! It will also motivate individuals to use and cite your code and analyses in the future!
But, reproducible code and methods saves everyone exponential amounts of time and effort! It will also motivate individuals to use and cite your methods in the future!

```{r, fig.align='center', echo = FALSE, fig.alt= "If Ruby’s code is built in a sturdier manner, it will save others’ time who might also need to perform a similar analysis. Ruby’s code is made reproducibly in this example and only one of her seven colleagues that are using her code needed to troubleshoot an error."}
ottrpal::include_slide("https://docs.google.com/presentation/d/1LMurysUhCjZb7DVF6KS9QmJ5NBjwWVjRn40MS9f2noE/edit#slide=id.gf1cd772e00_0_53")
Expand Down
File renamed without changes.
File renamed without changes.
Loading
Loading