-
Notifications
You must be signed in to change notification settings - Fork 7
/
README.Rmd
182 lines (117 loc) · 8.16 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
---
output: github_document
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
eval = FALSE,
message = FALSE,
warning = FALSE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
library(medrxivr)
```
# medrxivr <img src="man/figures/logo.png" align="right" width="20%" height="20%" />
<!-- badges: start -->
[![Project Status: Active – The project has reached a stable, usable state and is being actively developed.](https://www.repostatus.org/badges/latest/active.svg)](https://www.repostatus.org/#active)
[![CRAN Downloads.](https://cranlogs.r-pkg.org/badges/grand-total/medrxivr)](https://CRAN.R-project.org/package=medrxivr)
[![R build status](https://github.com/ropensci/medrxivr/workflows/R-CMD-check/badge.svg)](https://github.com/ropensci/medrxivr/actions)
[![Status at rOpenSci software peer-review](https://badges.ropensci.org/380_status.svg)](https://github.com/ropensci/onboarding/issues/380)
<!-- badges: end -->
An increasingly important source of health-related bibliographic content are preprints - preliminary versions of research articles that have yet to undergo peer review. The two preprint repositories most relevant to health-related sciences are [medRxiv](https://www.medrxiv.org/) and [bioRxiv](https://www.biorxiv.org/), both of which are operated by the Cold Spring Harbor Laboratory.
The goal of the `medrxivr` R package is two-fold. In the first instance, it provides programmatic access to the [Cold Spring Harbour Laboratory (CSHL) API](https://api.biorxiv.org/), allowing users to easily download medRxiv and bioRxiv preprint metadata (e.g. title, abstract, publication date, author list, etc) into R. The package also provides access to a maintained static snapshot of the medRxiv repository (see [Data sources](#medrxiv-data)). Secondly, `medrxivr` provides functions to search the downloaded preprint records using regular expressions and Boolean logic, as well as helper functions that allow users to export their search results to a .BIB file for easy import to a reference manager and to download the full-text PDFs of preprints matching their search criteria.
## Installation
To install the stable version of the package from CRAN:
``` {r}
install.packages("medrxivr")
library(medrxivr)
```
Alternatively, to install the development version from GitHub, use the following code:
``` {r}
install.packages("devtools")
devtools::install_github("ropensci/medrxivr")
library(medrxivr)
```
## Data sources
### medRxiv data
`medrixvr` provides two ways to access medRxiv data:
- `mx_api_content(server = "medrxiv")` creates a local copy of all data available from the medRxiv API at the time the function is run.
``` {r}
# Get a copy of the database from the live medRxiv API endpoint
preprint_data <- mx_api_content()
```
- `mx_snapshot()` provides access to a static snapshot of the medRxiv database. The snapshot is created each morning at 6am using `mx_api_content()` and is stored as CSV file in the [medrxivr-data repository](https://github.com/mcguinlu/medrxivr-data). This method does not rely on the API (which can become unavailable during peak usage times) and is usually faster (as it reads data from a CSV rather than having to re-extract it from the API). Discrepancies between the most recent static snapshot and the live database can be assessed using `mx_crosscheck()`.
``` {r}
# Get a copy of the database from the daily snapshot
preprint_data <- mx_snapshot()
```
The relationship between the two methods for the medRxiv database is summarised in the figure below:
``` {r eval = TRUE, echo = FALSE, out.width = "500px", out.height = "400px"}
knitr::include_graphics("vignettes/data_sources.png")
```
### bioRxiv data
Only one data source exists for the bioRxiv repository:
- `mx_api_content(server = "biorxiv")` creates a local copy of all data available from the bioRxiv API endpoint at the time the function is run. __Note__: due to it's size, downloading a complete copy of the bioRxiv repository in this manner takes a long time (~ 1 hour).
``` {r}
# Get a copy of the database from the live bioRxiv API endpoint
preprint_data <- mx_api_content(server = "biorxiv")
```
## Performing your search
Once you have created a local copy of either the medRxiv or bioRxiv preprint database, you can pass this object (`preprint_data` in the examples above) to `mx_search()` to search the preprint records using an advanced search strategy.
``` {r, eval = TRUE, message = TRUE}
# Import the medrxiv database
preprint_data <- mx_snapshot()
# Perform a simple search
results <- mx_search(data = preprint_data,
query ="dementia")
# Perform an advanced search
topic1 <- c("dementia","vascular","alzheimer's") # Combined with Boolean OR
topic2 <- c("lipids","statins","cholesterol") # Combined with Boolean OR
myquery <- list(topic1, topic2) # Combined with Boolean AND
results <- mx_search(data = preprint_data,
query = myquery)
```
You can also explore which search terms are contributing most to your search by setting `report = TRUE`:
```{r, eval = TRUE, message = TRUE}
results <- mx_search(data = preprint_data,
query = myquery,
report = TRUE)
```
## Further functionality
### Export records identified by your search to a .BIB file
Pass the results of your search above (the `results` object) to the `mx_export()` to export references for preprints matching your search results to a .BIB file so that they can be easily imported into a reference manager (e.g. Zotero, Mendeley).
```{r, eval = FALSE}
mx_export(data = results,
file = "mx_search_results.bib")
```
### Download PDFs for records returned by your search
Pass the results of your search above (the `results` object) to the `mx_download()` function to download a copy of the PDF for each record found by your search.
```{r}
mx_download(results, # Object returned by mx_search(), above
"pdf/", # Directory to save PDFs to
create = TRUE) # Create the directory if it doesn't exist
```
## Accessing the raw API data
By default, the `mx_api_*()` functions clean the data returned by the API for use with other `medrxivr` functions.
To access the raw data returned by the API, the `clean` argument should set to `FALSE`:
``` {r}
mx_api_content(to_date = "2019-07-01", clean = FALSE)
```
See [this article](https://docs.ropensci.org/medrxivr/articles/medrxiv-api.html#accessing-the-raw-api-data) for more details.
## Detailed guidance
Detailed guidance, including advice on how to design complex search strategies, is available on the [`medrxivr` website.](https://docs.ropensci.org/medrxivr/)
## Linked repositories
See here for the [code used to take the daily snapshot](https://github.com/mcguinlu/medrxivr-data) and [the code that powers the `medrxivr` web app](https://github.com/mcguinlu/medrxivr-app).
## Other tools/packages for working with medRxiv/bioRxiv data
The focus of `medrxivr` is on providing tools to allow users to import and then search medRxiv and bioRxiv data. Below are a list of complementary packages that provide distinct but related functionality when working with medRxiv and bioRxiv data:
* [`rbiorxiv`](https://github.com/nicholasmfraser/rbiorxiv) by [Nicholas Fraser](https://github.com/nicholasmfraser) provides access to the same medRxiv and bioRxiv _content_ data as `medrxivr`, but also provides access to the _usage_ data (e.g. downloads per month) that the Cold Spring Harbour Laboratory API offers. This is useful if you wish to explore, for example, [how the number of PDF downloads from bioRxiv has grown over time.](https://github.com/nicholasmfraser/rbiorxiv#pdf-downloads-over-time)
## Code of conduct
Please note that this package is released with a [Contributor
Code of Conduct](https://ropensci.org/code-of-conduct/).
By
contributing to this project, you agree to abide by its terms.
## Disclaimer
This package and the data it accesses/returns are provided "as is", with no guarantee of accuracy. Please be sure to check the accuracy of the data yourself (and do let me know if you find an issue so I can fix it for everyone!)