-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Save pages returned by oa_request()
and do not process further
#181
Conversation
oa_request()
and do not process fue=rtheroa_request()
and do not process further
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! I left a few minor (design-related) comments for your consideration (haven't tested out how the code runs yet).
R/oa_fetch.R
Outdated
@@ -163,6 +163,15 @@ oa_fetch <- function(entity = if (is.null(identifier)) NULL else id_type(shorten | |||
#' Either "cursor" for cursor paging or "page" for basic paging. | |||
#' When used with options$sample, please set `paging = "page"` | |||
#' to avoid duplicates. | |||
#' @param save_pages Character. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two quick thoughts about save_pages
:
-
I think the name could be a little clearer - people might confuse this with a toggle option taking T/F. Maybe something like
out_dir
that makes it explicit that it expects a directory path as value? -
The argument description should ideally be concise and document just its function. The motivation and other notes (about memory issues, etc.) could move to its own
@section
or@notes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Agreed.
out_dir
sounds a bit like where to save the output - which it is not. Ratheroutput_pages_to
? - OK - I will streamline it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah output_pages_to
sounds great!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed in last commit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure it's optimal to use saveRDS
within the package.
I'm currently working on a different branch on improving paging in oa_fetch
and oa_request
. Let me finish some final touches and perhaps you can help me test my implementation @rkrug? The idea is to use basic paging for fetching individual pages. Personally, I think this change would be a little less disruptive than saveRDS
. I will give some examples in the PR. And whether we move forward with output_pages_to
or not, I think my branch would still improve the paging implementation.
Would love to hear your thoughts @rkrug @yjunechoe. Stay tuned!
From my experience with snowballing starting from more than 3000 papers, the limitation is really memory, i.e. the memory management of R. Just my thoughts, and looking forward to your pull request, @trangdata |
@@ -312,6 +322,7 @@ oa_request <- function(query_url, | |||
per_page = 200, | |||
paging = "cursor", | |||
pages = NULL, | |||
output_pages_to = NULL, | |||
count_only = FALSE, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not more than a proof of concept implementation.
I'll wait to comment more until we resolve our discussion elsewhere on the design related issues, but I'll just add that I'm also a little wary of using saveRDS. It's a weird middle ground between where you apply some but not all processing to the data received from OA. I think we should either just dump out the JSON response exactly as we receive it, OR process it a bit further (ex: at least make it tabular) so that we can use a more efficient storage format (even csv is better than rds). |
Dumping the JSON response would make perfect sense - but probably as an RDS to safe space when downloading large sets. Let's discuss this further when the other discussion has reached consensus. |
@yjunechoe Not sure you saw my comment in my Improve paging control PR, but I'll reiterate it here: at the moment, I think it's out of scope for openalexR to handle any type of serialization/IO. We currently provide enough flexibility for the user to do this themselves. In general, I think making many requests to the Open Alex database is not a common use case, and I would generally advise against it. One can download the snapshot themselves and proceed with the filtering/processing without needing openalexR. Therefore, my current suggestion would be to close this PR. |
Ah I'd missed the discussion going over there! Yes, I don't disagree with just closing this as out of scope (my comment was more of a "if we do implement this" comment) |
This is inspired by #166
It is. not possible, due to memory limitations, to download a large number of works (thousands, millions). This pull request introduces an argument
save_pages
to the functionoa_request
which, if provided, saves all downloaded pages in the directory specified by the argumentsave_pages
. No further processing of the downloaded pages is done, as this would again lead to memory problems, a vector containing the file names of the pages is returned.This is at the moment a very rudimentary implementation, and could be tweaked further for performance, but I think this is not needed as a quick profiling indicated that the API calls take about 80% of the time, while the
saveRDS()
only 10%.