Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is it possible to sample only a subset of works when using oa_snowball ? #279

Closed
adrientaudiere opened this issue Sep 24, 2024 · 6 comments
Labels
question Further information is requested vignette

Comments

@adrientaudiere
Copy link

Hi all,
Thanks for your very usefull package.

I wonder if it is possible to add an equivalent of options = list(sample = 10, seed = 1)) to each snowball query ? For the moment if I add options = list(sample = 10, seed = 1), it only apply to the first oa_fetch query. And the option citing_params and cited_by_params add a filter in the api request, so the api parameter sample is not usable.

Best,

@trangdata
Copy link
Collaborator

Hi @adrientaudiere could you elaborate on the motivation to perform such an analysis? Why would you want a subset of the cited/citing work?

@adrientaudiere
Copy link
Author

Hi @trangdata,
I run some "big" analysis, and sometimes I only want to perform the analysis on a small subset of random work to test the entire pipeline with a few works. I know that I can filter the dataset after snowball, but when I saw the option sample in oa_fetch, I thank that it would be useful to be able to add sample in oa_snowball. It's only a small feature request, I understand that it is not very important (-;.

@yjunechoe
Copy link
Collaborator

yjunechoe commented Sep 26, 2024

The idea is not completely out of scope, but I'd also like to hear more before considering implementing this (though it seems like sampling, specifically, is not the point?)

If the point is to test a pipeline with a small oa_snowball() output, I'd probably prefer using the to/from_publication_date filter in the citing/cited_by params - this is how we also write our own tests for oa_snowball()! That will also give you a stable subset to work with (unless you're specifically looking to test against output variability?).

library(openalexR)

x <- oa_fetch(doi = "10.1038/s41586-022-05258-z")

# >200 connections
x$cited_by_count + length(x$referenced_works[[1]])
#> [1] 221

# Search the range of 1 year forward, 10 years back
pub_date <- as.Date(x$publication_date)
snowball <- oa_snowball(
  identifier = x$id,
  citing_params = list(to_publication_date = pub_date + 365),
  cited_by_params = list(from_publication_date = pub_date - 3650)
)

# Returns a subset
nrow(snowball$nodes)
#> [1] 112

hist(as.Date(snowball$nodes$publication_date), "years", freq = TRUE)
abline(v = pub_date, col = "red", lwd = 3)

image

@adrientaudiere
Copy link
Author

Oh yes, it's a perfect trick. Maybe it deserves to be present in the example of the function. Thank you.

@yjunechoe
Copy link
Collaborator

Oh yes, it's a perfect trick. Maybe it deserves to be present in the example of the function. Thank you.

We had actually planned for a full snowballing vignette but haven't got around to it. We'll keep this example in mind!

@yjunechoe yjunechoe added the question Further information is requested label Sep 26, 2024
@yjunechoe
Copy link
Collaborator

Update: no longer necessary to do date conversion of publication_date in dev

library(openalexR)

x <- oa_fetch(doi = "10.1038/s41586-022-05258-z")

# >200 connections
x$cited_by_count + length(x$referenced_works[[1]])
#> [1] 223

# Search the range of 1 year forward, 10 years back
snowball <- oa_snowball(
  identifier = x$id,
  citing_params = list(to_publication_date = x$publication_date + 365),
  cited_by_params = list(from_publication_date = x$publication_date - 3650)
)

# Returns a subset
nrow(snowball$nodes)
#> [1] 114

hist(snowball$nodes$publication_date, "years", freq = TRUE)
abline(v = x$publication_date, col = "red", lwd = 3)

image

Closing to track vignette discussion in #284

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested vignette
Projects
None yet
Development

No branches or pull requests

3 participants