Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reviewers using anonymous private URL might learn dataset author's identity from information about the Dataverse installation or collection #8184

Open
jggautier opened this issue Oct 25, 2021 · 20 comments · May be fixed by #10961
Assignees
Labels
FY25 Sprint 7 FY25 Sprint 7 (2024-09-25 - 2024-10-09) FY25 Sprint 8 FY25 Sprint 8 (2024-10-09 - 2024-10-23) Size: 10 A percentage of a sprint. 7 hours. UX & UI: Design This issue needs input on the design of the UI and from the product owner

Comments

@jggautier
Copy link
Contributor

jggautier commented Oct 25, 2021

Information about the installation and about the Dataverse collection that the dataset is in could help reveal the dataset author's identity to the dataset reviewer.

Information about the repository housing the dataset:

The anonymous private URL page shows the name of the Dataverse repository/installation that the dataset is in and the reviewer can navigate around the repository website to find more information about the repository. This could be an issue for Dataverse repositories with a more narrow/focused audience, like the repositories that only allow researches affiliated with a certain institution to deposit datasets.

Information about the Dataverse collection housing the dataset:

The anonymous private URL page shows the name of the Dataverse collection that the dataset is in, even if the Dataverse collection is unpublished. This feature was meant mostly for "Journal Dataverse collections," (#1724) so we should expect that the reviewer would already know, before ever visiting the anonymous private URL page, that the dataset is associated with a particular journal.

But the depositor's dataset could be in a Dataverse collection whose information (such as collection name or description) could be used to identify the author. This point was also brought up in two comments (1, 2) in the original GitHub issue. For example, many collections include the researcher's name because when people create Dataverse collections, the Dataverse software prefills the "Dataverse Name" field with the name of the Dataverse repository account that created the Dataverse collection. This is often the author's name, and the reviewer can see that Dataverse collection name, in the breadcrumbs on the anonymous private URL page.

@djbrooke
Copy link
Contributor

Thanks @jggautier!

I mentioned in Slack that I think it would be challenging to implement a programmatic fix for this, as you'd need to obscure the collection name and also potentially obscure names of other datasets, subcollections, parent collections etc. We could also revisit the functionality generally in order to not allow navigation off the dataset, but this would be a big change as well - right now the application just creates a temporary user that allows the access. Food for thought if we're able to prioritize this at some point in the future.

@jggautier
Copy link
Contributor Author

jggautier commented Oct 25, 2021

Thanks. Do you think users might share anonymous URLs before they realize that reviewers might see information about the repository or the dataset's Dataverse collection that could give away who the author is?

A careful depositor might check the URL before sharing it and realize this, but I think there are things we could do to increase the chances that most users will realize this, like adding this info in the User Guides or in the popup.

@djbrooke
Copy link
Contributor

@jggautier Oh yeah, option three - better explanatory text. :) All for it!

@jggautier
Copy link
Contributor Author

jggautier commented Nov 5, 2021

Hi everyone. Changes in the UI that @TaniaSchlatter and I are proposing are in the PDF, Proposed changes to Anonymous Private URL.pdf. The PDF has two boxes, the first describing how the feature works as designed now (v5.7) and the second describing changes based on reviews by the curation team at Harvard Dataverse Repository and @kmika11's review with some researchers who've needed to share their datasets anonymously.

Changes to the User Guides section about this feature are in the Google Doc at https://docs.google.com/document/d/1bn4fIPr_yhOj_DYDldzdKEZjmETV-WLYc98sWTgcg58.

The changes are meant to address the issue described in this GitHub issue as well as address confusion about the differences between the two types of URLs (#8185). (@jeisner brought up other points in an older GitHub issue, particularly about being able to anonymously share a dataset that's already been published, that this feature doesn't address.)

The next steps are:

  • Gathering feedback...
  • Iterate on the proposed changes based on the feedback

@philippconzett
Copy link
Contributor

philippconzett commented Nov 6, 2021

Thanks for sharing the progress on this feature! The proposed changes look all good to me. The term "Prepublish URL" is clearer than "Private URL", and the descriptions in the pop-up windows and in the user guide are all very clear. I think the anonymized version of the Prepublish URL is mainly useful in cases where a dataset is part of a double-blind peer-review process. I have added a note on this in the Google doc.

As mentioned earlier, in DataverseNO, we use a special, unpublished collection for datasets that are part of a double-blind peer review process. Page 12 in this presentation summarizes how DataverseNO currently supports double-blind peer review. See also this fake example of an anonymized dataset in our double-blind peer review collection.

Maybe an easy(?) way to enhance the Prepublish URL feature even more, could work like this:
When the depositor or curator (depending on the access rights) clicks the Prepublish URL button and selects Anonymous Review, a copy of the dataset will be pushed into an anonymized collection like the double-blind peer review collection at DataverseNO, and the copy will be anonymized following the current Anonymous Review feature.

That way, the name of the repository would still be revealed, but the collection would be anonymized.

@jggautier
Copy link
Contributor Author

jggautier commented Dec 16, 2021

Thanks @philippconzett :) I think for now we've decided to change the layout and the text on the popup to help the depositor understand the limitations of the feature, like how the collection name can help reveal the the authors' identities.

I'm all for opening another issue specifically for discussing ways to remove that limitation. @djbrooke and @TaniaSchlatter, what do you think?

To get more feedback about the redesigned popup, we reviewed it with 6 people - 5 people who I found used workarounds to deposit datasets in Harvard Dataverse Repository for anonymous review and 1 person who manages a journal's Dataverse Collection in the repository and has been interested in support for anonymous review. The redesign seemed to work well and I made small text-based adjustments based on the feedback:

  • One depositor who saw the button for disabling the Anonymous Review URL said he worried that if he disabled the link, the people who he gave the link to would then be able to see the metadata that would identify him as an author. (During the review the depositor didn't click the button, which would show a confirmation popup that says that others "will no longer be able to use it to access your unpublished dataset".)
  • Another person suggested including that the "General Review" page will display metadata that could identify the author.

These screenshots of the popup show the text changes:

Screen Shot 2021-12-16 at 9 22 52 AM

I also split the last block of text in two to improve readability, clarified that the files will be "accessible" if they're not restricted, and changed "data files" to "dataset's files". We heard during the review of the metadata tooltips that "data files" could be interpreted to exclude other types of files like "documentation files" and "code files", so I think it's better to use broader language here.

We learned more about the feature in general, including how discoverable it is (or isn't), and we heard things about the journal review process that I think we need to learn more about, so I'm working on summarizing that feedback and recommending next steps.

@TaniaSchlatter TaniaSchlatter added the UX & UI: Design This issue needs input on the design of the UI and from the product owner label Jan 3, 2022
@TaniaSchlatter
Copy link
Member

The wording and layout changes outlined above should help from the UI perspective, however moving them forward is not a complete programmatic fix.

@meghangoodchild
Copy link
Contributor

Thanks for the opportunity to provide feedback. The anonymous review feature is certainly a desired feature.

Based on some discussions with members from our community, we learned about several experiences where researchers have used the private URL in their article's data availability statement (instead of the DOI). We would like to stress the importance of using terminology that emphasizes the temporary nature of the URL, such as "temporary prepublish URL" or "prepublish preview URL".

@philippconzett
Copy link
Contributor

We have had the same experience as @meghangoodchild describes - although we have emphasized for the depositors that they must replace the private URL with the DOI before the article is published. Right now, this is the case in a Nature article that was published several months ago and so far, we have not been able to make Nature replace the private URL with the DOI. As a result, we cannot publish the dataset, because that would cause the private URL to be deactivated and the dataset URL in the reference list of the article would thus no longer resolve.

Maybe you could include some explicit wording in the private URL feature that makes depositors aware of the importance of making sure that the dataset reference in the final article must contain the dataset DOI, not the private URL.

@jggautier
Copy link
Contributor Author

jggautier commented Jan 12, 2022

Thanks @meghangoodchild and @philippconzett. We've heard the same thing, and I saw that figshare mentions in their guides that their "private link" shouldn't be used to cite the data in publications. I'm proposing adjusting the name of the feature and adding a line in the popup (and in the User Guides) about how the dataset's PID should be used to cite the data in publications:

Screen Shot 2022-01-12 at 3 08 38 PM

Because the name of the feature is in the URL, too, if the name makes the temporary nature of the link more obvious, hopefully it'll be more obvious to researchers and journal editors just by looking at the URL, e.g. https://demo.dataverse.org/previewurl.xhtml?token=0f04F8c2-bcer-4adf-816d-3b950c73ddce

But like I mentioned in emails, we'll be trying to contact journals and publishers to learn more about why authors have been adding this temporary URL to their articles in the first place and why there's friction when that URL needs to be replaced by the PID before the article is published. We've seen journal and publisher policies, like Springer Nature's policies, that I'd think are pretty explicit about using persistent IDs in articles to cite data. Is there anything about a publisher's or journal's processes that contribute to this friction?

We've also seen that sometimes researchers don't realize that the PIDs of unpublished datasets will "work" (lead to the datasets) once the datasets are published. Would making this fact more obvious encourage researchers to cite datasets with PIDs instead of private URLs?

@jggautier
Copy link
Contributor Author

@TaniaSchlatter agrees that the redesign of the feature name, the popup, banner messages, and relevant guide pages are done and can be moved to development when possible.

The changes are illustrated in mockups in an image and in a section of a virtual whiteboard. They include changes to:

  • The name of the feature in the Edit Dataset dropdown on the dataset page
  • Changes to the text, layout, and interaction of the feature's popup
  • Changes to the text of the "Disable URL" confirmation popup
  • Changes to the name of feature in the URL (e.g. previewurl in https://demo.dataverse.org/previewurl.xhtml?token=39b07d51-e0aa-4a89-a179-cacd63c94d72)
  • Changes to banner messages shown when using the feature

The changes to the guide pages - pages in the User, API, Installation, Developer and Style guides - are in the Google Doc at https://docs.google.com/document/d/1bn4fIPr_yhOj_DYDldzdKEZjmETV-WLYc98sWTgcg58

The change to the name of the feature will require changes to the names of associated code files, e.g. PrivateUrlUtil.java

@mreekie
Copy link

mreekie commented Sep 26, 2022

Worked on by

  • Tania
  • Julian

@scolapasta
Copy link
Contributor

scolapasta commented Feb 7, 2024

Removing information about the Dataverse collection should be relatively straight forward to not render (would have to see how it looks), i.e. don't show the dataverse collection header, don't show breadcrumbs.

Repository name can't be hidden as it's part of the URL.

Currently not sure about the past versions - since I'm not sure if the persistent ID is exposed in this; if it is, then an end user could use that to find the dataset. If it's not, then we should be able to not render the version tabs.

**Still need to determine how to deal with published versions - would we have published versions of a dataset needing anonymous review? @jggautier If this is a case, then we code to not allow the creation of an anonymous link once a dataset has one published version.


Other considerations:
*We still have the lingering issue of small repositories being at risk of identifying information exposure.
-Julian had suggested not providing a URL (which contains the repository name) but instead providing a PDF of the data to avoid interacting with the repository identifying information
-What alternatives can we consider?

@sbarbosadataverse
Copy link

From the comments on Jan 25, 2022: This task needs review (@scolapasta_: "Changes to the text, layout, and interaction of the feature's popup"

@jggautier can you explain further what this particular change would fix?

@scolapasta
Copy link
Contributor

So reviewing with @qqmyers it does seem that anonymos peer review can only happen for initial drafts - which means there are no previously published versions. That means there's no code/logic to worry about there, but also that the suggested popups above don't even need to mention previous versions in that case.

@jggautier
Copy link
Contributor Author

The changes are meant to:

  • Help depositors be aware that reviewers might figure out who they are based on the name of the repository and the name of the Dataverse collection that the dataset is deposited in
  • Help depositors be sure about the differences between the "Private URL" and "Anonymous Private URL". We wrote more about this in Some researchers unsure of difference between "Private URL" and "Anonymous Private URL" #8185
  • Encourage depositors to use the PID when citing the dataset, instead of using the private URL
  • Make depositors aware that their files are not being anonymized or changed in any way

I didn't know that anonymous peer review is available only for initial drafts of the dataset, but that's great!

@adam3smith
Copy link
Contributor

So this is only changes the guidance, correct? I think changes look good and worthwhile, though ime you shouldn't expect big effects on user behavior from any written text.

@jggautier
Copy link
Contributor Author

Yeah changes to text, and also the popup's layout and interaction

@sbarbosadataverse
Copy link

My next question is can we get this done and on the list of prioritization @scolapasta @jggautier
Any blockers to the changes we need to make to have this work on HDV?

@qqmyers
Copy link
Member

qqmyers commented Mar 25, 2024

FWIW: The text proposed, at least as of #8184 (comment) indicated that restricted files are not available with the anonymized preview URL - that is not currently the case. If this is desired, I suspect it may need to be an option (presumably some review requires looking at the restricted files?). #10403 is proposing to allow this in general (allowing users to be given the ability to view unpublished datasets but not restricted files), but that doesn't necessarily change how anonymized preview Urls work, so that would have to be handled somewhere. In any case, the text change here shouldn't include that unless/until the functionality is changed.

@cmbz cmbz added the Size: 3 A percentage of a sprint. 2.1 hours. label Mar 26, 2024
@sekmiller sekmiller added the FY25 Sprint 7 FY25 Sprint 7 (2024-09-25 - 2024-10-09) label Oct 3, 2024
@sekmiller sekmiller self-assigned this Oct 3, 2024
sekmiller added a commit that referenced this issue Oct 8, 2024
sekmiller added a commit that referenced this issue Oct 8, 2024
@cmbz cmbz added Size: 10 A percentage of a sprint. 7 hours. FY25 Sprint 8 FY25 Sprint 8 (2024-10-09 - 2024-10-23) and removed Size: 3 A percentage of a sprint. 2.1 hours. labels Oct 9, 2024
sekmiller added a commit that referenced this issue Oct 16, 2024
sekmiller added a commit that referenced this issue Oct 16, 2024
sekmiller added a commit that referenced this issue Oct 16, 2024
sekmiller added a commit that referenced this issue Oct 16, 2024
sekmiller added a commit that referenced this issue Oct 16, 2024
sekmiller added a commit that referenced this issue Oct 16, 2024
sekmiller added a commit that referenced this issue Oct 22, 2024
sekmiller added a commit that referenced this issue Oct 22, 2024
@sekmiller sekmiller linked a pull request Oct 23, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
FY25 Sprint 7 FY25 Sprint 7 (2024-09-25 - 2024-10-09) FY25 Sprint 8 FY25 Sprint 8 (2024-10-09 - 2024-10-23) Size: 10 A percentage of a sprint. 7 hours. UX & UI: Design This issue needs input on the design of the UI and from the product owner
Projects
Status: ⚠️ Needed/Important
Development

Successfully merging a pull request may close this issue.