Skip to content

caifand/softcite-dataset

 
 

Repository files navigation

softcite-dataset

We are building a dataset of software mentions in research publications. We have annotated thousands of mentions of software, mostly informal, in thousands of published academic papers. The effort has led to an annotated corpus suitable for training entity recognition algorithms. We expect this effort can fuel more development in text mining utilities leveraging machine learning techniques, either for enabling further analysis of software use and development in science, or for improving the visibility of software entities in existing scientific literature.

Visibility is important to the underacknowledged software work in science, which is critical for unleashing scientific progress. We hope our effort can help software work achieve its due credit on the honor wall of science, and thus facilitate more investment in quality software work for better science.

softcite-dataset: from PDF annotation to output

softcite-dataset: from manual annotation of PDF documents to a corpus for machine learning use

Documentation

API for knowledgebase access

CiteAs.org helps find the requested citation for software (and will eventually use data from the knowledge base).

Thanks to the Sloan Foundation for funding.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • HTML 92.2%
  • Python 6.7%
  • Other 1.1%