softcite-dataset

We are building a dataset of software mentions in research publications. We have annotated thousands of mentions of software, mostly informal, in thousands of published academic papers. The effort has led to an annotated corpus suitable for training entity recognition algorithms. We expect this effort can fuel more development in text mining utilities leveraging machine learning techniques, either for enabling further analysis of software use and development in science, or for improving the visibility of software entities in existing scientific literature.

Visibility is important to the underacknowledged software work in science, which is critical for unleashing scientific progress. We hope our effort can help software work achieve its due credit on the honor wall of science, and thus facilitate more investment in quality software work for better science.

softcite-dataset: from manual annotation of PDF documents to a corpus for machine learning use

Documentation

API for knowledgebase access

CiteAs.org helps find the requested citation for software (and will eventually use data from the knowledge base).

Thanks to the Sloan Foundation for funding.

Name		Name	Last commit message	Last commit date
Latest commit History 4,509 Commits
code		code
data		data
docs		docs
shacl-1.1.0		shacl-1.1.0
.gitignore		.gitignore
.travis.yml		.travis.yml
CHANGELOG.md		CHANGELOG.md
DESCRIPTION		DESCRIPTION
Dockerfile		Dockerfile
README.md		README.md
docker-compose.yml		docker-compose.yml
mySnippet		mySnippet
parseTurtle.log		parseTurtle.log
requirements.R		requirements.R
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

softcite-dataset

About

Releases

Packages

Languages

caifand/softcite-dataset

Folders and files

Latest commit

History

Repository files navigation

softcite-dataset

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages