Blogpost for GenR and FORCE11

I would like to put out one or more blogposts about openVirus. The first with the angle of innovating Open Science systems in relationship to COVID-19. This is for GenR a blog I edit and we are running a theme along these lines, the blogpost can also be posted on FORCE11 as there is an invitation to do this. See GenR theme call: https://genr.eu/wp/covid-19-and-innovating-open-science-systems/

The idea is that this a post open to all on openVirus.

I need to do this first one quite quickly as there are two online conferences next week. So want to publish Tuesday 23 before 10am CEST.

But we can do more, and of different lengths.

What needs to be done?

Collect ideas for blog post
Write blog post - word length of 800 words is good
Review blog post
Publish

An outline

What is the overall innovation and then some specific examples
Describe the openVirus project
Give a timeline and motivation
Invitation to get involved
Include the mini project - https://github.com/petermr/openVirus/tree/master/miniproject

Innovation

bioRxiv in Citizen Health Search (CHS)? - can this be used.

openVirus

openVirus - Scientific Knowledge for citizens in the time of COVID. A new knowledgebase by/for citizens.

'The software will demonstrate how we can search in future'

The Ebola paper: A serological survey on viral haemorrhagic fevers in liberia - https://doi.org/10.1016/S0769-2617(82)80028-2, 1982

Ebola, Liberia

Explain what openVirus does:

Mining:

build scrapers or API query tools for Openly readable sources.
query or scrape user questions
download raw content (PDF, HTML, images) - 10 - 10,000 articles
clean and semantify
annotate with dictionaries
expose, analyze, display

tools:

framework: ami + CProject data
scrapers: getpapers, Ferret, curl, scrapy
cleaners: PDFBox, Tidy/Jsoup, etc. Grobid
transformers: xml2html, ami ocr, KNIME
dictionaries: ami dictionary
indexing and annotation: Solr, ami
Analysis and display: R, KNIME

Sources: EuropePMC, biorxiv and medrxiv, DOAJ, EThOS, Redalyc (MX)

timeline

Invite

Join us! Testers, graphics, software, queries, scraper, and documenting.

Miniproject

Write the blogpost here in MD:

Image: https://unsplash.com/photos/gj4HyoAWrCE Photo by Vincent Ghilione on Unsplash

Title: openVirus Knowledge in the Hands of Citizens

By: #openVirus

openVirus is innovating new types of search for research literature using data mining technologies to enable citizens to make use of scientific knowledge. The COVID-19 pandemic has created a variety of crises — health and economic being the most obvious — but serious issues are occurring in education, social cohesion, transport, manufacturing, and supply chains, etc. The vast majority citizens working in these areas are locked out of accessing scientific literature — as an example if a doctor had a question about 'social distancing' on a publishers site like Taylor and Francis they would only find 5% (21,919) of research papers as open access (Murray-Rust 2020), the rest (426,613) are paywalled.

It is worth noting that the right to data mine paywalled research is permitted under EU copyright directives, (1) although publishers are hostile to upholding this legal right and are known to take punitive action to prevent it — like completely disconnecting paying clients. With the COVID-19 crisis Open Science is now on the public's radar and all stakeholders involved in scholarly communications are going to need to make themselves relevant to the situation as science's own crisis of 'designing new systems' (Thaney 2020) that is coming down the line.

openVirus works by speedily downloading papers as full-text from open repositories (EuropePMC, bioriv and medrxiv, DOAJ, EThOS, Redalyc (MX), etc.) at an average rate of fifty papers a second, then searching those papers on your local machine with 'dictionaries', that you build or use from others, based on Wikidata's 50 million items. Searches can be pinpointed on parts of a document for example graphs or conclusions and the indexing using Wikidata allows for semantic queries, e.g., if you had a question about COVID-19 infection rates and altitude Wikidata can return all city names over 2000 meters with a population over 50,000. New types of search are important as it enables that scientific knowledge can be put into action, allowing someone — a citizen outside of academia — to share research related to an idea they are working on with others, which importantly is still linked and identified with its source — say EuropePMC.

Image: Schematic of ContentMine software

The project uses a software framework called ContentMine as a foundation for data mining, interfacing Wikidata to add semantic enrichment, and a variety of other frameworks for dedicated tasks. The technology is being rapidly developed and is designed to be put in the hands of the public, but also serves a purpose for a wide set of communities: research repositories looking to service clients, or researchers needing to speed-up scoping on literature reviews.

At the start of the COVID-19 pandemic openVirus sprang into action as an open research project on GitHub and Slack. Currently there are thirty-eight members working globally and there is an open invitation for anyone to get involved. In April openVirus took part in the #EUvsVirus global hackathon to look at innovation for the pandemic on healthcare issues. In the three days of the hackathon the team made significant developments in its hackathon submission: establishing openVirus as first system to annotate the scientific literature corpus with Wikidata; bring on board thesis analysis, working on full-text indexing of UK PhD theses; adding Ferret scraping system; interfacing DOAJ and searching four million abstracts; and to move on Containerization for the system. The EUvsVirus hackathon was important for understanding the wide breadth of innovation challenges posed by the crisis and which makes it clear that Open Science systems need to accelerate innovation to meet these needs, from: health – lack of skilled caregivers; business – efficient team work; social cohesion – support arts & entertainment; remote education – e-learning methods & tools – family life during remote working & education, and; digital finance – speed-up access to financial support.

Andy Jackson of the British Library Web Archive team posted a blogpost 'Searching eTheses for the openVirus project' (Jackson 2020) on a contribution he made to openVirus in response to the issue that libraries may already hold knowledge that could be made available and help in the crises. Andy took up this challenge and applied the UK Web Archiving software tools to analysing the British Libraries holding of UK Theses EThOS of over half a million documents. Legally these cannot be redistributed, but data mining to generate statistical summaries of the contents of the documents is permitted — for example word frequencies — showing the likely relevance of a document. An API was made to access the theses and encapsulated in a Jupyter Notebook.

“Our digital libraries and archives may hold crucial clues and content about how to help with the #covid19 outbreak: particularly this is the case with scientific literature. Now is the time for institutional bravery around access!” – @melissaterras

Embed: @melissaterras on library collections and COVID-19 https://twitter.com/melissaterras/status/1245645959876378625

Dr Gita Yadav from the National institute for Plant Genome Research (NIPGR) in New Delhi is involved in running an innovative program on the next Green Revolution (TIGR2ESS). Last year Peter Murray-Rust helped her run a workshop on extracting knowledge from the scientific literature on plants, with fifty participants. NIPGR selects interns (mainly Masters students) to come to do 2-6 month research projects and because of the pandemic we've asked them to switch to researching COVID-19 literature! They love it. The seven interns — Ambreen, Charles, Kareena, Priya, Rajan, Vaishali, and Vanisha — who have very little programming experience, have learnt GitHub, Slack, Maven, and the ContentMine toolstack (getpapers and ami).

They're now starting miniprojects, downloading a thousand Open Access papers and extracting knowledge. They're all aimed at ‘viral epidemics’ (not just COVID) and looking for:

which countries they occur in,
what drugs are used,
what other diseases co-occur,
who funds research into epidemics,
what other viruses are involved,
non-pharmacological measures (e.g., social distances, public health), and
experiences of testing and tracing.

At the end of four weeks they'll have a spreadsheet of key papers, all annotated in a semantic manner, and will explore data analyses and display.

The software is still in an alpha stage, but moving forward at speed with development and documentation being added on a daily basis, all of which is able to happen because of years of work put in by Peter Murray-Rust, the ContentMine team, and collaborators. At some stage UIs will be able to be put in place for general use of the framework and methods by the public. Ideas for the application have already be worked out and below is a schematic of an health knowledge application that was made as a funding proposal 'bioRxiv in Citizen Health Search (CHS)' to traverse a variety of sources EuropePMC, bioRxiv and emerging community sources such as Crossref, unpaywall, Zenodo, and Wikidata and which illustrates the potential value as a 'Citizen Dashboard'.

Image: 'Citizen Dashboard' bioRxiv in Citizen Health Search (CHS)

Infobox: You can find out more information about #openVirus here on GitHub and on Slack.

Footnote

(1) From Wikipedia: Directive on Copyright in the Digital Single Market https://en.wikipedia.org/wiki/Directive_on_Copyright_in_the_Digital_Single_Market#cite_note-41

See: "24 organisations urge Rapporteur Axel Voss MEP to strike a more ambitious deal on TDM – European Alliance for Research Excellence". European Alliance for Research Excellence. 8 June 2018. Archived from the original on 12 June 2018. Retrieved 12 June 2018.

References

Murray-Rust, Peter. ‘OpenVirus - Tools for Discovering Literature on Viruses’. Science, 28 May 2020. https://www.slideshare.net/petermurrayrust/openvirus-tools-for-discovering-literature-on-viruses/6?src=clipshare.

Thaney, Kaitlin. ‘The Open Scholarship Ecosystem Faces Collapse; It’s Also Our Best Hope for a More Resilient Future’. Impact of Social Sciences (blog), 19 June 2020. https://blogs.lse.ac.uk/impactofsocialsciences/2020/06/19/the-open-scholarship-ecosystem-faces-collapse-its-also-our-best-hope-for-a-more-resilient-future/.

opeVirus. ‘EUvsVirus: ContentMine – Scientific Knowledge for All’. Devpost, 26 April 2020. http://devpost.com/software/contentmine-scientific-knowledge-for-all.

Jackson, Andy. ‘Searching ETheses for the OpenVirus Project’. Digital Scholarship Blog (blog), 14 May 2020. https://blogs.bl.uk/digital-scholarship/2020/05/searching-etheses-for-the-openvirus-project.html.

END

Upload previous presentation slides to Zenodo COVID community - currently offline 20.6 https://zenodo.org/communities/covid-19/

EU Virus hack
Pubfest
Summer school

Provide feedback

Saved searches

Use saved searches to filter your results more quickly