Skip to content

Blogpost for GenR and FORCE11

Simon Worthington edited this page Jun 21, 2020 · 29 revisions

I would like to put out one or more blogposts about openVirus. The first with the angle of innovating Open Science systems in relationship to COVID-19. This is for GenR a blog I edit and we are running a theme along these lines, the blogpost can also be posted on FORCE11 as there is an invitation to do this. See GenR theme call: https://genr.eu/wp/covid-19-and-innovating-open-science-systems/

The idea is that this a post open to all on openVirus.

I need to do this first one quite quickly as there are two online conferences next week. So want to publish Tuesday 23 before 10am CEST.

But we can do more, and of different lengths.

What needs to be done?

  • Collect ideas for blog post
  • Write blog post - word length of 800 words is good
  • Review blog post
  • Publish

An outline

Innovation

  • bioRxiv in Citizen Health Search (CHS)? - can this be used.

openVirus

openVirus - Scientific Knowledge for citizens in the time of COVID. A new knowledgebase by/for citizens.

'The software will demonstrate how we can search in future'

The Ebola paper: A serological survey on viral haemorrhagic fevers in liberia - https://doi.org/10.1016/S0769-2617(82)80028-2, 1982

Ebola, Liberia

Explain what openVirus does:

Mining:

  • build scrapers or API query tools for Openly readable sources.
  • query or scrape user questions
  • download raw content (PDF, HTML, images) - 10 - 10,000 articles
  • clean and semantify
  • annotate with dictionaries
  • expose, analyze, display

tools:

  • framework: ami + CProject data
  • scrapers: getpapers, Ferret, curl, scrapy
  • cleaners: PDFBox, Tidy/Jsoup, etc. Grobid
  • transformers: xml2html, ami ocr, KNIME
  • dictionaries: ami dictionary
  • indexing and annotation: Solr, ami
  • Analysis and display: R, KNIME

Sources: EuropePMC, biorxiv and medrxiv, DOAJ, EThOS, Redalyc (MX)

timeline

Invite

Join us! Testers, graphics, software, queries, scraper, and documenting.

Miniproject


Write the blogpost here in MD:

openVirus is innovating new types of search for research literature using data mining technologies to enable citizens to make use of scientific knowledge. The COVID-19 pandemic has created a variety of crises - health and economic being the most obvious. The vast majority citizens working in these areas are locked out of accessing scientific literature - if a doctor had a question about 'social distancing' they would only find 5% (21,919) of research papers as open access (Murray-Rust 2020), the rest (426,613) are paywalled.

It is worth noting that the right to data mine paywalled research is permitted under EU copyright, although publishers are hostile to upholding this legal right and are know to take punitive action to prevent it - like completely disconnecting paying clients. With the COVID-19 crisis Open Science is now on the public's radar and all stakeholders involved in scholarly communications are going to need to make themselves relevant to the crisis or face a reckoning down the line.

openVirus works by speedily downloading full text papers from open repositories at a rate of fifty papers a second, then searching those papers on your local machine with 'dictionaries', that you build or use from others, based on Wikidata's 50 million items. [I need an example here to make it relevant to the reader about annotation]. Why this is a new type of search is important is that the scientific knowledge can be put into action, allowing someone - a citizen outside of academia - to share research related to an idea they are working on with others, which importantly is still linked and identified with its source say EPMC.

The project uses a software framework called ContentMine as a foundation for data mining, interfacing Wikidata to add semantic enrichment, and a variety of other frameworks for other dedicated tasks. The technology is being rapidly developed and is designed to be put in the hands of the public, but also serves a purpose for a wide set of communities: research repositories looking to service clients, or researchers needing to speed-up scoping on literature reviews.

At the start of the COVID-19 pandemic openVirus sprung into action as an open research project on GitHub and Slack. Currently there are thirty-eight members working globally and there is an open invitation for anyone to get involved. In April openVirus took part in the #EUvsVirus global hackathon to look at innovation for the pandemic. In the three days of the hackathon the team made significant development: establishing openVirus as first system to annotate the scientific literature corpus with Wikidata; bring on board thesis analysis and worked on full-text indexing of UK Ph.D theses; adding Ferret scraping system; interfacing DOAJ and searching four million abstracts; and move on Containerization for the system. The EUvsVirus hackathon was important for understanding the wide breadth of innovation challenges posed by the crisis and which makes it clear that Open Science systems need to accelerate innovation to meet these needs, from: health: lack of skilled caregivers; business: efficient team work; social cohesion: support arts & entertainment; remote education: e-learning methods & tools; family life during remote working & education, and; digital finance: speed-up access to financial support.

Andy Jackson of the British Library Web Archive team posted a blogpost 'Searching eTheses for the openVirus project' (Jackson 2020) on a contribution he made to openVirus in response to the issue that libraries may already hold knowledge that could be made available and help in the crisis. Andy took up this challenge and applied the UK Web Archiving software tools to analysing the British Libraries holding of UK Theses EThOS of over half a million documents. Legally these cannot be redistibuted but data mining generate statistical summaries of the contents of the documents - for example word frequencies - showing likely relevance of a document. An API was made to access the theses and encapsulyted in a Jupyter Notebook.

Embed: @melissaterras on library collections and COVID-19 https://twitter.com/melissaterras/status/1245645959876378625

A number of students are also working with the openVirus project and recently a 'mini project' has been made as a gateway into learning about the software for the purpose of creating a useful dataset for machine learning exercises. With the package the user can create a dictionary around a COVID-19 topic such as 'face masks in viral epidemics' and refine that dictionary depending on the results. In the process of running the software the user will: creating a query, running it, and refining the query iteratively, downloading up to 1000 articles (your COVID-19 Project), searching them with 3-6 dictionaries for co-occurrence, manually evaluating how useful co-occurrence is, and refining dictionaries.

About the 'miniproject'. 100 words

Mention 'bioRxiv in Citizen Health Search (CHS)' as it illustrates the potential.

Maybe add right to mine

References

Murray-Rust, Peter. ‘OpenVirus - Tools for Discovering Literature on Viruses’. Science, 28 May 2020. https://www.slideshare.net/petermurrayrust/openvirus-tools-for-discovering-literature-on-viruses/6?src=clipshare.

opeVirus. ‘EUvsVirus: ContentMine – Scientific Knowledge for All’. Devpost, 26 April 2020. http://devpost.com/software/contentmine-scientific-knowledge-for-all.

Jackson, Andy. ‘Searching ETheses for the OpenVirus Project’. Digital Scholarship Blog (blog), 14 May 2020. https://blogs.bl.uk/digital-scholarship/2020/05/searching-etheses-for-the-openvirus-project.html.


Upload previous presentation slides to Zenodo COVID community - currently offline 20.6 https://zenodo.org/communities/covid-19/

  • EU Virus hack
  • Pubfest
  • Summer school
Clone this wiki locally