-
Notifications
You must be signed in to change notification settings - Fork 17
Jupyter Notebooks
Ambreen H edited this page Sep 20, 2020
·
3 revisions
https://jupyter.readthedocs.io/en/latest/install.html#install-and-use
Has anyone managed to getpapers
or ami
running in a notebook?
contributor Ambreen H
Python was used to remove flag symbols from XML Dictionaries:
- The
SRARQL
endpoint file was first converted into the standard format usingamidict
( for reference see above) - The new XML file was imported into python and all characters within the grandchild elements (ie synonyms) were converted to
ASCII
. This emptied the synonym elements
PYTHON CODE
import re
iname = "E:\\ami_try\\Dictionaries\\country_converted.xml"
oname = "E:\\ami_try\\Dictionaries\\country_converted2.xml"
pat = re.compile('(\s*<synonym>)(.*?)(</synonym>\s*)', re.U)
with open(iname, "rb") as fin:
with open(oname, "wb") as fout:
for line in fin:
#line = line.decode('utf-8')
line = line.decode('ascii', errors='ignore')
m = pat.search(line)
if m:
g = m.groups()
line = g[0].lower() + g[1].lower() + g[2].lower()
fout.write(line.encode('utf-8'))
- The empty elements were then deleted using python to create a new .xml file with all synonyms except the flags.
PYTHON CODE
from lxml import etree
def remove_empty_tag(tag, original_file, new_file):
root = etree.parse(original_file)
for element in root.xpath(f".//*[self::{tag} and not(node())]"):
element.getparent().remove(element)
# Serialize "root" and create a new tree using an XMLParser to clean up
# formatting caused by removing elements.
parser = etree.XMLParser(remove_blank_text=True)
tree = etree.fromstring(etree.tostring(root), parser=parser)
# Write to new file.
etree.ElementTree(tree).write(new_file, pretty_print=True, xml_declaration=True, encoding="utf-8")
remove_empty_tag("synonym", "E:\\ami_try\\Dictionaries\\country_converted2.xml", "E:\\ami_try\\Dictionaries\\country_converted3.xml")
All code is reusable with a little modification
Tester: Ambreen H
The code was written in Python to import the data from XML file and cleanse it to create a CSV document for binary classification of data.
In order to run the machine learning model, proper data preparation is necessary
- The following libraries were used:
xml.etree.ElementTree as ET, string, os
andre
- A function was written to locate XML files and extracting abstract from that
- This was done on a small number of papers (11 positives and 11 negatives)
- The abstract was cleaned by removing unnecessary characters, turning into lowercase and removing subheadings like 'abstract' etc
- Finally a single data file was created in CSV format having 3 columns, one for the name of the file, other for the entire cleaned text in the abstract, and whether the result is a false positive or true positive.
Jupyter was also used to run a smoke test for the Binary Classification using Machine Learning. More Information