-
Notifications
You must be signed in to change notification settings - Fork 13
Modules
Each module (a class implementing Module interface) is responsible for different area of MathML canonicalization. In the present, there are two types of modules. Stream modules implement StreamModule interface and perform canonicalization directly on input stream. DOM modules implement DOMModule interface and process the document using Document Object Model, JDOM2 Document have to be created before execution. AbstractModule allows loading default configuration from a property file. AbstractModuleTest provides easy way to test modules by comparing desired and real output using XMLUnit.
Stream module which removes elements and attributes that are insignificant for the formula searching and indexing purposes, e.g. appearance-altering tags and attributes. You can specify elements to remove whose children should be kept, and elements to be removed including their children.
With attributes, the other approach is used - all attributes are removed, except those we want to keep. Any element can be configured to have its own attribute whitelist. Attributes with exact values can be marked to be kept too.
This module also removes XML comments.
This module is based on information from chapters 3 and 4.3.2 of MathML Version 2.0 (Second Edition) W3C Recommendation.
DOM module replacing all ocurrences of mfenced
elements. Fenced formulae are converted to mrow
elements, containing delimiters and separators (from mfenced
attributes) in mo
elements. Inner content is placed into another mrow
element. Module can be configured not to add mrow
outside and inside or your own fixed or default parentheses and separators for fenced expressions can be specified.
DOM module which removes unnecessary mrow
elements and also adds
mrow
elements around detected parentheses to be same as would be output of
MfencedReplacer.
Operators to be detected as parentheses can be configured.
Module can be configured whether to wrap the detected parentheses
and/or its content in mrow
.
mrow
elements are removed according to their parent's required child count
specified in chapter 3.1.3 of MathML Version 3.0 W3C Recommendation but can
also be configured.
DOM module which removes all empty operator elements and/or operator elements with given operators (specified in config file). It also normalizes function application in three formats: <mi>
element with function name which is followed by <mo>
element containing operators specified in config file and then followed by either one element or <mrow>
element or <mo>
element with (
opening bracket in which case the function is applied to all elements until matching )
. The function application is converted to format: <mi>f</mi><mrow><mo>(</mo> arguments <mo>)</mo></mrow>
DOM module which removes all empty script elements specified in config file; removes script elements specified in config with just one child element and puts this child element on its place. It also converts <msubsup>
elements and/or <msub>
elements with <msup>
inside to <msup>
elements with <msub>
inside.
DOM module removing selected unary operators (e.g. +, -, −, ∓, ∔, ∸, ⊕, ⊖, ⊝, ⊞). This modifies meaning of the input MathML but improves similarity of various formulae that is useful feature for similarity search.