Skip to content
Michal Růžička edited this page Dec 10, 2015 · 8 revisions

Modules

Each module (a class implementing Module interface) is responsible for different area of MathML canonicalization. In the present, there are two types of modules. Stream modules implement StreamModule interface and perform canonicalization directly on input stream. DOM modules implement DOMModule interface and process the document using Document Object Model, JDOM2 Document have to be created before execution. AbstractModule allows loading default configuration from a property file. AbstractModuleTest provides easy way to test modules by comparing desired and real output using XMLUnit.

ElementMinimizer

Stream module which removes elements and attributes that are insignificant for the formula searching and indexing purposes, e.g. appearance-altering tags and attributes. You can specify elements to remove whose children should be kept, and elements to be removed including their children.

With attributes, the other approach is used - all attributes are removed, except those we want to keep. Any element can be configured to have its own attribute whitelist. Attributes with exact values can be marked to be kept too.

This module also removes XML comments.

This module is based on information from chapters 3 and 4.3.2 of MathML Version 2.0 (Second Edition) W3C Recommendation.

MfencedReplacer

DOM module replacing all ocurrences of mfenced elements. Fenced formulae are converted to mrow elements, containing delimiters and separators (from mfenced attributes) in mo elements. Inner content is placed into another mrow element. Module can be configured not to add mrow outside and inside or your own fixed or default parentheses and separators for fenced expressions can be specified.

MrowNormalizer

DOM module which removes unnecessary mrow elements and also adds mrow elements around detected parentheses to be same as would be output of MfencedReplacer. Operators to be detected as parentheses can be configured. Module can be configured whether to wrap the detected parentheses and/or its content in mrow. mrow elements are removed according to their parent's required child count specified in chapter 3.1.3 of MathML Version 3.0 W3C Recommendation but can also be configured.

OperatorNormalizer

DOM module which removes all empty operator elements and/or operator elements with given operators (specified in config file). It also normalizes function application in three formats: <mi> element with function name which is followed by <mo> element containing operators specified in config file and then followed by either one element or <mrow> element or <mo> element with ( opening bracket in which case the function is applied to all elements until matching ). The function application is converted to format: <mi>f</mi><mrow><mo>(</mo> arguments <mo>)</mo></mrow>

ScriptNormalizer

DOM module which removes all empty script elements specified in config file; removes script elements specified in config with just one child element and puts this child element on its place. It also converts <msubsup> elements and/or <msub> elements with <msup> inside to <msup> elements with <msub> inside.

UnaryOperatorRemover

DOM module removing selected unary operators (e.g. +, -, −, ∓, ∔, ∸, ⊕, ⊖, ⊝, ⊞). This modifies meaning of the input MathML but improves similarity of various formulae that is useful feature for similarity search.