Skip to content
Kitty edited this page Nov 10, 2021 · 23 revisions

MathMLCan

MathML Canonicalizer is a tool which performs canonicalization of MathML expressions. Original version can be found here.

Introduction

Goal of this project is to create an application in Java language which performs canonicalization of mathematical expressions written in MathML (Mathematical Markup Language).

The output should be canonical form of given MathML document. This canonicalized form of MathML can then be used for easy decision if two differently written MathML formulae represent the same expression, or by MathML search and comparison engines.

Architecture

The functionality of canonicalizer is divided into modules. MathMLCanonicalizer class can be initialized using XML configuration or manually by adding initialized modules or used with default settings stored in property files. Then it takes input stream with MathML document and produces canonicalized output stream. Class Settings provides static helper methods and loads global settings. MathMLCanonicalizerCommandLineTool is the runnable class connecting canonicalizer with command line interface.

How to build the project

The project is build as Maven project from the root directory:

mvn clean install

Executable .jar file is located in target directory.

Invocation

Usage:
	java -jar mathml-canonicalizer.jar [ -c </path/to/config.xml> ] [ -w ] [ -d ] </path/to/input>...
	java -jar mathml-canonicalizer.jar -p | --print-default-config-file
	java -jar mathml-canonicalizer.jar -h | --help

NB: </path/to/input> is /path/to/file.xhtml or /path/to/directory

Options:
        -c,--config-file <arg>                  Load configuration file.
        -d,--inject-xhtml-mathml-svg-dtd        Enforce injection of XHTML 1.1
                                                plus MathML 2.0 plus SVG 1.1 DTD
                                                reference into input documents.
        -h,--help                               Print help (this screen).
        -p,--print-default-config-file          Print default configuration that
                                                will be used if no config file
                                                is supplied.
        -w,--overwrite-inputs                   Overwrite input files by
                                                produced canonical outputs.

File encoding on Windows

On Windows, file encoding defaults to system-language-specific single-byte encoding. To ensure JVM uses UTF-8 start JVM with command line argument -Dfile.encoding=UTF-8:

java -Dfile.encoding=UTF-8 -jar mathml-canonicalizer.jar

However, be aware the default Windows command line shell has significant problems with Unicode in the default configuration. Try Lucida console font with appropriate shell code page setting via chcp 65001 command.

Contributors

  • Michal Růžička
  • David Formánek
    • architecture
    • class for module testing and some tests
    • MfencedReplacer tests and implementation
  • Jakub Adler
    • MrowNormalizer tests and implementation
  • Jaroslav Dufek
    • OperatorNormalizer tests and implementation
    • ScriptNormalizer tests and implementation
  • Robert Šiška
    • XML properties loading
    • CLI and GUI
    • ElementMinimizer improvements

Licence

MathMLCan's code is licensed under the terms of the Apache License, Version 2.0.