Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrote the LaTeX parser to be more flexible #10

Open
wants to merge 18 commits into
base: master
Choose a base branch
from

Conversation

Happypig375
Copy link
Collaborator

No description provided.

@Happypig375
Copy link
Collaborator Author

If no more review comments, then I'm merging.

@charlesroddie
Copy link
Owner

Only had time for a very brief review. Can look again this weekend.

@charlesroddie
Copy link
Owner

This comment here as it may get some discussion:
toAtom gives a Result<MathAtom, string>
Should it give this or a MathAtom option?

The former will have more performant processing of error cases, since they are not exceptional. But it will need more work to maintain, as results are harder to compose. It will also be slower in the happy case, especially if computation expressions are needed to deal with the complexity of Results.

The alternative is an option, where failure to find things gives a None, and where invalid syntax gives an exception.

I think invalid syntax can be treated as exceptional, so we can make our lives easier by avoiding Result. Any other opinions?

@charlesroddie
Copy link
Owner

In general it's good that you have got something working but a lot of simplification and explanation will be needed. I wasn't able to discover the strategy used for parsing LaTeX because it was hidden behind functions with highly obscure type signatures.

Names should be informative and their should be xml docs but it's only possible to name and describe objects once they have been simplified enough to be comprehensible.

I will try to do some constructive work in this direction this weekend.

@charlesroddie
Copy link
Owner

Possibly useful?
https://github.com/charlesroddie/StyleGuide

@Happypig375
Copy link
Collaborator Author

But it will need more work to maintain, as results are harder to compose.

Result.map and Result.bind can be used.

It will also be slower in the happy case, especially if computation expressions are needed to deal with the complexity of Results.

I read about the slowness of computation expressions. I wouldn't use them for Results.

In general it's good that you have got something working but a lot of simplification and explanation will be needed. I wasn't able to discover the strategy used for parsing LaTeX because it was hidden behind functions with highly obscure type signatures.

Will simplify the code a bit.

@charlesroddie
Copy link
Owner

I have thought more about Results and also considered a simple issue in another repo.
The alternatives are:

  • Return results locally during parsing, catching errors.
  • Throwing exceptions for a parsing failure and catching them in the main parse function (generating a None or a Result.Error).

I believe exceptions is better, because the inner functions are simplified to have types which return 'a instead of types which return Option<'a> or Result<'a,'e>. Of course in .Net exceptions should only be used for exceptional data.

A parsing function routine which for example looks for a certain pattern should return None if it fails to find the pattern; these are not exceptional.

A parsing function which is used when a certain pattern is known to be implied should throw an exception on failure. For example after seeing \color the algorithm is allowed to find { and } and fail if it doesn't find them.

The global algorithm can catch failures and either return Error("invalid LaTeX") or fail with an InvalidLatexException.

Copy link
Owner

@charlesroddie charlesroddie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

MathDisplay/DataTypes/AliasDictionary.fs Outdated Show resolved Hide resolved
loop []

let mapFoldResult f (state:'State) (list: 'T list) : Result<'Result list * 'State, 'Error> =
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs XML comment.

}


[<Struct>] type internal Read = Until of char | UntilRightDelimiter | OneArgument | All
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

XML doc for this type?

MathDisplay/DataTypes/AliasDictionary.fs Outdated Show resolved Hide resolved
MathDisplay/MathAtom/LaTeX.fs Outdated Show resolved Hide resolved
let (|CommandName|) cs = match cs with
| c::cs when not <| isAlphabet c -> string c, cs
| PartitionAlphabets (ab, cs) -> System.String.Concat ab, skipSpaces cs
let rec read tableEnv until cs list =
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type signature of read is very complex. From the name and the type signature (and failing that an xml doc) you should be able to work our what any function does.

MathDisplay/MathAtom/LaTeX_Old.fs Show resolved Hide resolved
| PartitionAlphabets (ab, cs) -> System.String.Concat ab, skipSpaces cs
let rec read tableEnv until cs list =
///Reads a delimiter
let readDelimiter cs =
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is almost comprehensible.
Why does it return a Delimiter * char list instead of just a Delimiter?
If it returned a Delimiter then the type signature and name would indicate what it does.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It takes a list of to-be-read characters, slices out a delimiter, returns it along with the to-be-read characters after the delimiter.

| '['::cs -> readArgUntil' readOptionalArgUntil id (Until ']') argDict argDict.AddOptional tableEnv cs
| _ -> Ok (tableEnv, ValueNone, cs)

let rec replaceArguments atom state =
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replaceArguments is obviously a key function but has an obfuscated type signature. The same goes for the inner functions replace1 and replace2.

MathDisplay.Tests.Benchmarks/AliasMap.fs Show resolved Hide resolved
@charlesroddie
Copy link
Owner

I've resolved the AliasDictionary issues (simplification and documentation).
Can move to LaTeX.fs now. Would appreciate documentation or simplification explanation of the methods so I can get started with this @Happypig375 . Thanks.

@Happypig375
Copy link
Collaborator Author

Now AliasDictionary, as a collection type, does not implement normal collection interfaces. Like MathCollection, it will surprise the user when trying to use AliasDictionary in functions taking e.g. IEnumerable<T>.

@charlesroddie
Copy link
Owner

charlesroddie commented Mar 3, 2019

Why does a user need to interact with an alias dictionary? If there is a need for extension in future it can be done but it is comprehensible now, as simple as possible for what it is doing, documented, and definitely provides the functionality we are using at the moment (or else the project wouldn't compile).

@Happypig375
Copy link
Collaborator Author

Happypig375 commented Mar 3, 2019

For example, LaTeXDefaultMaps.Commands.Where(command => command.Key.StartsWith("a")) does not compile because AliasDictionary does not implement IEnumerable<T>.

@charlesroddie
Copy link
Owner

charlesroddie commented Mar 3, 2019

I removed as much as possible while keeping the main project buildable. Ok do do one or two extra things like expose the keys or remove private on the dictionaries as needed in future. But we shouldn't add them until then. It's important to keep good discipline. I think the file went from 100 lines to 50. Maybe in future it goes back up to 55 if we need something extra.

@Happypig375
Copy link
Collaborator Author

Happypig375 commented Mar 3, 2019

While I agree on YANGI in general, I think all collection types should at least act like collections, have collection methods, and implement collection interfaces.

Also, why make Delimiter a DU which limits its possible values? It also kind of repeats information in LaTeXDefaultMaps.Delimiters.

@charlesroddie
Copy link
Owner

charlesroddie commented Mar 3, 2019

I think all collection types should at least act like collections, have collection methods, and implement collection interfaces

Not all interfaces; otherwise it gets too large to maintain (as the previous code was). We should consider this an internal type, and add these things as needed. The correct API if we do add them is not obvious because there are two dictionaries. For example member __.Aliases = d.Keys is probably OK (I have added this back as it should be useful for parsing) but should we have member __.Values = d.Values or member __.Values = e.Keys; these actually have different, and rather obscure, types. The original code always chose the dictionary from aliases to values when doing the various additional properties but that is not a clear decision. Maybe casting both dictionaries to IReadOnlyDictionary might give us everything we could possibly need in future. But no need to spend time on that (writing this conversion, code review, documentation) at the moment. I have limited time for code review and documentation and would rather spend this on important stuff like the parser than on methods that don't end up getting used.

Also, why make Delimiter a DU which limits its possible values

There are a limited number of delimiters so representing these as a limited number of values is better. Easier to understand code with a finite number of values than infinite number of possible values of which most are invalid. There is the option of adding an Delimiter.UserDefined of string case if absolutely necessary (much) later.

@charlesroddie
Copy link
Owner

You had enumerated all of the LaTeX delimiters:
image

@charlesroddie
Copy link
Owner

charlesroddie commented Mar 3, 2019

This idea of yours is interesting:

let Commands =
    [   ["frac"], Fraction (Argument 1, Argument 2, Center, Center, ValueNone)
        ["sqrt"], Radical (Argument_Optional (1, Row []), Argument 1)
    |> AliasDictionary

While we may not be able to do exactly this (since it requires usage of an undisplayable internal Latex-specific Argument atom), we could adjust it like this:

[<Struct>][<RequireQualifiedAccess>]
type Command =
    | Frac
    /// A square root or nth root
    | Sqrt
    | Color

let Commands =
    [   ["frac"], Command.Frac
        ["sqrt"], Command.Sqrt
        ["color"], Command.Color]
    |> AliasDictionary

let parseColor:string -> System.Drawing.Color = failwith "not implemented"

let rec parseCommand(command:Command,
                     optionals:string[],
                     args:string[],
                     parseAtom:string->MathAtom) =
    let require (n:int) =
        if args.Length <> n
        then failwith <| (string n) + " inputs required by " + (Commands.GetPrimaryAlias command)
    let maxOptional (n:int) =
        if optionals.Length > n
        then failwith <| "At most " + (string n) + " optional inputs allowed in " + (Commands.GetPrimaryAlias command)
    let getOptional n =
        if n < optionals.Length then Some(optionals.[n]) else None
    match command with
    | Command.Frac ->
        require 2
        Fraction(parseAtom args.[0], parseAtom args.[1], ValueNone, ValueNone, ValueNone)
    | Command.Sqrt ->
        require 1
        maxOptional 1
        Radical(getOptional 0 |> Option.map parseAtom, parseAtom args.[0])
    | Command.Color ->
        require 2
        Colored(parseAtom args.[1], parseColor args.[0])

This also works with things to be parsed that are not Atoms (e.g. Color as shown).

@Happypig375
Copy link
Collaborator Author

Happypig375 commented Mar 4, 2019

It gets complicated when we are going to implement "\binom" or "\TeX" which give atoms with more complicated structures. These two commands are both present in CSharpMath.

["binom"], Delimited (Delimiter.LeftBracket, Frac (Argument 1, Argument 2, Center, Center, ValueNone), Delimiter.RightBracket)
["TeX"], Styled (Row [Ordinary "T", Space -.1667, Offsetted (Ordinary "E", 0, -4.5), Space -.125, Ordinary "X"], FontStyles.Roman)

@Happypig375
Copy link
Collaborator Author

Regarding Result vs Exception:

Method | Mean | Error | StdDev |
---------------- |-------------:|------------:|------------:|
Ok_Result | 46.72 ns | 0.2317 ns | 0.2167 ns |
Ok_Exception | 17.36 ns | 0.1735 ns | 0.1623 ns |
Error_Result | 37.50 ns | 0.1186 ns | 0.1110 ns |
Error_Exception | 39,582.16 ns | 204.3940 ns | 181.1899 ns |

While Result is 2.7 times slower than Exception when the result is Ok, Exception is 1055.5 times slower than Result when the result is Error.

@Happypig375
Copy link
Collaborator Author

This will probably be very noticeable when using the Example app which displays user-inputted valid/invalid LaTeX real-time.

@charlesroddie
Copy link
Owner

It gets complicated when we are going to implement "\binom" or "\TeX" which give atoms with more complicated structures. These two commands are both present in CSharpMath.

Those examples easy with the draft scheme I gave.

    | Command.Binom ->
        require 2
        Delimited (Delimiter.LeftBracket, Frac (parseAtom args.[0], parseAtom args.[1], Center, Center, ValueNone), Delimiter.RightBracket)
    | Command.TeX ->
        require 0
        Styled (Row [Ordinary "T", Space -.1667, Offsetted (Ordinary "E", 0, -4.5), Space -.125, Ordinary "X"], FontStyles.Roman)

require may not be the best way to do it. The best structure may depend on the answer to an important question:

Is it possible to tokenize a LaTeX string (split it into a list of commands, delimiters, ordinaries...) before implementing any logic?

@Happypig375
Copy link
Collaborator Author

Now we need to define new Commands if we are going to add new ones. That goes against the point of a Commands dictionary, which is to reduce duplicate information.

Is it possible to tokenize a LaTeX string (split it into a list of commands, delimiters, ordinaries...) before implementing any logic?

That is the purpose of LaTeX.ToAtom instead of drawing them directly.

@charlesroddie
Copy link
Owner

That goes against the point of a Commands dictionary
A commands dictionary containing code is no longer a two-way dictionary. (The current code in the PR at this point has a two way dictionary but the second direction won't be usable or useful.)

The purpose of AliasDictionary is to convert multiple alias strings into the things they represent and to give the primary alias of a represented object.

A one-way dictionary is OK from commands to behavior, which is defined on all commands, is better known/represented as a function.

That is the purpose of LaTeX.ToAtom

This does more than lexing.

@Happypig375
Copy link
Collaborator Author

A one-way dictionary is OK from commands to behavior, which is defined on all commands, is better known/represented as a function.

What I'm trying to recreate with Commands is two-way pattern matching.

This does more than lexing.

It's just some sort of pattern matching that converts between LaTeX strings and MathAtoms.

@charlesroddie
Copy link
Owner

Some miscommunication. I am asking whether lexing of LaTeX is possible.

@Happypig375
Copy link
Collaborator Author

Lexing itself is possible, just that I don't see a reason for the extra step.

@charlesroddie
Copy link
Owner

charlesroddie commented Mar 9, 2019

Two way pattern matching: I'm skeptical that will work. I think we need to get this repo into a clean state with comprehensible code and any attempts like that can branch off as experiments. Don't want to tackle extremely difficult stuff before we have good basics in place.

@charlesroddie
Copy link
Owner

I think I will need to implement a simple parser.

@charlesroddie
Copy link
Owner

Serialization of some simple mathatoms into latex will have some utility. And some sharing of data, especially names and aliases, could be shared between parser and serializer. But it's important not to couple parsing to a complex system to create a difficult coupling between parser and serializer and mathatom itself. That will prevent getting master into any sort of reasonable shape in the near future.

@charlesroddie
Copy link
Owner

I would like this pr to go into master which means keeping it clean. I will write a parser but am travelling for next 2 weeks (including HK, could meet there). Anything to do with Serialization (beyond aliases and a list of commands) shouldn't go in this pr. Once we have a good pr to merge then master will be in a good state to branch off from with feature branches.

@Happypig375
Copy link
Collaborator Author

When will you be in HK?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants