Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Powers #36

Open
sashaalesin opened this issue May 7, 2020 · 7 comments
Open

Powers #36

sashaalesin opened this issue May 7, 2020 · 7 comments

Comments

@sashaalesin
Copy link

>>> latex = 'e^{2x}'
>>> LatexNodes2Text().latex_to_text(latex)
'e^2x'

How i can get e^(2x)?

Thanks.

@phfaist
Copy link
Owner

phfaist commented May 8, 2020

Hi, thanks for the issue. I've been thinking a bit about how to handle powers and subscripts but there are some edge cases I want to make sure that I can polish before providing support for these out of the box. Here is a simple solution you can use in the meantime:

from pylatexenc import macrospec, latexwalker, latex2text

# define ^/_ for the parser as accepting a mandatory argument
lwc = latexwalker.get_default_latex_context_db()
lwc.add_context_category('powers', specials=[
    macrospec.SpecialsSpec('^', args_parser=macrospec.MacroStandardArgsParser('{')),
    macrospec.SpecialsSpec('_', args_parser=macrospec.MacroStandardArgsParser('{')),
])

# define the replacement string for ^
l2tc = latex2text.get_default_latex_context_db()
l2tc.add_context_category('powers', specials=[
    latex2text.SpecialsTextSpec('^', simplify_repl='^(%s)'),
    latex2text.SpecialsTextSpec('_', simplify_repl='_(%s)'),
])

latex_text = r'e^{x_1}'

lw = latexwalker.LatexWalker(latex_text, latex_context=lwc)
l2t = latex2text.LatexNodes2Text(latex_context=l2tc)

print(l2t.nodelist_to_text(lw.get_latex_nodes()[0]))
# e^(x_(1))

The above code has the following caveats:

  • In LaTeX the syntax X_\mathrm{min} can be used instead of X_{\mathrm{min}}. With my above snippet, pylatexenc will not recognize the first syntax because it will parse the argument of _ as a macro argument, here capturing only the token \mathrm without {min}.

  • If you have underscores outside math mode, e.g., See \url{example.com/some_page}, then with my above snippet pylatexenc will attempt to parse the _ as a math subscript (whereas in LaTeX the \url command does some catcode magic to avoid that).

I'm currently thinking about some ideas for ways to avoid these caveats, in order to support ^ and _ by default in pylatexenc, but I haven't found the time to work these out fully yet.

@sashaalesin
Copy link
Author

Thank you!
I need to use pylatexenc only for parsing math formulas, so it's good for me :)

@phfaist
Copy link
Owner

phfaist commented Jun 22, 2020

Hi @nemeer, I think your issue is an inherent limitation of Unicode—as far as I know, there is no way to represent M_\odot in unicode. (Only a very specific subset of characters can be set in subscript/superscript in unicode as I can tell from "Unicode subscripts and superscripts" on wikipedia.) In my snippet above, I used the text replacement "M_(⊙)" which is the closest I could think of. You can change the replacement text with whatever you like in the arguments simplify_repl='^(%s)' of the SpecialsTextSpec(...) calls. You can specify a function, too, to implement more complicated logic, see the docs. Hope this helps.

@nemeer
Copy link

nemeer commented Jun 23, 2020

Thanks a lot for the update.

@Konfekt
Copy link

Konfekt commented Aug 18, 2021

Could this special subset then be replaced?

@hckiang
Copy link

hckiang commented Aug 22, 2021

+1! Having θᵢ would be nice

@phfaist
Copy link
Owner

phfaist commented Aug 29, 2021

Thanks for the pointer to the list of unicode sub/superscripts. For the reasons that were mentioned above, it isn't straightforward to implement this. I'm thinking about some upgrades to how LaTeX gets converted to unicode text, and I'll try to integrate unicode super/subscripts as much as possible. (Plus, there would be additional design decisions, e.g., what should happen to subscripts where not all characters have a unicode subscript variant, such as $\theta_{i,j*k}$?)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants