Skip to content

Commit

Permalink
Add ability to read math in PDF documents (#17276)
Browse files Browse the repository at this point in the history
Fixes #9288

Summary of the issue:
PDF 2.0 added Associated Files (AF). It also describes a method for Formula tags to make use of AF that contain MathML. The LaTeX Project (the group that maintains LaTeX) has released an update to LaTeX that uses this technique. Hence, there will soon be a large body of PDF documents generated from LaTeX (pdflatex and lualatex) that contain MathML. 

In conjunction with Foxit and an informal agreement with someone at Adobe, we agreed on a method to expose the MathML in an AF without a change to the PDF accessibility interface: the Formula tag gets role=Math (in windows, ROLE_SYSTEM_EQUATION) and the contents of tag is the MathML.

Note: this does not change the legality of the previous method of fully tagging the PDF math with children elements pointing to subexpressions in the PDF. However, that method has proved difficult to implement for PDF generators. This method seems to be much simpler and hence will be used.

The latest release of Foxit contains the support of AF with MathML. So far, Adobe has not made a change but with Foxit and NVDA supporting this, there will be more of an impetuses to do so. According to the Foxit implementer, it only took 1-2 days to implement. 

Description of user facing changes
The math in documents will be spoken and brailled just as it is done for HTML documents. It will also be navigable. This should work with any of the MathML add-ons.

Description of development approach
Support required only about 3 lines to be added to the AdobeAcrobat.py file. I changed a few more lines to add debug warnings when various COM interfaces were not found.

There was a commit in January 2024 that wiped out the MathML support in PDF in favor of alt text. This was in the .cpp file that is part of this PR. This PR mostly reverts that change. Alt text is still supported via the creation of a MathML `<mtext>` element. Potentially, this is a better solution because sometimes the alt text is LaTeX and LaTeX contains lots of punctuation characters that are not spoken by NVDA by default. Pushing this to the Math handler gives them the ability to override this behavior and speak all the characters. Currently MathCAT just passes the `mtext` content directly to NVDA, but I will look into making it smarter about that.

Because Adobe Reader currently does not handle AFs, the alt text will get read if a formula has both an AF and alt text.

Testing strategy:
Here are two PDF files for testing:
1. [Several inline and display equations](https://github.com/user-attachments/files/17334945/mathml-AF-ex2.pdf)
2. [Some equations with alt text](https://github.com/user-attachments/files/17334946/formula-alt-text.pdf)

Known issues with pull request:
None
  • Loading branch information
NSoiffer authored Oct 25, 2024
1 parent 8d6a860 commit 38e12d3
Show file tree
Hide file tree
Showing 3 changed files with 36 additions and 18 deletions.
17 changes: 2 additions & 15 deletions nvdaHelper/vbufBackends/adobeAcrobat/adobeAcrobat.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -437,7 +437,6 @@ AdobeAcrobatVBufStorage_controlFieldNode_t* AdobeAcrobatVBufBackend_t::fillVBuf(
}

BSTR stdName = NULL;
BSTR tagName = NULL;
int textFlags = 0;
// Whether to render just a space in place of the content.
bool renderSpace = false;
Expand All @@ -454,18 +453,8 @@ AdobeAcrobatVBufStorage_controlFieldNode_t* AdobeAcrobatVBufBackend_t::fillVBuf(
// This is an inline element.
parentNode->isBlock=false;
}
}

// Get tagName.
if ((res = domElement->GetTagName(&tagName)) != S_OK) {
LOG_DEBUG(L"IPDDomElement::GetTagName returned " << res);
tagName = NULL;
}
if (tagName) {
parentNode->addAttribute(L"acrobat::tagname", tagName);
if (wcscmp(tagName, L"math") == 0) {
// We don't want the content of math nodes here,
// As it will be fetched by NVDAObjects outside of the virtualBuffer.
if (wcscmp(stdName, L"Formula") == 0) {
// We don't want the content of formulas,
// but we still want a space so the user can get at them.
renderSpace = true;
}
Expand Down Expand Up @@ -747,8 +736,6 @@ AdobeAcrobatVBufStorage_controlFieldNode_t* AdobeAcrobatVBufBackend_t::fillVBuf(
delete pageNum;
if (stdName)
SysFreeString(stdName);
if (tagName)
SysFreeString(tagName);
if (domElement) {
LOG_DEBUG(L"Releasing IPDDomElement");
domElement->Release();
Expand Down
34 changes: 31 additions & 3 deletions source/NVDAObjects/IAccessible/adobeAcrobat.py
Original file line number Diff line number Diff line change
Expand Up @@ -76,11 +76,17 @@ def initOverlayClass(self):
self.accID = None

# Get the IPDDomNode.
try:
serv.QueryService(SID_GetPDDomNode, IGetPDDomNode)
except COMError:
log.debugWarning("FAILED: QueryService(SID_GetPDDomNode, IGetPDDomNode)")
self.pdDomNode = None
try:
self.pdDomNode = serv.QueryService(SID_GetPDDomNode, IGetPDDomNode).get_PDDomNode(
self.IAccessibleChildID,
)
except COMError:
log.debugWarning("FAILED: get_PDDomNode")
self.pdDomNode = None

if self.pdDomNode:
Expand Down Expand Up @@ -136,16 +142,38 @@ def _getNodeMathMl(self, node):
yield sub
yield "</%s>" % tag

def _get_mathMl(self):
# There could be other stuff before the math element. Ug.
def _get_mathMl(self) -> str:
"""Return the MathML associated with a Formula tag"""
if self.pdDomNode is None:
log.debugWarning("_get_mathMl: self.pdDomNode is None!")
raise LookupError
mathMl = self.pdDomNode.GetValue()
if log.isEnabledFor(log.DEBUG):
log.debug(
(
f"_get_mathMl: math recognized: {mathMl.startswith('<math')}, "
f"child count={self.pdDomNode.GetChildCount()},"
f"\n name='{self.pdDomNode.GetName()}', value='{mathMl}'"
),
)
# this test and the replacement doesn't work if someone uses a namespace tag (which they shouldn't, but..)
if mathMl.startswith("<math"):
return mathMl.replace('xmlns:mml="http://www.w3.org/1998/Math/MathML"', "")
# Alternative for tagging: all the sub expressions are tagged -- gather up the MathML
for childNum in range(self.pdDomNode.GetChildCount()):
try:
child = self.pdDomNode.GetChild(childNum).QueryInterface(IPDDomElement)
except COMError:
log.debugWarning(f"COMError trying to get childNum={childNum}")
continue
if log.isEnabledFor(log.DEBUG):
log.debug(f"\tget_mathMl: tag={child.GetTagName()}")
if child.GetTagName() == "math":
return "".join(self._getNodeMathMl(child))
raise LookupError
# fall back to return the contents, which is hopefully alt text
if log.isEnabledFor(log.DEBUG):
log.debug("_get_mathMl: didn't find MathML -- returning value as mtext")
return f"<math><mtext>{self.pdDomNode.GetValue()}</mtext></math>"


class RootNode(AcrobatNode):
Expand Down
3 changes: 3 additions & 0 deletions user_docs/en/changes.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,9 @@

### New Features

* Support for math in PDFs has been added.
This works for formulas with associated MathML, such as some files generated by newer versions of TeX/LaTeX.
Currently this is only supported in Foxit Reader & Foxit Editor. (#9288, @NSoiffer)
* Commands to adjust the volume of other applications besides NVDA have been added.
To use this feature, "allow NVDA to control the volume of other applications" must be enabled in the audio settings panel. (#16052, @mltony, @codeofdusk)
* `NVDA+alt+pageUp`: Increase the volume of all other applications.
Expand Down

0 comments on commit 38e12d3

Please sign in to comment.