Figures inside a <p> aren't recognized correctly. #19

marciuz · 2020-06-18T09:59:11Z

It seems that <fig> tags are not correctly recognized if present inside a <p>.

Actually, according to the documentation this is possible (https://jats.nlm.nih.gov/publishing/tag-library/1.1/element/fig.html).

I've been trying to replace ./fig with .//fig here, but it doesn't seem to be going right...

JATSParser/src/JATSParser/Body/Document.php

Line 118 in 2e2a8ae

    
           foreach (self::$xpath->evaluate(".//sec|./p|./list|./table-wrap|./fig|./media|./disp-quote|./verse-group", $body) as $content) {

Vitaliy-1 · 2020-06-18T14:12:37Z

Yeah, I was planning to support only subset of JATS - DAR: https://github.com/substance/dar/blob/master/DarArticle.md

The current JATS Parser object model allows block elements, like figures, tables and lists, only inside section or a document body. Also, it parses elements inside document consequently, to preserve the structure.

If implementing this feature, I would create an array inside paragraph element and a getter (e.g., getBlockelements()): https://github.com/Vitaliy-1/JATSParser/blob/2e2a8aeddc4a2423e8afcb642cf6e836e6eeaa11/src/JATSParser/Body/Par.php that will contain block elements (e.g., figures and tables) and then write it into result HTML after the paragraph, where they appear.

Say, checking after this line:

JATSParser/src/JATSParser/HTML/Document.php

Line 118 in 2e2a8ae

$par->setContent($articleSection);

where paragraph data is set if it contains block elements ($par->getBlockElements()). And then add those elements in the array just after the paragraph. I don't remember if PHP allows that dynamically.

I'll take a look if there is an easy way around without changing the logic and leaving code readable.

Vitaliy-1 · 2020-06-19T16:51:55Z

Let me know if it works.

I definitely need to refactor JATSParser\Body\Document::getContent() and JATSParser\BodySection::getContent().

Do you have a strong opinion regarding lists inside paragraph? They can't just be put after the paragraph because semantically are linked to the text inside it. One of the options could be to break paragraph into 2 and place list between.

marciuz · 2020-06-21T12:44:21Z

Hi, Yes, probably the code needs a general refactory. I'll tell you what I've found so far: - I think a namespace has to be created externally from Body, Front or Back. Something like "Shared", "Common" or something like that. For example, I created an abstract class Shared\AbstractParser with some functions like extractFromElement but also this one: /** * @link (e.g.) https://jats.nlm.nih.gov/publishing/tag-library/1.1/element/journal-id.html * @param \DOMElement $reference * @return array */ protected function extractIdType(\DOMElement $reference, string $xpathExpression, string $searchAttribute): array { $entityIdType = array(); $entityIdNode = $this->xpath->query($xpathExpression, $reference); if ($entityIdNode->length > 0) { /* @var $entityIdNode \DOMElement */ foreach ($entityIdNode as $entityIdNode) { if ($entityIdNode->getAttribute($searchAttribute)) { $entityIdKey = $entityIdNode->getAttribute($searchAttribute); $entityIdType[$entityIdKey] = trim($entityIdNode->nodeValue); } } } return $entityIdType; } - The management of the elements within the paragraphs actually needs to be reviewed. The image should simply be shown where it is, neither before nor after. If the image is printed at a certain part of the paragraph, HTML should reflect this. You can choose to close the paragraph, display the image, and open a second part of the paragraph artificially, but the HTML result should be true to XML (this is my humble opinion, of course). - I'm working on the front. I'm kind of "betraying" the general philosophy of your library, because I'm generating a "metadata" array where I populate all the fields I need. Now I'm working on categories (ahhh, there are self-referencing tags!), keywords and abstracts. It's quite complex, but something is coming up. Please let me know how can I help Best regards Marcello Il giorno ven 19 giu 2020 alle ore 18:52 Vitaliy <[email protected]> ha scritto:

…

Let me know if it works. I definitely need to refactor JATSParser\Body\Document::getContent() and JATSParser\BodySection::getContent(). Do you have a strong opinion regarding lists inside paragraph? They can't just be put after the paragraph because semantically are linked to the text inside it. One of the options could be to break paragraph into 2 and place list between. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#19 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAKKXE2WQ62P2NI2BH3PS7DRXOJTRANCNFSM4OBNL4AA> .

Vitaliy-1 · 2020-06-22T09:03:56Z

The idea of placing block elements outside the paragraph arose from compatibility with WYSIWYG editors, like Texture, TinyMCE, or ProseMirror. JATS XML standard is quite flexible regarding where to put tags or mixed elements but it's not that simple for machine readability. I think that was the reason behind creation of DAR subset of JATS XML and JATS4R initiative.

Regarding metadata, I think it's better to use object-oriented approach with getters for data extraction. Another possibility would be to create a generic service class, which allows simple interactions with XML's metadata (something similar can be seen in Laravel Framework). Moreover, I was thinking about this approach for the whole document but the way the data is presented in the article's body doesn't allow this, especially when talking about mixed elements inside paragraphs.

The current approach for parsing paragraphs has similarities with how it's done in OOXML, where paragraphs are flat and contain only text runs. Although it's a usual thing for OOXML for figures to be placed inside a paragraph, it's treated as a separate element. Recently I've explored ProseMirror and discovered that it also uses the same thing - flattening paragraph's content. See: https://prosemirror.net/docs/guide/#doc.structure

My aim right now is to create a full JATS XML workflow starting from parsing author's manuscript (either with Grobid, meTypeset or my own docxToJats converter), editing it with WYSIWYG editor and presenting on the front-end as HTML and PDF. But I need to confess that my knowledge in this area is not great and I'm open for suggestion as long as they are in line with the current plan.

marciuz changed the title ~~If the figure is inside a <p> isn't recognized.~~ Figures inside a <p> aren't recognized correctly. Jun 18, 2020

Vitaliy-1 added a commit that referenced this issue Jun 19, 2020

#19 parse block elements inside paragraphs

d30d3ae

Vitaliy-1 mentioned this issue Aug 3, 2020

Issue with inline-graphic, display-formula and fig tags #23

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Figures inside a <p> aren't recognized correctly. #19

Figures inside a <p> aren't recognized correctly. #19

marciuz commented Jun 18, 2020 •

edited

Loading

Vitaliy-1 commented Jun 18, 2020

Vitaliy-1 commented Jun 19, 2020

marciuz commented Jun 21, 2020 via email

Vitaliy-1 commented Jun 22, 2020

Figures inside a <p> aren't recognized correctly. #19

Figures inside a <p> aren't recognized correctly. #19

Comments

marciuz commented Jun 18, 2020 • edited Loading

Vitaliy-1 commented Jun 18, 2020

Vitaliy-1 commented Jun 19, 2020

marciuz commented Jun 21, 2020 via email

Vitaliy-1 commented Jun 22, 2020

marciuz commented Jun 18, 2020 •

edited

Loading