Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Figures inside a <p> aren't recognized correctly. #19

Open
marciuz opened this issue Jun 18, 2020 · 4 comments
Open

Figures inside a <p> aren't recognized correctly. #19

marciuz opened this issue Jun 18, 2020 · 4 comments

Comments

@marciuz
Copy link

marciuz commented Jun 18, 2020

It seems that <fig> tags are not correctly recognized if present inside a <p>.

Actually, according to the documentation this is possible (https://jats.nlm.nih.gov/publishing/tag-library/1.1/element/fig.html).

I've been trying to replace ./fig with .//fig here, but it doesn't seem to be going right...

foreach (self::$xpath->evaluate(".//sec|./p|./list|./table-wrap|./fig|./media|./disp-quote|./verse-group", $body) as $content) {

@marciuz marciuz changed the title If the figure is inside a <p> isn't recognized. Figures inside a <p> aren't recognized correctly. Jun 18, 2020
@Vitaliy-1
Copy link
Owner

Yeah, I was planning to support only subset of JATS - DAR: https://github.com/substance/dar/blob/master/DarArticle.md

The current JATS Parser object model allows block elements, like figures, tables and lists, only inside section or a document body. Also, it parses elements inside document consequently, to preserve the structure.

If implementing this feature, I would create an array inside paragraph element and a getter (e.g., getBlockelements()): https://github.com/Vitaliy-1/JATSParser/blob/2e2a8aeddc4a2423e8afcb642cf6e836e6eeaa11/src/JATSParser/Body/Par.php that will contain block elements (e.g., figures and tables) and then write it into result HTML after the paragraph, where they appear.

Say, checking after this line:

$par->setContent($articleSection);
where paragraph data is set if it contains block elements ($par->getBlockElements()). And then add those elements in the array just after the paragraph. I don't remember if PHP allows that dynamically.

I'll take a look if there is an easy way around without changing the logic and leaving code readable.

@Vitaliy-1
Copy link
Owner

Let me know if it works.

I definitely need to refactor JATSParser\Body\Document::getContent() and JATSParser\BodySection::getContent().

Do you have a strong opinion regarding lists inside paragraph? They can't just be put after the paragraph because semantically are linked to the text inside it. One of the options could be to break paragraph into 2 and place list between.

@marciuz
Copy link
Author

marciuz commented Jun 21, 2020 via email

@Vitaliy-1
Copy link
Owner

The idea of placing block elements outside the paragraph arose from compatibility with WYSIWYG editors, like Texture, TinyMCE, or ProseMirror. JATS XML standard is quite flexible regarding where to put tags or mixed elements but it's not that simple for machine readability. I think that was the reason behind creation of DAR subset of JATS XML and JATS4R initiative.

Regarding metadata, I think it's better to use object-oriented approach with getters for data extraction. Another possibility would be to create a generic service class, which allows simple interactions with XML's metadata (something similar can be seen in Laravel Framework). Moreover, I was thinking about this approach for the whole document but the way the data is presented in the article's body doesn't allow this, especially when talking about mixed elements inside paragraphs.

The current approach for parsing paragraphs has similarities with how it's done in OOXML, where paragraphs are flat and contain only text runs. Although it's a usual thing for OOXML for figures to be placed inside a paragraph, it's treated as a separate element. Recently I've explored ProseMirror and discovered that it also uses the same thing - flattening paragraph's content. See: https://prosemirror.net/docs/guide/#doc.structure

My aim right now is to create a full JATS XML workflow starting from parsing author's manuscript (either with Grobid, meTypeset or my own docxToJats converter), editing it with WYSIWYG editor and presenting on the front-end as HTML and PDF. But I need to confess that my knowledge in this area is not great and I'm open for suggestion as long as they are in line with the current plan.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants