-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Figures inside a <p> aren't recognized correctly. #19
Comments
Yeah, I was planning to support only subset of JATS - DAR: https://github.com/substance/dar/blob/master/DarArticle.md The current JATS Parser object model allows block elements, like figures, tables and lists, only inside section or a document body. Also, it parses elements inside document consequently, to preserve the structure. If implementing this feature, I would create an array inside paragraph element and a getter (e.g., Say, checking after this line: JATSParser/src/JATSParser/HTML/Document.php Line 118 in 2e2a8ae
$par->getBlockElements() ). And then add those elements in the array just after the paragraph. I don't remember if PHP allows that dynamically.
I'll take a look if there is an easy way around without changing the logic and leaving code readable. |
Let me know if it works. I definitely need to refactor Do you have a strong opinion regarding lists inside paragraph? They can't just be put after the paragraph because semantically are linked to the text inside it. One of the options could be to break paragraph into 2 and place list between. |
Hi,
Yes, probably the code needs a general refactory.
I'll tell you what I've found so far:
- I think a namespace has to be created externally from Body, Front or
Back. Something like "Shared", "Common" or something like that.
For example, I created an abstract class Shared\AbstractParser with some
functions like extractFromElement but also this one:
/**
* @link (e.g.)
https://jats.nlm.nih.gov/publishing/tag-library/1.1/element/journal-id.html
* @param \DOMElement $reference
* @return array
*/
protected function extractIdType(\DOMElement $reference, string
$xpathExpression, string $searchAttribute): array {
$entityIdType = array();
$entityIdNode = $this->xpath->query($xpathExpression, $reference);
if ($entityIdNode->length > 0) {
/* @var $entityIdNode \DOMElement */
foreach ($entityIdNode as $entityIdNode) {
if ($entityIdNode->getAttribute($searchAttribute)) {
$entityIdKey =
$entityIdNode->getAttribute($searchAttribute);
$entityIdType[$entityIdKey] =
trim($entityIdNode->nodeValue);
}
}
}
return $entityIdType;
}
- The management of the elements within the paragraphs actually needs to be
reviewed. The image should simply be shown where it is, neither before nor
after. If the image is printed at a certain part of the paragraph, HTML
should reflect this. You can choose to close the paragraph, display the
image, and open a second part of the paragraph artificially, but the HTML
result should be true to XML (this is my humble opinion, of course).
- I'm working on the front. I'm kind of "betraying" the general philosophy
of your library, because I'm generating a "metadata" array where I populate
all the fields I need. Now I'm working on categories (ahhh, there are
self-referencing tags!), keywords and abstracts. It's quite complex, but
something is coming up.
Please let me know how can I help
Best regards
Marcello
Il giorno ven 19 giu 2020 alle ore 18:52 Vitaliy <[email protected]>
ha scritto:
… Let me know if it works.
I definitely need to refactor JATSParser\Body\Document::getContent() and
JATSParser\BodySection::getContent().
Do you have a strong opinion regarding lists inside paragraph? They can't
just be put after the paragraph because semantically are linked to the text
inside it. One of the options could be to break paragraph into 2 and place
list between.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#19 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAKKXE2WQ62P2NI2BH3PS7DRXOJTRANCNFSM4OBNL4AA>
.
|
The idea of placing block elements outside the paragraph arose from compatibility with WYSIWYG editors, like Texture, TinyMCE, or ProseMirror. JATS XML standard is quite flexible regarding where to put tags or mixed elements but it's not that simple for machine readability. I think that was the reason behind creation of DAR subset of JATS XML and JATS4R initiative. Regarding metadata, I think it's better to use object-oriented approach with getters for data extraction. Another possibility would be to create a generic service class, which allows simple interactions with XML's metadata (something similar can be seen in Laravel Framework). Moreover, I was thinking about this approach for the whole document but the way the data is presented in the article's body doesn't allow this, especially when talking about mixed elements inside paragraphs. The current approach for parsing paragraphs has similarities with how it's done in OOXML, where paragraphs are flat and contain only text runs. Although it's a usual thing for OOXML for figures to be placed inside a paragraph, it's treated as a separate element. Recently I've explored ProseMirror and discovered that it also uses the same thing - flattening paragraph's content. See: https://prosemirror.net/docs/guide/#doc.structure My aim right now is to create a full JATS XML workflow starting from parsing author's manuscript (either with Grobid, meTypeset or my own docxToJats converter), editing it with WYSIWYG editor and presenting on the front-end as HTML and PDF. But I need to confess that my knowledge in this area is not great and I'm open for suggestion as long as they are in line with the current plan. |
It seems that <fig> tags are not correctly recognized if present inside a <p>.
Actually, according to the documentation this is possible (https://jats.nlm.nih.gov/publishing/tag-library/1.1/element/fig.html).
I've been trying to replace ./fig with .//fig here, but it doesn't seem to be going right...
JATSParser/src/JATSParser/Body/Document.php
Line 118 in 2e2a8ae
The text was updated successfully, but these errors were encountered: