Should the Document constructor use a different parser for collapse_whitespace? #54

JKatzwinkel · 2022-07-28T20:01:14Z

lxml.etree.tostring with pretty_print=True has this caveat:

If lxml cannot distinguish between whitespace and data, it will not alter your data. Whitespace is therefore only added between nodes that do not contain data. This is always the case for trees constructed element-by-element, so no problems should be expected here. For parsed trees, a good way to assure that no conflicting whitespace is left in the tree is the remove_blank_text option [...]

Now instantiating a delb.Document with the collapse_whitespace flag somewhat feels like it should do away with whitespaces in a way that makes the parsed XML suitable for custom formatting, e.g. calling:

  lxml.etree.tostring(document.root._etree_obj, pretty_print=True)

...or something like this. However, in order to be able to pretty print delb content, it is still necessary to use a custom parser on instantiation, e.g.

  document = Document(source, parser=etree.XMLParser(remove_blank_text=True))

...in which case the collapse_whitespace flag of the Document constructor isn't even relevant.

I feel like wanting to pretty-print delb objects as a usecase is somewhat justified (I needed it today in order to simplify a test), and think that this behaviour is somewhat obscured right now and should at least be documented in some way. But maybe this could even be handled in a more user-friendly way. Is there a point in using delb.Document with collapse_whitespace without an lxml parser that also removes whitespace or could the use of such a parser perhaps be implied by collapse_whitespace in general?

Should TagNode have a tostring method with an optional pretty_print flag as well?

The text was updated successfully, but these errors were encountered:

funkyfuture · 2022-07-30T13:24:22Z

after taking a few superficial glances, just some open ended threads:

the _collapse_whitespace method implements the recommendations linked in Document.collapse_whitespace, iirc
i assume that these recommendations generally make sense for text documents
i also assume that remove_blank_text is rather harsh in its doings, i guess i would have used that option if it was compatible with the aforementioned recommendations
does lxml's pretty serialization comply with these recommendations?
a serialization must not add whitespace here, for example: la<hi rendition="u">la</hi>la
can we agree that a pretty serialization must yield a result that is, when again parsed with the collapse_whitespace option, identical to the parsing of an unpretty Serialisat?

funkyfuture · 2022-08-07T15:22:33Z

i've been thinking that whitespace would make a good major topic for the 0.5 version. and, instead of relying on libxml specifics, we can implement serialization natively. that's to be engineered for the Rust implementation anyway and we can look at the API design (pretty-formatted string representations? always move all namespace declarations to the root node when serializing a document?).

funkyfuture · 2022-11-02T13:21:07Z

the last cell in the docs/getting_started.ipynb also gives a great example where pretty is broken. so that demo-case could be taken as one test.

funkyfuture · 2023-04-27T09:28:20Z

i started looking into this which led me to realise that the serialization doesn't consider the xml:space attribute yet.

funkyfuture · 2024-08-22T08:40:12Z

quick update: yesterday i was honest enough to meself to realize that i'm actually traumatized by the task of producing properly placed whitespace. but i still think the target is in eye's sight. let's hope the XML Foundation covers rehab.

funkyfuture · 2024-09-10T15:36:39Z

currently right now i'm in a manic phase (yes diggin to solve the problem got me to new experiences) and i imagine that the implementation will produce the most beautifullest XML that the world has ever seen and only the radiated überhumen on Mars will be able to deliver something better. anyway after a few hints by @zed-g i have the idea to compile an appendix for the documentation that compares "pretty" XML serialisat productions by different serializers for a small variety of samples.

funkyfuture added the design Proposals and discussion of API changes label Jul 30, 2022

funkyfuture added this to the 0.5 milestone Oct 9, 2022

funkyfuture mentioned this issue Sep 18, 2024

Serializations for mixed content documents #94

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Should the Document constructor use a different parser for collapse_whitespace? #54

Should the Document constructor use a different parser for collapse_whitespace? #54

JKatzwinkel commented Jul 28, 2022 •

edited

Loading

funkyfuture commented Jul 30, 2022

funkyfuture commented Aug 7, 2022

funkyfuture commented Nov 2, 2022

funkyfuture commented Apr 27, 2023

funkyfuture commented Aug 22, 2024

funkyfuture commented Sep 10, 2024

Should the Document constructor use a different parser for collapse_whitespace? #54

Should the Document constructor use a different parser for collapse_whitespace? #54

Comments

JKatzwinkel commented Jul 28, 2022 • edited Loading

funkyfuture commented Jul 30, 2022

funkyfuture commented Aug 7, 2022

funkyfuture commented Nov 2, 2022

funkyfuture commented Apr 27, 2023

funkyfuture commented Aug 22, 2024

funkyfuture commented Sep 10, 2024

JKatzwinkel commented Jul 28, 2022 •

edited

Loading