-
-
Notifications
You must be signed in to change notification settings - Fork 320
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ensure get_contents() method of Node objects return 'bytes' and fix decoding errors in Value.get_csig() #3738
base: master
Are you sure you want to change the base?
Conversation
Dir, Alias, ActionCaller compute their contents. The get_contents() of Dir and Alias now return a utf-8 formatted string. The content of ActionCaller is usually the 'co_code' attribute of the code object of the function it wraps, when this is unavailable, return the 'repr' of the wrapped function encoded with default options (currently 'utf-8').
Some mock objects' get_contents() were returning strings, modify them to return bytes. Some tests were asserting get_contents() equals a string, modify expected values to be bytes. For tests that check the value of Dir.get_contents() and Alias.get_contents(), make the tests explicitly expect 'utf-8' encoded content.
Value.get_contents() is currently calculating its binary contents by getting its text contents via Value.get_text_contents() and encoding the returned text content. This can be a problem if get_text_contents() fails. The current implementation of File.get_text_contents() never fails, as a last resort, undecodable bytes are handled by 'backslashreplace', but Value.get_text_contents() calculates its text content by decoding child node binary contents manually using bytes.decode() with default options (encoding='utf-8' and errors='strict') so has the potential to fail (e.g. a child File node contains non-utf8 binary data). Reimplement get_contents() to mirror get_text_contents() implementation as a potential solution. The current get_csig() also uses the return of get_text_contents() as the content signature. Because get_text_contents() might fail, make get_csig() return get_contents() decoded with default encoding and errors='backslashreplace'.
SCons/Node/Python.py
Outdated
return text_contents | ||
contents = str(self.value).encode('utf-8') | ||
for kid in self.children(None): | ||
contents = contents + kid.get_contents() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you change this to contents = contents + kid.get_csig()? Otherwise the memory usage of Value targets gets out of control.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've made a few extra commits, first changing Value.get_csig()
to use child csigs instead of content, and then making Value.get_text_contents()
to just be a concatenation of child csigs prepended with the stringified value
attribute like you suggest. The Value.get_contents()
then just return an encoded get_text_contents()
and Value.get_csig()
just wraps get_text_contents()
as well.
I'm not sure where the 'contents' of a Value
node is actually used, but the second change failed these tests:
SCons/SConfTests.py
test/Configure/ConfigureDryRunError.py
test/Configure/VariantDir-SConscript.py
test/Configure/VariantDir.py
test/Configure/basic.py
test/Configure/cache-not-ok.py
test/Configure/cache-ok.py
test/Configure/config-h.py
test/Configure/custom-tests.py
test/Configure/issue-3469/issue-3469.py
test/Configure/option--config.py
test/Value.py
test/explain/basic.py
test/option-n.py
test/question/Configure.py
test/sconsign/script/Configure.py
test/textfile/textfile.py
The first change passes existing tests (after they've been modified to expect bytes from get_contents()
calls in the previous PR commits).
I am running tests with Python 3.8.2 on Ubuntu 20.04 in a virtual environment.
SCons/Node/Python.py
Outdated
# Already encoded as python2 str are bytes | ||
return text_contents | ||
contents = str(self.value).encode('utf-8') | ||
for kid in self.children(None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
child
seems to be in use everywhere else
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer child
as well but it looks like kid
is still used frequently in SCons/Node/*.
Please add a blurb to CHANGES.txt |
Just a note this seems to be related to #3384 as well. I would love to have these fixes in |
Current implementation might lead to high memory usage. New implementation depends on children csigs instead. Remove test that makes sure Value.get_csig() works even if child binary contents are invalid utf-8 since get_csig() no longer depends on children content.
Current implementation of Value.get_text_contents() returns a concatenation of all child contents. Instead, make the contents of a Value similar to an Alias, except prepend the stringified 'value' attribute as well. Since Value.get_text_contents() will no longer fail (as long as child nodes implement a working get_csig() which they should), make get_contents() return get_text_contents() but utf-8 encoded. Also make get_csig() just return get_text_contents() Add new tests for testing get_contents() and get_text_contents() only rely on child node csigs, and not directly on child content.
The Lines 2043 to 2055 in 9861322
Should I remove it? |
contents = str(self.value) | ||
for kid in self.children(None): | ||
contents = contents + kid.get_contents().decode() | ||
contents += ''.join(n.get_csig() for n in self.children()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I've said before I'm not sure concatenating the contents of the child was correct, and this seems even less correct.
Does this also cover issue #3093? |
I'd say hold off on any more work on this until we determine if anything breaks if we remove including child nodes contents in get_contents(). |
Seeing this stalled out completely. I've been gradually sneaking in typing annotations in the codebase - we can do more now that 3.6 is the baseline Python versions, I could really only add return types before. One of the things that shows up as a consistency check problem - it would be nice to declare all |
I believe the sticking point from last time was what Using concat of child csigs leads to some tests to fail, and I'm not sure if it was figured out what the rationale for using the concat of child contents was. I had run out of time to look into these two issues unfortunately. On the topic of ensuring Deciding what |
Honestly it should do neither. The node's children shouldn't be added into it's contents. |
Fix #3737
Using
Value
as a target node sometimes fails in the calculation of its content signature inget_csig()
.Most existing
get_contents()
methods already returnbytes
/bytearray
s, but a few are returningstr
s. This is causing problems in some functions (e.g.Value.get_text_contents()
) asstr
s do not have adecode()
method.The current implementation of
Value.get_csig()
uses the return ofValue.get_text_contents()
as the csig. The methodValue.get_text_contents()
calculates its return value by converting theValue
'svalue
attribute to a string, and then appending its children's content. Child node content are calculated by calling theirget_contents()
to get their binary data, and then calling.decode()
on that. This can cause aUnicodeDecodeError
if the binary content of a child node is not valid utf-8.To fix this, I have looked through the
get_contents()
method in eachNode
subclass (the ones I could find at least) and made appropriate changes to make sure they always returnbytes
. TheDir
andAlias
nodes construct their binary content entirely as a string, they now return the utf-8 encoding of their original return values.I have also reimplemented
Value.get_contents()
similarly to howValue.get_text_contents()
works. I don't think this changes the output ofValue.get_contents()
.I've also reimplemented
Value.get_csig()
to useValue.get_contents()
anddecode()
that (witherrors='backslashreplace'
) to use as the csig.Alternatively,
Value.get_text_contents()
could be modified to work around decoding errors (using something likeerrors='backslashreplace'
). I'm not sure ifget_text_contents()
is meant to always return something, even if the binary contents of the node is not encoded text. The implementation ofFile.get_text_contents()
useserrors='backslashreplace'
to handle decoding errors.I've added two rough tests with duplicated code to test that
Value.get_contents()
andValue.get_csig()
work even when it has a child node with non-utf8 binary contents.I've also modified tests that were comparing
get_contents()
against strings for equality, and some mock objects that were returning strings in theirget_contents()
method. I don't think all the modifications to the test code were necessary to get my changes to pass the tests, but I've modified them anyway for completeness.Contributor Checklist:
CHANGES.txt
(and read theREADME.rst
)