Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

findtext returns empty string on integer zero value #91447

Closed
bitsgalore opened this issue Apr 11, 2022 · 16 comments · Fixed by #91486
Closed

findtext returns empty string on integer zero value #91447

bitsgalore opened this issue Apr 11, 2022 · 16 comments · Fixed by #91486
Labels
expert-XML type-bug An unexpected behavior, bug, or error

Comments

@bitsgalore
Copy link

bitsgalore commented Apr 11, 2022

ElementTree's "findtext" function returns an empty string value if the element's text field contains an integer with value 0. Below example illustrates the issue:

import xml.etree.ElementTree as ET

# Create root element
test = ET.Element("test")

# Compute sum and difference on two numbers
firstNumber = 5000
secondNumber = 5000

sum = firstNumber + secondNumber
difference = firstNumber - secondNumber

# Create subelements for sum and difference values
el = ET.SubElement(test, "sum")
el.text = sum

el = ET.SubElement(test, "diff")
el.text = difference

# Print data types
print("type(sum): " + str(type(sum)))
print("type(diff): " + str(type(difference)))
print("type(sum subelt): " + str(type(test.findtext('./sum'))))
print("type(diff subelt): " + str(type(test.findtext('./diff'))))

# Print values of sub elements
print("sum subelt: " + str(test.findtext('./sum')))
print("diff subelt: " + str(test.findtext('./diff')))

This gives me the following output:

type(sum): <class 'int'>
type(diff): <class 'int'>
type(sum subelt): <class 'int'>
type(diff subelt): <class 'str'>
sum subelt: 10000
diff subelt: 

Note how the data type of the "diff" sub-element is "string", even though the source data is an integer.

I'm using Python 3.8.10 on Linux Mint 20.1 Ulyssa (based on Ubuntu Focal Fossa 20.04).

@AlexWaygood AlexWaygood added type-bug An unexpected behavior, bug, or error expert-XML labels Apr 11, 2022
@eugenetriguba
Copy link
Contributor

eugenetriguba commented Apr 12, 2022

I anticipate this would happen for any "falsey" text attribute value. The findtext function is returning elem.text or "" and 0 is leading it to return the empty string.

For instance:

>>> import xml.etree.ElementTree as ET
>>> test = ET.Element("test")
>>> el = ET.SubElement(test, "diff")
>>> el.text = False
>>> print(test.findtext('./diff'))

>>> 

You can stringify the value for the text attribute as a workaround.

>>> el.text = str(0)
>>> print(test.findtext('./diff'))
0

But I'm not familiar with why the current behavior is the way it is. From the API docs: "Note that if the matching element has no text content an empty string is returned."

What is considered "no text content"? Any reason not to do something like the following?

if elem.text is None:
    return ""

return elem.text

@bitsgalore
Copy link
Author

bitsgalore commented Apr 12, 2022

@eugenetriguba Good call, hadn't tried the Boolean False myself! I ran into this problem while writing some unit tests that take a large element tree with low-level properties of an ISO image file as input. Since these values are used internally by the software for various calculations, stringifying them isn't really an option, as that would make things unnecessarily complex.

My own interpretation of "no text content" would also be a "None" match. After posting this, I realized I'd come across this very same issue some ten years ago in another project (for reasons I can't remember I never got round to reporting it back then!). At the time we created a replacement for the findText function (link here), which is along similar lines as your suggestion.

@JelleZijlstra
Copy link
Member

JelleZijlstra commented May 2, 2022

Is it even meaningful for elements to have an int or boolean as their .text attribute? I'd expect text to be either a string or None.

@eugenetriguba
Copy link
Contributor

eugenetriguba commented May 11, 2022

@JelleZijlstra I thought about that as well but felt like this was the less controversial change.

It fixes what certainly seems like unintended and incorrect behavior. I assumed we could revisit the text attribute returning different types in a different issue since that is what the behavior seems to have been anyway. What are your thoughts there?

@bitsgalore
Copy link
Author

bitsgalore commented May 11, 2022

@JelleZijlstra @eugenetriguba although I completely agree it's somewhat counter-intuitive that the .text attribute can have pretty much any type, some thoughts on this:

First, the current behavior has been around for a very long time. Because of this, I suspect restricting the type to string-only in some future Python release would break a lot of existing code.

Second, I think it's good to keep in mind that ElementTree and the Element object were originally designed to store/represent arbitrary hierarchical data structures, not just XML (even though this is an important use case). See also this 2014 archived snapshot from the (now defunct) effbot site:

The Element type is a simple but flexible container object, designed to store hierarchical data structures, such as simplified XML infosets, in memory. The element type can be described as a cross between a Python list and a Python dictionary.

I was surprised to see that this scope appears to have narrowed somewhat in the current documentation, which just describes ElementTree as an XML library.

In my own case I'm using ElementTree extensively in the jpylyzer software, of which I'm the lead developer. This is a validator tool for the JP2 (JPEG 2000 Part 1) still image format. Since the JP2 format follows a hierarchical tree structure, I'm using the Element object to create an internal representation of a JP2 file. Within this Element object, all extracted header fields from a JP2 are stored as .text fields, using their original type (e.g. short/long/signed/unsigned integer, etc., just following the JP2 filespec). This is then used to do a bunch of checks and analyses. In the end the tool does write an XML representation to file, but for that it does some extra work and sanity checking.

Any future change to ElementTree that restricts the .text attribute to strings-only would completely break my code, and it would also make the Element object far less useful for representing hierarchical data structures that aren't XML. I suspect other projects would be similarly affected by this, probably even a lot.

My suggestion would be to address this (as a separate issue) primarily as a documentation issue, i.e. clearly explain the current behavior in the ElementTree documentation. I think it would also be helpful to be more explicit about the scope of ElementTree and its Element object (generic structure for hierarchical data structures vs structure for just XML), since there seems to be a discrepancy here between the current docs and the original Effbot documentation.

@bitsgalore
Copy link
Author

bitsgalore commented May 11, 2022

Oh, almost forgot, @eugenetriguba, many thanks (belatedly on my part) for fixing the original issue with #91486, seems I somehow overlooked the notification for this fix!

@JelleZijlstra
Copy link
Member

JelleZijlstra commented May 11, 2022

Thanks @bitsgalore, that's very helpful! I didn't know ElementTree is useful for use cases other than xml (it's in the xml package, after all). Given your use case, we should continue to support non-string text fields. I agree this could be documented better.

I'm still hesitant about changing the behavior as proposed in #91486 though; it could break other users who put non-string content in their text fields.

@eugenetriguba
Copy link
Contributor

eugenetriguba commented May 11, 2022

@JelleZijlstra Could you explain that? I'm not quite understanding your concern.

Currently falsey values (like the integer 0) return "" rather than the actual .text attribute that was set. That MR aims to resolve it so the falsey value is actually returned from findtext rather than giving an empty string for any falsey value. It doesn't change non-string values to be strings.

@JelleZijlstra
Copy link
Member

JelleZijlstra commented May 11, 2022

@JelleZijlstra Could you explain that? I'm not quite understanding your concern.

Currently falsey values (like the integer 0) return "" rather than the actual .text attribute that was set. That MR aims to resolve it so the falsey value is actually returned from findtext rather than giving an empty string for any falsey value. It doesn't change non-string values to be strings.

It changes user-visible behavior that users could be relying on. Sure, it's not very intuitive behavior and we wouldn't design it like this from scratch, but it's been like this for a long time, and there may well be code out there relying on this behavior.

@eugenetriguba
Copy link
Contributor

eugenetriguba commented May 11, 2022

@JelleZijlstra Edited my comment and then saw yours come through. Here was my expanded portion

I assume you're saying that if someone has the text content set to [] or False or whatever, they may be relying on the fact that it returns an empty string?

The docs say it should return an empty string for the following: "Note that if the matching element has no text content an empty string is returned."

It depends on how you want to interpret "no text content." I interpret it as being None and not 0 or an empty list.

@eugenetriguba
Copy link
Contributor

eugenetriguba commented May 11, 2022

@JelleZijlstra But yes, I agree. It does still have that potential. But I would think it is closer aligned to how it is currently documented.

@eugenetriguba
Copy link
Contributor

eugenetriguba commented Jul 21, 2022

@JelleZijlstra Is there someone or somewhere I could reach out to who could make a decision on the best way to move forward here? I tried sending an email to the core mentorship mailing list a bit over 2 weeks ago, but I got an automated email back that it is pending approval to be sent.

@JelleZijlstra
Copy link
Member

JelleZijlstra commented Jul 21, 2022

Maybe start a discussion at discuss.python.org?

@gvanrossum
Copy link
Member

gvanrossum commented Jul 22, 2022

The original test case here seems pretty awkward. Is this a proper demonstration of the issue?

~/cpython$ python3
Python 3.11.0b4 (v3.11.0b4:5a7e1e0a92, Jul 11 2022, 12:47:14) [Clang 13.0.0 (clang-1300.0.29.30)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import xml.etree.ElementTree as ET
>>> test = ET.Element("test")
>>> el = ET.SubElement(test, "sum")
>>> el.text = 42
>>> el = ET.SubElement(test, "diff")
>>> el.text = 0
>>> test.findtext('./sum')
42
>>> test.findtext('./diff')
''
>>> ^D
~/cpython$ 

IIUC you'd like the final output to be the integer 0, not an empty string object. Right? Or is there a reason the example is so convoluted?

@bitsgalore
Copy link
Author

bitsgalore commented Aug 1, 2022

@gvanrossum Yes, your example demonstrates the issue perfectly. Looking back at my original example, I agree it's a bit overly convoluted.

Also, as @eugenetriguba pointed out above, the same thing happens for other "falsey" text attributes, e.g.:

>>> el = ET.SubElement(test, "false")
>>> el.text = False
>>> test.findtext('./false')
''

@gvanrossum
Copy link
Member

gvanrossum commented Aug 1, 2022

Honestly I think I should just merge your PR GH-91486 and be done with it. It will not land in 3.11 though, I don't feel this is a release blocker since the buggy behavior has been around for such a long time. We might fix it in 3.11.1 but that's up to the release manager.

miss-islington pushed a commit that referenced this issue Aug 1, 2022
The API documentation for [findtext](https://docs.python.org/3/library/xml.etree.elementtree.html#xml.etree.ElementTree.Element.findtext) states that this function gives back an empty string on "no text content." With the previous implementation, this would give back a empty string even on text content values such as 0 or False. This patch attempts to resolve that by only giving back an empty string if the text attribute is set to `None`. Resolves #91447.

Automerge-Triggered-By: GH:gvanrossum
miss-islington pushed a commit to miss-islington/cpython that referenced this issue Aug 1, 2022
…thonGH-91486)

The API documentation for [findtext](https://docs.python.org/3/library/xml.etree.elementtree.htmlGH-xml.etree.ElementTree.Element.findtext) states that this function gives back an empty string on "no text content." With the previous implementation, this would give back a empty string even on text content values such as 0 or False. This patch attempts to resolve that by only giving back an empty string if the text attribute is set to `None`. Resolves pythonGH-91447.

Automerge-Triggered-By: GH:gvanrossum
(cherry picked from commit a95e60d)

Co-authored-by: Eugene Triguba <eugenetriguba@gmail.com>
miss-islington added a commit that referenced this issue Aug 1, 2022
The API documentation for [findtext](https://docs.python.org/3/library/xml.etree.elementtree.htmlGH-xml.etree.ElementTree.Element.findtext) states that this function gives back an empty string on "no text content." With the previous implementation, this would give back a empty string even on text content values such as 0 or False. This patch attempts to resolve that by only giving back an empty string if the text attribute is set to `None`. Resolves GH-91447.

Automerge-Triggered-By: GH:gvanrossum
(cherry picked from commit a95e60d)

Co-authored-by: Eugene Triguba <eugenetriguba@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
expert-XML type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants