findtext returns empty string on integer zero value #91447

bitsgalore · 2022-04-11T16:34:11Z

ElementTree's "findtext" function returns an empty string value if the element's text field contains an integer with value 0. Below example illustrates the issue:

import xml.etree.ElementTree as ET

# Create root element
test = ET.Element("test")

# Compute sum and difference on two numbers
firstNumber = 5000
secondNumber = 5000

sum = firstNumber + secondNumber
difference = firstNumber - secondNumber

# Create subelements for sum and difference values
el = ET.SubElement(test, "sum")
el.text = sum

el = ET.SubElement(test, "diff")
el.text = difference

# Print data types
print("type(sum): " + str(type(sum)))
print("type(diff): " + str(type(difference)))
print("type(sum subelt): " + str(type(test.findtext('./sum'))))
print("type(diff subelt): " + str(type(test.findtext('./diff'))))

# Print values of sub elements
print("sum subelt: " + str(test.findtext('./sum')))
print("diff subelt: " + str(test.findtext('./diff')))

This gives me the following output:

type(sum): <class 'int'>
type(diff): <class 'int'>
type(sum subelt): <class 'int'>
type(diff subelt): <class 'str'>
sum subelt: 10000
diff subelt:

Note how the data type of the "diff" sub-element is "string", even though the source data is an integer.

I'm using Python 3.8.10 on Linux Mint 20.1 Ulyssa (based on Ubuntu Focal Fossa 20.04).

The text was updated successfully, but these errors were encountered:

eugenetriguba · 2022-04-12T03:10:48Z

I anticipate this would happen for any "falsey" text attribute value. The findtext function is returning elem.text or "" and 0 is leading it to return the empty string.

For instance:

>>> import xml.etree.ElementTree as ET
>>> test = ET.Element("test")
>>> el = ET.SubElement(test, "diff")
>>> el.text = False
>>> print(test.findtext('./diff'))

>>>

You can stringify the value for the text attribute as a workaround.

>>> el.text = str(0)
>>> print(test.findtext('./diff'))
0

But I'm not familiar with why the current behavior is the way it is. From the API docs: "Note that if the matching element has no text content an empty string is returned."

What is considered "no text content"? Any reason not to do something like the following?

if elem.text is None:
    return ""

return elem.text

bitsgalore · 2022-04-12T10:51:10Z

@eugenetriguba Good call, hadn't tried the Boolean False myself! I ran into this problem while writing some unit tests that take a large element tree with low-level properties of an ISO image file as input. Since these values are used internally by the software for various calculations, stringifying them isn't really an option, as that would make things unnecessarily complex.

My own interpretation of "no text content" would also be a "None" match. After posting this, I realized I'd come across this very same issue some ten years ago in another project (for reasons I can't remember I never got round to reporting it back then!). At the time we created a replacement for the findText function (link here), which is along similar lines as your suggestion.

JelleZijlstra · 2022-05-02T16:43:37Z

Is it even meaningful for elements to have an int or boolean as their .text attribute? I'd expect text to be either a string or None.

eugenetriguba · 2022-05-11T01:38:41Z

@JelleZijlstra I thought about that as well but felt like this was the less controversial change.

It fixes what certainly seems like unintended and incorrect behavior. I assumed we could revisit the text attribute returning different types in a different issue since that is what the behavior seems to have been anyway. What are your thoughts there?

bitsgalore · 2022-05-11T13:43:12Z

@JelleZijlstra @eugenetriguba although I completely agree it's somewhat counter-intuitive that the .text attribute can have pretty much any type, some thoughts on this:

First, the current behavior has been around for a very long time. Because of this, I suspect restricting the type to string-only in some future Python release would break a lot of existing code.

Second, I think it's good to keep in mind that ElementTree and the Element object were originally designed to store/represent arbitrary hierarchical data structures, not just XML (even though this is an important use case). See also this 2014 archived snapshot from the (now defunct) effbot site:

The Element type is a simple but flexible container object, designed to store hierarchical data structures, such as simplified XML infosets, in memory. The element type can be described as a cross between a Python list and a Python dictionary.

I was surprised to see that this scope appears to have narrowed somewhat in the current documentation, which just describes ElementTree as an XML library.

In my own case I'm using ElementTree extensively in the jpylyzer software, of which I'm the lead developer. This is a validator tool for the JP2 (JPEG 2000 Part 1) still image format. Since the JP2 format follows a hierarchical tree structure, I'm using the Element object to create an internal representation of a JP2 file. Within this Element object, all extracted header fields from a JP2 are stored as .text fields, using their original type (e.g. short/long/signed/unsigned integer, etc., just following the JP2 filespec). This is then used to do a bunch of checks and analyses. In the end the tool does write an XML representation to file, but for that it does some extra work and sanity checking.

Any future change to ElementTree that restricts the .text attribute to strings-only would completely break my code, and it would also make the Element object far less useful for representing hierarchical data structures that aren't XML. I suspect other projects would be similarly affected by this, probably even a lot.

My suggestion would be to address this (as a separate issue) primarily as a documentation issue, i.e. clearly explain the current behavior in the ElementTree documentation. I think it would also be helpful to be more explicit about the scope of ElementTree and its Element object (generic structure for hierarchical data structures vs structure for just XML), since there seems to be a discrepancy here between the current docs and the original Effbot documentation.

bitsgalore · 2022-05-11T15:42:41Z

Oh, almost forgot, @eugenetriguba, many thanks (belatedly on my part) for fixing the original issue with #91486, seems I somehow overlooked the notification for this fix!

JelleZijlstra · 2022-05-11T16:31:09Z

Thanks @bitsgalore, that's very helpful! I didn't know ElementTree is useful for use cases other than xml (it's in the xml package, after all). Given your use case, we should continue to support non-string text fields. I agree this could be documented better.

I'm still hesitant about changing the behavior as proposed in #91486 though; it could break other users who put non-string content in their text fields.

eugenetriguba · 2022-05-11T16:37:48Z

@JelleZijlstra Could you explain that? I'm not quite understanding your concern.

Currently falsey values (like the integer 0) return "" rather than the actual .text attribute that was set. That MR aims to resolve it so the falsey value is actually returned from findtext rather than giving an empty string for any falsey value. It doesn't change non-string values to be strings.

JelleZijlstra · 2022-05-11T16:40:02Z

@JelleZijlstra Could you explain that? I'm not quite understanding your concern.

Currently falsey values (like the integer 0) return "" rather than the actual .text attribute that was set. That MR aims to resolve it so the falsey value is actually returned from findtext rather than giving an empty string for any falsey value. It doesn't change non-string values to be strings.

It changes user-visible behavior that users could be relying on. Sure, it's not very intuitive behavior and we wouldn't design it like this from scratch, but it's been like this for a long time, and there may well be code out there relying on this behavior.

eugenetriguba · 2022-05-11T16:42:55Z

@JelleZijlstra Edited my comment and then saw yours come through. Here was my expanded portion

I assume you're saying that if someone has the text content set to [] or False or whatever, they may be relying on the fact that it returns an empty string?

The docs say it should return an empty string for the following: "Note that if the matching element has no text content an empty string is returned."

It depends on how you want to interpret "no text content." I interpret it as being None and not 0 or an empty list.

eugenetriguba · 2022-05-11T16:43:49Z

@JelleZijlstra But yes, I agree. It does still have that potential. But I would think it is closer aligned to how it is currently documented.

eugenetriguba · 2022-07-21T00:33:40Z

@JelleZijlstra Is there someone or somewhere I could reach out to who could make a decision on the best way to move forward here? I tried sending an email to the core mentorship mailing list a bit over 2 weeks ago, but I got an automated email back that it is pending approval to be sent.

JelleZijlstra · 2022-07-21T03:57:12Z

Maybe start a discussion at discuss.python.org?

gvanrossum · 2022-07-22T02:09:24Z

The original test case here seems pretty awkward. Is this a proper demonstration of the issue?

~/cpython$ python3
Python 3.11.0b4 (v3.11.0b4:5a7e1e0a92, Jul 11 2022, 12:47:14) [Clang 13.0.0 (clang-1300.0.29.30)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import xml.etree.ElementTree as ET
>>> test = ET.Element("test")
>>> el = ET.SubElement(test, "sum")
>>> el.text = 42
>>> el = ET.SubElement(test, "diff")
>>> el.text = 0
>>> test.findtext('./sum')
42
>>> test.findtext('./diff')
''
>>> ^D
~/cpython$

IIUC you'd like the final output to be the integer 0, not an empty string object. Right? Or is there a reason the example is so convoluted?

bitsgalore · 2022-08-01T13:49:25Z

@gvanrossum Yes, your example demonstrates the issue perfectly. Looking back at my original example, I agree it's a bit overly convoluted.

Also, as @eugenetriguba pointed out above, the same thing happens for other "falsey" text attributes, e.g.:

>>> el = ET.SubElement(test, "false")
>>> el.text = False
>>> test.findtext('./false')
''

gvanrossum · 2022-08-01T16:25:36Z

Honestly I think I should just merge your PR GH-91486 and be done with it. It will not land in 3.11 though, I don't feel this is a release blocker since the buggy behavior has been around for such a long time. We might fix it in 3.11.1 but that's up to the release manager.

The API documentation for [findtext](https://docs.python.org/3/library/xml.etree.elementtree.html#xml.etree.ElementTree.Element.findtext) states that this function gives back an empty string on "no text content." With the previous implementation, this would give back a empty string even on text content values such as 0 or False. This patch attempts to resolve that by only giving back an empty string if the text attribute is set to `None`. Resolves #91447. Automerge-Triggered-By: GH:gvanrossum

…thonGH-91486) The API documentation for [findtext](https://docs.python.org/3/library/xml.etree.elementtree.htmlGH-xml.etree.ElementTree.Element.findtext) states that this function gives back an empty string on "no text content." With the previous implementation, this would give back a empty string even on text content values such as 0 or False. This patch attempts to resolve that by only giving back an empty string if the text attribute is set to `None`. Resolves pythonGH-91447. Automerge-Triggered-By: GH:gvanrossum (cherry picked from commit a95e60d) Co-authored-by: Eugene Triguba <eugenetriguba@gmail.com>

The API documentation for [findtext](https://docs.python.org/3/library/xml.etree.elementtree.htmlGH-xml.etree.ElementTree.Element.findtext) states that this function gives back an empty string on "no text content." With the previous implementation, this would give back a empty string even on text content values such as 0 or False. This patch attempts to resolve that by only giving back an empty string if the text attribute is set to `None`. Resolves GH-91447. Automerge-Triggered-By: GH:gvanrossum (cherry picked from commit a95e60d) Co-authored-by: Eugene Triguba <eugenetriguba@gmail.com>

AlexWaygood added type-bug An unexpected behavior, bug, or error expert-XML labels Apr 11, 2022

eugenetriguba mentioned this issue Apr 12, 2022

gh-91447: Fix findtext to only give an empty string on None #91486

Merged

miss-islington closed this as completed in #91486 Aug 1, 2022

miss-islington mentioned this issue Aug 1, 2022

[3.11] gh-91447: Fix findtext to only give an empty string on None (GH-91486) #95543

Merged

findtext returns empty string on integer zero value #91447

findtext returns empty string on integer zero value #91447

bitsgalore commented Apr 11, 2022

eugenetriguba commented Apr 12, 2022 •

edited

bitsgalore commented Apr 12, 2022 •

edited

JelleZijlstra commented May 2, 2022

eugenetriguba commented May 11, 2022

bitsgalore commented May 11, 2022 •

edited

bitsgalore commented May 11, 2022 •

edited

JelleZijlstra commented May 11, 2022

eugenetriguba commented May 11, 2022 •

edited

JelleZijlstra commented May 11, 2022

eugenetriguba commented May 11, 2022

eugenetriguba commented May 11, 2022

eugenetriguba commented Jul 21, 2022

JelleZijlstra commented Jul 21, 2022

gvanrossum commented Jul 22, 2022

bitsgalore commented Aug 1, 2022

gvanrossum commented Aug 1, 2022

findtext returns empty string on integer zero value #91447

findtext returns empty string on integer zero value #91447

Comments

bitsgalore commented Apr 11, 2022

eugenetriguba commented Apr 12, 2022 • edited

bitsgalore commented Apr 12, 2022 • edited

JelleZijlstra commented May 2, 2022

eugenetriguba commented May 11, 2022

bitsgalore commented May 11, 2022 • edited

bitsgalore commented May 11, 2022 • edited

JelleZijlstra commented May 11, 2022

eugenetriguba commented May 11, 2022 • edited

JelleZijlstra commented May 11, 2022

eugenetriguba commented May 11, 2022

eugenetriguba commented May 11, 2022

eugenetriguba commented Jul 21, 2022

JelleZijlstra commented Jul 21, 2022

gvanrossum commented Jul 22, 2022

bitsgalore commented Aug 1, 2022

gvanrossum commented Aug 1, 2022

eugenetriguba commented Apr 12, 2022 •

edited

bitsgalore commented Apr 12, 2022 •

edited

bitsgalore commented May 11, 2022 •

edited

bitsgalore commented May 11, 2022 •

edited

eugenetriguba commented May 11, 2022 •

edited