Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ElementTree should use UTF-8 for xml declaration. #91810

Closed
methane opened this issue Apr 22, 2022 · 10 comments · Fixed by #91903
Closed

ElementTree should use UTF-8 for xml declaration. #91810

methane opened this issue Apr 22, 2022 · 10 comments · Fixed by #91903
Labels
expert-XML type-feature

Comments

@methane
Copy link
Member

@methane methane commented Apr 22, 2022

Feature or enhancement

Currently, ElementTree.tostring(root, encoding="unicode", xml_declaration=True) uses locale encoding.

I think ElementTree should use UTF-8, instead of locale encoding.

Example:

$ LANG=ja_JP.eucJP ./python.exe
Python 3.11.0a7+ (heads/bytes-alloc-dirty:7fbc7f6128, Apr 19 2022, 16:53:54) [Clang 12.0.0 (clang-1200.0.32.29)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import xml.etree.ElementTree as ET
>>> et = ET.fromstring("<t>hello</t>")
>>> ET.tostring(et, encoding="unicode", xml_declaration=True)
"<?xml version='1.0' encoding='eucJP'?>\n<t>hello</t>"

Code:

with _get_writer(file_or_filename, enc_lower) as write:
if method == "xml" and (xml_declaration or
(xml_declaration is None and
enc_lower not in ("utf-8", "us-ascii", "unicode"))):
declared_encoding = encoding
if enc_lower == "unicode":
# Retrieve the default encoding for the xml declaration
import locale
declared_encoding = locale.getpreferredencoding()
write("<?xml version='1.0' encoding='%s'?>\n" % (
declared_encoding,))

Pitch

  • UTF-8 is the most common encoding for XML.
  • Locale encoding name (e.g. cp932 or eucJP) would be different from XML encoding name recommended by w3c (e.g. Shift_JIS or EUC-JP).
@serhiy-storchaka
Copy link
Member

@serhiy-storchaka serhiy-storchaka commented Apr 22, 2022

Look at dump(). It writes an element to stdout, which usually uses the locale encoding. I think it is the rationale of using the locale encoding here. You need to change dump() to use the stdout's encoding explicitly.

@serhiy-storchaka
Copy link
Member

@serhiy-storchaka serhiy-storchaka commented Apr 22, 2022

Or maybe change write() to get the encoding from the output string if available.

@methane
Copy link
Member Author

@methane methane commented Apr 23, 2022

dump don't use xml_declaration=True. So this issue doesn't affect it.

@methane
Copy link
Member Author

@methane methane commented Apr 25, 2022

@scoder would you give us an advice? (you are listed as etree expert in expert index).

There is no correct behavior, because output is Unicode and etree don't know what is real output encoding.

There are some cases that current behavior is better (e.g. using default encoding (e.g. open(filename, 'w')).
On the other hand, encoding="cp932" (in Japanese Windows) is non-portable (encoding="Shift_JIS" should be used), and UTF-8 is the most recommended encoding for XML.

I have two ideas:

a. Make UTF-8 default. This is simplest.
b. Keep using locale.getpreferredencoding() and wait PEP 686 accepted. (But it should be replaced with locale.getpreferredencoding(False) anyway.)

@serhiy-storchaka
Copy link
Member

@serhiy-storchaka serhiy-storchaka commented Apr 25, 2022

Adding encoding="UTF-8" and using cp932 to encode the content would be even worse.

Maybe add a simple mapping from Python encodings to XML encodings (for example we need to write "ascii" as "us-ascii")? Later we can discuss adding a public API for this.

@serhiy-storchaka
Copy link
Member

@serhiy-storchaka serhiy-storchaka commented Apr 25, 2022

I proposed to get the default encoding from the file object if available. #91812 (comment)

@methane
Copy link
Member Author

@methane methane commented Apr 25, 2022

Adding encoding="UTF-8" and using cp932 to encode the content would be even worse.

Of course, we should recommend to use UTF-8.
Note that encoding='cp932' and using UTF-8 is possible bug for now already.
Any default value may cause bug. There is no one correct default. But UTF-8 may be the best for now.

Maybe add a simple mapping from Python encodings to XML encodings (for example we need to write "ascii" as "us-ascii")? Later we can discuss adding a public API for this.

We may not know Python encoding because output is Unicode (e.g. Unicode string or StringIO).
Such idea works only when output is TextIOWrapper. (And there are no guarantee that TextIOWrapper.encoding is really the final encoding.)

If we want to support arbitrary encoding, we should add another option like xml_declaration_encoding="Shift_JIS".
But this is not strict necessary.
User can chose xml_declaration=False and prepend <?xml version="1.0" encoding="Shift_JIS" ?> manually when they really need to use encoding other than UTF-8.

serhiy-storchaka added a commit to serhiy-storchaka/cpython that referenced this issue Apr 25, 2022
…ML declaration

ElementTree method write() and function tostring() now use the text file's
encoding ("UTF-8" if not available) instead of locale encoding in XML
declaration when encoding="unicode" is specified.
@serhiy-storchaka
Copy link
Member

@serhiy-storchaka serhiy-storchaka commented Apr 25, 2022

Note that encoding='cp932' and using UTF-8 is possible bug for now already.

Yes, it is a bug, and #91903 fixes it.

serhiy-storchaka added a commit to serhiy-storchaka/cpython that referenced this issue Apr 27, 2022
miss-islington pushed a commit to miss-islington/cpython that referenced this issue Apr 27, 2022
pythonGH-91989)

(cherry picked from commit f60b4c3)

Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
serhiy-storchaka added a commit to serhiy-storchaka/cpython that referenced this issue Apr 27, 2022
…II data (pythonGH-91989).

(cherry picked from commit f60b4c3)

Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
miss-islington added a commit that referenced this issue Apr 27, 2022
…91989)

(cherry picked from commit f60b4c3)

Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
@serhiy-storchaka
Copy link
Member

@serhiy-storchaka serhiy-storchaka commented Apr 29, 2022

The difference of #91903 from #91812:

  • If pass a file path, it declares the default file encoding (used to encode the content) instead of UTF-8. After implementing PEP 686 the result will be the same, but older versions need a fix.
    It also correctly escapes non-encodable characters.
  • If pass an open text file, it declares the file encoding instead of UTF-8. PEP 686 will not fix this.

What is common in #91812 and #91903 and different from the current code:

  • If pass StringIO or custom stream without the encoding attribute, they declare UTF-8 instead of the locale encoding.

@serhiy-storchaka
Copy link
Member

@serhiy-storchaka serhiy-storchaka commented May 10, 2022

What do you think about this @methane?

miss-islington pushed a commit to miss-islington/cpython that referenced this issue May 11, 2022
…ML declaration (pythonGH-91903)

ElementTree method write() and function tostring() now use the text file's
encoding ("UTF-8" if not available) instead of locale encoding in XML
declaration when encoding="unicode" is specified.
(cherry picked from commit 707839b)

Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
miss-islington pushed a commit to miss-islington/cpython that referenced this issue May 11, 2022
…ML declaration (pythonGH-91903)

ElementTree method write() and function tostring() now use the text file's
encoding ("UTF-8" if not available) instead of locale encoding in XML
declaration when encoding="unicode" is specified.
(cherry picked from commit 707839b)

Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
miss-islington pushed a commit to miss-islington/cpython that referenced this issue May 11, 2022
…ML declaration (pythonGH-91903)

ElementTree method write() and function tostring() now use the text file's
encoding ("UTF-8" if not available) instead of locale encoding in XML
declaration when encoding="unicode" is specified.
(cherry picked from commit 707839b)

Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
serhiy-storchaka added a commit that referenced this issue May 11, 2022
…laration (GH-91903)

ElementTree method write() and function tostring() now use the text file's
encoding ("UTF-8" if not available) instead of locale encoding in XML
declaration when encoding="unicode" is specified.
miss-islington added a commit that referenced this issue May 11, 2022
…XML declaration (GH-91903) (GH-92663)

ElementTree method write() and function tostring() now use the text file's
encoding ("UTF-8" if not available) instead of locale encoding in XML
declaration when encoding="unicode" is specified.
(cherry picked from commit 707839b)


Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>

Automerge-Triggered-By: GH:serhiy-storchaka
miss-islington added a commit that referenced this issue May 11, 2022
…XML declaration (GH-91903) (GH-92664)

ElementTree method write() and function tostring() now use the text file's
encoding ("UTF-8" if not available) instead of locale encoding in XML
declaration when encoding="unicode" is specified.
(cherry picked from commit 707839b)


Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>

Automerge-Triggered-By: GH:serhiy-storchaka
miss-islington added a commit that referenced this issue May 11, 2022
…ML declaration (GH-91903) (GH-92665)

ElementTree method write() and function tostring() now use the text file's
encoding ("UTF-8" if not available) instead of locale encoding in XML
declaration when encoding="unicode" is specified.
(cherry picked from commit 707839b)


Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>

Automerge-Triggered-By: GH:serhiy-storchaka
serhiy-storchaka added a commit to serhiy-storchaka/cpython that referenced this issue Jun 2, 2022
…ncoding='unicode'

Suppress writing an XML declaration in open files in ElementTree.write()
with encoding='unicode' and xml_declaration=None.
hello-adam pushed a commit to hello-adam/cpython that referenced this issue Jun 2, 2022
hello-adam pushed a commit to hello-adam/cpython that referenced this issue Jun 2, 2022
…t in XML declaration (pythonGH-91903) (pythonGH-92665)

ElementTree method write() and function tostring() now use the text file's
encoding ("UTF-8" if not available) instead of locale encoding in XML
declaration when encoding="unicode" is specified.
(cherry picked from commit 707839b)


Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>

Automerge-Triggered-By: GH:serhiy-storchaka
serhiy-storchaka added a commit that referenced this issue Jun 14, 2022
…g='unicode' (GH-93426)

Suppress writing an XML declaration in open files in ElementTree.write()
with encoding='unicode' and xml_declaration=None.

If file patch is passed to ElementTree.write() with encoding='unicode',
always open a new file in UTF-8.
miss-islington pushed a commit to miss-islington/cpython that referenced this issue Jun 14, 2022
…ncoding='unicode' (pythonGH-93426)

Suppress writing an XML declaration in open files in ElementTree.write()
with encoding='unicode' and xml_declaration=None.

If file patch is passed to ElementTree.write() with encoding='unicode',
always open a new file in UTF-8.
(cherry picked from commit d7db9dc3cc5b44d0b4ce000571fecf58089a01ec)

Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
miss-islington pushed a commit to miss-islington/cpython that referenced this issue Jun 14, 2022
…ncoding='unicode' (pythonGH-93426)

Suppress writing an XML declaration in open files in ElementTree.write()
with encoding='unicode' and xml_declaration=None.

If file patch is passed to ElementTree.write() with encoding='unicode',
always open a new file in UTF-8.
(cherry picked from commit d7db9dc3cc5b44d0b4ce000571fecf58089a01ec)

Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
miss-islington pushed a commit to miss-islington/cpython that referenced this issue Jun 14, 2022
…ncoding='unicode' (pythonGH-93426)

Suppress writing an XML declaration in open files in ElementTree.write()
with encoding='unicode' and xml_declaration=None.

If file patch is passed to ElementTree.write() with encoding='unicode',
always open a new file in UTF-8.
(cherry picked from commit d7db9dc)

Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
expert-XML type-feature
Projects
None yet
2 participants