Skip to content

EUC-JP codec fails to properly decode the "㎝" character #95734

Open
@Klim314

Description

@Klim314

Minor bug with decoding of EUC-JP character "㎝".

Bug report

the character "㎝" is part of the JIS_X_0208 encoding. The python core libraries include the EUC-JP encoding, which represents the JIS X 0208, JIS X 0212, and JIS X 0201 encodings. However, attempting to decode the "㎝" character with the EUC-JP codec results in decoding errors.

Example

As taken from https://stackoverflow.com/questions/73255012/python-fails-to-decode-euc-jp-strings-with-the-character:

print(b"58\xad\xd1".decode("EUC-JP"))

throws

Traceback (most recent call last):
  File "<pyshell#53>", line 1, in <module>
    print(b"58\xad\xd1".decode("EUC-JP"))
UnicodeDecodeError: 'euc_jp' codec can't decode byte 0xad in position 2: illegal multibyte sequence

However, decoding with alternative codecs works

content = b"\xa5\xb5\xa5\xa4\xa5\xba\xa1\xa7XL \xcc\xf377\xad\xd1\xa1\xdf\xcc\xf358\xad\xd1"
print(b"58\xad\xd1".decode("euc_jisx0213"))
>58㎝

Your environment

  • CPython versions tested on: 3.9, 3.10
  • Operating system and architecture: Windows x64

Metadata

Metadata

Assignees

No one assigned

    Labels

    type-bugAn unexpected behavior, bug, or error

    Projects

    Status

    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions