Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

msgfmt cannot cope with BOM - improve error message #44827

Open
Cito mannequin opened this issue Apr 10, 2007 · 9 comments
Open

msgfmt cannot cope with BOM - improve error message #44827

Cito mannequin opened this issue Apr 10, 2007 · 9 comments
Assignees
Labels
3.13 bugs and security fixes topic-unicode type-feature A feature request or enhancement

Comments

@Cito
Copy link
Mannequin

Cito mannequin commented Apr 10, 2007

BPO 1697943
Nosy @loewis, @rhettinger, @Cito, @vstinner, @merwok, @serhiy-storchaka
Files
  • msgfmt.diff
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/loewis'
    closed_at = None
    created_at = <Date 2007-04-10.20:58:04.000>
    labels = ['type-bug', 'expert-unicode', '3.11']
    title = 'msgfmt cannot cope with BOM - improve error message'
    updated_at = <Date 2021-04-22.15:26:29.553>
    user = 'https://github.com/Cito'

    bugs.python.org fields:

    activity = <Date 2021-04-22.15:26:29.553>
    actor = 'iritkatriel'
    assignee = 'loewis'
    closed = False
    closed_date = None
    closer = None
    components = ['Demos and Tools', 'Unicode']
    creation = <Date 2007-04-10.20:58:04.000>
    creator = 'cito'
    dependencies = []
    files = ['2348']
    hgrepos = []
    issue_num = 1697943
    keywords = ['patch']
    message_count = 9.0
    messages = ['31755', '31756', '31757', '31758', '70042', '125940', '125941', '290519', '290524']
    nosy_count = 6.0
    nosy_names = ['loewis', 'rhettinger', 'cito', 'vstinner', 'eric.araujo', 'serhiy.storchaka']
    pr_nums = []
    priority = 'normal'
    resolution = None
    stage = 'needs patch'
    status = 'open'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue1697943'
    versions = ['Python 3.11']

    @Cito
    Copy link
    Mannequin Author

    Cito mannequin commented Apr 10, 2007

    If a .po file has a BOM (byte order mark) at the beginning, as is often the case for utf-8 files created on Windows, msgfmt.py complines about a syntax error.

    The attached patch fixes this problem.

    @Cito Cito mannequin assigned loewis Apr 10, 2007
    @rhettinger
    Copy link
    Contributor

    Martin, is this your code?

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Apr 11, 2007

    It's my code, but I will need to establish first whether it's a bug. That depends on what the PO specification says, and, if is it silent on the matter, what GNU gettext does.

    @Cito
    Copy link
    Mannequin Author

    Cito mannequin commented Apr 12, 2007

    It may well be that GNU gettext also chokes on a BOM, because they aren't used under Linux. But I think as a Python tool it should be more Windows-tolerant. The annoying thing is that you get a syntax error, but everything looks right because the BOM is usually invisible. Such error messages are really frustrating. Either the BOM should be silently ignored (as in the patch) or the users should get a friendly error message asking them to save the file without BOM. If GNU behaves badly to Windows users, that's not an excuse to do the same. They are already suffering enough because of their (or their bosses') bad choice of OS ;-)

    @Cito
    Copy link
    Mannequin Author

    Cito mannequin commented Jul 19, 2008

    Small improvement of the patch: Instead of hardcoding the BOM as
    '\xef\xbb\xbf', we should use codecs.BOM_UTF8.

    @devdanzin devdanzin mannequin added topic-unicode type-bug An unexpected behavior, bug, or error labels May 15, 2009
    @vstinner
    Copy link
    Member

    Extract of the Unicode standard: "Use of a BOM is neither required nor recommended for UTF-8, but may be encountered in contexts where UTF-8 data is converted from other encoding forms that use a BOM or where the BOM is used as a UTF-8 signature".

    See also the following section explaing issues with UTF-8 BOM:
    http://en.wikipedia.org/wiki/Byte_order_mark#UTF-8

    I agree that Python should handle (UTF-8) BOM to read a CSV file (bpo-7185), because the file format is common on Windows.

    But msgfmt is an UNIX tool: I would expect that Python behaves like the original msgfmt tool, fail with a fatal error on the BOM "invisible character". How do you explain to a user msgfmt fails but not msgfmt.py?

    About the patch: *ignore* the BOM is not a good idea. The BOM announces the encoding (eg. UTF-8): if a Content-Type header announces another encoding, you should raise an error.

    @vstinner
    Copy link
    Member

    See also issue bpo-7651: "Python3: guess text file charset using the BOM".

    @serhiy-storchaka
    Copy link
    Member

    Corresponding GNU gettext issue [1] was closed as "Not a Bug".

    [1] https://savannah.gnu.org/bugs/?18345

    @Cito
    Copy link
    Mannequin Author

    Cito mannequin commented Mar 26, 2017

    Corresponding GNU gettext issue [1] was closed as "Not a Bug".

    Though I think the rationale given there pointing to RFC3629 section 6 is wrong, since that section explicitly refers to Internet protocols, but PO files are not an Internet protocol.

    Anyway, if silently ignoring BOMs is considered a bad idea, then at least there should be a more helpful error message. Because the BOM is invisible, users - who may not even be aware that something like a BOM exist or that their editor saves files with BOM - may be frustrated about the current error message because they don't see any invalid character when they open the PO file in their editor. A more explicit error message like "PO files should not be saved with a byte order mark" might point users in the right direction.

    After all, these tools are supposed to be used directly by human beings on the command line. Who said that command line tools must not be user friendly?

    @iritkatriel iritkatriel added the 3.11 only security fixes label Apr 22, 2021
    @iritkatriel iritkatriel changed the title msgfmt cannot cope with BOM msgfmt cannot cope with BOM - improve error message Apr 22, 2021
    @iritkatriel iritkatriel added 3.11 only security fixes and removed invalid labels Apr 22, 2021
    @iritkatriel iritkatriel changed the title msgfmt cannot cope with BOM msgfmt cannot cope with BOM - improve error message Apr 22, 2021
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    @iritkatriel iritkatriel added type-feature A feature request or enhancement 3.12 bugs and security fixes and removed type-bug An unexpected behavior, bug, or error 3.11 only security fixes labels Oct 4, 2022
    @erlend-aasland erlend-aasland removed the 3.12 bugs and security fixes label Jan 5, 2024
    @erlend-aasland erlend-aasland added the 3.13 bugs and security fixes label Jan 5, 2024
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.13 bugs and security fixes topic-unicode type-feature A feature request or enhancement
    Projects
    Status: No status
    Development

    No branches or pull requests

    5 participants