Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

email parseaddr and formataddr should be IDNA aware #55992

Open
bitdancer opened this issue Apr 6, 2011 · 32 comments
Open

email parseaddr and formataddr should be IDNA aware #55992

bitdancer opened this issue Apr 6, 2011 · 32 comments
Labels
3.7 (EOL) end of life stdlib Python modules in the Lib dir topic-email type-feature A feature request or enhancement

Comments

@bitdancer
Copy link
Member

BPO 11783
Nosy @warsaw, @macfreek, @tiran, @bitdancer, @zvyn
PRs
  • bpo-11783: Make parseaddr, formataddr IDNA aware. #1900
  • Files
  • issue-11783-v1.patch: IDNA awareness for formataddr() and parseaddr() + tests + docs
  • issue-11783-v2.patch: Fix for local-part only addresses + test case
  • email_address_idna.patch
  • issue-11783-v4.patch: getaddresses() also decodes IDNs
  • issue11783.patch: Update issue-11783-v4.patch to be applieable on 3.5.
  • issue11783-rdm-fixed.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = None
    created_at = <Date 2011-04-06.14:12:00.758>
    labels = ['type-feature', '3.7', 'expert-email']
    title = 'email parseaddr and formataddr should be IDNA aware'
    updated_at = <Date 2017-06-02.16:29:14.305>
    user = 'https://github.com/bitdancer'

    bugs.python.org fields:

    activity = <Date 2017-06-02.16:29:14.305>
    actor = 'r.david.murray'
    assignee = 'none'
    closed = False
    closed_date = None
    closer = None
    components = ['email']
    creation = <Date 2011-04-06.14:12:00.758>
    creator = 'r.david.murray'
    dependencies = []
    files = ['21595', '21614', '21653', '21698', '35565', '35566']
    hgrepos = []
    issue_num = 11783
    keywords = ['patch']
    message_count = 32.0
    messages = ['133133', '133335', '133338', '133343', '133381', '133421', '133512', '133519', '133522', '133686', '133696', '133699', '133754', '133757', '133767', '133768', '133935', '133936', '161807', '170381', '213328', '213332', '220240', '220253', '294937', '294938', '294987', '295023', '295027', '295028', '295031', '295035']
    nosy_count = 8.0
    nosy_names = ['barry', 'macfreek', 'christian.heimes', 'r.david.murray', 'jesstess', 'sdaoden', 'torsten.becker', 'zvyn']
    pr_nums = ['1900']
    priority = 'normal'
    resolution = None
    stage = 'needs patch'
    status = 'open'
    superseder = None
    type = 'enhancement'
    url = 'https://bugs.python.org/issue11783'
    versions = ['Python 3.7']

    @bitdancer
    Copy link
    Member Author

    The patch for bpo-1690608 adds support for unicode in the realname field to formataddr. To complete the currently-workable internationalization of address specs, both parseaddr and formataddr should be made IDNA aware. It is probably a good idea to do some research on this to double check, but it seems as though recognizing IDNA during parsing and generating it during formatting when the domain contains non-ASCII characters should be both safe and useful.

    See also bpo-963906, which contains some test cases that could be adapted.

    @bitdancer bitdancer added the stdlib Python modules in the Lib dir label Apr 6, 2011
    @bitdancer bitdancer self-assigned this Apr 6, 2011
    @bitdancer bitdancer added easy type-feature A feature request or enhancement labels Apr 6, 2011
    @sdaoden
    Copy link
    Mannequin

    sdaoden mannequin commented Apr 8, 2011

    If Torsten does not have time for this i'll do that next.
    I hate this useless Punycode (but for nameprep),
    so it's exactly the right thing for me to do next in 2011.
    Unless someone complains.
    I've not looked at formataddr/parseaddr, so that's a motivation.
    Torsten?

    @bitdancer
    Copy link
    Member Author

    I believe Torsten is interested, but he can of course speak for himself.

    Just to make sure you are aware: Python has a built in idna codec, so this should be a fairly simple patch.

    @torstenbecker
    Copy link
    Mannequin

    torstenbecker mannequin commented Apr 8, 2011

    I was about to look into this over the weekend, but of course I don't
    want to steal your fun, Steffen. :)

    @sdaoden
    Copy link
    Mannequin

    sdaoden mannequin commented Apr 9, 2011

    On Fri, Apr 08, 2011 at 09:03:51PM +0000, Torsten Becker wrote:

    I was about to look into this over the weekend, but of course I don't
    want to steal your fun, Steffen. :)

    Toll, toll, toll!!
    Still cherry blossom, thanks to the weather, apple blossom
    started, more than 20 degrees.
    I really understand if a young german man rather prefers staying
    at home over the weekend, looking at its beautiful, intelligent
    chancellor whenever the TV is started, than doing whatever else!

    Dear god, all these nice memories ...
    14" BW CRT monitors, 30 hours at a time ...
    Have a nice weekend!

    @torstenbecker
    Copy link
    Mannequin

    torstenbecker mannequin commented Apr 9, 2011

    Have a nice weekend!

    Thank you for the wishes, I hope yours is going well, too!

    I added IDNA awareness to formataddr() and parseaddr(), updated the docs and wrote 2 tests for it.

    I wasn't sure if the IDNA awareness should be optional via a argument or always automatically enabled, I favored the latter.

    Also, is it safe to split at "@" and encode/decode the last component? I am not familiar with all the weird variants a email address could be in strictly after the RFCs.

    @bitdancer
    Copy link
    Member Author

    Patch mostly looks good to me, modulo some English wording that I'll fix up when I commit it.

    The issue with the '@' is that it might not be there. So you do need to check for that case (it means the domain part defaults to the 'local' domain). I should double check the RFC to make sure that's the only special case, but I believe it is.

    @torstenbecker
    Copy link
    Mannequin

    torstenbecker mannequin commented Apr 11, 2011

    modulo some English wording that I'll fix up when I commit it.

    Yeah, sorry for that, I seem to have trouble with writing good documentation. :) I'll have a look at the documents referenced by 1 to improve my writing.

    The issue with the '@' is that it might not be there.

    I added a fix and a test for this in v2. However, when reading through the RFC 2 and Wikipedia 3, it seems like this is not actually allowed.

    Is there a way to internationalize the local-part as well? That is the only part which is missing now that domain and real name are covered.

    @bitdancer
    Copy link
    Member Author

    Hmm. You are correct. I thought the RFC's covered this case, but apparently they don't.

    The email package gets used in MUA contexts, where the domain part of the address may be omitted and the MUA must fill it in before transmitting the message to the MTA. And some MTAs will fill in the local domain if it is omitted, so actually it applies in an MTA context, too. So we need to support this case even if it isn't covered by the RFCs.

    Thanks, the revised patch looks good.

    @bitdancer
    Copy link
    Member Author

    OK, so when I went to apply this, I figured out that the patch isn't quite right. I've redone the doc updates, and am attaching a version of the patch containing them.

    The issue is that the place that the IDNA decode support needs to be added isn't in parseaddr, it's in _parseaddr.py's AddresslistClass. Tests are then needed to make sure that the IDNA decoding gets done both when parseaddr and getaddresslist are used.

    Do you want to tackle this, Torsten?

    @torstenbecker
    Copy link
    Mannequin

    torstenbecker mannequin commented Apr 13, 2011

    OK, so when I went to apply this, I figured out that the patch isn't quite right.  I've redone the doc updates, and am attaching a version of the patch containing them.

    The issue is that the place that the IDNA decode support needs to be added isn't in parseaddr, it's in _parseaddr.py's AddresslistClass.  Tests are then needed to make sure that the IDNA decoding gets done both when parseaddr and getaddresslist are used.

    Do you want to tackle this, Torsten?

    I would like to, but I probably will not get to it before Monday. So
    if anybody wants to work on this before that time, please feel free to
    fix it properly. :)

    Just two questions for the implementation:

    1. Would it be fine to move the helper _encode_decode_addr() into
      _parseaddr.py and then import it in util.py, so it can be shared
      between the two?
    2. Would line 232 in _parseaddr.py (AddrlistClass.getaddrlist) be a
      good place to integrate it?

    @bitdancer
    Copy link
    Member Author

    Yes, putting the function in _parseaddr is fine. And yes, 232 looks like a good place. The alternative would be understanding the rfc822 parser, which is pretty mind bending, and of doubtful additional benefit. (At most it would save a pair of split/join calls.)

    @sdaoden
    Copy link
    Mannequin

    sdaoden mannequin commented Apr 14, 2011

    On Wed, Apr 13, 2011 at 09:42:22PM +0000, Torsten Becker wrote:

    So if anybody wants to work on this before that time,
    please feel free to fix it properly. :)

    (The word anybody made me think.
    But "fix properly" ... i'm sure you cannot refer to myself.
    :))

    @torstenbecker
    Copy link
    Mannequin

    torstenbecker mannequin commented Apr 14, 2011

    (The word anybody made me think.
    But "fix properly" ... i'm sure you cannot refer to myself.
    :))

    "fix properly" referred to my inferior implementation and "anybody"
    should probably have been worded "Steffen or David". So sure .. go
    ahead. :)

    @sdaoden
    Copy link
    Mannequin

    sdaoden mannequin commented Apr 14, 2011

    On Thu, Apr 14, 2011 at 06:05:59PM +0000, Torsten Becker wrote:

    So sure .. go ahead. :)

    Grrrrr....
    Wuah, Wuuuah, Wuuuuuuaaaah!
    No. Now i'm exhausted.

    @sdaoden
    Copy link
    Mannequin

    sdaoden mannequin commented Apr 14, 2011

    On Thu, Apr 14, 2011 at 06:05:59PM +0000, Torsten Becker wrote:

    :)

    Mumble...
    :)

    @torstenbecker
    Copy link
    Mannequin

    torstenbecker mannequin commented Apr 17, 2011

    Hi, here is my revised patch with email.utils.getaddresses() also decoding IDNs.

    I decided to integrate IDN decoding in AddrlistClass.getaddress() instead of AddrlistClass.getaddrlist() since that function is one level lower and if somebody should ever all it directly, the conversion would not happen.

    I also fixed a glitch in the docs, "versionchanged" seems to need two colons to end up in the generated HTML.

    As a follow up, wouldn't it be helpful if email.Message would do the conversions directly? So when you parse a mail into a Message and access the "To" field, you get a list of tuples which are decoded properly?

    For example the following test currently still fails because the quoted header value is not decoded by email.feedparser.FeedParser nor email.Message:

        def test_email_decodes_idns_and_unicode(self):
            text = '''\
    To: =?utf-8?b?SMOkbnMgV8O8cnN0?= <hans@xn--dm-fka.ain>

    Hello World!'''
    msg = Parser().parsestr(text)
    self.assertEqual(utils.getaddresses(msg.get_all('To')),
    [('H\xe4ns W\xfcrst', 'hans@d\xf6m.ain')])

    Am I using the package wrong here or is this actually missing? email.header.decode_header seems to be able to do this already but it is not used. Would it be safe to integrate this into the email.message._sanitize_header helper?

    @bitdancer
    Copy link
    Member Author

    Thanks. I should be able to look at this tomorrow.

    You are correct about the fact that Message currently doesn't do any decoding. That is part of the design: you get the string out of Message and use the helper decoding functions (decode_header, getaddresses, etc) to get actual usable data out of it.

    Part of the email6 design addresses this with a new API that will make it much easier to get at the parsed data (methods on a header object returned from msg['xxx']). But for the current package the way it works it the way it is supposed to work ;)

    @bitdancer bitdancer added topic-email and removed stdlib Python modules in the Lib dir labels May 28, 2012
    @bitdancer
    Copy link
    Member Author

    I'm finally getting back around to this. Torsten, could you submit a contributor agreement, please? (http://www.python.org/psf/contrib/)

    And to answer the question you had about the 'still failing' test, parseaddr isn't currently doing the encoded-word decode. Either it should do that too, or it shouldn't do the IDNA decode. I think it should do the decode, because parseaddr and formataddr are supposed to be inverses. And decoding when decoding didn't used to be done seems like a backward compatible change: a program doing its own decoding just won't find anything that needs decoding.

    @bitdancer bitdancer removed their assignment May 28, 2012
    @tiran
    Copy link
    Member

    tiran commented Sep 12, 2012

    3.3 is in feature freeze mode. This new feature has to go into 3.4.

    @macfreek
    Copy link
    Mannequin

    macfreek mannequin commented Mar 12, 2014

    @r.david.murray
    Quote from bpo-20083:

    "You can also help me to remember to commit 11783 after the final release of 3.4.0."

    Ping!

    @bitdancer
    Copy link
    Member Author

    Final isn't out yet :)

    @zvyn
    Copy link
    Mannequin

    zvyn mannequin commented Jun 11, 2014

    Ping :)

    @zvyn
    Copy link
    Mannequin

    zvyn mannequin commented Jun 11, 2014

    Here comes an updated patch based on 'email_address_idna.patch' without breaking smtplib (as the previous patches did).

    @tiran
    Copy link
    Member

    tiran commented Jun 1, 2017

    zvyn, thanks for your patch.

    However I'm sorry to say that Python stdlib's IDNA support is fundamentally broken by design. Therefore I'm against any IDNA related patches until we have addresses multiple issues with internationalized domain names. Our naive support of IDNA in socket module and ssl module is a security issue waiting to be happening.

    • Python blindly assume that 'idna' is the only transformation of IDN U-labels into IDN A-labels. That's just plain wrong. Python's idna is really IDNA-2003.
    • Besides IDNA 2003 there is also IDNA 2008. Of course the encodings are not compatible to each other.
    • The old encoding IDNA-2003 and *MUST NOT* be used for some TLDs like .de because has an incorrect mapping for several characters like 'ß'.
    • IDNA-2008 does not support upper case letters. Most applications want to use UTR46 mapping for IDNA-2008.
    • On the application side, mapping of IDN U-labels must go through an additional validation layer to counteract homoglyphic confusion attacks. (e.g. cyrillic 'r' looks like latin 'p').

    Before we add more security issues to libraries, we should come up with a plan to address this mess. First step: add IDNA-2008 and UTR46 support to stdlib.

    I'm deeply sorry for dragging you into this mess. :/

    PS: I have removed the 'easy' keyword.

    @tiran tiran removed the easy label Jun 1, 2017
    @tiran
    Copy link
    Member

    tiran commented Jun 1, 2017

    http://www.unicode.org/reports/tr46/#IDNA2008-Section

    Additions. Some IDNs are invalid in IDNA2003, but valid in IDNA2008.
    Subtractions. Some IDNs are valid in IDNA2003, but invalid in IDNA2008.
    Deviations. Some IDNs are valid in both, but resolve to different destinations.
    

    @zvyn
    Copy link
    Mannequin

    zvyn mannequin commented Jun 2, 2017

    I see your point, but I'm not fully convinced it relates to this PR directly: the code here just uses the standard interface to use an 'idna' codec. If that codec is buggy that is a different issue.

    If it's so buggy that using it is absolutely pointless, it should not exist in the first place IMHO. But if we need to keep a useless codec for compabilety reasons and you prefer adding a now codec (e.g. 'idna-2008') I agree that we should put this PR on hold until a codec it could use exists.

    @tiran
    Copy link
    Member

    tiran commented Jun 2, 2017

    We can't replace IDNA-2003 with IDNA-2008 either. Some applications / protocols / TLDs still use IDNA-2003. I see two options, (1) make encoding configurable somehow, (2) drop implicit IDNA encoding/decoding and force applications to perform explicit IDNA.

    @bitdancer
    Copy link
    Member Author

    The email package currently forces explicit IDNA use currently. The new policies are supposed to support it automatically but I they currently don't. I'm not sure we should add it to the old interface (parseaddr/formataddr) any longer, but I don't object to doing it, either.

    Not handling idna automatically would go against the entire design of the new email policies, which is to produce unicode from the wire encoding for programs to work with, and convert back to wire protocol on output.

    The work on resolving the idna2008 issue belongs in issue bpo-17305, where MvL (who wrote the original idna codec) points to IMO the correct solution (a uts46 codec) in msg217218.

    @bitdancer bitdancer added the 3.7 (EOL) end of life label Jun 2, 2017
    @tiran
    Copy link
    Member

    tiran commented Jun 2, 2017

    Adding automatic IDNA decoding is like opening Pandora's box.

    uts46 is not necessarily the correct solution. The correct handling of internationalized domain names depends on multiple factors: TLD, remote application and more. You actually have to know if the remote peer uses 2003 or 2008. For any user facing application you cannot simply expose all of IDNA to the user (homoglypic confusion attacks).

    @reaperhulk
    Copy link
    Mannequin

    reaperhulk mannequin commented Jun 2, 2017

    As someone who built an idna aware API for pyca/cryptography and deeply regrets it I'd like to weigh in on the side of saying that IDNA is a presentation issue and that supporting it in lower level APIs is the cause of many bugs, some of which can potentially be security issues. Users wanting to make requests to IDNA domains should be responsible for the encoding themselves so that impedance mismatches in encoding version are both discoverable and correctable.

    @bitdancer
    Copy link
    Member Author

    In other words, this was a major standards screwup and we get to deal with the consequences :(

    All right, since I'm hardly likely to have time to deal with it anyway, we'll just say that email isn't going to handle unicode domain names until *someone* figures out how to do this right. And it sounds like that may be never. Because saying that "users that want to make requests to IDNA domains should be responsible for the encoding themselves" is, really, a *complete* non-starter from any perspective I can think of, and has its own security issues. If UTF46 does not do the job, we are just out of luck and the users will pay the price.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    @iritkatriel iritkatriel added the stdlib Python modules in the Lib dir label Nov 23, 2023
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.7 (EOL) end of life stdlib Python modules in the Lib dir topic-email type-feature A feature request or enhancement
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants