-
-
Notifications
You must be signed in to change notification settings - Fork 30.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
email parseaddr and formataddr should be IDNA aware #55992
Comments
The patch for bpo-1690608 adds support for unicode in the realname field to formataddr. To complete the currently-workable internationalization of address specs, both parseaddr and formataddr should be made IDNA aware. It is probably a good idea to do some research on this to double check, but it seems as though recognizing IDNA during parsing and generating it during formatting when the domain contains non-ASCII characters should be both safe and useful. See also bpo-963906, which contains some test cases that could be adapted. |
If Torsten does not have time for this i'll do that next. |
I believe Torsten is interested, but he can of course speak for himself. Just to make sure you are aware: Python has a built in idna codec, so this should be a fairly simple patch. |
I was about to look into this over the weekend, but of course I don't |
On Fri, Apr 08, 2011 at 09:03:51PM +0000, Torsten Becker wrote:
Toll, toll, toll!! Dear god, all these nice memories ... |
Thank you for the wishes, I hope yours is going well, too! I added IDNA awareness to formataddr() and parseaddr(), updated the docs and wrote 2 tests for it. I wasn't sure if the IDNA awareness should be optional via a argument or always automatically enabled, I favored the latter. Also, is it safe to split at "@" and encode/decode the last component? I am not familiar with all the weird variants a email address could be in strictly after the RFCs. |
Patch mostly looks good to me, modulo some English wording that I'll fix up when I commit it. The issue with the '@' is that it might not be there. So you do need to check for that case (it means the domain part defaults to the 'local' domain). I should double check the RFC to make sure that's the only special case, but I believe it is. |
Yeah, sorry for that, I seem to have trouble with writing good documentation. :) I'll have a look at the documents referenced by 1 to improve my writing.
I added a fix and a test for this in v2. However, when reading through the RFC 2 and Wikipedia 3, it seems like this is not actually allowed. Is there a way to internationalize the local-part as well? That is the only part which is missing now that domain and real name are covered. |
Hmm. You are correct. I thought the RFC's covered this case, but apparently they don't. The email package gets used in MUA contexts, where the domain part of the address may be omitted and the MUA must fill it in before transmitting the message to the MTA. And some MTAs will fill in the local domain if it is omitted, so actually it applies in an MTA context, too. So we need to support this case even if it isn't covered by the RFCs. Thanks, the revised patch looks good. |
OK, so when I went to apply this, I figured out that the patch isn't quite right. I've redone the doc updates, and am attaching a version of the patch containing them. The issue is that the place that the IDNA decode support needs to be added isn't in parseaddr, it's in _parseaddr.py's AddresslistClass. Tests are then needed to make sure that the IDNA decoding gets done both when parseaddr and getaddresslist are used. Do you want to tackle this, Torsten? |
I would like to, but I probably will not get to it before Monday. So Just two questions for the implementation:
|
Yes, putting the function in _parseaddr is fine. And yes, 232 looks like a good place. The alternative would be understanding the rfc822 parser, which is pretty mind bending, and of doubtful additional benefit. (At most it would save a pair of split/join calls.) |
On Wed, Apr 13, 2011 at 09:42:22PM +0000, Torsten Becker wrote:
(The word anybody made me think. |
"fix properly" referred to my inferior implementation and "anybody" |
On Thu, Apr 14, 2011 at 06:05:59PM +0000, Torsten Becker wrote:
Grrrrr.... |
On Thu, Apr 14, 2011 at 06:05:59PM +0000, Torsten Becker wrote:
Mumble... |
Hi, here is my revised patch with email.utils.getaddresses() also decoding IDNs. I decided to integrate IDN decoding in AddrlistClass.getaddress() instead of AddrlistClass.getaddrlist() since that function is one level lower and if somebody should ever all it directly, the conversion would not happen. I also fixed a glitch in the docs, "versionchanged" seems to need two colons to end up in the generated HTML. As a follow up, wouldn't it be helpful if email.Message would do the conversions directly? So when you parse a mail into a Message and access the "To" field, you get a list of tuples which are decoded properly? For example the following test currently still fails because the quoted header value is not decoded by email.feedparser.FeedParser nor email.Message: def test_email_decodes_idns_and_unicode(self):
text = '''\
To: =?utf-8?b?SMOkbnMgV8O8cnN0?= <hans@xn--dm-fka.ain> Hello World!''' Am I using the package wrong here or is this actually missing? email.header.decode_header seems to be able to do this already but it is not used. Would it be safe to integrate this into the email.message._sanitize_header helper? |
Thanks. I should be able to look at this tomorrow. You are correct about the fact that Message currently doesn't do any decoding. That is part of the design: you get the string out of Message and use the helper decoding functions (decode_header, getaddresses, etc) to get actual usable data out of it. Part of the email6 design addresses this with a new API that will make it much easier to get at the parsed data (methods on a header object returned from msg['xxx']). But for the current package the way it works it the way it is supposed to work ;) |
I'm finally getting back around to this. Torsten, could you submit a contributor agreement, please? (http://www.python.org/psf/contrib/) And to answer the question you had about the 'still failing' test, parseaddr isn't currently doing the encoded-word decode. Either it should do that too, or it shouldn't do the IDNA decode. I think it should do the decode, because parseaddr and formataddr are supposed to be inverses. And decoding when decoding didn't used to be done seems like a backward compatible change: a program doing its own decoding just won't find anything that needs decoding. |
3.3 is in feature freeze mode. This new feature has to go into 3.4. |
@r.david.murray "You can also help me to remember to commit 11783 after the final release of 3.4.0." Ping! |
Final isn't out yet :) |
Ping :) |
Here comes an updated patch based on 'email_address_idna.patch' without breaking smtplib (as the previous patches did). |
zvyn, thanks for your patch. However I'm sorry to say that Python stdlib's IDNA support is fundamentally broken by design. Therefore I'm against any IDNA related patches until we have addresses multiple issues with internationalized domain names. Our naive support of IDNA in socket module and ssl module is a security issue waiting to be happening.
Before we add more security issues to libraries, we should come up with a plan to address this mess. First step: add IDNA-2008 and UTR46 support to stdlib. I'm deeply sorry for dragging you into this mess. :/ PS: I have removed the 'easy' keyword. |
http://www.unicode.org/reports/tr46/#IDNA2008-Section
|
I see your point, but I'm not fully convinced it relates to this PR directly: the code here just uses the standard interface to use an 'idna' codec. If that codec is buggy that is a different issue. If it's so buggy that using it is absolutely pointless, it should not exist in the first place IMHO. But if we need to keep a useless codec for compabilety reasons and you prefer adding a now codec (e.g. 'idna-2008') I agree that we should put this PR on hold until a codec it could use exists. |
We can't replace IDNA-2003 with IDNA-2008 either. Some applications / protocols / TLDs still use IDNA-2003. I see two options, (1) make encoding configurable somehow, (2) drop implicit IDNA encoding/decoding and force applications to perform explicit IDNA. |
The email package currently forces explicit IDNA use currently. The new policies are supposed to support it automatically but I they currently don't. I'm not sure we should add it to the old interface (parseaddr/formataddr) any longer, but I don't object to doing it, either. Not handling idna automatically would go against the entire design of the new email policies, which is to produce unicode from the wire encoding for programs to work with, and convert back to wire protocol on output. The work on resolving the idna2008 issue belongs in issue bpo-17305, where MvL (who wrote the original idna codec) points to IMO the correct solution (a uts46 codec) in msg217218. |
Adding automatic IDNA decoding is like opening Pandora's box. uts46 is not necessarily the correct solution. The correct handling of internationalized domain names depends on multiple factors: TLD, remote application and more. You actually have to know if the remote peer uses 2003 or 2008. For any user facing application you cannot simply expose all of IDNA to the user (homoglypic confusion attacks). |
As someone who built an idna aware API for pyca/cryptography and deeply regrets it I'd like to weigh in on the side of saying that IDNA is a presentation issue and that supporting it in lower level APIs is the cause of many bugs, some of which can potentially be security issues. Users wanting to make requests to IDNA domains should be responsible for the encoding themselves so that impedance mismatches in encoding version are both discoverable and correctable. |
In other words, this was a major standards screwup and we get to deal with the consequences :( All right, since I'm hardly likely to have time to deal with it anyway, we'll just say that email isn't going to handle unicode domain names until *someone* figures out how to do this right. And it sounds like that may be never. Because saying that "users that want to make requests to IDNA domains should be responsible for the encoding themselves" is, really, a *complete* non-starter from any perspective I can think of, and has its own security issues. If UTF46 does not do the job, we are just out of luck and the users will pay the price. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: