Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

some code paths in ssl and _socket still import idna unconditionally #90906

Open
slingamn mannequin opened this issue Feb 14, 2022 · 7 comments
Open

some code paths in ssl and _socket still import idna unconditionally #90906

slingamn mannequin opened this issue Feb 14, 2022 · 7 comments
Labels
3.12 expert-SSL interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage stdlib Python modules in the Lib dir

Comments

@slingamn
Copy link
Mannequin

slingamn mannequin commented Feb 14, 2022

BPO 46750
Nosy @tiran, @slingamn
PRs
  • bpo-46750: only import idna when required on several common code paths #31328
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = None
    created_at = <Date 2022-02-14.15:55:15.583>
    labels = ['interpreter-core', 'expert-SSL', '3.11', 'library', 'performance']
    title = 'some code paths in ssl and _socket still import idna unconditionally'
    updated_at = <Date 2022-02-16.22:58:31.065>
    user = 'https://github.com/slingamn'

    bugs.python.org fields:

    activity = <Date 2022-02-16.22:58:31.065>
    actor = 'slingamn'
    assignee = 'none'
    closed = False
    closed_date = None
    closer = None
    components = ['Interpreter Core', 'Library (Lib)', 'SSL']
    creation = <Date 2022-02-14.15:55:15.583>
    creator = 'slingamn'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 46750
    keywords = ['patch']
    message_count = 7.0
    messages = ['413229', '413231', '413237', '413242', '413245', '413252', '413364']
    nosy_count = 2.0
    nosy_names = ['christian.heimes', 'slingamn']
    pr_nums = ['31328']
    priority = 'normal'
    resolution = None
    stage = None
    status = 'open'
    superseder = None
    type = 'resource usage'
    url = 'https://bugs.python.org/issue46750'
    versions = ['Python 3.11']

    @slingamn
    Copy link
    Mannequin Author

    slingamn mannequin commented Feb 14, 2022

    Importing the idna encoding has a significant time and memory cost. Therefore, the standard library tries to avoid importing it when it's not needed (i.e. when the domain name is already pure ASCII), e.g. in Lib/http/client.py and Modules/socketmodule.c with idna_converter.

    However, there are code paths that still attempt to encode or decode as idna unconditionally, in particular Lib/ssl.py and _socket.getaddrinfo. Here's a one-line test case:

    python3 -c "import sys, urllib.request; urllib.request.urlopen('https://www.google.com'); assert 'encodings.idna' not in sys.modules"

    These code paths can be converted using existing code to do the import conditionally (I'll send a PR).

    @slingamn slingamn mannequin assigned tiran Feb 14, 2022
    @slingamn slingamn mannequin added 3.11 interpreter-core (Objects, Python, Grammar, and Parser dirs) stdlib Python modules in the Lib dir expert-SSL labels Feb 14, 2022
    @slingamn slingamn mannequin assigned tiran Feb 14, 2022
    @slingamn slingamn mannequin added performance Performance or resource usage 3.11 interpreter-core (Objects, Python, Grammar, and Parser dirs) stdlib Python modules in the Lib dir expert-SSL labels Feb 14, 2022
    @tiran
    Copy link
    Member

    tiran commented Feb 14, 2022

    Please provide benchmarks and data for your claim that encodings.idna is a performance bottleneck.

    encodings.idna is a simple, short module without state. On my system it takes about 0.15 msec to import the module. When unicodedata and stringprep aren't loaded yet, it still takes less than 0.5 msec. The stringprep and unicodedata modules are used by other modules, e.g. urllib parse. It's likely that any non-trivial program with network access has both imported already.

    $ python3 -m timeit -s "import sys" "import encodings.idna; sys.modules.pop('encodings.idna'); sys.modules.pop('stringprep'); sys.modules.pop('unicodedata')"
    500 loops, best of 5: 488 usec per loop

    The IDNA codec performs additional verification of the input. You cannot replace it with a simple "try encode to ASCII" block:

    >>> ("a"*65).encode('idna')
    Traceback (most recent call last):
      File "/usr/lib64/python3.10/encodings/idna.py", line 167, in encode
        raise UnicodeError("label too long")
    UnicodeError: label too long
    
    The above exception was the direct cause of the following exception:
    
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    UnicodeError: encoding with 'idna' codec failed (UnicodeError: label too long)

    @tiran tiran unassigned tiran Feb 14, 2022
    @slingamn
    Copy link
    Mannequin Author

    slingamn mannequin commented Feb 14, 2022

    Thanks for the prompt response. As evidence that this was of concern to the development team in the past, here's an issue where the unnecessary import of idna was treated as a regression:

    https://bugs.python.org/issue22127

    The discussion there also examines the semantic change produced by the optimization (some invalid labels making it to a DNS lookup instead of being rejected) and doesn't consider it to be a breaking change (albeit a reason not to backport).

    (I also see references in documentation to a discussion labeled "RFE bpo-1472176", but am unable to find the actual bug tracker or database entry this refers to.)

    A time cost of 15 milliseconds seems accurate to me. The RAM cost on my release build of Python 3.8.10 is about 600 KB in RSS (this is approximately 5% of the baseline interpreter usage).

    I cannot reproduce the claim that urllib.parse imports stringprep or unicodedata:

    python3 -c "import sys, urllib.parse; assert 'stringprep' not in sys.modules"
    
    python3 -c "import sys, urllib.parse; assert 'unicodedata' not in sys.modules"
    

    I am developing a new lightweight http library that does use urllib.parse; on my system, these patches allow it to function without importing stringprep, idna, or unicodedata:

    https://github.com/slingamn/mureq

    @tiran
    Copy link
    Member

    tiran commented Feb 14, 2022

    It's 100 times faster. The initial import takes about 150 μsec (0.15msec) and it's a one time cost. IDNA encoding of a typical hostname is about 70nsec. A DNS lookup is three magnitudes slower.

    @slingamn
    Copy link
    Mannequin Author

    slingamn mannequin commented Feb 14, 2022

    Sorry, I should have been more clear: I am including the initial costs of importing stringprep and unicodedata. On my system:

    $ python3 -c "import time; start = time.time(); r = 'a'.encode('idna'); elapsed = time.time() - start; print(elapsed)"
    0.0053806304931640625

    So the earlier measurement of 15 milliseconds was excessive (I'm not sure what happened) but it's the right order of magnitude: I can reproduce 5 milliseconds reliably.

    @slingamn
    Copy link
    Mannequin Author

    slingamn mannequin commented Feb 14, 2022

    (Looks like it was 15 milliseconds when measuring inside python -i; I'm not sure what the root cause of the difference is, but clearly the 5 millisecond measurement from regular python is more accurate.)

    @slingamn
    Copy link
    Mannequin Author

    slingamn mannequin commented Feb 16, 2022

    I wanted to check in about the status of this patch. Here's the case for the patch, as I understand it:

    1. It is not a novel optimization, it just consistently applies design decisions that were made previously (RFE bpo-1472176 and bpo-22127).
    2. The performance impact of the initial import of encodings.idna and its transitive dependencies is in fact macroscopic relative to the baseline costs of the interpreter: 5 milliseconds to import the modules and 500 KB in increased RSS, relative to baselines of approximately 50 milliseconds to set up and tear down an interpreter and 10 MB in RSS.

    Here are the relevant benchmarks, first for time:

    import time
    start = time.time()
    'a'.encode('idna')
    print(time.time() - start)

    and for memory:

    import os
    def rss():
        os.system('grep VmRSS /proc/' + str(os.getpid()) + '/status')
    rss()
    'a'.encode('idna')
    rss()

    Are there potential changes to this patch that would mitigate your concerns?

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    @iritkatriel iritkatriel added 3.12 and removed 3.11 labels Sep 9, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.12 expert-SSL interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage stdlib Python modules in the Lib dir
    Projects
    None yet
    Development

    No branches or pull requests

    2 participants