Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bpo-36216: Add check for characters in netloc that normalize to separators #12201

Merged
merged 2 commits into from Mar 7, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
@@ -124,6 +124,11 @@ or on combining URL components into a URL string.
Unmatched square brackets in the :attr:`netloc` attribute will raise a
:exc:`ValueError`.

Characters in the :attr:`netloc` attribute that decompose under NFKC
normalization (as used by the IDNA encoding) into any of ``/``, ``?``,
``#``, ``@``, or ``:`` will raise a :exc:`ValueError`. If the URL is
decomposed before parsing, no error will be raised.

.. versionchanged:: 3.2
Added IPv6 URL parsing capabilities.

@@ -136,6 +141,10 @@ or on combining URL components into a URL string.
Out-of-range port numbers now raise :exc:`ValueError`, instead of
returning :const:`None`.

.. versionchanged:: 3.8
Characters that affect netloc parsing under NFKC normalization will
now raise :exc:`ValueError`.


.. function:: parse_qs(qs, keep_blank_values=False, strict_parsing=False, encoding='utf-8', errors='replace', max_num_fields=None)

@@ -259,10 +268,19 @@ or on combining URL components into a URL string.
Unmatched square brackets in the :attr:`netloc` attribute will raise a
:exc:`ValueError`.

Characters in the :attr:`netloc` attribute that decompose under NFKC
normalization (as used by the IDNA encoding) into any of ``/``, ``?``,
``#``, ``@``, or ``:`` will raise a :exc:`ValueError`. If the URL is
decomposed before parsing, no error will be raised.

.. versionchanged:: 3.6
Out-of-range port numbers now raise :exc:`ValueError`, instead of
returning :const:`None`.

.. versionchanged:: 3.8
Characters that affect netloc parsing under NFKC normalization will
now raise :exc:`ValueError`.


.. function:: urlunsplit(parts)

@@ -1,3 +1,5 @@
import sys
import unicodedata
import unittest
import urllib.parse

@@ -994,6 +996,27 @@ def test_all(self):
expected.append(name)
self.assertCountEqual(urllib.parse.__all__, expected)

def test_urlsplit_normalization(self):
# Certain characters should never occur in the netloc,
# including under normalization.
# Ensure that ALL of them are detected and cause an error
illegal_chars = '/:#?@'
hex_chars = {'{:04X}'.format(ord(c)) for c in illegal_chars}
denorm_chars = [
c for c in map(chr, range(128, sys.maxunicode))
if (hex_chars & set(unicodedata.decomposition(c).split()))
and c not in illegal_chars
]
# Sanity check that we found at least one such character
self.assertIn('\u2100', denorm_chars)
self.assertIn('\uFF03', denorm_chars)

for scheme in ["http", "https", "ftp"]:
for c in denorm_chars:
url = "{}://netloc{}false.netloc/path".format(scheme, c)
with self.subTest(url=url, char='{:04X}'.format(ord(c))):
with self.assertRaises(ValueError):
urllib.parse.urlsplit(url)

class Utility_Tests(unittest.TestCase):
"""Testcase to test the various utility functions in the urllib."""
@@ -396,6 +396,21 @@ def _splitnetloc(url, start=0):
delim = min(delim, wdelim) # use earliest delim position
return url[start:delim], url[delim:] # return (domain, rest)

def _checknetloc(netloc):
if not netloc or netloc.isascii():
return
# looking for characters like \u2100 that expand to 'a/c'
# IDNA uses NFKC equivalence, so normalize for this check
import unicodedata
netloc2 = unicodedata.normalize('NFKC', netloc)
if netloc == netloc2:
return
_, _, netloc = netloc.rpartition('@') # anything to the left of '@' is okay
Copy link
Contributor

@mcepl mcepl May 28, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zooba @tiran Could you tell me something about this line (it is now https://github.com/python/cpython/blob/master/Lib/urllib/parse.py#L405)? It seems to me that it exactly makes the first example from https://bugs.python.org/issue36216 fail as before:

>>> u = "https://example.com\uFF03@bing.com"
>>> urlsplit(u).netloc.rpartition("@")[2]
bing.com

for c in '/?#@:':
if c in netloc2:
raise ValueError("netloc '" + netloc2 + "' contains invalid " +
"characters under NFKC normalization")

def urlsplit(url, scheme='', allow_fragments=True):
"""Parse a URL into 5 components:
<scheme>://<netloc>/<path>?<query>#<fragment>
@@ -424,6 +439,7 @@ def urlsplit(url, scheme='', allow_fragments=True):
url, fragment = url.split('#', 1)
if '?' in url:
url, query = url.split('?', 1)
_checknetloc(netloc)
v = SplitResult('http', netloc, url, query, fragment)
_parse_cache[key] = v
return _coerce_result(v)
@@ -447,6 +463,7 @@ def urlsplit(url, scheme='', allow_fragments=True):
url, fragment = url.split('#', 1)
if '?' in url:
url, query = url.split('?', 1)
_checknetloc(netloc)
v = SplitResult(scheme, netloc, url, query, fragment)
_parse_cache[key] = v
return _coerce_result(v)
@@ -0,0 +1,3 @@
Changes urlsplit() to raise ValueError when the URL contains characters that
decompose under IDNA encoding (NFKC-normalization) into characters that
affect how the URL is parsed.