Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

urllib: urlretrieve() seems to ignore provided host header #96287

Open
Evernow opened this issue Aug 25, 2022 · 1 comment
Open

urllib: urlretrieve() seems to ignore provided host header #96287

Evernow opened this issue Aug 25, 2022 · 1 comment
Labels
type-bug An unexpected behavior, bug, or error

Comments

@Evernow
Copy link

Evernow commented Aug 25, 2022

Bug report

I have been having issues confirming this, but urlretrieve seems to ignore the provided Host header even if it's added. It seems to correctly look at User-agent and Referer. I have two functions doing the same download, one with urlretrieve and one with requests. The requests one works as expected and fails in the same way urlretrieve fails if I remove the Host header.

def download_helper(url, fname):
            opener = urllib.request.build_opener()
            opener.addheaders = [('User-agent', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:103.0) Gecko/20100101 Firefox/103.0'),
                                ('Referer', "https://www.amd.com/en/support/graphics/amd-radeon-6000-series/amd-radeon-6700-series/amd-radeon-rx-6700-xt"),
                                ('Host' , 'us.download.nvidia.com')]
            urllib.request.install_opener(opener)
            import ssl
            ssl._create_default_https_context = ssl._create_unverified_context
            urllib.request.urlretrieve(url, filename=fname)

def download_helper2(url, fname):
    my_referer = "https://www.amd.com/en/support/graphics/amd-radeon-6000-series/amd-radeon-6700-series/amd-radeon-rx-6700-xt"
    resp = requests.get(url, verify=False, stream=True, headers={
        'referer': my_referer,
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:93.0) Gecko/20100101 Firefox/93.0',
        'Host' : 'us.download.nvidia.com'
        })
    total = int(resp.headers.get('content-length', 0))
    with open(fname, 'wb') as file:
        for data in resp.iter_content(chunk_size=1024):
            size = file.write(data)

download_helper2('https://192.229.211.70/Windows/516.94/516.94-desktop-win10-win11-64bit-international-dch-whql.exe', r'516.94-desktop-win10-win11-64bit-international-dch-whql.exe')

Your environment

  • CPython versions tested on: 3.10.6
  • Operating system and architecture: Windows Server 2022 10.0.20348.887 ; AMD64
@Evernow Evernow added the type-bug An unexpected behavior, bug, or error label Aug 25, 2022
@tirkarthi
Copy link
Member

tirkarthi commented Sep 3, 2022

urlretrieve is a legacy interface that is not recommended as per docs. Regarding the issue, it seems a request object is constructed without Host header and hence the parsed host value from url is used. In the next loop since Host is already set the value you add in addheaders is skipped. You might want to try something modified from this page like below constructing your own request object with appropriate headers : https://docs.python.org/3/howto/urllib2.html?highlight=urllib2#fetching-urls

https://docs.python.org/3/library/urllib.request.html?highlight=urlretrieve#legacy-interface

cpython/Lib/urllib/request.py

Lines 1293 to 1302 in 837ce64

sel_host = host
if request.has_proxy():
scheme, sel = _splittype(request.selector)
sel_host, sel_path = _splithost(sel)
if not request.has_header('Host'):
request.add_unredirected_header('Host', sel_host)
for name, value in self.parent.addheaders:
name = name.capitalize()
if not request.has_header(name):
request.add_unredirected_header(name, value)

# https://docs.python.org/3/howto/urllib2.html?highlight=urllib2#fetching-urls
import shutil
import tempfile
import urllib.request

def download_helper2(url, fname):
    HOST = "us.download.nvidia.com"
    headers = dict(
        [
            (
                "User-agent",
                "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:103.0) Gecko/20100101 Firefox/103.0",
            ),
            (
                "Referer",
                "https://www.amd.com/en/support/graphics/amd-radeon-6000-series/amd-radeon-6700-series/amd-radeon-rx-6700-xt",
            ),
            ("Host", HOST),
            ("test", "test"),
        ]
    )

    request = urllib.request.Request(url=url, headers=headers)

    with urllib.request.urlopen(request) as response:
        with open(fname, "wb") as file_:
            shutil.copyfileobj(response, file_)


download_helper2("http://localhost:8000/test", "/tmp/test")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

No branches or pull requests

2 participants