Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTTP Client retry different IP address #50333

Open
GrahamCampbell opened this issue May 15, 2023 · 9 comments
Open

HTTP Client retry different IP address #50333

GrahamCampbell opened this issue May 15, 2023 · 9 comments

Comments

@GrahamCampbell
Copy link
Contributor

Description

If an A or AAAA record (or similarly for CNAMEs) resolves to multiple IP addresses, and a request fails, it would be great to be able to request a retry using a different IP address, rather than retries using the original cached IP. It is pretty common that a particular IP can be experiencing failures and not cycled out on the other end for multiple minutes, or never at all.

Example

An example use case is sending requests to AWS SNS in us-east-1. I have experienced being routed to a server that always returns a 5xx error, and retries would never recover - being able to retry against a different IP address would have resolved the issue, allowing the retry to succeed.

@nicolas-grekas
Copy link
Member

This can already be managed from the caller side by doing DNS resolution before issuing the request and using the resolve option.
But I would welcome a more friendly way to do that. How would that work in your opinion?

@GrahamCampbell
Copy link
Contributor Author

Exactly - the proposal was for a more friendly way to do this. I'm not exactly sure how this could work. ;)

@danielburger1337
Copy link
Contributor

danielburger1337 commented May 17, 2023

Couldn't we add a option like "cycle_ips" to RetryableHttpClient, resolve the domain name, select one at random and pass it to the decorated HttpClient? Thats basically how the very similar "base_uri" option works.

This shouldnt be a big problem or am I missing something?

@GrahamCampbell
Copy link
Contributor Author

The problem is that the curl driver does the dns resolution async, and if we do it ourselves, that blocks, as far as I'm aware.

@danielburger1337
Copy link
Contributor

Yes, but why does that matter? It is a very nieche and optin feature and we only have to resolve the DNS once manually, because we pass the result via the resolve option. to the CurlHttpClient or what ever implementation is used in the end.

A possible solution which I haven't tested right now but will most likely work is that we only resolve the IPs manually AFTER the first request has failed. There is a good chance that when cURL resolves a domain, it uses the same DNS cache as PHP will access when calling dns_get_record.

Most likely we would have to do that anyway because otherwise we do not have access to the full URL that is requested because of relative paths. When only doing that after the first request has failed, we can use the async context ($context->getInfo('url')) or, if configured, the base_uri option.

<?php

$start = microtime(true);

$ips = [];
foreach (dns_get_record('symfony.com', \DNS_A | \DNS_AAAA) as $dnsRecord) {
    $ips[] = $dnsRecord['ip'] ?? $dnsRecord['ipv6'];
}

$duration = microtime(true) - $start;

var_dump([
    'IPs' => $ips,
    'Duration' => $duration
]);

The first run took about 0.05 seconds, subsequent runs took on average 0.002 seconds. I couldn't find it documented which DNS cache PHP uses, but it will most likely be from the OS which cURL should use as well.

I would give it a shot of implementing it after my previous PR to the RetryableHttpClient is merged.

@GrahamCampbell
Copy link
Contributor Author

I wouldn't say it is niche at all. Anyone using AWS services would benefit from this, because it is supper common to get routed to a bad machine and for retries to fail in the way that I described, once you are running at scale. I see it happen most weeks, at a scale of around ~100M requests to SNS per month. It would have also mitigated an outage that occurred last year where SNS was unavailable for ~90 minutes at half of the IP addresses in the A-record, resulting in half my requests failing.

@GrahamCampbell
Copy link
Contributor Author

The reason non-async-dns matters is because a use case is sending 10k parallel requests to SNS, and we don't want to have blocking setting up the requests,

@danielburger1337
Copy link
Contributor

danielburger1337 commented May 17, 2023

I think I am missing something here.

I proposed only resolving the DNS hostname AFTER a request has failed. Wouldn't the blocking actually work in our favor then because it is only resolved once (cause parallel execution is halted and later resumed) and after that the local DNS cache is used and latency is at around 0.002 seconds (2ms)?

At 10k requests, that would add about 20seconds of delay.

If that is a problem, I dont think there is a general way the component can fix that besides maybe using a static class variable as DNS cache? Then we would need some kind of policy to invalidate the variable cache to avoid memory leaks. Maybe the TTL of the DNS record? Calling time and comparing it against a cached timestamp in the dns record is way less than 2ms.

@carsonbot
Copy link

Thank you for this suggestion.
There has not been a lot of activity here for a while. Would you still like to see this feature?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants