HTTP Client retry different IP address #50333

GrahamCampbell · 2023-05-15T23:43:40Z

Description

If an A or AAAA record (or similarly for CNAMEs) resolves to multiple IP addresses, and a request fails, it would be great to be able to request a retry using a different IP address, rather than retries using the original cached IP. It is pretty common that a particular IP can be experiencing failures and not cycled out on the other end for multiple minutes, or never at all.

Example

An example use case is sending requests to AWS SNS in us-east-1. I have experienced being routed to a server that always returns a 5xx error, and retries would never recover - being able to retry against a different IP address would have resolved the issue, allowing the retry to succeed.

nicolas-grekas · 2023-05-16T13:37:14Z

This can already be managed from the caller side by doing DNS resolution before issuing the request and using the resolve option.
But I would welcome a more friendly way to do that. How would that work in your opinion?

GrahamCampbell · 2023-05-16T17:26:43Z

Exactly - the proposal was for a more friendly way to do this. I'm not exactly sure how this could work. ;)

danielburger1337 · 2023-05-17T02:37:20Z

Couldn't we add a option like "cycle_ips" to RetryableHttpClient, resolve the domain name, select one at random and pass it to the decorated HttpClient? Thats basically how the very similar "base_uri" option works.

This shouldnt be a big problem or am I missing something?

GrahamCampbell · 2023-05-17T07:54:48Z

The problem is that the curl driver does the dns resolution async, and if we do it ourselves, that blocks, as far as I'm aware.

danielburger1337 · 2023-05-17T17:21:57Z

Yes, but why does that matter? It is a very nieche and optin feature and we only have to resolve the DNS once manually, because we pass the result via the resolve option. to the CurlHttpClient or what ever implementation is used in the end.

A possible solution which I haven't tested right now but will most likely work is that we only resolve the IPs manually AFTER the first request has failed. There is a good chance that when cURL resolves a domain, it uses the same DNS cache as PHP will access when calling dns_get_record.

Most likely we would have to do that anyway because otherwise we do not have access to the full URL that is requested because of relative paths. When only doing that after the first request has failed, we can use the async context ($context->getInfo('url')) or, if configured, the base_uri option.

<?php

$start = microtime(true);

$ips = [];
foreach (dns_get_record('symfony.com', \DNS_A | \DNS_AAAA) as $dnsRecord) {
    $ips[] = $dnsRecord['ip'] ?? $dnsRecord['ipv6'];
}

$duration = microtime(true) - $start;

var_dump([
    'IPs' => $ips,
    'Duration' => $duration
]);

The first run took about 0.05 seconds, subsequent runs took on average 0.002 seconds. I couldn't find it documented which DNS cache PHP uses, but it will most likely be from the OS which cURL should use as well.

I would give it a shot of implementing it after my previous PR to the RetryableHttpClient is merged.

GrahamCampbell · 2023-05-17T17:27:49Z

I wouldn't say it is niche at all. Anyone using AWS services would benefit from this, because it is supper common to get routed to a bad machine and for retries to fail in the way that I described, once you are running at scale. I see it happen most weeks, at a scale of around ~100M requests to SNS per month. It would have also mitigated an outage that occurred last year where SNS was unavailable for ~90 minutes at half of the IP addresses in the A-record, resulting in half my requests failing.

GrahamCampbell · 2023-05-17T17:28:32Z

The reason non-async-dns matters is because a use case is sending 10k parallel requests to SNS, and we don't want to have blocking setting up the requests,

danielburger1337 · 2023-05-17T18:00:10Z

I think I am missing something here.

I proposed only resolving the DNS hostname AFTER a request has failed. Wouldn't the blocking actually work in our favor then because it is only resolved once (cause parallel execution is halted and later resumed) and after that the local DNS cache is used and latency is at around 0.002 seconds (2ms)?

At 10k requests, that would add about 20seconds of delay.

If that is a problem, I dont think there is a general way the component can fix that besides maybe using a static class variable as DNS cache? Then we would need some kind of policy to invalidate the variable cache to avoid memory leaks. Maybe the TTL of the DNS record? Calling time and comparing it against a cached timestamp in the dns record is way less than 2ms.

carsonbot · 2023-11-21T13:03:43Z

Thank you for this suggestion.
There has not been a lot of activity here for a while. Would you still like to see this feature?

xabbuh added the HttpClient label May 17, 2023

javiereguiluz added the Feature label May 20, 2023

carsonbot added the Stalled label Nov 21, 2023

HTTP Client retry different IP address #50333

HTTP Client retry different IP address #50333

GrahamCampbell commented May 15, 2023

nicolas-grekas commented May 16, 2023

GrahamCampbell commented May 16, 2023

danielburger1337 commented May 17, 2023 •

edited

GrahamCampbell commented May 17, 2023

danielburger1337 commented May 17, 2023

GrahamCampbell commented May 17, 2023

GrahamCampbell commented May 17, 2023

danielburger1337 commented May 17, 2023 •

edited

carsonbot commented Nov 21, 2023

HTTP Client retry different IP address #50333

HTTP Client retry different IP address #50333

Comments

GrahamCampbell commented May 15, 2023

Description

Example

nicolas-grekas commented May 16, 2023

GrahamCampbell commented May 16, 2023

danielburger1337 commented May 17, 2023 • edited

GrahamCampbell commented May 17, 2023

danielburger1337 commented May 17, 2023

GrahamCampbell commented May 17, 2023

GrahamCampbell commented May 17, 2023

danielburger1337 commented May 17, 2023 • edited

carsonbot commented Nov 21, 2023

danielburger1337 commented May 17, 2023 •

edited

danielburger1337 commented May 17, 2023 •

edited