I’ve added a proxy server to handle 429 responses, which sleeps for 5 mins before retrying (it does it recursively until a successful response); only then it returns a response to the client.
Our tooling will raise that error if the server does not respond within an expected timeframe. This is different from a 429, which is a meaningful response from the server. Context Deadline Exceeded
The Azure Go SDK already has retry logic, see this link.
So you shouldn’t need to handle 429 yourself. We could consider exposing those settings via CloudQuery config if you submit a feature request for it here. Please state under which conditions you needed to handle the 429 error.
As for context deadline exceeded, that might also indicate a memory issue, so I’d try reducing concurrency. You can refer to the documentation here.
Finally, please make sure you’re not using az login as it can cause performance issues, see this link.
The reason I added this handling of 429 is because I noticed an inconsistent behavior when throttling occurs. See this thread: Discord Thread.
After adding this proxy as a workaround, I noticed that it did solve the inconsistency problem.
BUT
I then ran this solution on one huge Azure Subscription with lots of resources under storage, and I now get context deadline exceeded as I believe that there are simply too many resources and some requests sleep for too long. I changed the sleep time to 30 seconds, but some requests might get into a loop of sleep if they keep encountering 429.
I’ll try reducing the concurrency and let you know. I’m not using az login; I pass the following environment variables to the CQ CLI:
I wasn’t able to monitor the resources of the machine. Sorry about that.
So I did some research and found an interesting thing about the 429 problem mentioned in the other thread: According to the Azure Go Client SDK CHANGELOG, they mentioned that
Don't retry a request if the Retry-After delay is greater than the configured RetryOptions.MaxRetryDelay.
Using the proxy, I was able to find out that most of the headers returned from azure_storage_* of a 429 response have a Retry-After which is greater than the Default RetryOptions.MaxRetryDelay, which is set to 60, meaning that Azure won’t retry them.
Correct me if I’m wrong, but CQ uses the default value and that’s why it does not handle 429 responses well (at least not from azure_storage_*). This can explain the inconsistent behavior.
If you can monitor the resources of the machine that runs the sync, that would be good too to help us debug.
Could you point me to how I can monitor this? I’m running on my local machine now, Linux if that makes a difference.
That makes sense about the Azure Go SDK. Thanks for digging into it. I assume they do that so it won’t sleep/hang for too long.
So I think the solution would be to expose those as options in the spec, or even have different defaults for different tables per issue #9860 as depending on Retry-After, we might not retry some resources at all. I’ll take a look at that.
About monitoring, if you’re running locally, I would use a tool like htop to get some basic visibility on the machine resources while running a sync.