CloudQuery Azure v9.3.0 error message indicates a potential timeout issue

thorough-bear · October 11, 2023, 10:10am

Another one…
I’m using Azure v9.3.0.
I’m seeing this log:

{"level":"error","module":"cli","client":"subscriptions/43e6231a-e00b-46fa-9d7b-aeb81c00b10d","error":"context deadline exceeded","message":"pre resource resolver failed","module":"azure-src","table":"azure_storage_tables","time":"2023-10-11T09:47:01Z"}

Does that message indicate a timeout?

I’ve added a proxy server to handle 429 responses, which sleeps for 5 mins before retrying (it does it recursively until a successful response); only then it returns a response to the client.

marcel · October 11, 2023, 1:27pm

Our tooling will raise that error if the server does not respond within an expected timeframe. This is different from a 429, which is a meaningful response from the server. Context Deadline Exceeded

erez · October 11, 2023, 2:47pm

Hi @thorough-bear,

The Azure Go SDK already has retry logic, see this link.

So you shouldn’t need to handle 429 yourself. We could consider exposing those settings via CloudQuery config if you submit a feature request for it here. Please state under which conditions you needed to handle the 429 error.

As for context deadline exceeded, that might also indicate a memory issue, so I’d try reducing concurrency. You can refer to the documentation here.

Finally, please make sure you’re not using az login as it can cause performance issues, see this link.

thorough-bear · October 11, 2023, 3:21pm

Hi @erez,

Thank you for the response!

The reason I added this handling of 429 is because I noticed an inconsistent behavior when throttling occurs. See this thread: Discord Thread.

After adding this proxy as a workaround, I noticed that it did solve the inconsistency problem.

BUT

I then ran this solution on one huge Azure Subscription with lots of resources under storage, and I now get context deadline exceeded as I believe that there are simply too many resources and some requests sleep for too long. I changed the sleep time to 30 seconds, but some requests might get into a loop of sleep if they keep encountering 429.

I’ll try reducing the concurrency and let you know. I’m not using az login; I pass the following environment variables to the CQ CLI:

AZURE_TENANT_ID
AZURE_CLIENT_ID
AZURE_CLIENT_SECRET
AZURE_SUBSCRIPTION_ID

erez · October 11, 2023, 3:34pm

If you can monitor the resources of the machine that runs the sync, that would be good too to help us debug.

thorough-bear · October 12, 2023, 6:35pm

I wasn’t able to monitor the resources of the machine. Sorry about that.

So I did some research and found an interesting thing about the 429 problem mentioned in the other thread: According to the Azure Go Client SDK CHANGELOG, they mentioned that

Don't retry a request if the Retry-After delay is greater than the configured RetryOptions.MaxRetryDelay.

(found here - Azure SDK for Go Releases).

Using the proxy, I was able to find out that most of the headers returned from azure_storage_* of a 429 response have a Retry-After which is greater than the Default RetryOptions.MaxRetryDelay, which is set to 60, meaning that Azure won’t retry them.

Correct me if I’m wrong, but CQ uses the default value and that’s why it does not handle 429 responses well (at least not from azure_storage_*). This can explain the inconsistent behavior.

If you can monitor the resources of the machine that runs the sync, that would be good too to help us debug.

Could you point me to how I can monitor this? I’m running on my local machine now, Linux if that makes a difference.

erez · October 13, 2023, 9:31am

Hi @thorough-bear,

That makes sense about the Azure Go SDK. Thanks for digging into it. I assume they do that so it won’t sleep/hang for too long.

So I think the solution would be to expose those as options in the spec, or even have different defaults for different tables per issue #9860 as depending on Retry-After, we might not retry some resources at all. I’ll take a look at that.

About monitoring, if you’re running locally, I would use a tool like htop to get some basic visibility on the machine resources while running a sync.

Related issues we have open are issue #11225 and issue #9860.

Topic		Replies	Views
Error 429 Too Many Requests with CloudQuery Azure Storage configuration CloudQuery Plugins	7	29	October 23, 2023
Error 429 Too Many Requests when downloading azure_consumption_subscription_legacy_usage_details CloudQuery Plugins	1	13	April 16, 2024
CloudQuery errors regarding invalid client certificate in azure security assessments CloudQuery Plugins	2	2	October 20, 2023
Azure query locks client node with high memory usage in CloudQuery CloudQuery Plugins	8	13	October 9, 2023
Need guidance on next steps after CloudQuery issue moved to ready state CloudQuery Plugins	3	0	September 30, 2023

CloudQuery Azure v9.3.0 error message indicates a potential timeout issue

Related topics