Inconsistent sync results and rate limit issues in CloudQuery Azure configuration

Hi,

I’ve noticed that when an Azure sync experiences rate limit issues, I get inconsistent results.

For example:

  1. When tables are azure_storage_accounts, azure_storage_containers, there is no rate limit in the logs and the number of resources synced from azure_storage_containers is 866 with 0 errors, which is the expected result. I’ve done multiple syncs with this configuration and the number of resources stays the same.
  2. When tables are azure_storage_accounts, azure_storage_containers, azure_storage_file_shares, azure_storage_blob_services, there is a rate limit in the logs and the number of resources synced from azure_storage_containers is 780 with 20 errors. I’ve done multiple syncs with this configuration and the number of resources is inconsistent, meaning that the number of resources is sometimes X and sometimes Y, but not the expected 866.

I’m using Azure v9.3.0 if that matters.
Has anyone else experienced this?
Thanks!

That is interesting… Just opening an issue to track this.

Here’s the link for tracking: GitHub Issue #13944

Thanks for reporting.

@thorough-bear Reducing the concurrency value should help with this. But we can also investigate on our side to see if there’s a way to better schedule the queries so that they don’t hit Azure rate limits.

@herman I’m using concurrency=1, but it doesn’t seem to help.

Could you share your (redacted) config, @thorough-bear?

"kind: source\nspec:\n  name: AZURE_135c8a6b-2a45-4225-9aee-87dde3de856b_09063151-829b-4cce-b6ba-8d89873620ed\n\
    \  path: \"cloudquery/azure\"\n  version: v9.3.0\n  tables: ['azure_storage_accounts',\
    \ 'azure_storage_containers']\n  destinations: ['elasticsearch']\n  skip_dependent_tables:\
    \ True\n  deterministic_cq_id: True\n  spec:\n    subscriptions: [4d9b301c-ceab-4682-ba37-88700e6f5d0b]\n\
    \    cloud_name: AzurePublic\n    concurrency: 1\n    discovery_concurrency: 1"

Sorry for it looking so bad, it is loaded as part of a ConfigMap into a K8s job, so I had to print it.
Here is the template before it is rendered:

kind: source
spec:
  name: {{ config.name }}
  path: "cloudquery/{{ config.source_name }}"
  version: {{ config.version }}
  tables: {{ config.tables }}
  destinations: {{ config.destinations }}
  skip_dependent_tables: {{ config.skip_dependent_tables }}
  deterministic_cq_id: {{ config.deterministic_cq_id }}
  spec:
    subscriptions: [{{ config.subscription_id }}]
    cloud_name: {{ config.cloud_name }}
    concurrency: {{ config.concurrency }}
    discovery_concurrency: {{ config.discovery_concurrency }}

And you’re saying this happens even with concurrency set to 1? That’s quite strange; we’ll have to investigate. :thinking:

Yes, I set concurrency=1 every time. As part of sending you the config, I had to run my program a couple of times, which initiated a couple of syncs. I noticed that:

When I ran the tables azure_storage_accounts and azure_storage_containers twice one after another, I see in the logs that the second sync gets a lot of 429 TooManyRequests, and the number of resources synced under azure_storage_containers is just 720 with 43 errors. The first sync has no errors, and the number of azure_storage_containers resources is 866 (as expected).