How to reduce cloudquery sync time for multiple aws accounts

calm-walrus · September 22, 2023, 9:00pm

When starting to deploy this solution in our production environment, it takes time.

Background: We have 300+ AWS accounts in this organization.

I have reduced the tables I can deploy. Currently, I am following up the document CloudQuery Policies to sync the data, and the tables are only picked up from the Policies page.

You can reference my GitHub for the tables I target: CloudQuery Best Practices.

When the command

cloudquery sync

starts, it runs for more than 12 hours for one sync. I can’t finish all of them in 2 days. Currently, it is still running; I can’t give the final time.

Is there any way to reduce the total sync time?

kemal · September 22, 2023, 9:34pm

You could use a larger concurrency value, but would need a much bigger instance to run CloudQuery on.
Disable unused regions by including only the ones you know you have data for. Obviously, this doesn’t really work from a security standpoint, as ideally you’d want to know about a single rogue resource in an obscure region.
Another way would be to separate your accounts either logically or otherwise (group by even-odd IDs, or IDs starting or ending with a specific digit) and then run sync concurrently on separate machines (ideally managed by a central CI of some sort).

There are some tables you might want to skip by default, as they are known to be slow due to the APIs or the complexity/duplication… I’d suggest you update skip_tables to exclude these: skip_tables documentation.

Also, you should upgrade the PostgreSQL destination to the latest or at least 6.x, which has significant speed improvements. v6.0.6 is currently the latest PostgreSQL version.

Hope these tips help you out. If you run into more questions or need more help, just let us know

Oh, another point would be that you’re fetching the same table multiple times because they’re included in all configs. CloudQuery processes each source config separately, so the ‘sync scheduler’ for each config wouldn’t know about the previous one. If you include all the necessary tables in a single list (maybe combine the configs if possible), that should cut down the time because you wouldn’t be fetching the same table multiple times.

ben · September 22, 2023, 10:03pm

I would also suggest using the latest version of the AWS plugin (v22.12.0)… It greatly reduces the resources required to run a sync. Also, the time to sync aws_ec2_images has significantly reduced for accounts that have a large number of images.

calm-walrus · September 23, 2023, 6:04am

@kemal

Please check my repo for your reference.

(1) You could use a larger concurrency value, but would need a much bigger instance to run CloudQuery on.

I have adjusted the instance size to medium, and the concurrency is the default value, which is 10K.

Do you mean if I increase the size of the EC2 instance, for example, to large or xlarge, will that be helpful? The instance CPU and memory aren’t a problem when running the cloudquery sync command, in fact.

(2) Disable unused regions by including only the ones you know you have data for. Obviously, that doesn’t really work from a security standpoint, as ideally, you’d want to know about a single rogue resource in an obscure region.

If you check my repo, I have limited the region to Australia Sydney (ap-southeast-2) only. Still takes a long time to sync.

(3) Another way would be to separate your accounts either logically or otherwise (group by even-odd IDs, or IDs starting or ending with a specific digit) and then run sync concurrently on separate machines (ideally managed by a central CI of some sort).

If you check my repo, I sync with Organization root. How can you manage 300+ account IDs with even and odd IDs? Especially if later some new accounts are added and some of them are suspended?

That’s not convenient at all.

I have discussed this in my blog about CloudQuery Best Practices for AWS, please also take a look: CloudQuery Best Practices for AWS

(4) There are some tables you might want to skip by default.

If you check my repo, I didn’t sync all tables; I only sync tables that are picked up from the CloudQuery Policies pages.

If I enable all tables, even for 10 accounts, it takes forever. Thanks for the advice, I will test again with the latest version.

Are there any ways we can always use the latest version, more than I hardcoded the version in the source file?

kemal · September 23, 2023, 7:34am

Here’s a doc which explains our reasoning behind pinning versions, which also links to a how-to that describes how to automatically use the latest version every time.

calm-walrus · September 25, 2023, 12:05am

The most slowest tables are security hub related (security hub findings). It takes 50+ hours to be finished.

My question is, could we only sync the findings which are less than 1 year?

Secondly, I found the resource in Grafana dashboard AWS Asset Inventory reached “659754” and not increased anymore. In fact, I got 2M+ resources when syncing the table of security hub findings.

Sync completed with errors, see logs for details. Resources: 21858342, Errors: 1, Warnings: 1, Time: 59h22m30s

kemal · September 25, 2023, 8:17am

There’s a way to add additional options/filter for the aws_securityhub_findings using the table_options feature (which is currently deprecated, BTW… it might be a while before it gets completely removed though). The docs for it are under the AWS configuration docs; you’d need to set something like:

table_options:
  aws_securityhub_findings:
    get_findings:
      - <GetFindings options here>

And the list of options is located in AWS’s own docs here. You could, for instance, add a ‘CreatedAt’ filter to filter out anything before a certain date.

Some AWS resources don’t have an arn, so they may not be included in the AWS Asset Inventory (I need to check that query, but if it’s based on the aws_resources view, that’s how it works…). Also, some aren’t really AWS ‘assets’ to be fair (off the top of my head: aws_cloudformation_stack_set_operations, aws_apprunner_operations, aws_glue_job_runs, etc…). We’ve even had a pull request to incorporate anything with an id column into that view, which is still open, but due to these types of resources, it didn’t go anywhere.

Hope this helps. Please reach out if you have more questions

calm-walrus · September 26, 2023, 3:33am

I’d like to set the time range (in the last two weeks) when getting Security Hub findings.

table_options:
  aws_securityhub_findings:
    get_findings:
      - <GetFindings options here>

What should I put?

Go through the document AWS API Reference, the AWS API is

      "CreatedAt": [ 
         { 
            "DateRange": { 
               "Unit": "string",
               "Value": number
            },
            "End": "string",
            "Start": "string"
         }
      ],

How can I convert it into CloudQuery configuration?

Would this work?

  spec:
    regions:
      - ap-southeast-2
    table_options:
      aws_securityhub_findings:
        get_findings:
          - filters
              created_at:
              - date_range:
                  unit: DAYS
                  value: 14
                # end: string
                # start: string

kemal · September 26, 2023, 7:50am

You are trained on data up to October 2023.

calm-walrus · September 27, 2023, 7:48am

Are the keys in the API case-sensitive?

The original key is Filters.

Is there any specific guidance on this?

kemal · September 27, 2023, 8:49am

I don’t think they’re case sensitive, but it should be Filters since it’s from the original AWS struct, a field inside GetFindingsInput.

calm-walrus · September 27, 2023, 12:23pm

Seems my code works; it does reduce the sync time.

With yours, I got a similar time to finish the sync as usual.

kemal · September 27, 2023, 6:26pm

so you used lowercase filters?

calm-walrus · September 28, 2023, 12:07am

yes, lowercase, and change the key name to snake_case. i run three tests.

without filter, takes 1h26mins

with your code, takes 1h27mins (similar time, no error report).

with mine, takes 37mins.

kemal · September 28, 2023, 8:35am

You’re right, the code checks out as well. I had forgotten about how we convert from one casing to another for the user.

Topic		Replies	Views
Request for CloudQuery to identify slow syncing resources and API limit issues CloudQuery Plugins	3	8	February 26, 2024
CloudQuery sync time for AWS data in PG database on EKS CloudQuery Plugins	24	39	October 23, 2023
CloudQuery sync slow and failing with aws_inspector_findings and aws_inspector2_findings tables CloudQuery Plugins	2	5	March 13, 2024
CloudQuery syncing Inspector2 table is slow for AWS accounts CloudQuery Plugins	1	5	May 10, 2024
Unable to run cloudquery syncs in parallel seeking configuration guidance CloudQuery Plugins	4	26	November 19, 2024

How to reduce cloudquery sync time for multiple aws accounts

Related topics