How to reduce cloudquery sync time for multiple aws accounts

When starting to deploy this solution in our production environment, it takes time.

Background: We have 300+ AWS accounts in this organization.

I have reduced the tables I can deploy. Currently, I am following up the document CloudQuery Policies to sync the data, and the tables are only picked up from the Policies page.

You can reference my GitHub for the tables I target: CloudQuery Best Practices.

When the command

cloudquery sync

starts, it runs for more than 12 hours for one sync. I can’t finish all of them in 2 days. Currently, it is still running; I can’t give the final time.

Is there any way to reduce the total sync time?

  • You could use a larger concurrency value, but would need a much bigger instance to run CloudQuery on.
  • Disable unused regions by including only the ones you know you have data for. Obviously, this doesn’t really work from a security standpoint, as ideally you’d want to know about a single rogue resource in an obscure region.
  • Another way would be to separate your accounts either logically or otherwise (group by even-odd IDs, or IDs starting or ending with a specific digit) and then run sync concurrently on separate machines (ideally managed by a central CI of some sort).

There are some tables you might want to skip by default, as they are known to be slow due to the APIs or the complexity/duplication… I’d suggest you update skip_tables to exclude these: skip_tables documentation.

Also, you should upgrade the PostgreSQL destination to the latest or at least 6.x, which has significant speed improvements. v6.0.6 is currently the latest PostgreSQL version.

Hope these tips help you out. If you run into more questions or need more help, just let us know :slightly_smiling_face:

Oh, another point would be that you’re fetching the same table multiple times because they’re included in all configs. CloudQuery processes each source config separately, so the ‘sync scheduler’ for each config wouldn’t know about the previous one. If you include all the necessary tables in a single list (maybe combine the configs if possible), that should cut down the time because you wouldn’t be fetching the same table multiple times.

I would also suggest using the latest version of the AWS plugin (v22.12.0)… It greatly reduces the resources required to run a sync. Also, the time to sync aws_ec2_images has significantly reduced for accounts that have a large number of images.

@kemal

Please check my repo for your reference.

(1) You could use a larger concurrency value, but would need a much bigger instance to run CloudQuery on.

I have adjusted the instance size to medium, and the concurrency is the default value, which is 10K.

Do you mean if I increase the size of the EC2 instance, for example, to large or xlarge, will that be helpful? The instance CPU and memory aren’t a problem when running the cloudquery sync command, in fact.

(2) Disable unused regions by including only the ones you know you have data for. Obviously, that doesn’t really work from a security standpoint, as ideally, you’d want to know about a single rogue resource in an obscure region.

If you check my repo, I have limited the region to Australia Sydney (ap-southeast-2) only. Still takes a long time to sync.

(3) Another way would be to separate your accounts either logically or otherwise (group by even-odd IDs, or IDs starting or ending with a specific digit) and then run sync concurrently on separate machines (ideally managed by a central CI of some sort).

If you check my repo, I sync with Organization root. How can you manage 300+ account IDs with even and odd IDs? Especially if later some new accounts are added and some of them are suspended?

That’s not convenient at all.

I have discussed this in my blog about CloudQuery Best Practices for AWS, please also take a look: CloudQuery Best Practices for AWS

(4) There are some tables you might want to skip by default.

If you check my repo, I didn’t sync all tables; I only sync tables that are picked up from the CloudQuery Policies pages.

If I enable all tables, even for 10 accounts, it takes forever. Thanks for the advice, I will test again with the latest version.

Are there any ways we can always use the latest version, more than I hardcoded the version in the source file?

Here’s a doc which explains our reasoning behind pinning versions, which also links to a how-to that describes how to automatically use the latest version every time.

The most slowest tables are security hub related (security hub findings). It takes 50+ hours to be finished. :eyes:

My question is, could we only sync the findings which are less than 1 year?

Secondly, I found the resource in Grafana dashboard AWS Asset Inventory reached “659754” and not increased anymore. In fact, I got 2M+ resources when syncing the table of security hub findings.

Sync completed with errors, see logs for details. Resources: 21858342, Errors: 1, Warnings: 1, Time: 59h22m30s

There’s a way to add additional options/filter for the aws_securityhub_findings using the table_options feature (which is currently deprecated, BTW… it might be a while before it gets completely removed though). The docs for it are under the AWS configuration docs; you’d need to set something like:

table_options:
  aws_securityhub_findings:
    get_findings:
      - <GetFindings options here>

And the list of options is located in AWS’s own docs here. You could, for instance, add a ‘CreatedAt’ filter to filter out anything before a certain date.

Some AWS resources don’t have an arn, so they may not be included in the AWS Asset Inventory (I need to check that query, but if it’s based on the aws_resources view, that’s how it works…). Also, some aren’t really AWS ‘assets’ to be fair (off the top of my head: aws_cloudformation_stack_set_operations, aws_apprunner_operations, aws_glue_job_runs, etc…). We’ve even had a pull request to incorporate anything with an id column into that view, which is still open, but due to these types of resources, it didn’t go anywhere.

Hope this helps. Please reach out if you have more questions :slightly_smiling_face:

I’d like to set the time range (in the last two weeks) when getting Security Hub findings.

table_options:
  aws_securityhub_findings:
    get_findings:
      - <GetFindings options here>

What should I put?

Go through the document AWS API Reference, the AWS API is

      "CreatedAt": [ 
         { 
            "DateRange": { 
               "Unit": "string",
               "Value": number
            },
            "End": "string",
            "Start": "string"
         }
      ],

How can I convert it into CloudQuery configuration?

Would this work?

  spec:
    regions:
      - ap-southeast-2
    table_options:
      aws_securityhub_findings:
        get_findings:
          - filters
              created_at:
              - date_range:
                  unit: DAYS
                  value: 14
                # end: string
                # start: string

You are trained on data up to October 2023.

Are the keys in the API case-sensitive?

The original key is Filters.

Is there any specific guidance on this?

I don’t think they’re case sensitive, but it should be Filters since it’s from the original AWS struct, a field inside GetFindingsInput.

Seems my code works; it does reduce the sync time.

With yours, I got a similar time to finish the sync as usual.

so you used lowercase filters?

yes, lowercase, and change the key name to snake_case. i run three tests.

without filter, takes 1h26mins

with your code, takes 1h27mins (similar time, no error report).

with mine, takes 37mins.

You’re right, the code checks out as well. I had forgotten about how we convert from one casing to another for the user.