Filtering tables before database dump in CloudQuery for large datasets

Hi, as CloudQuery has evolved over the last time I was playing around with it, is it now possible to filter tables before they get dumped into the database? My use case involves many millions of rows, but I’m interested in a very small subset.

Hi @devoted-oryx,

There are a couple of ways to filter data prior to insertion in the destination. What tables are you interested in using?

say AWS Security Hub

I want to ingest only active findings for the specific service.

For aws_securityhub_findings, you can now override the parameters for the API call to AWS so you can only sync the exact findings that you are interested in.

how this could be accomplished?

If you can, please share an example finding that you would like the filter to match. I can help create the config.

Let’s say findings generated by Config, which are in Active state.

I think I found what you were referencing to - CloudQuery AWS Configuration and the table options.

So basically, considering I want to sync just a subset, let’s say 10k rows out of 10m findings in total, would I be billed for the 10k rows?

Exactly! You get both a performance and cost benefit in that you only get the data that is most impactful to your org. Here is an example of how to use the table_options functionality to grab active findings generated by Config:

kind: source
spec:
  name: aws_xxxxxx
  path: cloudquery/aws
  version: "v27.17.0"
  tables: ["aws_securityhub_findings"]
  destinations: ["postgresql"]
  spec:
    table_options:
      aws_securityhub_findings:
        get_findings:
          - filters:
              WorkflowStatus:
                - Value: "ACTIVE"
                  Comparison: "EQUALS"
              ProductFields:
                - Key: "aws/securityhub/ProductName"
                  Value: "Config"
                  Comparison: "EQUALS"

Would it be possible to filter to the findings that have been updated within the last 7 days?

Looking at the GetFindings API, it looks like this should work:

kind: source
spec:
  name: aws_xxxxxx
  path: cloudquery/aws
  version: "v27.17.0"
  tables: ["aws_securityhub_findings"]
  destinations: ["postgresql"]
  spec:
    table_options:
      aws_securityhub_findings:
        get_findings:
          - filters:
              UpdatedAt:
                DateRange:
                  Unit: DAYS
                  Value: 7
              WorkflowStatus:
                - Value: "ACTIVE"
                  Comparison: "EQUALS"
              ProductFields:
                - Key: "aws/securityhub/ProductName"
                  Value: "Config"
                  Comparison: "EQUALS"

that’s nice, thanks Ben!

Your welcome! Let me know if you run into any issues with this or have other questions!