Issue with CloudQuery affecting multiple AWS RDS parameter tables

honest-snipe · April 17, 2024, 10:45pm

This problem affects the aws_rds_cluster_parameters (123M), aws_rds_cluster_parameter_group_parameters (93M), and aws_rds_db_parameter_group_db_parameters (31M) tables.

ben · April 17, 2024, 10:46pm

Would you mind sharing your configurations?
This will help us to debug the issue.

honest-snipe · April 17, 2024, 11:14pm

It is a very simple and the version is probably out of date.

ben · April 17, 2024, 11:19pm

That config is helpful! Thank you. What is the query you used to grab the data you shared?

honest-snipe · April 17, 2024, 11:20pm

Here are some interesting stats…
But I am just chasing the ones > 5M.

ben · April 17, 2024, 11:21pm

Can you share the config for the destination as well?

honest-snipe · April 17, 2024, 11:21pm

The destination is just PostgreSQL.

ben · April 17, 2024, 11:22pm

Are you specifying write_mode?
In your source config, you specify the name to be {{.Name}}. Is that value a deterministic value? Is it the same value each time you run a sync?

honest-snipe · April 17, 2024, 11:27pm

Here is what I found… this section:

syncPolicy:
    syncOptions:
      - ApplyOutOfSyncOnly=true
      - CreateNamespace=true

Everything else seems standard = defaults. For the AWS plugin, we use defaults. We also developed 2 custom plugins, and for those, we have write_mode: append in the config file.

ben · April 17, 2024, 11:29pm

So the append write mode is designed to never delete any data.

honest-snipe · April 17, 2024, 11:30pm

Unless the AWS plugin inherits this somehow.

ben · April 17, 2024, 11:31pm

How do you define the {{.Name}} variable that gets injected into the spec name? If that is unique every single run, then no data will ever be deleted.

You can see more information about that here: CloudQuery Documentation.

honest-snipe · April 17, 2024, 11:32pm

My reading of the code is that write_mode that we use in our custom plugin is not inherited by the standard AWS plugin. We dynamically create the AWS plugin name for each table and run each extractor as a cron (K8s).

ben · April 17, 2024, 11:38pm

When you say it is dynamically generated, does that mean that each time your cron job runs it is a different value?

honest-snipe · April 17, 2024, 11:39pm

No, only at deployment time to set different crontime for each extractor so they do not swarm the AWS APIs.

Question: Each time the extractor runs (based on AWS source plugin), will it repopulate the entire table? What is the default behavior (trim and write)? You are suspecting the data is appended, right? The sync time seems to indicate that the data is outdated (2023?).

ben · April 18, 2024, 12:06am

The behavior is that it will do an upsert and then a delete for any record where the _cq_sync_time is outdated and the source name is the same. Can you share the following:

The result of this query:

SELECT COUNT(DISTINCT "_cq_source_name") FROM aws_rds_cluster_parameters;

The schema of your aws_rds_cluster_parameters table.

honest-snipe · April 18, 2024, 12:26am

It is taking a long, long time. I ran the following query:

SELECT COUNT(*) 
FROM aws_rds_cluster_parameters 
WHERE _cq_sync_time < TO_DATE('2024-01-01', 'YYYY-MM-DD') 
LIMIT 10;

This was to prove that there are many records with an outdated timestamp.

The primary key is defined as:

PRIMARY KEY, BTREE (_cq_id);

ben · April 18, 2024, 2:40am

Can you share a copy of the redacted logs?
Specifically, I am looking for lines that contain grpc.method.

honest-snipe · April 18, 2024, 7:46pm

also I have got the result of the query you requested:

postgres=> select COUNT(DISTINCT "_cq_source_name") from aws_rds_cluster_parameters;
 count
-------
     1
(1 row)

ben · April 18, 2024, 7:48pm

OK! Thank you for that information!

Are there any other logs? Because I am not seeing the events that correlate to the end of the sync. I would assume that the issues that you are facing have been resolved by now, as we are on v8 of the Postgres plugin and v26 of the AWS plugin. If you upgrade to the latest versions and still see issues persisting, we would be more than happy to dive in to better understand the issue and find a fix for it.

Topic		Replies	Views
CloudQuery table aws_route53_hosted_zone_resource_record_sets always appends records during sync CloudQuery Plugins	4	12	February 23, 2024
Help with BigQuery append write mode and data duplication issues CloudQuery Plugins	6	13	November 10, 2023
Recommended Deduplication process with AWS source plugin General	3	33	November 15, 2024
Migrating CloudQuery from EC2 to ECS using FARGATE_SPOT options CloudQuery Plugins	3	7	February 1, 2024
Handling duplications in CloudQuery with multiple containers CloudQuery Plugins	5	13	February 26, 2024

Issue with CloudQuery affecting multiple AWS RDS parameter tables

Related topics