This problem affects the aws_rds_cluster_parameters
(123M), aws_rds_cluster_parameter_group_parameters
(93M), and aws_rds_db_parameter_group_db_parameters
(31M) tables.
Would you mind sharing your configurations?
This will help us to debug the issue.
It is a very simple and the version is probably out of date.
That config is helpful! Thank you. What is the query you used to grab the data you shared?
Here are some interesting stats…
But I am just chasing the ones > 5M.
Can you share the config for the destination as well?
The destination is just PostgreSQL.
Are you specifying write_mode
?
In your source config, you specify the name to be {{.Name}}
. Is that value a deterministic value? Is it the same value each time you run a sync?
Here is what I found… this section:
syncPolicy:
syncOptions:
- ApplyOutOfSyncOnly=true
- CreateNamespace=true
Everything else seems standard = defaults. For the AWS plugin, we use defaults. We also developed 2 custom plugins, and for those, we have write_mode: append
in the config file.
So the append
write mode is designed to never delete any data.
Unless the AWS plugin inherits this somehow.
How do you define the {{.Name}}
variable that gets injected into the spec name? If that is unique every single run, then no data will ever be deleted.
You can see more information about that here: CloudQuery Documentation.
My reading of the code is that write_mode
that we use in our custom plugin is not inherited by the standard AWS plugin. We dynamically create the AWS plugin name for each table and run each extractor as a cron (K8s).
When you say it is dynamically generated, does that mean that each time your cron job runs it is a different value?
No, only at deployment time to set different crontime
for each extractor so they do not swarm the AWS APIs.
Question: Each time the extractor runs (based on AWS source plugin), will it repopulate the entire table? What is the default behavior (trim
and write
)? You are suspecting the data is appended, right? The sync time seems to indicate that the data is outdated (2023?).
The behavior is that it will do an upsert and then a delete for any record where the _cq_sync_time
is outdated and the source name is the same. Can you share the following:
-
The result of this query:
SELECT COUNT(DISTINCT "_cq_source_name") FROM aws_rds_cluster_parameters;
-
The schema of your
aws_rds_cluster_parameters
table.
It is taking a long, long time. I ran the following query:
SELECT COUNT(*)
FROM aws_rds_cluster_parameters
WHERE _cq_sync_time < TO_DATE('2024-01-01', 'YYYY-MM-DD')
LIMIT 10;
This was to prove that there are many records with an outdated timestamp.
The primary key is defined as:
PRIMARY KEY, BTREE (_cq_id);
Can you share a copy of the redacted logs?
Specifically, I am looking for lines that contain grpc.method
.
also I have got the result of the query you requested:
postgres=> select COUNT(DISTINCT "_cq_source_name") from aws_rds_cluster_parameters;
count
-------
1
(1 row)
OK! Thank you for that information!
Are there any other logs? Because I am not seeing the events that correlate to the end of the sync. I would assume that the issues that you are facing have been resolved by now, as we are on v8 of the Postgres plugin and v26 of the AWS plugin. If you upgrade to the latest versions and still see issues persisting, we would be more than happy to dive in to better understand the issue and find a fix for it.