Handling duplications in CloudQuery with multiple containers

comic-pup · February 26, 2024, 10:19pm

Hey, how do you guys handle duplications in larger scale deployments of CloudQuery? For example, the overwrite-delete-stale works when running CloudQuery in one container. However, running 200 containers at once seems to net a lot of orphaned resources.

Does the delete process of old records occur after the sync? Would it be better to have a PostgreSQL trigger to automatically delete stale rows based on NOW() - _cq_sync_time?

ben · February 26, 2024, 10:21pm

overwrite-delete-stale is designed to work when running in parallel containers. The key is to ensure that each config in each container uses a unique name. More details can be found here.

comic-pup · February 26, 2024, 10:49pm

I currently have each _cq_source_name set as exclusive unique names. The issue is if a sync fails, the rows are not deleted. It appears as though the delete occurs after a successful sync. Is this the case? Sometimes these rows can end up becoming orphaned in our case.

ben · February 26, 2024, 10:52pm

Yes, deletion only occurs after a successful sync. Only a panic or some other very serious error should result in the sync failing. Which plugin are you using that doesn’t reliably sync successfully?

Also, the next time that the sync is run, it should clean up all of the stranded records, so they shouldn’t be permanently stranded.

comic-pup · February 26, 2024, 10:58pm

The AWS plugin is the one that has the problem. Now it has only occurred 2 times in ~300 syncs. The issue is that on the next successful run, the resources are not removed.

ben · February 26, 2024, 10:59pm

If you can share details about the config and the error you faced, we can look into it.

Topic		Replies	Views
Sync runs duplicating data temporarily in CloudQuery need a solution CloudQuery Plugins	1	4	March 21, 2024
Duplicate records appearing in daily CloudQuery syncs with no log visibility CloudQuery Plugins	3	7	June 7, 2024
Cloudquery deletion of deprecated postgres tables during sync process CloudQuery Plugins	3	11	July 11, 2024
Multi-client sync configuration guidance for CloudQuery with PostgreSQL CloudQuery Plugins	5	28	January 5, 2024
Error executing CloudQuery sync due to duplicate key constraint violation CloudQuery Plugins	3	12	March 13, 2024

Handling duplications in CloudQuery with multiple containers

Related topics