Sync runs duplicating data temporarily in CloudQuery need a solution

comic-pup · March 21, 2024, 6:51am

Hey, every time a sync runs, it doubles the data temporarily. Is there a good solution to prevent this from happening? Maybe Source → Postgres (Schema 1) then Postgres (Schema 1) → Postgres (Final Schema)?

herman · March 21, 2024, 9:09am

Hi @comic-pup,

Yes, using different schemas is one good way of doing it. Some users also use views to select only data from a particular sync.

That said, assuming you’re using overwrite-delete-stale write mode, data won’t really be doubled as such, but rather new resources will appear before the stale resources are deleted. Resources with the same PK will be replaced in-place. If the table only has a _cq_id primary key, then you may see temporary doubling.

I think this is a problem we can still improve on, so if you have any ideas let us know, maybe we can raise an issue to look into this more.

I’ve raised an issue on GitHub here with a suggested feature that I think would solve this problem in a better way: https://github.com/cloudquery/cloudquery/issues/17291

Topic		Replies	Views
Handling duplications in CloudQuery with multiple containers CloudQuery Plugins	5	6	February 26, 2024
Multi-client sync configuration guidance for CloudQuery with PostgreSQL CloudQuery Plugins	5	12	January 5, 2024
Issue with double creation of _cq_sync_time when using PG and S3 plugins CloudQuery Plugins	5	2	April 11, 2024
Duplicate records appearing in daily CloudQuery syncs with no log visibility CloudQuery Plugins	3	2	June 7, 2024
Help with BigQuery append write mode and data duplication issues CloudQuery Plugins	6	5	November 10, 2023

Sync runs duplicating data temporarily in CloudQuery need a solution

Related topics