Need help partitioning S3 parquet files by Postgres column value

organic-goblin · November 2, 2023, 2:53pm

Hey! I’m trying to load a large Postgres table to S3/Parquet. It’s a batch job, so no CDC. I can see that the file/S3 destination supports certain partitioning such as {{TABLE}} and the current year, month, day, etc.

However, I’d like to partition the resulting Parquet files by a value from a Postgres column, for example, {{USERID}}. Is there any way I can achieve that?

herman · November 2, 2023, 3:29pm

Hi @organic-goblin!

I don’t think that’s possible right now. Feel free to raise an issue if you think this is something you’ll need again in the future; this sounds like something that would be useful to others as well.

If you need a solution right now, I think the only way would be a pre-transformation step. For example, first do a transformation in the database so every user ID gets its own table, then sync to S3. I don’t know how feasible this would be in your case!

Alternatively, you can do a post-transformation step with something like Glue to achieve the same thing.

organic-goblin · November 3, 2023, 11:13am

Thanks for the detailed answer, @herman. Yes, I think this would be a common request for anyone doing large batch jobs.

Topic		Replies	Views
CloudQuery chain for syncing parquet files to PostgreSQL from Azure Blob Storage CloudQuery Plugins	1	7	November 15, 2023
Configure dynamic GCS destination based on postgres schema and table CloudQuery Plugins	6	19	November 1, 2023
Help with importing DynamoDB records in CloudQuery for Grafana integration CloudQuery Plugins	4	26	October 16, 2023
Syncing specific schema tables in PostgreSQL with CloudQuery CloudQuery Plugins	4	19	October 18, 2023
Help with incremental syncing of CloudTrail management and data events in CloudQuery CloudQuery Plugins	13	61	August 28, 2024

Need help partitioning S3 parquet files by Postgres column value

Related topics