Help with incremental syncing of CloudTrail management and data events in CloudQuery

Hello :wave:

I am looking to do some incremental syncing of CloudTrail management and data events. For management events, it appears the CloudTrail events table supports incremental syncs, but do I just point it to some metadata table I want to store that in under backend_options, or do I need to use table_options for this?

For the data events, these are in S3 and I have transformed them to Parquet already via a Glue ETL process. Does the S3 plugin support incremental syncs, or does it automatically pull only new files and I should rely on partitioning? Ideally, I could get to a place where the raw output of both is very similar, but it seems the S3 plugin is very reliant on the folder structure/partitioning scheme, but maybe that’s desirable?

This is my first time trying incremental syncs with CloudQuery, so please bear with me.

Hi,

You don’t need to point the plugin to any metadata in particular. In our docs for the aws_cloudtrail_events table (link), you will find that it supports incremental syncing based on the event_time column. This is done automatically. You just need to pass a state backend (e.g., sqlite) where that incremental state will be stored.

For the S3 plugin, have a look at the no_rotate option in this link. Setting this to false will mean that there will be a new file per each table, which should support your need for incremental syncing. The default value there is false.

Apologies for not seeing this in the morning and for getting back with a delay. Feel free to ping me with any other questions you have.

No worries! I’m realizing you may be in a different time zone. I just felt kind of helpless being unable to go through the code to figure it out myself.

And thank you for the detailed response!

@marcel, can S3 itself be used as a backing store for incremental syncs?

And could you please help me better understand a bit. Say I have my tables partitioned by service and then by year-day-hour, but the CloudQuery S3 source sync may run in the middle of an hour - what happens to those files that appear in the folder for the latest hour after that sync has run?

And are the tables all going to have the year, month, and hour in them, or can I set it up to conceptually all be the same table even if I want to preserve the partitioning (or maybe this is fundamentally how it should work)? Sorry, this is a bit new to me - used to API dev.

Where are you looking to sync these tables? From S3 to where?

Back to S3

Multiple accounts. Right now I have a Glue ETL process transforming the data events to Parquet. I want to grab all the new data from each of the accounts periodically and sync it into a single account where I have all my raw Parquet files, then process them from there via DBT.

Okay, I understand a bit better now. The S3 source plugin does not support incremental syncing. You might want to limit the syncing based on the bucket in particular.

I’m not sure if you have the option to remove stale data using another process from the buckets?

Oof, okay. Well, please consider that as a feature request. :smile: I realize you’d need something to key off of, but I’d be fine with that being something simple like a date that needs to be in an opinionated format, and we need to provide a field selector so you know where to find it.

We are limited to what the S3 API is able to provide. In this case, it’s this endpoint: ListObjectsV2.

I’m looking at the docs, and I think it’s the start_after parameter that you would be after, is that correct? Could you look at the example below and tell me if it would satisfy your use case? Note the field uses lexicographical matching:

Example 3

Hmm, good question. I see the problem I think you’re trying to get at.
Not great :sob:
A couple of ideas here. I may honestly take a stab at a custom plugin.

This could potentially be exposed as an option in the spec, if you could define the criteria for us. Otherwise, a custom plugin would make sense.

Yeah, I’m torn here. On the one hand, I have all sorts of ideas with Glue ETL and crawlers to automatically infer the schema, but that introduces its own set of complexities and costs (probably a lot of extra costs). I’ve seen crawlers go really wrong when the schema changes significantly. Yeah… convinced myself that’s a bad idea, plus you don’t get the benefits of CQ if you’re having to provision a bunch of resources in every account.

I suppose if they’re not already in Parquet, you don’t have any easy tools to introspect the schema from JSON, so I imagine some sort of way to provide a schema would also be needed and probably desirable for various reasons - at which point, what’s the difference between providing a JSON schema and typing out the Arrow schema in Go or Python?

Meh, yeah, I think I’ll try out the custom plugin for now - if I find there are lots of generic use cases beyond this, then maybe I’ll circle back and try to make the case for this being something you all could maintain (after all, your maintaining the mapping is a significant part of the value you add, which is why I can appreciate your open-source posture despite it being frustrating sometimes).

Potentially, there could be some middle ground of tooling built into the SDK for making it easier to deal with blob storage sources, but maybe you already have that - about to find out :slightly_smiling_face:

By the way, looks like you can do something like this:

aws s3api list-objects-v2 --bucket <mybucketid> --query "Contents[?LastModified>'2024-08-23T19:04:45+00:00']"

I think working on a custom plugin might be a good solution as it might help you clarify your requirements better. Our plugins aim to be as generic as possible by keeping original structures and formats intact. They are essentially copy and paste mechanisms which enable you to transform the data after the fact. Your own plugin would enable you to do transformations on the fly.

Perhaps you could use an interim store (e.g. a database like Postgres), transform the data there with a custom script, and sync it to S3 using CQ. This might be an option. That would mean you don’t have to write a custom plugin and you’ll have more control over the transformation (it won’t have to be on the fly which might be RAM/CPU intensive).

Yeah, the more I thought through it and started to write some code, the more I realized maybe there’s a good reason you don’t have large demand for S3 as a source - possibly a better task for something like built-in S3 transfer capabilities with AWS and Spark and/or Glue ETL, at least for larger files. Even the files I’m dealing with, which usually aren’t larger than 5 MB, are still going to have pretty big RAM requirements if I want to do large amounts of parallel processing.

I’m really not sure this is practical to do without pre-processing in each of the accounts themselves, and at that point, maybe I’m better off just using AWS Config and fetching data from the API that way. It would be more appealing to not have to do that, but we knew we’d have some use cases that CloudQuery wouldn’t be the right fit for.

Also, we currently have everything set up for the Python SDK, while your Go tooling is much more mature and has utilities for batch stream processing. So if we still decide this is the best route, I’ll likely push for us leveraging your Go SDK for certain cases. I prefer Go anyways :slightly_smiling_face:

Let us know if you require any further assistance!