Struggling with incremental extractions from Stripe to S3 using CloudQuery

immune-sculpin · December 5, 2023, 2:59pm

Hey all, I’m trying to set up the extraction of data from a Stripe account into an S3 bucket, and so far I’ve managed to have it working for full extractions (without the incremental part) using the following configuration:

kind: source
spec:
  name: "stripe"
  path: "cloudquery/stripe"
  version: "v2.1.11"
  registry: github
  tables: ["stripe_charges"]
  destinations: ["s3"]
  spec:
    api_key: ${STRIPE_TOKEN}
---
kind: destination
spec:
  name: "s3"
  path: "cloudquery/s3"
  version: "v4.8.3"
  write_mode: "append" 
  spec:
    bucket: "datalake-bucket"
    region: "us-east-1"
    path: "raw_data/stripe_raw_data/{{TABLE}}/dt={{YEAR}}-{{MONTH}}-{{DAY}}/{{UUID}}.parquet"
    format: "parquet"
    athena: true

Now I’m trying to change it to have incremental extractions working, but I’m always getting the error:

Error: failed to sync v3 source stripe: unexpected error from sync client receive: rpc error: code = Unknown desc = failed to sync records: failed to sync unmanaged client: failed to create state client: rpc error: code = Unknown desc = failed to sync records: failed to read: operation error S3: GetObject, https response error StatusCode: 404, RequestID: XXXXXXXXXXXXXXX, HostID: (...), NoSuchKey:

My suspicion is that an S3 destination cannot be used to store and manage state tables, but given that I can’t find documentation about this detail, I’m not sure how to proceed to have the incremental extractions working. Can anyone help?

My current configuration is as follows:

kind: source
spec:
  name: "stripe"
  (...)
  backend_options:
    table_name: "cq_state_stripe"
    connection: "@@plugins.s3.connection"
---
kind: destination
spec:
  (...)
  spec:
    bucket: "datalake-bucket"
    region: "us-east-1"
    path: "raw_data/stripe_raw_data/{{TABLE}}/dt={{YEAR}}-{{MONTH}}-{{DAY}}/{{YEAR}}-{{MONTH}}-{{DAY}}.parquet"
    format: "parquet"
    athena: true
    no_rotate: true

erez · December 5, 2023, 3:35pm

Hi @immune-sculpin,

For incremental syncs to work, you’ll need a destination that supports overwrite mode. You can find more information on this in the CloudQuery documentation.

Keep in mind that S3 only supports append mode, as noted in the CloudQuery S3 plugin documentation.

You might want to consider using Postgres as your destination.

immune-sculpin · December 5, 2023, 3:41pm

It makes sense! The problem with Postgres is that I would need to spin up an RDS just to save that state, which is kind of overkill.

Currently, in our stack, we are only using S3 with Glue and Athena to store and query data. Is there some other destination that I could use that would fit into this stack?

Something like sqlite, having the actual sqlite database persisted in an S3 bucket?

erez · December 5, 2023, 4:27pm

sqlite supports overwrite, so it should work. Please let me know if it doesn’t.
You might need to download the DB file from S3 before the sync, though.

Topic		Replies	Views
Stripe to s3 integration not updating state in sqlite CloudQuery Plugins	4	2	August 12, 2024
Help with incremental syncing of CloudTrail management and data events in CloudQuery CloudQuery Plugins	13	12	August 28, 2024
Incremental tables not syncing with Elasticsearch due to index not found error CloudQuery Plugins	1	1	November 8, 2023
How does incremental syncing work with CloudQuery from aws to postgres CloudQuery Plugins	6	7	January 31, 2024
CloudQuery GitHub plugin sync error and query on data restriction CloudQuery Plugins	3	3	February 1, 2024

Struggling with incremental extractions from Stripe to S3 using CloudQuery

Related topics