CloudQuery S3 destination JSON fields landing as base64-encoded is it configurable

Hello folks! We are using the S3 destination, writing in Parquet format. We have a number of JSON fields across our data sources. They seem to be landing in S3 as base64-encoded.

Can you please confirm this behavior? Is it configurable? Can you point me to the source code?

Many questions :slightly_smiling_face: Thank you!

Hey :wave:,

The docs for the S3 plugins are here: S3 Plugin Documentation and the source is here: S3 Plugin Source.

What warehouse do you use to query it? Is it Athena? If yes, there is a flag (S3 Overview) that you need to use called athena that should make JSON columns compatible with Athena as different destinations expect the JSON columns in slightly different ways.

I’ll also let other folks weigh in here on Monday in case I’m not up to date with the S3 destination.

Hi :wave:

So the CloudQuery JSON type is an Arrow extension, which uses binary as its base type. I can imagine how a bug there, or maybe in the Arrow parquet writer when dealing with extension types, could lead to it being stored as base64, but it shouldn’t be the case as far as I know.

I also just tested it myself, loading a parquet file with JSON columns from S3 and it showed as varchar, not base64 encoded. I tested by downloading the file and then importing it with DuckDB.

Could you share a reproduction by any chance? Maybe the source plugin is using an older version of the SDK, or a different language?

Hi folks,

Thank you for the responses, and I apologize for my own delay :slightly_smiling_face: Here is the configuration:

kind: source
spec:
  name: "aws"
  path: "cloudquery/aws"
  registry: "cloudquery"
  version: "v23.6.1"
  ## See all tables: https://www.cloudquery.io/docs/plugins/sources/aws/tables
  tables: ["aws_s3_buckets", "aws_ec2_instances", "aws_ec2_images"]
  skip_tables:
    - "aws_s3_bucket_objects"
    - "aws_s3_bucket_object_grants"
  destinations: ["s3"]
  spec:
    aws_debug: true
    accounts:
      - id: <ID>
    regions:
      - "us-east-1"
      - "us-west-2"
---
kind: destination
spec:
  name: "s3"
  path: "cloudquery/s3"
  registry: "cloudquery"
  version: "v4.8.9"
  write_mode: "append"
  spec:
    bucket: <BUCKET>
    region: "us-east-1" # Example: us-east-1
    path: "raw/{{TABLE}}/{{UUID}}.parquet"
    format: "parquet" # options: parquet, json, csv
    format_spec:
      # CSV-specific parameters:
      # delimiter: ","
      # skip_header: false

    # Optional parameters
    athena: true # <- set this to true for Athena compatibility

We are observing this behavior both in the CloudQuery-provided AWS plugin, as well as a custom plugin we have implemented in Python.

Hi @herman, do you have any advice here? TYIA

Looping in @KernelKnight as Herman is out this week.

Hi all!

I’ll take a look and post here once I have some ideas.

OK, I do see a way to improve things: we use Append call on the bytes, which shouldn’t be, as the Append call for JSONBuilder marshals the data itself.

I’ll conjure a quick fix, but I’d like to give it for you to test (along with the change PR to see the changes, if you wish), so that we’re on the same page here.

Yes! That would be fantastic! Thank you, Alex!

PR: cloudquery/cloudquery/pull/16942
Version to try: v4.10.2-rc
Tag for reference: plugins-destination-s3-v4.10.2-rc
@pure-quetzal please let us know when you’ve run a sync & verified results :pleading_face:

Looks like I need to do cloudquery login now. Any guidance on how to achieve this in an AWS Airflow environment?

Can you create a token in the UI and set it as an environment variable in the Airflow environment?

CloudQuery Documentation - Generate API Key I think should do it.

This should work, thank you.
I’m getting Firebase errors when attempting to create a CloudQuery Cloud account.

What error do you see?

Firebase: Error (auth/internal-error).

Checking this out now. Are you trying to sign up via: https://cloud.cloudquery.io/auth/register?

The above error is with G Suite. When I try with email, I get the following error:

Firebase: Error (auth/firebase-app-check-token-is-invalid.).

We tried registering now on our end from different machines and it seems to work. Can you try from incognito? (Just to make sure there are no cache issues.) Also, which browser do you use? (We will try to replicate the issue.)

Hm, okay. I’m using Chrome.

Maybe there are some extensions or something that are blocking something, but it’s pretty weird indeed.

Same results with Incognito.
I’m pretty sure it’s something on my end at this point.

I tried from both Chrome and Brave, but even if it’s on your end, then the message should be more descriptive, even though it’s a Firebase error, so we don’t control that part much. I’m still trying to investigate. Do you have Firefox or anything else? (Just trying to narrow down the cause of this.)