Issue querying JSON data from ECS TASK module stored in S3 with Athena

To be more precise, the data was from the ECS TASK module. I stored it in S3 as Parquet and crawled it with Glue. However, in Athena, I got some JSON data (like container info) that I could not query.

hi @above-sawfly :wave: What do you mean you could not query it? Is the data there but in JSON format? Did you try some of the Athena JSON extraction functions, like json_extract?

The data is there but in JSON format.
I did not know about the JSON extract function… I will test it.

Additionally, v5.0.0 was released only a couple of days ago: CloudQuery S3 Plugin Release Notes

Ok, I will retest it… in case the athena:true functionality has changed since last week.

Ok, tested again.
Seems some schema for tag is not working properly.

HIVE_INVALID_METADATA: Glue table 'foo_aws_ecs_cluster_tasks' column 'tags' has invalid data type: struct<Application:string,Core:string,Env:string,Environment:string,Name:string,Repo:string,Stack:string,App:string,StackName:string,Version:string,aws\:ecs\:clusterName:string,aws\:ecs\:serviceName:string>

Hmm, if I had to guess, it’s not happy about having a struct with keys aws:ecs:clusterName and aws:ecs:serviceName, which contain colons. I assume there are tags in your account with those keys? Athena has many limitations, and we had a similar issue with duplicate keys a while back, if you’d like to read some background there: GitHub Issue #7639.

Maybe this is something we can handle as part of the athena: true delivery mode. Would you mind opening an issue about it with as much detail as possible, so we can prioritize and take a look? Open a New Issue.

@above-sawfly I also just emailed you regarding a private channel so we can support the PoC faster.