To be more precise, the data was from the ECS TASK module. I stored it in S3 as Parquet and crawled it with Glue. However, in Athena, I got some JSON data (like container info) that I could not query.
hi @above-sawfly What do you mean you could not query it? Is the data there but in JSON format? Did you try some of the Athena JSON extraction functions, like json_extract
?
The data is there but in JSON format.
I did not know about the JSON extract function… I will test it.
Additionally, v5.0.0 was released only a couple of days ago: CloudQuery S3 Plugin Release Notes
Ok, I will retest it… in case the athena:true
functionality has changed since last week.
Ok, tested again.
Seems some schema for tag
is not working properly.
HIVE_INVALID_METADATA: Glue table 'foo_aws_ecs_cluster_tasks' column 'tags' has invalid data type: struct<Application:string,Core:string,Env:string,Environment:string,Name:string,Repo:string,Stack:string,App:string,StackName:string,Version:string,aws\:ecs\:clusterName:string,aws\:ecs\:serviceName:string>
Hmm, if I had to guess, it’s not happy about having a struct with keys aws:ecs:clusterName
and aws:ecs:serviceName
, which contain colons. I assume there are tags in your account with those keys? Athena has many limitations, and we had a similar issue with duplicate keys a while back, if you’d like to read some background there: GitHub Issue #7639.
Maybe this is something we can handle as part of the athena: true
delivery mode. Would you mind opening an issue about it with as much detail as possible, so we can prioritize and take a look? Open a New Issue.
@above-sawfly I also just emailed you regarding a private channel so we can support the PoC faster.