Maybe it’s just my lack of strong SQL skills that is hurting me here, but the biggest challenge we’re running into is trying to accurately extract/expand all these JSON objects that are being written to the DB (BigQuery in our case) to the point where I’m considering adding another ETL tool (a la DBT or similar) on top of this data collection just so I can unravel all this JSON into its own column.
But before I go down that very painful route, I figured I’d ask/make a suggestion… Is CloudQuery considering any feature/update that will start parsing these JSON objects and writing them as individual columns?
Yeah, we ran into similar issues and added DBT into the mix, but really CQ is the E and the L of data, not so much the T. And that’s where tools like DBT shine, in my opinion.
The searchable table documentation alone makes it very worth it. Though I’m curious what you had in mind for column extraction?
What we found is that initial JSON syntax familiarization was weird for analysts, but once they understood, it was easier. Plus, with all the various JSON structures across all the different providers, I was worried we’d end up having to manage all our own schemas/tables for things we rarely used more than once.
I will say we’re mostly GCP, so I’m a bit less worried (maybe inaccurately) about trying to maintain a schema ourselves in that regard. But as an example that I’ve been working on this morning…
gcp_storage_bucket_policies
table returns a column bindings
which is structured something like:
[
{members: [lol, lol2], role: lolrole},
{members: [lol3, lol4], role: lolrole2}
]
So I guess in my mind, there would be something like bindings.role
and bindings.members
columns (or whatever the names would be) that extract everything out into ordered pairs/arrays? So like bindings.members[0]
and bindings.role[0]
are the matching element pairs?
TBH - I’m much better at complaining than figuring out good solutions but this JSON is whopping my ass right now.
@sure-hound Thanks for the feedback! I’d like to better understand what you were trying to do and see if there’s some low-hanging fruit we could pick. Would you have some time for a quick chat with me and the team this week?
@pilvikala - Absolutely! Anything to help give back
I’ll DM you with my calendar link
I think some of these could easily be expanded to child tables within the provider.
I think some of these could easily be expanded to child tables within the provider, so something like gcp_storage_bucket_policies_bindings
where it’s just columns of members, role
?
Something like that, yeah. Are all the JSON objects identical to that? If it’s just those two properties in that JSON blob, though, perhaps just a new resolver within the fetching function is all that’s needed.
I think most of the GCP CQ SDK was autogenerated, so it might just be that no one’s hit the issue before. I’d agree that “list based” JSON (stuff that is wrapped in square brackets) is very irritating to work with.
Unpacking JSON offends me too. Busy work.
We have some static dbt models that unpack generators. The bigquery-schema-generator is useful.
My preference would be for CloudQuery to include a configuration option for schema evolution:
On Success:
- Add new columns as type/name evolve
- Raise warning when new columns are added
On Failure:
- Dump payload to object storage
- Raise error
- Mechanism to restart from object storage
Does it feel cluttered for every ingestor app to have a recovery partner app that has object storage as the input? I’d prefer this to be a standard pattern.
@closing-mink Thanks for sharing on how you approached this!
Just to make sure I understand your proposal: Are you suggesting to expand each key/value pair into a column named by the key? How do you approach expanding JSON objects with multiple levels or arrays?
Actually ran into an even worse table in terms of parsing JSON (gcp_compute_osconfig_inventories
). But what I’ve just started doing is creating user defined functions (it’s a BigQuery thing, I think other SQL DBs called them stored procedures) that parse these JSON objects. In BigQuery, you can actually run JavaScript in your UDF, so I’m just passing the column into a bunch of JavaScript and parsing it there. This way I’m not having to maintain an ETL process outside of that. For example:
CREATE TEMP FUNCTION
extract_package(json STRING)
RETURNS ARRAY<STRUCT<package_name STRING,
version STRING,
architecture STRING>>
LANGUAGE js AS """
let data = JSON.parse(json);
let result = [];
for (const key in data) {
let package_name, version, architecture;
if (data[key]['Details']['InstalledPackage']['Details']['CosPackage']) {
package_name = data[key]['Details']['InstalledPackage']['Details']['CosPackage']['package_name'];
version = data[key]['Details']['InstalledPackage']['Details']['CosPackage']['version'];
architecture = data[key]['Details']['InstalledPackage']['Details']['CosPackage']['architecture'];
} else if (data[key]['Details']['AvailablePackage']['Details']['AptPackage']) {
package_name = data[key]['Details']['AvailablePackage']['Details']['AptPackage']['package_name'];
version = data[key]['Details']['AvailablePackage']['Details']['AptPackage']['version'];
architecture = data[key]['Details']['AvailablePackage']['Details']['AptPackage']['architecture'];
}
result.push({ package_name, version, architecture });
}
return result;
""";
SELECT
DISTINCT project_id AS `Project`,
STRING(os_info.hostname) AS `Compute Name`,
STRING(os_info.short_name) AS `Operating System`,
STRING(os_info.version) AS `OS Version`,
SPLIT(pkgs.package_name, '/')[SAFE_OFFSET(1)] AS `Package Name`,
pkgs.version AS `Package Version`,
pkgs.architecture AS `Package Arch`
FROM
`cloudquery.gcp_compute_osconfig_inventories`,
UNNEST(extract_package(TO_JSON_STRING(items))) pkgs
WHERE
TIMESTAMP_TRUNC(_cq_sync_time, DAY) = TIMESTAMP("2023-10-25")
LIMIT
1
Now all I have to do is iterate that JSON column to find all variations of the JSON, add it to the if else
loop, and move on with life. Saving me from having to maintain additional infrastructure
We have some tables that land JSON and then use the BigQuery JSON functions in dbt to parse.
BigQuery JSON Functions Documentation
Like:
SELECT STRUCT(
JSON_VALUE(json_data.path.to.foo) AS foo,
JSON_VALUE_ARRAY(json_data.path.to.`myArray`) AS my_array
) AS my_struct
FROM foo
It works, but it’s getting to the starting position, not a value add.
Yes. I imagine a flag in the destination config something like parse_json
as a boolean with default false
(backwards compatibility) and/or a path to a schema.json
file for more complex selection (e.g., key_a
is a struct, key_b
is JSON).
@closing-mink Great, thank you for the feedback. Could you please upvote that issue?