Need help extracting JSON objects to separate columns in CloudQuery

sure-hound · August 28, 2023, 12:45pm

Maybe it’s just my lack of strong SQL skills that is hurting me here, but the biggest challenge we’re running into is trying to accurately extract/expand all these JSON objects that are being written to the DB (BigQuery in our case) to the point where I’m considering adding another ETL tool (a la DBT or similar) on top of this data collection just so I can unravel all this JSON into its own column.

But before I go down that very painful route, I figured I’d ask/make a suggestion… Is CloudQuery considering any feature/update that will start parsing these JSON objects and writing them as individual columns?

handy-bee · August 28, 2023, 12:55pm

Yeah, we ran into similar issues and added DBT into the mix, but really CQ is the E and the L of data, not so much the T. And that’s where tools like DBT shine, in my opinion.

The searchable table documentation alone makes it very worth it. Though I’m curious what you had in mind for column extraction?

What we found is that initial JSON syntax familiarization was weird for analysts, but once they understood, it was easier. Plus, with all the various JSON structures across all the different providers, I was worried we’d end up having to manage all our own schemas/tables for things we rarely used more than once.

sure-hound · August 28, 2023, 1:01pm

I will say we’re mostly GCP, so I’m a bit less worried (maybe inaccurately) about trying to maintain a schema ourselves in that regard. But as an example that I’ve been working on this morning…

gcp_storage_bucket_policies

table returns a column bindings which is structured something like:

[
  {members: [lol, lol2], role: lolrole},
  {members: [lol3, lol4], role: lolrole2}
]

So I guess in my mind, there would be something like bindings.role and bindings.members columns (or whatever the names would be) that extract everything out into ordered pairs/arrays? So like bindings.members[0] and bindings.role[0] are the matching element pairs?

TBH - I’m much better at complaining than figuring out good solutions but this JSON is whopping my ass right now.

pilvikala · August 28, 2023, 1:06pm

@sure-hound Thanks for the feedback! I’d like to better understand what you were trying to do and see if there’s some low-hanging fruit we could pick. Would you have some time for a quick chat with me and the team this week?

sure-hound · August 28, 2023, 1:09pm

@pilvikala - Absolutely! Anything to help give back

pilvikala · August 28, 2023, 1:09pm

I’ll DM you with my calendar link

handy-bee · August 28, 2023, 1:19pm

I think some of these could easily be expanded to child tables within the provider.

sure-hound · August 28, 2023, 1:44pm

I think some of these could easily be expanded to child tables within the provider, so something like gcp_storage_bucket_policies_bindings where it’s just columns of members, role?

handy-bee · August 28, 2023, 2:28pm

Something like that, yeah. Are all the JSON objects identical to that? If it’s just those two properties in that JSON blob, though, perhaps just a new resolver within the fetching function is all that’s needed.

I think most of the GCP CQ SDK was autogenerated, so it might just be that no one’s hit the issue before. I’d agree that “list based” JSON (stuff that is wrapped in square brackets) is very irritating to work with.

closing-mink · October 26, 2023, 7:35pm

Unpacking JSON offends me too. Busy work.

We have some static dbt models that unpack generators. The bigquery-schema-generator is useful.

My preference would be for CloudQuery to include a configuration option for schema evolution:

On Success:

Add new columns as type/name evolve
Raise warning when new columns are added

On Failure:

Dump payload to object storage
Raise error
Mechanism to restart from object storage

Does it feel cluttered for every ingestor app to have a recovery partner app that has object storage as the input? I’d prefer this to be a standard pattern.

pilvikala · October 27, 2023, 9:11am

@closing-mink Thanks for sharing on how you approached this!
Just to make sure I understand your proposal: Are you suggesting to expand each key/value pair into a column named by the key? How do you approach expanding JSON objects with multiple levels or arrays?

sure-hound · October 27, 2023, 12:30pm

Actually ran into an even worse table in terms of parsing JSON (gcp_compute_osconfig_inventories). But what I’ve just started doing is creating user defined functions (it’s a BigQuery thing, I think other SQL DBs called them stored procedures) that parse these JSON objects. In BigQuery, you can actually run JavaScript in your UDF, so I’m just passing the column into a bunch of JavaScript and parsing it there. This way I’m not having to maintain an ETL process outside of that. For example:

CREATE TEMP FUNCTION
  extract_package(json STRING)
  RETURNS ARRAY<STRUCT<package_name STRING,
  version STRING,
  architecture STRING>>
  LANGUAGE js AS """
    let data = JSON.parse(json);
    let result = [];

    for (const key in data) {
      let package_name, version, architecture;

      if (data[key]['Details']['InstalledPackage']['Details']['CosPackage']) {
        package_name = data[key]['Details']['InstalledPackage']['Details']['CosPackage']['package_name'];
        version = data[key]['Details']['InstalledPackage']['Details']['CosPackage']['version'];
        architecture = data[key]['Details']['InstalledPackage']['Details']['CosPackage']['architecture'];
      } else if (data[key]['Details']['AvailablePackage']['Details']['AptPackage']) {
        package_name = data[key]['Details']['AvailablePackage']['Details']['AptPackage']['package_name'];
        version = data[key]['Details']['AvailablePackage']['Details']['AptPackage']['version'];
        architecture = data[key]['Details']['AvailablePackage']['Details']['AptPackage']['architecture'];
      } 
      result.push({ package_name, version, architecture });
    }

    return result;
""";
SELECT
  DISTINCT project_id AS `Project`,
  STRING(os_info.hostname) AS `Compute Name`,
  STRING(os_info.short_name) AS `Operating System`,
  STRING(os_info.version) AS `OS Version`,
  SPLIT(pkgs.package_name, '/')[SAFE_OFFSET(1)] AS `Package Name`,
  pkgs.version AS `Package Version`,
  pkgs.architecture AS `Package Arch`
FROM
  `cloudquery.gcp_compute_osconfig_inventories`,
  UNNEST(extract_package(TO_JSON_STRING(items))) pkgs
WHERE
  TIMESTAMP_TRUNC(_cq_sync_time, DAY) = TIMESTAMP("2023-10-25")
LIMIT
  1

Now all I have to do is iterate that JSON column to find all variations of the JSON, add it to the if else loop, and move on with life. Saving me from having to maintain additional infrastructure

closing-mink · October 27, 2023, 9:40pm

We have some tables that land JSON and then use the BigQuery JSON functions in dbt to parse.

BigQuery JSON Functions Documentation

Like:

SELECT STRUCT(
    JSON_VALUE(json_data.path.to.foo) AS foo,
    JSON_VALUE_ARRAY(json_data.path.to.`myArray`) AS my_array
) AS my_struct
FROM foo

It works, but it’s getting to the starting position, not a value add.

pilvikala · October 30, 2023, 12:35pm

@closing-mink Would this be a good solution? GitHub - cloudquery/cloudquery#9931

closing-mink · October 30, 2023, 10:16pm

Yes. I imagine a flag in the destination config something like parse_json as a boolean with default false (backwards compatibility) and/or a path to a schema.json file for more complex selection (e.g., key_a is a struct, key_b is JSON).

pilvikala · October 31, 2023, 8:46am

@closing-mink Great, thank you for the feedback. Could you please upvote that issue?

Topic		Replies	Views
Introducing The Highly Requested JSON Flattener Transformer Integration Announcements	0	26	October 28, 2024
CloudQuery S3 destination JSON fields landing as base64-encoded is it configurable CloudQuery Plugins	29	46	February 28, 2024
Help with defining multiple data types for a CloudQuery source plugin column CloudQuery Plugins	4	9	February 7, 2024
Issue querying JSON data from ECS TASK module stored in S3 with Athena CloudQuery Plugins	7	22	March 4, 2024
CloudQuery AWS source plugin v23 update missing column information CloudQuery Plugins	6	10	May 14, 2024

Need help extracting JSON objects to separate columns in CloudQuery

Related topics