Plugin schema requirements for CloudQuery and enhancing discovery capabilities

Plugin Requirements - Specs and Maybe Discovery

I don’t know if this use-case is just a “me” issue or something more significant, and please don’t take this as me complaining about how this tool is built.

My use case is building a workflow around CloudQuery that is a fully managed ETL process with multiple destinations and humans in the loop for version changes, as downstream changes could impact production-level services.

To meet my own goal, I need more tooling than CloudQuery comes with. I need more precise error detection than parsing log files. I have started to build a wrapper interface around specs, and this is where I am hitting my first major issue: Spec requirements.

What details the source and destination plugins need and expect are not known by the CloudQuery CLI and are often stored in many different repos across GitHub. This makes sense as a self-contained tool that is driven by YAML files, but for what I am building, it turns into a rather significant pain.

I don’t have a complete solution at this time, but my initial thoughts are the following: it would be helpful for my needs and useful for the project generally.

  • Each plugin would implement a gRPC call that would return the schema for specific details the plugin supports.
    • My wrapper could discover and make available options without coding.
    • The CLI tool could output helpful details like example YAMLs:
      • cloudquery plugin source cloudquery/github yaml
      • cloudquery plugin source cloudquery/github schema
      • cloudquery plugin destination cloudquery/s3 schema

I don’t know yet what the correct answer is, but I wanted to continue in a public forum rather than just me thinking before coding. Also happy to move this to GitHub if that forum makes more sense.

Hi @bella-spark :wave:

Can you explain a bit what the problem is that you’re trying to solve using CloudQuery and what the wrapper you built solves? I think the expectation (at the moment) is that specs are written manually for the first time at least.

We have some users that automate specs creation since YAML can be easily written using code, but it’s not super generic and usually ad-hoc.

Maybe the following resources can help:

  1. How to Guides - Datadog Observability (as you mentioned error detection)
  2. How to Guides - Update Plugins Using Renovate - Automatically handle plugin updates. Please note we follow SemVer so using a tool like Renovate can have full control on how you do updates.

Howdy,

This Discord thread has very little context, but let me continue.

I use Temporal.io (workflow) to build views of a sizeable hybrid infrastructure (AWS, OpenStack, VMWare, and a large global network). Bringing the understanding of all the services into a single database that could have higher-level BI and operational engineering and security usages. Teams and external parties can depend on tables within the database, and due to this, I need to ensure some type of standards and consistent expectations.

Right now, the system builds out views in the following methods:

  1. CloudQuery and plugins sync into raw_table.*
  2. Workflow verifies relationships and ensures integrations with the on-prem ownership model.
  3. Workflow processes and creates versioned views of the raw_tables.* for downstream real-time clients (Think DevOps type work).
  4. Workflow processes and creates a view of the dataset for BI (breaking this is often not a big deal).

I have everything working within CloudQuery CLI by and large, but I am not happy with error detection or even knowing the error for something as large as the AWS source plugin.

(Side note: Love the Renovate stuff, never seen that before, and will look to see if it reduces the code I have already built around this problem space.)

Thanks for the added context so we can continue in the original thread.
FYI, Renovate can be self-hosted (that’s what we do), see more in Self-Hosting.
We (CloudQuery) use the GitHub Action Renovate GitHub Action.

I am thinking about doing the following and would like to know if it makes sense for CloudQuery, but I also hate adding JSON schema when protobufs is part of the dependency chain already.

Add JSON schema something like this (from cloudquery/cloudquery/plugins/destination/aws/client/spec.go):

type Spec struct {
    Regions       []string   `json:"regions,omitempty" jsonschema:"oneof_type=string;array,default=us-east-1,description=Array of AWS regions"`
    Accounts      []Account  `json:"accounts"`
    Organization  *AwsOrg    `json:"org"`
    AWSDebug      bool       `json:"aws_debug,omitempty" jsonschema:"default=false,description=enable aws sdk debug logging"`
    // Removed things here 
}

This would allow the plugin to generate a schema for CLI or other tools to make tools more self-discoverable, verify expectations, and create a GUI for input.

That is the idea. Please let me know if this was clear or ask any clarifying questions.

That’s an interesting approach. I don’t think the plugins spec is a part of the gRPC protocol as the protocol is a part of the SDK. Plugins get a JSON string that represents the spec and unmarshal it.

Assuming you know each plugin’s schema, how would the discovery work?

Maybe it’s better discussed in a GitHub issue here: GitHub Issue so we can document the full context of what you’re trying to accomplish.

Moving this to GitHub.