Hey everyone,
We’ve been working on running CloudQuery syncs directly within Databricks, and after hitting a few walls, we’ve got a solid setup that’s working well in production. Figured I’d share the complete walkthrough since several folks have asked about this integration.
Our team needed centralized cloud asset inventory across AWS accounts, but didn’t want to spin up separate infrastructure just for CloudQuery. Since we’re already heavy Databricks users, running everything as Jobs made sense.
The setup breakdown:
Secrets management (this part’s crucial): We’re using Databricks CLI to create a secrets scope. You’ll need these keys:
AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY / AWS_SESSION_TOKEN
CLOUDQUERY_API_KEY
DATABRICKS_CATALOG / DATABRICKS_SCHEMA
DATABRICKS_ACCESS_TOKEN / DATABRICKS_HOSTNAME / DATABRICKS_HTTP_PATH
DATABRICKS_STAGING_PATH
You can checkout the complete walk though here: How to Work with CloudQuery Syncs within Databricks | CloudQuery Blog
Additional Resources:
-
Main sync job: https://github.com/cloudquery/databricks-simple-cli-job
-
Transformation code: https://github.com/cloudquery/databricks-sync-transformations (private repo, ping us if you need access)
-
Visualization app: https://github.com/cloudquery/databricks-sample-app
