Issue with collecting sync metrics from CloudQuery logs

valid-lab · August 9, 2024, 1:53pm

Hi CQ team!

I have a question about metrics collection. What is the preferred method of collecting sync stats (resources, errors, etc)?

For a long time, I have used a wrapper for running syncs, which performed metrics collection by reading log files/stdout of CLI and source plugin:

For the source plugin, it extracts the resources value of all log records with message == "table sync finished" and sums it up (e.g.

{"level":"info","module":"yc-source","table":"yc_access_bindings_organizationmanager_organizations","client":"org:bpxxxm","resources":1048,"errors":0,"message":"table sync finished"}

This summed value is not equal to CLI’s final log record, e.g.

{"level":"info","module":"cli","resources":115030,"errors":0,"warnings":0,"duration":"1m54s","result":"Sync completed successfully","time":"2024-08-08T14:28:18Z","message":"Sync summary"}

I see multiple table sync finished messages for the same client, table tuple:

{"level":"info","module":"yc-source","table":"yc_access_bindings_organizationmanager_organizations","client":"org:bpxxxm","resources":1048,"errors":0,"message":"table sync finished"}
{"level":"info","module":"yc-source","table":"yc_access_bindings_organizationmanager_organizations","client":"org:bpxxxm","resources":1062,"errors":0,"message":"table sync finished"}

So, what should I really use to collect metrics?

Plugin-sdk version for source plugin: v4.54.0
CLI version: v5.18.0

erez · August 9, 2024, 2:07pm

Hi @valid-lab,

The best way would be to use our OpenTelemetry integration. You can find the documentation here: OpenTelemetry Integration.

For ongoing syncs, you can use --tables-metrics-location. More information can be found here: Identifying Slow Tables.

You’ll need the latest CLI and SDK to have the full set of traces, metrics, and logs we send via OpenTelemetry.

As for the counts not adding up, we would need to see the spec file to try and reproduce the issue.

valid-lab · August 9, 2024, 2:33pm

kind: source
spec:
  name: "yc"
  registry: grpc
  path: "[::1]:7777"
  version: Development
  destinations: ['postgres']
  tables: ["*"]
  spec:
    organization_ids:
      - bxxxm
    backoff_retries: 3
---
kind: destination
spec:
  name: "postgres"
  registry: grpc
  path: "[::1]:7778"
  version: Development
  write_mode: "overwrite-delete-stale" # overwrite, overwrite-delete-stale, append
  migrate_mode: forced
  spec:
    connection_string: "${PG_CONNECTION_STRING}"

My wrapper starts the source at :7777 and the destination at :7778.

I tested different destinations and it seems it doesn’t affect the counts. Also, the last open-source AWS plugin version counters match CLI stats.

Source: Yandex Cloud Source Plugin Documentation or Yandex Cloud Source Release v1.1.1
Destination: PostgreSQL Destination Plugin Documentation v7.3.5

erez · August 9, 2024, 2:59pm

Hi @valid-lab,

Can you try running the plugin via registry: local? You can go build it and pass path: <full-path-to-plugin-binary>.

Running as gRPC where the plugin is a long-running process could be related. Also, if this community plugin didn’t update to the latest SDK, it won’t have the OpenTelemetry support.

valid-lab · August 9, 2024, 3:58pm

Hi! Sure, I’ll try it.

Hi again! I’ve debugged the thing.

So apparently the CLI deduplicates records sent by the source plugin. But the source can send duplicate events. In my case, this happened because the table resolver started two times for each table client (actually, my multiplex function returned the same client twice). Thus, in the logs, there were two records for the same table.

My faulty sort + dedup implementation:

https://github.com/yandex-cloud/cq-source-yc/blob/v1.1.1/client/resourcehierarchy.go#L235-L239

I remember back in time the CLI or plugin would print a warning in that case, like “resolver started for the same client more than one time”.

erez · August 13, 2024, 8:40am

The warning is still there link to code

erez · August 13, 2024, 8:41am

Yeah, we collect metrics in a map by client ID, so if there’s a duplicate client ID, it breaks the metrics. Here is the relevant code.

valid-lab · August 13, 2024, 8:43am

Looked at other schedulers; it seems that it prints only for DFS one?
I currently use shuffle, which is probably why I don’t see it in logs.

erez · August 13, 2024, 8:46am

Ah, you’re right. We should add that warning to other schedulers. I’m happy to accept a contribution, or you can open a bug report via GitHub Issues.

valid-lab · August 13, 2024, 8:48am

Thanks for the help! Hopefully, I will send a PR.

Topic		Replies	Views
AWS Plugin - aws_rds_instance_resource_metrics CloudQuery Plugins	9	79	December 11, 2024
CloudQuery sync issues with GCP projects and empty tables CloudQuery Plugins	21	8	September 22, 2023
CloudQuery heartbeat request and logging level behavior clarification CloudQuery Plugins	5	23	August 29, 2024
Need help with cloudquery log format for clear sync process indicators CloudQuery Plugins	2	5	November 7, 2023
Questions about AWS event-based sync in CloudQuery CloudQuery Plugins	6	9	December 13, 2023

Issue with collecting sync metrics from CloudQuery logs

Related topics