Issue with collecting sync metrics from CloudQuery logs

Hi CQ team!

I have a question about metrics collection. What is the preferred method of collecting sync stats (resources, errors, etc)?

For a long time, I have used a wrapper for running syncs, which performed metrics collection by reading log files/stdout of CLI and source plugin:

  1. For the source plugin, it extracts the resources value of all log records with message == "table sync finished" and sums it up (e.g.

    {"level":"info","module":"yc-source","table":"yc_access_bindings_organizationmanager_organizations","client":"org:bpxxxm","resources":1048,"errors":0,"message":"table sync finished"}
    
  2. This summed value is not equal to CLI’s final log record, e.g.

    {"level":"info","module":"cli","resources":115030,"errors":0,"warnings":0,"duration":"1m54s","result":"Sync completed successfully","time":"2024-08-08T14:28:18Z","message":"Sync summary"}
    
  3. I see multiple table sync finished messages for the same client, table tuple:

    {"level":"info","module":"yc-source","table":"yc_access_bindings_organizationmanager_organizations","client":"org:bpxxxm","resources":1048,"errors":0,"message":"table sync finished"}
    {"level":"info","module":"yc-source","table":"yc_access_bindings_organizationmanager_organizations","client":"org:bpxxxm","resources":1062,"errors":0,"message":"table sync finished"}
    

So, what should I really use to collect metrics?


Plugin-sdk version for source plugin: v4.54.0
CLI version: v5.18.0

Hi @valid-lab,

The best way would be to use our OpenTelemetry integration. You can find the documentation here: OpenTelemetry Integration.

For ongoing syncs, you can use --tables-metrics-location. More information can be found here: Identifying Slow Tables.

You’ll need the latest CLI and SDK to have the full set of traces, metrics, and logs we send via OpenTelemetry.

As for the counts not adding up, we would need to see the spec file to try and reproduce the issue.

kind: source
spec:
  name: "yc"
  registry: grpc
  path: "[::1]:7777"
  version: Development
  destinations: ['postgres']
  tables: ["*"]
  spec:
    organization_ids:
      - bxxxm
    backoff_retries: 3
---
kind: destination
spec:
  name: "postgres"
  registry: grpc
  path: "[::1]:7778"
  version: Development
  write_mode: "overwrite-delete-stale" # overwrite, overwrite-delete-stale, append
  migrate_mode: forced
  spec:
    connection_string: "${PG_CONNECTION_STRING}"

My wrapper starts the source at :7777 and the destination at :7778.

I tested different destinations and it seems it doesn’t affect the counts. Also, the last open-source AWS plugin version counters match CLI stats.

Source: Yandex Cloud Source Plugin Documentation or Yandex Cloud Source Release v1.1.1
Destination: PostgreSQL Destination Plugin Documentation v7.3.5

Hi @valid-lab,

Can you try running the plugin via registry: local? You can go build it and pass path: <full-path-to-plugin-binary>.

Running as gRPC where the plugin is a long-running process could be related. Also, if this community plugin didn’t update to the latest SDK, it won’t have the OpenTelemetry support.

Hi! Sure, I’ll try it.

Hi again! I’ve debugged the thing.

So apparently the CLI deduplicates records sent by the source plugin. But the source can send duplicate events. In my case, this happened because the table resolver started two times for each table client (actually, my multiplex function returned the same client twice). Thus, in the logs, there were two records for the same table.

My faulty sort + dedup implementation:

https://github.com/yandex-cloud/cq-source-yc/blob/v1.1.1/client/resourcehierarchy.go#L235-L239

I remember back in time the CLI or plugin would print a warning in that case, like “resolver started for the same client more than one time”.

The warning is still there link to code

Yeah, we collect metrics in a map by client ID, so if there’s a duplicate client ID, it breaks the metrics. Here is the relevant code.

Looked at other schedulers; it seems that it prints only for DFS one?
I currently use shuffle, which is probably why I don’t see it in logs.

Ah, you’re right. We should add that warning to other schedulers. I’m happy to accept a contribution, or you can open a bug report via GitHub Issues.

Thanks for the help! Hopefully, I will send a PR. :smiley: