Hi CQ team!
I have a question about metrics collection. What is the preferred method of collecting sync stats (resources, errors, etc)?
For a long time, I have used a wrapper for running syncs, which performed metrics collection by reading log files/stdout of CLI and source plugin:
For the source plugin, it extracts the resources
value of all log records with message == "table sync finished"
and sums it up (e.g.
{"level":"info","module":"yc-source","table":"yc_access_bindings_organizationmanager_organizations","client":"org:bpxxxm","resources":1048,"errors":0,"message":"table sync finished"}
This summed value is not equal to CLI’s final log record, e.g.
{"level":"info","module":"cli","resources":115030,"errors":0,"warnings":0,"duration":"1m54s","result":"Sync completed successfully","time":"2024-08-08T14:28:18Z","message":"Sync summary"}
I see multiple table sync finished
messages for the same client, table
tuple:
{"level":"info","module":"yc-source","table":"yc_access_bindings_organizationmanager_organizations","client":"org:bpxxxm","resources":1048,"errors":0,"message":"table sync finished"}
{"level":"info","module":"yc-source","table":"yc_access_bindings_organizationmanager_organizations","client":"org:bpxxxm","resources":1062,"errors":0,"message":"table sync finished"}
So, what should I really use to collect metrics?
Plugin-sdk version for source plugin: v4.54.0
CLI version: v5.18.0
erez
August 9, 2024, 2:07pm
2
Hi @valid-lab ,
The best way would be to use our OpenTelemetry integration. You can find the documentation here: OpenTelemetry Integration .
For ongoing syncs, you can use --tables-metrics-location
. More information can be found here: Identifying Slow Tables .
You’ll need the latest CLI and SDK to have the full set of traces, metrics, and logs we send via OpenTelemetry.
As for the counts not adding up, we would need to see the spec file to try and reproduce the issue.
kind: source
spec:
name: "yc"
registry: grpc
path: "[::1]:7777"
version: Development
destinations: ['postgres']
tables: ["*"]
spec:
organization_ids:
- bxxxm
backoff_retries: 3
---
kind: destination
spec:
name: "postgres"
registry: grpc
path: "[::1]:7778"
version: Development
write_mode: "overwrite-delete-stale" # overwrite, overwrite-delete-stale, append
migrate_mode: forced
spec:
connection_string: "${PG_CONNECTION_STRING}"
My wrapper starts the source at :7777
and the destination at :7778
.
I tested different destinations and it seems it doesn’t affect the counts. Also, the last open-source AWS plugin version counters match CLI stats.
Source: Yandex Cloud Source Plugin Documentation or Yandex Cloud Source Release v1.1.1
Destination: PostgreSQL Destination Plugin Documentation v7.3.5
erez
August 9, 2024, 2:59pm
4
Hi @valid-lab ,
Can you try running the plugin via registry: local
? You can go build
it and pass path: <full-path-to-plugin-binary>
.
Running as gRPC where the plugin is a long-running process could be related. Also, if this community plugin didn’t update to the latest SDK, it won’t have the OpenTelemetry support.
Hi! Sure, I’ll try it.
Hi again! I’ve debugged the thing.
So apparently the CLI deduplicates records sent by the source plugin. But the source can send duplicate events. In my case, this happened because the table resolver started two times for each table client (actually, my multiplex function returned the same client twice). Thus, in the logs, there were two records for the same table.
My faulty sort + dedup implementation:
https://github.com/yandex-cloud/cq-source-yc/blob/v1.1.1/client/resourcehierarchy.go#L235-L239
I remember back in time the CLI or plugin would print a warning in that case, like “resolver started for the same client more than one time”.
erez
August 13, 2024, 8:40am
6
The warning is still there link to code
erez
August 13, 2024, 8:41am
7
Yeah, we collect metrics in a map by client ID, so if there’s a duplicate client ID, it breaks the metrics. Here is the relevant code.
Looked at other schedulers; it seems that it prints only for DFS one?
I currently use shuffle, which is probably why I don’t see it in logs.
erez
August 13, 2024, 8:46am
9
Ah, you’re right. We should add that warning to other schedulers. I’m happy to accept a contribution, or you can open a bug report via GitHub Issues .
Thanks for the help! Hopefully, I will send a PR.