Concurrency on child tables for Go SDK based on how you insert into channel

jeromewir · November 29, 2024, 4:53pm

Hey team,

I’ve been working on a plugin recently and been looking at the traces produced with opentracing and noticed something.

It seems like the concurrency on the child tables is not the same when adding the parent one by one or in one big array on the result channel.

I’ve setup a repository to show the behavior: GitHub - jeromewir/cq-source-concurrency-childtable-example

When adding an array of 10 items in the channel directly, all items are processed in parallel (up to the child table concurrency setting) but if adding items one at a time, they are treated synchronously, waiting for all the child relations to complete before processing the next.

I’ve tried to buffer the items before adding them to the channel as a workaround but this is not great as the slowest processed item would make all of the other items in the batch wait.

Thank you.

erez · November 29, 2024, 5:49pm

Hi @jeromewir.

You observation is correct and thanks for the reproduction. I believe the reason is the code here plugin-sdk/scheduler/scheduler_dfs.go at 698c50f7a942b45e4e7055a12bcc9e148a78661d · cloudquery/plugin-sdk · GitHub

Then plugin-sdk/scheduler/scheduler_dfs.go at 698c50f7a942b45e4e7055a12bcc9e148a78661d · cloudquery/plugin-sdk · GitHub

We use an unbuffered channel to receive resources to resolve so until the item (single or array) finishes resolving the next one won’t be processed.

Usually that’s not an issues due to:

Most APIs plugins use return arrays so no need to buffer on the plugins side
A common sync has more than one top level table so the overall sync time stays the same

A potential improvement you can do is pass the strategy option to the scheduler with a shuffle-queue plugin-sdk/scheduler/strategy.go at 698c50f7a942b45e4e7055a12bcc9e148a78661d · cloudquery/plugin-sdk · GitHub value.

That strategy uses a different concurrency pattern and should not wait for fully resolving the child table before moving on to the next item.

Please let me know if that helps, and also if you can share where are you sourcing data from, that can help us optimize the plugin further.

jeromewir · November 29, 2024, 7:30pm

Hi @erez,

Thank you for the quick response!
I’ve tested the shuffle-queue and it did work, it seems this strategy doesn’t have the same limits as the DFS one, does it?

I was actually using the Microsoft graph SDK, they provide a paginate function where you get only one item at a time (GitHub - microsoftgraph/msgraph-sdk-go: Microsoft Graph SDK for Go)

I didn’t expect the behavior to be different when passing an array or an item to be honest so this surprised me and thought it could be a bug.
I think having some documentation / comment could be helpful for plugin developers.

I’ve now updated my code to iterate manually on the results and getting way faster syncs for tables that have relations with the DFS scheduler.

Thank you for the great work on CloudQuery

erez · November 30, 2024, 10:59am

Agree this can be considered a bug. We could fix it by making the channel a buffered (let’s say size of 20), but then if someone passes multiple arrays (e.g. each array size of 100) that can increase memory usage.

The shuffle-queue scheduler uses a worker queue pattern so each child table does not resolve resources immediately, but instead the resources are inserted into a queue (which is fast), to be later be picked up for resolving by a free worker.
This pattern allows distributing the work more efficiently

I’ll add that to our docs

erez · December 9, 2024, 3:02pm

Hi @jeromewir instead of adding the limitation to our docs, we fixed it in the SDK. See Release v4.71.0 · cloudquery/plugin-sdk · GitHub.

That should make it so table resolvers don’t need to batch items to benefit from concurrent resolving.

jeromewir · December 10, 2024, 2:37am

Hey @erez,

That’s awesome, thanks for making this happen so promptly!

Topic		Replies	Views
Is concurrency a valid option in CloudQuery spec today CloudQuery Plugins	8	10	February 3, 2024
Issue with synchronizing large data sources in CloudQuery CloudQuery Plugins	8	29	June 24, 2024
Sync child table rows without affecting parent table in CloudQuery CloudQuery Plugins	1	4	July 10, 2024
Running CloudQuery plugin recursively for large dataset retrieval CloudQuery Plugins	30	23	May 21, 2024
CloudQuery concurrency spec parameter behavior with Azure API calls CloudQuery Plugins	1	1	October 15, 2023

Concurrency on child tables for Go SDK based on how you insert into channel

Related topics