Running CloudQuery plugin recursively for large dataset retrieval

natural-beagle · May 20, 2024, 5:55pm

Can we run the destination CloudQuery plugin recursively? For example, the purpose is to fetch data via API, save it into a database, and the database has more than 10,000,000 items. The API returns 5500 items at once as pagination.

erez · May 20, 2024, 5:59pm

Hi @natural-beagle,

Source plugins can sync any number of resources to any destination. For example, a specific resource for the Azure plugin has about 30,000 items in a single API call, and can paginate on the rest.

Not sure what you mean by recursively; maybe you could give a more concrete example of the source, destination, and the API data you’re syncing?

natural-beagle · May 20, 2024, 6:00pm

Ok, thanks for the reply. Let me give you an example.
I am developing a custom source plugin. I want to save items as they are fetched immediately, and the number of items is 5500 at the most for one request. Overall, there are more than 10,000,000 items.

So to fetch all items, I should send several requests (10,000,000 / 5500 times). Does this make sense so far?

erez · May 20, 2024, 6:05pm

Makes sense, you should send several requests and send each response over the channel. See an example for pagination in this GitHub link.

The code will vary depending on how pagination works with the API you’re using. Each group of 5500 will be streamed to the destination to be written without blocking the source.

natural-beagle · May 20, 2024, 6:06pm

Sorry, I am a freshman for Go.
I mastered JavaScript.

erez · May 20, 2024, 6:06pm

Ah cool, are you using the JS SDK?

natural-beagle · May 20, 2024, 6:07pm

Yes, I am building with JS. Is there any other example?

erez · May 20, 2024, 6:09pm

Sure, you can see the example plugin in our JS SDK. It works in a similar way to the Go one; you get a stream instance you can write plain JS objects to.

Here’s the link to the code.

So for each object on each page, you would need to do a

stream.write(obj)

Does that help?

natural-beagle · May 20, 2024, 6:16pm

Thanks so much for your help, sir.
I really want to reach out when I have issues in the future too.

Hello @erez,
Sorry to bother you.
Can you help me please?

erez · May 21, 2024, 1:32pm

Hi Not bothering, it’s all good.

natural-beagle · May 21, 2024, 1:32pm

Thanks
Yet, I don’t understand the usage of stream.
I don’t get the sense of the way for my case.

erez · May 21, 2024, 1:37pm

Can you share some code? Or maybe the repo with the plugin’s code?

natural-beagle · May 21, 2024, 2:03pm

const newClient: NewClientFunction = async (
  logger,
  spec,
) => {
  pluginClient.allTables = [];
  pluginClient.spec = parseSpec(spec);
  pluginClient.client = { id: () => "cve-sync" };
  await getTables(logger, pluginClient.spec);

  return pluginClient;
};

pluginClient.plugin = newPlugin("cve-sync", version, newClient, {
  kind: "source",
  team: "cloudquery",
});

return pluginClient.plugin;

I am saving data in the getTables function and set pluginClient.allTables as an empty array.

erez · May 21, 2024, 2:18pm

So you should set pluginClient.allTables to an array of the tables the plugin supports, and implement the sync method to handle getting data:

Line 62 of memdb.ts

Pagination should happen inside each table resolver:

Line 15 of tables.ts

Since the table resolver API gets the stream instance that is used to pass data to the destination, see memdb.ts as an example of how to set up a plugin and tables.ts as an example of how to define tables.

natural-beagle · May 21, 2024, 2:22pm

Thanks so much @erez.
I found the way.
Now I understand what to do.
Would you check this please?
I am saving one item by one.
Can I save as an array?
@erez

erez · May 21, 2024, 2:31pm

This looks good. To save an array of items, you can use:

for (const item of items) {
  stream.write(item);
}

natural-beagle · May 21, 2024, 2:31pm

Is that only the way?

erez · May 21, 2024, 2:31pm

Yes, or items.forEach(). Any kind of loop you’d like.

natural-beagle · May 21, 2024, 2:31pm

Yes, yes.
So, we can’t do batch saving?

erez · May 21, 2024, 2:32pm

So the stream accepts a single item each time, but the data is saved in batches at the destination. Each destination can optimize the batching based on its capabilities. It shouldn’t matter much from a performance/memory standpoint if the stream would be able to accept an array.

Topic		Replies	Views
CloudQuery sync from one source to multiple destinations clarification needed CloudQuery Plugins	2	12	September 27, 2023
Support for streaming in source plugins CloudQuery Plugins	1	32	November 11, 2024
Issue with synchronizing large data sources in CloudQuery CloudQuery Plugins	8	45	June 24, 2024
Request for CloudQuery plugin to forward data via socket CloudQuery Plugins	3	9	August 9, 2024
Writing to DB in realtime with JS Plugin CloudQuery Plugins	4	42	June 26, 2025

Running CloudQuery plugin recursively for large dataset retrieval

Related topics