CloudQuery RDS latency issues with joins exploring views and MongoDB switch possibilities

composed-ringtail · January 19, 2024, 5:24pm

Hello All,

I need your inputs on this.

Currently, we are using RDS (PostgresDB) and running into latency issues as we are doing joins with multiple tables. I assume that VIEWS could help in this case, but I wanted to check with you before we take that route. If it does, do you have any examples on how to implement VIEWS on Postgres for CloudQuery?
If we have to switch our DB from Postgres to MongoDB (NoSQL), do you think we will be in better shape regarding the latency issues?

ben · January 19, 2024, 5:29pm

There are multiple ways you can approach your latency issues and both databases can be configured to support your use cases. That being said, if you are trying to join multiple tables, your best bet would probably be Postgres. If you could share your queries that you are seeing issues with, I can give some more specific suggestions.

composed-ringtail · January 19, 2024, 9:57pm

Thanks @ben. This is one of the queries we are doing for compute resources to fetch the data.

ben · January 19, 2024, 10:17pm

So looking at this impressive query over…

I have some suggestions:

I would first try to update the query to use the existing predicates in as many of the joins and selects as possible. This will help the Postgres query planner to interact with as little data as possible, as it will be able to exclude irrelevant data.
Add indexes to the columns you are using to do your joins. This will help the query time by enabling PG to skip doing an entire table scan of every table to grab the data you are looking for.
Depending on the size of the RDS instance, I would consider upgrading it to a larger machine. I would make sure that you are using an instance size that enables Postgres to keep your data in memory for fast processing.

I could make even more specific recommendations if you were to share the result of the EXPLAIN ANALYZE <sql query> where the <sql query> is the query you shared above.

Here is an example of the updated query I suggested in (1):

composed-ringtail · January 22, 2024, 2:35pm

Thanks @ben for the inputs. Let me try it out and will let you know.

Hi @ben, is the attached the analyze plan for the above query?

ben · January 22, 2024, 5:13pm

How long is the entire query taking to run currently?

composed-ringtail · January 22, 2024, 5:18pm

7 row(s) fetched - 0.146s, on 2024-01-22 at 11:18:23
I am using DBeaver

ben · January 22, 2024, 5:20pm

Is 0.146s creating latency issues or does that represent an improvement over the initial query?

composed-ringtail · January 22, 2024, 5:48pm

A bit of improvement for sure. Do you think VIEWS will help for us in this case?

ben · January 22, 2024, 5:50pm

Views will not help performance unless you use materialized views. Views are just wrappers around the SQL statement, while materialized views are actually persisted to the disk. If you decide to use materialized views, you will have to be aware that you will need to refresh the view after every sync.

I have an even more optimized version of that query where it removes some of the duplicate table scans:

SELECT EC2.ACCOUNT_ID,
    EC2._CQ_SOURCE_NAME,
    EC2.TAGS,
    EC2.INSTANCE_ID,
    EC2.PRIVATE_IP_ADDRESS,
    EC2.PUBLIC_IP_ADDRESS,
    EC2.SECURITY_GROUPS,
    EC2.NETWORK_INTERFACES,
    EC2.GROUP_INFO ->> 'GroupId' AS SECURITY_GROUP_ID,
    SEC_GROUP.GROUP_NAME AS SECURITY_GROUP_NAME,
    SEC_GROUP.IP_PERMISSIONS AS SECURITY_GROUP_IP_PERMISSIONS,
    I_I_P.INSTANCE_PROFILE_NAME AS INSTANCE_PROFILE_NAME,
    EC2.REGION,
    I_I_P.ROLE_INFO ->> 'RoleId' AS IAM_ROLE_ID,
    I_I_P.ROLE_INFO ->> 'Arn' AS IAM_ROLE_ARN,
    I_I_P.ROLE_INFO ->> 'RoleName' AS IAM_ROLE_NAME,
    IAM_ROLE_ATTACHED_POLICIES.POLICY_ARN AS POLICY_ARN,
    IAM_ROLE_POLICIES.POLICY_DOCUMENT AS POLICY_DOCUMENT,
    AWS_IAM_POLICIES.POLICY_VERSION_LIST AS POLICY_VERSION_LIST
FROM
    (SELECT AWS_EC2_INSTANCES.*, JSONB_ARRAY_ELEMENTS(SECURITY_GROUPS) AS GROUP_INFO FROM AWS_EC2_INSTANCES WHERE account_id=':accountId') AS EC2
JOIN AWS_EC2_SECURITY_GROUPS AS SEC_GROUP ON (EC2.GROUP_INFO ->> 'GroupId')::text = SEC_GROUP.GROUP_ID 
LEFT JOIN
    (SELECT AWS_IAM_INSTANCE_PROFILES.*, JSONB_ARRAY_ELEMENTS(AWS_IAM_INSTANCE_PROFILES.ROLES) AS ROLE_INFO FROM AWS_IAM_INSTANCE_PROFILES WHERE account_id=':accountId') AS I_I_P ON (EC2.IAM_INSTANCE_PROFILE ->> 'Id')::text = I_I_P.ID AND I_I_P.account_id = EC2.account_id
LEFT JOIN AWS_IAM_ROLE_ATTACHED_POLICIES AS IAM_ROLE_ATTACHED_POLICIES ON I_I_P.ROLE_INFO ->> 'Arn' = IAM_ROLE_ATTACHED_POLICIES.ROLE_ARN
LEFT JOIN AWS_IAM_ROLE_POLICIES AS IAM_ROLE_POLICIES ON I_I_P.ROLE_INFO ->> 'Arn' = IAM_ROLE_POLICIES.ROLE_ARN
LEFT JOIN AWS_IAM_POLICIES ON IAM_ROLE_ATTACHED_POLICIES.POLICY_ARN = AWS_IAM_POLICIES.ARN
WHERE AWS_EC2_INSTANCES.PUBLIC_IP_ADDRESS IS NOT NULL

Just so you are aware, if you are running DBeaver locally and running the query against the RDS instance, then the latency of the actual query could be much lower as that 0.146s value might well be including the round trip network time. You can check the latency between your machine and different AWS regions here: AWS Latency Test

composed-ringtail · January 22, 2024, 5:58pm

Yes, I agree. But the above query you sent is returning more records than the previous one.

ben · January 22, 2024, 5:58pm

Oh, okay, I can take a look at it and see what is going on in a bit…

Sorry that query left out the WHERE AWS_EC2_INSTANCES.PUBLIC_IP_ADDRESS IS NOT NULL (we don’t have any publicly available instances to test on). I have updated the query to add it back.

composed-ringtail · January 22, 2024, 7:37pm

Thanks for the query. You are really great with it and we will have to revisit all of our queries. So, I have a couple of questions:

Fixing the queries should improve our latency issues. We have a lot of queries.
Having views doesn’t really help. (Please correct me if I am wrong.)
Having indexes on the columns we fetch will help. (Do you have examples or documentation on how to do it?)
Moving to MongoDB? (Does this help to fetch the records from various documents?)

ben · January 22, 2024, 7:47pm

Yes - altering your queries to reduce duplicate actions like sequential table scans will result in more efficient and performant queries.
Having regular Views will do nothing for performance; having Materialized Views can definitely help, but comes at the cost of keeping the indexes refreshed after every sync.
Adding indexes on columns you are using to filter or join tables should help with performance, as Postgres will not have to read all of the data, but will be able to just grab the data it needs.
From what I have gathered about your use case, MongoDB will not be a magical solution; it will have its own set of optimizations and tunings that will be required.

If you need more in-depth customized support, we offer support plans that might help you and your team! https://www.cloudquery.io/pricing

certain-alpaca · January 24, 2024, 2:01pm

@composed-ringtail Just out of curiosity, what are you trying to get here? Data related to the roles attached to the EC2 instances exposed to the internet?

Topic		Replies	Views
Issue with synchronizing large data sources in CloudQuery CloudQuery Plugins	8	40	June 24, 2024
CloudQuery sync slow and failing with aws_inspector_findings and aws_inspector2_findings tables CloudQuery Plugins	2	5	March 13, 2024
How to reduce cloudquery sync time for multiple aws accounts CloudQuery Plugins	14	27	September 28, 2023
CloudQuery sync time for AWS data in PG database on EKS CloudQuery Plugins	24	39	October 23, 2023
CloudQuery AWS Postgres need help with fetching data without SQL CloudQuery Plugins	1	4	January 8, 2024

CloudQuery RDS latency issues with joins exploring views and MongoDB switch possibilities

Related topics