Subject: Performance Issue with CONNECT BY for Finding Connected Components (Oracle 10g)
Hello all,
I'm facing a significant performance issue when trying to identify connected groups of PERSON_CODEs using a recursive query in an Oracle 10g database.
Data Structure
I have two main tables:
- PERSON: Contains personal data (name, DOB, etc.). There are multiple records (with different PERSON_CODEs) that can belong to the same physical person.
- PERSON_RELATIONS (approx. 250k rows): Stores the relationships between PERSON_CODEs that belong to the same person.
The relationship is bidirectional (if A is related to B, then B is related to A) and transitive (if A is related to B, and B is related to C, then A, B, and C all belong to the same group). This represents a connected component or a "friendship-style" network.
The Goal
For every PERSON_CODE, I need to find the unique identifier for its entire connected group. I'm choosing the lowest (minimum) PERSON_CODE within that group as the group identifier (GROUP_CODE).
Example, input (real example):
`PERSON_CODE PERSON_CODE_REL`
`BTE/20065/2017 BTE/8073/2012`
`BTE/20065/2017 BTE/7912/2012`
`BTE/20065/2017 BTE/21223/2017`
`BTE/7912/2012 BTE/8073/2012`
`BTE/21223/2017 BTE/8073/2012`
`BTE/21223/2017 BTE/7912/2012`
Output:
`PERSON_CODE_START GROUP_CODE`
`BTE/20065/2017 BTE/20065/2017`
`BTE/8073/2012 BTE/20065/2017`
`BTE/21223/2017 BTE/20065/2017`
`BTE/7912/2012 BTE/20065/2017`
Current Query (Using CONNECT BY)
My current approach uses a CONNECT BY clause to find all related codes, followed by a MIN() aggregation.
SQL
`WITH PersonCodes AS`
`(`
`-- As the relationship es bidirectional, we have to union both 'ways': PERSON_CODE -> PERSON_CODE_REL and PERSON_CODE_REL -> PERSON_CODE`
`SELECT PERSON_CODE, PERSON_CODE_REL FROM PERSON_RELATIONS`
`WHERE PERSON_CODE IN ('BTE/20065/2017','BTE/7912/2012','BTE/21223/2017','BTE/8073/2012') OR PERSON_CODE_REL IN ('BTE/20065/2017','BTE/7912/2012','BTE/21223/2017','BTE/8073/2012')`
`UNION`
`SELECT PERSON_CODE_REL AS PERSON_CODE, PERSON_CODE AS PERSON_CODE_REL FROM PERSON_RELATIONS`
`WHERE PERSON_CODE IN ('BTE/20065/2017','BTE/7912/2012','BTE/21223/2017','BTE/8073/2012') OR PERSON_CODE_REL IN ('BTE/20065/2017','BTE/7912/2012','BTE/21223/2017','BTE/8073/2012')`
`),`
`ConnectedCodes AS (`
`-- Get all related PERSON_CODE_REL for each PERSON_CODE`
`SELECT DISTINCT`
`CONNECT_BY_ROOT PERSON_CODE AS PERSON_CODE_START,`
`PERSON_CODE_REL AS GROUP_CODE`
`FROM`
`PersonCodes`
`START WITH PERSON_CODE IS NOT NULL`
`CONNECT BY NOCYCLE`
`PRIOR PERSON_CODE_REL = PERSON_CODE`
`)`
`SELECT PERSON_CODE_START, MIN(GROUP_CODE) AS GROUP_CODE`
`FROM ConnectedCodes`
`GROUP BY PERSON_CODE_START;`
The Problem:
The query works perfectly for small subsets of data (e.g., up to 1,000 relationship rows). However, when running it against the full 250,000 rows in PERSON_RELATIONS, the query hangs indefinitely. The bidirectional and transitive nature seems to cause the CONNECT BY to perform an excessive number of steps.
Since I'm on Oracle 10g, I cannot use Recursive CTEs (WITH RECURSIVE).
Question: Is there a more performant, set-based, or non-hierarchical way in Oracle 10g to efficiently find these fully connected components (groups) to determine the unique MIN(PERSON_CODE) for each member? I've spent a couple of days trying to find a solution with no luck :( Or maybe some out-of-the-box solution that could solve the problem…
Thank you for your help!