I came across a query at work today. It's trying to see if two sets contain the same IDs. It's doing this by calculating the standard deviation and average of the IDs in each set. Any set with the same SD/Avg are considered a match. It seems to be working, but it doesn't feel like a bullet-proof method to compute a checksum (we're on Oracle 12.2.0.1 and I'm not aware of a checksum-like function in this version, but could be wrong).
I started to see if I could find example values that create false matches. The things I know are:
- Order within the set does't matter.
- ID will be a positive integer from 1 to ~27,000
- A set will contain anywhere from 1 to ~1,200 IDs
- The universe of all possible sets is less than the sum-total of all combinations of all IDs, of all set sizes. In reality, IDs are unique to an entity, and each set belongs to a single entity. The largest entity has about 1,200 IDs and can have some/all of those IDs in a set.
It seems difficult/impossible to cause a collision with a set of 2 numbers. But as the sets grow, it seems easier to get values to converge. I'm wondering if:
- anyone has seen this approach used to checksum a column.
- there's an existing function/method that actually does a checksum.
- there's a sql/plsql approach that could actually to predict sets of IDs that have collisions (and not take years).
I know of other SQL approaches to verify the sets match and not worried about that. I'm more focused in figuring out if/when the existing approach will have issues. I was thinking about trying to checking procedurally, or mathematically to try an backtrack into a formula that shows a breakdown somewhere.
-- Set 1: (1,2,3,4,5,6,7,8,9,10) | Avg = 5.5, SD = 3.0276503540974916654225328097181936992
-- Set 2: (2,3,4,5,6,7,8,9) | Avg = 5.5, SD = 2.44948974278317809819728407470589139197
SELECT
grp,
AVG(id),
STDDEV(id)
FROM
(
SELECT
1 AS grp,
level AS id
FROM
dual
CONNECT BY
level <= 10
UNION ALL
SELECT
2 AS grp,
level + 1 AS id
FROM
dual
CONNECT BY
level <= 8
)
GROUP BY
grp;