SQL & PL/SQL

Announcement

For appeals, questions and feedback about Oracle Forums, please email oracle-forums-moderators_us@oracle.com. Technical questions should be asked in the appropriate category. Thank you!

Makeshift Checksum

_Dylan_Aug 7 2020 — edited Aug 11 2020

I came across a query at work today. It's trying to see if two sets contain the same IDs. It's doing this by calculating the standard deviation and average of the IDs in each set. Any set with the same SD/Avg are considered a match. It seems to be working, but it doesn't feel like a bullet-proof method to compute a checksum (we're on Oracle 12.2.0.1 and I'm not aware of a checksum-like function in this version, but could be wrong).

I started to see if I could find example values that create false matches. The things I know are:

Order within the set does't matter.
ID will be a positive integer from 1 to ~27,000
A set will contain anywhere from 1 to ~1,200 IDs
The universe of all possible sets is less than the sum-total of all combinations of all IDs, of all set sizes. In reality, IDs are unique to an entity, and each set belongs to a single entity. The largest entity has about 1,200 IDs and can have some/all of those IDs in a set.

It seems difficult/impossible to cause a collision with a set of 2 numbers. But as the sets grow, it seems easier to get values to converge. I'm wondering if:

anyone has seen this approach used to checksum a column.
there's an existing function/method that actually does a checksum.
there's a sql/plsql approach that could actually to predict sets of IDs that have collisions (and not take years).

I know of other SQL approaches to verify the sets match and not worried about that. I'm more focused in figuring out if/when the existing approach will have issues. I was thinking about trying to checking procedurally, or mathematically to try an backtrack into a formula that shows a breakdown somewhere.

-- Set 1: (1,2,3,4,5,6,7,8,9,10) | Avg = 5.5, SD = 3.0276503540974916654225328097181936992
-- Set 2: (2,3,4,5,6,7,8,9) | Avg = 5.5, SD = 2.44948974278317809819728407470589139197
SELECT
    grp,
    AVG(id),
    STDDEV(id)
FROM
    (
        SELECT
            1 AS grp,
            level AS id
        FROM
            dual
        CONNECT BY
            level <= 10
        UNION ALL
        SELECT
            2 AS grp,
            level + 1 AS id
        FROM
            dual
        CONNECT BY
            level <= 8
    )
GROUP BY
    grp;

This post has been answered by mathguy on Aug 8 2020

Jump to Answer

Added on Aug 7 2020

19 comments

1,352 views