Best way to remove duplicates based on multiple tables
PyrocksMar 22 2012 — edited Mar 29 2012Hi,
I have a mechanism which loads flat files into multiple tables (can be up to 6 different tables) using external tables.
Whenever a new file arrives, I need to insert duplicate rows to a side table, but the duplicate rows are to be searched in all 6 tables according to a given set of columns which exist in all of them.
In the SQL Server Version of the same mechanism (which i'm migrating to Oracle) it uses an additional "UNIQUE" table with only 2 columns(Checksum1, Checksum2) which hold the checksum values of 2 different sets of columns per inserted record. when a new file arrives it computes these 2 checksums for every record and look it up in the unique table to avoid searching all the different tables.
We know that working with checksums is not bulletproof but with those sets of fields it seems to work.
My questions are:
should I use the same checksums mechanism? if so, should I use the owa_opt_lock.checksum function to calculate the checksums?
Or should I look for duplicates in all tables one after the other (indexing some of the columns we check for duplicates with)?
Note:
These tables are partitioned with day partitions and can be very large.
Any advice would be welcome.
Thanks.