Skip to Main Content

SQL & PL/SQL

Announcement

For appeals, questions and feedback about Oracle Forums, please email oracle-forums-moderators_us@oracle.com. Technical questions should be asked in the appropriate category. Thank you!

Ideas on cleaning up large table with massive duplicate rows.

EdStevensFeb 2 2017 — edited Feb 23 2017

This thread is a branch from Analytics to identify duplicate rows not working as expected , to have a broader discussion of the core problem I am working on.

Oracle 11.2.0.4.0, StdEd One, 64-bit

Oracle Grid Infrastructure 11.2.0.4.0, standalone ASM, no cluster, no RAC

Oracle Linux 6.

Here’s the situation:  A bit lengthy as I'll try to provide as much relevant information as possible and avoid you having to drag it out a bit at a time.

I have one table, used as a staging table for a data warehouse app.  The only source of data for this table is a weekly load from a file provided by an outside source.  The entire file is loaded into the table in a weekly job.  This is the only input source to this table. The primary business analyst says the non-duplicated rows must be retained for historical research.

Each week the data file has all of the content from the previous week, plus new content.  Thus, all rows loaded last week will be loaded again this week, along with new data.  And all data loaded this week will be loaded again next week, along with next week’s new data.  Once per quarter the source does some cleanup and housekeeping, but the data file is never restarted from zero.  You can see how this gets a massive build-up of duplicated rows.

Currently the table has 1.13 Billion rows, 29,655,781 blocks, with an average row length of 184 bytes.  Table has 37 columns, of which a match on 35 would define a ‘duplicate’.  The two columns not factoring for duplication status are the PK, which is a surrogate generated from a sequence at time of insert, and an INSERT_USER_DATE, populated by sysdate at time of insert.  The table has its own dedicated tablespace that is currently at 9 data files totaling 242 gbytes.  There is an MV dependent on this table.

 

There are three indexes, all in one tablespace that contains only these 3 segments.  The index TS has 4 data files totaling 72.1 gbytes. I can provide index stats if deemed relevant.

So, that’s the landscape. With that I have two objectives

First, stop the bleeding,

Second, clean up the duplicates.

On ‘stop the bleeding’ I thought I had a good method by replacing the INSERT with a MERGE that only exercises the ‘insert’ option.  On low volumes proof of concept this worked exactly as expected, but given the current size of the table, it ran for two hours before I killed it.  (The current production procedure runs in about 45 minutes, inserting approx. 2.2 million records).  So it appears that I’m not going to achieve acceptable run-time on the load until I can get the trash removed.

On ‘clean up the duplicates’, I see four methods.

  One, copy the data to be retained to a new table, then drop the original, rename the new, and rebuild all dependencies

  Two, copy the data to be retained to a new table, truncate the original, then copy the data from the new back to the original.

  Three, working in identified subsets of the original data (say, all purchase_account_num between 10000 and 20000), copy data to be retained to ‘working table’, delete all identified data from original, copy everything from working table back to original.

  Four, working with identified chunks of data, delete in place.

I ran a full volume trial of the INSERT (SELECT .. FROM) method.  I killed it after it had run 24 hours with no end in sight.

I’ve been reading and re-reading the discussion on this at AskTom  but am still spinning my wheels.

Here's a script of creating and loading a POC table.  This is an abridged version of the real table, but does include the three indexed columns along with a representative group of other columns.

--

-- create test tables

--

drop TABLE CUSTOMER_ACCOUNT purge;

CREATE TABLE CUSTOMER_ACCOUNT

   (    CUSTOMER_ACCOUNT_ID NUMBER(*,0) NOT NULL ENABLE,

        COMPANY_CODE CHAR(2),

        SALES_DIVISION_CODE CHAR(2),

        PURCHASE_ACCOUNT_NUM CHAR(10),

        ACCOUNT_NAME VARCHAR2(20),

        INSERT_USER_DATE DATE,

      constraint ca_pk primary key (CUSTOMER_ACCOUNT_ID)

    )

;  

CREATE INDEX IX_INSERT_USER_DATE

ON CUSTOMER_ACCOUNT (INSERT_USER_DATE)

;

CREATE INDEX IX_PURCHASE_ACCOUNT_NUM

ON CUSTOMER_ACCOUNT (PURCHASE_ACCOUNT_NUM)

;

--

--  insert test data

--

insert into CUSTOMER_ACCOUNT values (1,'AA','AA','AAAA','HELEN BACH',to_date('2017-01-01','yyyy-mm-dd'));

insert into CUSTOMER_ACCOUNT values (2,'AA','AA','AAAA','HELEN BACH',to_date('2017-01-02','yyyy-mm-dd'));

insert into CUSTOMER_ACCOUNT values (3,'AA','AA','AAAA','HELEN BACH',to_date('2017-01-03','yyyy-mm-dd'));

insert into CUSTOMER_ACCOUNT values (4,'BB','BB',NULL,'HELEN BACH',to_date('2017-01-03','yyyy-mm-dd'));

--

insert into CUSTOMER_ACCOUNT values (5,'AA','AA','AAAA','CHEATEM AND HOWE',to_date('2017-02-01','yyyy-mm-dd'));

insert into CUSTOMER_ACCOUNT values (6,'AA','AA','AAAA','CHEATEM AND HOWE',to_date('2017-02-02','yyyy-mm-dd'));

insert into CUSTOMER_ACCOUNT values (7,'AA','AA','AAAA','CHEATEM AND HOWE',to_date('2017-02-03','yyyy-mm-dd'));

--

insert into CUSTOMER_ACCOUNT values (8,'AA','AA','AAAA','Pikof Andropov',to_date('2017-03-01','yyyy-mm-dd'));

insert into CUSTOMER_ACCOUNT values (9,'AA','AA','AAAA','Pikof Andropov',to_date('2017-03-02','yyyy-mm-dd'));

insert into CUSTOMER_ACCOUNT values (10,'AA','AA','AAAA','Pikof Andropov',to_date('2017-03-03','yyyy-mm-dd'));

--

insert into CUSTOMER_ACCOUNT values (11,'AA','AA','AAAA','MURKEY RESEARCH INC.',to_date('2017-03-03','yyyy-mm-dd'));

--

commit;

From the above data,

- PK 1 to 3 are duplicated on all columns except the pk and insert date.  Retain the one with the earliest  insert date

- PK 4 has no dupes across all relevant columns.  Keep it.

- PK 5 to 7 are duplicated on all columns except the pk and insert date.  Retain the one with the earliest  insert date

- PK 8 to 10 are duplicated on all columns except the pk and insert date.  Retain the one with the earliest  insert date

- PK 11 has no dupes across all relevant columns.  Keep it.

I am open to any and all approaches.

This post has been answered by Paulzip on Feb 10 2017
Jump to Answer
Comments
Locked Post
New comments cannot be posted to this locked post.
Post Details
Locked on Mar 23 2017
Added on Feb 2 2017
122 comments
13,101 views