SQL & PL/SQL

Announcement

For appeals, questions and feedback about Oracle Forums, please email oracle-forums-moderators_us@oracle.com. Technical questions should be asked in the appropriate category. Thank you!

Delete duplicate rows from a huge table

marindoSep 29 2017 — edited Nov 8 2017

Hi!

I've got a table with about 50 million records. However, about 20 millions of these are considered to be duplicates, as the relevant information is the same. I am trying to delete those duplicates, but I run into that TEMP space error: "ORA-01652: unable to extend temp segment by 128 in tablespace TEMP"...

I am operating in verison 11g.

My initial approach to delete the duplicates has been:

DELETE FROM my_table WHERE key in (
SELECT
key
FROM (
SELECT
    key,
    row_number() OVER (PARTITION BY
      ... -- all relevant columns here
      ORDER BY key DESC
    ) AS dup_num
FROM my_table
)
WHERE dup_num>1

);

In the inner SELECT, all rows that have identical values in the relevant columns are numbered. As the key was assigned in chronological order (rows added later have a higher key), I sorted DESC to have the row in a partition added latest given number 1. The outer SELECTS picks all duplicates. The DELETE is to remove these.

This is logically correct, but runs into memory issues (ORA-01652).

My next attempt was to restrict the number of rows to be deleted in one go (with the idea to repeat). I used another row_number() to count the duplicates:

DELETE FROM my_table WHERE key in (
SELECT
key
FROM (
SELECT
key,
row_number() OVER (ORDER BY key) as count,
FROM (
SELECT
    key,
    row_number() OVER (PARTITION BY
      ... -- all relevant columns here
      ORDER BY key DESC
    ) AS dup_num
FROM my_table
)
WHERE dup_num>1
)
WHERE count<1000000
);

But this did not help at all. Same memory issue.

A simple answer here might be: Increase that TEMP space. But if this is not possible, what else can I do?

Is there any way to restrict the number of rows returned without sorting?

What I intend to do is basically:

DELELTE FROM my_table WHERE key IN (

SELECT key FROM my_table a WHERE exists (SELECT 1 FROM my_table b WHERE b.key<>a.key AND a.col1=b.col1 AND a.col2=b.col2 AND ..)

But I assume that would run for ages. Also, when I write a.col1=b.col1 I actually mean (a.col1=b.col1 OR a.col1 is null AND b.col1 is null). I.e. two column values are considered to be the same if both are null. Is there any way (function?) to express that in a shorter way?

ORA-01652: unable to extend temp segment by 128 in tablespace TEMP

Locked Post

New comments cannot be posted to this locked post.

Locked on Dec 6 2017

Added on Sep 29 2017

22 comments

6,387 views