SQL & PL/SQL

Announcement

For appeals, questions and feedback about Oracle Forums, please email oracle-forums-moderators_us@oracle.com. Technical questions should be asked in the appropriate category. Thank you!

Query to deduplicate record from a table-> insert into another table -> delete from previous table

User_21V35Jan 16 2023 — edited Jan 17 2023

Background:
Our system has a dumping table, multiple threads simultaneously inserts 100k records to this table -> lets call it EVENT_DUMP table
we want to de-duplicate these events and insert into another table -> lets call it PROBABLE_EVENTS table. Also, at the same time delete these records from EVENT_DUMP table
Other Details:
EVENT_DUMP can have approximately 7-8M records
But distinct records in EVENT_DUMP are only 1K
There can be already existing duplicate records in PROBABLE_EVENTS
existing records in PROBABLE_EVENTS table can be 15K
Question is:
What is the best approach
to de-duplicate records from EVENT_DUMP table ---> use E_KEY as a unique identifier of a record
insert (that do not already exits) in PROBABLE_EVENTS table
Delete these records from EVENT_DUMP table
ALL IN ONE TRANSACTION

What I have tried so far
Using Triggers, but I found this post saying triggers are slow http://rwijk.blogspot.com/2007/09/database-triggers-are-evil.html
Using SELECT INTO within Procedure, but again this link mentions - load only small datasets into memory: https://docs.oracle.com/cd/B14117_01/appdev.101/b10807/13_elems045.htm
But I could not come to a complete solution yet, that maintains integrity and is efficient.

Sharing Scripts for TEST DATA:

CREATE TABLE EVENT_DUMP
(
  Version SMALLINT NOT NULL,
  E_KEY VARCHAR(512) NOT NULL,
  ENTITY_ID VARCHAR(255) NOT NULL,
  IDENTIFIER VARCHAR (4000),
  NAME VARCHAR(4000) NOT NULL,
  DESCRIPTION VARCHAR(4000),
  CONFIG CLOB NOT NULL
);

CREATE TABLE PROBABLE_EVENTS (
  Version SMALLINT NOT NULL,
  E_KEY VARCHAR(512) NOT NULL,
  ENTITY_ID VARCHAR(255) NOT NULL,
  IDENTIFIER VARCHAR (4000),
  NAME VARCHAR(4000) NOT NULL,
  DESCRIPTION VARCHAR(4000),
  CONFIG CLOB,
  ARCHIVAL_STATUS VARCHAR(16),
  ARCHIVAL_RETRY_COUNT SMALLINT,
  ARCHIVAL_ATTEMPT_TS TIMESTAMP
);

-- TO GENERATE HUGE TEST DATA
-- two loops are executed:: 20*4
-- creates 8,000,000 records in EVENT_DUMP and creates 40 records in PROBABLE_EVENTS
begin
  for cnt in 1..20 loop
  	for n IN 1..4 loop
  		-- create a few records in
  		if mod(n,2) = 0 THEN
	 		MERGE INTO PROBABLE_EVENTS USING DUAL ON (E_KEY = n)
	 		WHEN NOT MATCHED THEN
		 	INSERT (Version, E_KEY, ENTITY_ID, IDENTIFIER, NAME, DESCRIPTION, CONFIG, ARCHIVAL_STATUS, ARCHIVAL_RETRY_COUNT)
			values( 2, n, 'entityID-' || n, 'some event identifier', 'name', 'description', '{"config":"some value"}', 'Pending', 0 );
		end if;
		-- create an entry in EVENT_DUMP
		insert into EVENT_DUMP(Version, E_KEY, ENTITY_ID, IDENTIFIER, NAME, DESCRIPTION, CONFIG)
		values( 2, n, 'entityID-' || n, 'some event identifier', 'name', 'description', '{"config":"some value"}') ;
	end loop;
  end loop;
end;

Data generated by the above script:
Expected Output (for the sample data) after running the solution:
No records in EVENT_DUMP table

2 records in PROBABLE_EVENTS table

FYI: We found a solution for Postrges DB, sharing for reference:

with deleted_rows as (
	delete
	from
		EVENT_DUMP
		where ctid in ( select ctid from EVENT_DUMP limit 1000000 )
		returning *
)
insert
	into
	PROBABLE_EVENTS(version, e_key, ENTITY_ID, IDENTIFIER, NAME, DESCRIPTION, CONFIG )
select
	distinct on
	(dr.e_key)
dr.Version, dr.e_key, dr.ENTITY_ID, dr.IDENTIFIER, dr.NAME, dr.DESCRIPTION, dr.CONFIG
from
	deleted_rows dr
left join PROBABLE_EVENTS pe on
	dr.e_key = pe.e_key
where
	pe.e_key is null;

Added on Jan 16 2023

8 comments

536 views

1 mention