How to efficiently de-dupe a table in Oracle

A bad approach to de-dupe a table is to delete the dupes. Unfortunately, when dealing with tables that have millions of rows the delete method is going to take a long time and consume a lot of server resources. Below is a step by step example that shows how we can de-dupe a table using “Partition By” Oracle command.

Step 1:
Let us create a table and load data into it.

CREATE TABLE TEST_DEDUPE
(
CUSTOMER_ID NUMBER(10),
BRAND       VARCHAR2(20),
TRANSACTION_DATE DATE
);
INSERT
INTO TEST_DEDUPE VALUES
(
1 ,
'GREAT VALUE' ,
TO_DATE('01/20/2011' , 'MM/DD/YYYY')
);
INSERT
INTO TEST_DEDUPE VALUES
(
1 ,
'GREAT VALUE' ,
TO_DATE('01/20/2011' , 'MM/DD/YYYY')
);
INSERT
INTO TEST_DEDUPE VALUES
(
1 ,
'GREAT VALUE' ,
TO_DATE('01/20/2011' , 'MM/DD/YYYY')
);
INSERT
INTO TEST_DEDUPE VALUES
(
1 ,
'GREAT VALUE' ,
TO_DATE('01/22/2011' , 'MM/DD/YYYY')
);
INSERT
INTO TEST_DEDUPE VALUES
(
1 ,
'GREAT VALUE' ,
TO_DATE('01/22/2011' , 'MM/DD/YYYY')
);
INSERT
INTO TEST_DEDUPE VALUES
(
2 ,
'MEMBERS MARK' ,
TO_DATE('01/22/2011' , 'MM/DD/YYYY')
);
INSERT
INTO TEST_DEDUPE VALUES
(
2 ,
'MEMBERS MARK' ,
TO_DATE('01/22/2011' , 'MM/DD/YYYY')
);
INSERT
INTO TEST_DEDUPE VALUES
(
2 ,
'MEMBERS MARK' ,
TO_DATE('01/22/2011' , 'MM/DD/YYYY')
);
INSERT
INTO TEST_DEDUPE VALUES
(
2 ,
'MEMBERS MARK' ,
TO_DATE('01/15/2011' , 'MM/DD/YYYY')
);
INSERT
INTO TEST_DEDUPE VALUES
(
2 ,
'MEMBERS MARK' ,
TO_DATE('01/15/2011' , 'MM/DD/YYYY')
);
INSERT
INTO TEST_DEDUPE VALUES
(
3 ,
'MEMBERS MARK' ,
TO_DATE('01/15/2011' , 'MM/DD/YYYY')
);
INSERT
INTO TEST_DEDUPE VALUES
(
3 ,
'MEMBERS MARK' ,
TO_DATE('01/15/2011' , 'MM/DD/YYYY')
);
INSERT
INTO TEST_DEDUPE VALUES
(
4 ,
'MEMBERS MARK' ,
TO_DATE('01/15/2011' , 'MM/DD/YYYY')
);

Now let us look at the data:

SELECT * FROM TEST_DEDUPE
CUSTOMER_IDBRANDTRANSACTION_DATE
1GREAT VALUE20-JAN-11
1GREAT VALUE20-JAN-11
1GREAT VALUE20-JAN-11
1GREAT VALUE22-JAN-11
1GREAT VALUE22-JAN-11
2MEMBERS MARK22-JAN-11
2MEMBERS MARK22-JAN-11
2MEMBERS MARK22-JAN-11
2MEMBERS MARK15-JAN-11
2MEMBERS MARK15-JAN-11
3MEMBERS MARK15-JAN-11
3MEMBERS MARK15-JAN-11
4MEMBERS MARK15-JAN-11

 

We noticed that Customer_IDs {1, 2, 3} have duplicates. For instance, Customer_ID 1 have three rows with Brand = “Great Value” and Purchase date = “20-JAN-11”. After de-duping we should only have one row.


Step 2:

Now let us rename the table that we need to de-dupe.

ALTER TABLE TEST_DEDUPE
RENAME TO TEST_DEDUPE_OLD;

Step 3:
Now let’s create a new table that has the same name as the original table.

CREATE TABLE TEST_DEDUPE AS
(SELECT CUSTOMER_ID ,
BRAND ,
TRANSACTION_DATE
FROM
(SELECT CUSTOMER_ID ,
BRAND ,
TRANSACTION_DATE ,
ROW_NUMBER()
OVER
(PARTITION BY CUSTOMER_ID , BRAND , TRANSACTION_DATE
ORDER BY TRANSACTION_DATE ) AS RN
FROM TEST_DEDUPE_OLD
)
WHERE RN = 1
);

Final Step:
Let us view the data without dupes:

SELECT * FROM TEST_DEDUPE;
CUSTOMER_IDBRANDTRANSACTION_DATE
1GREAT VALUE20-Jan-11
1GREAT VALUE22-Jan-11
2MEMBERS MARK15-Jan-11
2MEMBERS MARK22-Jan-11
3MEMBERS MARK15-Jan-11
4MEMBERS MARK15-Jan-11