Thursday 8 May 2014

oracle template for de-duplication dedupe

This is an often fought with concept.  You need to create a PK over a table based upon a list of columns, but it’s hard to effectively identify which row to delete, because they have the same key value.  Now please remember, I’m considering these rows to be duplicates, and that I don’t really care which one I keep and which one I delete.

My situation is quite specific, so I’ll elaborate.  One of my table conversions is falling back to RAT because of duplicates in the F4072TEM based upon the new PK that is in the new F4072.  Darn it…  I cannot afford to have this go back to RAT.  What happens when something is RAT – the dupes are just chucked away anyway….  So, I’m going get rid of them with some set based logic.

Desc the PK, and you get a list of columns:

ADITM, ADAST, ADAN8, ADIGID, ADCGID, ADOGID, ADCRCD, ADUOM, ADMNQ, ADEXDJ, ADUPMJ, ADTDAY

So, use this list of columns in the template SQL below:

delete from tablename
where (rowid,COLUMN LIST) in (
select max(rowid), COLUMN LIST
from tablename
group by COLUMN LIST
having count(1) > 1);

To create

delete from proddta.f4072tem
where (rowid,ADITM, ADAST, ADAN8, ADIGID, ADCGID, ADOGID, ADCRCD, ADUOM, ADMNQ, ADEXDJ, ADUPMJ, ADTDAY) in (
select max(rowid), ADITM, ADAST, ADAN8, ADIGID, ADCGID, ADOGID, ADCRCD, ADUOM, ADMNQ, ADEXDJ, ADUPMJ, ADTDAY
from proddta.f4072tem
group by ADITM, ADAST, ADAN8, ADIGID, ADCGID, ADOGID, ADCRCD, ADUOM, ADMNQ, ADEXDJ, ADUPMJ, ADTDAY
having count(1) > 1);

Nice, it deletes the newest record, you could use min to delete the oldest record.

Wait… Update…  The above is good if there is only every 1 other duplicate.  When there are a lot of possible duplicates that match the key, then the above will have to be run multiple times, which is not good…  unless you put some recursion on your scripts, so…

Although quite a bit slower, the following works:  thanks to http://www.dba-oracle.com/t_delete_duplicate_table_rows.htm for some help on this one.

delete from proddta.f4072tem t1
where t1.rowid >
ANY (
select
t2.rowid
from proddta.f4072tem t2
WHERE t1.ADITM = t2.ADITM
and t1.ADAST = t2.ADAST
and t1.ADAN8 = t2.ADAN8
and t1.ADIGID = t2.ADIGID
and t1.ADCGID = t2.ADCGID
and t1.ADOGID = t2.ADOGID
and t1.ADCRCD = t2.ADCRCD
and t1.ADUOM = t2.ADUOM
and t1.ADMNQ = t2.ADMNQ
and t1.ADEXDJ = t2.ADEXDJ
and t1.ADUPMJ = t2.ADUPMJ
and t1.ADTDAY = t2.ADTDAY);

And finally…  another one.  The above is way too slow, I’m not too sure that it’ll ever return, so: my final instalment:  I really must pay homage to burleson consulting, their website is gold when it comes to quick oracle tips.  http://www.dba-oracle.com/ 

delete from proddta.F4072tem where rowid in
  (
  select "rowid" from
     (select "rowid", rank_n from
         (select rank() over (partition by  ADITM, ADAST, ADAN8, ADIGID, ADCGID, ADOGID, ADCRCD, ADUOM, ADMNQ, ADEXDJ, ADUPMJ, ADTDAY order by rowid) rank_n, rowid as "rowid"
             from proddta.F4072tem
             where (ADITM, ADAST, ADAN8, ADIGID, ADCGID, ADOGID, ADCRCD, ADUOM, ADMNQ, ADEXDJ, ADUPMJ, ADTDAY) in
                (select ADITM, ADAST, ADAN8, ADIGID, ADCGID, ADOGID, ADCRCD, ADUOM, ADMNQ, ADEXDJ, ADUPMJ, ADTDAY from proddta.F4072tem
                  group by ADITM, ADAST, ADAN8, ADIGID, ADCGID, ADOGID, ADCRCD, ADUOM, ADMNQ, ADEXDJ, ADUPMJ, ADTDAY
                  having count(*) > 1
                )
             )
         )
     where rank_n > 1
  );

No comments: