De-duplicating across multiple tables in MySQL

Posted by Jedd on 2013-03-21

I'd accidentally let a few duplicate records to slip into a system that I've inherited. Being a close-to-3NF schema meant that I was using a bunch of linking tables, and needed to correlate those as part of my identification of duplicates as well as tidying them up in the process.

Loosely what I had was something like:

    person  [  id , firstname , surname , ... ]
    phone   [  id , number ]
    link_person_phone [ id , person_id , phone_id ]

    // And two more pairs of tables similar to phone, that provide:
    // address + link_address_person,
    // email   + link_email_person

The owner of the data had agreed a person is unique if they match on ( firstname + surname + phone number ).

In my phone (and address and email) tables I already had unique constraints on the data columns, but obviously there could be (and were) multiple link records to a given phone (and address, and email).

So I needed to first identify someone with their phone number, and then the older duplicates, and then remove those duplicates.

Originally, because I was using a hammer at the time, I thought I'd fix this with PHP (I was young & foolish).

A rinky-dinky prototype script seemed to run quite nicely, and extrapolating from the timing against 10% of the full data I calculated it'd take about an hour to run, which seemed reasonable enough as there's always other things to be getting on with. However on returning to my byobu session, I discovered it was taking much, much longer.

A good opportunity (in disguise) to begrudgingly do it properly, so I returned to writing what ended up being some fairly basic SQL.

I'd originally looked at doing it this way, but the requirement to double-link put me off, plus I anticipated being able to (quickly) reuse some php fragments I was using in another importer script at the time.

Here's what I came up with. Comments follow.

    -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
    -- De-duplication for person (and related lookup) tables

    CREATE TABLE 
      -- Without an index it was taking ~30s per 1,000 DELETES per table
      dups (person2_id INTEGER UNSIGNED, INDEX(person2_id)) ENGINE=InnoDB  
      AS

      SELECT
      --  person1.id  AS person1_id,
          distinct person2.id  AS  person2_id
      --  person1.firstname,
      --  person1.surname,

      --  phone1.num AS phone1_num,
      --  phone2.num AS phone2_num,

      --  link1.id  AS link1_id,
      --  link2.id  AS link2_id

      FROM
        person person1,
        person person2,

        link_person_phone link1,
        link_person_phone link2,

        phone phone1,
        phone phone2

      WHERE

        person1.firstname = person2.firstname
        AND
        person1.surname   = person2.surname
        AND
        phone1.num = phone2.num

        AND
        person1.id = link1.person_id
        AND
        link1.phone_id = phone1.id

        AND
        person2.id = link2.person_id
        AND
        link2.phone_id = phone2.id

        -- This ensures we retain the most recent (largest id) row
        AND
        person1.id > person2.id

      ORDER BY person2.id DESC;

    -- Remove FK constraints back to person first
    DELETE FROM link_person_phone   WHERE person_id IN (SELECT person2_id FROM dups);
    DELETE FROM link_address_person WHERE person_id IN (SELECT person2_id FROM dups);
    DELETE FROM link_email_person   WHERE person_id IN (SELECT person2_id FROM dups);

    -- The good stuff
    DELETE FROM person              WHERE person.id IN (SELECT person2_id FROM dups);

    -- Tidy up after ourselves
    DROP TABLE dups;

    -- Once this is done, these two should return identical results:
    -- SELECT firstname, surname, phone.num 
    -- FROM person, phone, link_person_phone 
    -- WHERE 
    --   person.id=link_person_phone.person_id 
    --   AND 
    --   link_person_phone.phone_id = phone.id 
    -- GROUP BY firstname, surname, phone.num ;

    -- and without grouping:

    -- SELECT firstname, surname, phone.num 
    -- FROM person, phone, link_person_phone 
    -- WHERE 
    --   person.id=link_person_phone.person_id 
    --   AND
    --   link_person_phone.phone_id = phone.id;
    -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- 

It's reasonably well documented inline, but loosely we first identify duplicates, store them in a temporary table (dups), then delete the link table references, then delete the actual person records.

Originally, before explicitly defining the dups table (you can just do CREATE TABLE AS SELECT ...) it defaulted to creating a table without an index – and the resulting deletes were mind-bogglingly slow, despite the data being in good order. Evidently the optimiser wasn't able to do much with that.

I've left the (commented-out) other fields I was originally SELECTing while developing the script in place for completeness.

The trick here is, obviously enough, joining the same table to itself wherever I need to do a comparison. A basic trick, to be sure, but still neat to my humble mind.

The AND person1.id > person2.id conditional ensures that person1.id (which we want to retain) is the most recent, or the one with the biggest id. That is, the multiple person2.id's – of which there were in some cases a few dozen – are guaranteed to be the oldest duplicates. If you wanted to retain the oldest (original) duplicate for a given record, invert that logic.

Because I have FOREIGN KEYS in play, I had to delete the link table entries first, and then the actual person records.

Finally, tidy up by removing the dups table.

I've included the basic test queries I was running before and after the script ran, so I could work out precisely how many records I anticipated losing, and then confirm that was the actual number that went away.

Needless to say this script was run after I fixed up the legacy code that allowed duplicates to slip in, but it's the kind of thing that can be re-run without any great risk, and, as it turns out, has a tiny impact on the server.

For the record, with a modest configuration (16MB caches for most things) on my test data setup where I was deleting 30% out of 150,000 records, it took about 15s to create the dups table, and a second or so to delete the data from the other tables.