Deleting duplicate mdir emails using php

In an attempt to speed up my imap access I moved old emails in to folders based on year. This speed up process majorly backfired when imap/mdir decided to make 7 copies of each email in the same folder. I ended up with 27,000 emails in my 2006 folder!

With my mailbox quota full I needed a quick solution… and couldn’t find one! Thunderbird has a plugin that will search for and delete duplicate messages but it runs over imap which crippled the server trying to handle all the requests.

Using Google I stumbled across this solution for finding and deleting duplicate messages using reformail but after getting reformail installed I found it to be very slow and the number of messages to delete didn’t add up so I had to abandon this approach.

In the end I decided to write my own PHP script that would cycle through the specific mail directory, search for duplicate messages based on the Message-Id (or a checksum of the email if not available) and then delete the unnecessary, duplicate emails. It worked a treat, and went through the 27,000 emails in less than 5 minutes! If anybody wants the code, its below!

All you need to do is change the dir variable at the top to the location of your mdir folder that contains all the duplicates. In my case it was a folder called 2006, so you’d use …../.2006/cur/. I suggest you run the script once with $delete set to false to check that the stats it echoes out sound correct. If they do just change $delete to true and let it run! Hope this helps somebody.

<?php

$dir = ‘/home/username/mail/domain.com/user/.2006/cur/’;
$delete = false; // set to true after testing to actually delete them

$emails = scandir($dir);

$found = array();
$dups = 0;
$actual = 0;
$blank = 0;

$i = 0;
foreach($emails AS $file){

set_time_limit(20);

if($file == ‘.’ OR $file == ‘..’) continue;

$i++;
//if($i > 1000) break; // temp stopper

$messageIds = false;

$email = file_get_contents($dir . $file);
preg_match(‘#Message-ID:(s+)<([^>]+)>#is’, $email, $messageIds);

if(!is_array($messageIds) OR !isset($messageIds[2]) OR empty($messageIds[2])){
$messageIds[2] = ‘md5_’ . md5($email);
$blank++;
}

$messageId = $messageIds[2];

if(in_array($messageId, $found)){
// message is a dup, delete
if($delete) unlink($dir . $file);
$dups++;
}else{
$found[] = $messageId;
$actual++;
}

}

echo ‘Found ‘ . $i . ‘ emails’ . “<br/>n”;
echo ‘Found ‘ . $dups . ‘ duplicates’ . “<br/>n”;
echo ‘Leaving ‘ . $actual . ‘ originals’ . “<br/>n”;
echo ‘With ‘ . $blank . ‘ without message IDs’ . “<br/>n”;

?>