In an attempt to speed up my imap access I moved old emails in to folders based on year. This speed up process majorly backfired when imap/mdir decided to make 7 copies of each email in the same folder. I ended up with 27,000 emails in my 2006 folder!
With my mailbox quota full I needed a quick solution… and couldn’t find one! Thunderbird has a plugin that will search for and delete duplicate messages but it runs over imap which crippled the server trying to handle all the requests.
Using Google I stumbled across this solution for finding and deleting duplicate messages using reformail but after getting reformail installed I found it to be very slow and the number of messages to delete didn’t add up so I had to abandon this approach.
In the end I decided to write my own PHP script that would cycle through the specific mail directory, search for duplicate messages based on the Message-Id (or a checksum of the email if not available) and then delete the unnecessary, duplicate emails. It worked a treat, and went through the 27,000 emails in less than 5 minutes! If anybody wants the code, its below!
All you need to do is change the dir variable at the top to the location of your mdir folder that contains all the duplicates. In my case it was a folder called 2006, so you’d use …../.2006/cur/. I suggest you run the script once with $delete set to false to check that the stats it echoes out sound correct. If they do just change $delete to true and let it run! Hope this helps somebody.
<?php
$dir = ‘/home/username/mail/domain.com/user/.2006/cur/’;
$delete = false; // set to true after testing to actually delete them
$emails = scandir($dir);
$found = array();
$dups = 0;
$actual = 0;
$blank = 0;
$i = 0;
foreach($emails AS $file){
set_time_limit(20);
if($file == ‘.’ OR $file == ‘..’) continue;
$i++;
//if($i > 1000) break; // temp stopper
$messageIds = false;
$email = file_get_contents($dir . $file);
preg_match(‘#Message-ID:(s+)<([^>]+)>#is’, $email, $messageIds);
if(!is_array($messageIds) OR !isset($messageIds[2]) OR empty($messageIds[2])){
$messageIds[2] = ‘md5_’ . md5($email);
$blank++;
}
$messageId = $messageIds[2];
if(in_array($messageId, $found)){
// message is a dup, delete
if($delete) unlink($dir . $file);
$dups++;
}else{
$found[] = $messageId;
$actual++;
}
}
echo ‘Found ‘ . $i . ‘ emails’ . “<br/>n”;
echo ‘Found ‘ . $dups . ‘ duplicates’ . “<br/>n”;
echo ‘Leaving ‘ . $actual . ‘ originals’ . “<br/>n”;
echo ‘With ‘ . $blank . ‘ without message IDs’ . “<br/>n”;
?>
Warning: Declaration of Social_Walker_Comment::start_lvl(&$output, $depth, $args) should be compatible with Walker_Comment::start_lvl(&$output, $depth = 0, $args = Array) in /home/customer/www/arronwoods.com/public_html/blog/wp-content/plugins/social/lib/social/walker/comment.php on line 18
Warning: Declaration of Social_Walker_Comment::end_lvl(&$output, $depth, $args) should be compatible with Walker_Comment::end_lvl(&$output, $depth = 0, $args = Array) in /home/customer/www/arronwoods.com/public_html/blog/wp-content/plugins/social/lib/social/walker/comment.php on line 42
Your blog editor did a number on the single and double quotes. I’ve fixed that and added options to show the duplicate emails and limit the number of emails matched. I’ve uploaded the modified version to
http://www.badcode.org/blog_random_duplicate_mdir_email_finder.txt
You didn’t specify a license, so I noted that the license is unknown.
I am trying to run your script on a directory that has 39525 messages in it. Needed to increase php.ini max mem to 200M to get the script to run. So far it’s found a ton my Evolution remove-duplicate-emails has been running now for about 5hours, and i only selected a few messages. Don’t know how long this will take, but i am glad i found this script. Thanks for sharing it.
Hi
I’ve got about 1000 email adresses (in thunderbird) and I’ve got many duplicates I was wondering if a similar script could work for me?
I’ve found many programs that to that under windows but only one in ubuntu that doesn’t work properly.
I’m looking for exact duplicates here.
Thanks