Converting Emails: Merge & Purge

Over far too many years I had amassed a collection of mail archives with much overlap (but also enough unique emails to keep the different archives) in different formats. This article is about the effort to unify them (merge), and eliminate (most) duplication, delete emails older than my new ‘personal data retention policy’ (purge), and organize what was left into yearly archives so that it is easy to annually remove no longer relevant emails.

All the tools I’ve used to do this are free, but do require Linux (sorry Windows users) although they might also be available for Mac OS X command line. That is because a) I’m broke and b) figuring what is actually worth paying for in the world of proprietary data conversion is non-trivial (with the SEO games some companies play, it is not enough that a tool shows up high in search rankings, or even in reviews). In any event I had free tools available and the skills to use them, so I did.

So the first thing to do is to figure out the email format required by the tools one is going to use to the do the merge and purge. In my case, I used Thunderbird, because I knew it supported what I wanted and comes preinstalled on the distro I use. It may also have been possible to use Evolution (which I have added to the system and use for accessing Microsoft Exchange email servers, but I have found it’s de-duplication not as helpful as the addon I used in Thunderbird).

Since the intermediate tool for the emails was Thunderbird the first trick was to get all the email into Thunderbird’s ‘mbox’ and subfolder format. That’s typically not directly possible so I chose ‘Maildir with subdirectories’ as the intermediate format.

The first set of conversions was easy; I had a number of PST format files from Outlook backups. To convert them I used ‘pst-utils’ package, specifically the ‘readpst’ program and ended up with Maildir’s with the contents of the PST files.

The next set wasn’t much more difficult; I had different styles of ‘Maildir’ I need to convert to my chosen Maildir format. To do that I used a tool called ‘mbsync’. It lives in the package ‘isync’ on Debian and Ubuntu (and derivatives). The biggest trick was identifying the need for different ‘Subfolders’ format options in the ‘.mbsyncrc’ file, then it was a matter of letting the mbsync convert the mails.

Converting the Maildirs to Thunderbird’s mbox-based format turned out to be a bit of a pain (it turns out that Maildir with subdirectories is not actually a proper standards-compliant format), so I wrote a couple of Bash scripts to do the folder hierarchy conversion, along with the help of a Perl script from the Dovecot Wiki under the section titled ‘Converting Maildir to mbox’.

The two Bash scripts I hacked together (obviously you will need to edit the path to suite your own situation, should you chose to use them for your own conversion).

md2mb-top.sh

#!/bin/bash

SOFAR="$1"
MAILTREE="$2"

if [ -z "$SOFAR" ]; then
	exit 1
fi

if [ -z "$MAILTREE" ]; then
	MAILTREE=""
fi

export SOFAR MAILTREE

RET=0
cd /path/to/ArchivedMail && \
if [ "$(find "Maildir/${MAILTREE:+$MAILTREE/}cur" -mindepth 0 -maxdepth 0 -type d -a ! -empty 2>/dev/null)" = "Maildir/${MAILTREE:+$MAILTREE/}cur" ]; then
	/home/daniel/bin/dw-maildirtombox.pl "${MAILTREE:-.}" "$SOFAR"
	RET="$?"
else
	touch "$SOFAR"
	RET="$?"
fi
if [ "$RET" != "0" ]; then
	echo "Error converting '${MAILTREE}'"
	exit 1
fi

exit 0

md2mb-depth1.sh

#!/bin/bash

set -e

SOFAR="$1"
MAILTREE="$2"

if [ -z "$SOFAR" ]; then
	SOFAR=""
fi

if [ -z "$MAILTREE" ]; then
	MAILTREE=""
fi

export SOFAR MAILTREE

if [ "$MAILTREE" = "" ]; then
	MAILBASE="/path/to/ArchivedMail/Maildir"
else
	MAILBASE="/path/to/ArchivedMail/Maildir/${MAILTREE}"
fi

# Only once; convert topmost directory to mbox (if maildir)
# Create mbox subdirs for all subdirs in root dir that have subdirs
# Process all maildirs in current root dir
# Process all subdirs in root dir (recursive)

(
	cd "${MAILBASE}" && \
	if [ -n "$SOFAR" ] && ls -A . 2>/dev/null | grep -qvE '^(cur|tmp|new|\.uidvalidity)$'; then mkdir -p "${SOFAR}"; fi && \
	if [ -z "$MAILTREE" ]; then /home/daniel/bin/md2mb-top.sh "${SOFAR:-/path/to/ArchivedMail/mbox/INBOX}" "$MAILTREE"; fi && \
	find * -mindepth 1 -maxdepth 1 -type d -a ! -empty -a ! -name 'cur' -a ! -name 'tmp' -a ! -name 'new' -execdir sh -c "mkdir -p '${SOFAR:-/path/to/ArchivedMail/mbox}'/'$(dirname '{}')'.sbd"  \; && \
	find * -mindepth 0 -maxdepth 0 -type d -a ! -empty -a ! -name 'cur' -a ! -name 'tmp' -a ! -name 'new' -execdir sh -c "/home/daniel/bin/md2mb-top.sh '${SOFAR:-/path/to/ArchivedMail/mbox}'/'{}' '${MAILTREE:+$MAILTREE/}''{}'" \; && \
	find * -mindepth 0 -maxdepth 0 -type d -a ! -empty -a ! -name 'cur' -a ! -name 'tmp' -a ! -name 'new' -execdir sh -c "if ls -A '{}' 2>/dev/null | grep -qvE '^(cur|tmp|new|\.uidvalidity)$'; then /home/daniel/bin/md2mb-depth1.sh '${SOFAR:-/path/to/ArchivedMail/mbox}'/'{}'.sbd '${MAILTREE:+$MAILTREE/}''{}'; fi" \;

	exit 0
)

exit 0

Now, one creates a ‘fake’ email account (neither used for sending or receiving actual mail) in Thunderbird (which is trickier than it needs to be in my opinion) and then modifies the Account ‘Properties’ to point to the ‘/path/to/ArchivedMail/mbox’ folder created by the above script.

Next, comes a fair bit of ‘manual’ work. Because of the lack of organization of the email collection I started with, I had to copy mail from a mess of of folders into a coherent hierarchy. There is nothing remarkable about that, it’s basically tedious housekeeping stuff (one of the reasons for which it was put off for so long).

The next is to eliminate duplicates. For that I used the first addon that came up in a search of Thunderbird addons. It’s fairly widely used and works well. I did turn off the ‘Message-ID’ as a matching criteria because I knew I had quite a mess of duplicate mails with different ID strings. I tweaked a few other setting so that I am fairly confident that mails would only be marked duplicate if actually duplicate, without missing (many) actual duplicates. Again, not terribly exciting.

I then created a special folder for emails I wanted to keep regardless of age. I then did a search for all emails in all folders (except the keep folder) that were older than 2555 days (7 years). I then selected all the emails that came up and deleted them (really move to trash).

Then for each of the seven years of mail I created a search folder with a search that picked up the mail for that year. I modified the ‘Archive’ preferences in Thunderbird to use a year-based folders and to keep the folder hierarchy (so I wouldn’t end up with an entire year in one mailbox folder). I then selected all the messages in the search and pressed ‘A’ (which is the keyboard shortcut for ‘Archive’.

Once all the folders were organized into years, and all the original folders were empty, I removed the original folders. I mad sure I emptied the ‘Trash’, expunged all folders, and compacted everything. I then exited Thunderbird.

I then changed to ‘/path/to/ArchivedMail/mbox/Archive’ which is the folder Thunderbird created containing the yearly archives of my emails.

I used the package ‘mb2md’ to convert each year’s folder (recursively, using the -R option) to a Maildir. I then tarred and compressed all years and moved them to my personal archives.

The most recent year I uploaded to my email server using ‘mbsync’ as various hosting changes had mangled what emails were on the server (I don’t think I need more than a year ‘active’, since I have archives if I need them).

And finally I removed the various working folders and originals.

I’ve now got a reasonably organized historical archive of emails, a year’s worth of email ‘active’, and saved a lot of hard drive space.

I guess Covid is good for something…

By Daniel F. Dickinson

Daniel Dickinson has been using and exploring computer and electronic technologies for over three decades. He is proficient with various flavours of Linux (including varieties of desktop, server, and embedded systems) as well as modern Windows. Daniel still enjoys software and firmware development, but prefers integration and extending existing open source solutions to quickly achieve desired goals. Fo more details see https://www.wildtechgarden.com/about/about-daniel-f

Leave a comment

Your email address will not be published. Required fields are marked *