In order to learn some more about data analytics, I am working Russel Jurneys' book "Agile Data Science". In it, he uses data downloaded from gmail to illustrate the principles and helpfully he set up a github for all code used throughout his book (https://github.com/rjurney/Agile_Data_Code). I will be analysing data obtained from Microsoft Outlook. Since getting the data prepared was a bit of a hassle, I will document the steps here.
1) Export all emails into a .pst file.
2) Get readpst and transform the data into mbox format. On Ubuntu, this should be as easy as typing
sudo apt-get install pst-utils readpst -r emails.pst
This creates an mbox file for each folder in the pst file containing all emails in the folder.
3) Reading the mbox file in python is pretty easy once you now about the mailbox module:
import mailbox mboxfile='mbox' m =  for message in mailbox.mbox(mboxfile): m.append(message) m.keys() m['Subject']
The next step will be getting all those mails from the mbox files into an avro schema.