Email Classification vastly outperforms other word embedding approaches

The Email Classification approach was compared against several widely used text representation models. Both classification accuracy and speed were taken into account to evaluate their suitability for production grade real-time applications dealing with large amounts of text data.

The results of the evaluations are summarized in the table below:

Summary table of the results

The experiments show that’s Email Classification achieves state-of-the-art accuracies on all three data sets, outperforming Word2Vec, FastText and Doc2Vec by a significant margin. Although it is possible that additional hyperparameter tuning on the BERT model could increase its accuracy scores, it is far too slow for many real-world use cases even when run with GPUs.

While the pure software version of the Email Classification is already capable of classifying hundreds of emails per second, a hardware-accelerated version run on FPGAs is currently in development and will lead to significant speed-ups.


Three subsets of the famous Enron Dataset were chosen to measure the performance of the different models. This corpus contains information on how several Enron employees had structured their emails into different folders by topic (such as “Personal”, “Hiring”, “Project A”). Factors such as the number of emails and the number of different classes were taken into account when selecting the emails of employees. The subsets used were Kaminski (3,994 emails and 19 classes after removing folders with too few samples), Farmer (3,451 emails, 11 classes) and Lokay (2,321 emails, 8 classes). A fixed split of 75% of each data set was used for training the models, while the remaining 25% were held back for evaluation.

Widely used approaches for generating word and text embeddings include Word2Vec, FastText and Doc2Vec, all of which were included in our benchmark analysis.

A common approach to generating representations of text from dense word embeddings is averaging them and weighting them by the term’s tf-idf value which represents how important a term is to a document. This method was used to create email representations from official Word2Vec vectors (GoogleNews-vectors-negative300), which were then used to train a linear classifier.

An embedding approach more suited to documents is Doc2Vec. Since no official implementation has been released, the Gensim version was used to train a Doc2Vec model on the email data. The resulting document embeddings were again used as input to a linear classifier.

Additionally, the Python version of the official Facebook implementation of FastText was used to directly train a classifier.

The current state-of-the-art results for text classification can be achieved with contextualized word embedding models such as BERT. To compare’s Email Classification approach to such models, the official bert-base-uncased model was fine-tuned on the email classification task using the PyTorch implementation.

With the exception of the GPU-accelerated BERT implementation that was executed on an AWS g3.8xlarge EC2 instance with two M60 GPUs, all experiments were performed on the same hardware (a 2015 MacBook Pro with 8 CPUs). The reported runtime specifies the number of minutes it took to classify all emails in the evaluation set.

Read more about the science behind’s Email Classification