Machine learning: a powerful spam filtering tool

Machine learning seems to be the latest IT hype. But what is all the excitement about? What is machine learning exactly? It’s the science that enables computers to learn how to perform a task without being explicitly told how to do it. In ZEROSPAM’s use of machine learning, this task is the classification of emails as spam or legitimate. To perform this task, a learning program is presented with numerous email examples, together with the correct category. Those are called training samples. There can be thousands or millions of training samples: the more the better. During the training process, the program builds a model of all the email message characteristics that help determine whether it is spammy or legitimate. The program has learned something if it can then generalize what it has seen and apply it to other email samples. Those are called test samples and are used to evaluate the model’s performance.



The learning program isn’t designed to read an email the way a person would. It requires a more abstract representation of the email. The email messages are deconstructed and reconstructed following a representation called the bag of words. It’s like putting all the words contained in an email in a bag and then shaking it. No information whatsoever on word order is preserved. The information on whether a word appears more than once in a given email may or may not be preserved. This basic representation can be enhanced in a variety of ways. For example, consecutive words can be counted in groups of two or more or characters can be counted instead. As you can imagine, countless more complex ways of reconstructing an email can also be used.



Before it is used for training purposes, a model is defined in a very general manner. When the learning program goes through the training email samples, it tries to establish the classification parameters that distinguish a spam email from a legitimate email. It does so by trying to fit the model to the existing classification so the parameters can be deducted. These parameters are like a set of dials that need to be set correctly. The more parameters the model has, the more complex it is.

Models used in machine learning range from the simplest to the very complex. One simple model consists of giving all the words in our bag of words a score on the spammy/legitimate spectrum. A set of dials (the parameters of the model) controls how each word scores. When the number on a dial is positive, that word is on the spammy side. Conversely, if the number on the dial is negative, that word is considered to be on the legitimate side. The scores of all the different dials are then translated into a global spam/legitimate verdict using a special function. The performance that can be achieved by carefully adjusting the dials to fit the training samples is surprisingly good. The beauty of the whole process is that no manual adjustment of each individual dial is required.



You may also have heard about deep learning. Deep learning models use networks of artificial neurons (neural networks). An artificial neuron is a mathematical function conceived as a model of biological neurons. The simple machine learning model we have just described actually is a single artificial neuron. Artificial neurons are joined to form an artificial neural network.  In a neural network, neurons don’t just compute scores coming directly from an email representation. They also take as input the results of other neurons’ computations. To form a network, neurons are arranged in layers. The layer that treats only the original email representation is the input layer. The last layer, the one that gives the spam verdict is the output layer. The layers in between are called the hidden layers. Each hidden layer compiles the results of the scores of the preceding layer and passes them on to the next layer.



The beauty of the whole thing is that hidden layers of a neural network have the ability to learn by themselves how to best construct their own email representations. Those representations build on the basic representations we feed them to implicitly identify more abstract concepts to refine the model. For instance two sentences from the same spam campaign could be phrased differently but have the same meaning. This is called representation learning and has the potential to greatly enhance spam filtering. The only caveat is that in order to produce good results, the neural network needs a lot of email samples to work with.

Apart from parameters, a model may also use specific criteria that must be set before training can even begin. Those are called hyperparameters. One example of a hyperparameter of our simple model is how much should a dial be moved (one way or the other) for every email sample? In addition to this, hyperparameters of a neural network include the number of hidden layers and the number of neurons per layer to be used. Many other important details must be analyzed and set in order for the model to work properly. The correct adjusting of hyperparameters can be very tricky. It’s as much an art as it is a science.

Machine learning at ZEROSPAM

At ZEROSPAM, machine learning has always been part of our spam fighting arsenal, in one form or another. Over the years, we have gradually improved our models. However, we recently decided to launch a special research project that will enable us to leverage recent advances in the field of deep learning and machine learning. New discoveries in these fields are in the process of transforming many areas of our society. We have a responsibility for making the best use of cutting-edge technology to protect our customers from phishing, spearphising, ransomware, fraudulent and dangerous messages, unwanted ads and all other email-borne threats. And we take that responsibility to heart.