Types of Learning That Cybersecurity AI Should Leverage by Sohrob Kazerounian

Despite the recent explosion in machine learning and artificial intelligence (AI) research, there is no singular method or algorithm that works best in all cases. In fact, this notion has been formalized and shown mathematically in a result known as the No Free Lunch theorem (Wolpert and Macready 1997).

No single algorithm will outperform all other algorithms, across all possible problem spaces, particularly when considered under various real-world constraints like space and time complexity and availability of training data.

As such, AI systems that are designed to detect advanced cybersecurity threats must be tailored to the specific problems for which they are being deployed and should make use of the best available tools and algorithms for the types of detections they are designed to trigger.

As in other domains, AI systems in cybersecurity must be validated on the following criteria:

Can the AI system detect, cluster, classify and make predictions that would not have been possible to detect, cluster, classify or predict by humans alone?
Does the AI system make predictions and classifications that reduce the amount of human intervention and analysis required? Does it make predictions and classifications that increase the amount of human intervention and analysis required?

Designing an AI system that is capable of learning to simultaneously achieve both goals requires a deep understanding of the problem space and a breadth of understanding across machine learning algorithms in general. Attempts to use monolithic solutions that uniformly learn about the myriad security threats and intrusions across modern networks are bound to fall short on the former goal, while creating too many false detections to provide any benefit towards the latter.

Similarly, the use of multiple techniques or algorithms to detect each type of threat independently requires an intricate knowledge of how each algorithm functions and the ways in which it might fall short. Incomplete knowledge about the algorithm can lead to the very real possibility of subpar performance in the ability of a system to detect a threat and the amount of work created for network administrators because of false positives.

Problem scope

Because of the wide-ranging nature of cybersecurity threats today, many algorithms should be in the arsenal for any team developing AI solutions that automate the detection of cyberattacks. These include techniques from time-series analysis, NLP, statistics, neural networks, linear algebra, and topology. Nevertheless, the first determination that needs to be made about an algorithm is whether it should learn to make predictions in a supervised or unsupervised manner.

Does there exist a dataset of labeled data from which an algorithm can learn to map inputs to labels? Or does the algorithm need to learn which inputs are malicious and which aren’t, in an unsupervised fashion, without the use of labels? If a labeled dataset exists, is it sufficiently representative of the attack surface that the algorithm is being designed for? Is the data drawn from a distribution that covers the space of network, device, user and connection types that will be observed when the system is put into production? Even if these criteria hold, are there reasons to prefer unsupervised learning methods that instead ignore the class labels altogether?

For example, in the case of domain generation algorithms (DGAs), in which an infected host makes connections to domains whose names have been randomly generated to avoid simply blacklisting the domain, several large datasets contain examples of known good domains (labeled as Class 0 in the table below), and known DGA domains (Class 1). The labeled training set can be used to learn a functional mapping between the domain name and the class (normal vs. DGA, 0 vs 1). It is also possible to use unsupervised methods that can learn about the underlying statistics of normal domains, which would label anything out of the “ordinary,” as having been generated by a DGA.

The use of unsupervised learning could be advantageous if the datasets in question are out of date or contain errors. It could be even more damaging if attackers have prior knowledge of the training sets to adapt their DGAs to avoid detection.

Normal domain (class label 0)

DGA domain (class label 1)

google.com

tmwqfxrmb.ac

soundcloud.com

pkmeprkwtxigpnjshcsddhkgn.in

litetech.eu

nawntgvcbixvwh.net

urban-research.jp

gujtvpqvd.com

Making such a determination requires an understanding of the attack under consideration. It also necessitates knowing the proper techniques for training, testing and validating models to quantify over-fitting to a specific dataset, while allowing for generalization to new and unseen data.