Until recently, using the terms “data science” and ”cybersecurity” in the same sentence would have seemed odd. Cybersecurity solutions have traditionally been based on signatures – relying on matches to patterns identified with previously identified malware to capture attacks in real time. In this context, the use of advanced analytical techniques, big data and all the traditional components that have become representative of “data science” have not been at the center of cybersecurity solutions focused on identification and prevention of cyber attacks.
This is not surprising. In a signature-based solution, any given malware or new flavor of it needs to be identified, sometimes reverse-engineered and have a matching signature deployed in an update of the product in order to be “detectable.” For this reason, signature-based solutions are not able to prevent zero-day attacks and provide very limited benefit compared to the predictive power offered by data science.
Among the many definitions of data science that have emerged in the last few years, “gaining knowledge from data using a scientific approach” best captures some of the different components that characterize it.
In this series of posts, we will investigate how data science can be used to extract knowledge that identifies malware and potential persistent cybersecurity threats.
The unprecedented number of companies that have reported breaches in 2014 are evidence that existing cybersecurity solutions are not effective at identifying malware or detecting attackers inside an organization’s network. The list of companies that have reported breaches and exfiltration of sensitive data grows at an alarming rate: from the large volume data breaches at Target and Home Depot earlier in 2014, to the recent breaches at Sony Entertainment, JP Morgan and the most recent attack at Anthem in February, where personally identifiable Information (PII) for 80 million Americans was stolen. Breaches involve big and small companies, showing that the time has come for a different approach to the identification and prevention of malware and malicious network activity.
Three technological advances enable data science to deliver new innovative cybersecurity solutions:
- Storage – the ease of collecting and storing large amount of data on which analytics techniques can be applied (distributed systems as cluster deployments).
- Computing – the prompt availability of large computing power allows easy use of sophisticated machine learning techniques to build models for malware identification.
- Behavior – the fundamental transition from identifying malware with signatures to identifying the particular behaviors an infected computer will exhibit.
Let's discuss more in depth how each of the items above can be used for a rigorous application of data science techniques to solve today's cybersecurity problems.
Having a large amount of data is of paramount importance in building analytical models that identify cyber attacks. For either a heuristic or refined model based on machine learning, large numbers of data samples need to be analyzed to identify the relevant set of characteristics and aspects that will be part of the model – this is usually referred to as “feature engineering”. Then data needs to be used to cross check and evaluate the performance of the model – this should be thought of as a process of training, cross validation and testing a given “machine learning” approach.
In a separate post, we will discuss in more detail how and why data collection is a crucial part in the data science approach to cybersecurity, and why it presents unique challenges.
One of the reasons for the recent increase in machine learning’s popularity is the prompt availability of large computing resources: Moore’s law holds that the processing power and storage capacity of computer chips double approximately every 24 months.
These advances have enabled the introduction of many off-the-shelf machine learning packages that allow training and testing of machine learning algorithms of increasing complexity on large data samples. These two factors make the use of machine learning practical for use in cybersecurity solutions.
There is a distinction between data science and machine learning, and we will discuss in a dedicated post how machine learning can be used in cybersecurity solutions, and how it fits into the more generic solution of applying data science in malware identification and attack detection.
The fundamental transition from signatures to behavior for malware identification is the most important enabler of applying data science to cybersecurity. Intrusion Prevention System (IPS) and Next-generation Firewall (NGFW) perimeter security solutions inspect network traffic for matches with a signature that has been created in response to analysis of specific malware samples. Minor changes to malware reduce the IPS and NGFW efficacy. However, machines infected with malware can be identified through the observation of their abnormal, post-infection, behavior. Identifying abnormal behavior requires primarily the capability of first identifying what's normal and the use rigorous analytical methods – data science – to identify anomalies.
We have identified several key aspects that innovative cybersecurity solutions need to have. These require analysis of large data sample and application of advanced analytical methods in order to build data-driven solutions for malware identification and attack detection. A rigorous application of data science techniques is a natural solution to this problem, and represents a dramatic advancement of cybersecurity efficacy.
Watch this short video to see how Vectra learns, detect threats and reports the highest priority risks.