We often receive questions about our decision to anchor network visibility to network metadata as well as how we choose and design the algorithmic models to further enrich it for data lakes and even SIEMs.The story of Goldilocks and the Three Bears offers a pretty good analogy as she stumbles across a cabin in the woods in search of creature comforts that strike her as being
Random forest, an ensemble method
The random forest (RF) model, first proposed by Tin Kam Ho in 1995, is a subclass of ensemble learning methods that is applied to classification and regression. An ensemble method constructs a set of classifiers – a group of decision trees, in the case of RF – and determines the label for each data instance by taking the weighted average of each classifier’s output.
The learning algorithm utilizes the divide-and-conquer approach and reduces the inherent variance of a single instance of the model through bootstrapping. Therefore, “ensembling” a group of weaker classifiers boosts the performance and the resulting aggregated classifier is a stronger model.
We live in the age where big data and data science are used to predict everything from what I might want to buy on Amazon to the outcome of an election.
The results of the Brexit referendum caught many by surprise because pollsters suggested that a “stay” vote would prevail. And we all know how that turned out.
History repeated itself on Nov. 8 when U.S. president-elect Donald Trump won his bid for the White House. Most polls and pundits predicted there would be a Democratic victory, and few questioned their validity.
The Wall Street Journal article, Election Day Forecasts Deal Blow to Data Science, made three very important points about big data and data science:
- Dark data, data that is unknown, can result in misleading predictions.
- Asking simplistic questions yields a limited data set that produces ineffective conclusions.
- “Without comprehensive data, you tend to get non-comprehensive predictions.”
Big data is around us. However, it is common to hear from a lot of data scientists and researchers doing analytics that they need more data. How is that possible, and where does this eagerness to get more data come from?
Very often, data scientists need lots of data to train sophisticated machine-learning models. The same applies when using machine-learning algorithms for cybersecurity. Lots of data is needed in order to build classifiers that identify, among many different targets, malicious behavior and malware infections. In this context, the eagerness to get vast amounts of data comes from the need to have enough positive samples — such as data from real threats and malware infections — that can be used to train machine-learning classifiers.
Is the need for large amounts of data really justified? It depends on the problem that machine learning is trying to solve. But exactly how much data is needed to train a machine-learning model should always be associated with the choice of features that are used.
Time is a big expense when it comes to detecting cyber threats and malware. The proliferation of new malware variants makes it impossible to detect and prevent zero-day threats in real-time. Sandboxing takes at least 30 minutes to analyze a file and deliver a signature – and by then, threats will have spread to many more endpoints.
In big-data discussions, the value of data sometimes refers to the predictive capability of a given data model and other times to the discovery of hidden insights that appear when rigorous analytical methods are applied to the data itself. From a cybersecurity point of view, I believe the value of data refers first to the "nature" of the data itself. Positive data, i.e. malicious network traffic data from malware and cyberattacks, have much more value than some other data science problems. To better understand this, let's start to discuss how a wealth of network traffic data can be used to build network security models through the use of machine learning techniques.
The main reason behind the rising popularity of data science is the incredible amount of digital data that gets stored and processed daily. Usually, this abundant data is referred to as "big data" and it's no surprise that data science and big data are often paired in the same discussion and used almost synonymously. While the two are related, the existence of big data prompted the need for a more scientific approach – data science – to the consumption and analysis of this incredible wealth of data.
Security breaches did not stop making headlines in recent months, and while hackers still go after credit card data, the trends goes towards richer data records and exploiting various key assets inside an organization. As a consequence, organizations need to develop new schemes to identify and track key information assets.
The biggest recent breach in the financial industry occurred at JP Morgan Chase, with an estimated 76 million customer records and another 8 million records belonging to businesses stolen from several internal servers. At Morgan Stanley, an employee of the company’s wealth management group was fired after information from up to 10% of Morgan Stanley’s wealthiest clientele was leaked. Even more sensitive was the largest health-care breach thus far: at Anthem, over 80 million records containing personally identifiable information (PII) including social security numbers were exposed. Less well-known, but potentially more costly in terms of damage and litigation is the alleged theft of trade secrets by the former CEO of Chesapeake’s Energy (NYSE: CHK).
Until recently, using the terms “data science” and ”cybersecurity” in the same sentence would have seemed odd. Cybersecurity solutions have traditionally been based on signatures – relying on matches to patterns identified with previously identified malware to capture attacks in real time. In this context, the use of advanced analytical techniques, big data and all the traditional components that have become representative of “data science” have not been at the center of cybersecurity solutions focused on identification and prevention of cyber attacks.
This is not surprising. In a signature-based solution, any given malware or new flavor of it needs to be identified, sometimes reverse-engineered and have a matching signature deployed in an update of the product in order to be “detectable.” For this reason, signature-based solutions are not able to prevent zero-day attacks and provide very limited benefit compared to the predictive power offered by data science.
In the previous posts, we have examined the insider threat from various angles and we have seen that insider threat prevention involves the information security, legal and human resources (HR) departments of an organization. In this post, we want to examine what information security departments can actually do to detect ongoing insider threats, and even prevent them before they happen.
The literal needle in the haystack
Overall, insider threats represent only a small proportion of employee behavior. And while only the ‘black swan’ incidents become public knowledge, minor incidents such as theft of IP or customer contact lists will add up to major costs for organizations.
In addition, insiders are by default authorized to be inside the network and are both granted access to and make use of key resources of an organization. Given the large pile of access patterns visible in an organization’s network, how is one to know which ones are negligent, harmful or malicious behavior?