Cybersecurity and machine learning: The right features can lead to success

Posted by David Pegna on Sep 15, 2015 9:52:24 AM

Cybersecurity and machine learning: The right features can lead to successBig data is around us. However, it is common to hear from a lot of data scientists and researchers doing analytics that they need more data. How is that possible, and where does this eagerness to get more data come from?

Very often, data scientists need lots of data to train sophisticated machine-learning models. The same applies when using machine-learning algorithms for cybersecurity. Lots of data is needed in order to build classifiers that identify, among many different targets, malicious behavior and malware infections. In this context, the eagerness to get vast amounts of data comes from the need to have enough positive samples — such as data from real threats and malware infections — that can be used to train machine-learning classifiers.

Is the need for large amounts of data really justified? It depends on the problem that machine learning is trying to solve. But exactly how much data is needed to train a machine-learning model should always be associated with the choice of features that are used.

Features are the set of information that’s provided to characterize a given data sample. Sometimes the number of features available is not directly under control because it comes from sophisticated data pipelines that can’t be easily modified. In other cases, it’s relatively easy to access new features from existing data samples, or properly pre-process data to build new and more interesting features. This process is sometimes known as "feature engineering."

This article was originally published as part of the IDG Contributor Network.

Topics: Data Science, cyber security

Subscribe to the Vectra Blog

Recent Posts

Posts by Topic

Follow us