Why Network Metadata Is Just Right for Your Data Lake by Vectra AI Security Research team

We often receive questions about our decision to anchor network visibility to network metadata as well as how we choose and design the algorithmic models to further enrich it for data lakes and even SIEMs. The story of Goldilocks and the Three Bears offers a pretty good analogy as she stumbles across a cabin in the woods in search of creature comforts that strike her as being just right.

As security operations teams search for the best threat data to analyze in their data lakes, network metadata often lands in the category of being just right. Here’s what I mean: NetFlow offers incomplete data and was originally conceived to manage network performance. PCAPs are performance-intensive and expensive to store in a way that ensures fidelity in post-forensics investigations. The tradeoffs between NetFlow and PCAPs leaves security practitioners in an untenable state.

NetFlow: Too little

As the former chief of analysis for US-CERT recommended in a recent article, “Many organizations feed a steady stream of Layer 3 or Layer 4 data to their security teams. But what does this data, with its limited context, really tell us about modern attacks? Unfortunately, not much.”

That’s NetFlow.

Originally designed for network performance management and repurposed for security, NetFlow fails when used in forensics scenarios. What’s missing are attributes like port, application, and host context that are foundational to threat hunting and incident investigations. What if you need to go deep into the connections themselves? How do you know if there are SMBv1 connection attempts, the main infection vector for WannaCry ransomware? You might know if a connection on Port 445 exists between hosts, but how do you see into the connection without protocol-level details?

You can’t. And that’s the problem with NetFlow.

PCAPs: Too much

Used in post-forensic investigations, PCAPs are handy for payload analysis and to reconstruct files to determine the scale and scope of an attack and identify malicious activity.

But as Gartner analysts Augusto Barros, Anton Chuvakin, and Anna Belak wrote in their research note “Applying Network-Centric Approaches for Threat Detection and Response,” published on March 18, 2019 (ID: G00373460), “Years ago, network forensic tools (NFTs) sought to collect raw packets at large scale, but today’s fast networks made this approach impractical for nearly all organizations.”

An analysis of full PCAPs in Security Intelligence magazine explains how the simplest networks would require hundreds of terabytes, if not petabytes, of storage for PCAPs. Because of that—not to mention the exorbitant cost—organizations that rely on PCAPs rarely store more than a week’s worth of data, which is useless when you have a large data lake. A week’s worth of data is also insufficient when you consider that security operations teams don’t often know for weeks or months that they’ve been breached.

Add to that the huge performance degradation—I mean frustratingly slow—when conducting post-forensic investigations across large data sets. Why would anyone pay to store PCAPs in return for lackluster performance?

Network metadata: Just right

The collection and storage of network metadata strikes a balance that is just right for data lakes and SIEMs. Barros, Chuvakin and Belak later wrote in the same research note, “Hence, rich metadata and file capture deliver much better investigative value—it is easier and faster to find things—at a much lower computational and storage cost.”

Zeek-formatted metadata gives you the proper balance between network telemetry and price/performance. You get rich, organized and easily searchable data with traffic attributes relevant to security detections and investigation use-cases (e.g. the connection ID attribute). Metadata also enables security operations teams to craft queries that interrogate the data and lead to deeper investigations. From there, progressively targeted queries can be constructed as more and more attack context is extracted.

And it does so without the performance and big-data limitations common with PCAPs. Network metadata reduces storage requirements by over 99%, compared to PCAPs. And you can selectively store the right PCAPs, requiring them only after metadata-based forensics have pinpointed payload data that is relevant.