When applying data science to cyber security, providing insightful and unbiased analytics on any data presents a variety of challenges. To name but a few, the supporting data platform must be ready to ingest data in virtually any format, deal with changing data rates and ultimately cater for a broad range of analytical use cases.
Apache Spark™ as a backbone of an ETL architecture is an obvious choice. Using Spark allows us to leverage in-house experience with the Hadoop ecosystem. While Apache Hadoop® is invaluable for data analysis and modelling, Spark enables near real-time processing pipeline via its low latency capabilities and streaming API.
Suppose a classic use case of threat detection by correlating technical Threat Intelligence, i.e. Indicators of Compromise (IOC’s) such as known bad IP addresses, with log data such as web proxy logs. Rather than discussing details around malicious content identification (missing FQDN, domain masquerade, typo squatting etc.), let’s focus on an actual end-to-end workflow built on Spark for Threat Intelligence sweeping.
Proxy logs are continuously intercepted by Apache Flume™ remote agent and streamed to a Kafka channel via a local cluster Flume agent.
The ETL framework makes use of seamless Spark integration with Kafka to extract new log lines from the incoming messages. With the use of the streaming analysis, data can be processed as it becomes available, thus reducing the time to detection.
As you can see the workflow revolves around DStreams, which is a convenient concept of micro batches of data represented as DataFrames.
The modified stream of textual data is ready to be passed down the pipeline. Log entries are interpreted and transformed into database records. Entries failing to meet expectations set by a schema are marked as invalid:
Successfully parsed input is analyzed and scrutinized by a Threat Intelligence rules engine.
Records once analyzed can be stored in any number of data stores, e.g. HDFS or HBase, for downstream analysis and presentation. Exceptions and lines which failed to be successfully parsed in general can also be passed directly into the persistence layer.
In summary, Apache Spark has evolved into a full-fledged ETL engine with DStream and RDD as ubiquitous data formats suitable both for streaming and batch processing. Only a thin abstraction layer is needed to come up with a customizable framework. The example below depicts the idea of a fluent API backed by Apache Spark.