Your Elastic stack is up and running, and you’re using Logstash for SIEM purposes. But you’re overwhelmed to discover that while every new system produces heaps and heaps of logs, each vendor uses their own data format, and employs a different set of values for describing events, rendering the data materially different from other logs. It seems that adherence to standards is little to non-existent…
To make things even more challenging, the data itself cannot simply be taken at “face value”. Many a time, what was written does not present a faithful image of what was really intended, and security analysts are left to sift through the sea of data, in a laborious effort to parse and extract the real information.
I have often witnessed how this problem, which may be referred to as the “data-information gap”, can take a heavy toll. The process of data normalization can be tedious, time-consuming and error-prone.
For example, let’s take a look at a typical, undigested piece of log, generated in this case by a Symantec product:
CEF:0|Symantec|Threat Isolation|1.0|Network Request|Network Request|6|rt=Jun 03 2018 12:40:48.123 UTC end=Jun 03 2018 12:40:48.123 UTC start=Jun 03 2018 12:40:48.123 UTC externalId=fcbc2792-a604-40e0-833f-9a3c9cf364ec cat=Network Request sproc=Chrome sourceServiceName=Threat Isolation Engine request= https[://]usermatch[.]krxd[.]net/um/v2?partner=vdna requestMethod=GET requestContext= src=10.0.80.80 spt= dst=126.96.36.199 dhost= usermatch-krxd-net dlat=54.0 dlong=-2.0 dpt=80 requestClientApplication=Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36 fname= filePath= fileType= fileHash= fileId= fsize= suser=test sntdom= app=http in=0 out=5242880 act=Isolate dvchost=fireglass1 reason= outcome= cn1=70 cn2=200 cn2Label=Response Status Code cn3=0 cn3Label=URL Risk Level cs1=1 cs1Label=Matched Rule ID cs2Label=Isolation Unique Session ID cs3=Policy Rule cs3Label=Action Reason cs4=Remote Access Tools cs4Label=URL Categories cs5=Default Rule:default cs5Label=Rule Name (At Log Time) cs6= http[://]www[.]bbc[.]com/earth/world cs6Label=Website URL SymantecThreatIsolationTenantId=
A beautiful chunk of data - what would you do if you had to analyze and try to extract information from this log?
Some information can be extracted without expertise, such as the time stamp right at the beginning of the log: 6|rt=Jun 03 2018, and fields like source IP addresses: src=10.0.80.80 spt= dst=188.8.131.52 (we’re all programmers after all!).
But what about the rest? What does cn1Label=Policy Version mean? Or cs2=a42070ac1bb18de7_17641_6? For this you’d need to consult your security gurus - and guess what, they’re always busy…
Log and order
If you’ve read this far, I imagine you’d agree with me that writing parsers can be hard, and analyzing them - even harder. empow takes care of this for you, and publishes the parsers for free use, on an open-source GitHub repository. In accordance with the potent practices of the open-source movement (Apache License 2.0), we constantly encourage our open source communities to take part and help in keeping the plugins updated and nifty.
Together with these powerful communities of security specialists, parsers are created and maintained for the different logs generated by various products and vendors, based on log samples and vendors’ documentation. Each product has its own parser pipeline that extracts relevant information using various Logstash filter plugins. The repository contains a readme.md file, detailing exactly how to install and configure the plugins in Logstash.
In order to simplify the usage of Elastic for SIEM purposes, in early 2019 Elastic introduced the Elastic Common Schema or ECS, “a new specification that provides a consistent and customizable way to structure your data in Elasticsearch, facilitating the analysis of data from diverse sources”. While this has not eliminated the need for data normalization, it was an important step towards standardization.
Back to our Symantec log example. So long as the fields don’t show up in the ECS, we need to do the work of normalizing them ourselves. A more complex breakdown of the log would have been extremely long, but even this simple example requires certain security expertise to implement. The code below is part of a “.conf” file, which is loaded into Logstash to aptly digest the Symantec log.
IF "Web domain" eq <IP>
ELSE IF "Web domain" eq <string>
THEN USE as is
IF "Web domain" = null
IF "Good" THEN Benign
IF "Suspicious" THEN Suspicious
IF "Bad" THEN Malicious
ELSE THEN null
As we can see, one must be very precise in order to achieve normalized data that can be effectively processed. But as the saying goes, “don’t reinvent the wheel” - we’re here to provide you with solutions, save you time and energy, and allow you to focus on securing your organization, instead of crunching infinite logs. And so are our global communities of open source security experts.
I’d love to hear from you with any questions or comments at firstname.lastname@example.org. If you’d like to learn about more of our open-source tools for the Elastic community, visit our website at https://empow.co/opensource/.
Rethinking log analysis: looking at logs in a whole new way