In this four-part blog series, “Lessons learned from building Cybersecurity Lakehouses,” we are discussing a number of challenges organizations face with data engineering when building out a Lakehouse for cybersecurity data, and offer some solutions, tips, tricks, and best practices that we have used in the field to overcome them.
In part one, we began with uniform event timestamp extraction. In part two, we looked at how to spot and handle delays in log ingestion. In this third blog, we tackle some of the issues related to parsing semi-structured machine-generated data, using the medallion architecture as our guiding principle.
This blog will outline some of the challenges faced when parsing log-generated data and offer some guidance and best practices for generating data captured and parsed accurately for analysts to gain insights into abnormal behavior, potential breaches, and indicators of compromise. By the end of this blog, you will have a solid understanding of some of the issues faced when capturing and parsing data into the Cybersecurity Lakehouse and some techniques we can use to overcome them.
Introduction
Parsing machine-generated logs in the context of cybersecurity is the cornerstone of understanding data and gaining visibility and insights from activity within your organization. Parsing can be a gnarly and challenging task, but it is a necessary one if data is to be analyzed, reported on, and visualized. Without generating accurate and structured formats, organizations are blind to the many traces of information left in machine-generated data by cyber attacks.
Parsing Challenges
There are many challenges faced when capturing raw data, mainly when machine-generated data is in a streaming format as is the case with many sources.
Timeliness: Data may arrive delayed or out of order. We discussed this in <<part two>> if you have been following the blog series. Initial data capture can be brittle, and making only the minimum transformation actions before an initial write is necessary.
Data Format: Log files are typically read by a forwarding agent and transmitted to their destination (possibly via third party systems). The same data may be formatted differently depending on the agent or intermediary hosts. For instance, a JSON record written directly to cloud storage will not be wrapped with any other system information. However, a record received by a Kafka cluster will have the JSON record encapsulated in a Kafka wrapper. This makes parsing the same data from different systems an adaptive process.
Data Inconsistency: Generating schemas for incoming data can lead to parsing errors. Fields may not exist in records they are expected to appear in, or unpacking nested fields may lead to duplicate column names, which must be appropriately handled.
Metadata Extraction: To understand the origins of data sources, we need a mechanism to extract imply, or transmit metadata fields such as:
- Source host
- File name (if file source)
- Sourcetype for parsing purposes
Wire data may have traversed several network systems, and the originating network host is no longer apparent. File data may be stored in directory structures partitioned by network host names, or originating sources. Capturing this information at the initial ingest is required to completely understand our data.
Retrospective Parsing: Critical incident response or detection data may require extracting only parts of a string.
Event Time: Systems output event timestamps in many different formats. The system must accurately parse timestamps. Check out part one of this blog series for detailed information about this topic.
Changing log formats: Log file formats change frequently. New fields are added, old ones go away, and standards for field naming are just an illusion!
Parsing Principles
Given the challenges outlined above, parsing raw data is a brittle task and needs to be treated with care and methodically. Here are some guiding principles for capturing and parsing raw log data.
Think about the parsing operations occurring in at least three distinct stages:
- Capture the raw data and parse only what is necessary to store the data for further transformations
- Extract columns from the captured data
- Filter and normalize events into a Common Information Model
- Optionally, enrich data either before or after (or both) the normalization process
Initial Data Capture
The initial read of data from log files and streaming sources is the most important and brittle part of data parsing. At this stage, make only the bare minimum changes to the data. Changes should be limited to:
- Exploding blobs of data into a single record per event
- Metadata extraction and addition (_event_time, _ingest_time, _source, _sourcetype, _input_filename, _dvc_hostname)
Capturing raw unprocessed data in this way allows for data re-ingestion at a later point should downstream errors occur.
Extracting Columns
The second phase focuses on extracting columns from their original structures where needed. Flattening STRUCTS and MAPs ensures the normalization phase can be completed easily without the need for complex PySpark code to access key information required for cyber analysts. Column flattening should be evaluated on a case-by-case basis, as some use cases may also benefit from remaining MAP<STRING, STRING> formats.
Event Normalization
Typically, a single data source can represent tens or hundreds of event types within a single feed. Event normalization requires filtering specific event types into an event-specific Common Information Model. For example, a CrowdStrike data source may have endpoint process activity that should be filtered into a process-specific table but also has Windows Management Instrumentation (WMI) events that should be filtered and normalized into a WMI-specific table. Event normalization is the topic of our next blog. Stay tuned for that.
Databricks recommends a data design pattern to logically organize these tasks in the Lakehouse called the ‘Medallion Architecture‘.
Parsing Example
The example below shows how to put into practice the parsing principles applied to the Apache access_combined log format.
Below, we read the raw data as a text file.

As described above, we want to keep any transformations to extracting or adding metadata needed to represent the data source. Since this data source is already represented as one row per event, no explode functionality is required.
source = "apache"
sourcetype = "access_combined"
timestamp_col = "value"
timestamp_regex = '^([^ ]*) [^ ]* ([^ ]*) \[([^\]]*)\]'
df = df.select(
    current_timestamp().cast(TimestampType()).alias("_ingest_time"),
    to_timestamp(unix_timestamp(col(timestamp_col), timestamp_format).cast("timestamp"),
"dd-MM-yyyy HH:mm:ss.SSSZ").alias("_event_time"),
    lit(source).alias("_source"),
    lit(sourcetype).alias("_sourcetype"),
    input_file_name().alias("_source_file"),
    "*").withColumn("_event_date", to_date(col("_event_time")))In this command, we extract the _event_time only from the record and add new columns of metadata, capturing the input_file_name
At this stage, we should write the bronze delta table before making any transformations to extract the columns from this data source. Once done, we can create a silver table by applying a regular expression to extract the individual columns.
ex = r"^([\d.]+) (\S+) (\S+) \[.+] \"(\w+) (\S+) .+\" (\d{3}) (\d+) \"(.+)\" \"(.+)\"?$"
df = (df.select('*',
                 regexp_extract("value", ex, 1).alias('host'),
                 regexp_extract("value", ex, 2).alias('user'),
                 regexp_extract("value", ex, 4).alias('method'),
                 regexp_extract("value", ex, 5).alias('path'),
                 regexp_extract("value", ex, 6).alias('code'),
                 regexp_extract("value", ex, 7).alias('size'),
                 regexp_extract("value", ex, 8).alias('referer'),
                 regexp_extract("value", ex, 9).alias('agent')
                 )
                 .withColumn("query_parameters", expr("""transform(split(parse_url(path,
"QUERY"), "&"), x -> url_decode(x))""")))In this command we parse out the individual columns and return a dataframe that can be used to write to the silver stage table. At this point, a well-partitioned table can be used for performing queries and creating dashboards, reports, and alerts. However, the final stage for this datasource should be to apply a common information model normalization process. This is the topic of the next part of this blog series. Stay tuned!
Tips and best practices
Along our journey helping customers with log source parsing, we have developed a number of tips and best practices, some of which are presented below.
- Log formats change. Develop reusable and version-controlled parsers.
- Use the medallion architecture to parse and transform dirty data into clean structures.
- Allow for schema evolution in your tables.
- Machine-generated data is messy and changes often between software releases.
- New attack vectors will require extraction of new columns (some you may want to write to tables, not just create on the fly).
 
- Think about storage retention requirements at the different stages of the medallion architecture. Do you need to keep the raw capture as long as the silver or gold tables?
Conclusion
Parsing and normalizing semi-structured machine-generated data is a requirement for obtaining and maintaining good security posture. There are a number of factors to consider, and the Delta Lake architecture is well-positioned to accelerate cybersecurity analytics. Some features not discussed in this blog are schema evolution, data lineage, data quality, and ACID transactions, which are left for the interested reader.
Get in Touch
If you are interested to learn more about how Databricks cyber solutions can empower your organization to identify and mitigate cyber threats, reach out to [email protected] and check out our Lakehouse for Cybersecurity Applications webpage.




