Description This dataset contains network traffic and vulnerability scan reports for networks with different characteristics:
- vlan11 is a public network with low traffic and ~30 hosts
- cloud is a public network with moderate traffic and ~100 hosts from a cloud environment
- vlan23 is a private network with high traffic and ~200 hosts
Data formats
- netflow data is presented in (CSV, JSON, RAW) formats for 30 day period
- security scan reports are presented in (CSV, filtered CSV, HTML, XML) formats
Data is compressed in may cases for preserving repository space and network bandwidth. Uncompress with
xz
Anonymization The anonymized dataset comprises a collection of network traffic and domain-related information derived from the described environments.
The source information includes sensitive IPv4 addresses and domain hostnames, vital for network analysis, vulnerability assessments, and security research.
However, due to the sensitive nature of the data, anonymization is employed to protect personal and organizational privacy.
Anonymization Methodology To ensure privacy while retaining the dataset's analytical value, the following anonymization techniques are applied:
The main objective is to maintain the utility of network patterns and relationships while masking specific addresses to prevent any form of trace-back to individual devices or networks.
IPv4 Address Anonymization Each IPv4 address in the dataset has its first two octets anonymized, using a consistent mapping system that replaces these octets with random, uniquely assigned numbers.
This transformation is deterministic, meaning that the same original address segments always map to the same anonymized segments, thus preserving relationships and patterns critical for analysis.
Domain Name Anonymization The hostnames within domain names are anonymized by substituting them with a randomly generated string.
These new hostnames follow a structured anonymized format: <randomname>.random.xyz.
Similar to IP anonymization, the mapping is consistent across the dataset, ensuring that each original hostname is consistently replaced with the same anonymized version.
Privacy Considerations
- Consistency: The anonymization process employs a reproducible mapping system, ensuring that every occurrence of a unique IP address segment or domain hostname is anonymized identically across the dataset. This consistency allows for meaningful analysis of trends and repeated interactions without exposing raw data.
- Data Integrity: By focusing the anonymization on specific segments of IP addresses and hostnames, the overall structure of the data remains intact. This integrity is crucial for operations such as network flow analysis and anomaly detection, which rely on the continuity of data patterns.
- Data Minimization: Alongside anonymizing critical fields, the dataset also undergoes a process of column removal, where non-essential fields that might contain sensitive information are excluded. This further reduces the risk of unintended information exposure.
(2023-06-01)