-
Notifications
You must be signed in to change notification settings - Fork 446
Description
Summary
This proposal suggests enhancing Elastic Common Schema (ECS) by adding explicit metadata to field definitions, allowing fields to be marked as containing Personally Identifiable Information (PII). This metadata would facilitate automated data governance, filtering, and compliance efforts downstream.
Background
Currently, identifying and managing PII within logs often rely on manually curated lists of sensitive fields. This approach is prone to errors, difficult to scale, and requires constant maintenance as the ECS schema evolves or as new integrations introduce new fields.
By standardizing PII metadata directly within the ECS field definitions, there may be a few inherent benefits:
- Automated Data Governance: Tools and processes can programmatically identify and handle PII fields, enabling automated redaction, anonymization, or access controls.
- Improved Compliance: Provides a clearer, auditable mechanism for demonstrating compliance with data privacy regulations (e.g., GDPR, CCPA) by identifying sensitive data at its source.
- Enhanced Data Understanding: Data producers and consumers gain a standardized understanding of which fields inherently carry PII risk, leading to better data hygiene from ingestion to analysis.
- Reduced Manual Effort: Eliminates the need for maintaining separate, external PII field lists, reducing operational overhead and the risk of human error.
- Consistency Across Ecosystem: Extends a consistent PII identification strategy across ECS fields and integrated vendor fields.
Detailed Design
The proposed solution involves adding new optional metadata fields directly to the ECS field definitions. This metadata would signal that a given field is expected to contain PII.
Proposed Field Names
pii
: A boolean field indicating whether the field is considered PII.pii_category
(Optional): A keyword field to classify the type of PII (e.g., "Name/Identifier", "Email Address", "IP Address", "Free-text/Context-dependent"). This provides more granular context for handling.pii_reason
(Optional): A text field to briefly explain why the field is classified as PII.
Example Values for the Fields:
Below are examples showing how the proposed pii
metadata would be integrated into existing ECS field definitions:
# Example 1: User Name
- name: user.name
type: keyword
description: "Name of the user."
pii: true
pii_category: "Name/Identifier"
pii_reason: "Usernames can directly identify individuals."
# Example 2: User Email Address
- name: user.email
type: keyword
description: "Email address of the user."
pii: true
pii_category: "Email Address"
pii_reason: "Personal email addresses directly identify individuals."
# Example 3: Source IP Address
- name: source.ip
type: ip
description: "IP address of the source."
pii: true
pii_category: "IP Address"
pii_reason: "Public IP addresses can sometimes be linked to individuals, especially in home or small business contexts."
# Example 4: Hostname (potentially PII)
- name: host.hostname
type: keyword
description: "Hostname of the host."
pii: true
pii_category: "Hostname/Device Name"
pii_reason: "Hostnames or device names, especially for personal devices, may contain personal identifying information."
# Example 5: File Path (context-dependent PII)
- name: file.path
type: keyword
description: "Full path to the file."
pii: true
pii_category: "File Path"
pii_reason: "File paths often contain usernames or other identifying information (e.g., `/Users/johndoe/Documents/`)."
# Example 6: Process Command Line (free-text PII)
- name: process.command_line
type: keyword
description: "The full command line that was used to start a process, including the process name and all arguments."
pii: true
pii_category: "Free-text/Context-dependent"
pii_reason: "Command-line arguments can contain arbitrary sensitive data, including credentials, filenames, or user-specific information."
# Example 7: Email Sender Address
- name: email.sender.address
type: keyword
description: "Email address of the sender."
pii: true
pii_category: "Email Address"
pii_reason: "Sender's email address directly identifies an individual."
Alternatively, this metadata could reside within a top pii
field list instead of an inline metadata field within every field.
Extension to Vendor Fields (Integration Fields)
This PII metadata standard should explicitly extend to fields defined within Elastic Integrations (e.g., in packages//data_stream//fields/fields.yml). When integrations define or map fields, they should be able to apply the pii, pii_category, and pii_reason metadata to their respective field definitions.
For instance, considering the AWS CloudTrail integration fields:
# Example: AWS CloudTrail fields with proposed PII metadata
- name: aws.cloudtrail
type: group
fields:
# ... other fields
- name: user_identity
type: group
fields:
- name: type
type: keyword
description: The type of the identity
- name: arn
type: keyword
description: The Amazon Resource Name (ARN) of the principal that made the call.
pii: true
pii_category: "Identifier"
pii_reason: "ARNs can uniquely identify cloud identities, which may be tied to individuals or roles."
- name: principal_id
type: keyword
description: The internal ID of the entity that was used to get credentials.
pii: true
pii_category: "Identifier"
pii_reason: "Principal IDs are unique identifiers that can be linked to individuals or roles."
# ... other fields in user_identity
# ... other fields in aws.cloudtrail
# Example: target.entity.id and actor.entity.id
- name: related.entity
description: "A collection of all entity identifiers associated with the document."
type: keyword
pii: true
pii_category: "Identifier"
pii_reason: "This field aggregates various entity identifiers (e.g., cloud resource IDs, ARNs, email addresses) which can be PII."
- name: target
type: group
fields:
- name: entity
type: group
fields:
- name: id
type: keyword
pii: true
pii_category: "Identifier"
pii_reason: "Target entity IDs can be linked to specific users or resources that contain PII."
- name: actor
type: group
fields:
- name: id
type: keyword
pii: true
pii_category: "Identifier"
pii_reason: "Actor entity IDs represent the initiating entity, often a user, and therefore contain PII."
It's also important to consider "free-text" or "flattened" fields (like aws.cloudtrail.request_parameters, response_elements, additional_eventdata, service_event_details) where PII might be present within unstructured JSON. While marking the top-level flattened field as pii: true provides a general warning, explicit guidance or tooling for inspecting the contents of such fields for PII at ingest time would be beneficial, as granular PII tagging isn't feasible at the schema level for dynamic content. This recommendation primarily focuses on explicitly defined fields.
Relevant Existing Issues
This proposal builds upon the general concept of enriching field metadata, as discussed in elastic/ecs#68, which IINM, explores adding a type property to describe the nature of fields. While that issue is broader in scope, the introduction of pii
metadata aligns with the idea of providing more descriptive and actionable information directly within the ECS field definitions.